This project classifies tweets with specified hashtag into positive and negative ones. It's implemented using Scala language and Apache Spark framework and includes several stages:
- Streaming - converting RSS stream of tweets from QueryFeed into Spark data abstraction - RDD (ready GitHub project was used for that)
- Preprocessing - remove redundant characters and words, convert to lowercase, etc.
- Model training - using models from Spark Machine Learning library and training them on Twitter Sentiment Analysis dataset.
- Classifying tweets - using saved ML model for classifying tweets from RSS stream.
classifier/— Classification moduleMain- example class of how to operate with Model class (also generated pretrained model)MLFindModel- used to identify the model with best accuracyModel- class which represents the classifierModelLoader- simple support static methods
preprocessing/— Preprocessing modulePreprocessingDataset- static method for preprocessing train dataset before learningPreprocessingExample- obvious from the namePreprocessTweet- class for processing text messages (described earlier)
streamer/- streaming module
- If you wish to train model - use example provided in
classifier.Main - If you wish to start streaming rss & process it - run
streamer.RSSDemowith one argument - tag (cat, dog, winter, etc.). Just one word, nothing else.
- Pretrained model is then stored close to
classifier/twits/train.csvresource (in model folder). - Each new batch produced by RSS generates two artifacts - folder with initial data and after model process.
(${tag}_input and ${tag}_result)
- All the manipulations with spark were done on local machine without usage of the cluster.
- Paths in a program are configured to use files from local machine.
- RDDs are generated each 10 seconds.
QueryFeed
RSS streaming GitHub repository
Twitter Sentiment Analysis