Final Project for Algorithms class with Professor Tristan. Made by Hannah Brooks, Megan Costello, Brielle Donowho, Colin Hall, and Ryan Rafferty
Our project is going to be based around a Twitter data miner as well as a sentiment analysis model on relevant tweets. We are first going to create a twitter data miner using the “Tweepy” python library to access Twitter’s API. We have created a developer account in order to do so. The goal for the data miner is to be able to plug in search words as well as a given number of tweets. The data miner will then produce all of the most recent tweets with the given search word or phrase in it. Afterwards we want to analyze the overall sentiment of all of the tweets in order to gain insight on our given search word or phrase. The major components of this project are going to be the theory behind sentiment analysis, the data miner, the sentiment analysis model, and then running data analysis over our findings from the sentiment analysis model. With all of these components we aim to be able to analyze the sentiment of search words in the current state or over time. This could be used to gauge public reaction to events or actions by notable corporations and institutions. Not only could this be utilized to develop corporate strategy, it can be used for political, social, and relevant cultural insights.
Before we begin the algorithm, we will look at the derivation of the inference algorithms from the Latent Dirichlet Algorithm using calculus, probability formulas, and mathematical analysis. From this we will be able to determine a MCMC style algorithm by maximizing the likelihood function. After obtaining the tweets, we will do a general sentiment analysis, looking primarily at the polarity - positive, negative, or neutral. To do this, we will first have to prepare the data. So, we will tokenize the tweets, splitting them into a list of lists of words. We will also normalize, remove noise, and clean and stem the data. For this part, we will convert words into their canonical form to make sure there are no repeats. We will also get rid of hyperlinks, connecting words like “is” or “the”, twitter handles, and extra blank space. Once the data is ready to be analyzed, we will compare the words in each tweet to two databases of positive and negative words. The positive words will give a value of 1, while negative words will give a value of -1. The average resulting number for the tweet will provide its polarity. If the final value is between -1 and -.4, it will have a negative sentiment; if it is between -.4 and .4 it will be neutral; and if it is between .4 and 1 it will be positive. We will then calculate how many overall positive and negative tweets there were and the percentages of each.
After collecting the data from applicable tweets and recording sentiment data, we will also look to create regressions that reveal the correlation between tweets that contain certain keywords and their resulting sentiment values. We could also produce scatter plots and other types of charts to visually represent the sentiment data, such as graphing sentiment of tweets over time and categorizing tweets by sentiment and keywords that are statistically significant to the sentiment score. In today’s age of social media, we feel this analysis can be extremely useful to companies looking to evaluate their brand image, marketers trying to pick up on consumer sentiment, and even government agencies attempting to gauge public opinion. This is a topic we are very interested in and we are excited to dive deeper into exploring how to leverage social media to gain impactful insights.