In this series of posts we set out to create a machine learning model that would correctly identify positive tweets from negative tweets in Portuguese (from Portugal). While we achieved a good results (70+% accuracy) there are some things that I would like to cover.
The first and probably most important thing is the fact that we did not use “neutral tweets”. While some (if not most) papers simply ignore tweets with neutral content I find that a sentiment analysis model that can only separate positive from negative tweets is lacking. In fact, in the startup I mentioned we had to consider neutral tweets so, here are a couple of things we tried:
- Creating a model like we did but adding a threshold interval where tweets we are not 100% sure are either negative or positive would be neutrals. This turned out not to work because our models separated quite clearly both classes.
- Considering every tweet without emoticons a neutral tweet. We had labeled tweets and noticed that about 70% had neutral content so the idea is that this would not impact our performance that much. It did.
- Adding neutral text from other sources. We experimented with adding sentences from different review sites like zomato, yelp, etc to create a neutral dataset. This actually worked quite well.
- Analyzing the twitter content and infer neutral tweets based on POS tags. This was done here and it seems like a promising idea but it would be a different project altogether.
The second point I want to talk about was that the training dataset we used was quite small. Twitter has a rate limit for how much information you can get from it and the Portuguese community on twitter is not that big (at least for residents on Portugal). There is also the fact that, in this notebook, we only played arround with default values (both for word2vec, pca and models). We would probably be able to improve performance if we played around a bit more. Still, the point of the these posts was to give an overview on how this kind of thing is/was done.
Lastly, some thoughts about working on this project. While we started with this idea of creating a perfect sentiment analysis engine, it became clear that rather than worrying about identifying positive and neutral tweets, what we should have actually been trying to do was identifying negative tweets. On the context of the whole project, negative sentences bring much more information about your business than positive or neutral ones (you can learn much more from criticism than from praise).
There are also some cool things we started doing with word2vec and relationships between promotions and companies but I’ll leave that to another post.
Hope you have enjoyed reading about this small adventure, let me know if you want to hear more about it!