In this post I will cover the process of Data Extraction, Cleaning and Processing. The first challenge we faced was getting the data. There were three major issues:
- We needed a lot of text
- We needed the text to be in portuguese (from Portugal, not from Brasil)
- We needed to be able to label this text quickly and inexpensively
Data Extraction
For the first issue we took a look at several sources of information from Facebook to Youtube to Zomato and even Portuguese book review databases and we decided that Twitter was the way to go. Not only do you have a lot of information readily available but it also suited our main objective of tackling social media content. The issue now was how to identify Portuguese text and how to correctly identify it from Portuguese from Brasil. To do this we created a crawler that looked up for people with locations in Portugal and had language as Portuguese. We also made sure that these tweets were in Portuguese since Portuguese users can still tweet in English (we tested langid but twitter proved itself to be better at labeling languages). We also found that there were a lot of accounts repeating tweets (almost like spam accounts) so we made sure only to consider distinct tweets.
tweets = alltweets.filter("lang LIKE 'pt'") print(tweets.count(),"language filtered tweets.") beforeSpamRemoval = tweets.count() tweets = tweets.map(lambda x: (x["id"],x["detail"])).distinct().cache() afterSpamRemoval = tweets.count() print(beforeSpamRemoval-afterSpamRemoval,"spam tweets removed") tweets_text_only = tweets.map(lambda x: x[1]) processed_tweets = tweets_text_only.map(lambda x: processTweet(x)).cache() tweets_with_emoticons = processed_tweets.filter(lambda x: len(get_emoticons(x))>0).cache() print(tweets_with_emoticons.count(),"tweets with emoticons")
Next comes the problem of labeling. Since we had a small team and needed to reach a consensus on the neutrality of each tweet we decided to take the approach described in this paper. The idea is to label tweets based on the emoticons present. We then searched the web for the best regex to find emoticons and describe them as either positive or negative (with some little adaptations):
Identify emoticons: (\:\w+\:|\<[\/\\]?3|[\(\)\\\D|\*\$][\-\^]?[\:\;\=]|[\:\;\=B8]’?[\-\^]?[3DOPp\@\$\*\\\)\(\/\|])(?=\s|[\!\.\?]|$)
Positive emoticons: [\:\;\=8B]’?[)>DpP*3]|[\:\;\=8B]’?[ -^][)>DpP*3]
Negative emoticons: [\:\;\=8B]’?[(<\\/$Oo\|]|[\:\;\=8B]’?[ -^][(<\\/$Oo\|]
We then took a look at which emoticons were present and tested our regex.
- ‘:)’, ‘positive’
- ‘;3’, ‘positive’
- ‘:D’, ‘positive’
- ‘;)’, ‘positive’
- ‘:(‘, ‘negative’
The most surprising one was ;3. We later found out that <3 translates to “heart” so we actually got lucky.
emoticons = alltweets.flatMap(lambda x: get_emoticons(x["detail"])).map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b) emoticons = emoticons.map(lambda x: (x,extract_polarity_from_emoticons(x[0]))) emoticons.takeOrdered(5,key = lambda x: -x[0][1])
Data Cleaning and Processing
Following the suggestions of this paper and some observations we proceeded to apply a set of transformations to translate each tweet into a set of words that we could then use in our models. These included:
- Removing spaces, trimming the text and dealing with special cases like hashtags or urls
- Swapping portuguese special characters such as á, ão, etc
- Removing punctuation (not before swapping emoticons by “HAPPYSMILLEY” or “SADSMILLEY”)
- Tokenizing words
- Removing repeated letters (Fiixe means the same as Fixe)
- Removing stopwords (from the nltk portuguese corpus)
- Removing numbers
- Stemming words (this one turned out to cripple our results)
The end result for a sentence like “@TVieira a foto ficou desfocaadaaa! ée a base de uma parece e o chão de terra;)” would be:
['foto', 'ficou', 'desfocada', 'base', 'parece', 'chao', 'terra', 'HAPPYSMILEY'] (non-stemmed version) ['fot', 'fic', 'desfoc', 'bas', 'parec', 'cha', 'terr', 'happysmiley'] (stemmed version)
The last step before moving on is to filter the tweets with either negative on positive connotations. We will discuss the importance of a neutral class later (in the evaluation part). We also make sure that both classes are equally present in our final dataset.
# lets filter positive and negative tweets and add a neutral sample to the mix labeled_tweets = tweets_with_emoticons.map(lambda x: (extract_polarity_from_emoticons(x),x)) #(label,text) # lets remove neutrals modelInput = labeled_tweets.filter(lambda x: x[0] != 'neutral') # lets resample so that we have a balanced dataset countLabels = modelInput.map(lambda x: (x[0],1)).reduceByKey(lambda a,b: a+b) labelRatio = countLabels.collectAsMap() print(labelRatio) positive = labelRatio["positive"] negative = labelRatio["negative"] ratio = [positive,negative] smallest = min(ratio) majority = max(labelRatio, key=lambda k: labelRatio[k]) positiveRatio = smallest/positive negativeRatio = smallest/negative fractions = { "positive": positiveRatio, "negative": negativeRatio } sampled_model_input = modelInput.sampleByKey(False,fractions,seed=42) countLabels = sampled_model_input.map(lambda x: (x[0],1)).reduceByKey(lambda a,b: a+b) labelRatio = countLabels.collectAsMap() print(labelRatio)
In the next post we will cover how to translate the list we obtained above into a usable vector that we can then use to create our models. See you there!