Having the data in hand we set up to research potential sentiment analysis methods that would fit our criteria. One such method made use of Word2Vec to measure sentiment and, since we had already looked at word2vec for other challenges we were facing, we decided to go for it. This post is thus divided in 2 sections: Word2Vec and SA modeling.

# Word2Vec

I am not going into detail about word2vec (this might help) but basically, by combining continuous bag-of-words with skip-gram architectures, you get a vector representation of each word from a corpus of text. Fortunately for us Spark already had this model in its ML library so implementing it was pretty straight forward.

VECTOR_SIZE = 50 training_set = processed_tweets.map(lambda x: tokenizeTweet(x)).cache() word2vec_model = Word2Vec() .setLearningRate(0.01) .setMinCount(5) .setNumIterations(1) .setNumPartitions(1) .setVectorSize(VECTOR_SIZE) .setSeed(42) .fit(training_set)

We decided to pick a couple of words to test our model and the results were quite impressive:

Original | Top 1 | Top 2 | Top 3 | Top 4 | Top 5 |
---|---|---|---|---|---|

Cansado | cansada | ultimamente | nervosa | exausto | podre |

Festa | celebrar | comemorar | jantar | kiay | sabado |

Corrida | km | running | nike | cascais | accao |

###### *kiay is a famous night club

# Loading data and PCA

While researching Word2Vec we came into contact with Doc2Vec (an extended approach which takes a look at complete documents/sentences). We gave it a try but the results were not substantially different from using word2vec. Furthermore an implementation on PySpark was not available for this algorithm. Since we are dealing with small sentences (up to 140 char. max) we decide instead to define each sentence as an average of the vectors of each of its words.

lookup = sqlContext.read.parquet("model/twitter_word2vec_"+str(VECTOR_SIZE)+".ml/data").alias("lookup") lookup_bd = sc.broadcast(lookup.rdd.collectAsMap()) # at this point Spark does not allow the model to be used inside operations def tweet2vec(wlist): vecs = [] for word in wlist: vec = lookup_bd.value.get(word) if vec: vecs.append(vec) else: vecs.append([0] * VECTOR_SIZE) # we could use instead the median when word not found if len(vecs) > 0: return np.mean(vecs, axis=0,dtype='float').tolist() return [0] * VECTOR_SIZE

When deciding the dimension of our word2vec model we followed the usual guidelines but it is quite likely that there was a lot of covariance so we decided to apply PCA to see if this impacted our model. If you are not familiar with PCA here is wikipedia for you.

df = sqlContext.createDataFrame(s_training_set, ["features","label"]) # PCA only available for Dataframes at the moment pca = PCA(k=5, inputCol="features", outputCol="pcaFeatures") pcaModel = pca.fit(df) result = pcaModel.transform(df).select(["pcaFeatures","label"]) # explained variance not available as of yet. pca_training_set = result.map(lambda x: LabeledPoint(x["label"],x["pcaFeatures"])).cache()

From this point on we are ready to test out some models!

# Modeling

Looking at spark mllib (which has now been replaced by ml) we have a lot of interesting classification algorithms plus some ensemble models we can try out:

- LogisticRegressionWithSGD
- LogisticRegressionWithLBFGS
- SVMWithSGD
- NaiveBayesModel
- NaiveBayes
- DecisionTree
- RandomForest
- GradientBoostedTrees

Applying them is pretty straightforward:

model = LogisticRegressionWithSGD.train(s_training_set) model = LogisticRegressionWithLBFGS.train(s_training_set) model = SVMWithSGD.train(s_training_set) model = DecisionTree.trainClassifier(s_training_set, categoricalFeaturesInfo={},numClasses=2) model = GradientBoostedTrees.trainClassifier(s_training_set,categoricalFeaturesInfo={}) model = RandomForest.trainClassifier(s_training_set,categoricalFeaturesInfo={},numTrees=200,numClasses=2,seed=42)

with the exception of naive Bayes which does not allow for negative values (luckily for us the vectors of word2vec are scaled between -1 and 1 so rescaling them to 0-1 is quite easy):

def scale_features(features): temp = [] for feature in features: temp.append((feature+1)/2) return temp naive_training_set = s_training_set.map(lambda x: LabeledPoint(x.label,scale_features(x.features))) model = NaiveBayes.train(naive_training_set, 1.0)

Now that it is all set up lets test our models and figure out what can be improved!