Sentiment Analysis (PT-PT) – Part 2/3

Having the data in hand we set up to research potential sentiment analysis methods that would fit our criteria. One such method made use of Word2Vec to measure sentiment and, since we had already looked at word2vec for other challenges we were facing, we decided to go for it. This post is thus divided in 2 sections: Word2Vec and SA modeling.


I am not going into detail about word2vec (this might help) but basically, by combining continuous bag-of-words with skip-gram architectures, you get a vector representation of each word from a corpus of text. Fortunately for us Spark already had this model in its ML library so implementing it was pretty straight forward.

 training_set = x: tokenizeTweet(x)).cache()
 word2vec_model = Word2Vec()

We decided to pick a couple of words to test our model and the results were quite impressive:

OriginalTop 1Top 2Top 3Top 4Top 5
*kiay is a famous night club

Loading data and PCA

While researching Word2Vec we came into contact with Doc2Vec (an extended approach which takes a look at complete documents/sentences). We gave it a try but the results were not substantially different from using word2vec. Furthermore an implementation on PySpark was not available for this algorithm. Since we are dealing with small sentences (up to 140 char. max) we decide instead to define each sentence as an average of the vectors of each of its words.

lookup ="model/twitter_word2vec_"+str(VECTOR_SIZE)+".ml/data").alias("lookup")
lookup_bd = sc.broadcast(lookup.rdd.collectAsMap()) # at this point Spark does not allow the model to be used inside operations
def tweet2vec(wlist):
    vecs = []
    for word in wlist:
        vec = lookup_bd.value.get(word)
        if vec:
            vecs.append([0] * VECTOR_SIZE) # we could use instead the median when word not found
    if len(vecs) > 0:
        return np.mean(vecs, axis=0,dtype='float').tolist()
    return [0] * VECTOR_SIZE

When deciding the dimension of our word2vec model we followed the usual guidelines but it is quite likely that there was a lot of covariance so we decided to apply PCA to see if this impacted our model. If you are not familiar with PCA here is wikipedia for you.

df = sqlContext.createDataFrame(s_training_set, ["features","label"]) # PCA only available for Dataframes at the moment
 pca = PCA(k=5, inputCol="features", outputCol="pcaFeatures")
 pcaModel =
 result = pcaModel.transform(df).select(["pcaFeatures","label"]) # explained variance not available as of yet.
 pca_training_set = x: LabeledPoint(x["label"],x["pcaFeatures"])).cache()

From this point on we are ready to test out some models!


Looking at spark mllib (which has now been replaced by ml) we have a lot of interesting classification algorithms plus some ensemble models we can try out:

Applying them is pretty straightforward:

model = LogisticRegressionWithSGD.train(s_training_set)
model = LogisticRegressionWithLBFGS.train(s_training_set)
model = SVMWithSGD.train(s_training_set)
model = DecisionTree.trainClassifier(s_training_set, categoricalFeaturesInfo={},numClasses=2)
model = GradientBoostedTrees.trainClassifier(s_training_set,categoricalFeaturesInfo={})
model = RandomForest.trainClassifier(s_training_set,categoricalFeaturesInfo={},numTrees=200,numClasses=2,seed=42)

with the exception of naive Bayes which does not allow for negative values (luckily for us the vectors of word2vec are scaled between -1 and 1 so rescaling them to 0-1 is quite easy):

def scale_features(features):
    temp = []
    for feature in features:
    return temp

naive_training_set = x: LabeledPoint(x.label,scale_features(x.features)))
model = NaiveBayes.train(naive_training_set, 1.0)

Now that it is all set up lets test our models and figure out what can be improved!

Leave a Reply

Your email address will not be published. Required fields are marked *