Sentiment Analysis (PT-PT) – Part 3/3

Evaluation

To evaluate our models we decided to manually label about 1500 tweets as neutral, positive or negative. This proved to be a hard task since for a great share of tweets a consensus was not achieved. When in doubt we labeled the tweets as neutral. We then proceeded to create a sample of equal parts of negative and positive tweets so we could test our models (we also tested our models against our training set so that we could look for overfitting). Lastly, to test our test set against the models created with PCA we had to transform our test set.

df = sqlContext.createDataFrame(s_test_set, ["features","label"])
result = pcaModel.transform(df).select(["pcaFeatures","label"])
pca_test_set = result.map(lambda x: LabeledPoint(x["label"],x["pcaFeatures"])).cache()

Once we have trained our models we can use the predict function and compare our labels against the predicted labels:

trainingLabels = s_training_set.map(lambda x: x.label)
trainingPredictions = model.predict(trainingFeatures)
trainingResults = trainingLabels.zip(trainingPredictions)

testingLabels = s_test_set.map(lambda x: x.label)
testingPredictions = model.predict(testingFeatures)
testingResults = testingLabels.zip(testingPredictions)

At this point we are ready to calculate the accuracy of our model as well as other factors such as the area under the ROC curve. PySpark’s default evaluation libraries did not provide an accuracy measure out of the box so we had to calculate it ourselves. We also added a prediction distribution to check if there was a bias against any one class.

testingError = testingResults.filter(lambda x: x[0] != x[1]).count() / float(testingResults.count())
 print("Accuracy (model) " + str(round((1-testingError)*100,2)) + "%")
 metrics = BinaryClassificationMetrics(testingResults)
 print('ROC ', metrics.areaUnderROC)

print("Prediction distribution")
 countLabels = testingPredictions.map(lambda x: (x,1)).reduceByKey(lambda a,b: a+b)
 labels = countLabels.collectAsMap()
 print(labels)

For our Base model we decided to always predict our majority label. Since we are using balanced datasets for both training and testing this will have an accuracy of about 50%. In the last chapter we will discuss this in greater detail.

Here are the final results for each of our models:

ModelTraining w/o PCA Training w/ PCA Testing w/o PCA Testing w/ PCA
Logistic Regression w/ SGD50.38%50.22%62.99%63.28%
Logistic Regression w/
LBFGS
63.7%56.57%63.88%60.9%
SVM w/ SGD49.97%49.96%62.09%63.28%
Decision Tree61.04%58.11%51.34%72.16%
Gradient Boosted Trees72.6%64.09%51.34%63.88%
Random Forest61.61%57.92%68.66%70.96%
Naive Bayes52.13%-68.06%-
Base50.07%50.07%50.9%50.9%

While the results are not super impressive, they are already much higher than our base model with some decision trees achieving an accuracy over 70%. As we predicted, using PCA helped improve our final result, probably by reducing the covariance of the word vectors obtained from word2vec. We can also see that, for some of our models like Gradient Boosted Trees, there seems to be overfitting (the performance on the training dataset is much higher than in the testing set). It is also quite interesting that some models achieved better performance on the testing set than in the training set. This is most likely because of the fact that smiley faces can be somewhat ambiguous and we are probably defining a lot of tweets as positive or negative in the training set when in fact they are not.

In the next chapter we will round up what we have learned so far and discuss what could be the next steps to improve this results. See you there!

Leave a Reply

Your email address will not be published. Required fields are marked *