Hi everyone!, the purpose of this research and implementation is to be able to visualize the pulse of the upcoming 2020 November US Presidential Election using NLP.
We tried to use techniques that were the state-of-art (SOTA) NLP Transfer Learning under the ML discipline.
As we know, NLP is pretty common nowadays when dealing with images, e.g. using the huge ImageNet dataset, but its rather uncommon or experimental regarding text classification.
The idea behind transfer learning is that instead of training a model from scratch, we can make use of pre-trained models on a large dataset and then fine-tune them for specific natural language tasks like in this case, social media.
New techniques such as ULMFiT, OpenAI GPT, ELMo and Google AI’s BERT have revolutionized the field of transfer learning in NLP by using language modelling during pre-training.
In our downstream task we are going to use the ULMFiT approach, which in turn is going to be fine-tuned to perform the classification task from a different distribution with specific semantic rules.
For the election standpoint we scrapped Twitter data regarding the 2 main parties that are going to present candidates that run for President: Republicans and Democrats.
We used hashtags that represent both parties, candidates under them and terms that represent positively and negatively sentiment.
Why choose Stance Classification (SC) over Sentiment Analysis (SA)
Stance Classification (SC) is the task of inferring from text whether the author is in favor of a given target, against it, or has a neutral position toward it.
This task, which can be complex even for humans (1), is related to argument mining, subjectivity analysis, and sentiment classification.
Generic sentiment classification is formulated as determining whether a piece of text is positive, negative, or neutral. However, in SC, systems must detect favorability toward a given (pre-chosen) target of interest.
In this sense, SC is more similar to target-dependent sentiment classification (2), with a major difference that the target of the stance might not be explicitly mentioned in text or might not be the target of the opinion (3).
Transfer learning in NLP is typically done as a multi-step process.
A model network is first pre-trained in an unsupervised manner with a language modelling objective.
Afterwards the model is fine-tuned on a new supervised task. This involved a bunch of people in our team to manually label tweets regarding a topic.
We divided the data into the 2 main parties, then we labelled the tweets with a positive, negative or neutral stance regarding that specific topic.
Brief overview of ULMFit (Universal Language Model Fine-Tuning for Text Classification)
The majority of the available datasets for use when trying to accomplish text classification are not large.
This makes it difficult to train deep neural networks because they do not generalize well and one can see that overfitting takes place.
When talking about computer vision, pretty much every single training involves using the formidable ImageNet corpus.
Since we are not trying to reinvent the wheel here, we just use and learn from general image features and then retrain certain parts of our model network to learn from the specific vision task that we are dealing with.
Howard and Ruder (4) propose a bi-LSTM (Long Short Term Memory) model that is trained on a general LM (Language Modeling) task and then fine-tuned for our specific text classification task; in our case, target topic stance classification for tweets scrapped using the Twitter API for Republicans/Democrats.
This would perform well since training the model first on a big corpus allows it to learn the general semantics of language and then the transfer is done when learning the specifics of another downstream task like for example Twitter, a blog, a scientific paper since every has their own jargon and “way of talk”.
The LM occurs on the large corpus because there we can find long-term dependencies on long sentences in the language, incorporate hierarchical relations, etcetera.
In our case, the model was first trained on a Wikipedia WikiText-103 corpus based on 28,595 preprocessed Wikipedia articles with about 103 million tokens in english. It is provided by the fast.ai library.
There is also an additional step for language model fine-tuning.
The provided training data which is around 1600 tweets large can be augmented with the Kaggle’s Sentiment140 (5) dataset that has 1.6 million general purpose tweets.
By fine-tuning the language model on this larger Twitter dataset, we might better learn the structure of Twitter conversations.
When improving the vocabulary we only used a portion of it.
These are the 3 basic implementation steps which are explained with more detail in the paper:
- Discriminative fine tuning: Different LR (learning rates) are used for different layers during the fine-tuning phase of LM (on the target task). This is possible because the layers capture different types of information.
- Slanted triangular learning rates (STLR): Learning rates are first increased linearly, and then decreased gradually after a cut, i.e., there is a “short increase” and a “long decay.”
- Gradual unfreezing: During the classification training, the LM model is gradually unfreezed starting from the last layer. If all the layers are trained from the beginning, the learning from the LM would be forgotten quickly. Called catastrophic forgetting.
On the 6 text classification tasks that they evaluated, there was a relative improvement of 18–24% on the majority of tasks.
Further, the following was observed:
- Only 100 labeled samples in classification were sufficient to match the performance of a model trained on 50–100x samples from scratch.
- Pretraining is more useful on small and medium sized data.
- LM quality affects final classification performance.
Our Election Dataset
Our Twitter dataset used tries to mimic what was done in this SemEval 2016 shared task (6), which contains Tweets that pertain to five different topics.
In our case, we scrapped and obtained 2 topics; one per each of the parties in the election. We used a Python Jupyter Notebook and used Tweepy library for this.
The labelled data provided consists of a target topic, the tweet contents and the classified stance of the tweet towards the target.
The data is then broken into a training set, and a test set.
The stance can be one of three possible labels:
hence this is a multi-class dataset.
The total number of Tweets (in the training set) available for this task is around 1663 , which amounts to roughly 700-800 Tweets per topic. Thus, this can be considered a small dataset.
- Republicans Sample
- Democrats Sample
Here’s a quick peek of the results based on cross-tab validations.
The values under the matching row and columns indexes are the correctly predicted classes, the rest of them are wrong.
- Republicans Cross-tab validations
- Democrats Cross-tab validations
After the whole process, including the fine-tuning, we measure the results based on a tool provided by the Semeval task (6) experiments.
They provide a Perl script which evaluates the accuracy based on the F1 Score (7). It considers both the precision p and the recall r of the test
To run this, we just need a test file with the predictions made by the just created model and a “Gold Standard” (which are the same manually labelled tweets by us).
Here are the results:
This score is similar to the one originally publish by the Semeval task 6 and the one accomplished by the winning team called Mitre Corporation (8)
Pulse of the Election
Finally, we proceed to export and use the models generated and created 3 different Elastic Beanstalk environments on our enterprise AWS account:
- WORKER ENVIROMENT which triggers every X minutes (30 mins. as a default), and its responsible to fetch a batch of fresh tweets, make the predictions and store them in AWS’s DynamoDB.
- An API ENVIRONMENT dockerized and mounted on Uvicorn that is a lightning-fast ASGI server. This one is responsible to access and scan the DB to fetch the predictions based on the web application filters.
- WEB-APP ENVIRONMENT which is a Node.JS Express application that polls for new data and show the pulse of the election in a 2-column view of the screen for the 2 parties.
You can take a look at it by going to:
Hope you find this fun!
One thing to note: The field of NLP, and specifically the one of transfer learning keeps moving fast, so this is an exciting time.
New methods and techniques keeps popping every once in a while, so stay tuned!
(1) Walker et al., 2012a https://www.aclweb.org/anthology/N12-1072.pdf
(2) Jiang et al., 2011 https://www.aclweb.org/anthology/P11-1016.pdf
(3) Mohammad et al., 2016 https://www.aclweb.org/anthology/L16-1623.pdf
(4) Cornell University “Universal Language Model Fine-tuning for Text Classification”, Paper by Jeremy Howard and Sebastian Ruder https://arxiv.org/abs/1801.06146