Financial Sentiment Analysis Part II – Sentiment Extraction

As promised I’ll devote this second post to walk trough the remaining part of the Financial Sentiment Anaysis pipeline.

Just to recap, the steps we wanted to clarify are the following:

  1.  Scrape the historical archives of a web financial blog in order to get for each post the following information: date, keywords, text.
  2. Save all this information in a JSON file.
  3.  Read the JSON file with Pandas and preprocess the text with NLTK (Natural Language ToolKit) and BeautifulSoup.
  4. Perform Sentiment Analysis on the clean text data in order to get sentiment scores for each day.
  5. Generate a final Pandas DataFrame and correlate it with stocks prices to test our hypothesis.
  6. Repeat points 1-5 for as many blogs as possible. For the sake of simplicity I report only the pipeline for a single blog, Bloomberg Business Week. The code can be easily readapted to other websites; in any case all my experiments are available on Github.

In the previous post we went trough steps 1-2. The goal of this second article is to cover the remaining steps (3-5), from data cleaning to sentiment extraction.

Step 3 – Data Cleaning

This step is crucial as it consists in preprocessing all the information stored in the JSON file generated after the web crawling. The main tasks to be carried out are:

  1. read JSON in Pandas DataFrame
  2. unlist all the entries (XPath queries return Python lists)
  3. convert dates to datetime
  4. join keywords and body columns in one text column and drop the first 2
  5. The first result must be a Pandas DataFrame with 2 columns: date of the post and text of the post.
  6. turn all the text to lowercase
  7. get rid of all the HTML tags from text using BeautifulSoup
  8. tokenize the text and get rid of stop words using NLTK
  9. return clean text in form of a list of words

Two functions take care of this work: readJson (step 1-5) and cleanText (step 6-9). Both reported below.

Step 4 – Computing Sentiment Score

This is the core of the whole pipeline. The problem of computing the sentiment of a piece of text is extremely complex. There is a whole branch of Machine Learning devoted to developing algorithms for this kind of issues (Natural Language Processing) and there are at present several possible approaches to be considered.

  1. One solution consists in getting chunks of pre-prepared texts labeled as Positive or Negative and then perform a supervised binary classification over our posts dividing them in the two available categories.
  2. An other approach consists in getting a dictionary of Positive and Negative words and then count how many of each occur respectively in each post. Then get a Sentiment Measure out of that.

I decided to follow the second path, which first of all required me to find dictionaries available for this specific purpose. After an intense googling I stepped into the web page of Prof Bill McDonald, professor of Finance at the University of Notre Dame. His research group recently conducted a very interesting work over Sentiment Analysis of Financial Texts, showing that in order to get a decent accuracy in this kind of computations it is absolutely mandatory to use a dictionary of words developed for the specific financial purpose. In fact, it is not uncommon that words happening to have a negative meaning in a normal context may turn positive in a financial one. After the paper had been published these guys had put their own dictionaries online, thus I downloaded them and used for my analysis.

In order to get to the final result (getting the sentiment out of a text) I wrote a bunch of functions which I go through as follows:

    • loadPositive(): loads the dictionary of positive words into a list.

    • loadNegative(): loads the dictionary of negative words into a list.

    • countPos(cleantext, positive): counts the number of words contained both in the post and in the positive dictionary.

    • countNeg(cleantext, negative): counts the number of words contained both in the post and in the negative dictionary.

    • getSentiment(cleantext, negative, positive): returns the difference between positive and negative words in a post.

    • upDateSentimentDataFrame(dataframe): performs all the computations described above on the whole Pandas dataframe of posts. The returned dataframe is then saved to csv file.

  • prepareToConcat(filename): reads the csv sentiment dataframe and groups by day performing the average of the sentiment per day.

Step 5 – Final Steps

It seems we arrived almost to the end. Now we have a Pandas DataFrame consisting of one single ‘score’ column (for each day)  and indexed with dates. What we have to do now is to plug this sentiment feature in our model.

To do so we save the output of the prepareToConcat(filename) function to a sentiment.csv file and then we apply the mergeSentimenToStocks(stocks), which takes the previously built financial stocks prices dataframe as argument and left joins it with our new born sentiment dataframe.

There you go! Now the score of the financial blog is a real new feature and we can run a model on top of that. Task accomplished!

Cool, isn’t it?


by Francesco Pochetti