Financial Sentiment Analysis Part II - Sentiment Extraction -

Reading Time: 5 minutes

As promised I’ll devote this second post to walk trough the remaining part of the Financial Sentiment Anaysis pipeline.

Just to recap, the steps we wanted to clarify are the following:

1. Scrape the historical archives of a web financial blog in order to get for each post the following information: date, keywords, text.

1. Save all this information in a JSON file.

1. Read the JSON file with Pandas and preprocess the text with NLTK (Natural Language ToolKit) and BeautifulSoup.

1. Perform Sentiment Analysis on the clean text data in order to get sentiment scores for each day.

1. Generate a final Pandas DataFrame and correlate it with stocks prices to test our hypothesis.

Repeat points 1-5 for as many blogs as possible. For the sake of simplicity I report only the pipeline for a single blog, Bloomberg Business Week. The code can be easily readapted to other websites; in any case all my experiments are available on Github.

In the previous post we went trough steps 1-2. The goal of this second article is to cover the remaining steps (3-5), from data cleaning to sentiment extraction.

Table of Contents

Step 3 – Data Cleaning

This step is crucial as it consists in preprocessing all the information stored in the JSON file generated after the web crawling. The main tasks to be carried out are:

1. read JSON in Pandas DataFrame

1. unlist all the entries (XPath queries return Python lists)

1. convert dates to datetime

1. join keywords and body columns in one text column and drop the first 2

1. The first result must be a Pandas DataFrame with 2 columns: date of the post and text of the post.

1. turn all the text to lowercase

1. get rid of all the HTML tags from text using BeautifulSoup

1. tokenize the text and get rid of stop words using NLTK

return clean text in form of a list of words

Two functions take care of this work: readJson (step 1-5) and cleanText (step 6-9). Both reported below.

# import modules necessary for all the following functions
import re
import pandas as pd
from sklearn import preprocessing

def readJson(filename):
    """
    reads a json file and returns a clean pandas data frame
    """
    import pandas as pd
    df = pd.read_json(filename)
    
    def unlist(element):
        return ''.join(element)
    
    for column in df.columns:
        df[column] = df[column].apply(unlist)
    # gets only first 10 characters of date: year/month/day
    df['date'] = df['date'].apply(lambda x: x[:10])
    df['date'] = pd.to_datetime(df['date'])
    
    # if any removes duplicate posts
    df = df.drop_duplicates(subset = ['keywords'])
    # sorts dataframe by post date
    df = df.sort(columns='date')
    df['text'] = df['keywords'] + df['body'] 

    df = df.drop('body', 1)
    df = df.drop('keywords', 1)
    
    return df

def cleanText(text):
    """
    removes punctuation, stopwords and returns lowercase text in a list of single words
    """
    text = text.lower()    
    
    from bs4 import BeautifulSoup
    text = BeautifulSoup(text).get_text()
    
    from nltk.tokenize import RegexpTokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    text = tokenizer.tokenize(text)
    
    from nltk.corpus import stopwords
    clean = [word for word in text if word not in stopwords.words('english')]
    
    return clean

Step 4 – Computing Sentiment Score

This is the core of the whole pipeline. The problem of computing the sentiment of a piece of text is extremely complex. There is a whole branch of Machine Learning devoted to developing algorithms for this kind of issues (Natural Language Processing) and there are at present several possible approaches to be considered.

1. One solution consists in getting chunks of pre-prepared texts labeled as Positive or Negative and then perform a supervised binary classification over our posts dividing them in the two available categories.

An other approach consists in getting a dictionary of Positive and Negative words and then count how many of each occur respectively in each post. Then get a Sentiment Measure out of that.

I decided to follow the second path, which first of all required me to find dictionaries available for this specific purpose. After an intense googling I stepped into the web page of Prof Bill McDonald, professor of Finance at the University of Notre Dame. His research group recently conducted a very interesting work over Sentiment Analysis of Financial Texts, showing that in order to get a decent accuracy in this kind of computations it is absolutely mandatory to use a dictionary of words developed for the specific financial purpose. In fact, it is not uncommon that words happening to have a negative meaning in a normal context may turn positive in a financial one. After the paper had been published these guys had put their own dictionaries online, thus I downloaded them and used for my analysis.

In order to get to the final result (getting the sentiment out of a text) I wrote a bunch of functions which I go through as follows:

- loadPositive(): loads the dictionary of positive words into a list.

 def loadPositive():
    """
    loading positive dictionary
    """
    myfile = open('/home/LoughranMcDonald_Positive.csv', "r")
    positives = myfile.readlines()
    positive = [pos.strip().lower() for pos in positives]
    return positive

- loadNegative(): loads the dictionary of negative words into a list.

def loadNegative():
    """
    loading positive dictionary
    """
    myfile = open('/home/LoughranMcDonald_Negative.csv', "r")
    negatives = myfile.readlines()
    negative = [neg.strip().lower() for neg in negatives]
    return negative

- countPos(cleantext, positive): counts the number of words contained both in the post and in the positive dictionary.

def countNeg(cleantext, negative):
    """
    counts negative words in cleantext
    """
    negs = [word for word in cleantext if word in negative]
    return len(negs)

- countNeg(cleantext, negative): counts the number of words contained both in the post and in the negative dictionary.

def countPos(cleantext, positive):
    """
    counts negative words in cleantext
    """
    pos = [word for word in cleantext if word in positive]
    return len(pos)

- getSentiment(cleantext, negative, positive): returns the difference between positive and negative words in a post.

def getSentiment(cleantext, negative, positive):
    """
    counts negative and positive words in cleantext and returns a score accordingly
    """
    positive = loadPositive()
    negative = loadNegative()
    return (countPos(cleantext, positive) - countNeg(cleantext, negative))

- upDateSentimentDataFrame(dataframe): performs all the computations described above on the whole Pandas dataframe of posts. The returned dataframe is then saved to csv file.

 
def updateSentimentDataFrame(df):
    """
    performs sentiment analysis on single text entry of dataframe and returns dataframe with scores
    """
    positive = loadPositive()
    negative = loadNegative()   
    
    df['text'] = df['text'].apply(cleanText)
    df['score'] = df['text'].apply(lambda x: getSentiment(x,negative, positive))

    return df

prepareToConcat(filename): reads the csv sentiment dataframe and groups by day performing the average of the sentiment per day.

def prepareToConcat(filename):
    """
    load a csv file and gets a score for the day
    """
    df = pd.read_csv(filename, parse_dates=['date'])
    df = df.drop('text', 1)
    df = df.dropna()
    df = df.groupby(['date']).mean()
    name = re.search( r'/(\w+).csv', filename)
    df.columns.values[0] = name.group(1)
    return df

Step 5 – Final Steps

It seems we arrived almost to the end. Now we have a Pandas DataFrame consisting of one single ‘score’ column (for each day) and indexed with dates. What we have to do now is to plug this sentiment feature in our model.

To do so we save the output of the prepareToConcat(filename) function to a sentiment.csv file and then we apply the mergeSentimenToStocks(stocks), which takes the previously built financial stocks prices dataframe as argument and left joins it with our new born sentiment dataframe.

def mergeSentimenToStocks(stocks):
    df = pd.read_csv('/home/sentiment.csv', index_col = 'date')
    final = stocks.join(df, how='left')
    return final

There you go! Now the score of the financial blog is a real new feature and we can run a model on top of that. Task accomplished!

Cool, isn’t it?

Additional resources

Around the topic of NLP and sentiment analysis, I highly recommend the following 3 blog posts. Very good and up to date content!

Twitter

Financial Sentiment Analysis Part II – Sentiment Extraction

Step 3 – Data Cleaning

Step 4 – Computing Sentiment Score

Step 5 – Final Steps

Additional resources

Discover more from