Skip to content

Financial Sentiment Analysis Part I – Web Scraping

Reading Time: 5 minutes

It’s been a while without Mr Why’s posts! I apologize but quite a lot has happened in the meantime. I quit my job in Italy and I moved to Berlin to attend a three-month course in Data Analysis and Machine Learning. Amazing experience which started at the beginning of August and will end on the 31st October.

I’m learning a bunch of new interesting things and I thought it would be pretty awesome to share my experience about some of the challenges I’m facing at the moment. What I decided to tackle in this post is Web Crawling in Python, a fascinating topic I realized I needed to get trough a couple of weeks ago, while working on my Portfolio Project on Stock Market Prediction at DSR.

Basically I’m studying a model to predict daily S&P-500 index returns. This is quite a hard task as stocks exchanges are notoriously governed by randomness; a huge amount of work has already been done and people much more talented than me have faced this issue in the past and are facing it at present.

In order to accomplish the task I needed to gather information to put into the model. It is quite straightforward that to build a decent model it is mandatory to find data with strong predictive value. This is quite intuitive and it means that if my goal is to forecast stocks prices I’d better find data as correlated as possible to market fluctuations. The higher the correlation between predictors and output the better.

For this reason I thought that an interesting aspect to investigate could have been the relation between exchanges and financial news. There is a huge variety of business blogs out there, publishing every day pieces of news, analysis, comparisons or simply thoughts  regarding specific market segments or companies. When we address world wide influential blogs it is likely that if today the general feeling coming from a set of posts was negative, investors could decide to sell or not to buy tomorrow. Many researches have shown in the past that the decision making process of a common investor is highly affected by public sentiment, thus the idea of correlating the  “feeling” of a specific day with the trading choices of the following day is definitely worth trying.

The steps to be followed in order to achieve this goal are the following:

  1.  Scrape the historical archives of a web financial blog in order to get for each post the following information: date, keywords, text. The date of the post is fundamental to store the time information; keywords are necessary all the same for a simple reason. The text data we scrape is going to be full of HTML tags. The XPath language (which we’ll make extensive use of) has been developed for the specific purpose of easily getting information from a web page and getting rid of the tags in the first place. Nevertheless the process is not always smooth and during the post text selection we may happen to lose some tagged words, which in general are the most meaningful (companies’ or people’s names). It happens that this words are often the keywords or tags related to the post, thus they are easily reachable with a proper XPath query.
  2. Save all this information in a JSON file.
  3.  Read the JSON file with Pandas and preprocess the text with NLTK (Natural Language ToolKit) and BeautifulSoup.
  4. Perform Sentiment Analysis on the clean text data in order to get sentiment scores for each day.
  5. Generate a final Pandas DataFrame and correlate it with stocks prices to test our hypothesis.
  6. Repeat points 1-5 for as many blogs as possible. For the sake of simplicity I report only the pipeline for a single blog, Bloomberg Business Week. The code can be easily readapted to other websites; in any case all my experiments are available on Github.

The purpose of the present article is to walk trough points 1 and 2. The remaining part of the pipeline will we faced in the following post.

 So let’s start with the Web Crawling phase. The answer to this problem is web scraping in Python or in other words Scrapy.

The example I report below is the code I wrote to crawl Bloomberg Business Week archives. The  web site has a very clean structure which facilitates the task.

The tool I used is the Scrapy library, which is a very handy Python package written for these purposes. Here is the link to the Scrapy First Tutorial, which can be quite useful to follow what is reported below.

In order to have everything work correctly it is necessary to write two basic Classes (I renamed them to fit my project):

  • Class FinancecrawlerItem(scrapy.Item): contained in items.py, this class takes care of the definition of all the information we want to collect from the web. In my case Date of the post, Keywords of the post and Body of the post

import scrapy

class FinancecrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    date = scrapy.Field()
    keywords = scrapy.Field()
    body = scrapy.Field()
  • Class BWSpider(scrapy.Spider): this is the proper Spider. This Class connects to the server, generates a response for each blog-post-url and returns the information specified in the FinancecrawlerItem Class (in the form of a dictionary). The Spider needs a list of urls to start  performing its job (start_urls). In order to generate this list and feed the Spider I wrote an other short script which takes care of parsing the Business Week Archive page and retrieve all the relevant urls. The code is provided below and fully commented.
import scrapy

from financeCrawler.items import FinancecrawlerItem

def getCleanStartUrlList(filename):
    """
    Takes as input the name of the txt file generated by next
    script. In this file each line is the url of a blog post published 
    in the period (1 January 2008 - 15 August 2014). The function returns
    a list of all the urls to be scraped.
    """
    myfile = open(filename, "r")
    urls = myfile.readlines()
    return [url.strip() for url in urls]
    
    

class BWSpider(scrapy.Spider):
    # name of the spider
    name = "busweek"
    # domains in which the spider can operate
    allowed_domains = ["businessweek.com"]
    # list of urls to be scraped
    urls = getCleanStartUrlList('businessweek.txt')
    start_urls = urls

    def parse(self, response):
        # the parse method is called by default on each url of the
        # start_urls list 
        item = FinancecrawlerItem()
        # the date, keywords and body attributes are retrieved from
        # the response page using the XPath query language
        item['date'] = response.xpath('//meta[@content][@name="pub_date"]/@content').extract()
        item['keywords'] = response.xpath('//meta[@content][@name="keywords"]/@content').extract() 
        item['body'] = response.xpath('//div[@id = "article_body"]/p/text()').extract()
        # the complete item filled with all its attributes 
        yield item
"""
The archive is nicely structured, thus the purpose of this script is 
to generate a txt file containing all the urls of the blog-posts 
published between 1 January 2008 and 15 August 2014.
In order to achieve this goal I implemented the following steps:
1- generate the urls of all the months in the time interval
2- generate the urls of all the days for each month
3- scrape each of the day-urls and get all the urls of the 
   posts published on that specific day.
4- repeat for all the days on which something was published 
"""
import scrapy
import urllib

def businessWeekUrl():
    totalWeeks = []
    totalPosts = []
    url = 'http://www.businessweek.com/archive/news.html#r=404'
    data = urllib.urlopen(url).read()
    hxs = scrapy.Selector(text=data)
    
    months = hxs.xpath('//ul/li/a').re('http://www.businessweek.com/archive/\\d+-\\d+/news.html')    
    admittMonths = 12*(2013-2007) + 8
    months = months[:admittMonths]

    for month in months:
        data = urllib.urlopen(month).read()
        hxs = scrapy.Selector(text=data)
        weeks = hxs.xpath('//ul[@class="weeks"]/li/a').re('http://www.businessweek.com/archive/\\d+-\\d+/news/day\\d+\.html')
        totalWeeks += weeks
    
    for week in totalWeeks:
        data = urllib.urlopen(week).read()
        hxs = scrapy.Selector(text=data)
        posts = hxs.xpath('//ul[@class="archive"]/li/h1/a/@href').extract()
        totalPosts += posts
    
    with open("businessweek.txt", "a") as myfile:
        for post in totalPosts:
            post = post + '\n'
            myfile.write(post)

businessWeekUrl()

Here you go! Done! Now you just have to browse to the directory where you have stored all this fancy code and run:

scrapy crawl busweek -o businessweek.json

The previous command will run the Spider and redirect the Standard Output of your code both to the screen and to a nicely formatted JSON file, which can be then read and parsed with Pandas. So let’s dive into the following post and go on with the pipeline!

 

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading