Tag Archives: Scrapy

Financial Sentiment Analysis Part I – Web Scraping

It’s been a while without Mr Why’s posts! I apologize but quite a lot has happened in the meantime. I quit my job in Italy and I moved to Berlin to attend a three-month course in Data Analysis and Machine Learning. Amazing experience which started at the beginning of August and will end on the 31st October.

I’m learning a bunch of new interesting things and I thought it would be pretty awesome to share my experience about some of the challenges I’m facing at the moment. What I decided to tackle in this post is Web Crawling in Python, a fascinating topic I realized I needed to get trough a couple of weeks ago, while working on my Portfolio Project on Stock Market Prediction at DSR.

Basically I’m studying a model to predict daily S&P-500 index returns. This is quite a hard task as stocks exchanges are notoriously governed by randomness; a huge amount of work has already been done and people much more talented than me have faced this issue in the past and are facing it at present.

In order to accomplish the task I needed to gather information to put into the model. It is quite straightforward that to build a decent model it is mandatory to find data with strong predictive value. This is quite intuitive and it means that if my goal is to forecast stocks prices I’d better find data as correlated as possible to market fluctuations. The higher the correlation between predictors and output the better.

For this reason I thought that an interesting aspect to investigate could have been the relation between exchanges and financial news. There is a huge variety of business blogs out there, publishing every day pieces of news, analysis, comparisons or simply thoughts  regarding specific market segments or companies. When we address world wide influential blogs it is likely that if today the general feeling coming from a set of posts was negative, investors could decide to sell or not to buy tomorrow. Many researches have shown in the past that the decision making process of a common investor is highly affected by public sentiment, thus the idea of correlating the  “feeling” of a specific day with the trading choices of the following day is definitely worth trying.

The steps to be followed in order to achieve this goal are the following:

  1.  Scrape the historical archives of a web financial blog in order to get for each post the following information: date, keywords, text. The date of the post is fundamental to store the time information; keywords are necessary all the same for a simple reason. The text data we scrape is going to be full of HTML tags. The XPath language (which we’ll make extensive use of) has been developed for the specific purpose of easily getting information from a web page and getting rid of the tags in the first place. Nevertheless the process is not always smooth and during the post text selection we may happen to lose some tagged words, which in general are the most meaningful (companies’ or people’s names). It happens that this words are often the keywords or tags related to the post, thus they are easily reachable with a proper XPath query.
  2. Save all this information in a JSON file.
  3.  Read the JSON file with Pandas and preprocess the text with NLTK (Natural Language ToolKit) and BeautifulSoup.
  4. Perform Sentiment Analysis on the clean text data in order to get sentiment scores for each day.
  5. Generate a final Pandas DataFrame and correlate it with stocks prices to test our hypothesis.
  6. Repeat points 1-5 for as many blogs as possible. For the sake of simplicity I report only the pipeline for a single blog, Bloomberg Business Week. The code can be easily readapted to other websites; in any case all my experiments are available on Github.

The purpose of the present article is to walk trough points 1 and 2. The remaining part of the pipeline will we faced in the following post.

 So let’s start with the Web Crawling phase. The answer to this problem is web scraping in Python or in other words Scrapy.

The example I report below is the code I wrote to crawl Bloomberg Business Week archives. The  web site has a very clean structure which facilitates the task.

The tool I used is the Scrapy library, which is a very handy Python package written for these purposes. Here is the link to the Scrapy First Tutorial, which can be quite useful to follow what is reported below.

In order to have everything work correctly it is necessary to write two basic Classes (I renamed them to fit my project):

  • Class FinancecrawlerItem(scrapy.Item): contained in items.py, this class takes care of the definition of all the information we want to collect from the web. In my case Date of the post, Keywords of the post and Body of the post

  • Class BWSpider(scrapy.Spider): this is the proper Spider. This Class connects to the server, generates a response for each blog-post-url and returns the information specified in the FinancecrawlerItem Class (in the form of a dictionary). The Spider needs a list of urls to start  performing its job (start_urls). In order to generate this list and feed the Spider I wrote an other short script which takes care of parsing the Business Week Archive page and retrieve all the relevant urls. The code is provided below and fully commented.

Here you go! Done! Now you just have to browse to the directory where you have stored all this fancy code and run:

The previous command will run the Spider and redirect the Standard Output of your code both to the screen and to a nicely formatted JSON file, which can be then read and parsed with Pandas. So let’s dive into the following post and go on with the pipeline!


by Francesco Pochetti