Using Topic Modelling to Analyse 10-K Filings - Auquan

by Wian Stipp

1. Getting The Data

1.1. Install Relevant Packages

In [0]:
#prevents printing the install messages
!pip install sec-edgar-downloader
!pip install html2text
In [0]:
from sec_edgar_downloader import Downloader
import textwrap
import html2text
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
from tqdm.notebook import tqdm

1.2. Download The Data

We use a tool called SEC Edgar Downloader to scrape the html from the 10-k reports. For more details:

We will download the data directly into the default working directory that Google Colab uses. We also need to specify which companies we would like data for.

In [0]:
PATH = "/content"
dl = Downloader(PATH)
In [0]:
           # The ARGS variable holds some hardcoded information that we might need to reuse
ARGS = {"Type of Report": "10-K",
        "Companies": SYMBOLS,

Then we can simply download the data by looping through each of the companies and downloading using the SEC Edgar tool. The data will download into the "content" directory as we specified above.

Throughout this notebook we also use the tqdm_notebook tool from tqdm. This is essentially an awesome progress bar that helps you see how far you are through a loop and the expected remaining time.

In [0]:
for symbol in tqdm_notebook(SYMBOLS):
  dl.get(ARGS["Type of Report"], symbol)
/usr/local/lib/python3.6/dist-packages/ TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.

1.3. Extract Relevant Text

Now that we have downloaded the data, we need to extract the relevant text from the files. We also need to extract the year of the 10-k filing. This is easy since it is included in the document name.

To extract the text we just use a rough approximation to take a portion of text near the start of the report.

Html2text is another tool used here to convert from HTML, which all the files are in, to text.

In [0]:
def createDataframe(company_list):

  df = pd.DataFrame(columns=["Company", "Year", "Report"])
  start_index = {"AMZN": 49257}
  end_index = {"AMZN": 185190}

  for company in tqdm(company_list):
      reports = os.listdir(PATH + "/sec_edgar_filings/" + company + "/10-K")
      for index, report in enumerate(reports):
        opened_file = open(PATH + "/sec_edgar_filings/" + company + "/10-K/" + report, "r")
        full_text =
        full_text_length = len(full_text)
          if company in start_index.keys():
            start = start_index[company]
            end = end_index[company]
            start = 44800
            end = 200000
          text = html2text.html2text(full_text[start:end])
          t_len = len(text)
          relevant_text = text[round(t_len*0.003):round(t_len*0.08)]
          yr = int(report.split("-")[1])
          if yr > 20:
            yr = 1900 + yr
            yr = 2000 + yr
          df = df.append({"Company": company, "Year": yr, "Report": relevant_text}, ignore_index=True)
          print(company, report, "Failed")
    except: pass
  return df
In [0]:
dataframe = createDataframe(ARGS["Companies"])
/usr/local/lib/python3.6/dist-packages/ TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0
Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  import sys

1.4. Data Cleaning

In [0]:
import re

def clean_dataset(text):
    # Make Lowercase
    text = text.lower()
    # Remove some remaining html
    text = re.sub(r"font", "", text)
    text = re.sub(r"size", "", text)
    text = re.sub(r"pt", "", text)
    text = re.sub(r"px", "", text)
    text = re.sub(r"padding", "", text)
    text = re.sub(r"family", "", text)
    text = re.sub(r"style", "", text)
    # Remove HTML special entities (e.g. &)
    text = re.sub(r"\&\w*;", "", text)
    # Remove tickers
    text = re.sub(r"\$\w*", "", text)
    # Remove hyperlinks & URLs
    text = re.sub(r"https?:\/\/.*\/\w*", "", text)
    text = re.sub(r"http(\S)+", "", text)
    text = re.sub(r"http ...", "", text)
    # Remove whitespace (including new line characters)
    text = re.sub(r"\s\s+", "", text)
    text = re.sub(r"[ ]{2, }", " ", text)
    # &, < and >
    text = re.sub(r"&amp;?", "and", text)
    text = re.sub(r"&lt;", "<", text)
    text = re.sub(r"&gt;", ">", text)
    # Insert space between words and punctuation marks
    text = re.sub(r'\[\[(?:[^\]|]*\|)?([^\]|]*)\]\]', r'\1', text)
    # Remove characters beyond Basic Multilingual Plane (BMP) of Unicode:
    text = "".join(c for c in text if c <= "\uFFFF")
    text = text.strip()
    text = " ".join(text.split())

    return text
In [0]:
dataframe["Report"] = dataframe["Report"].apply(clean_dataset)

2. Data Exploration

2.1. Word Cloud

Word clouds, as basic as they are, can be useful to visually represent the main themes in the report.

In [0]:
from wordcloud import WordCloud
In [0]:
def make_wordcloud(series):
  all_text = ','.join(list(series.values))
  wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3)
  return wordcloud.to_image()
In [0]:

2.2. Word Cloud by Company

Let's see if we can see any immediate differences between companies by just using the word cloud. Since we wrote the word cloud generation into a function called make_wordcloud we can easily reuse it! Let's look at Alphabet vs Microsoft.

In [0]:
import numpy as np
In [0]:
GOOGL_condition = (dataframe.Company == "GOOGL")
In [0]:
MSFT_condition = (dataframe.Company == "MSFT")

Just by visualising the data from Alphabet vs Microsoft, we can see that Microsoft seems to talk more about their services and products, while Alphabet seems to be more concerned about macroeconomic factors.

3. Topic Modelling

We will be using LDA for the Topic Model. LDA stands for Latent Dirichlet Allocation. If we break down that term a little, we notice the word "latent", which means unobserved; Dirichlet is named after the German mathematician and "allocation" because of the nature of the problem of allocating latent topics to chunks of text.

LDA is actually an unsupervised technique, meaning we do not need labelled data, which is a big benefit when you have many 10-k filings as we do. The mathematics behind it is very deep, because it uses Bayesian methods, and so we won't cover it here. If you would like to get a better idea of the mathematics, then I recommend you start here: or view the original paper (with Andrew Ng) here:

An intuitive way to understand how topic modelling works is that the model imagines each document contains a fixed number of topics. For each topic, there are certain words that are associated with that topic. Then a document can be modeled as some topics that are generating some words associated with the topics. For example, a document discussing Covid-19 and unemployment impact can be modelled as containing the topics: “Covid-19”, “economics”, “health” and “unemployment”. Each one of these topics has a specific vocabulary associated with it, which appears in the document. The model knows the document isn’t discussing the topic “trade” because words associated with “trade” do not appear in the document.

3.1. Text Preprocessing

NTLK, Gensim and SpaCy are the primary packages we will be using to clean the text, preprocess it and then build the model. There libraries are very common in the NLP space nowadays and should become familar to you overtime.

We also use a special plotting tool called pyLDAvis. As the name suggests this enables you to visualise the Topic Modelling output by using a number of techniques, such as dimensionality reduction.

To prepare the text for the model we need to do a few things. The first is to remove stopwords, which are words that are not going to add much meaning to the text and hence just add noise into the model. The NTLK package has a lot of these words listed which we can make use of right away. Have a look at some of them below, stored in the list variable stop_words.

Note: If you find any other words that could be considered meaningless which remain after filtering out the stopwords, then you can just append them to the list.

In [0]:
import nltk;'stopwords')
!python3 -m spacy download en
In [0]:
from pprint import pprint

import warnings

# Gensim is a great package that supports topic modelling and other NLP tools
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.utils import simple_preprocess

# spacy for lemmatization
import spacy

# Plotting tools
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim
In [0]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
In [0]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
def remove_stopwords(texts):
  This function simply removes all of the stopwords we have specified in the list stop_words.
  return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def bigrams(words, bi_min=3):
  bigram = gensim.models.Phrases(words, min_count = bi_min)
  bigram_mod = gensim.models.phrases.Phraser(bigram)
  return bigram_mod

def get_corpus(df):  
    words = list(sent_to_words(df.Report))
    words = remove_stopwords(words)
    bigram = bigrams(words)
    bigram = [bigram[report] for report in words]
    id2word = gensim.corpora.Dictionary(bigram)
    id2word.filter_extremes(no_below=3, no_above=0.35)
    corpus = [id2word.doc2bow(text) for text in bigram]
    return corpus, id2word, bigram
In [0]:
train_corpus, train_id2word, bigram_train = get_corpus(dataframe)

We need to choose the number of topics we are looking for. This is a hyperparameter and cannot be directly optimised.

In [0]:

3.2. Model

In [0]:
with warnings.catch_warnings():
    lda_train = gensim.models.ldamulticore.LdaMulticore(
                           eval_every = 1,
In [0]:
coherencemodel = CoherenceModel(lda_train, texts=bigram_train, dictionary=train_id2word)
print (coherencemodel.get_coherence())
In [0]:
# Visualize the topics using pyLDAvis
vis = pyLDAvis.gensim.prepare(lda_train, train_corpus, train_id2word)

4. Improving The Model

As the results show, the model is decent at finding topics, but we can do better. We will look at two ways to improve the model: finding the optimal number of topics, and using Mallet.

4.1. Mallet

Mallet (Machine Learning for Language Toolkit), is a topic modelling package written in Java. Gensim has a wrapper to interact with the package, which we will take advantage of.

The difference between the LDA model we have been using and Mallet is that the original LDA using variational Bayes sampling, while Mallet uses collapsed Gibbs sampling. The latter is more precise, but is slower. In most cases Mallet performs much better than original LDA, so we will test it on our data. Also, as we will see, Mallet will dramatically increase our coherence score, demonstrating that it is better suited for this task as compared with the original LDA model.

We need to go through some additional steps to properly install Mallet and the wrapper from Gensim. Here is an excellent guide to using Mallet with Google Colab:

In [0]:
def install_java():
  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  !java -version       #check java version
openjdk version "11.0.7" 2020-04-14
OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04)
OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)
In [0]:
In [0]:
os.environ['MALLET_HOME'] = '/content/mallet-2.0.8'
mallet_path = '/content/mallet-2.0.8/bin/mallet'
In [0]:
from gensim.models.wrappers import LdaMallet
In [0]:
with warnings.catch_warnings():
    lda_mallet = LdaMallet(
In [0]:
gensimmodelMallet = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda_mallet)
In [0]:
coherencemodel = CoherenceModel(gensimmodelMallet, texts=bigram_train, dictionary=train_id2word)

As you can see we get a huge boost in coherence!

In [0]:
# Visualize the topics using pyLDAvis
vis = pyLDAvis.gensim.prepare(gensimmodelMallet, train_corpus, train_id2word)

4.2. Finding Optimal Number of Topics

There is no easy way to obtain the optimal number of topics you should use. This is a hyperparameter and must be tuned manually. One way, albeit still not great, is to create a number of models, each with different topic numbers, and calculate the coherence scores for each model. Then we plot the coherence vs. number of topics, and find the elbow - the point at which the curve tapers off.

In [0]:
def plot_coherence(dictionary, corpus, texts, maximum=30, minimum=3, step=4):

  coherence_values = []
  model_list = []

  for num_topics in tqdm(range(minimum, maximum, step)):
    model = LdaMallet(
    gensimMallet = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(model)
    coherencemodel = CoherenceModel(model=gensimMallet, texts=texts, dictionary=dictionary, coherence='c_v')

  return model_list, coherence_values
In [0]:
models, coherences = plot_coherence(train_id2word, train_corpus, bigram_train)
/usr/local/lib/python3.6/dist-packages/smart_open/ UserWarning: This function is deprecated, use instead. See the migration notes for details:
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL

In [0]:
x = range(3, 30, 4)
plt.figure(figsize=(15, 10))
plt.plot(x, coherences)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
# plt.legend(("coherence_values"), loc='best')

Unfortunately it looks like this is quite messy, but the scale on the y-axis is quite narrow, so it doesn't make too much difference which value we choose. Also, if you run this again, the line can look much different. Therefore, we should choose a number of topics that makes sense in this content. The fewer topics we choose, the more general and broad the topics would be and vice versa.

5. Using The Model

Using the knowledge from the optimal topics we can select a "best model" to analyse the data with.

5.1. Visualisation

In [0]:
bestModelMallet = models[1]
In [0]:
# Visualize the topics using pyLDAvis
vis = pyLDAvis.gensim.prepare(bestModelMallet, train_corpus, train_id2word)