Using Topic Modelling to Analyse 10-K Filings

In this article we explore the subject of topic modelling, as well as an application to analysing trends in company 10-k filings.

A Google Colab file with all the code can be found here. We recommend you open it to see all the details:

What is Topic Modelling?

Topic modelling is a subtask of natural language processing and information extraction from text. The aim is, for a given corpus of text, model the latent (hidden underlying) topics that are present in the text. Once you know the topics that are being discussed in the text, various further analysis work can be done. For example, using the topics as features, classification, trend analysis or visualisation tasks can be performed. This makes topic modelling a useful tool in a data scientist’s toolbox.

Latent Dirichlet Allocation (LDA) is commonly used for topic modelling due to its ease of implementation and computation speed.  If we break down that term a little, we notice the word “latent”, which means unobserved; Dirichlet is named after the German mathematician and “allocation” because of the nature of the problem of allocating latent topics to chunks of text.

An intuitive way to understand how topic modelling works is that the model imagines each document contains a fixed number of topics. For each topic, there are certain words that are associated with that topic. Then a document can be modeled as some topics that are generating some words associated with the topics. For example, a document discussing Covid-19 and unemployment impact can be modelled as containing the topics: “Covid-19”, “economics”, “health” and “unemployment”. Each one of these topics has a specific vocabulary associated with it, which appears in the document. The model knows the document isn’t discussing the topic “trade” because words associated with “trade” do not appear in the document.

The underlying mathematics behind LDA are beyond the scope of this article, but reply on variational Bayes sampling. We also touch on the MALLET model, which is similar to LDA, but has some advantages over the original.

Now that you have an idea of what topic modelling is, and how it can be used, let’s explore how it can be used to analyse 10-k filings for some major tech companies.


Code snippets will be provided throughout the article to show how these ideas are implemented, but much will be left out in the interest of space. The full detail can be found in the notebook.

To download the data, we use a Python package called “sec-edgar-downloader”, which can be easily pip installed For this article we look at some major tech companies: Alphabet, Microsoft, Amazon, IBM, and Nvidia.

After downloading the data, we can produce a simple visualisation to get a better idea of the content of these 10-k filings. We define a function to create a word cloud given a Pandas Series object.

def make_wordcloud(series):
  all_text = ','.join(list(series.values))
  wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3)
  return wordcloud.to_image()

If we run this on the entire dataset, we get the following output:

We can also compare word clouds between companies. For example, here is the word cloud for Alphabet:

And for Microsoft:

Just by visualising the data from Alphabet  vs Microsoft, we can see that Microsoft seems to talk more about their services and products, while Alphabet seems to be more concerned about macroeconomic factors.

Just like for all NLP texts, the text data needs to be cleaned and preprocessed to make it useful. We can apply regular expressions to filter out a lot of the junk, as well as removing stopwords. Stopwords are simple words that do not add any meaning to the document, and hence just generate noise.

SpaCy comes with a predefined list of stop words, which can be accessed as follows:

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

This is simply a Python list of 179 words. If there are any idiosyncratic stop words related to your application, you can simply just append them to this list before filtering them out.

Bag of Words

From there we tokenise the text using the Bag of Words model to put the text data in a way the computer understands it. The BoW model has two components: the vocabulary and the frequency (or some measure thereof). Finally, we can filter out both extremely high and low frequency words. We get rid of low frequency words to reduce chances of overfitting, and remove high frequency words because they are usually not very relevant and so can hide the signal.

There are two primary components that go into preprocessing the text for the LDA model. We need the bigrams and an id2word mapping. The bigram function serves to automatically detect (using the gensim.models.Phrases method) which words ought to be grouped as a phrase. For example, in our context “original” and “equipment” is concatenated to “original_equipment”. This is both useful for reducing noise, and creating better features. The bi_min parameter makes sure that there are at least 3 instances of the concatenated phrase in the document before confirming it as a valid phrase. We carry out this operation with the following function:

def bigrams(words, bi_min=3):
  bigram = gensim.models.Phrases(words, min_count = bi_min)
  bigram_mod = gensim.models.phrases.Phraser(bigram)
  return bigram_mod

The Phrases method creates these bigrams from the list of words, then the Phraser method exports these bigrams to a trained model with which model updates are no longer possible, meaning less RAM and faster processing.

Now, for the id2word mapping, we take the list of bigrams, where each element is the bigram representation of a document, and then we feed it into Gensim’s Dictionary method. This creates a mapping between each token (as it is now called after converting into bigrams) and an id for that token. We also filter, such that tokens need to be in at least “no_below” documents, and in no more than “no_above” fraction of documents. Finally we convert all of our documents into a “corpus”, where a document is represented by a list of (id, frequency) tuples. The id value comes from the id2word mapping, and the frequency is calculated based on how many of these ids are in the document.

id2word = gensim.corpora.Dictionary(bigram)
id2word.filter_extremes(no_below=3, no_above=0.35)
corpus = [id2word.doc2bow(text) for text in bigram]

For example, the id of “microsoft” is:

[(539, 1)]

So the word “microsoft” has id 539, and since we only passed the word in once, it has a frequency of 1. If we had passed in [“microsoft”, “microsoft”], then we would have gotten [(539, 2)]. More generally, the union of two documents if the disjoint union, summing the multiplicities of each element (token id).

It is important to be aware of the BoW assumptions, predominantly that there are no relationships between words, and that a document’s meaning is solely composed of which words it contains, and not the order. This is of course highly unrealistic, however it seems to be a simplifying assumption that works quite well in a lot of cases.

It is also important to note that we do not need to use idf-tf (inverse document frequency – term frequency), because LDA addresses term frequency issues by construction.


Mallet (Machine Learning for Language Toolkit), is a topic modelling package written in Java. Gensim has a wrapper to interact with the package, which we will take advantage of.

The difference between the LDA model we have been using and Mallet is that the original LDA using variational Bayes sampling, while Mallet uses collapsed Gibbs sampling. The latter is more precise, but is slower. In most cases Mallet performs much better than original LDA, so we will test it on our data. Also, as shown in the notebook, Mallet will dramatically increase our coherence score, demonstrating that it is better suited for this task as compared with the original LDA model.

To use Mallet in Google Colab, we need to go through a few extra steps. First we install Java (in which Mallet is written).

def install_java():
  !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk
  os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
  !java -version       #check java version

Then we download Mallet and unzip it:


Finally, we set the path to the Mallet binary:

os.environ['MALLET_HOME'] = '/content/mallet-2.0.8'
mallet_path = '/content/mallet-2.0.8/bin/mallet'

To use the model, we use the Gensim wrapper for it. We need to specify the path to the Mallet binary as the first argument. The other arguments included the training corpus, the number of topics (hyperparameter) and the id2word mappings.

lda_mallet = LdaMallet(

The last thing we need to do before we can use the model is to convert it to the Gensim format, like so:

gensimmodelMallet = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda_mallet)

Now we can use the model just like we would use the original LDA model!

Note – Here is an excellent guide to using Mallet with Google Colab:

Finding The Optimal Number of Topics

One hyperparameter that needs to be tuned is the number of topics your model looks for. This is some fixed constant and must be tuned by you. There is no real simple way to do this, but one way is simply create a number of models, each with varying numbers of topics and then compute the coherence score. Note we are using the Mallet variation of the model.

def plot_coherence(dictionary, corpus, texts, maximum=30, minimum=3, step=4):
   coherence_values = []
   model_list = []
   for num_topics in tqdm(range(minimum, maximum, step)):
      model = LdaMallet(
   gensimMallet = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(model)
   coherencemodel = CoherenceModel(model=gensimMallet, texts=texts, dictionary=dictionary, coherence='c_v')
return model_list, coherence_values

After computing this score for a number of models, plotting the coherence score against the number of topics should reveal an elbow shaped graph. The optimal number of topics is at the elbow, where the graph starts to flatten out. Our graph was pretty noisy and didn’t have an elbow shape. Exploring the results from each configuration led to 7 topics working the best. Choosing too few topics meant that each topic was too vague and high-level. Choosing too many topics is even more problematic, because there isn’t enough data for each topic and so there is a problem of overfitting.

Using The Model

We can use a great plotting tool called pyLDAvis. As the name suggests this enables you to visualise the Topic Modelling output by using a number of techniques, such as dimensionality reduction.


We give a brief introduction to pyLDAvis, but for full details, please see “LDAvis: A method for visualizing and interpreting topics by Sievert and Shirley”.

There are five main components to the display. Refer to the image below to make sense of each component.

  1. Default Topic Circles: There are K circles for K topics. The bigger the circle, the more tokens that topic captures.
  2. Red Bars: Each red bar represents the estimated number of times a particular term can be found within a particular topic.
  3. Blue Bars: Displays the overall frequency of a particular term in the corpus.
  4. Topic Term Circles: For each term, K circles where each area is proportional to the frequency with which the term is estimated to be generated by each topic. So a bigger circle for a particular term and a particular topic implies that this term is estimated to be more likely to be generated under this topic.
  5. Distance Between Circles: Distance between circles are given by the Jensen-Shannon divergence. If you are familiar with KL-divergence, you know that it is asymmetric: for probability distributions P and Q on the same probability space, the KL-divergence from P to Q is not the same as Q to P (in general). However, Jensen-Shannon divergence, which is based on KL-divergence is in fact symmetric, and also guaranteed to have a finite value – two useful properties. Thus the further away two circles are, the more different the distributions of words under the two topics are.

Inspecting topic 3 reveals the terms: “improved_advertising”, “cloud_platforms”, “phones_intelligent”, “information_investors”, “global_compute” and so on. This topic can be interpreted as integrating technology with current business practices. Let’s simply call this topic “Tech” for short.

One great thing we can now do is look at how certain mentions of particular topics change over time within each company. First, we need a function to extract the prominence of each topic for a given document. The process is: convert each document to a list of words, remove stopwords, convert to bigrams, lookup the id of each bigram as well as the frequency, find the topic scores using the topic model, and finally extract the score of the desired topic.

def get_df_topics(row, topic_id):
words = list(sent_to_words([row]))
words = remove_stopwords(words)
bigram = bigrams(words)
bigram = [bigram[report] for report in words]
tokens = train_id2word.doc2bow(bigram[0])
topics = bestModelMallet.get_document_topics(tokens)
if topics is not None:
    for index, score in topics:
      if index == topic_id:
        return score

For Microsoft, the “Tech” topic has been wildly fluctuating since 1995, and is at a low point in recent years. Looking at IBM, “Tech” has been trending strongly upwards, but has recently fallen to a low point not seen since 2004. “Tech” at Amazon, on the other hand, has been trending upwards over the past 10 years.

Wrapping Up

We have learned about topic modelling, and more specifically LDA and Mallet. Topic modelling is a great unsupervised tool for extracting topics from documents, and Mallet is a particular model for performing topic modelling, which improves on the weaknesses of the original LDA model – precision.

We learned how to preprocess and clean text for building an LDA/Mallet model. First, cleaning the data with regular expressions, then removing stopwords, and creating a Bag of Words model. We discussed the assumptions of the BoW model, but recognised that it can be a useful simplification in many cases.

Then we built the Mallet and LDA models, chose the optimal number of topics, and visualised the topics in pyLDAvis. The visualisation methods of pyLDAvis were covered in detail.

After finding the topic representation for each document, we saw how further work can be done, such as looking at how certain topics trend overtime.

If you would like to go further, I would recommend looking into Deep LDA or Neural Topic Modelling with Reinforcement Learning. 


Leave a Reply

Your email address will not be published. Required fields are marked *