This article is accompanied by a Google Colab notebook, which contains all the code and additional mathematical details. You can find the notebook here:

https://colab.research.google.com/drive/15JsLw-JwXViKdNaWCrW24_vaBVjZ1ZTR?usp=sharing

What Will You Learn in This Article?

In this article we will explore how you can automatically identify relevant news about a company, using a technology called knowledge graphs (KGs). You will be able to automatically identify news that mentions a company, its competitors, people or products. To do this we will introduce KGs and look at how to construct your own using data from Wikidata and news sources.

The steps to do this are as follows:

  • Build a knowledge graph of companies and their related entities
  • Extract information from news
  • Add news data to the knowledge graph
  • Make inferences

What are Knowledge Graphs?

 A knowledge graph organises information into an ontology. An ontology is simply a set of concepts, and the relations between them. Once we have an ontology, we can leverage the mathematical study of networks to perform interesting tasks in inference. This is how Google is able to present relevant information when you search for a company. Here is a simple diagram to help visualise a KG.

The nodes (everything at the end of a line) are entities. The edges/lines connecting these entities are called relations.

More formally in mathematical terms, a knowledge graph

is a set of (subject, predicate, object) triples. The subject and object are in the set of entities, while the predicate is in the set of relations as described by the notation above.

Building a Knowledge Graph

Triples extraction from Wikidata

Wikidata is a free knowledge base that contains a huge amount of data, with a unique id for each entity and each relation. Entity identifiers are denoted by the letter Q followed by an integer and predicate (relation) identifiers are denoted by the letter P followed by an integer. For example, Google has id Q95 whereas the predicate “owned by” has the id P127. For our task, we would like to get Wikidata’s data for a list of companies – Recall from the definition of a knowledge graph this means we aim to extract (subject, predicate, object) triples from Wikidata.

Qwikidata is a Python library that makes it easy to extract data from Wikidata. Given a list of companies and a list of predicates we are interested in, let’s write a function that will extract the list of (subject, predicate, object) triples.

def get_triples_from_wikidata(companies_list, predicate_list):
 
    subjects, predicates, objects = [], [], []
    for Q_id in tqdm(companies_list):
        Q_company = WikidataItem(get_entity_dict_from_api(Q_id))
        for predicate in predicate_list:
            for claim in Q_company.get_claim_group(predicate):
                object_id = claim.mainsnak.datavalue.value["id"]
                object_entity = WikidataItem(get_entity_dict_from_api(object_id))
 
                subjects.append(Q_company.get_label())
 
                predicate_property = WikidataProperty(get_entity_dict_from_api(predicate))
                predicates.append(predicate_property.get_label()) 
 
                objects.append(object_entity.get_label())
 
    return subjects, predicates, objects 

The idea of this function is to loop through each company, obtain it’s WikidataItem object from the API. From there, loop through each of the predicates we would like to explore and see if there are any objects that are related to the company through the predicate. Now we can store all of this in a Pandas dataframe, where each row is a triple.

Additionally, we can visualise what a segment of this KG looks like using a package with graph visualisation features, such as NetworkX. For example, if we have a look at Tesla:

Wikidata is very reliable, and so now we have a good basis for our KG. In our small KG we have a few companies, as well as some useful relationships between the companies and other entities. However, this is still relatively barebones and it is unlikely that we can discover new sources of knowledge from the KG that we have. Let’s look at how we can fix that.

Enrichment using News Articles

Now that we have a strong foundation to build upon, we can look for further sources of data to enrich our knowledge graph. For this article, we will use data from one of Auquan’s Quant Quest (QQ) competitions: The Text Prediction Challenge. We will extract all of the news pieces that mention at least one of the companies in our company list. From there, we can enrich our KG with the information these articles contain after doing some preprocessing.

We are interested in just the content and the titles from the articles, which can all be stored in a Pandas dataframe for easy access.

After removing all of the articles that do not mention any of our companies, we are ready to perform the (subject, predicate, object) extraction process. We can use SpaCy along with this excellent tool to do so. Extracting triples from free text is a study in and of itself, so we will not discuss it in detail. Again, a function is required to obtain all of the triples. Looping through each article in the dataframe, we use SpaCy’s NLP model to tokenise the text before passing it to the findSVOs (subject-verb-object) tool. 

def get_spacy_triples_from_news(dataframe, title_or_content="title"):
    subjects, predicates, objects = [], [], []
    for index, row in tqdm(dataframe.iterrows(), total=len(dataframe)):
        text = row[title_or_content]
        tokens = nlp(text)
        svos = findSVOs(tokens)
        if len(svos) > 0:
            for triple in svos:
                subjects.append(triple[0])
                predicates.append(triple[1])
                objects.append(triple[2])
    return subjects, predicates, objects

The triples extracted from the process are not as high quality as those from Wikidata, however they still provide a lot of useful information. Here is an example of three triples randomly sampled after being stored in another Pandas dataframe:

We can concatenate the Wikidata and the news triples and now we have our basic knowledge graph! 

In this section we learned how one can extract triples from free text, and in our case, news articles. At the moment we just have a huge set of triples, but in the next section we will leverage a package called AmpliGraph to build the knowledge graph and transform it into a model which is able to learn the embeddings.

Embedding Model

Now that we have our KG, we can build an embedding model from it using the AmpliGraph library.  The embedding model can be thought of as a layer of abstraction that is built on top of the KG, with the functionality that it is able to encode concepts (entities and relations) from the KG into lower-dimensional vectors. This is similar to embeddings like word2vec if you’re familiar with NLP. When we have these embeddings, we can perform mathematical operations on them.

The embedding model is learned by training a neural network over the knowledge graph. The aim is to minimise a loss function, which is a function of a scoring function. The scoring function assigns some score to a triple (subject, predicate, object). Optimisation takes place such that the scoring function maps positive (subject, predicate, object) statements to higher scores and negative (subject, predicate, object) statements to lower scores.

The model we will use is called ComplEx. The paper detailing ComplEx can be found here: https://arxiv.org/abs/1606.06357. The model is instantiated as follows:

from ampligraph.latent_features import ComplEx
 
model = ComplEx(batches_count=50,
                epochs=300,
                k=20,
                eta=20,
                optimizer='adam', 
                optimizer_params={'lr':1e-4},
                loss='multiclass_nll',
                regularizer='LP', 
                regularizer_params={'p':3, 'lambda':1e-5}, 
                seed=0, 
                verbose=True)

The loss function we are using is called Multiclass NLL:

Which is the sum of negative log probabilities of observed triples with an added L2 loss.

where “k” is the embedding space dimensionality and eta a parameter controlling the number of negatives generated for each positive shown to the model. After training the model, we evaluate it with out-of-sample data:

The results mean that the model can identify the correct entity within the top 10 possibilities with 24% accuracy, the top 3 with 21% and top 1 with 19%. For a more detailed discussion on these metrics, see the notebook.

We have seen how to build a knowledge graph embedding model in AmpliGraph. All that we needed to feed it was the triples from Wikidata and the news data that we had collected earlier. Embedding models are trained using a neural architecture, where a scoring function learns to differentiate between true statements, and statements that are unlikely to be true.

Making Inferences

With the model fully trained, we can use the embeddings it has learned to make inferences relating to knowledge discovery, entity prediction as well as visualisations.

Query Top N

If we remove either an entity or a relation from the triple, such as (subject, predicate, unknown), (unknown, predicate, object) or (subject, unknown, object), we can ask the learned model to fill in the “unknown”. For example, let’s try (“tesla”, unknown, “cars”). The top 3 results are:

(tesla, made, cars), (tesla, sold, cars), (tesla, delivered, cars). Query Top N can be used to find interesting relationships between entities.

Ranking News

One naïve way to get the most relevant news for each company is to extract all of the valid triples from every article, find their embeddings, take an average and then find the company whose embedding is closest in the sense of a Euclidean distance metric. For more details, see the notebook.

This simple approach produces decent results:

Note that the predictions are in descending order. This means that “Prime Day was Amazon’s biggest day” is deemed the most relevant article. It is also interesting to note that the article with headline: “Apple owes $450M for book-publishing conspiracy” is an article discussing both Apple and Amazon. The model has preferred to rank this as associated with Amazon, perhaps because books are frequently mentioned.

This naïve approach can be extended, perhaps by performing a weighted average of the individual triples embeddings, or by taking a more advanced approach.

This basic approach can then be built upon to automatically flag news relevant for companies you’re interested in. Imagine you were invested in 200 companies, keeping on top of the breaking news would be a full time job! You could instead run all the news through your relevance scorer to identify which ones you should be reading.

Conclusion

In summary, we have seen how to build a knowledge graph by first getting a “ground truth” from Wikidata and then enriching it with news articles. From there, we used some simple techniques from natural language processing to extract (subject, predicate, object) triples. These triples represent the concepts and relationships between concepts. With knowledge graphs just being a list of (subject, predicate, object) triples, we can train a neural network on this list, projecting onto a (lower dimensional) embedding space. After obtaining an embedding space, we can then perform mathematical operations, as this is just a vector space. For example, we can find the company that has the smallest Euclidean distance to an article


David

I'm the community lead at Auquan. Feel free to message me if you get into any problems. A fun fact about me is I actually studied medicine at university.

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *