This article refers to a problem in Quant-Quest’s Uk Data Science Varsity and Asia Invitational competition. You can view the problems and code along here

Introduction

There are several approaches to text building text summarises. Broadly, they can be split into two groups defined by the type of output they produce: 

  1. Extractive, where important sentences are selected from the input text to form a summary. Most summarization approaches today are extractive in nature.
  2. Abstractive, where the model forms its own phrases and sentences to offer a more coherent summary, like what a human would generate. This approach is definitely more appealing, but much more difficult than extractive summarization.

From a practical point of view, Abstractive summaries require an extra leap of faith for people to adopt. Not only do they have to accept that the summarizer has identified the key sentences, but then actually created a summary that conveys that information without mistake. When high-value decisions are made on the back this summary, it makes sense to minimise this risk.

For this reason, we’ll be looking at an extractive approach based on this paper. from the University of Texas. Our pipeline is going to consist of the following steps:

  1. Preparation
  2. Elmo encoder
  3. Clustering
  4. Summarization

Let’s look at each of these steps.

Step One: Preparation

These initial bits are just downloading packages that we are going to need and getting the data for our problem (If you’re not doing one of these competitions you might want to skip this step).

Firstly, we’re going to turn on auto-reload for this notebook so that any changes to the external libraries get automatically included here.

%load_ext autoreload
%autoreload 2

Next, here is a list of the non-standard packages we’re using:

Uncomment to install any of the needed packages if you don't have them already
# !pip install tensorflow==1.14
# !pip install tensorflow_hub
# !pip install -U wget
# !pip install -U spacy

Spacy is an industrial-grade NLP library that we’re going to use as a pre-trained model to help separate our sample text into sentences. We’re using the English, core, web trained, medium model, so the code is pretty self-explanatory.

!python -m spacy download en_core_web_md
#this may take a little while

And the rest:

import numpy as np
import wget
from zipfile import ZipFile
import os
import glob
from rouge_score import rouge_scorer
from gensim.summarization.summarizer import summarize
from tqdm import tqdm
import spacy
nlp = spacy.load('en_core_web_md')

Finally, we’re ready to import the data and get started. We’ve just copied all of this straight out of the template file for the problem. If you want to understand what each part is specifically doing, check out the getting started section of the problem page.

# Copied from the template file
class Summarizer():

def __init__(self):
# Define any global variables or download any pre trained models here
pass

def download_data(self):
'''
Download data from S3 bucket
'''
if not os.path.isdir('summary_data_train'):
print('Downloading!!')
data_url = ' https://qq16-data.s3.us-east-2.amazonaws.com/summary_data_train.zi p'
filename = wget.download(data_url, out='.')
with ZipFile(filename, 'r') as zipObj:
zipObj.extractall()
else:
print('Already Downloaded data, delete previous folder to download again!!')


def read_data(self):
'''
Read the data in the form of two dictionaries
Returns:
X_dict (dict: filename->text): dictionary containing text from the document
y_dict (dict: filename->text): dictionary containing text from the summary
'''
X_filepath = glob.glob('summary_data_train/docs/*.txt')
y_filepath = glob.glob('summary_data_train/summaries/*.txt')

X_dict, y_dict = {}, {}
for f1, f2 in zip(X_filepath, y_filepath):
name = f1.split('\\')[-1].split('.')[0]
with open(f1, 'r', encoding="utf8") as f:
X_dict[name] = f.read()

with open(f2, 'r', encoding="utf8") as f:
y_dict[name] = f.read()
return X_dict, y_dict
def get_summary(self, raw_text):
'''
Function to generate the summary from raw text. THIS IS THE ONLY PART THAT IS REQUIRED TO BE FILLED.
Args:
raw_text (string): string of text from full document
Returns:
summary (text)
'''
# Using a naive model
return summarize(raw_text, 0.55)

The last bit of preparation is downloading the data and reading it into the summarizer:

summarizer = Summarizer()
print('Step-1: Downloading data:')
try:
summarizer.download_data()
except Exception as e:
print(e)
print('Failed to download data')
# Getting document and summary
print('Step-2: Reading documents and summaries:')
X_dict, y_dict = summarizer.read_data()

Step Two: Encoding The Data

As you might be aware from the title, we are using Elmo as our encoder. Developed in 2018 by AllenNLP, ElMo it goes beyond traditional embedding techniques. It uses a deep, bi-directional LSTM model to create word representations.

Rather than a dictionary of words and their corresponding vectors, ELMo analyses words within the context that they are used. This is important because a word like “sick”, may have entirely opposite meanings depending on the context. It is also character-based, allowing the model to form representations of out-of-vocabulary words. 

This means that the way ELMo is used is quite different from word2vec or fastText. Rather than having a dictionary ‘look-up’ of words and their corresponding vectors, ELMo instead creates vectors on-the-fly by passing text through the deep learning model.

We’ll be using ElMo to create embeddings for the text.

Let’s start by working with just one document and parse it into sentences using Spacy

document = list(X_dict.values())[2]
summary = list(y_dict.values())[2]
doc = nlp(document)
sentences = [sent.text for sent in doc.sents]

We’re going to use tensorflow hub as its slightly easier to do NLP project with. The Elmo model is all pre-trained and includes all the layers, vectors and weights that we will need.

import tensorflow as tf
import tensorflow_hub as hub
url = " https://tfhub.dev/google/elmo/2 "
embed = hub.Module(url)

Now we need to create the embeddings for our text

embeddings = embed(
sentences,
signature="default",
as_dict=True)["default"]
#Start a session and run ELMo to return the embeddings in variable x

Then start up TensorFlow to do its magic

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
x = sess.run(embeddings)
x.shape

We should see a shape of X by1024, with X representing the number of sentences in the text and 1024 being the number of dimensions created by Elmo for each sentence. 

Step Three: Clustering

At this point, we now have all our sentences represented as embeddings. Clustering is going to group sentences that are similar, from which we will be able to create our final summary.

The number of clusters we create here will be the number of sentences we’ll have in our summary. Currently, the approach here is pretty basic and that will limit the quality of our final summary. It would be wise to experiment with different clustering approaches and numbers of clusters to see how that affects your final results.

For our simple approach, we’ve just used a K-means clustering algorithm, splitting the text into 10 sentences. Do you think this would be good for small documents? What about large ones?

import numpy as np
from sklearn.cluster import KMeans
n_clusters = 10
kmeans = KMeans(n_clusters=n_clusters)
kmeans = kmeans.fit( x )

Step four: Summarising

The final step is to take these clusters and create our summary. To do this we’re going to identify the center of the cluster and then find the sentence embedding that is closest to that point. These will then all be added together and returned as the text summary.

One thing to bear in mind here that is if you have some very different sentences e.g. those with a html link in. Then there are likely to in a unique cluster and called in the summary, even though they are not actually useful in the summary.

from sklearn.metrics import pairwise_distances_argmin_min
avg = []
for j in range(n_clusters):
idx = np.where(kmeans.labels_ == j)[0]
avg.append(np.mean(idx))
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, x)
ordering = sorted(range(n_clusters), key=lambda k: avg[k])
summary = ' '.join([sentences[closest[idx]] for idx in ordering])
summary

There you go.

This problem can be found on Auquan’s Quant Quest: links.auquan.com/asiaopen


David

I'm the community lead at Auquan. Feel free to message me if you get into any problems. A fun fact about me is I actually studied medicine at university.

0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *