Text Auto Summarization(Extraction)

Recently I was given a topic to research a manner to summary the text automatically. So I shared some my search results, hope it is helpful.

Summarization Methods

we can classify summarization methods into different types by input type, the purpose and output type. Typically, extractive and abstractive are the most common ways.

img

Here, we would like introduce two methods for Extractive. One is Stats-based , another is Deep Learning-based.

Stats-based

  1. Idea: for each word, we would give a weight frequency. For each sentence, we summary the weight frequency for the words inside. Then pick up the sentences ordered by the sum of weight frequency.
  2. Steps
    2.1 Preprocessing: replace extra whitespace characters or delete some parts we do not need to analysis.
replace = {
ord('\f') : ' ',
ord('\t') : ' ',
ord('\n') : ' ',
ord('\r') : None
}
data.translate(replace)

2.2 Tokenizing the sentence

sent_list = nltk.sent_tokenize(content)

2.3 Get frequency of each word

stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
if word not in stopwords: if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1

2.4 Weighted frequency of occurrence

maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

2.5 Calculate the sum of weight frequency for each sentence

sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]

2.6 sort sentences in descending order of sum

import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)

Deep Learning-based

  1. Idea: vectorizing each sentence into a high dimension space, then cluster the vector using kmean, pick up the sentences which mostly close to the center of each cluster to form the summery of text.
  2. steps:
    2.1 prepossessing and tokenizing the sentence( same as stats-based method)
    2.2 Skip-Thought Encoder

img

Encoder Network: The encoder is typically a GRU-RNN which generates a fixed length vector representation h(i) for each sentence S(i) in the input.
Decoder Network: The decoder network takes this vector representation h(i) as input and tries to generate two sentences — S(i-1) and S(i+1), which could occur before and after the input sentence respectively.

These learned representations h(i) are such that embeddings of semantically similar sentences are closer to each other in vector space, and therefore are suitable for clustering.

img

Skip-Thoughts Architecture

import skipthoughts
# You would need to download pre-trained models first
model = skipthoughts.load_model()
encoder = skipthoughts.Encoder(model)
encoded =  encoder.encode(sentences)

2.3 Clustering

import numpy as np
from sklearn.cluster import KMeans

n_clusters = np.ceil(len(encoded)**0.5)
kmeans = KMeans(n_clusters=n_clusters)
kmeans = kmeans.fit(encoded)

2.4 Summerization

from sklearn.metrics import pairwise_distances_argmin_min

avg = []
for j in range(n_clusters):
idx = np.where(kmeans.labels_ == j)[0]
avg.append(np.mean(idx))
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, encoded)
ordering = sorted(range(n_clusters), key=lambda k: avg[k])
summary = ' '.join([email[closest[idx]] for idx in ordering])

Reference:

  1. Unsupervised Text Summarization using Sentence Embeddings,https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1
  2. Skip-Thought Vectors, https://arxiv.org/abs/1506.06726
  3. Text Summarization with NLTK in Python, https://stackabuse.com/text-summarization-with-nltk-in-python/

Baisc NLP by spaCy

I had my first contact with NLP was sensitive classification by NLTK, which was refreshed me how NLP working. A couple of days ago, since I needed to extract some keywords from one or more paragraphs, I tried to understand spaCy which I thought is easier for relatively simple subjects. I summarized some key concepts from spaCy.io and put one of my examples for reference.

The pipeline of NLP

No matter we use NLTP or spaCy, there are almost same pipelines:

1. Sentence Segmentation

It makes sense we break down the paragraph into sentence since each sentence expresses its own topics. spaCy uses the dependency parse to determine sentence boundaries in term of stats model.

doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

Unlike structure papers, web articles are usually semi-structured. Hence, we need to some customized processing and put it into the standard pipeline.

def set_custom_boundaries(doc):
    # do something here
    return doc
# you can only set boundaries before a document is parsed
nlp.add_pipe(set_custom_boundaries, before='parser')

2. Word Tokenization

Tokenization is used to break a sentence to separate words call Tokens. Tokenization is based on the language which decides the punctuations, prefix, suffix and inffix.

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text)

3. PoS(Parts of Speech), Lemmatization and stop words

At this step, spaCy makes a prediction for each token and put on the most likely tags for them. At the same time, spaCy figures out the basic form and stop words.

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop, [child for child in token.children])
# token.lemma_: the basic form of word
# token.pos_: simple pos
# token.tag_: detail pos
# token.dep_: syntactic dependency
# token.shape_: word shape
# token.is_alpha_: is alpha word
# token.is_stop_: is stop word
# token.children: children token

4. Dependency Parsing

After PoS, we already know the relationship between words. Next step is using more friendly visualization to show the relationships in a sentence. Dependency visualizer is a tree sturcute where the root is main verb in the sentence.

display.serve(doc, style='dep')

5. Finding Noun Phrases

Sometimes, we need to simply the sentence. Group noun phrases together makes this more sense.

For simple way:

from spacy.symbols import *
for np in doc.noun_chunks:
    print(np.text)

For flexible way is iterating over the words of the sentence and consider the syntactic context to determine whether the word governs the phrase-type you want.

from spacy.symbols import *
np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj]) # Probably others too
def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:
            yield word.subtree

6. Named Entity Recognition (NER)

The goal of Named Entity Recognition, or NER, is to detect and label these nouns with the real-world concepts that they represent. The most common NE are:People’s names,Company names,Geographic locations (Both physical and political),Product names,Dates and times, Amounts of money,Names of events

for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")
html = display.render(doc, style='ent')
display(HTML(html))

7. Coreference Resolution

This is the most hard part in NLP. How should computer know the pronominal, like “it”, “he” or “she”? spaCy is not doing it well. But there are deep learning models, like huggingface.

My Example

Context

Speaking at a swearing-in ceremony for Associate Supreme Court Justice Brett Kavanaugh in the East Room of the White House Monday evening, President Trump apologized to Kavanaugh and his family “on behalf of our nation” for what he called a desperate Democrat-led campaign of “lies and deception” intent on derailing his confirmation.

PoS

Speaking speak VERB VBG advcl Xxxxx True False [at]
at at ADP IN prep xx True True [ceremony]
a a DET DT det x True True []
swearing swearing NOUN NN amod xxxx True False [in]
......
on on ADP IN prep xx True True [derailing]
derailing derail VERB VBG pcomp xxxx True False [confirmation]
his -PRON- ADJ PRP$ poss xxx True True []
confirmation confirmation NOUN NN dobj xxxx True False [his]

Dependency Parsing

Noun Phrases

a swearing-in ceremony
Associate Supreme Court Justice Brett Kavanaugh
the East Room
the White House
President Trump
Kavanaugh
his family
behalf
our nation
what
he
"lies
deception
his confirmation

NER


Deprecated: preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated in /home/jietao/jie-tao/wp-content/themes/zacklive/library/zacklive.php on line 283