Baisc NLP by spaCy

I had my first contact with NLP was sensitive classification by NLTK, which was refreshed me how NLP working. A couple of days ago, since I needed to extract some keywords from one or more paragraphs, I tried to understand spaCy which I thought is easier for relatively simple subjects. I summarized some key concepts from spaCy.io and put one of my examples for reference.

The pipeline of NLP

No matter we use NLTP or spaCy, there are almost same pipelines:

1. Sentence Segmentation

It makes sense we break down the paragraph into sentence since each sentence expresses its own topics. spaCy uses the dependency parse to determine sentence boundaries in term of stats model.

doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

Unlike structure papers, web articles are usually semi-structured. Hence, we need to some customized processing and put it into the standard pipeline.

def set_custom_boundaries(doc):
    # do something here
    return doc
# you can only set boundaries before a document is parsed
nlp.add_pipe(set_custom_boundaries, before='parser')

2. Word Tokenization

Tokenization is used to break a sentence to separate words call Tokens. Tokenization is based on the language which decides the punctuations, prefix, suffix and inffix.

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text)

3. PoS(Parts of Speech), Lemmatization and stop words

At this step, spaCy makes a prediction for each token and put on the most likely tags for them. At the same time, spaCy figures out the basic form and stop words.

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop, [child for child in token.children])
# token.lemma_: the basic form of word
# token.pos_: simple pos
# token.tag_: detail pos
# token.dep_: syntactic dependency
# token.shape_: word shape
# token.is_alpha_: is alpha word
# token.is_stop_: is stop word
# token.children: children token

4. Dependency Parsing

After PoS, we already know the relationship between words. Next step is using more friendly visualization to show the relationships in a sentence. Dependency visualizer is a tree sturcute where the root is main verb in the sentence.

display.serve(doc, style='dep')

5. Finding Noun Phrases

Sometimes, we need to simply the sentence. Group noun phrases together makes this more sense.

For simple way:

from spacy.symbols import *
for np in doc.noun_chunks:
    print(np.text)

For flexible way is iterating over the words of the sentence and consider the syntactic context to determine whether the word governs the phrase-type you want.

from spacy.symbols import *
np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj]) # Probably others too
def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:
            yield word.subtree

6. Named Entity Recognition (NER)

The goal of Named Entity Recognition, or NER, is to detect and label these nouns with the real-world concepts that they represent. The most common NE are:People’s names,Company names,Geographic locations (Both physical and political),Product names,Dates and times, Amounts of money,Names of events

for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")
html = display.render(doc, style='ent')
display(HTML(html))

7. Coreference Resolution

This is the most hard part in NLP. How should computer know the pronominal, like “it”, “he” or “she”? spaCy is not doing it well. But there are deep learning models, like huggingface.

My Example

Context

Speaking at a swearing-in ceremony for Associate Supreme Court Justice Brett Kavanaugh in the East Room of the White House Monday evening, President Trump apologized to Kavanaugh and his family “on behalf of our nation” for what he called a desperate Democrat-led campaign of “lies and deception” intent on derailing his confirmation.

PoS

Speaking speak VERB VBG advcl Xxxxx True False [at]
at at ADP IN prep xx True True [ceremony]
a a DET DT det x True True []
swearing swearing NOUN NN amod xxxx True False [in]
......
on on ADP IN prep xx True True [derailing]
derailing derail VERB VBG pcomp xxxx True False [confirmation]
his -PRON- ADJ PRP$ poss xxx True True []
confirmation confirmation NOUN NN dobj xxxx True False [his]

Dependency Parsing

Noun Phrases

a swearing-in ceremony
Associate Supreme Court Justice Brett Kavanaugh
the East Room
the White House
President Trump
Kavanaugh
his family
behalf
our nation
what
he
"lies
deception
his confirmation

NER

Leave a Reply

Your email address will not be published. Required fields are marked *