Baisc NLP by spaCy

I had my first contact with NLP was sensitive classification by NLTK, which was refreshed me how NLP working. A couple of days ago, since I needed to extract some keywords from one or more paragraphs, I tried to understand spaCy which I thought is easier for relatively simple subjects. I summarized some key concepts from spaCy.io and put one of my examples for reference.

The pipeline of NLP

No matter we use NLTP or spaCy, there are almost same pipelines:

1. Sentence Segmentation

It makes sense we break down the paragraph into sentence since each sentence expresses its own topics. spaCy uses the dependency parse to determine sentence boundaries in term of stats model.

doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

Unlike structure papers, web articles are usually semi-structured. Hence, we need to some customized processing and put it into the standard pipeline.

def set_custom_boundaries(doc):
    # do something here
    return doc
# you can only set boundaries before a document is parsed
nlp.add_pipe(set_custom_boundaries, before='parser')

2. Word Tokenization

Tokenization is used to break a sentence to separate words call Tokens. Tokenization is based on the language which decides the punctuations, prefix, suffix and inffix.

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text)

3. PoS(Parts of Speech), Lemmatization and stop words

At this step, spaCy makes a prediction for each token and put on the most likely tags for them. At the same time, spaCy figures out the basic form and stop words.

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop, [child for child in token.children])
# token.lemma_: the basic form of word
# token.pos_: simple pos
# token.tag_: detail pos
# token.dep_: syntactic dependency
# token.shape_: word shape
# token.is_alpha_: is alpha word
# token.is_stop_: is stop word
# token.children: children token

4. Dependency Parsing

After PoS, we already know the relationship between words. Next step is using more friendly visualization to show the relationships in a sentence. Dependency visualizer is a tree sturcute where the root is main verb in the sentence.

display.serve(doc, style='dep')

5. Finding Noun Phrases

Sometimes, we need to simply the sentence. Group noun phrases together makes this more sense.

For simple way:

from spacy.symbols import *
for np in doc.noun_chunks:
    print(np.text)

For flexible way is iterating over the words of the sentence and consider the syntactic context to determine whether the word governs the phrase-type you want.

from spacy.symbols import *
np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj]) # Probably others too
def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:
            yield word.subtree

6. Named Entity Recognition (NER)

The goal of Named Entity Recognition, or NER, is to detect and label these nouns with the real-world concepts that they represent. The most common NE are:People’s names,Company names,Geographic locations (Both physical and political),Product names,Dates and times, Amounts of money,Names of events

for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")
html = display.render(doc, style='ent')
display(HTML(html))

7. Coreference Resolution

This is the most hard part in NLP. How should computer know the pronominal, like “it”, “he” or “she”? spaCy is not doing it well. But there are deep learning models, like huggingface.

My Example

Context

Speaking at a swearing-in ceremony for Associate Supreme Court Justice Brett Kavanaugh in the East Room of the White House Monday evening, President Trump apologized to Kavanaugh and his family “on behalf of our nation” for what he called a desperate Democrat-led campaign of “lies and deception” intent on derailing his confirmation.

PoS

Speaking speak VERB VBG advcl Xxxxx True False [at]
at at ADP IN prep xx True True [ceremony]
a a DET DT det x True True []
swearing swearing NOUN NN amod xxxx True False [in]
......
on on ADP IN prep xx True True [derailing]
derailing derail VERB VBG pcomp xxxx True False [confirmation]
his -PRON- ADJ PRP$ poss xxx True True []
confirmation confirmation NOUN NN dobj xxxx True False [his]

Dependency Parsing

Noun Phrases

a swearing-in ceremony
Associate Supreme Court Justice Brett Kavanaugh
the East Room
the White House
President Trump
Kavanaugh
his family
behalf
our nation
what
he
"lies
deception
his confirmation

NER

Mandelbrot set

Defination of Mandelbort set

According to wiki, the Mandelbrot set is the set of complex numbers c for which the function f_{c}(z)=z^{2}+c does not diverge when iterated from  z=0 .

What magic does it has

If we try to find the set of c in a complex plane. Here is the magic:

where x axis is the real part of c, y axis is the imaginary of c. A point c is colored black if it belongs to the set.

How to find set

here is the pseducode from wiki:

For each pixel (Px, Py) on the screen, do:
{
  x0 = scaled x coordinate of pixel (scaled to lie in the Mandelbrot X scale (-2.5, 1))
  y0 = scaled y coordinate of pixel (scaled to lie in the Mandelbrot Y scale (-1, 1))
  x = 0.0
  y = 0.0
  iteration = 0
  max_iteration = 1000

  //when the sum of the squares of the real and imaginary parts exceed 4, the point has reached escape.
  while (x*x + y*y < 2*2  AND  iteration < max_iteration) {
    xtemp = x*x - y*y + x0
    y = 2*x*y + y0
    x = xtemp
    iteration = iteration + 1
  }
  color = palette[iteration]
  plot(Px, Py, color)
}

for complex numbers:

z=x+iy

 z^{2}=x^{2}+i2xy-y^{2}

 c=x_{0}+iy_{0}

Hence: x is sum of real parts in z and c, y is sum of imaginary parts of z and c. x={Re}(z^{2}+c)=x^{2}-y^{2}+x_{0} and  y={Im}(z^{2}+c)=2xy+y_{0}

How about 3D

I am going to try my 3D later, but please see the detail from skytopia firstly, I cited the formula directly.

3D formula is defined by:

z -> z^n + c where z and c are defined by {x,y,z}


{x,y,z}^n = r^n { sin(theta*n) * cos(phi*n) , sin(theta*n) * sin(phi*n) , cos(theta*n) }

where
r = sqrt(x^2 + y^2 + z^2)
theta = atan2( sqrt(x^2+y^2), z )
phi = atan2(y,x)

// z^n + c is similar to standard complex addition

{x,y,z}+{a,b,c} = {x+a, y+b, z+c}

//The rest of the algorithm is similar to the 2D Mandelbrot!

//Here is some pseudo code of the above:

r = sqrt(x*x + y*y + z*z )
theta = atan2(sqrt(x*x + y*y) , z)
phi = atan2(y,x)

newx = r^n * sin(theta*n) * cos(phi*n)
newy = r^n * sin(theta*n) * sin(phi*n)
newz = r^n * cos(theta*n)

Interesting Python I ( function )

how to pass-by-reference?

Many lauguages support pass by value or pass by reference, like C/C++. It copies the address of an argument into the formal parameter. Inside the function, the address is used to access the actual argument used in the call. It means the changes made to the parameter affect the passed argument. In Python, pass by reference is very tricky. There are two kinds of objects: mutable and immutable. string, tuple, numbers are immuable, list, dict, set are muable. When we try to change the value of immuable object, Python will create a copy of reference rather than changing the value of reference. Let us see the code:

    def ref_demo(x):
        print "x=",x," id=",id(x)
        x=42
        print "x=",x," id=",id(x)

    >>> x = 9
    >>> id(x)
    41902552
    >>> ref_demo(x)
    x= 9  id= 41902552
    x= 42  id= 41903752
    >>> id(x)
    41902552
    >>> 

We can find when x = 42, the address of x has changed.

And so on, if we pass a mutable object into a function, we can change it value as pass-by-reference.

*args and **kwargs

Before I explain them, I want to metion that * is used to unpack tuple or list into positional arguments and ** is used to it unpacks dictionary into named arguments.

* defines a variable number of arguments. The asterisk character has to precede a variable identifier in the parameter list.

>>> def print_everything(*args):
        for count, thing in enumerate(args):
...         print '{0}. {1}'.format(count, thing)
...
>>> print_everything('apple', 'banana', 'cabbage')
0. apple
1. banana
2. cabbage

** defines an arbitrary number of keyword parameters.

>>> def table_things(**kwargs):
...     for name, value in kwargs.items():
...         print '{0} = {1}'.format(name, value)
...
>>> table_things(apple = 'fruit', cabbage = 'vegetable')
cabbage = vegetable
apple = fruit

A * can appear in function calls as well, as we have just seen in the previous exercise: The semantics is in this case “inverse” to a star in a function definition. An argument will be unpacked and not packed. In other words, the elements of the list or tuple are singularized:

>>> def f(x,y,z):
...     print(x,y,z)
... 
>>> p = (47,11,12)
>>> f(*p)
(47, 11, 12)

There is also a mechanism for an arbitrary number of keyword parameters. To do this, we use the double asterisk “**” notation:

>>> def f(a,b,x,y):
...     print(a,b,x,y)
...
>>> d = {'a':'append', 'b':'block','x':'extract','y':'yes'}
>>> f(**d)
('append', 'block', 'extract', 'yes')

Understand PCA

What is PCA

PCA(Principal Component Analysis) is a non-parametric linear method to reduce dimension. It’s a very popular way used in dimension reduction.

How to use it

Use PCA is much easier than understand it. So, I put the function first. If we look at the document of sklearning, it says like this:

pca = PCA(n_components=2) # select how many components we want to pick up
pca.fit(X) # fit the PCA model, or we can use fit_transform(X)
...
pca.transform(X2) # transfer data by trained model

We can also use line graph to plot the variance% that PCA explained.

variance = covar_matrix.explained_variance_ratio_ #calculate variance ratios
var=np.cumsum(np.round(covar_matrix.explained_variance_ratio_, decimals=3)*100)
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
plt.ylim(30,100.5)
plt.style.context('seaborn-whitegrid')
plt.plot(var)

Don’t forget nomalization before PCA

How does it work?

1. First, let us see what our data looks like

We call left distribution as low redundancy, and right distribution as high redundancy. In the case of high redundancy, there are many related dimensions, e.g., how many hours you start and how many score you get in the test. PCA is used to reduce redundancy and noise in the dataset, and find the strongest signal.

2. How could we find signal and Nosie?

The answer is covariance matrix.

Diagonal is variance of x,y,z(3 dimensions) which is signal, off diagonal is covariance which is redundancy.

3. How could we increase signal and decrease redundancy?

Simple! make covariance matrix as a diagonal matrix, like this:

We need transfer original dataset X to dataset Y through linear transformation P: PX = Y, which makes covariance matrix of Y: Sy to be a diagonal matrix.We can seem this to be a changing the basis of coordinate, in the new coordinate, we can get the maxium variance in one axis.

More inforamtion and 3-D simulation, please visit: http://setosa.io/ev/principal-component-analysis/

4. Math approvel

Sy = 1(n-1) YYT

Sy = 1(n-1)(PX)(PX)T

Sy = 1(n-1)(PXXTPT)

Sy = 1(n-1)(P(XXT)PT)

Let A = XXT which is a symmetric matrix, Sy = 1(n-1)(PAPT)

According to the characters of symmetric matrix, A can be written as VDVT, where D is diagonal matrix, V is eigenvector matrix.

Here is the tricky part: let P = V, then

Sy = 1(n-1)(PAPT) = 1(n-1)(PPTDPPT). since the inverse of orthonormal matrix is its transpose, P−1 = PT.

Sy = 1(n-1)(PP-1DPP-1) = 1(n-1) D. Here is what we want! D is a diagonal matrix!

So, V(eigenvectors) is our result for P.

5. How gets the eigenvectors?

You can find more information here. Basically, the blew is the formula:

Reference: A Tutorial on Data Reduction, Shireen Elhabian and Aly Farag

Know your data

What’s my problem?

I don’t think data science is a kind of art stuff, essentially it is a science. For a long period, I found there was not much information about practice principles for EDA( exploratory data analysis). Yes, maybe there are bunch of helpful program snippets or stats books for reference, when it came to real projects, they were hard to use(or too many choices).

Temp solution

Since there is not a finest solution for me. I just combined several solutions together, and makes them look like a “piratical” solution. Here is it:

From up to down at the second layer, they are six steps before we do the modeling. The following lays are some possible methods we can use. I didn’t list all the information as too many information is gonna make things complex. For feature selection, there is a useful tool.However, if you do it by your coding, it won’t be hard.

I listed all the contents I put into this plot for your reference.

Comprehensive data exploration with Python: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

A Feature Selection Tool for Machine Learning in Python: https://towardsdatascience.com/a-feature-selection-tool-for-machine-learning-in-python-b64dd23710f0

Data Mining: Concepts and Techniques, 3rd ed.: http://hanj.cs.illinois.edu/bk3/

Efficiency of different programming languages

Most likely every programmer knows python is a low efficiency language, but how slow it is? See this picture:

enter image description here

The author compared virtually all languages considering three variables: energy consumption, memory consumption and execution time. Java and C are doing very well in energy and time. As to python, the numbers are not ideal as python is interpreted a language and using GIL mechanism.

I also did a experiment by myself, where I picked up k numbers from n numbers(n is much larger than k). I compared three languages: python, c and cuda with single and multi-threads. The column name (k|n) means k from n. Here is my result:

C is much faster than python, specially when data is raising fast.Python’s multi-threads is significantly better than single thread. However, I don’t know why C’s multi-threads is almost as same as the single thread.In multi-threads C, I found CPU usage is lower than 30% for each thread.Perhaps create threads takes time. Since I used clock object in C to calc time costing, it got the user time rather than cpu time.(real < user when parallely running). If we want to get real time, we should use clock_gettime(CLOCK_MONOTONIC, &start);Credit: [Dr.Greg Wolffe] . As to Cuda, in this case, no matter how data size raising, the time didn’t change much, although in the small data size, it didn’t run very fast compared with C.

You can find more detail on my Github.

Windows + Linux = WSL

I still remembered the song “pineapple”, the singer is a wired middle age man, looks like this: Today, what I wanna talk about is similar to “pineapple”. It’s called “WSL”.

what is WSL?

WSL is an abbr for windows subsystem for linux, where you can run linux on windows like an app.

How does it work?

very simple!

  1. run windows store , find ubuntu and install it.
  2. check the wsl option in turn windows feature on or off
    • you need to restart you computer after check it.
  3. run ubuntu as your application, all commands are same.

Where is GUI?

Sorry, by default, it is only a kernal system. So, basicly you have to run with commands. But there is a workaround: Xming. You can find detail here . I only tried two steps: install ‘Xming’ and setting display export DISPLAY=:0 . Then I did successfully run Geany with GUI! Don’t tell me you don’t know how to run app on consoler.

Existing Problem

I tried install program by sanpd, but it awalys showed me error message.

Who can try it

If you like both Linux and windows, WSL gives you a opportunity to waive the dual booting problem. You can aslo access windows drivers under /mnt/.

Have fun!

Data Science Master Curriculum

Wow…..

I happened to find a pretty good website called create your own data science master's. You can access from here.

Why is it good

  1. This person is so gentle to collect all DS related online courses together, although some of them are not free. Once I was always frustrated about finding the right & good rated resource, now he gave everything I need.
  2. According your backgrounds, you can pick up different pieces. e.g. I am CS background, so that I will put stats into the first line.
  3. It almost covers all contents in data scientist roadmap.

Is it a good way for us?

It depends. If you are already roll in a data science project. You might be going to take or have taken courses. But if you are a computer science background, an organized courses with a reasonable price may be a good choice.

Hello world!

Well. I have to agree “Hello world” is one of my favor titles. Since I bought a raspberry pi which supposed to do some deep learning works, like object detection, I have to do something before my camera coming from amazon.

So, I guess creating a pro personal website would be a good choice. I refer “Build a LAMP Web Server with WordPress”  and “offcial website“. In an nutshell, not hard stuffs, but some points need to be care.

Installation:

    1.  if you change password through mySQL directly, you have to transfer pwd through MD5 tool, here is one.
    2. if you happen to unable to upload file, please install “sudo apt-get install php7.0-gd” and restart apache “sudo service apache2 restart“.
    3. I initially started with offical website guide, but I suddenly realized I need to install LAMP(Linux, ApacheMySQLPHP)
    4. There might be requests for setting FTP when installing plugins, this can be solved by edit wp-config.php with adding “define(‘FS_METHOD’, ‘direct’);
    5. Give www-data authority to access wordpress folder: “sudo chown -R www-data:www-data /var/www/html“, /var/www/html is my wordpress folder

Backup:

  1. UpdraftPlus would be a good choice which provides several romote storage methods inculding google dirve.

After almost 2 hours mess around, I finally  create this website on raspberry!