Simple global warming analysis by pyspark

Purpose

Using pyspark to help analysis the situation of global warming. The data is from NCDC(http://www.ncdc.noaa.gov/) through 1980 to 1989 and 2000 to 2009(except 1984 and 2004).

Two Stretage

  • Get the max/min temperature and max wind speed only filter mistaken data(9999). Steps are as follows:
    • Load files into RDD: sc.textFile("/home/DATA/NOAA_weather/{198[0-3],198[5-9]}/*.gz")
    • Extract fields from files through map function: parse_record, return a key-value(tuple) data.
    • Filter 9999 data: .filter(lambda x: x[1]!=9999)
    • reducebyKey to get max or min data ( the key is year): .reduceByKey(lambda x,y: max(x,y)
  • Get the average temperature and avg wind speed by year, latitude and longitude of station which is a fixed land station.
    • Load files into RDD. Same as mapreduce
    • Load RDD to Dataframe. sqlContext.createDataFrame(all_fields,schema=["date","report_type","lat","lon","wind_speed","wind_qulity","temp"])
    • Filter error data(9999) and station type(FM-12) df.where((df['lat']!='+9999') & (df['lon']!='+9999') & (df['wind_speed']!=9999) & (df['temp']!=9999) & (df['report_type']=='FM-12'))
    • aggregate average by year, latitude and longitude:df.groupBy(['date',"lat","lon"]).agg({"wind_speed":"avg","temp":"avg"})

Result and visualization

  • the max/min temperature and max wind speed(based on stretage 1.)
year, max_temp(10x), min_temp(10x), max_wind_speed(10x)
1980,600,-780,617
1981,580,-850,618
1983,616,-931,618
1984,617,-932,618
1982,617,-930,700
1986,607,-901,607
1987,607,-900,602
1985,611,-932,618
1989,606,-900,900
2000,568,-900,900
1988,607,-900,618
2002,568,-932,900
2001,568,-900,900
2003,565,-900,900
2005,610,-925,900
2006,610,-917,900
2007,610,-900,900
2008,610,-900,900
2009,610,-854,900

These extreme data maybe happen in special area, like Antarctica or Sahara Desert.

  • the average temperature and avg wind speed by year, latitude and longitude(based on stretage 2.)
img
img

Through trend line chat, we can figure out the temperature in 2000 to 2009 is significantly higher(1.5-2.0℃) than in 1980 to 1989. Also the max wind speed is greater than mostly before.

Let’s create another chat, which shows the temperature difference between two decades.

img

The red color means weather becomes warm, otherwise weather becomes cold. In the most of countries, the red dots are more than cold ones. The most serious area is Europe.

Files

  • infor_extra.py: main file to execute in spark, output a single csv file in the folder named output1
  • weather_gap.py: caculate the gap between two decades, export to output.csv
  • weather_overview.twb and weather_gap.twb are two tableau files for visualization.
  • If out of memory or GC error happens, please squeeze the year range in infor_extra.pyrdd = sc.textFile("/home/DATA/NOAA_weather/{198[0-3],198[5-9]}/*.gz")

Code: https://github.com/neoaksa/HPC/tree/master/6.Global_Warming

Inside the Linux kernel

cite from: http://turnoff.us/geek/inside-the-linux-kernel/

enter image description here
1. The foundation is a file system. PID 421 is accessing files, while a watch dog is monitoring the whole file system.
2. On the first floor, we can find a process table, the “mom” is PID 1. PID 1341(apache http) for port 80, PID 52 for ssh(22), and nobody for expired FTP(21). Also there are some others, like wine, pipes, cron and a hacker closing to the watch dog. PID in this floor can go upstairs to IO or down stairs to access file system.
3. Go upstairs, two pids are responsible to input and output(tty)

Singleton Pattern in Java, Python and C++

An implementation of the singleton pattern must:

  1. ensure that only one instance of the singleton class ever exists
  2. class creates its own singleton pattern instance
  3. provide global access to that instance.

Typically, this is done by:

  1. declaring all constructors of the class to be private
  2. providing a static method that returns a reference to the instance.

Java

//Lazy loading(not recommend)
public class Singleton {  
    private static Singleton instance;  
    private Singleton (){}  
    public static synchronized Singleton getInstance() {  
    if (instance == null) {  
        instance = new Singleton();  
    }  
    return instance;  
    }  
}
//Non-lazy loading(recommend)
public final class Singleton {
    // create instance
    private static final Singleton INSTANCE = new Singleton();
    // set constructor as null, so that no instance will be created 
    private Singleton() {}
    // return this singletion instance 
    public static Singleton getInstance() {
        return INSTANCE;
    }
}
//enum(recommend)
public enum Singleton {  
    INSTANCE;  
    public void dummy() {  
    }  
}
// how to use
 SingletonEnum singleton = SingletonEnum.INSTANCE;
 singleton.dummy();

Python

// Lazy loading
class Singleton(object):
    def __new__(cls, *args, **kw):
        if not hasattr(cls, '_instance'):
            orig = super(Singleton, cls)
            cls._instance = orig.__new__(cls, *args, **kw)
        return cls._instance

class MyClass(Singleton):
    a = 1

//meta class
class Singleton(type):
    _instances = {}
    def __call__(cls, *args, **kwargs):
        if cls not in cls._instances:
            cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
        return cls._instances[cls]

class MyClass(metaclass=Singleton):
     pass

//decorator
from functools import wraps

def singleton(cls):
    instances = {}
    @wraps(cls)
    def getinstance(*args, **kw):
        if cls not in instances:
            instances[cls] = cls(*args, **kw)
        return instances[cls]
    return getinstance

@singleton
class MyClass(object):
    a = 1

C++

#include <iostream>

class Singleton
{
    private:
        /* Here will be the instance stored. */
        static Singleton* instance;

        /* Private constructor to prevent instancing. */
        Singleton();

    public:
        /* Static access method. */
        static Singleton* getInstance();
};

/* Null, because instance will be initialized on demand. */
Singleton* Singleton::instance = 0;

Singleton* Singleton::getInstance()
{
    if (instance == 0)
    {
        instance = new Singleton();
    }

    return instance;
}

Singleton::Singleton()
{}

int main()
{
    //new Singleton(); // Won't work
    Singleton* s = Singleton::getInstance(); // Ok
    Singleton* r = Singleton::getInstance();

    /* The addresses will be the same. */
    std::cout << s << std::endl;
    std::cout << r << std::endl;
}

OpenCV 4.0 released!!!

OpenCV 4.0 has relased since more than 3 years after 3.0 release. As a most wildly used CV libary which can be used to object detection, image classification, moving object detection and human face detecion. It inculds lots of machine learning and state of art computer vision algorithms. Also, it supports bunch of coding languages, like C++, java, python…

Summary of Upgrade

  1. C++ 11 required. Lots of C API from 1.x have removed. CMake >= 3.5.1
  2. New module G-API has been added. It declares image processing task in form of expressions and then submit it for execution – using a number of available backends.
  3. More DNN module supported.
    3.1 Mask-RCNN model. Here is a guid using Tensorflow ojbect detection.
    3.2 Integrated ONNX(Open Neural Network Exchange ) parser.
  4. Kinect Fusion algorithm has been implemented and optimized for CPU and GPU (OpenCL)
  5. QR code detector and decoder
  6. high-quality DIS dense optical flow algorithm has been moved from opencv_contrib to the video module.
  7. Persistence (storing and loading structured data to/from XML, YAML or JSON) in the core module has been completely reimplemented in C++ and lost the C API as well.

MPI non-blocking receiving with OpenMP

A simple prototype

  1. In the cluster, a Master – Slave structure set up. Master node responses to send the task to slave nodes and regularly check the status of all slave nodes(busy or idle). Salve nodes response to split task from master into sub tasks running with multi-threads.
  2. Master node assigns tasks by iterating each row of the first column in the lattice. If all slave nodes are busy, master will waiting for feedback from slave nodes, otherwise master node will send the new task to the idle slave node.
  3. Master node uses non-blocking method(Irecv) to get the feedback from slave node. So that master node is able to check status of all slave nodes as well as receive feedback.
  4. Slave node splits task into subtask by iterating each row of the second column in the lattice, which is running in a dynamic schedule looping. So that each thread can keep busy all time.

Non-blocking MPI

pool[node] is used to record working slave nodes, wait[node] is used to record if MPI_Irecv is running for the working slave node. Then use MPI_Request_get_status to check the status of request.

            //receive result from slave nodes
            for(int node=0; node < (num_nodes-1); node++){
                if(pool[node]==1){
                    if(wait[node]==0){
                        MPI_Irecv(&node_solution[node], 1, MPI_INT, node+1, TAG, MPI_COMM_WORLD,&rq[node]);
                        wait[node]=1;
                    }

                    MPI_Request_get_status(rq[node],&flag[node],NULL);
                    if(flag[node]){
//                         printf("rec by node %d: %d - %d \n",row,node+1,node_solution[node]);
                        total_solution += node_solution[node];
                        pool[node]=0;
                        wait[node]=0;
                    }
                }
            }

How to complie

compile with openmp: mpic++ -fopenmp NQ-MPI.cpp -o NQ
run in mpi: mpirun -np 5 --host arch06,arch07,arch04,arch05,arch08 ./NQ

Only Programmers Understand

when a new intern debugging

You set a breakpoint successfully, then

Start a unit test

Forget “where” in a SQL

A tiny bug for 10 hours

There is no bug, we pretend

Last mins of the project

Let’s start a multi-thread program

You thought you catched all exceptions

Reduce some unused code

First time you presented a demo to boss

pair programming

Review some code you created one month ago

project went online after a perfect final test

Baisc NLP by spaCy

I had my first contact with NLP was sensitive classification by NLTK, which was refreshed me how NLP working. A couple of days ago, since I needed to extract some keywords from one or more paragraphs, I tried to understand spaCy which I thought is easier for relatively simple subjects. I summarized some key concepts from spaCy.io and put one of my examples for reference.

The pipeline of NLP

No matter we use NLTP or spaCy, there are almost same pipelines:

1. Sentence Segmentation

It makes sense we break down the paragraph into sentence since each sentence expresses its own topics. spaCy uses the dependency parse to determine sentence boundaries in term of stats model.

doc = nlp(u"This is a sentence. This is another sentence.")
for sent in doc.sents:
    print(sent.text)

Unlike structure papers, web articles are usually semi-structured. Hence, we need to some customized processing and put it into the standard pipeline.

def set_custom_boundaries(doc):
    # do something here
    return doc
# you can only set boundaries before a document is parsed
nlp.add_pipe(set_custom_boundaries, before='parser')

2. Word Tokenization

Tokenization is used to break a sentence to separate words call Tokens. Tokenization is based on the language which decides the punctuations, prefix, suffix and inffix.

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
    print(token.text)

3. PoS(Parts of Speech), Lemmatization and stop words

At this step, spaCy makes a prediction for each token and put on the most likely tags for them. At the same time, spaCy figures out the basic form and stop words.

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, token.is_alpha, token.is_stop, [child for child in token.children])
# token.lemma_: the basic form of word
# token.pos_: simple pos
# token.tag_: detail pos
# token.dep_: syntactic dependency
# token.shape_: word shape
# token.is_alpha_: is alpha word
# token.is_stop_: is stop word
# token.children: children token

4. Dependency Parsing

After PoS, we already know the relationship between words. Next step is using more friendly visualization to show the relationships in a sentence. Dependency visualizer is a tree sturcute where the root is main verb in the sentence.

display.serve(doc, style='dep')

5. Finding Noun Phrases

Sometimes, we need to simply the sentence. Group noun phrases together makes this more sense.

For simple way:

from spacy.symbols import *
for np in doc.noun_chunks:
    print(np.text)

For flexible way is iterating over the words of the sentence and consider the syntactic context to determine whether the word governs the phrase-type you want.

from spacy.symbols import *
np_labels = set([nsubj, nsubjpass, dobj, iobj, pobj]) # Probably others too
def iter_nps(doc):
    for word in doc:
        if word.dep in np_labels:
            yield word.subtree

6. Named Entity Recognition (NER)

The goal of Named Entity Recognition, or NER, is to detect and label these nouns with the real-world concepts that they represent. The most common NE are:People’s names,Company names,Geographic locations (Both physical and political),Product names,Dates and times, Amounts of money,Names of events

for entity in doc.ents:
    print(f"{entity.text} ({entity.label_})")
html = display.render(doc, style='ent')
display(HTML(html))

7. Coreference Resolution

This is the most hard part in NLP. How should computer know the pronominal, like “it”, “he” or “she”? spaCy is not doing it well. But there are deep learning models, like huggingface.

My Example

Context

Speaking at a swearing-in ceremony for Associate Supreme Court Justice Brett Kavanaugh in the East Room of the White House Monday evening, President Trump apologized to Kavanaugh and his family “on behalf of our nation” for what he called a desperate Democrat-led campaign of “lies and deception” intent on derailing his confirmation.

PoS

Speaking speak VERB VBG advcl Xxxxx True False [at]
at at ADP IN prep xx True True [ceremony]
a a DET DT det x True True []
swearing swearing NOUN NN amod xxxx True False [in]
......
on on ADP IN prep xx True True [derailing]
derailing derail VERB VBG pcomp xxxx True False [confirmation]
his -PRON- ADJ PRP$ poss xxx True True []
confirmation confirmation NOUN NN dobj xxxx True False [his]

Dependency Parsing

Noun Phrases

a swearing-in ceremony
Associate Supreme Court Justice Brett Kavanaugh
the East Room
the White House
President Trump
Kavanaugh
his family
behalf
our nation
what
he
"lies
deception
his confirmation

NER

Mandelbrot set

Defination of Mandelbort set

According to wiki, the Mandelbrot set is the set of complex numbers c for which the function f_{c}(z)=z^{2}+c does not diverge when iterated from  z=0 .

What magic does it has

If we try to find the set of c in a complex plane. Here is the magic:

where x axis is the real part of c, y axis is the imaginary of c. A point c is colored black if it belongs to the set.

How to find set

here is the pseducode from wiki:

For each pixel (Px, Py) on the screen, do:
{
  x0 = scaled x coordinate of pixel (scaled to lie in the Mandelbrot X scale (-2.5, 1))
  y0 = scaled y coordinate of pixel (scaled to lie in the Mandelbrot Y scale (-1, 1))
  x = 0.0
  y = 0.0
  iteration = 0
  max_iteration = 1000

  //when the sum of the squares of the real and imaginary parts exceed 4, the point has reached escape.
  while (x*x + y*y < 2*2  AND  iteration < max_iteration) {
    xtemp = x*x - y*y + x0
    y = 2*x*y + y0
    x = xtemp
    iteration = iteration + 1
  }
  color = palette[iteration]
  plot(Px, Py, color)
}

for complex numbers:

z=x+iy

 z^{2}=x^{2}+i2xy-y^{2}

 c=x_{0}+iy_{0}

Hence: x is sum of real parts in z and c, y is sum of imaginary parts of z and c. x={Re}(z^{2}+c)=x^{2}-y^{2}+x_{0} and  y={Im}(z^{2}+c)=2xy+y_{0}

How about 3D

I am going to try my 3D later, but please see the detail from skytopia firstly, I cited the formula directly.

3D formula is defined by:

z -> z^n + c where z and c are defined by {x,y,z}


{x,y,z}^n = r^n { sin(theta*n) * cos(phi*n) , sin(theta*n) * sin(phi*n) , cos(theta*n) }

where
r = sqrt(x^2 + y^2 + z^2)
theta = atan2( sqrt(x^2+y^2), z )
phi = atan2(y,x)

// z^n + c is similar to standard complex addition

{x,y,z}+{a,b,c} = {x+a, y+b, z+c}

//The rest of the algorithm is similar to the 2D Mandelbrot!

//Here is some pseudo code of the above:

r = sqrt(x*x + y*y + z*z )
theta = atan2(sqrt(x*x + y*y) , z)
phi = atan2(y,x)

newx = r^n * sin(theta*n) * cos(phi*n)
newy = r^n * sin(theta*n) * sin(phi*n)
newz = r^n * cos(theta*n)

Interesting Python I ( function )

how to pass-by-reference?

Many lauguages support pass by value or pass by reference, like C/C++. It copies the address of an argument into the formal parameter. Inside the function, the address is used to access the actual argument used in the call. It means the changes made to the parameter affect the passed argument. In Python, pass by reference is very tricky. There are two kinds of objects: mutable and immutable. string, tuple, numbers are immuable, list, dict, set are muable. When we try to change the value of immuable object, Python will create a copy of reference rather than changing the value of reference. Let us see the code:

    def ref_demo(x):
        print "x=",x," id=",id(x)
        x=42
        print "x=",x," id=",id(x)

    >>> x = 9
    >>> id(x)
    41902552
    >>> ref_demo(x)
    x= 9  id= 41902552
    x= 42  id= 41903752
    >>> id(x)
    41902552
    >>> 

We can find when x = 42, the address of x has changed.

And so on, if we pass a mutable object into a function, we can change it value as pass-by-reference.

*args and **kwargs

Before I explain them, I want to metion that * is used to unpack tuple or list into positional arguments and ** is used to it unpacks dictionary into named arguments.

* defines a variable number of arguments. The asterisk character has to precede a variable identifier in the parameter list.

>>> def print_everything(*args):
        for count, thing in enumerate(args):
...         print '{0}. {1}'.format(count, thing)
...
>>> print_everything('apple', 'banana', 'cabbage')
0. apple
1. banana
2. cabbage

** defines an arbitrary number of keyword parameters.

>>> def table_things(**kwargs):
...     for name, value in kwargs.items():
...         print '{0} = {1}'.format(name, value)
...
>>> table_things(apple = 'fruit', cabbage = 'vegetable')
cabbage = vegetable
apple = fruit

A * can appear in function calls as well, as we have just seen in the previous exercise: The semantics is in this case “inverse” to a star in a function definition. An argument will be unpacked and not packed. In other words, the elements of the list or tuple are singularized:

>>> def f(x,y,z):
...     print(x,y,z)
... 
>>> p = (47,11,12)
>>> f(*p)
(47, 11, 12)

There is also a mechanism for an arbitrary number of keyword parameters. To do this, we use the double asterisk “**” notation:

>>> def f(a,b,x,y):
...     print(a,b,x,y)
...
>>> d = {'a':'append', 'b':'block','x':'extract','y':'yes'}
>>> f(**d)
('append', 'block', 'extract', 'yes')

Understand PCA

What is PCA

PCA(Principal Component Analysis) is a non-parametric linear method to reduce dimension. It’s a very popular way used in dimension reduction.

How to use it

Use PCA is much easier than understand it. So, I put the function first. If we look at the document of sklearning, it says like this:

pca = PCA(n_components=2) # select how many components we want to pick up
pca.fit(X) # fit the PCA model, or we can use fit_transform(X)
...
pca.transform(X2) # transfer data by trained model

We can also use line graph to plot the variance% that PCA explained.

variance = covar_matrix.explained_variance_ratio_ #calculate variance ratios
var=np.cumsum(np.round(covar_matrix.explained_variance_ratio_, decimals=3)*100)
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
plt.ylim(30,100.5)
plt.style.context('seaborn-whitegrid')
plt.plot(var)

Don’t forget nomalization before PCA

How does it work?

1. First, let us see what our data looks like

We call left distribution as low redundancy, and right distribution as high redundancy. In the case of high redundancy, there are many related dimensions, e.g., how many hours you start and how many score you get in the test. PCA is used to reduce redundancy and noise in the dataset, and find the strongest signal.

2. How could we find signal and Nosie?

The answer is covariance matrix.

Diagonal is variance of x,y,z(3 dimensions) which is signal, off diagonal is covariance which is redundancy.

3. How could we increase signal and decrease redundancy?

Simple! make covariance matrix as a diagonal matrix, like this:

We need transfer original dataset X to dataset Y through linear transformation P: PX = Y, which makes covariance matrix of Y: Sy to be a diagonal matrix.We can seem this to be a changing the basis of coordinate, in the new coordinate, we can get the maxium variance in one axis.

More inforamtion and 3-D simulation, please visit: http://setosa.io/ev/principal-component-analysis/

4. Math approvel

Sy = 1(n-1) YYT

Sy = 1(n-1)(PX)(PX)T

Sy = 1(n-1)(PXXTPT)

Sy = 1(n-1)(P(XXT)PT)

Let A = XXT which is a symmetric matrix, Sy = 1(n-1)(PAPT)

According to the characters of symmetric matrix, A can be written as VDVT, where D is diagonal matrix, V is eigenvector matrix.

Here is the tricky part: let P = V, then

Sy = 1(n-1)(PAPT) = 1(n-1)(PPTDPPT). since the inverse of orthonormal matrix is its transpose, P−1 = PT.

Sy = 1(n-1)(PP-1DPP-1) = 1(n-1) D. Here is what we want! D is a diagonal matrix!

So, V(eigenvectors) is our result for P.

5. How gets the eigenvectors?

You can find more information here. Basically, the blew is the formula:

Reference: A Tutorial on Data Reduction, Shireen Elhabian and Aly Farag