Interesting Python I ( function )

how to pass-by-reference?

Many lauguages support pass by value or pass by reference, like C/C++. It copies the address of an argument into the formal parameter. Inside the function, the address is used to access the actual argument used in the call. It means the changes made to the parameter affect the passed argument. In Python, pass by reference is very tricky. There are two kinds of objects: mutable and immutable. string, tuple, numbers are immuable, list, dict, set are muable. When we try to change the value of immuable object, Python will create a copy of reference rather than changing the value of reference. Let us see the code:

    def ref_demo(x):
        print "x=",x," id=",id(x)
        x=42
        print "x=",x," id=",id(x)

    >>> x = 9
    >>> id(x)
    41902552
    >>> ref_demo(x)
    x= 9  id= 41902552
    x= 42  id= 41903752
    >>> id(x)
    41902552
    >>> 

We can find when x = 42, the address of x has changed.

And so on, if we pass a mutable object into a function, we can change it value as pass-by-reference.

*args and **kwargs

Before I explain them, I want to metion that * is used to unpack tuple or list into positional arguments and ** is used to it unpacks dictionary into named arguments.

* defines a variable number of arguments. The asterisk character has to precede a variable identifier in the parameter list.

>>> def print_everything(*args):
        for count, thing in enumerate(args):
...         print '{0}. {1}'.format(count, thing)
...
>>> print_everything('apple', 'banana', 'cabbage')
0. apple
1. banana
2. cabbage

** defines an arbitrary number of keyword parameters.

>>> def table_things(**kwargs):
...     for name, value in kwargs.items():
...         print '{0} = {1}'.format(name, value)
...
>>> table_things(apple = 'fruit', cabbage = 'vegetable')
cabbage = vegetable
apple = fruit

A * can appear in function calls as well, as we have just seen in the previous exercise: The semantics is in this case “inverse” to a star in a function definition. An argument will be unpacked and not packed. In other words, the elements of the list or tuple are singularized:

>>> def f(x,y,z):
...     print(x,y,z)
... 
>>> p = (47,11,12)
>>> f(*p)
(47, 11, 12)

There is also a mechanism for an arbitrary number of keyword parameters. To do this, we use the double asterisk “**” notation:

>>> def f(a,b,x,y):
...     print(a,b,x,y)
...
>>> d = {'a':'append', 'b':'block','x':'extract','y':'yes'}
>>> f(**d)
('append', 'block', 'extract', 'yes')

Understand PCA

What is PCA

PCA(Principal Component Analysis) is a non-parametric linear method to reduce dimension. It’s a very popular way used in dimension reduction.

How to use it

Use PCA is much easier than understand it. So, I put the function first. If we look at the document of sklearning, it says like this:

pca = PCA(n_components=2) # select how many components we want to pick up
pca.fit(X) # fit the PCA model, or we can use fit_transform(X)
...
pca.transform(X2) # transfer data by trained model

We can also use line graph to plot the variance% that PCA explained.

variance = covar_matrix.explained_variance_ratio_ #calculate variance ratios
var=np.cumsum(np.round(covar_matrix.explained_variance_ratio_, decimals=3)*100)
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
plt.ylim(30,100.5)
plt.style.context('seaborn-whitegrid')
plt.plot(var)

Don’t forget nomalization before PCA

How does it work?

1. First, let us see what our data looks like

We call left distribution as low redundancy, and right distribution as high redundancy. In the case of high redundancy, there are many related dimensions, e.g., how many hours you start and how many score you get in the test. PCA is used to reduce redundancy and noise in the dataset, and find the strongest signal.

2. How could we find signal and Nosie?

The answer is covariance matrix.

Diagonal is variance of x,y,z(3 dimensions) which is signal, off diagonal is covariance which is redundancy.

3. How could we increase signal and decrease redundancy?

Simple! make covariance matrix as a diagonal matrix, like this:

We need transfer original dataset X to dataset Y through linear transformation P: PX = Y, which makes covariance matrix of Y: Sy to be a diagonal matrix.We can seem this to be a changing the basis of coordinate, in the new coordinate, we can get the maxium variance in one axis.

More inforamtion and 3-D simulation, please visit: http://setosa.io/ev/principal-component-analysis/

4. Math approvel

Sy = 1(n-1) YYT

Sy = 1(n-1)(PX)(PX)T

Sy = 1(n-1)(PXXTPT)

Sy = 1(n-1)(P(XXT)PT)

Let A = XXT which is a symmetric matrix, Sy = 1(n-1)(PAPT)

According to the characters of symmetric matrix, A can be written as VDVT, where D is diagonal matrix, V is eigenvector matrix.

Here is the tricky part: let P = V, then

Sy = 1(n-1)(PAPT) = 1(n-1)(PPTDPPT). since the inverse of orthonormal matrix is its transpose, P−1 = PT.

Sy = 1(n-1)(PP-1DPP-1) = 1(n-1) D. Here is what we want! D is a diagonal matrix!

So, V(eigenvectors) is our result for P.

5. How gets the eigenvectors?

You can find more information here. Basically, the blew is the formula:

Reference: A Tutorial on Data Reduction, Shireen Elhabian and Aly Farag

Know your data

What’s my problem?

I don’t think data science is a kind of art stuff, essentially it is a science. For a long period, I found there was not much information about practice principles for EDA( exploratory data analysis). Yes, maybe there are bunch of helpful program snippets or stats books for reference, when it came to real projects, they were hard to use(or too many choices).

Temp solution

Since there is not a finest solution for me. I just combined several solutions together, and makes them look like a “piratical” solution. Here is it:

From up to down at the second layer, they are six steps before we do the modeling. The following lays are some possible methods we can use. I didn’t list all the information as too many information is gonna make things complex. For feature selection, there is a useful tool.However, if you do it by your coding, it won’t be hard.

I listed all the contents I put into this plot for your reference.

Comprehensive data exploration with Python: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

A Feature Selection Tool for Machine Learning in Python: https://towardsdatascience.com/a-feature-selection-tool-for-machine-learning-in-python-b64dd23710f0

Data Mining: Concepts and Techniques, 3rd ed.: http://hanj.cs.illinois.edu/bk3/

Efficiency of different programming languages

Most likely every programmer knows python is a low efficiency language, but how slow it is? See this picture:

enter image description here

The author compared virtually all languages considering three variables: energy consumption, memory consumption and execution time. Java and C are doing very well in energy and time. As to python, the numbers are not ideal as python is interpreted a language and using GIL mechanism.

I also did a experiment by myself, where I picked up k numbers from n numbers(n is much larger than k). I compared three languages: python, c and cuda with single and multi-threads. The column name (k|n) means k from n. Here is my result:

C is much faster than python, specially when data is raising fast.Python’s multi-threads is significantly better than single thread. However, I don’t know why C’s multi-threads is almost as same as the single thread.In multi-threads C, I found CPU usage is lower than 30% for each thread.Perhaps create threads takes time. Since I used clock object in C to calc time costing, it got the user time rather than cpu time.(real < user when parallely running). If we want to get real time, we should use clock_gettime(CLOCK_MONOTONIC, &start);Credit: [Dr.Greg Wolffe] . As to Cuda, in this case, no matter how data size raising, the time didn’t change much, although in the small data size, it didn’t run very fast compared with C.

You can find more detail on my Github.

Windows + Linux = WSL

I still remembered the song “pineapple”, the singer is a wired middle age man, looks like this: Today, what I wanna talk about is similar to “pineapple”. It’s called “WSL”.

what is WSL?

WSL is an abbr for windows subsystem for linux, where you can run linux on windows like an app.

How does it work?

very simple!

  1. run windows store , find ubuntu and install it.
  2. check the wsl option in turn windows feature on or off
    • you need to restart you computer after check it.
  3. run ubuntu as your application, all commands are same.

Where is GUI?

Sorry, by default, it is only a kernal system. So, basicly you have to run with commands. But there is a workaround: Xming. You can find detail here . I only tried two steps: install ‘Xming’ and setting display export DISPLAY=:0 . Then I did successfully run Geany with GUI! Don’t tell me you don’t know how to run app on consoler.

Existing Problem

I tried install program by sanpd, but it awalys showed me error message.

Who can try it

If you like both Linux and windows, WSL gives you a opportunity to waive the dual booting problem. You can aslo access windows drivers under /mnt/.

Have fun!

Data Science Master Curriculum

Wow…..

I happened to find a pretty good website called create your own data science master's. You can access from here.

Why is it good

  1. This person is so gentle to collect all DS related online courses together, although some of them are not free. Once I was always frustrated about finding the right & good rated resource, now he gave everything I need.
  2. According your backgrounds, you can pick up different pieces. e.g. I am CS background, so that I will put stats into the first line.
  3. It almost covers all contents in data scientist roadmap.

Is it a good way for us?

It depends. If you are already roll in a data science project. You might be going to take or have taken courses. But if you are a computer science background, an organized courses with a reasonable price may be a good choice.

Hello world!

Well. I have to agree “Hello world” is one of my favor titles. Since I bought a raspberry pi which supposed to do some deep learning works, like object detection, I have to do something before my camera coming from amazon.

So, I guess creating a pro personal website would be a good choice. I refer “Build a LAMP Web Server with WordPress”  and “offcial website“. In an nutshell, not hard stuffs, but some points need to be care.

Installation:

    1.  if you change password through mySQL directly, you have to transfer pwd through MD5 tool, here is one.
    2. if you happen to unable to upload file, please install “sudo apt-get install php7.0-gd” and restart apache “sudo service apache2 restart“.
    3. I initially started with offical website guide, but I suddenly realized I need to install LAMP(Linux, ApacheMySQLPHP)
    4. There might be requests for setting FTP when installing plugins, this can be solved by edit wp-config.php with adding “define(‘FS_METHOD’, ‘direct’);
    5. Give www-data authority to access wordpress folder: “sudo chown -R www-data:www-data /var/www/html“, /var/www/html is my wordpress folder

Backup:

  1. UpdraftPlus would be a good choice which provides several romote storage methods inculding google dirve.

After almost 2 hours mess around, I finally  create this website on raspberry!