Know your data

What’s my problem?

I don’t think data science is a kind of art stuff, essentially it is a science. For a long period, I found there was not much information about practice principles for EDA( exploratory data analysis). Yes, maybe there are bunch of helpful program snippets or stats books for reference, when it came to real projects, they were hard to use(or too many choices).

Temp solution

Since there is not a finest solution for me. I just combined several solutions together, and makes them look like a “piratical” solution. Here is it:

From up to down at the second layer, they are six steps before we do the modeling. The following lays are some possible methods we can use. I didn’t list all the information as too many information is gonna make things complex. For feature selection, there is a useful tool.However, if you do it by your coding, it won’t be hard.

I listed all the contents I put into this plot for your reference.

Comprehensive data exploration with Python: https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python

A Feature Selection Tool for Machine Learning in Python: https://towardsdatascience.com/a-feature-selection-tool-for-machine-learning-in-python-b64dd23710f0

Data Mining: Concepts and Techniques, 3rd ed.: http://hanj.cs.illinois.edu/bk3/

Efficiency of different programming languages

Most likely every programmer knows python is a low efficiency language, but how slow it is? See this picture:

enter image description here

The author compared virtually all languages considering three variables: energy consumption, memory consumption and execution time. Java and C are doing very well in energy and time. As to python, the numbers are not ideal as python is interpreted a language and using GIL mechanism.

I also did a experiment by myself, where I picked up k numbers from n numbers(n is much larger than k). I compared three languages: python, c and cuda with single and multi-threads. The column name (k|n) means k from n. Here is my result:

C is much faster than python, specially when data is raising fast.Python’s multi-threads is significantly better than single thread. However, I don’t know why C’s multi-threads is almost as same as the single thread.In multi-threads C, I found CPU usage is lower than 30% for each thread.Perhaps create threads takes time. Since I used clock object in C to calc time costing, it got the user time rather than cpu time.(real < user when parallely running). If we want to get real time, we should use clock_gettime(CLOCK_MONOTONIC, &start);Credit: [Dr.Greg Wolffe] . As to Cuda, in this case, no matter how data size raising, the time didn’t change much, although in the small data size, it didn’t run very fast compared with C.

You can find more detail on my Github.

Windows + Linux = WSL

I still remembered the song “pineapple”, the singer is a wired middle age man, looks like this: Today, what I wanna talk about is similar to “pineapple”. It’s called “WSL”.

what is WSL?

WSL is an abbr for windows subsystem for linux, where you can run linux on windows like an app.

How does it work?

very simple!

  1. run windows store , find ubuntu and install it.
  2. check the wsl option in turn windows feature on or off
    • you need to restart you computer after check it.
  3. run ubuntu as your application, all commands are same.

Where is GUI?

Sorry, by default, it is only a kernal system. So, basicly you have to run with commands. But there is a workaround: Xming. You can find detail here . I only tried two steps: install ‘Xming’ and setting display export DISPLAY=:0 . Then I did successfully run Geany with GUI! Don’t tell me you don’t know how to run app on consoler.

Existing Problem

I tried install program by sanpd, but it awalys showed me error message.

Who can try it

If you like both Linux and windows, WSL gives you a opportunity to waive the dual booting problem. You can aslo access windows drivers under /mnt/.

Have fun!

Data Science Master Curriculum

Wow…..

I happened to find a pretty good website called create your own data science master's. You can access from here.

Why is it good

  1. This person is so gentle to collect all DS related online courses together, although some of them are not free. Once I was always frustrated about finding the right & good rated resource, now he gave everything I need.
  2. According your backgrounds, you can pick up different pieces. e.g. I am CS background, so that I will put stats into the first line.
  3. It almost covers all contents in data scientist roadmap.

Is it a good way for us?

It depends. If you are already roll in a data science project. You might be going to take or have taken courses. But if you are a computer science background, an organized courses with a reasonable price may be a good choice.

Hello world!

Well. I have to agree “Hello world” is one of my favor titles. Since I bought a raspberry pi which supposed to do some deep learning works, like object detection, I have to do something before my camera coming from amazon.

So, I guess creating a pro personal website would be a good choice. I refer “Build a LAMP Web Server with WordPress”  and “offcial website“. In an nutshell, not hard stuffs, but some points need to be care.

Installation:

    1.  if you change password through mySQL directly, you have to transfer pwd through MD5 tool, here is one.
    2. if you happen to unable to upload file, please install “sudo apt-get install php7.0-gd” and restart apache “sudo service apache2 restart“.
    3. I initially started with offical website guide, but I suddenly realized I need to install LAMP(Linux, ApacheMySQLPHP)
    4. There might be requests for setting FTP when installing plugins, this can be solved by edit wp-config.php with adding “define(‘FS_METHOD’, ‘direct’);
    5. Give www-data authority to access wordpress folder: “sudo chown -R www-data:www-data /var/www/html“, /var/www/html is my wordpress folder

Backup:

  1. UpdraftPlus would be a good choice which provides several romote storage methods inculding google dirve.

After almost 2 hours mess around, I finally  create this website on raspberry!