Enable GPU Accelerate in WSL2 to support AI frameworks

Since Microsoft upgraded WSL to version 2, it introduced full Linux kernel and full VM manage features. Except the performance benefit through deep integration with windows, WSL2 allows installing additional powerful apps like docker and upgrading Linux kernel anytime when it is available.

Two months ago, Microsoft with NVIDIA brought GPU acceleration to WSL2. This new feature made me exciting, so that we don’t have to train our models on a separated Linux machine or install dual OS startup.

Figure 1: Stack image showing layers involved while running AI frameworks in WSL 2 containers. The container provides integration with CUDA related components. WSL2 communicates with windows host through GPU paravirtualization protocol

Before I start, I did some search about basic ideas of virtualization and WSL2 GPU. It is good for me to understand how GPU paravirtualization works in WSL2.

Types of virtualization

Figure 2: four major types of virtualization
  • Full virtualization. In full virtualization, there is almost a complete model of the underlying physical system resources that allows any and all installed software to run without modification. There are two types of full virtualization.
software assisted full virtualization( binary translation). like VMware workstation(32bit), virtual PC, VirtualBox(32 bits). issue: low performance
hardware- assisted full virtualization. eliminates the binary translation and directly interrupts with hardware ( intel VT-x and AMD-V). like , KVM, VMware ESX, Hyper-V, Xen. issue: virtual context execute privileged instruction directly on the processor.
  • Paravirtualization. Paravirtualization (PV) is an enhancement of virtualization technology in which a guest operating system (guest OS) is modified prior to installation inside a virtual machine (VM) in order to allow all guest OS within the system to share resources and successfully collaborate, rather than attempt to emulate an entire hardware environment. so the guests aware that it has been virtualized.  products like Xen, IBM LPAR, Oracle VM for X86
Xen supports both Full virtualization and Para-virtualization
  • Hybrid virtualization(hardware virtualized with PV drivers). virtual machine uses PV for specific hardware drivers(like I/O), and the host use full virtualization for other features. products like Oracle VM for x86, Xen. 
VMware paravirtual with hardware full virtualization
  • OS level Virtualization. aka containerization. No overhead . Products like docker, Linux LCX, AIX WPAR
The difference between VM and container

Except containerization, all virtualization use hypervisor to communicate with the host. We can take a look how hypervisor works blew.

  • Hypervisor
    • Emulation. (software full virtualization)
      • emulate a certain piece of hardware which guest VM can only see.
      • expense of performance since “common lowest” denominator
      • need to translate instruction
      • wide compatibility
    • Paravirtualization
      • only support certain hardware in certain configurations.
      • Direct hardware access is possible
      • Compatibility is limited
    • hardware pass-through(hardware full virtualization)
      • native performance, but need proper drivers for the real physical hardware
      • hardware specific images
      • GPU supported

GPU Virtualization on Windows

How it works on WSL

  • a new kernel driver “dxgkrnl” which expoes “/dev/dxg” device to user mode.
  • /dev/dxg mimic the native WDDM D3DKMT kernel service layer on Windows.
  • dxgkrnl communicate with its big brother on Windows through VM Bus WDDM paravirtualization protocol.
Figure 3: there is no partitioning of resources between Linux and Windows or limit on Linux application

DxCore & D3D12 on Linux

  • libd3d12.so is compiled from the same source code as d3d12.dll on windows
  • except Present() function, all others are same with windows. 
  • libxcore(DxCore) is a simplified version of dxgi
  • GPU manufacturer partners provide UMD(user mode driver) for Linux
Figure 4: D3D12 builds upon the /dev/dxg device

DirectML and AI Training

  • DirectML sits on top of D3D12 API, provides a a collection of compute compute operations.
  • Tensorflow with an integrated DirectML backend.
Figure 5: DirectML provides beginner a basic ML framework

OpenGL, OpenCL & Vulkan

  • Mesa library is the mapping layer which bring hardware acceleration for OpenCL , OpenGL
  • vulkan is not supported right now.
Figure 6: WSL2 only support OpenGL and OpenCL right now.

Nvidia CUDA

  • a version of CUDA taht directly targets WDDM 2.9 abstraction exposed by /dev/dxg. 
  • libcuda.so enables CUDA-X libaries such as cuDNN, cuBLAS, TensorRT.
  • available on any glibc-based WSL distro
Figure 7: NVIDIA-docker tolls available ( NVIDIA container toolkit), which provides us container like plugin and usage experience.

GPU container in WSL

  • libnvidia-container libarary is able to detect the presence of libdxcore.so at runtime and uses it to detect all the GPUs exposed to this interface.
  • driver store is a folder that containers all driver librarians for both Linux and Windows
Figure 8: NVIDIA docker provides NVIDIA container toolkits along with lots of good images.

GUI Application is still under developing.

How to enable GPU Acceleration in WSL

for the detail step, we can refer https://docs.nvidia.com/cuda/wsl-user-guide/index.html. Here I brief some keypoints:

  1. Windows version: 20150 or above (Dev Channel)
  2. Enable WSL 2
  3. Install Ubuntu On WSL
  4. Install Windows Terminal
  5. Upgrade kernel to 4.19.121 or higher
  6. NVIDA DRIVERS FOR CUDA ON WSL  https://developer.nvidia.com/cuda/wsl/download
  7. Install docker in WSL:  
    • curl https://get.docker.com | sh
    • You can see vmmen process on your windows task manger. It is the process for virtual machine in wsl2
  8. Install Nvidia Container Toolkit( nvidia-docker2)
Figure 9: docker in WSL2 with NIVIDA container toolkit

9. Start A TensorFlow Container

# test for docker
docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
# pull tersorflow image and run it
docker run -it --gpus all -p 8888:8888 tensorflow/tensorflow:latest-gpu-py3-jupyter

After you pull tersoflow image, and run it. You can see following instruction:

Figure 10: replace to localhost, and open this URL on your browser then we can use GPU acceleration in our WSL2


Para virtualization vs Full virtualization vs Hardware assisted Virtualization, https://www.unixarena.com/2017/12/para-virtualization-full-virtualization-hardware-assisted-virtualization.html/

Emulation, paravirtualization, and pass-through: what you need to know for client hypervisors, https://searchvirtualdesktop.techtarget.com/opinion/Emulation-paravirtualization-and-pass-through-what-you-need-to-know-for-client-hypervisors

DirectX is coming to the Windows Subsystem for Linux, https://devblogs.microsoft.com/directx/directx-heart-linux/

NVIDIA Container Toolkit, https://github.com/NVIDIA/nvidia-docker

CUDA on WSL User Guide, https://docs.nvidia.com/cuda/wsl-user-guide/index.html

NVIDIA Drivers for CUDA on WSL, https://developer.nvidia.com/cuda/wsl/download

Tensorflow image on Docker, https://www.tensorflow.org/install/docker

Eiganvectors from Eiganvalues

This is a unbelievable discovery from PETER B. DENTON, STEPHEN J. PARKE, TERENCE TAO, AND XINING ZHANG. The original paper you can find here. In a short nut, we can get eiganvector through eiganvalues only. This magic formular is like blew:

We only need to know eiganvalues in original matrix then we can calculate its eiganvectors through sub-matrix.

Currently, I don’t know or image what will effect on the road, but the eiganX is the basic for AI, dimension reduction, feature extraction, etc. It also may help us improve the speed to get eiganvector if we need to incrementally add data on a known matrix.

I wrote a very simple python script to prove this formula(surely, it is correct). I think it can archive by GPU as well.

import numpy as np
from numpy import linalg as LA

matrix_size = 6
org_x = np.diag(range(1,matrix_size+1))
w, v = LA.eig(org_x)

print("orgnal matriax: \n %s \n" % org_x)
print("eigan values: \n %s \n" % w)
print("normalized eigenvectors: \n %s \n" % v)


for _ in range(matrix_size):

for n in range(matrix_size):
    for j in range(matrix_size):
        sub_x = np.delete(org_x,j,axis=0)
        sub_x = np.delete(sub_x,j,axis=1)
        w1,v1 = LA.eig(sub_x)

        # in term of new formula to get orignal matrix eigenvecotr through eiganvalue
        numberator = 1
        denominator = 1
        for i in range(matrix_size-1):
            temp_n = w[n] - w1[i]
            numberator = numberator*temp_n

        for i in range(matrix_size):
                temp_d = w[n] - w[i]
                denominator = denominator*temp_d
        result[j] = numberator/denominator

    print("%s \n" % result)


orgnal matriax:
 [[1 0 0 0 0 0]
 [0 2 0 0 0 0]
 [0 0 3 0 0 0]
 [0 0 0 4 0 0]
 [0 0 0 0 5 0]
 [0 0 0 0 0 6]]

eigan values:
 [1. 2. 3. 4. 5. 6.]

normalized eigenvectors:
 [[1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]]


[1.0, -0.0, -0.0, -0.0, -0.0, -0.0]

[0.0, 1.0, -0.0, -0.0, -0.0, -0.0]

[0.0, 0.0, 1.0, -0.0, -0.0, -0.0]

[0.0, 0.0, 0.0, 1.0, -0.0, -0.0]

[0.0, 0.0, 0.0, 0.0, 1.0, -0.0]

[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]


EIGENVECTORS FROM EIGENVALUES, https://arxiv.org/pdf/1908.03795.pdf

Tao’s Blog, https://terrytao.wordpress.com/2019/08/13/eigenvectors-from-eigenvalues/

Raspberry PI Security

Lots of ready-to-use opensource project can be found on internet for raspberry PI object detection. Most of them can do very well to motion detection or object classification. I am thinking how to merge them together to make a practical security system that can help all of us to make our home safety. Since It is open source, I will share my design, source code and project milestone. ( mostly to push myself to finish it eventually 🙂


  1. start camera to capture each frame
  2. save the first frame as reference. and this reference frame will be replaced every 5 mins when there is no movement detected.
  3. continuously compare the current frame with the reference frame.
  4. if any movement detected, draw some interesting areas.
  5. loop these interesting areas, if its size out of threshold, move it to next DNN network.
  6. use DNN network to classify the object.
  7. if object is human, trigger dedicated process “Event process”. includes, recording video, send notification and make speaker noise.
  8. loop to the next frame.


Oct 10th-15th: finish basic function to detect movement, recording video and send mail.

Oct 16th-22th: add DNN network

Oct 22th-30th: add speaker and integration test.

Potential update:

  • GPU accelerate on Cuda device
  • GUI
  • restful API



Demo(till Oct 15th):


Deep learning: How OpenCV’s blobFromImage works. https://www.pyimagesearch.com/2017/11/06/deep-learning-opencvs-blobfromimage-works/

Raspberry Pi: Deep learning object detection with OpenCV. https://www.pyimagesearch.com/2017/10/16/raspberry-pi-deep-learning-object-detection-with-opencv/

how to install opencv on the raspberry pi 3 Model b+ (with camera) https://pysource.com/2018/10/31/raspberry-pi-3-and-opencv-3-installation-tutorial/

Home surveillance and motion detection with the Raspberry Pi, Python, OpenCV, and Dropbox. https://www.pyimagesearch.com/2015/06/01/home-surveillance-and-motion-detection-with-the-raspberry-pi-python-and-opencv/

Night Sight with Google Camera

As an amateur photographer, I was believing DSL is better than phone camera since it has a much larger CMOS so that it can receive more photons, until I installed Google Camera on my Galaxy S8. The results surprised me a lot. I mean, it performed better than my Conan 5DIII in the most of case without editing in the software like photoshop.

Except HDR+ and portrait mode, Google camera provides a magic mode called Night Sight. Unfortunately, this can only work on Pixel series (Pixel 3/3a is the best) phone with hardware support. You can find the A/B Testing here.

left: disable Night Sight; right: enable the Night Sight.

How does Night Sight improve the quality of shots in the night?

  • HDR+. HDR+ is the foundational function for Night Sight. It is a computational photography technology that improves this situation by capturing a burst of frames, aligning the frames in software, and merging them together. Since each frame is short enough to prevent the blur caused by hand shake, the result turns out to be sharper and wilder dynamic range than without HDR+.
This image has an empty alt attribute; its file name is IMG_20190518_101130.jpg

with HDR+, we can clear see the background in an indoor place
  • Positive-shutter-lag (PSL) . In the regular mode, Google camera uses zero-shutter-lag (ZSL) protocol which  limits exposures to at most 66ms no matter how dim the scene is, and allows our viewfinder to keep up a display rate of at least 15 frames per second. Using PSL means you need to hold still for a short time after pressing the shutter, but it allows the use of longer exposures, thereby improving SNR at much lower brightness levels. 
  • Motion metering. Optical image stabilization (OIS) is widely used in many devices to prevent hand shake. But it doesn’t help for long exposure and motion object. Pixel 3 adds motion metering to detect the motion object and adjust the exposure time for each frame. For example, if it detects a dog moving in the frame or we are using the tripod, it will increase exposure time.
  • Super Res Zoom.  HDR essentially uses algorithm to aliment and merge the frames to increase the SNR( signal to noise ratio). Pixel 3 provides a new algorithm called Super Res Zoom for super-resolution and reduce noise, since it averages multiple images together. Super Res Zoom produces better results for some nighttime scenes than HDR+, but it requires the faster processor of the Pixel 3.
  • Learning-based AWB algorithm. When it is dim, AWB( auto white balance) is not functional well. And it is an ill-posed problem, which means we cannot inverse the problem (find out the real color of object in the dark). In this case, Google camera utilizes machine learning to “guess” what is the true color and shift the white balance.
  •  S-curve into our tone mapping. As we know, if we exposure a picture for a long time, all the objects become brighter so that we can not figure out when this picture takes. Google uses sigmoid function to adjust the object brightness ( dark objects become darker, light objects become brighter).

Reference: Night Sight: Seeing in the Dark on Pixel Phones, https://ai.googleblog.com/2018/11/night-sight-seeing-in-dark-on-pixel.html

Brief Talk Object Detection Algorithm of YOLO

Figure 1: YOLO: Real-Time Object Detection

YOLO also know as You Only Look Once. Not like R-CNN, YOLO uses single CNN to do the object detection as well as localization which makes it super faster than R-CNN with only losing a little accuracy. From 2016 to 2018, YOLO has been imporved from v1 to v3. In this article, I will use a simple way to explain how YOLO works.

What tasks we need to solve in object detection problem?

Yolo use the same method as human to detect the object. There are three major steps: 1. is it an object? 2. what object is it? 3. where is the position and size of this object. BUT! Through CNN, YOLO can do these three things all together.

How YOLO solve this problem?

First, Let’s introduce Grid Cell in YOLO. The whole input image is divided into S \times S grid. Each grid cell predicts only one objects with fixed boundary boxes( say #B). For each boundary box has its own box confidence score to reflet the possibility of object. For each grid cell it predicts C conditional class probabilities( one per class). so that we will get
S\times S \times (B*5+C) predictions. Here 5 means central location(x,y), size( h,w) and confidence score of each boundary box.

Figure 2: the cell with red mark predicts two boundary boxes for a single object.

Then you will find so many boundary box from output. How we choose of them? Here we need to do Non-max suppression. The step is as blew:

  1. discard all boxes with box confidence less then a threshold. ( say 0.65)
  2. While there are any renaming box(overlapping):
    1. pick the box with the largest confidence that as a prediction
    2. discard any remaining boxes with IoU(intersection over union: you can see it as overlap size between two boundary box) greater than a threshold(say 0.5)
Figure 3: boundary boxes for each grid cell

After Non-max suppression, we need to calculate class confidence score , which equals to box confidence score * conditional class probability. Finally, we get the object with probability and its localization. (see Figure 1)

YOLO Network Design

Let’s see how YOLO v1 looks like. Input = 448*448 image, output = S\times S \times (B*5+C). There are 24 convolutional layers followed by 2 full connected layer for localization. It use sum-squared error between the predictions and ground truth to calculate loss which is consist of classification loss, localization loss and confidence loss.

Figure 4: YOLO v1 architecture

Classification loss:

Localization loss:

Confidence loss:

an object is detected in the box
an object is not detected in the box

The final loss add three components together.


YOLO v2 improves accuracy compared with YOLO v1.

  1. Add batch normalization on all of the convolutional layers. It get more than 2% improvement in accuracy.
  2. High-resolution classifier. First fine tune the classification network at the full 448 \times 448 resolution for 10 epochs on IMageNet. This gives network time to adjust tis filters to work better on higher resolution input.
  3. Convolutional with Anchor Boxes. YOLO v1 can only predicts 98 boxes per images and it makes arbitrary guesses on the boundary boxes which leads to bad generalization, but with anchor boxes, YOLO v2 predicts more than a thousand. Then it use dimension cluster and direct location prediction to get the boundary box.
  4. Dimension Cluster. use K-mean to get the boundary boxes patterns. Figure 5 might be the most common boundary boxes in spec dataset.
Figure 5: 5 anchor boxes
Figure 6: remove the class prediction from the cell level to the boundary box level

5. Direct location prediction. Since we use anchor boxes, we have to predict on the offsets to these anchors.

Figure 6: Bounding boxes with dimension priors and location prediction

6. Multi-Scale Training. Every 10 batches, YOLOv2 randomly selects another image size to train the model. This acts as data augmentation and forces the network to predict well for different input image dimension and scale. 


  1. Detection at three scales. YOLOv3 predicts boxes at 3 different scales. Then features are extracted from each scale by using a method similar to that of feature pyramid networks
  2. Bounding box predictions. YOLO v3 predicts the object score using logistic regression.
  3. Class prediction. Use independent logistic classifiers instead of softmax. This is done to make the classification multi-able classification.


YOLOv1 : https://arxiv.org/abs/1506.02640

YOLOv2 : https://pjreddie.com/media/files/papers/YOLO9000.pdf

YOLOv3 : https://pjreddie.com/media/files/papers/YOLOv3.pdf

Machine Learning on Spark — How it works and why it doesn’t work well

Spark provides spark MLlib for machine learning in a scalable environment. MLlib includes three major parts: Transformer, Estimator and Pipeline. Essentially, transformer takes a dataframe as an input and returns a new data frame with more columns. Most featurization tasks are transformer. Estimator takes a dataframes as an input and returns a model(transformer), as we know the ML algorithms.. Pipeline combines transformer and estimator together.
Additionally, data frame becomes the primary API for MLlib. There is not any more new features for RDD based API in Spark MLib.

If you already understood or used high level machine learning or deep learning frameworks, like scikit-learning, keras, tersorflow, you will find everything is so familiar with. But when you use spark MLlib in practice, you still need third library’s help. I will talk about it in the end.

Basic Stats

# Corrlation
from pyspark.ml.stat import Correlation
r1 = Correlation.corr(df, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))

# Summarizer
from pyspark.ml.stat import Summarizer
# compute statistics for multiple metrics without weight

# ChiSquare
r = ChiSquareTest.test(df, "features", "label").head()
print("pValues: " + str(r.pValues))
print("degreesOfFreedom: " + str(r.degreesOfFreedom))
print("statistics: " + str(r.statistics))


# stop word remove
remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
# tokenize
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
# n grame
ngram = NGram(n=2, inputCol="wordsData", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordDataFrame)
# word frequence
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
# idf
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

# word2vec
word2Vec = Word2Vec(vectorSize=N, minCount=0, inputCol="text", outputCol="result")
model = word2Vec.fit(documentDF)

# binarizer
binarizer = Binarizer(threshold=0.5, inputCol="feature", outputCol="binarized_feature")
binarizedDataFrame = binarizer.transform(continuousDataFrame)

# reduce dimension to 3
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

# StringIndex
# encodes a string column of labels to a column of label of indices order by frequency or alphabet
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)

# OneHotEstimator
# we need to use StringIndex first if apply to categorical feature
encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", "categoryIndex2"],
                                 outputCols=["categoryVec1", "categoryVec2"])
model = encoder.fit(df)encoded = model.transform(df)

# Normalize & Scaler
normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)
lInfNormData = normalizer.transform(dataFrame, {normalizer.p: float("inf")})
l1NormData = normalizer.transform(dataFrame)
# standard scaler
# withMean=false: standard deviation, withMean=true: mean 
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)
# maxmin scaler
scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures")
# max abs scaler
scaler = MaxAbsScaler(inputCol="features", outputCol="scaledFeatures")

# bin
from pyspark.ml.feature import Bucketizer
splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")]
bucketizer = Bucketizer(splits=splits, inputCol="features", outputCol="bucketedFeatures")

# QuantileDiscretizer
discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour", outputCol="result")

# ElementwiseProduct
transformer = ElementwiseProduct(scalingVec=Vectors.dense([0.0, 1.0, 2.0]),
                                 inputCol="vector", outputCol="transformedVector")

# SQL Transformer
sqlTrans = SQLTransformer(
    statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")

# VectorAssembler
# combine vector together for future model inputs
assembler = VectorAssembler(
    inputCols=["hour", "mobile", "userFeatures"], # the columns we need to combine
    outputCol="features") # output column

# Imputer
# handle missing value
imputer = Imputer(inputCols=["a", "b"], outputCols=["out_a", "out_b"])

# slice vector
slicer = VectorSlicer(inputCol="userFeatures", outputCol="features", indices=[1])

# ChiSqSelector
# use Chisqare to select the features
selector = ChiSqSelector(numTopFeatures=1, featuresCol="features",
                         outputCol="selectedFeatures", labelCol="clicked")

Clarification and Regression

# Linear regression
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)

# logistic regression
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) 
mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial") # multinomial

# decision tree
# classification
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
# regression
dt = DecisionTreeRegressor(featuresCol="indexedFeatures")

# Random forest
# classification
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
# regeression
rf = RandomForestRegressor(featuresCol="indexedFeatures")
# gradient-boosted 
gbt = GBTRegressor(featuresCol="indexedFeatures", maxIter=10)

# preceptron
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)

# Linear SVM, there is no kernel SVM like RBF
lsvc = LinearSVC(maxIter=10, regParam=0.1)

# Naive Bayes
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

knn = KNNClassifier().setTopTreeSize(training.count().toInt / 500).setK(10)


# k-means
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

gmm = GaussianMixture().setK(2).setSeed(538009335)

Collaborative Filtering

als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
model = als.fit(training)
# Evaluate the model by computing the RMSE on the test datapredictions = model.transform(test)evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)


# split train and test
train, test = data.randomSplit([0.9, 0.1], seed=12345)
# cross validation
crossval = CrossValidator(estimator=pipeline,
                          numFolds=2)  # use 3+ folds in practice

You might be already found the problem. The ecosystem of Spark MLlib is not as rich as scikit learning, and it is lack of deep learning (of course, its name is machine learning). According to Databricks documents, We still have the solutions.

  • Use scikit learning on single node. Very simple solution. but since scikit leanring load the data still in memory. If the note is faster enough(driver), we can get a good performance as well.
  • To solve deep learning problem. we have two work around methods.
    • Apply keras, tensorflow on single node with GPU acceleration(Recommend by databricks).
    • Distribute Training. It might be slower than on the single node because of communication overhead. There are two frameworks used for distribute training. Horovod and Apache SystemML. I’ve never use Horovod, but you can find information here. As to SystemML, it is more like a wrapper for high level API and provide cluster optimizer which parses the code into spark RDD(live variable analysis, propagate stats, rewrite by matrix decomposition and runtime instruction). From the official website, we know it is much faster than MLLib and native R. The problem is it didn’t update anymore since 2017.
# Create and save a single-hidden-layer Keras model for binary classification# NOTE: In a typical workflow, we'd train the model before exporting it to disk,# but we skip that step here for brevity
model = Sequential()
model.add(Dense(units=20, input_shape=[num_features], activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
model_path = "/tmp/simple-binary-classification"

transformer = KerasTransformer(inputCol="features", outputCol="predictions", modelFile=model_path)

It seems no perfect solution for machine learning in spark, right? Don’t forget we have other time costing jobs: hyper parameters configuration and validation. We can run same model with different hyper parameters on different nodes using paramMap which similar to grid search or random search.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
paramGrid = ParamGridBuilder().addGrid(lr.maxIter, [10, 100, 1000]).addGrid(lr.regParam, [0.1, 0.01]).build()
crossval = CrossValidator(estimator=pipeline,
                      numFolds=2)  # use 3+ folds in practice
cvModel = crossval.fit(training)

There might be someone saying: why we don’t use MPI? The answer is simple, too complex. Although it can gain the perfect performance, and you can do whatever you want even running distributed GPU + CPU codes, there are too many things we need to manually configuration on low level API without fail tolerance.

In conclusion, we can utilize spark for ELT and training/ validation model to maximize the performance(it did really well for these works). But until now, we still need third frameworks to help us do deep learning or machine learning tasks on single strong node.


Spark MLlib: http://spark.apache.org/docs/latest/ml-guide.html

Databrick: https://docs.databricks.com/getting-started/index.html

Spark ML Tuning: http://spark.apache.org/docs/latest/ml-tuning.html

Harovod: https://github.com/horovod/horovod

Apache SystemML: https://systemml.apache.org/docs/1.1.0/beginners-guide-keras2dml

A Failed Text Classification

  • — version 1@20190401
  • –version 2@20190402: change to category to 2

Today I tried a text classification task where the data is about the message on the flights and labeled into 5 levels. Observably, it is a supervised problem. And I though there are bunch of solutions already for this kind of problem. So that I started with full of confidence. But…. the result was so bad, no more than 35% accurate for 5 classification. Only a little bit better than guess.

word tf-idf with kernal SVM32.6
word tf-idf with random forest31.3
Naïve Bayes35.3
word embedding(FastText) with GRU30.3

The detail can be found in Google Colaboratory.

I am considering following reasons leading this failure:

  1. The module is easy to over fitting. For example, when GRU model’s training loss decreasing, the invalidation loss was decreasing in the beginning, but after 40 epochs, it started to increasing or jumped up/down.
  2. Since I used trained embedding model(FastText) which is based on wiki but the dataset is in civil aviation. The words and word vectors may far away.
  3. In the data source, there might be lack of significant or clear rules to classify them to 5 categories. If we just label it to binary “attention/no worry”. The result will be better.

@20190402: I change the category from 5 to 2, hopefully the result would be better. But NO improvement. Only 63.5% for 2 categories.

Brief analysis on recommendation system of Netflix & YouTube

–03/19/2019 version 0.1

Last week, my wife told me she logged into my netflix’s account, then she found it was not hers immediately since the items did not match her tastes. This activated my interesting in the recommendation system of Netflix & Youtube which are the most watched channels in US. (maybe spotfiy will be the same way). Here I want to give a brief analysis how they work.


Before we quickly look how many different manners(that I knew) used in the recommendation systems.

Popularity. This is the simplest way in term of PV. It works very good for new users and avoid the “cold start” problem. However, the downside is this method can not provide the personalized recommendation. The way to optimize it is adding some categories at the beginning so that users can filter the categories by themselves.
Collaborative filtering (CF). The Collaborative Filtering (CF) algorithms are based on the idea that if two clients have similar rating history then they will behave similarly in the future (Breese,Heckerman, and Kadie, 1998). It can also split into two subcategories, one is Memory-based, another is Model-based.

  • Memory-based approach can be divided into User-based and Item-based.  They find the similar users or similar items respectively in term of Pearson Correlation.
    • User-based.
      1. Build correlation matrix S which is symmetric.

            \[S(i,k)=\frac{\sum_j (v_{ij}-\bar{v_i})(v_{kj}-\bar{v_k})}{\sqrt{\sum_j (v_{ij}-\bar{v_i})^2(v_{kj}-\bar{v_k})^2}}\]

      2. select top k users who has the largest scores.
      3. identify items that similar users like but the prediction user has not seem before.  The prediction of a recommendation is based on the wighted combination of the selected neighbor’s rating.

            \[p(i,k)=\bar{v}_i+\frac{\sum_{i=1}^{n}(v_{ij}-\bar{v}_k)\times S(i,k)}{\sum_{i=1}^{n}S(i,k)}\]

      4. pick up top N of movies based on the predicted rating.
    • Item-based.
      1. Build correlation matrix S based on items. (similar to user-based)
      2. Get the top n movies that prediction user watched and rated before.
      3. return the movies that mostly related to these n movies and the prediction user has never watched.
    • In the real word. The size of user are growing faster than item, and they are easy to be changed. So item-based are most frequency used. 
  • Model-based approach are based on matrix factorization which is popular in  dimension reduction. Here we use Singular value decomposition(SVD) to explain. 

        \[X_{n*m}=U_{n*r}\cdot S_{r*r}\cdot V_{r*m}^T\]

    , where U represents the freature vectors corresponding to the users in the latent space with dimension r, V represents the feature vectors corresponding to the items in the latent space with dimension r.  Once we find U and V, we can calculate any p(i,j) by U_i \cdot V_j.
  • CF is based on historical data, it has “cold start” problem. and the accuracy of prediction is based on the mount of data since the CF matrix has sparsity problem, e.g, few mistake rating will effect the prediction seriously. 

Contented-based(CB).  This approach is based on the information of item itself rather than only rating in CF approach. We need to create meta data for the items. These meta data can be tagged manual or use TF-IDF tech to automatically extra keywords. Then build the connection between the item that prediction user liked and the items with similar meta data. CB avoid of “cold start” and “over recommend” problems, however, it is hard to metain and keep accuracy of meta data.  

Hybrid. It combined CF and CB. We can merge the prediction together or set the weights in different scenarios. 

Deep Learning. In the large scale dataset, it is hard to use traditional recommendation system because of 4V(volume, variety, velocity, and veracity).   Deep learning model are good at solving complex problem( A review on deep learning for recommender systems: challenges and remedies).  We will introduce deep learning model used by YouTube in the next section.


I firstly log into the Netflix to find some information provided by the official website. Fortunately, there was a topic How Netflix’s Recommendations System Works. They didn’t give much detail about algorithms but the provides the clues which information they are using for predict users’ choices. Blew is their explanation:

We estimate the likelihood that you will watch a particular title in our catalog based on a number of factors including:

  • your interactions with our service (such as your viewing history and how you rated other titles),

  • other members with similar tastes and preferences on our service (more info here), and

  • information about the titles, such as their genre, categories, actors, release year, etc.

So, we can guess it is a hybrid approach combined with CF(item-base and user-based) and CB approaches. But we don’t know how they design it at this moment. Let keep reading from the official website.

In addition to knowing what you have watched on Netflix, to best personalize the recommendations we also look at things like:

  • the time of day you watch,

  • the devices you are watching Netflix on, and

  • how long you watch.

These actives are not mentioned in the basic section. They are all used as input vector for the deep learning model which we will see in YouTube section. 

It also mentioned “Cold start” problem:

When you create your Netflix account, or add a new profile in your account, we ask you to choose a few titles that you like. We use these titles to “jump start” your recommendations. Choosing a few titles you like is optional. If you choose to forego this step then we will start you off with a diverse and popular set of titles to get you going.

It’s clear they use popularity approach with categories to solve “cold start” problem. As user has more historical information, Netflix will use another approaches to replace the initial one. 

They also personalized row  and title inside:

In addition to choosing which titles to include in the rows on your Netflix homepage, our system also ranks each title within the row, and then ranks the rows themselves, using algorithms and complex systems to provide a personalized experience. …. In each row there are three layers of personalization:

  • the choice of row (e.g. Continue Watching, Trending Now, Award-Winning Comedies, etc.)
  • which titles appear in the row, and
  • the ranking of those titles.

They calculate the score for each item for each users, then sum up these scores into each category to decide the order of rows. As I said, we don’t know how they mix CB and CF to get the score of each item yet. But they are mixed for sure. 



As Google’s product, it is not surprised that YouTube uses Deep learning as a solution for recommendation system. It is too large both in user and item aspects. A simple stats model can not handle it well.  In the paper “Deep Neural Networks for YouTube Recommendations“, they explained how they use DL to YouTube.


It has two parts: Candidate Generation and Ranking. One for filtering hundred candidates from millions, second for sorting by adding more scenario or video features information. Let’s see how they work:

  • Candidate generation.  For candidate generation, it filters from millions videos, so it only uses user activities and scenario information. The basic idea is getting  probabilities of watching specific video V through user U and context C.


    . The key point is to get user vector u and v.  To get user vector u, author embeds the video watches and search tokens, then average them into watch vector and search vector, then combined with other geogrphic , video ages and gender vectors to get through 3 connected ReLU layer. The output is user vector u.  To get video vectors v, we need to use u to predict probabilities for all v through softmax. After training, the video vector v is what we want. In the serving processing, we only need to put u and v together to calculate the top N highest probability vectors. 
  • Ranking.Compared with Candidate Generation, the number of videos is much less. So we can put more video features into the embedding vectors. These features are mostly focus on scenario, like topic of video, how many videos the user watched under each topic and time since last watch.  It embeds categorical features with shared embeddings and continuous features with powers of normalization. 


  • Deep Neural Networks for YouTube Recommendations, Paul Covington, Jay Adams, Emre Sargin, https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf
  • How Netflix’s Recommendations System Works, n/a, https://help.netflix.com/en/node/100639
  • Finding the Latent Factors | Stanford University, https://www.youtube.com/watch?v=GGWBMg0i9d4&index=56&list=PLLssT5z_DsK9JDLcT8T62VtzwyW9LNepV
  • Recommendation System for Netflix, Leidy Esperanza MOLINA
    FERNÁNDEZ, https://beta.vu.nl/nl/Images/werkstuk-fernandez_tcm235-874624.pdf
  • 现在推荐算法都发展成什么样了?来看看这个你就知道了!,章华燕, https://mp.weixin.qq.com/s?__biz=MzIzNzA4NDk3Nw==&mid=2457737060&idx=1&sn=88ef898f5054ae9b8cb005c31b65ee2d&chksm=ff44bf3ac833362c2436002be265c390b033d3e7709553fd8b603e6269d58f366f689beb2639&mpshare=1&scene=1&srcid=#

Text Auto Summarization(Extraction)

Recently I was given a topic to research a manner to summary the text automatically. So I shared some my search results, hope it is helpful.

Summarization Methods

we can classify summarization methods into different types by input type, the purpose and output type. Typically, extractive and abstractive are the most common ways.


Here, we would like introduce two methods for Extractive. One is Stats-based , another is Deep Learning-based.


  1. Idea: for each word, we would give a weight frequency. For each sentence, we summary the weight frequency for the words inside. Then pick up the sentences ordered by the sum of weight frequency.
  2. Steps
    2.1 Preprocessing: replace extra whitespace characters or delete some parts we do not need to analysis.
replace = {
ord('\f') : ' ',
ord('\t') : ' ',
ord('\n') : ' ',
ord('\r') : None

2.2 Tokenizing the sentence

sent_list = nltk.sent_tokenize(content)

2.3 Get frequency of each word

stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
if word not in stopwords: if word not in word_frequencies.keys():
word_frequencies[word] = 1
word_frequencies[word] += 1

2.4 Weighted frequency of occurrence

maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

2.5 Calculate the sum of weight frequency for each sentence

sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
sentence_scores[sent] += word_frequencies[word]

2.6 sort sentences in descending order of sum

import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)

Deep Learning-based

  1. Idea: vectorizing each sentence into a high dimension space, then cluster the vector using kmean, pick up the sentences which mostly close to the center of each cluster to form the summery of text.
  2. steps:
    2.1 prepossessing and tokenizing the sentence( same as stats-based method)
    2.2 Skip-Thought Encoder


Encoder Network: The encoder is typically a GRU-RNN which generates a fixed length vector representation h(i) for each sentence S(i) in the input.
Decoder Network: The decoder network takes this vector representation h(i) as input and tries to generate two sentences — S(i-1) and S(i+1), which could occur before and after the input sentence respectively.

These learned representations h(i) are such that embeddings of semantically similar sentences are closer to each other in vector space, and therefore are suitable for clustering.


Skip-Thoughts Architecture

import skipthoughts
# You would need to download pre-trained models first
model = skipthoughts.load_model()
encoder = skipthoughts.Encoder(model)
encoded =  encoder.encode(sentences)

2.3 Clustering

import numpy as np
from sklearn.cluster import KMeans

n_clusters = np.ceil(len(encoded)**0.5)
kmeans = KMeans(n_clusters=n_clusters)
kmeans = kmeans.fit(encoded)

2.4 Summerization

from sklearn.metrics import pairwise_distances_argmin_min

avg = []
for j in range(n_clusters):
idx = np.where(kmeans.labels_ == j)[0]
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, encoded)
ordering = sorted(range(n_clusters), key=lambda k: avg[k])
summary = ' '.join([email[closest[idx]] for idx in ordering])


  1. Unsupervised Text Summarization using Sentence Embeddings,https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1
  2. Skip-Thought Vectors, https://arxiv.org/abs/1506.06726
  3. Text Summarization with NLTK in Python, https://stackabuse.com/text-summarization-with-nltk-in-python/

Math in Machine Learning

Linear Algebra

  • mathematics of data: multivariate, least square, variance, covariance, PCA
  • equotion: y = A \cdot b, where A is a matrix, b is a vector of depency variable
  • application in ML
    1. Dataset and Data Files
    2. Images and Photographs
    3. One Hot Encoding: A one hot encoding is a representation of categorical variables as binary vectors. encoded = to_categorical(data)
    4. Linear Regression. L1 and L2
    5. Regularization
    6. Principal Component Analysis. PCA
    7. Singular-Value Decomposition. SVD. M=U*S*V
    8. Latent Semantic Analysis. LSA typically, we use tf-idf rather than number of terms. Through SVD, we know the different docments with same topic or the different terms with same topic
    9. Recommender Systems.
    10. Deep Learning


  • array broadcasting
    1. add a scalar or one dimension matrix to another matrix. y = A + b where b is broadcated.
    2. it oly works when when the shape of each dimension in the arrays are equal or one has the dimension size of 1.
    3. The dimensions are considered in reverse order, starting with the trailing dimension;


  • Vector
    1. lower letter. \upsilon = (\upsilon<em>1, \upsilon</em>2, \upsilon_3)
    2. Addtion, Substruction
    3. Multiplication, Divsion(Same length) a*b or a / b
    4. Dot product: a\cdot b
  • Vector Norm
    1. Defination: the length of vector
    2. L1. Manhattan Norm. L<em>1(\upsilon)=|a</em>1| + |a<em>2| + |a</em>3| python: norm(vector, 1) . Keep coeffiencents of model samll
    3. L2. Euclidean Norm. L<em>2(\upsilon)=\sqrt(a</em>1^2+a<em>2^2+a</em>3^2) python: norm(vector)
    4. Max Norm. L<em>max=max(a</em>1,a<em>2,a</em>3) python: norm(vector, inf)
  • Matrices
    1. upper letter. A=((a<em>{1,1},a</em>{1,2}),(a<em>{2,1},a</em>{2,2}) )
    2. Addtion, substruction(same dimension)
    3. Multiplication, Divsion( same dimension)
    4. Matrix dot product. If C=A\cdot B, A’s column(n) need to be same size to B’s row(m). python: A.dot(B) or A@B
    5. Matrix-Vector dot product. C=A\cdot \upsilon
    6. Matrix-Scalar. element-wise multiplication
    7. Type of Matrix
      1. square matrix. m=n. readily to add, mulitpy, rotate
      2. symmetric matrix. M=M^T
      3. triangular matrix. python: tril(vector) or triu(vector) lower tri or upper tri matrix
      4. Diagonal matrix. only diagonal line has value, doesnot have to be square matrix. python: diag(vector)
      5. identity matrix. Do not change vector when multiply to it. notatoin as I^n python: identity(dimension)
      6. orthogonal matrix. Two vectors are orthogonal when dot product is zeor. \upsilon \cdot \omega = 0 or \upsilon \cdot \omega^T = 0. which means the project of \upsilon to \omega is zero. An orthogonal matrix is a matrix which Q^T \cdot Q = I
    8. Matrix Operation
      1. Transpose. A^T number of rows and columns filpped. python: A.T
      2. Inverse. A^{-1} where AA^{-1}=I^n python: inv(A)
      3. Trace. tr(A) the sum of the values on the main diagonal of matrix. python: trace(A)
      4. Determinant. a square matrix is a scalar representation of the volume of the matrix. It tell the matrix is invertable. det(A) or |A|. python: det(A) .
      5. Rank. Number of linear indepent row or column(which is less). The number of dimesions spanned by all vectors in the matrix. python: rank(A)
    9. Sparse matrix
      1. sparsity score = \frac{count of non-zero elements}{total elements}
      2. example: word2vector
      3. space and time complexity
      4. Data and preperation
        1. record count of activity: match movie, listen a song, buy a product. It usually be encoded as : one hot, count encoding, TF-IDF
      5. Area: NLP, Recomand system, Computer vision with lots of black pixel.
      6. Solution to represent sparse matrix. reference
        1. Dictionary of keys:  (row, column)-pairs to the value of the elements.
        2. List of Lists: stores one list per row, with each entry containing the column index and the value.
        3. Coordinate List: a list of (row, column, value) tuples.
        4. Compressed Sparse Row: three (one-dimensional) arrays (A, IA, JA).
        5. Compressed Sparse Column: same as SCR
      7. example
        1. covert to sparse matrix python: csr_matrix(dense_matrix)
        2. covert to dense matrix python: sparse_matrix.todense()
        3. sparsity = 1.0 – count_nonzero(A) / A.size
    10. Tensor
      1. multidimensional array.
      2. algriothm is similar to matrix
      3. dot product: python: tensordot()


  • Matrix Decompositions
    1. LU Decomposition
      1. square matrix
      2. A = L\cdot U \cdot P, L is lower triangle matrix, U is upper triangle matrix. P matrix is used to permute the result or return result to the orignal order.
      3. python: lu(square_matrix)
    2. QR Decomposition
      1. n*m matrix
      2. A = Q \cdot R where Q a matrix with the size mm, and R is an upper triangle matrix with the size mn.
      3. python: qr(matrix)
    3. Cholesky Decomposition
      1. square symmtric matrix where values are greater than zero
      2. A = L\cdot L^T=U\cdot U^T, L is lower triangle matrix, U is upper triangle matrix.
      3. twice faster than LU decomposition.
      4. python: cholesky(matrix)
    4. EigenDecomposition
      1. eigenvector: A\cdot \upsilon = \lambda\cdot \upsilon, A is matrix we want to decomposite, \upsilon is eigenvector, \lambda is eigenvalue(scalar)
      2. a matrix could have one eigenvector and eigenvalue for each dimension. So the matrix A can be shown as prodcut of eigenvalues and eigenvectors. A = Q \cdot \Lambda \cdot Q^T where Q is the matrix of eigenvectors, \Lambda is the matrix of eigenvalue. This equotion also mean if we know eigenvalues and eigenvectors we can construct the orignal matrix.
      3. python: eig(matrix)
    5. SVD(singluar value decomposition)
      1. A = U\cdot \sum \cdot V^T, where A is m*n, U is m*m matrix, \sum is m*m diagonal matrix also known as singluar value, V^T is n*n matrix.
      2. python: svd(matrix)
      3. reduce dimension
        1. select top largest singluar values in \sum
        2. B = U\cdot \sum<em>k \cdot V</em>k^T, where column select from \sum, row selected from V^T, B is approximate of the orignal matrix A.
        3. `python: TruncatedSVD(n_components=2)


  • Multivari stats
    1. variance: \sigma^2 = \frac{1}{n-1} * \sum<em>{i=1}^{n}(x</em>i-\mu)^2, python: var(vector, ddof=1)
    2. standard deviation: s = \sqrt{\sigma^2}, python:std(M, ddof=1, axis=0)
    3. covariance: cov(x,y) = \frac{1}{n}\sum<em>{i=1}^{n}(x</em>i-\bar{x})(y_i-\bar{y}), python: cov(x,y)[0,1]
    4. coralation: cor(x,y) = \frac{cov(x,y)}{s<em>x*s</em>y}, normorlized to the value between -1 to 1. python: corrcoef(x,y)[0,1]
    5. PCA
      1. project high dimensions to subdimesnion
      2. steps:
        1. M = mean(A)
        2. C = A-M
        3. V = cov(C)
        4. values,vector = eig(V)
        5. B = select(values,vectors), which order by eigenvalue
      3. scikit learn

        pca = PCA(2) # get two components
        print(pca.componnets_) # values
        print(pca.explained_variance_) # vectors
        B = pca.transform(A) # transform to new matrix
    • Linear Regression
    1. y = X \cdot b, where b is coeffcient and unkown
    2. linear least squares( similar to MSE) ||X\cdot b - y|| = \sum<em>{i=1}^{m}\sum</em>{j=1}^{n}X<em>{i,j}\cdot (b</em>j - y_i)^2, then b = (X^T\cdot X)^{-1} \cdot X^T \cdot y. Issue: very slow
    3. MSE with SDG

Reference: Basics of Linear Algebra for Machine Learning, jason brownlee, https://machinelearningmastery.com/linear_algebra_for_machine_learning/