Love, Death & Robots

img
More than 80 per cent of the TV shows and movies people watch on Netflix are discovered through the platform’s recommendation system. We are not choose the TV, the algorithm does.
In all the services that I am using or used, including youtube, spotify, netflix, hulu etc. The best and most sticky ones are spotify and netflix. But why? In personal, I think these points maybe:
1. the algorithm combined with recommend system with user preference. Not like youtube, it uses auto recommend system too aggressive which means it always give some information that I didn’t like. Netflix/Spotify is more smart, we can choose the catalogs we like in the beginning or give use the main catalogs which makes us easier to choose. In another way, it offers something rather than nothing for a new user.
2. Netflix has hired real life humans to categorize every bit of TV shows and movies and apply tags to each of them in order to create hyperspecific micro genres such as “Visually-striking nostalgic dramas” or “Understated romantic road trip movies”. Since size of movies is controlled by Netflix itself, the manual tag work is more precise than auto tags.
3. They create original movies that based on big data. Okay, that’s maybe not good. But they did, and we love the movies.
img

Text Auto Summarization(Extraction)

Recently I was given a topic to research a manner to summary the text automatically. So I shared some my search results, hope it is helpful.

Summarization Methods

we can classify summarization methods into different types by input type, the purpose and output type. Typically, extractive and abstractive are the most common ways.

img

Here, we would like introduce two methods for Extractive. One is Stats-based , another is Deep Learning-based.

Stats-based

  1. Idea: for each word, we would give a weight frequency. For each sentence, we summary the weight frequency for the words inside. Then pick up the sentences ordered by the sum of weight frequency.
  2. Steps
    2.1 Preprocessing: replace extra whitespace characters or delete some parts we do not need to analysis.
replace = {
ord('\f') : ' ',
ord('\t') : ' ',
ord('\n') : ' ',
ord('\r') : None
}
data.translate(replace)

2.2 Tokenizing the sentence

sent_list = nltk.sent_tokenize(content)

2.3 Get frequency of each word

stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
if word not in stopwords: if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1

2.4 Weighted frequency of occurrence

maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

2.5 Calculate the sum of weight frequency for each sentence

sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) < 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]

2.6 sort sentences in descending order of sum

import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)

Deep Learning-based

  1. Idea: vectorizing each sentence into a high dimension space, then cluster the vector using kmean, pick up the sentences which mostly close to the center of each cluster to form the summery of text.
  2. steps:
    2.1 prepossessing and tokenizing the sentence( same as stats-based method)
    2.2 Skip-Thought Encoder

img

Encoder Network: The encoder is typically a GRU-RNN which generates a fixed length vector representation h(i) for each sentence S(i) in the input.
Decoder Network: The decoder network takes this vector representation h(i) as input and tries to generate two sentences — S(i-1) and S(i+1), which could occur before and after the input sentence respectively.

These learned representations h(i) are such that embeddings of semantically similar sentences are closer to each other in vector space, and therefore are suitable for clustering.

img

Skip-Thoughts Architecture

import skipthoughts
# You would need to download pre-trained models first
model = skipthoughts.load_model()
encoder = skipthoughts.Encoder(model)
encoded =  encoder.encode(sentences)

2.3 Clustering

import numpy as np
from sklearn.cluster import KMeans

n_clusters = np.ceil(len(encoded)**0.5)
kmeans = KMeans(n_clusters=n_clusters)
kmeans = kmeans.fit(encoded)

2.4 Summerization

from sklearn.metrics import pairwise_distances_argmin_min

avg = []
for j in range(n_clusters):
idx = np.where(kmeans.labels_ == j)[0]
avg.append(np.mean(idx))
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, encoded)
ordering = sorted(range(n_clusters), key=lambda k: avg[k])
summary = ' '.join([email[closest[idx]] for idx in ordering])

Reference:

  1. Unsupervised Text Summarization using Sentence Embeddings,https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1
  2. Skip-Thought Vectors, https://arxiv.org/abs/1506.06726
  3. Text Summarization with NLTK in Python, https://stackabuse.com/text-summarization-with-nltk-in-python/

Math in Machine Learning

Linear Algebra

  • mathematics of data: multivariate, least square, variance, covariance, PCA
  • equotion: y = A \cdot b, where A is a matrix, b is a vector of depency variable
  • application in ML
    1. Dataset and Data Files
    2. Images and Photographs
    3. One Hot Encoding: A one hot encoding is a representation of categorical variables as binary vectors. encoded = to_categorical(data)
    4. Linear Regression. L1 and L2
    5. Regularization
    6. Principal Component Analysis. PCA
    7. Singular-Value Decomposition. SVD. M=U*S*V
    8. Latent Semantic Analysis. LSA typically, we use tf-idf rather than number of terms. Through SVD, we know the different docments with same topic or the different terms with same topic
    9. Recommender Systems.
    10. Deep Learning

Numpy

  • array broadcasting
    1. add a scalar or one dimension matrix to another matrix. y = A + b where b is broadcated.
    2. it oly works when when the shape of each dimension in the arrays are equal or one has the dimension size of 1.
    3. The dimensions are considered in reverse order, starting with the trailing dimension;

Matrice

  • Vector
    1. lower letter. \upsilon = (\upsilon<em>1, \upsilon</em>2, \upsilon_3)
    2. Addtion, Substruction
    3. Multiplication, Divsion(Same length) a*b or a / b
    4. Dot product: a\cdot b
  • Vector Norm
    1. Defination: the length of vector
    2. L1. Manhattan Norm. L<em>1(\upsilon)=|a</em>1| + |a<em>2| + |a</em>3| python: norm(vector, 1) . Keep coeffiencents of model samll
    3. L2. Euclidean Norm. L<em>2(\upsilon)=\sqrt(a</em>1^2+a<em>2^2+a</em>3^2) python: norm(vector)
    4. Max Norm. L<em>max=max(a</em>1,a<em>2,a</em>3) python: norm(vector, inf)
  • Matrices
    1. upper letter. A=((a<em>{1,1},a</em>{1,2}),(a<em>{2,1},a</em>{2,2}) )
    2. Addtion, substruction(same dimension)
    3. Multiplication, Divsion( same dimension)
    4. Matrix dot product. If C=A\cdot B, A’s column(n) need to be same size to B’s row(m). python: A.dot(B) or A@B
    5. Matrix-Vector dot product. C=A\cdot \upsilon
    6. Matrix-Scalar. element-wise multiplication
    7. Type of Matrix
      1. square matrix. m=n. readily to add, mulitpy, rotate
      2. symmetric matrix. M=M^T
      3. triangular matrix. python: tril(vector) or triu(vector) lower tri or upper tri matrix
      4. Diagonal matrix. only diagonal line has value, doesnot have to be square matrix. python: diag(vector)
      5. identity matrix. Do not change vector when multiply to it. notatoin as I^n python: identity(dimension)
      6. orthogonal matrix. Two vectors are orthogonal when dot product is zeor. \upsilon \cdot \omega = 0 or \upsilon \cdot \omega^T = 0. which means the project of \upsilon to \omega is zero. An orthogonal matrix is a matrix which Q^T \cdot Q = I
    8. Matrix Operation
      1. Transpose. A^T number of rows and columns filpped. python: A.T
      2. Inverse. A^{-1} where AA^{-1}=I^n python: inv(A)
      3. Trace. tr(A) the sum of the values on the main diagonal of matrix. python: trace(A)
      4. Determinant. a square matrix is a scalar representation of the volume of the matrix. It tell the matrix is invertable. det(A) or |A|. python: det(A) .
      5. Rank. Number of linear indepent row or column(which is less). The number of dimesions spanned by all vectors in the matrix. python: rank(A)
    9. Sparse matrix
      1. sparsity score = \frac{count of non-zero elements}{total elements}
      2. example: word2vector
      3. space and time complexity
      4. Data and preperation
        1. record count of activity: match movie, listen a song, buy a product. It usually be encoded as : one hot, count encoding, TF-IDF
      5. Area: NLP, Recomand system, Computer vision with lots of black pixel.
      6. Solution to represent sparse matrix. reference
        1. Dictionary of keys:  (row, column)-pairs to the value of the elements.
        2. List of Lists: stores one list per row, with each entry containing the column index and the value.
        3. Coordinate List: a list of (row, column, value) tuples.
        4. Compressed Sparse Row: three (one-dimensional) arrays (A, IA, JA).
        5. Compressed Sparse Column: same as SCR
      7. example
        1. covert to sparse matrix python: csr_matrix(dense_matrix)
        2. covert to dense matrix python: sparse_matrix.todense()
        3. sparsity = 1.0 – count_nonzero(A) / A.size
    10. Tensor
      1. multidimensional array.
      2. algriothm is similar to matrix
      3. dot product: python: tensordot()

Factorization

  • Matrix Decompositions
    1. LU Decomposition
      1. square matrix
      2. A = L\cdot U \cdot P, L is lower triangle matrix, U is upper triangle matrix. P matrix is used to permute the result or return result to the orignal order.
      3. python: lu(square_matrix)
    2. QR Decomposition
      1. n*m matrix
      2. A = Q \cdot R where Q a matrix with the size mm, and R is an upper triangle matrix with the size mn.
      3. python: qr(matrix)
    3. Cholesky Decomposition
      1. square symmtric matrix where values are greater than zero
      2. A = L\cdot L^T=U\cdot U^T, L is lower triangle matrix, U is upper triangle matrix.
      3. twice faster than LU decomposition.
      4. python: cholesky(matrix)
    4. EigenDecomposition
      1. eigenvector: A\cdot \upsilon = \lambda\cdot \upsilon, A is matrix we want to decomposite, \upsilon is eigenvector, \lambda is eigenvalue(scalar)
      2. a matrix could have one eigenvector and eigenvalue for each dimension. So the matrix A can be shown as prodcut of eigenvalues and eigenvectors. A = Q \cdot \Lambda \cdot Q^T where Q is the matrix of eigenvectors, \Lambda is the matrix of eigenvalue. This equotion also mean if we know eigenvalues and eigenvectors we can construct the orignal matrix.
      3. python: eig(matrix)
    5. SVD(singluar value decomposition)
      1. A = U\cdot \sum \cdot V^T, where A is m*n, U is m*m matrix, \sum is m*m diagonal matrix also known as singluar value, V^T is n*n matrix.
      2. python: svd(matrix)
      3. reduce dimension
        1. select top largest singluar values in \sum
        2. B = U\cdot \sum<em>k \cdot V</em>k^T, where column select from \sum, row selected from V^T, B is approximate of the orignal matrix A.
        3. `python: TruncatedSVD(n_components=2)

Stats

  • Multivari stats
    1. variance: \sigma^2 = \frac{1}{n-1} * \sum<em>{i=1}^{n}(x</em>i-\mu)^2, python: var(vector, ddof=1)
    2. standard deviation: s = \sqrt{\sigma^2}, python:std(M, ddof=1, axis=0)
    3. covariance: cov(x,y) = \frac{1}{n}\sum<em>{i=1}^{n}(x</em>i-\bar{x})(y_i-\bar{y}), python: cov(x,y)[0,1]
    4. coralation: cor(x,y) = \frac{cov(x,y)}{s<em>x*s</em>y}, normorlized to the value between -1 to 1. python: corrcoef(x,y)[0,1]
    5. PCA
      1. project high dimensions to subdimesnion
      2. steps:
        1. M = mean(A)
        2. C = A-M
        3. V = cov(C)
        4. values,vector = eig(V)
        5. B = select(values,vectors), which order by eigenvalue
      3. scikit learn

        pca = PCA(2) # get two components
        pca.fit(A)
        print(pca.componnets_) # values
        print(pca.explained_variance_) # vectors
        B = pca.transform(A) # transform to new matrix
    • Linear Regression
    1. y = X \cdot b, where b is coeffcient and unkown
    2. linear least squares( similar to MSE) ||X\cdot b - y|| = \sum<em>{i=1}^{m}\sum</em>{j=1}^{n}X<em>{i,j}\cdot (b</em>j - y_i)^2, then b = (X^T\cdot X)^{-1} \cdot X^T \cdot y. Issue: very slow
    3. MSE with SDG

Reference: Basics of Linear Algebra for Machine Learning, jason brownlee, https://machinelearningmastery.com/linear_algebra_for_machine_learning/

Types of Convolution(Translation)

For most of us who learned CNN, we already knew the convolutional operation is used for feature extraction in the spatial relationship. Compared with the full connection NN, it is good for weights sharing and translation invariant. There are many different convolutions. Recently, I found a very good article which summarized this topic. I translated it to English combined with my understanding. If you want to read the original one, you can go here.

1. Standard Convolution

1.1 Single channel

img

It’s element-wise multiply then sum together. The Convolutional filter moves forward each element in the picture. Here we set padding = 0, stride = 1. This is very useful for the gray picture.

img

1.2 multi channels

img

For the color pictures, they are made of 3 layers: Red, Green and Yellow. we create a 333 convolution which contains 3 convolutional kernels. Then we sum the three results togher to one channel 2D array.

img

Read More

First Running! Deep learning to CFD

This is our first apply ConvLSTM to CFD successfully! although the case is simple and under control of lots of factors. The ground factor is generated by Openfoam, and the custom model is predNet from coxlab. We trained three models in this time.

1.Training: use Nth frame to predict (N+1)th frame
Prediction: use 1-10th frame to predict 2-11th frame, then combined 11th frame in the predicted output with 1-10th frame. With new input(2-10th are ground truth, 11th is predicted), we can keep predicting 3-12th frame. In this experience, we predict the frames until 20th where sliding window = 1 frame. ( only first few frame are good, since we use the predicting frames to do the prediction)

LSTM

2.Training: Nth frame to predict (N+10)th frame
Prediction: use 1-10 frame to predict 11-20 frame, no sliding window. ( this is very good since all input are ground truth)

LSTM

3.Training: use Nth frame to predict (N+1)th frame
Prediction: use one frame to predict next frame, like driving prediction (animation has a little problem. right side is prediction, left side is ground truth)

LSTM

Convolutional LSTM

For some reason, there is a request to predict video frames. We need that video is a combination of spatial and temporal dimensions. FCN and LSTM are good for them respectively. But for both of them, we need to use ConvLSTM. Since I just start to learn it, so I write down some of notes for good understanding.

1.First thing first, let’s see what LSTM looks like:

LSTM

From left to right, we can see forget gate, input gate, input modulation gate and output gate. On the top side is memory pipe. It simulates the manner that human remember things. For more information, how the LSTM works please click here.
In keras, there are already three kinds of RNN: simpleRNN, LSTM and GRU. They are all easy to use.

2. What is ConvLSTM

Since LSTM is not good for spatial vector as input, which is only one dimension, they created ConvLSTM to allowed multidimensional data coming with convolutional operations in each gate.
ConvLSTM

fomular

We can find the basic formulas are as same as LSTM, they just use convolutional operations instead of one dimension for input, previous output and memory. Keras needs a new component which called ConvLSTM2D to wrap this ConvLSTM.

3. Where we use it?

As I said in the beginning, it is used for prediction with time and space. The already done in academic inculds: predict precipitation, video frame prediction, some physic movement activities. You can find more in my reference.
weather

Reference:
1. the bounce ball. https://www.youtube.com/watch?v=RjZ1VKYyHhs
2. weather forecast. https://papers.nips.cc/paper/5955-convolutional-lstm-network-a-machine-learning-approach-for-precipitation-nowcasting.pdf
3. some video prediction. https://www.youtube.com/watch?v=MjFpgyWH-pk

Incremental Load DW by using CDC in SSIS

To load data from OLTP system to DW, we have to face a problem: how to balance time and cost. Since data raises faster and faster, we need to increase our hardware ability to match the time requirement. So, incremental load coming out to reduce the data transmission significantly. There are three way to achieve it: 1. use datetime column 2. Changed data capture 3. changed data tracking. For a long time, I use the first way to capture the changed data manual, it is good, but too many works in development and testing. Here I want to introduce the CDC. Basically, CDC just a feature to utilize LSN(Log Sequence Number) and log tables to capture changed data, while SSIS itself provide native components to easy work with this new feature.

Let’s build a CDC workflow in SSIS for example:

Enable CDC feature

  • enable CDC by executing sp_cdc_enable_db(disable by sp_cdc_disable_db). Then check it with select name from sys.databaseswhere is_cdc_enabled=1
  • enable CDC for spec table with sp_cdc_enable_table then we can find the CDC table in systemtable folder or using select namefrom sys.tables tabwhere is_tracked_by_cdc=1 to check.
exec sys.sp_cdc_enable_table
@source_schema = N'Person'
, @source_name = N'Address'
, @role_name = N'cdc_Admin'
, @capture_column_list = N'column1, column2'; //can track spec columns, rather than the whole table

Control flow setting

  • add CDC control task
  • set CDC control operation to mark cdc start and set the cdc states for saving cdc states
  • run this control task, it will create a record in tablecdc_states
  • create two CDC control tasks , one set operation to Get Processing Range , another for Mark process range, they will get changed data and update CDC states respectively.
  • put a dataflow which is response for ETL operation, between two CDC control tasks.

Data flow setting into staging table

  • add CDC source which points to the table enabled CDC and choose the correct cdc_states table as well.
  • choose Net CDC processing mode in CDC source.
  • add CDC splinter after CDC source, create three Derived Column transformation for insert(0), update(2) and delete(1) data.
  • create a Union All transformation to union all data and export to stage database.
  • if necessary, we need to add a truncate script before all control flow to delete everything in stage database.
    img

Update fact table through staging tables

  • create a oledb source to connect to stage database
  • use conditional split to split insert and update+delete
  • for insert, we directly export; for update+delete we need to delete from fact table by identifier by OLE DB Command transformation, and use conditional split to export update data.
  • if necessary, use lookup to replace some dimensional columns
  • export to fact database.

Do we really know how water moves?

Close your eyes, think about what would happen when the spray beats the shore, or the water from faucet comes into your body.
This is a simplest animation showing how water changes its speed after it counters a wall.

img

The color is speed of water and the water is from left side with 1m/s. Can anyone has ability to simulate it in our brain? I guess its super hard unless you see thousands of similar pictures like this.
So that if the simulation becomes much complex than this one like I mentioned before the spray beats the shore, I guess there will be a big gap between our imaginations and real situations.

Learn Django with me(part 3)

Handle view and templates

View consist of a set of functions which handle the different url request with the specific url pettarns.
And it returns either of HttpResponse or Http404.

Firstly, let’s update webapp/views.py:

from django.shortcuts import render
from .models import Question

def index(request):
# get the lastest 5 questions
latest_question_list = Question.objects.order_by('-pub_date')[:5]
# create context
context = {'latest_question_list': latest_question_list}
# a shortcuts for render request by template
return render(request, 'webapp/index.html', context)

def detail(request, question_id):
question = get_object_or_404(Question, pk=question_id)
return render(request, 'webapp/detail.html', {'question': question})

def results(request, question_id):
response = "You're looking at the results of question %s."
return HttpResponse(response % question_id)

def vote(request, question_id):
return HttpResponse("You're voting on question %s." % question_id)

Here we used template webapp/index.html, which locates in webapp/tempaltes/webapp/index.html. So let’s create a folder templates and its subfolder webapp, the code of index.html:

# list of all items from question object
{% if latest_question_list %}
<ul>
    {% for question in latest_question_list %}
# webapp is the namespace, detail is the name
    <li><a href="{% url 'webapp:detail' question.id %}">{{ question.question_text }}</a></li>
{% endfor %}</ul>
{% else %}

No polls are available.

{% endif %}

The trick point is when we refer to the details, we use {% url 'webapp:detail' question.id %} instand of absolute path. Here webapp is the name space, detail is the name, all can be found in updated webapp/urls.py:

from django.urls import path

from . import views

# namespace
app_name = 'webapp'
urlpatterns = [
# ex: /webapp/
path('', views.index, name='index'),
# ex: /webapp/5/
path('<int:question_id>/', views.detail, name='detail'),
# ex: /webapp/5/results/
path('<int:question_id>/results/', views.results, name='results'),
# ex: /webapp/5/vote/
path('<int:question_id>/vote/', views.vote, name='vote'),
]
```</int:question_id></int:question_id></int:question_id>

Similar to the template `index.html`, we should add the template `webapp/tempaltes/webapp/detail.html`:
``` html
<h1>{{ question.question_text }}</h1>
<ul>
{% for choice in question.choice_set.all %}
    <li>{{ choice.choice_text }}</li>
{% endfor %}</ul>

All the dynamic codes in html are easy to understand, I wouldn’t waste time to explain them.

Now, you can access http://localhost:8000/webapp/ to display the reuslts.The whole process can be described like this:
1. send request to server
2. Djongo pastes the url by ROOT_URLCONF = 'mysite.urls' in mysite/settings.py, which points to mysite.urls.
3. In term of urlpatterns in mysite/urls.py, the request will be transfered to webapp folder.
4. The request can be handled by webapp/urls.py, which points to the different functions in webapp/views.py. Here the second param in the function results is from request pattern.
5. View pastes and handle the request, then retrieves template in template/webapp/.
6. HttpResponse or Http404 back to client

In a nutshell, urls.py handles url patterns and sends request to views.py, views.py calls model.py and templates to send response back.

Get Rid of ETL , Move to Spark.

ETL is the most common tool in the process of building EDW, of course the first step in data integration. As big data emerging, we would find more and more customer starting using hadoop and spark. Personally, I agree the idea that spark will replace most ETL tools.

Background

  • Business Intelligence -> big data
  • Data warehouse -> data lake
  • Applications -> Micro services

ETL hell

  • Data getting out of sync, each copy is a risk.
  • Performance issues and waste of server resource(peek Performance), although ETL can do limited parallel work.
  • Plain-text code in hidden stages(VB or java typical)
  • CSV files are not type safe
  • all or nothing approach in batch jobs.
  • legacy code

Spark for ETL

  • parallel processing in build in
  • using steaming to parallel ETL
  • Hadoop which is data source, we don’t need copy and reduce risk
  • just one code(scala or python)
  • Machine learning included
  • security, unit testing, Performance measurement , excepting handling, monitoring

Code Demo

  1. Simple one
spark.read.json("/sourcepath") #extract
.filter(...)   # Transform and blew
.agg(...)
.write.mode("append")  # Load
.parquet("/outputpath")

2.Steam

# @param1: master
# @param2: appname
sc = SparkContext("local[2]", "NetworkWordCount")
# @param1: spark context
# @param2: seconds
ssc = StreamingContext(sc, 1)
steam = ssc.textFileStream("path")
# do transform
# do load
ssc.start()
ssc.awaitTermination()

reference:
1. https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
2.https://databricks.com/session/get-rid-of-traditional-etl-move-to-spark
3.https://www.slideshare.net/databricks/building-robust-etl-pipelines-with-apache-spark