From Qlikview to Tableau, a comparison from developer’s viewpoint

— 03/26/2019 version 0.1

In the field of BI tools, the three wildly used are Tableau, PowerBI and Qlik. According to Gartner’s report, all of them are leaders in the market. although there are some difference in features and operations, I think they can do the most of typical visualization work. (Don’t compared them with D3 or matplotlib, the complexity and target customer are totally different)

Tableau, Qlikview and PowerBI are leaders in the market

The downside of this situation for the BI developer or data analyzer is that you have to learn all of them since you never know your company or clients environments in advance. So, like Einstein said:

“The measure of intelligence is the ability to change.”

Albert Einstein

I touched Qlikview almost 7 years ago, compared with SSRS, it brought new ideas and quick access ability for BI. And 3 years ago, I started to use Tableau, I never thought it could be so easy and fast developing. In this post, I will share some ideas through compare Performance, Visualization, Suitable , ETL, LOD or Set analysis, Table calculation, Tooltips, Sets/Filter/Group etc between Qlikview and Tableau as a developer used both of them.

  • Performance. Qlikview and Tableau are both working well in performance. They use in-memory tech to accelerate the speed. You won’t feel much difference when use them to handle the data with small or middle size ( less than 1 million). But above this size, Tableau is slower than Qlikview since Qlikview is only based on your RAM, Tableau uses cube and RAM. The speed of Qlikiview more depends on your design in your model where the more sync tables the more computation to refresh data, that’s the reason of slow Qlikview.
  • Visualization. Tableau provides a fashion, simple, drop-drag method to operate the dashboards. It is very quick to develop a new dashboard without do much modeling work. Qlikivew provides relatively complex but flexible dashboard and tables. Yes, you are right, “Table” is much better in Qlikview. and another advantage is drilldown, the cycle drill down feature is convenience for managers to find the key points in the lower level. However, as to Map function, I have to say, Qlikivew has much to learn from Tableau. In Tableau, you can: 1. set map by country, state or latitude, longitude 2. import spatial file 3. set custom image as map, and set x, y coordinate.
Use x, y coordinate in a custom image
  • Suitable: in a nuts, Tableau is good for both end-user and IT; Qlikview is good for IT. Tableau is better to develop dashboards and rapid developing for specific purposes, like sales growth analysis. Qlikivew is better to develop enterprise BI solution( all in One). It is easy to understand since Qlikview is based on its model which connects everything together and reveals on the UI.
  • ETL. Both of two companies said they can do some part of ETL work. In my opinion, neither of them can replace ETL tools, their tools for ETL are simple and armature. Maybe Qliview doing a little better in incremental loading with QVD files. Without incremental loading, Tableau cannot handle large size data. Hope it can solve it soon
use QVD files for incremental loading
  • LOD or Set analysis. Level of Detail(Tableau) and Set analysis(Qlikview) are my favor features. It allows us to control one or more dimensions. The only difference is set analysis allow developer to set value for specific dimension.
LOD in Tableau
Set analysis in Qlikview
  • Table calculation. This feature and Tooltips are two my favors in Tableau. In Qlikview, except simple percentage and cumulative sum, you have you code by yourself, like rolling sum “sum(aggr(rangesum(above(total sum({
    <Month=> }Amount),0,3)),Month))” . However, in Tableau, Table calculation gives more convince experience. Also you can choose effect area between table and pane.
logic of custom table calculation in Tableau
  • Tooltips. This is my second favor feature in Tableau. Long time ago, I was hoping Qliview could provide subchart in tooltip. But until now, it still can only text context in tooltips.
subchart in tooltips gives end-user more relative information rather than text one.
  • Filter, Set, Group: In Qlikview, there is no corresponding concept. Filter is only fields in the control panel; set is much like bookmark; and group is mostly done in the script. In Tableau, you need to set filter repeatedly with its working scope; set is a dynamic sub dataset, you can set compute set or in/out set in the global or region level; group is static sub dataset in the region level. From Qlikview point, it is hard to understand “Set”, but it is just a True/False flag essentially.
Set detail members of in/out in set with parameters
  • Others. Qlikview can do lots of work in its script, I mean everything you can image since it includes vbscript in module function. It seems tableau can do jscript as well, but you won’t want to use it. Tableau provide “Story” feature, the user won’t need to export to PPT to do the second developing.

I didn’t mention some soft or hardware features, like rapid prototyping ability or device supporting. I will put them in the future poster.

Brief analysis on recommendation system of Netflix & YouTube

–03/19/2019 version 0.1


Last week, my wife told me she logged into my netflix’s account, then she found it was not hers immediately since the items did not match her tastes. This activated my interesting in the recommendation system of Netflix & Youtube which are the most watched channels in US. (maybe spotfiy will be the same way). Here I want to give a brief analysis how they work.

Basic

Before we quickly look how many different manners(that I knew) used in the recommendation systems.

Popularity. This is the simplest way in term of PV. It works very good for new users and avoid the “cold start” problem. However, the downside is this method can not provide the personalized recommendation. The way to optimize it is adding some categories at the beginning so that users can filter the categories by themselves.
Collaborative filtering (CF). The Collaborative Filtering (CF) algorithms are based on the idea that if two clients have similar rating history then they will behave similarly in the future (Breese,Heckerman, and Kadie, 1998). It can also split into two subcategories, one is Memory-based, another is Model-based.

  • Memory-based approach can be divided into User-based and Item-based.  They find the similar users or similar items respectively in term of Pearson Correlation.
    • User-based.
      1. Build correlation matrix S which is symmetric.

            \[S(i,k)=\frac{\sum_j (v_{ij}-\bar{v_i})(v_{kj}-\bar{v_k})}{\sqrt{\sum_j (v_{ij}-\bar{v_i})^2(v_{kj}-\bar{v_k})^2}}\]

      2. select top k users who has the largest scores.
      3. identify items that similar users like but the prediction user has not seem before.  The prediction of a recommendation is based on the wighted combination of the selected neighbor’s rating.

            \[p(i,k)=\bar{v}_i+\frac{\sum_{i=1}^{n}(v_{ij}-\bar{v}_k)\times S(i,k)}{\sum_{i=1}^{n}S(i,k)}\]

      4. pick up top N of movies based on the predicted rating.
    • Item-based.
      1. Build correlation matrix S based on items. (similar to user-based)
      2. Get the top n movies that prediction user watched and rated before.
      3. return the movies that mostly related to these n movies and the prediction user has never watched.
    • In the real word. The size of user are growing faster than item, and they are easy to be changed. So item-based are most frequency used. 
  • Model-based approach are based on matrix factorization which is popular in  dimension reduction. Here we use Singular value decomposition(SVD) to explain. 

        \[X_{n*m}=U_{n*r}\cdot S_{r*r}\cdot V_{r*m}^T\]

    , where U represents the freature vectors corresponding to the users in the latent space with dimension r, V represents the feature vectors corresponding to the items in the latent space with dimension r.  Once we find U and V, we can calculate any p(i,j) by U_i \cdot V_j.
  • CF is based on historical data, it has “cold start” problem. and the accuracy of prediction is based on the mount of data since the CF matrix has sparsity problem, e.g, few mistake rating will effect the prediction seriously. 

Contented-based(CB).  This approach is based on the information of item itself rather than only rating in CF approach. We need to create meta data for the items. These meta data can be tagged manual or use TF-IDF tech to automatically extra keywords. Then build the connection between the item that prediction user liked and the items with similar meta data. CB avoid of “cold start” and “over recommend” problems, however, it is hard to metain and keep accuracy of meta data.  

Hybrid. It combined CF and CB. We can merge the prediction together or set the weights in different scenarios. 

Deep Learning. In the large scale dataset, it is hard to use traditional recommendation system because of 4V(volume, variety, velocity, and veracity).   Deep learning model are good at solving complex problem( A review on deep learning for recommender systems: challenges and remedies).  We will introduce deep learning model used by YouTube in the next section.

Netflix

I firstly log into the Netflix to find some information provided by the official website. Fortunately, there was a topic How Netflix’s Recommendations System Works. They didn’t give much detail about algorithms but the provides the clues which information they are using for predict users’ choices. Blew is their explanation:

We estimate the likelihood that you will watch a particular title in our catalog based on a number of factors including:

  • your interactions with our service (such as your viewing history and how you rated other titles),

  • other members with similar tastes and preferences on our service (more info here), and

  • information about the titles, such as their genre, categories, actors, release year, etc.

So, we can guess it is a hybrid approach combined with CF(item-base and user-based) and CB approaches. But we don’t know how they design it at this moment. Let keep reading from the official website.

In addition to knowing what you have watched on Netflix, to best personalize the recommendations we also look at things like:

  • the time of day you watch,

  • the devices you are watching Netflix on, and

  • how long you watch.

These actives are not mentioned in the basic section. They are all used as input vector for the deep learning model which we will see in YouTube section. 

It also mentioned “Cold start” problem:

When you create your Netflix account, or add a new profile in your account, we ask you to choose a few titles that you like. We use these titles to “jump start” your recommendations. Choosing a few titles you like is optional. If you choose to forego this step then we will start you off with a diverse and popular set of titles to get you going.

It’s clear they use popularity approach with categories to solve “cold start” problem. As user has more historical information, Netflix will use another approaches to replace the initial one. 

They also personalized row  and title inside:

In addition to choosing which titles to include in the rows on your Netflix homepage, our system also ranks each title within the row, and then ranks the rows themselves, using algorithms and complex systems to provide a personalized experience. …. In each row there are three layers of personalization:

  • the choice of row (e.g. Continue Watching, Trending Now, Award-Winning Comedies, etc.)
  • which titles appear in the row, and
  • the ranking of those titles.

They calculate the score for each item for each users, then sum up these scores into each category to decide the order of rows. As I said, we don’t know how they mix CB and CF to get the score of each item yet. But they are mixed for sure. 

 

YouTube

As Google’s product, it is not surprised that YouTube uses Deep learning as a solution for recommendation system. It is too large both in user and item aspects. A simple stats model can not handle it well.  In the paper “Deep Neural Networks for YouTube Recommendations“, they explained how they use DL to YouTube.

 

It has two parts: Candidate Generation and Ranking. One for filtering hundred candidates from millions, second for sorting by adding more scenario or video features information. Let’s see how they work:

  • Candidate generation.  For candidate generation, it filters from millions videos, so it only uses user activities and scenario information. The basic idea is getting  probabilities of watching specific video V through user U and context C.

        \[P(w_t=i|U,C)=\frac{e^{v_iu}}{\sum_{j\in{v}}e^{v_ju}}}\]

    . The key point is to get user vector u and v.  To get user vector u, author embeds the video watches and search tokens, then average them into watch vector and search vector, then combined with other geogrphic , video ages and gender vectors to get through 3 connected ReLU layer. The output is user vector u.  To get video vectors v, we need to use u to predict probabilities for all v through softmax. After training, the video vector v is what we want. In the serving processing, we only need to put u and v together to calculate the top N highest probability vectors. 
  • Ranking.Compared with Candidate Generation, the number of videos is much less. So we can put more video features into the embedding vectors. These features are mostly focus on scenario, like topic of video, how many videos the user watched under each topic and time since last watch.  It embeds categorical features with shared embeddings and continuous features with powers of normalization. 

Reference:

  • Deep Neural Networks for YouTube Recommendations, Paul Covington, Jay Adams, Emre Sargin, https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf
  • How Netflix’s Recommendations System Works, n/a, https://help.netflix.com/en/node/100639
  • Finding the Latent Factors | Stanford University, https://www.youtube.com/watch?v=GGWBMg0i9d4&index=56&list=PLLssT5z_DsK9JDLcT8T62VtzwyW9LNepV
  • Recommendation System for Netflix, Leidy Esperanza MOLINA
    FERNÁNDEZ, https://beta.vu.nl/nl/Images/werkstuk-fernandez_tcm235-874624.pdf
  • 现在推荐算法都发展成什么样了?来看看这个你就知道了!,章华燕, https://mp.weixin.qq.com/s?__biz=MzIzNzA4NDk3Nw==&mid=2457737060&idx=1&sn=88ef898f5054ae9b8cb005c31b65ee2d&chksm=ff44bf3ac833362c2436002be265c390b033d3e7709553fd8b603e6269d58f366f689beb2639&mpshare=1&scene=1&srcid=#

Love, Death & Robots

img
More than 80 per cent of the TV shows and movies people watch on Netflix are discovered through the platform’s recommendation system. We are not choose the TV, the algorithm does.
In all the services that I am using or used, including youtube, spotify, netflix, hulu etc. The best and most sticky ones are spotify and netflix. But why? In personal, I think these points maybe:
1. the algorithm combined with recommend system with user preference. Not like youtube, it uses auto recommend system too aggressive which means it always give some information that I didn’t like. Netflix/Spotify is more smart, we can choose the catalogs we like in the beginning or give use the main catalogs which makes us easier to choose. In another way, it offers something rather than nothing for a new user.
2. Netflix has hired real life humans to categorize every bit of TV shows and movies and apply tags to each of them in order to create hyperspecific micro genres such as “Visually-striking nostalgic dramas” or “Understated romantic road trip movies”. Since size of movies is controlled by Netflix itself, the manual tag work is more precise than auto tags.
3. They create original movies that based on big data. Okay, that’s maybe not good. But they did, and we love the movies.
img

Text Auto Summarization(Extraction)

Recently I was given a topic to research a manner to summary the text automatically. So I shared some my search results, hope it is helpful.

Summarization Methods

we can classify summarization methods into different types by input type, the purpose and output type. Typically, extractive and abstractive are the most common ways.

img

Here, we would like introduce two methods for Extractive. One is Stats-based , another is Deep Learning-based.

Stats-based

  1. Idea: for each word, we would give a weight frequency. For each sentence, we summary the weight frequency for the words inside. Then pick up the sentences ordered by the sum of weight frequency.
  2. Steps
    2.1 Preprocessing: replace extra whitespace characters or delete some parts we do not need to analysis.
replace = {
ord('\f') : ' ',
ord('\t') : ' ',
ord('\n') : ' ',
ord('\r') : None
}
data.translate(replace)

2.2 Tokenizing the sentence

sent_list = nltk.sent_tokenize(content)

2.3 Get frequency of each word

stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
if word not in stopwords: if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1

2.4 Weighted frequency of occurrence

maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

2.5 Calculate the sum of weight frequency for each sentence

sentence_scores = {}
for sent in sentence_list:
for word in nltk.word_tokenize(sent.lower()):
if word in word_frequencies.keys():
if len(sent.split(' ')) &lt; 30:
if sent not in sentence_scores.keys():
sentence_scores[sent] = word_frequencies[word]
else:
sentence_scores[sent] += word_frequencies[word]

2.6 sort sentences in descending order of sum

import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)

Deep Learning-based

  1. Idea: vectorizing each sentence into a high dimension space, then cluster the vector using kmean, pick up the sentences which mostly close to the center of each cluster to form the summery of text.
  2. steps:
    2.1 prepossessing and tokenizing the sentence( same as stats-based method)
    2.2 Skip-Thought Encoder

img

Encoder Network: The encoder is typically a GRU-RNN which generates a fixed length vector representation h(i) for each sentence S(i) in the input.
Decoder Network: The decoder network takes this vector representation h(i) as input and tries to generate two sentences — S(i-1) and S(i+1), which could occur before and after the input sentence respectively.

These learned representations h(i) are such that embeddings of semantically similar sentences are closer to each other in vector space, and therefore are suitable for clustering.

img

Skip-Thoughts Architecture

import skipthoughts
# You would need to download pre-trained models first
model = skipthoughts.load_model()
encoder = skipthoughts.Encoder(model)
encoded =  encoder.encode(sentences)

2.3 Clustering

import numpy as np
from sklearn.cluster import KMeans

n_clusters = np.ceil(len(encoded)**0.5)
kmeans = KMeans(n_clusters=n_clusters)
kmeans = kmeans.fit(encoded)

2.4 Summerization

from sklearn.metrics import pairwise_distances_argmin_min

avg = []
for j in range(n_clusters):
idx = np.where(kmeans.labels_ == j)[0]
avg.append(np.mean(idx))
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, encoded)
ordering = sorted(range(n_clusters), key=lambda k: avg[k])
summary = ' '.join([email[closest[idx]] for idx in ordering])

Reference:

  1. Unsupervised Text Summarization using Sentence Embeddings,https://medium.com/jatana/unsupervised-text-summarization-using-sentence-embeddings-adb15ce83db1
  2. Skip-Thought Vectors, https://arxiv.org/abs/1506.06726
  3. Text Summarization with NLTK in Python, https://stackabuse.com/text-summarization-with-nltk-in-python/

Math in Machine Learning

Linear Algebra

  • mathematics of data: multivariate, least square, variance, covariance, PCA
  • equotion: y = A \cdot b, where A is a matrix, b is a vector of depency variable
  • application in ML
    1. Dataset and Data Files
    2. Images and Photographs
    3. One Hot Encoding: A one hot encoding is a representation of categorical variables as binary vectors. encoded = to_categorical(data)
    4. Linear Regression. L1 and L2
    5. Regularization
    6. Principal Component Analysis. PCA
    7. Singular-Value Decomposition. SVD. M=U*S*V
    8. Latent Semantic Analysis. LSA typically, we use tf-idf rather than number of terms. Through SVD, we know the different docments with same topic or the different terms with same topic
    9. Recommender Systems.
    10. Deep Learning

Numpy

  • array broadcasting
    1. add a scalar or one dimension matrix to another matrix. y = A + b where b is broadcated.
    2. it oly works when when the shape of each dimension in the arrays are equal or one has the dimension size of 1.
    3. The dimensions are considered in reverse order, starting with the trailing dimension;

Matrice

  • Vector
    1. lower letter. \upsilon = (\upsilon<em>1, \upsilon</em>2, \upsilon_3)
    2. Addtion, Substruction
    3. Multiplication, Divsion(Same length) a*b or a / b
    4. Dot product: a\cdot b
  • Vector Norm
    1. Defination: the length of vector
    2. L1. Manhattan Norm. L<em>1(\upsilon)=|a</em>1| + |a<em>2| + |a</em>3| python: norm(vector, 1) . Keep coeffiencents of model samll
    3. L2. Euclidean Norm. L<em>2(\upsilon)=\sqrt(a</em>1^2+a<em>2^2+a</em>3^2) python: norm(vector)
    4. Max Norm. L<em>max=max(a</em>1,a<em>2,a</em>3) python: norm(vector, inf)
  • Matrices
    1. upper letter. A=((a<em>{1,1},a</em>{1,2}),(a<em>{2,1},a</em>{2,2}) )
    2. Addtion, substruction(same dimension)
    3. Multiplication, Divsion( same dimension)
    4. Matrix dot product. If C=A\cdot B, A’s column(n) need to be same size to B’s row(m). python: A.dot(B) or A@B
    5. Matrix-Vector dot product. C=A\cdot \upsilon
    6. Matrix-Scalar. element-wise multiplication
    7. Type of Matrix
      1. square matrix. m=n. readily to add, mulitpy, rotate
      2. symmetric matrix. M=M^T
      3. triangular matrix. python: tril(vector) or triu(vector) lower tri or upper tri matrix
      4. Diagonal matrix. only diagonal line has value, doesnot have to be square matrix. python: diag(vector)
      5. identity matrix. Do not change vector when multiply to it. notatoin as I^n python: identity(dimension)
      6. orthogonal matrix. Two vectors are orthogonal when dot product is zeor. \upsilon \cdot \omega = 0 or \upsilon \cdot \omega^T = 0. which means the project of \upsilon to \omega is zero. An orthogonal matrix is a matrix which Q^T \cdot Q = I
    8. Matrix Operation
      1. Transpose. A^T number of rows and columns filpped. python: A.T
      2. Inverse. A^{-1} where AA^{-1}=I^n python: inv(A)
      3. Trace. tr(A) the sum of the values on the main diagonal of matrix. python: trace(A)
      4. Determinant. a square matrix is a scalar representation of the volume of the matrix. It tell the matrix is invertable. det(A) or |A|. python: det(A) .
      5. Rank. Number of linear indepent row or column(which is less). The number of dimesions spanned by all vectors in the matrix. python: rank(A)
    9. Sparse matrix
      1. sparsity score = \frac{count of non-zero elements}{total elements}
      2. example: word2vector
      3. space and time complexity
      4. Data and preperation
        1. record count of activity: match movie, listen a song, buy a product. It usually be encoded as : one hot, count encoding, TF-IDF
      5. Area: NLP, Recomand system, Computer vision with lots of black pixel.
      6. Solution to represent sparse matrix. reference
        1. Dictionary of keys:  (row, column)-pairs to the value of the elements.
        2. List of Lists: stores one list per row, with each entry containing the column index and the value.
        3. Coordinate List: a list of (row, column, value) tuples.
        4. Compressed Sparse Row: three (one-dimensional) arrays (A, IA, JA).
        5. Compressed Sparse Column: same as SCR
      7. example
        1. covert to sparse matrix python: csr_matrix(dense_matrix)
        2. covert to dense matrix python: sparse_matrix.todense()
        3. sparsity = 1.0 – count_nonzero(A) / A.size
    10. Tensor
      1. multidimensional array.
      2. algriothm is similar to matrix
      3. dot product: python: tensordot()

Factorization

  • Matrix Decompositions
    1. LU Decomposition
      1. square matrix
      2. A = L\cdot U \cdot P, L is lower triangle matrix, U is upper triangle matrix. P matrix is used to permute the result or return result to the orignal order.
      3. python: lu(square_matrix)
    2. QR Decomposition
      1. n*m matrix
      2. A = Q \cdot R where Q a matrix with the size mm, and R is an upper triangle matrix with the size mn.
      3. python: qr(matrix)
    3. Cholesky Decomposition
      1. square symmtric matrix where values are greater than zero
      2. A = L\cdot L^T=U\cdot U^T, L is lower triangle matrix, U is upper triangle matrix.
      3. twice faster than LU decomposition.
      4. python: cholesky(matrix)
    4. EigenDecomposition
      1. eigenvector: A\cdot \upsilon = \lambda\cdot \upsilon, A is matrix we want to decomposite, \upsilon is eigenvector, \lambda is eigenvalue(scalar)
      2. a matrix could have one eigenvector and eigenvalue for each dimension. So the matrix A can be shown as prodcut of eigenvalues and eigenvectors. A = Q \cdot \Lambda \cdot Q^T where Q is the matrix of eigenvectors, \Lambda is the matrix of eigenvalue. This equotion also mean if we know eigenvalues and eigenvectors we can construct the orignal matrix.
      3. python: eig(matrix)
    5. SVD(singluar value decomposition)
      1. A = U\cdot \sum \cdot V^T, where A is m*n, U is m*m matrix, \sum is m*m diagonal matrix also known as singluar value, V^T is n*n matrix.
      2. python: svd(matrix)
      3. reduce dimension
        1. select top largest singluar values in \sum
        2. B = U\cdot \sum<em>k \cdot V</em>k^T, where column select from \sum, row selected from V^T, B is approximate of the orignal matrix A.
        3. `python: TruncatedSVD(n_components=2)

Stats

  • Multivari stats
    1. variance: \sigma^2 = \frac{1}{n-1} * \sum<em>{i=1}^{n}(x</em>i-\mu)^2, python: var(vector, ddof=1)
    2. standard deviation: s = \sqrt{\sigma^2}, python:std(M, ddof=1, axis=0)
    3. covariance: cov(x,y) = \frac{1}{n}\sum<em>{i=1}^{n}(x</em>i-\bar{x})(y_i-\bar{y}), python: cov(x,y)[0,1]
    4. coralation: cor(x,y) = \frac{cov(x,y)}{s<em>x*s</em>y}, normorlized to the value between -1 to 1. python: corrcoef(x,y)[0,1]
    5. PCA
      1. project high dimensions to subdimesnion
      2. steps:
        1. M = mean(A)
        2. C = A-M
        3. V = cov(C)
        4. values,vector = eig(V)
        5. B = select(values,vectors), which order by eigenvalue
      3. scikit learn

        pca = PCA(2) # get two components
        pca.fit(A)
        print(pca.componnets_) # values
        print(pca.explained_variance_) # vectors
        B = pca.transform(A) # transform to new matrix
    • Linear Regression
    1. y = X \cdot b, where b is coeffcient and unkown
    2. linear least squares( similar to MSE) ||X\cdot b - y|| = \sum<em>{i=1}^{m}\sum</em>{j=1}^{n}X<em>{i,j}\cdot (b</em>j - y_i)^2, then b = (X^T\cdot X)^{-1} \cdot X^T \cdot y. Issue: very slow
    3. MSE with SDG

Reference: Basics of Linear Algebra for Machine Learning, jason brownlee, https://machinelearningmastery.com/linear_algebra_for_machine_learning/