Apache Kafka Concepts and Theory.

It’s a little bit late to talk about Kafka, since this technology has been widely used for a long time. These days, I finally has time to learn it ans summary the major concepts inside. In this nutshell, I will split the page into three parts: why do we need it, basic concepts and how it works.

Why do we need Kafka?

Related image
before we use kafka

As we have more and more systems merged into a big network structure. and each system has their own protocols and communication methods. we have to take more time and endure the super complex structure. At the end, the cost will be raised as well.

Kafka as a central messaging bus

Kafka provides a central messaging bus with distributed, resilient and fault tolerant. Each source and target connects to kafka cluster as producers and consumers. They use same protocols to send and receive the messages(key-value and timestamp). We can seem it as a hybrid solution combining message system with distributed retention.

Concepts

Producers publish the data into kafka, it’s source of the data. which could be sensors, laptop, loT, logs etc.

  • producers choose which records to assign to which partition within the topic ( topic is stream of data, like table in database; partition is part of topic. we will talk about them soon)
  • Producers use round-robin or function to choose partition to archive load balance between nodes(brokers) in the cluster.

Brokers are nodes/servers in the cluster. They are key features for kafka features like fault tolerance, high performance.

Thumbnail
  • broker identified by ID
  • each brokers contains certain topic partitions
  • connect any broker = connect to the entire cluster
  • typically 3 or more brokers to achieve redundancy
  • message in a topic are spread across partitions in difference brokers, each partition can be replicated aross multiple brokers

Consumers subscribe the message from kafka cluster.

  • consumer can be a single receiver or a distributed cluster named consumer group
  • a consumer group can read from topic in parallel

Topics are a particular stream of data which identified by its name. Each topic is split into partitions. We can set partition number and replication factor for each topic. Partition is ordered immutable sequence of records(messages) with their id named offset.

  • offset’s order is only guaranteed in one partition
  • data is kept only for a limited time(default is one week).
  • partition is immutable
  • consumer can choose any offset as start reading point, so that each consumer wouldn’t impact others.

Connector is used to connect between topics and app/database. A Source Connector (with help of Source Tasks) is responsible for getting data into kafka while a Sink Connector (with help of Sink Tasks) is responsible for getting data out of Kafka.

Image result for kafka connector
  • stream an entire sql database to kafka
  • stream kafka topics into hdfs
  • recommend to leverage build-in connectors

How does it work?

Producer to kafka

  • Producer chooses topic and partition to send the inbound message into kafka.
  • Topic splits into several partitions. Each partition has one broker as leader and zero or more brokers as follower named ISR (in sync replica) . Only the leader can receive and serve data for partitions. Other brokers will sync the data from the leader.
  • A topic with replication factor N, it can tolerate up to N-1 server failures.

kafka to consumers

  • As we mentioned ahead, consumer can be single node or consumer groups.
  • if consumer number is greater than partitions number, some consumers will be idle. 
  • if consumer number is less than partitions number, some consumers will receive messages from multiple partitions. 
  • if consumer number is equal to partitions number, each consumer reads messages in order from exactly one partition

Data consistency and availability

  • Partition
    • Messages sent to a topic partition will be appended to the commit log in the order they are sent
    • a single consumer instance will see messages in the order they appear in the log (message order only guaranteed in a partition)
    • a message is ‘committed’ when all in sync replicas have applied it to their log
    • any committed message will not be lost, as long as at least one in sync replica is alive.
  • Producer options
    • wait for all in sync replicas to acknowledge the message
    • wait for only the leader to acknowledge the message
    • do not wait for acknowledgement
  • consumer options
    • receive each message at most once
      • restart from the next offset without ever having processed the message
      • potentially message loss
    • receive each message at least once
      • restart and process message again. duplicate messages in downstream,
      • no data loss
      • ( recommended, downside stream handles duplicate message)
    • receive each message exactly once
      • transitional level
      • re read the last transaction committed
      • no data loss and no data duplication
      • significantly decreasing the throughput using a transaction

Architecture

Message structure

Fundamental data flow

Kafka in Azure through HDinsight

Reference

Apache Kafka official website. https://kafka.apache.org/intro.

Kafka in a Nutshell. https://sookocheff.com/post/kafka/kafka-in-a-nutshell/

Apache Kafka Series – Learn Apache Kafka for Beginners v2, https://tpl.udemy.com/course/apache-kafka/learn/lecture/11566878?start=180#overview

What is Apache Kafka in Azure HDInsight, https://docs.microsoft.com/en-us/azure/hdinsight/kafka/apache-kafka-introduction

Eiganvectors from Eiganvalues

This is a unbelievable discovery from PETER B. DENTON, STEPHEN J. PARKE, TERENCE TAO, AND XINING ZHANG. The original paper you can find here. In a short nut, we can get eiganvector through eiganvalues only. This magic formular is like blew:

We only need to know eiganvalues in original matrix then we can calculate its eiganvectors through sub-matrix.

Currently, I don’t know or image what will effect on the road, but the eiganX is the basic for AI, dimension reduction, feature extraction, etc. It also may help us improve the speed to get eiganvector if we need to incrementally add data on a known matrix.

I wrote a very simple python script to prove this formula(surely, it is correct). I think it can archive by GPU as well.

import numpy as np
from numpy import linalg as LA

matrix_size = 6
org_x = np.diag(range(1,matrix_size+1))
w, v = LA.eig(org_x)


print("orgnal matriax: \n %s \n" % org_x)
print("eigan values: \n %s \n" % w)
print("normalized eigenvectors: \n %s \n" % v)

print("START NEW ALGORITHM \n")

result=[]
for _ in range(matrix_size):
    result.append(0)

for n in range(matrix_size):
    for j in range(matrix_size):
        sub_x = np.delete(org_x,j,axis=0)
        sub_x = np.delete(sub_x,j,axis=1)
        w1,v1 = LA.eig(sub_x)

        # in term of new formula to get orignal matrix eigenvecotr through eiganvalue
        numberator = 1
        denominator = 1
        for i in range(matrix_size-1):
            temp_n = w[n] - w1[i]
            numberator = numberator*temp_n

        for i in range(matrix_size):
            if(i!=n):
                temp_d = w[n] - w[i]
                denominator = denominator*temp_d
        result[j] = numberator/denominator

    print("%s \n" % result)

result:

orgnal matriax:
 [[1 0 0 0 0 0]
 [0 2 0 0 0 0]
 [0 0 3 0 0 0]
 [0 0 0 4 0 0]
 [0 0 0 0 5 0]
 [0 0 0 0 0 6]]

eigan values:
 [1. 2. 3. 4. 5. 6.]

normalized eigenvectors:
 [[1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]]

START NEW ALGORITHM

[1.0, -0.0, -0.0, -0.0, -0.0, -0.0]

[0.0, 1.0, -0.0, -0.0, -0.0, -0.0]

[0.0, 0.0, 1.0, -0.0, -0.0, -0.0]

[0.0, 0.0, 0.0, 1.0, -0.0, -0.0]

[0.0, 0.0, 0.0, 0.0, 1.0, -0.0]

[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]

Reference:

EIGENVECTORS FROM EIGENVALUES, https://arxiv.org/pdf/1908.03795.pdf

Tao’s Blog, https://terrytao.wordpress.com/2019/08/13/eigenvectors-from-eigenvalues/

Simple way to create Gantt chart for Azure Devops through power BI

usually we use Gantt chart to track the project progress, take the picture blew for example, the dark part means completed, the light part means waiting for being done.

To reach this result, there are two solutions. The steps in azure devops are same for both solutions.

  • Add start date and end date for each tasks. The Admin can change the layout to make these two fields visible on edit page, but I can only change them in backlogs
  • Add Efforts information: estimate time, already done and left hours. The picture blow shows I have done 80% ‘s workload.
  • Create a view for listing all user stories and tasks. ( one time effort for extracting the data)

Power BI steps:

after imported, you will find gantt chart
  • Import data through “Get Data- More – Online Service – Azure Devops(beta)”, then choose the view you created in Devops.
  • set the Task, start date, end date, % complete for Gantt chat.

CRLF vs LF

These days I am coding a Linux shell running for AZcopy, since my laptop is running windows, so I switch between two operations systems. The strange problem came soon:

  1. ./azCopyToAzure.sh: line 7: $’\r’: command not found
  2. source ./config.cfg can not found
  3. when I was using echo ${log_dir}!!!, the result is !!!XXX, basically, the !!! will come to the beginning and overwrite ${log_dir}

After wasting almost two days, I found the problem. It caused by difference between end of sequence in windows and Linux. Here is from wiki:

The term CRLF refers to Carriage Return (ASCII 13, \r) Line Feed (ASCII 10, \n). They’re used to note the termination of a line, however, dealt with differently in today’s popular Operating Systems. For example: in Windows both a CR and LF are required to note the end of a line, whereas in Linux/UNIX a LF is only required. In the HTTP protocol, the CR-LF sequence is always used to terminate a line.

https://www.owasp.org/index.php/CRLF_Injection

In vs code , we have to change this setting at the bottom of right side.

I thought there should not be many person developing bash on windows, but if you are, hope this can solve some of your problems.

Raspberry PI Security

Lots of ready-to-use opensource project can be found on internet for raspberry PI object detection. Most of them can do very well to motion detection or object classification. I am thinking how to merge them together to make a practical security system that can help all of us to make our home safety. Since It is open source, I will share my design, source code and project milestone. ( mostly to push myself to finish it eventually 🙂

Design:

  1. start camera to capture each frame
  2. save the first frame as reference. and this reference frame will be replaced every 5 mins when there is no movement detected.
  3. continuously compare the current frame with the reference frame.
  4. if any movement detected, draw some interesting areas.
  5. loop these interesting areas, if its size out of threshold, move it to next DNN network.
  6. use DNN network to classify the object.
  7. if object is human, trigger dedicated process “Event process”. includes, recording video, send notification and make speaker noise.
  8. loop to the next frame.

Milestone:

Oct 10th-15th: finish basic function to detect movement, recording video and send mail.

Oct 16th-22th: add DNN network

Oct 22th-30th: add speaker and integration test.

Potential update:

  • GPU accelerate on Cuda device
  • GUI
  • restful API

Source:

https://github.com/neoaksa/raspiberry-security

Demo(till Oct 15th):

Reference:

Deep learning: How OpenCV’s blobFromImage works. https://www.pyimagesearch.com/2017/11/06/deep-learning-opencvs-blobfromimage-works/

Raspberry Pi: Deep learning object detection with OpenCV. https://www.pyimagesearch.com/2017/10/16/raspberry-pi-deep-learning-object-detection-with-opencv/

how to install opencv on the raspberry pi 3 Model b+ (with camera) https://pysource.com/2018/10/31/raspberry-pi-3-and-opencv-3-installation-tutorial/

Home surveillance and motion detection with the Raspberry Pi, Python, OpenCV, and Dropbox. https://www.pyimagesearch.com/2015/06/01/home-surveillance-and-motion-detection-with-the-raspberry-pi-python-and-opencv/

Our new cloud architecture launched!

After so many discussion, evaluation and testing, we finally launched a basic architecture for Azure cloud. I hid some key words that explain the business flows underneaths. Basically, however it is good for all similar scenarios.

We tried to use tableau to read data from datalake directly, bot spark sql or native databrick JDBC are not stable for large size data(over 10,000,000). So we use RDBMS replace. However, if you already use powerBI, we tried, you can extract data from datalake directly without any problem.

Another thing is standards. Since we have lots of pipelines developed by a team. so we utilize data factory to standardize our components. But you can totally use coding style in databricks.

Git is very helpful in version control. ADF and databricks have provides the GUI and API to connect to Git as well.

One problem we have not solved yet is the storage life cycle which is fine in Blobstorage, but seems not ready in data lake gen2. I think Microsoft would fix it soon.

Set up MySQL on Azure Ubuntu and compare with Azure SQL

I will combine three parts: Create Ubuntu VM & attach data disk, Install and configure MySQL, Performance comparison with Azure SQL.

Create Ubuntu VM

  • Choose your size of VM. Here I used D4s_v3, which has 4 cores and 16GB memory. You need to choose disk for storage and set the initial admin password, I recommend premium SSD.
  • Open SSH access. Go to network, add SSH port 22 into your inbound port rules. Later we will add mySQL port 3306 as well.
  • Mount datadisk. Remember the disk for storage you chosen in the step 1? It wouldn’t mount automatically. So you need to do the following steps in your SSH.
dmesg | grep SCSI
sudo fdisk /dev/sdc
---------------------------------------
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-10485759, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-10485759, default 10485759):
Using default value 10485759
Command (m for help): p

Disk /dev/sdc: 5368 MB, 5368709120 bytes
255 heads, 63 sectors/track, 652 cylinders, total 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x2a59b123

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1            2048    10485759     5241856   83  Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
------------------------------------------------
sudo mkfs -t ext4 /dev/sdc1
sudo mkdir /datadrive
sudo mount /dev/sdc1 /datadrive
------------------------------------------------
//if you want to automount, you need to edit fstab file
# retrieve UUID
ls -al /dev/disk/by-uuid/
# edit fstab 
sudo nano /etc/fstab
UUID=<ID> /datadrive auto defaults 0 0 

After these steps, you have done all configurations for Ubuntu. You can check the link by using df -H command.

Install and configure MySQL

  • Install MySQL.
sudo apt-get update
sudo apt-get install mysql-server
  • Allow remote access
sudo ufw enable
sudo ufw allow mysql

then edit “/etc/mysql/mysql.conf.d/mysqld.cnf ” to change bind-address to 0.0.0.0 which allow all ip to remote mySQL.

nano /etc/mysql/mysql.conf.d/mysqld.cnf
  • Start MySQL
sudo systemctl start mysql
  • Add a new root user. You can use any IP address to replace % blew.
CREATE USER '<username>'@'%' IDENTIFIED BY '<user password>';
  • Change the data dictionary. By default, the VM only provides 30GB, you have to use your extra disk to save the database.
# stop service
sudo systemctl stop mysql
# sync to new path
sudo rsync -av /var/lib/mysql /datadrive
# backup 
sudo mv /var/lib/mysql /var/lib/mysql.bak
# change configure files
sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf
-----------------------
datadir=/datadrive/mysql
--------------------
# configure AppArmor Access Control
sudo nano /etc/apparmor.d/tunables/alias
-----------------------
alias /var/lib/mysql/ -> /datadrive/mysql/,
----------------------
sudo systemctl restart apparmor
# dummy file
sudo mkdir /var/lib/mysql/mysql -p
# restart service
sudo systemctl start mysql

Then you can use select * from @@datadir to check the data dictionary.

Performance comparison with Azure SQL

  40,000,000 rows AzureSQL Ubuntu+mySQL
Write(from databricks) 25mins 31mins
Read(to Tableau) 44mins 12mins

what a surprise! VM mySQL is faster than AzureSQL.

Tips: How to write data into AzureSQL and mySQL through Databricks.

To SQL server:

# you have to load com.microsoft.azure:azure-sqldb-spark:1.0.2 into library first
%scala
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._


val config = Config(Map(
  "url"          -> "<accountname>.database.windows.net",
  "databaseName" -> "<dbname>",
  "dbTable"      -> "<tablename>",
  "user"         -> "<admin name>",
  "password"     -> "<password name>"
))

import org.apache.spark.sql.SaveMode

df.write.mode(SaveMode.Overwrite).sqlDB(config)

To mySQL:

%scala
val jdbcHostname = "<mysql address>"
val jdbcPort = 3306
val jdbcDatabase = "<dbname>"
val jdbcUsername = "<user name>"
val jdbcPassword ="<password>"

// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"

// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()

connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")

import org.apache.spark.sql.SaveMode


     df.write
     .mode(SaveMode.Overwrite) // <--- Append to the existing table
     .jdbc(jdbcUrl, "<table name>", connectionProperties)

Tableau 2019.3 starts to support Databricks

Since version 2019.3, Tableau starts to support databricks with the native connection driver. So we don’t have to use some wired unknown product or custom API to connect them together. Here is the step how to use it simply.

  • Download the new version of Tableau through pre-release page. (it will come to the official version soon)
  • Download the supported Simba ODBC driver. and install it.
  • After all these two steps, you will find Databrick option in your tableau connect.
Select Databricks connector

The next steps are figuring out the authorization information for databricks.

  • Login Databricks.
  • Go to Cluster – <your selection cluster name> – Advanced Options – JDBC/ODBC

Here you can find Server Hostname and Http path, copy them to authorization page.

For user name and password, you have to use token.

  • find the little logo on the right up corner. Press it then choose user setting.
  • Click “Generate new Token”. there will be a popup windows, save the series code which is your PASSWORD.
  • Enter “token” as username, the code you just got as the password. Then you can see the tables you created in databricks.

For Saving tables in databricks, it is also simple.

df_test = spark.read.format("parquet").load("/mnt/aclaradls01/test/dcu_health.parquet")
df_test.write.mode('overwrite').saveAsTable("df_test") 

Tips:

We can also use spark SQL to connect databricks. And even use databricks as API to connect to datalakes. It pretty much like a cluster running Spark. Here is the steps:

  1. create a view in databricks which links to the mounted datalake path. spark.sql(“create view df_test_view as SELECT * FROM parquet./mnt/<mounted_name>/<folder_name>/“)
  2. Add “spark.hadoop.hive.server2.enable.doAs false” into your cluster spark config.
  3. If you tried to load large size data which takes long time, add spark.executor.heartbeatInterval 10s and spark.network.timeout 9999s to your cluster spark config as well.

Reference:

Tableau with databricks. https://docs.databricks.com/user-guide/bi/tableau.html

How to connect parquet files from Azure through Apache Drill

Parquet format is a very good choice for big data. It is fast, small and distributed. But not all software support this format. So, we have to use Apache Drill as intermediate layer to connect source and target together.

drill query flow

Today, we are going to take tableau as example to explain how to read parquet files from Azure Blobstorage to Tableau dashboard.

  • Install java JDK.
  • Install 7-zip.
  • Setting up windows environment.
    1. Add JAVA_HOME to your environment variable. set the value as ‘C:\progra~1\Java\jdk1.8.0_221‘. Please change the path if you have a different version.
    2. Add %JAVA_HOME%\bin to Path variable.
  • Create UDF directories manually in bash.
    mkdir "%userprofile%\drill"
    mkdir "%userprofile%\drill\udf"
    mkdir "%userprofile%\drill\udf\registry"
    mkdir "%userprofile%\drill\udf\tmp"
    mkdir "%userprofile%\drill\udf\staging"
    takeown /R /F "%userprofile%\drill"
  • download and unzip Apache Drill to your install folder.
  • Use cmd to start Drill.
cd  \apache-drill-1.16.0\bin
drill-embedded.bat

now, you can see apache drill has been running. If you have data files on your local machine, you can simply retrive data by running SQL select * from dfs.<file path>.

  • Download support jar for Azure.Put them into folder ‘apache-drill-1.16.0\jars\3rdparty’
    1. hadoop-azure-2.7.7.jar
    2. azure-storage-8.0.0.jar
  • Configure connection information.
    • goto ‘apache-drill-1.16.0\conf’, create a copy of core-site-example.xml at the same folder, and change the name to core-site.xml. Modify the contents as blew, remember to change STORAGE_ACCOUNT_NAME and AUTHENTICATION_KEY in term of your azure blob storage account:
<property>
  <name>fs.azure.account.key.STORAGE_ACCOUNT_NAME.blob.core.windows.net</name>
   <value>AUTHENTICATION_KEY</value>
</property>
  • Configure connection information.(con’t)
    • goto ‘http://localhost:8047/storage‘ to open WebUI. Create a new plugin called ‘AZure’, the contents is as same as cp. change the connection and set config as ‘null’ since we already set it up in core-site.xml.
 "type": "file",
 "connection":"wasbs://container_name@STORAGE_ACCOUNT_NAME.blob.core.windows.net",
    "config": null,
  • Try to read parquet files into tableau.
    1. Open Tableau and choose connection Apache Drill. Entry *localhost for Server. Click Signin.
    2. Create a new custom SQL. select * from `az.default`.`parquet file name`.
  • Create view.
    1. Install Drill JDBC Driver.
    2. Go to Start, find Drill Explorer.
    3. Connect to Server. Choose SQL Tab, write SQL here, then click Create as View. Now, you can see the view under dfs.temp.
    4. Tableau can read this view anytime.

Plus: Currently, Drill API didn’t support Datalake Gen2 since it only supports Blob Storage API which is without hierarchy.