CRLF vs LF

These days I am coding a Linux shell running for AZcopy, since my laptop is running windows, so I switch between two operations systems. The strange problem came soon:

  1. ./azCopyToAzure.sh: line 7: $’\r’: command not found
  2. source ./config.cfg can not found
  3. when I was using echo ${log_dir}!!!, the result is !!!XXX, basically, the !!! will come to the beginning and overwrite ${log_dir}

After wasting almost two days, I found the problem. It caused by difference between end of sequence in windows and Linux. Here is from wiki:

The term CRLF refers to Carriage Return (ASCII 13, \r) Line Feed (ASCII 10, \n). They’re used to note the termination of a line, however, dealt with differently in today’s popular Operating Systems. For example: in Windows both a CR and LF are required to note the end of a line, whereas in Linux/UNIX a LF is only required. In the HTTP protocol, the CR-LF sequence is always used to terminate a line.

https://www.owasp.org/index.php/CRLF_Injection

In vs code , we have to change this setting at the bottom of right side.

I thought there should not be many person developing bash on windows, but if you are, hope this can solve some of your problems.

Raspberry PI Security

Lots of ready-to-use opensource project can be found on internet for raspberry PI object detection. Most of them can do very well to motion detection or object classification. I am thinking how to merge them together to make a practical security system that can help all of us to make our home safety. Since It is open source, I will share my design, source code and project milestone. ( mostly to push myself to finish it eventually 🙂

Design:

  1. start camera to capture each frame
  2. save the first frame as reference. and this reference frame will be replaced every 5 mins when there is no movement detected.
  3. continuously compare the current frame with the reference frame.
  4. if any movement detected, draw some interesting areas.
  5. loop these interesting areas, if its size out of threshold, move it to next DNN network.
  6. use DNN network to classify the object.
  7. if object is human, trigger dedicated process “Event process”. includes, recording video, send notification and make speaker noise.
  8. loop to the next frame.

Milestone:

Oct 10th-15th: finish basic function to detect movement, recording video and send mail.

Oct 16th-22th: add DNN network

Oct 22th-30th: add speaker and integration test.

Potential update:

  • GPU accelerate on Cuda device
  • GUI
  • restful API

Source:

https://github.com/neoaksa/raspiberry-security

Demo(till Oct 15th):

Reference:

Deep learning: How OpenCV’s blobFromImage works. https://www.pyimagesearch.com/2017/11/06/deep-learning-opencvs-blobfromimage-works/

Raspberry Pi: Deep learning object detection with OpenCV. https://www.pyimagesearch.com/2017/10/16/raspberry-pi-deep-learning-object-detection-with-opencv/

how to install opencv on the raspberry pi 3 Model b+ (with camera) https://pysource.com/2018/10/31/raspberry-pi-3-and-opencv-3-installation-tutorial/

Home surveillance and motion detection with the Raspberry Pi, Python, OpenCV, and Dropbox. https://www.pyimagesearch.com/2015/06/01/home-surveillance-and-motion-detection-with-the-raspberry-pi-python-and-opencv/

Our new cloud architecture launched!

After so many discussion, evaluation and testing, we finally launched a basic architecture for Azure cloud. I hid some key words that explain the business flows underneaths. Basically, however it is good for all similar scenarios.

We tried to use tableau to read data from datalake directly, bot spark sql or native databrick JDBC are not stable for large size data(over 10,000,000). So we use RDBMS replace. However, if you already use powerBI, we tried, you can extract data from datalake directly without any problem.

Another thing is standards. Since we have lots of pipelines developed by a team. so we utilize data factory to standardize our components. But you can totally use coding style in databricks.

Git is very helpful in version control. ADF and databricks have provides the GUI and API to connect to Git as well.

One problem we have not solved yet is the storage life cycle which is fine in Blobstorage, but seems not ready in data lake gen2. I think Microsoft would fix it soon.

Set up MySQL on Azure Ubuntu and compare with Azure SQL

I will combine three parts: Create Ubuntu VM & attach data disk, Install and configure MySQL, Performance comparison with Azure SQL.

Create Ubuntu VM

  • Choose your size of VM. Here I used D4s_v3, which has 4 cores and 16GB memory. You need to choose disk for storage and set the initial admin password, I recommend premium SSD.
  • Open SSH access. Go to network, add SSH port 22 into your inbound port rules. Later we will add mySQL port 3306 as well.
  • Mount datadisk. Remember the disk for storage you chosen in the step 1? It wouldn’t mount automatically. So you need to do the following steps in your SSH.
dmesg | grep SCSI
sudo fdisk /dev/sdc
---------------------------------------
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-10485759, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-10485759, default 10485759):
Using default value 10485759
Command (m for help): p

Disk /dev/sdc: 5368 MB, 5368709120 bytes
255 heads, 63 sectors/track, 652 cylinders, total 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x2a59b123

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1            2048    10485759     5241856   83  Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
------------------------------------------------
sudo mkfs -t ext4 /dev/sdc1
sudo mkdir /datadrive
sudo mount /dev/sdc1 /datadrive
------------------------------------------------
//if you want to automount, you need to edit fstab file
# retrieve UUID
ls -al /dev/disk/by-uuid/
# edit fstab 
sudo nano /etc/fstab
UUID=<ID> /datadrive auto defaults 0 0 

After these steps, you have done all configurations for Ubuntu. You can check the link by using df -H command.

Install and configure MySQL

  • Install MySQL.
sudo apt-get update
sudo apt-get install mysql-server
  • Allow remote access
sudo ufw enable
sudo ufw allow mysql

then edit “/etc/mysql/mysql.conf.d/mysqld.cnf ” to change bind-address to 0.0.0.0 which allow all ip to remote mySQL.

nano /etc/mysql/mysql.conf.d/mysqld.cnf
  • Start MySQL
sudo systemctl start mysql
  • Add a new root user. You can use any IP address to replace % blew.
CREATE USER '<username>'@'%' IDENTIFIED BY '<user password>';
  • Change the data dictionary. By default, the VM only provides 30GB, you have to use your extra disk to save the database.
# stop service
sudo systemctl stop mysql
# sync to new path
sudo rsync -av /var/lib/mysql /datadrive
# backup 
sudo mv /var/lib/mysql /var/lib/mysql.bak
# change configure files
sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf
-----------------------
datadir=/datadrive/mysql
--------------------
# configure AppArmor Access Control
sudo nano /etc/apparmor.d/tunables/alias
-----------------------
alias /var/lib/mysql/ -> /datadrive/mysql/,
----------------------
sudo systemctl restart apparmor
# dummy file
sudo mkdir /var/lib/mysql/mysql -p
# restart service
sudo systemctl start mysql

Then you can use select * from @@datadir to check the data dictionary.

Performance comparison with Azure SQL

  40,000,000 rows AzureSQL Ubuntu+mySQL
Write(from databricks) 25mins 31mins
Read(to Tableau) 44mins 12mins

what a surprise! VM mySQL is faster than AzureSQL.

Tips: How to write data into AzureSQL and mySQL through Databricks.

To SQL server:

# you have to load com.microsoft.azure:azure-sqldb-spark:1.0.2 into library first
%scala
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._


val config = Config(Map(
  "url"          -> "<accountname>.database.windows.net",
  "databaseName" -> "<dbname>",
  "dbTable"      -> "<tablename>",
  "user"         -> "<admin name>",
  "password"     -> "<password name>"
))

import org.apache.spark.sql.SaveMode

df.write.mode(SaveMode.Overwrite).sqlDB(config)

To mySQL:

%scala
val jdbcHostname = "<mysql address>"
val jdbcPort = 3306
val jdbcDatabase = "<dbname>"
val jdbcUsername = "<user name>"
val jdbcPassword ="<password>"

// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"

// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()

connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")

import org.apache.spark.sql.SaveMode


     df.write
     .mode(SaveMode.Overwrite) // <--- Append to the existing table
     .jdbc(jdbcUrl, "<table name>", connectionProperties)

Tableau 2019.3 starts to support Databricks

Since version 2019.3, Tableau starts to support databricks with the native connection driver. So we don’t have to use some wired unknown product or custom API to connect them together. Here is the step how to use it simply.

  • Download the new version of Tableau through pre-release page. (it will come to the official version soon)
  • Download the supported Simba ODBC driver. and install it.
  • After all these two steps, you will find Databrick option in your tableau connect.
Select Databricks connector

The next steps are figuring out the authorization information for databricks.

  • Login Databricks.
  • Go to Cluster – <your selection cluster name> – Advanced Options – JDBC/ODBC

Here you can find Server Hostname and Http path, copy them to authorization page.

For user name and password, you have to use token.

  • find the little logo on the right up corner. Press it then choose user setting.
  • Click “Generate new Token”. there will be a popup windows, save the series code which is your PASSWORD.
  • Enter “token” as username, the code you just got as the password. Then you can see the tables you created in databricks.

For Saving tables in databricks, it is also simple.

df_test = spark.read.format("parquet").load("/mnt/aclaradls01/test/dcu_health.parquet")
df_test.write.mode('overwrite').saveAsTable("df_test") 

Tips:

We can also use spark SQL to connect databricks. And even use databricks as API to connect to datalakes. It pretty much like a cluster running Spark. Here is the steps:

  1. create a view in databricks which links to the mounted datalake path. spark.sql(“create view df_test_view as SELECT * FROM parquet./mnt/<mounted_name>/<folder_name>/“)
  2. Add “spark.hadoop.hive.server2.enable.doAs false” into your cluster spark config.
  3. If you tried to load large size data which takes long time, add spark.executor.heartbeatInterval 10s and spark.network.timeout 9999s to your cluster spark config as well.

Reference:

Tableau with databricks. https://docs.databricks.com/user-guide/bi/tableau.html

How to connect parquet files from Azure through Apache Drill

Parquet format is a very good choice for big data. It is fast, small and distributed. But not all software support this format. So, we have to use Apache Drill as intermediate layer to connect source and target together.

drill query flow

Today, we are going to take tableau as example to explain how to read parquet files from Azure Blobstorage to Tableau dashboard.

  • Install java JDK.
  • Install 7-zip.
  • Setting up windows environment.
    1. Add JAVA_HOME to your environment variable. set the value as ‘C:\progra~1\Java\jdk1.8.0_221‘. Please change the path if you have a different version.
    2. Add %JAVA_HOME%\bin to Path variable.
  • Create UDF directories manually in bash.
    mkdir "%userprofile%\drill"
    mkdir "%userprofile%\drill\udf"
    mkdir "%userprofile%\drill\udf\registry"
    mkdir "%userprofile%\drill\udf\tmp"
    mkdir "%userprofile%\drill\udf\staging"
    takeown /R /F "%userprofile%\drill"
  • download and unzip Apache Drill to your install folder.
  • Use cmd to start Drill.
cd  \apache-drill-1.16.0\bin
drill-embedded.bat

now, you can see apache drill has been running. If you have data files on your local machine, you can simply retrive data by running SQL select * from dfs.<file path>.

  • Download support jar for Azure.Put them into folder ‘apache-drill-1.16.0\jars\3rdparty’
    1. hadoop-azure-2.7.7.jar
    2. azure-storage-8.0.0.jar
  • Configure connection information.
    • goto ‘apache-drill-1.16.0\conf’, create a copy of core-site-example.xml at the same folder, and change the name to core-site.xml. Modify the contents as blew, remember to change STORAGE_ACCOUNT_NAME and AUTHENTICATION_KEY in term of your azure blob storage account:
<property>
  <name>fs.azure.account.key.STORAGE_ACCOUNT_NAME.blob.core.windows.net</name>
   <value>AUTHENTICATION_KEY</value>
</property>
  • Configure connection information.(con’t)
    • goto ‘http://localhost:8047/storage‘ to open WebUI. Create a new plugin called ‘AZure’, the contents is as same as cp. change the connection and set config as ‘null’ since we already set it up in core-site.xml.
 "type": "file",
 "connection":"wasbs://container_name@STORAGE_ACCOUNT_NAME.blob.core.windows.net",
    "config": null,
  • Try to read parquet files into tableau.
    1. Open Tableau and choose connection Apache Drill. Entry *localhost for Server. Click Signin.
    2. Create a new custom SQL. select * from `az.default`.`parquet file name`.
  • Create view.
    1. Install Drill JDBC Driver.
    2. Go to Start, find Drill Explorer.
    3. Connect to Server. Choose SQL Tab, write SQL here, then click Create as View. Now, you can see the view under dfs.temp.
    4. Tableau can read this view anytime.

Plus: Currently, Drill API didn’t support Datalake Gen2 since it only supports Blob Storage API which is without hierarchy.

How to build a data pipeline in Databricks

For a long term, I thought there was no pipeline concept in Databricks. For the most engineers they will write the whole script into one notebook rather than split into several activities like in Data factory. In this case, we have to rewirte everything in the script when the next pipeline coming.

But recently, I found there was indeed a way to achieve pipeline in databricks. Here is the document. And follows my understanding.

  • Create widgets as parameters in callee notebook
dbutils.widgets.text("input", "default") # define a text widget in callee notebook
......
dbutils.notebook.exit(defaultText) # set return variable
  • Call callee notebook through caller notebook(we can write our pipeline in this caller notebook, then call all following activity notebooks)
status = dbutils.notebook.run("pipeline",30,{"input":"pass success"}) # transfer input = “pass success” to a notebook called pipeline
  • Callee notebook can return a string to indicate the execution statues( you still remember dbutils.notebook.exit in callee notebook right?)
returned_string = dbutils.notebook.run("pipeline2", 60) 
  • The data exchange between two notebooks can be done with global tempview which locates in memory.
# in callee
sqlContext.range(10).toDF("value").createOrReplaceGlobalTempView("my_data")
dbutils.notebook.exit("my_data")

# in caller
returned_table = dbutils.notebook.run("pipeline2", 60)
global_temp_db = spark.conf.get("spark.sql.globalTempDatabase")
display(table(global_temp_db + "." + returned_table))
  • We can also call function in another notebook, which means we can build a library for all notebook
%run ./aclaraLibary # in this notebook, there are two functions which are overloaded.
simple(1,2)
simple(1,2,3)

First step into Azure loT Edge

loT + AI + Cloud = Edge computing.

This is my understanding of edge computing. As the raising of loT, more and more AI work will be dedicated to the local machines, they may be weaker than the central cloud, but it provides significantly quicker reaction.

Frankly, I only have experience to install the models on raspberry Pi 3 which can be seem as a standalone system. In the real world, there are hundreds of endpoints, it is impossible to maintain them separately, and there is also existing communication between endpoints and cloud to transmit data.

Azure gives its own solution called Azure loT Edge. It builds on the top of loT hub(an bi-direction instant message control component )and has three major components: loT module, loT runtime and cloud interface. Let’s figure out some concepts first, then we tried to build a real systems.

Diagram - Quickstart architecture for device and cloud
Figure 1: loT edge architecture

loT hub: is a managed service, hosted in the cloud, that acts as a central message hub for bi-directional communication between your IoT application and the devices it manages. It provides a interface to manage all the loT edge devices and all the modules for delivery to devices.

loT runtime: hosted on device, has two responsibilities: communication and manage modules. These two roles are performed by two parts: loT edge hub and loT agent. loT edge hub takes responsibility of communication between loT hub and loT device and between loT modules, so we can consider loT edge hub is a local proxy for loT hub. loT agent runs and monitors modules on the device.

loT module: is the smallest unit to achieve the business logic on the device. It could be Azure service( AI, Azure function etc.) or the custom functions written by C#, java, python, node.js…. Modules are docker images pulled from Azure register service or Docker hub. The maximum number of modules is 20 in a single device. Each modules has four elements: docker image, instance, identity and twin. The instance is a running image on the device, identity is initialed when the instance created and it stored in loT hub(cloud) as well, in case we launch the communication between cloud and device. Twin is Json document contains some state information, like metadata, configuration and conditions. There are two types of properties in twin: desired properties and reported properties. They have different write/read authorities for device and cloud.

Figure 2: device twin. tags is used for grouping devices when we delivery modules.

The last thing is deployment manifest, which is a Json document tells device which modules to install and how to configure them. Please find the detail in the comments(start with “//”) blew.

{
  "modulesContent": {
// $edgeAgent indicates the docker version
    "$edgeAgent": {
      "properties.desired": {
        "schemaVersion": "1.0",
        "runtime": {
          "type": "docker",
          "settings": {
            "minDockerVersion": "v1.25",
            "loggingOptions": "",
            "registryCredentials": {}
          }
        },
// system module includes information of imges of edgeAgent & edgeHub, "createOptions" indicates when to launch these images.  
        "systemModules": {
          "edgeAgent": {
            "type": "docker",
            "settings": {
              "image": "mcr.microsoft.com/azureiotedge-agent:1.0.7",
              "createOptions": "{}"
            }
          },
          "edgeHub": {
            "type": "docker",
            "status": "running",
            "restartPolicy": "always",
            "settings": {
              "image": "mcr.microsoft.com/azureiotedge-hub:1.0.7",
              "createOptions": "{\"HostConfig\":{\"PortBindings\":{\"5671/tcp\":[{\"HostPort\":\"5671\"}],\"8883/tcp\":[{\"HostPort\":\"8883\"}],\"443/tcp\":[{\"HostPort\":\"443\"}]}}}"
            }
          }
        },
// modules includes all information about custom modules or Azure service 
        "modules": {
          "camera-capture": {
            "version": "1.0",
            "type": "docker",
            "status": "running",
            "restartPolicy": "always",
            "settings": {
              "image": "glovebox/camera-capture-opencv:1.1.45-arm32v7",
              "createOptions": "{\"Env\":[\"Video=0\",\"azureSpeechServicesKey=2f57f2d9f1074faaa0e9484e1f1c08c1\",\"AiEndpoint=http://image-classifier-service:80/image\"],\"HostConfig\":{\"PortBindings\":{\"5678/tcp\":[{\"HostPort\":\"5678\"}]},\"Devices\":[{\"PathOnHost\":\"/dev/video0\",\"PathInContainer\":\"/dev/video0\",\"CgroupPermissions\":\"mrw\"},{\"PathOnHost\":\"/dev/snd\",\"PathInContainer\":\"/dev/snd\",\"CgroupPermissions\":\"mrw\"}]}}"
            }
          },
          "image-classifier-service": {
            "version": "1.0",
            "type": "docker",
            "status": "running",
            "restartPolicy": "always",
            "settings": {
              "image": "glovebox/image-classifier-service:1.1.4-arm32v7",
              "createOptions": "{\"HostConfig\":{\"Binds\":[\"/home/pi/images:/images\"],\"PortBindings\":{\"8000/tcp\":[{\"HostPort\":\"80\"}],\"5679/tcp\":[{\"HostPort\":\"5679\"}]}}}"
            }
          }
        }
      }
    },
// $edgeHub includes routers which is a data pipeline
    "$edgeHub": {
      "properties.desired": {
        "schemaVersion": "1.0",
        "routes": {
          "camera-capture": "FROM /messages/modules/camera-capture/outputs/output1 INTO $upstream"
        },
        "storeAndForwardConfiguration": {
          "timeToLiveSecs": 7200
        }
      }
    }
  }
}

Let’s start to do a demo step by step.

  • Create a loT hub. There are three ways to create a loT hub. One is through the Azure portal, the second through the CLI, the third through vs code.
// through CLI
az iot hub create --resource-group {resource group name} --name {hub_name} 
Figure 3: Create loT hub through Azure loT hub toolkit.
  • Register an loT Edge device in the loT hub. In this step, we actually do nothing for the device. We create the device ID and connect string for using later.
// through CLI
az iot hub device-identity create --hub-name {hub_name} --device-id myEdgeDevice --edge-enabled
az iot hub device-identity show-connection-string --device-id myEdgeDevice --hub-name {hub_name}
// connection string will be created
"connectionString": "HostName={host name};DeviceId=loTEdgeTest;SharedAccessKey=Hn/60BimXvYBM16y0hgwZtGtcERG+FQI0FEFUtVCiXA="
Figure 4: Add a loT Edge device through Azure portal. The connection string will automatically created.
Figure 5: Add a loT Edge device through VS code. Click “Get Device Info” to get the connection string.
  • Install loT runtime and manage the device. Depending on the environment, we can install loT runtime on Linux or Linux containers on Windows. (sorry, I think few windows loT device existing). Linux container on windows is usually used for developing and testing in local, it is very easy to configure since everything done in the image, you can find the steps here. We am going through the how to set up in Linux.
// register repository for Ubuntu server 18.04
curl https://packages.microsoft.com/config/ubuntu/18.04/multiarch/prod.list > ./microsoft-prod.list
// copy list to container list
sudo cp ./microsoft-prod.list /etc/apt/sources.list.d/
// install Microsoft GPG public key
curl https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > microsoft.gpg
sudo cp ./microsoft.gpg /etc/apt/trusted.gpg.d/

//install container runtime
sudo apt-get update
sudo apt-get install moby-engine
sudo apt-get install moby-cli

// install lot edge security daemon
// security daemon is used for launching edge hub and edge agent
sudo apt-get update
sudo apt-get install iotedge

// manual provision
// set up the connection string you got in previous step
sudo nano /etc/iotedge/config.yaml
# Manual provisioning configuration
provisioning:
  source: "manual"
  device_connection_string: "<ADD DEVICE CONNECTION STRING HERE>"

// verify the installation
sudo systemctl restart iotedge
sudo iotedge list

There is another automatic method to manage the devices, called DPS(Device provisioning Service). But I am planing to talk about it as a new topic.

  • Developing module. As we know, each module is an image located on cloud register or local. Then it is deployed through loT hub to loT edge device. Fortunately, Microsoft has provided an integrated solution in Visual Studio Code. We mostly can do everything in vs code, rather than switch back and forth. Before we start, we need install some extensions for vs code: Azure IoT Tools , Docker extension, Azure IoT EdgeHub Dev Tool, and Visual Studio extension(s) specific to the language( I installed Python extension and C# extension ). Don’t forget to install common language environment: .net core or python, pip…..
Run New IoT Edge Solution
Figure 6: create a new loT edge solution

When you creating a module, you have to provide module’s image repository. It could be Azure container registry( <registry name>.azurecr.io/<module name> ) or local address ( localhost:5000/ <module name> )

Figure 7: solution structure on the left Tab

launch.json is debug configuration, config folder is the deployment manifest. modules folder has subfolders for each module, module.json file defines the Docker build process, the module version, and your docker registry, updating the version number, pushing the updated module to an image registry, and updating the deployment manifest for an edge device triggers the Azure IoT Edge runtime to pull down the new module to the edge device; deployment.template.json file is used for build process, define what module to build, what router used and what version of runtime. If you right click this file, you can find the option to create deployment manifest.

Figure 8: create deployment manifest through build template file
  • Test and Deployment. We can deploy the module to local docker container by right click” deployment.template.json ” and choose “build and push loT edge solution”(or simulator)
Figure 9: Test the module

Deployment is easy as well. right click the json file under config folder(this json file is generate by tempalte.json). choose “create deployment for single device”, then choose device ID to finish single device deployment.

Figure 10: select single device deployment or auto scale deployment
  • Manage device and module on Cloud. If you already upload the module to cloud container registry. I can deploy instantly through Azure portal and this process is automatically.
Figure 11: Set modules. we can add an existing module on container registry or add a custom model by click the device ID in Azure portal.

Next time, I will try to deploy an image classification module to raspberry(or Linux virtual machine) to achieve a simple edge machine learning.

Reference:

Quickstart: Deploy your first IoT Edge module to a Linux device, https://docs.microsoft.com/en-us/azure/iot-edge/quickstart-linux

Creating an image recognition solution with Azure IoT Edge and Azure Cognitive Services, https://dev.to/azure/creating-an-image-recognition-solution-with-azure-iot-edge-and-azure-cognitive-services-4n5i

loT EDGE document, https://docs.microsoft.com/en-us/azure/iot-edge/

Multi Processing in Python

Recently we get the work to improve the ETL efficiency. The latency mostly happen in I/O and file translation, since we only use one thread to handle it. So we want to use multi processing to accelerate this task.

Before everything, I think I need to introduce the difference between multi processing and multi threading in python. The major distance is multi threading is not parallel, there can only be one thread running at any given time in python. You can see it in blew.

  • Actually for CPU heavy tasks, multi threading is useless indeed. However it’s perfect for IO
  • Multiprocessing is always faster than serial, but don’t pop more than number of cores
  • Multiprocessing is for increasing speed. Multi threading is for hiding latency
  • Multiprocessing is best for computations. Multi threading is best for IO
  • If you have CPU heavy tasks, use multiprocessing with n_process = n_cores and never more. Never!
  • If you have IO heavy tasks, use multi threading with n_threads = m * n_cores with m a number bigger than 1 that you can tweak on your own. Try many values and choose the one with the best speedup because there isn’t a general rule. For instance the default value of m in ThreadPoolExecutor is set to 5 [Source] which honestly feels quite random in my opinion.

In this task, we utilized multi processing.

Pro

  1. 5x-10x faster than single core when utilizing 20 cores. (6087 files, 27s (20 cores) vs 180s(single core)
  2. no need to change current coding logic, only adapt to multiprocess pattern.
  3. no effect to upstream and down stream.
Image

Corn:

  1. need to change code by scripts, no general framework or library 
  2. manually decide which part to be parallel 

How to work:

  • Library: you need to import these libraries to enable multi processing in python
from multiprocessing import Pool
from functools import partial
import time # optional
  • Function: you need two functions. parallelize_dataframe is used split dataframe and call multi processing, multiprocess is used for multiprocessing. These two function need to put outside of main function. The yellow marked parameters can be changed by needs.
# split dataframe into parts for parallelize tasks and call multiprocess function
def parallelize_dataframe(df, func, num_cores,cust_id,prodcut_fk,first_last_data):
    df_split = np.array_split(df, num_cores)
    pool = Pool(num_cores)
    parm = partial(func, cust_id=cust_id, prodcut_fk=prodcut_fk,first_last_data=first_last_data)
    return_df = pool.map(parm, df_split)
    df = pd.DataFrame([item for items in return_df for item in items])
    pool.close()
    pool.join()
    return df

# multi processing
# each process handles parts of files imported
def multiprocess(dcu_files_merge_final_val_slice, cust_id,prodcut_fk,first_last_data):
  • Put another script into main function, and utilize  parallelize_dataframe  to execute.
if __name__ == '__main__':
.......
......
                   # set the process number
                    core_num = 20
                    rsr_df = parallelize_dataframe(dcu_files_merge_final_val,multiprocess,core_num, cust_id, prodcut_fk,first_last_data)
                    print(rsr_df.shape)
                    end = time.time()
                    print('Spend:',str(end-start)+'s')

I can’t share the completed code, but you should understand it very well. There are some issues that I have not figure out the solution. The major one is how to more flexible to transmit the parameters to the multicore function rather than hard coding each time.

update@20190621

We added the feature that unzip single zip file with multi processing supporting utilizing futures library.

import os
import zipfile
import concurrent

def _count_file_object(f):
    # Note that this iterates on 'f'.
    # You *could* do 'return len(f.read())'
    # which would be faster but potentially memory
    # inefficient and unrealistic in terms of this
    # benchmark experiment.
    total = 0
    for line in f:
        total += len(line)
    return total


def _count_file(fn):
    with open(fn, 'rb') as f:
        return _count_file_object(f)

def unzip_member_f3(zip_filepath, filename, dest):
    with open(zip_filepath, 'rb') as f:
        zf = zipfile.ZipFile(f)
        zf.extract(filename, dest)
    fn = os.path.join(dest, filename)
    return _count_file(fn)



def unzipper(fn, dest):
    with open(fn, 'rb') as f:
        zf = zipfile.ZipFile(f)
        futures = []
        with concurrent.futures.ProcessPoolExecutor() as executor:
            for member in zf.infolist():
                futures.append(
                    executor.submit(
                        unzip_member_f3,
                        fn,
                        member.filename,
                        dest,
                    )
                )
            total = 0
            for future in concurrent.futures.as_completed(futures):
                total += future.result()
    return total

Ref :

Multithreading VS Multiprocessing in Python, https://medium.com/contentsquare-engineering-blog/multithreading-vs-multiprocessing-in-python-ece023ad55a

Fastest way to unzip a zip file in Python, https://www.peterbe.com/plog/fastest-way-to-unzip-a-zip-file-in-python