I will combine three parts: Create Ubuntu VM & attach data disk, Install and configure MySQL, Performance comparison with Azure SQL.
Create Ubuntu VM
- Choose your size of VM. Here I used D4s_v3, which has 4 cores and 16GB memory. You need to choose disk for storage and set the initial admin password, I recommend premium SSD.
- Open SSH access. Go to network, add SSH port 22 into your inbound port rules. Later we will add mySQL port 3306 as well.
- Mount datadisk. Remember the disk for storage you chosen in the step 1? It wouldn’t mount automatically. So you need to do the following steps in your SSH.
dmesg | grep SCSI
sudo fdisk /dev/sdc
---------------------------------------
Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)
Command (m for help): n
Partition type:
p primary (0 primary, 0 extended, 4 free)
e extended
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-10485759, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-10485759, default 10485759):
Using default value 10485759
Command (m for help): p
Disk /dev/sdc: 5368 MB, 5368709120 bytes
255 heads, 63 sectors/track, 652 cylinders, total 10485760 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x2a59b123
Device Boot Start End Blocks Id System
/dev/sdc1 2048 10485759 5241856 83 Linux
Command (m for help): w
The partition table has been altered!
Calling ioctl() to re-read partition table.
Syncing disks.
------------------------------------------------
sudo mkfs -t ext4 /dev/sdc1
sudo mkdir /datadrive
sudo mount /dev/sdc1 /datadrive
------------------------------------------------
//if you want to automount, you need to edit fstab file
# retrieve UUID
ls -al /dev/disk/by-uuid/
# edit fstab
sudo nano /etc/fstab
UUID=<ID> /datadrive auto defaults 0 0
After these steps, you have done all configurations for Ubuntu. You can check the link by using df -H command.
Install and configure MySQL
- Install MySQL.
sudo apt-get update
sudo apt-get install mysql-server
- Allow remote access
sudo ufw enable
sudo ufw allow mysql
then edit “/etc/mysql/mysql.conf.d/mysqld.cnf ” to change bind-address
to 0.0.0.0 which allow all ip to remote mySQL.
nano /etc/mysql/mysql.conf.d/mysqld.cnf
- Start MySQL
sudo systemctl start mysql
- Add a new root user. You can use any IP address to replace % blew.
CREATE USER '<username>'@'%' IDENTIFIED BY '<user password>';
- Change the data dictionary. By default, the VM only provides 30GB, you have to use your extra disk to save the database.
# stop service
sudo systemctl stop mysql
# sync to new path
sudo rsync -av /var/lib/mysql /datadrive
# backup
sudo mv /var/lib/mysql /var/lib/mysql.bak
# change configure files
sudo nano /etc/mysql/mysql.conf.d/mysqld.cnf
-----------------------
datadir=/datadrive/mysql
--------------------
# configure AppArmor Access Control
sudo nano /etc/apparmor.d/tunables/alias
-----------------------
alias /var/lib/mysql/ -> /datadrive/mysql/,
----------------------
sudo systemctl restart apparmor
# dummy file
sudo mkdir /var/lib/mysql/mysql -p
# restart service
sudo systemctl start mysql
Then you can use select * from @@datadir
to check the data dictionary.
Performance comparison with Azure SQL
40,000,000 rows | AzureSQL | Ubuntu+mySQL |
Write(from databricks) | 25mins | 31mins |
Read(to Tableau) | 44mins | 12mins |
what a surprise! VM mySQL is faster than AzureSQL.
Tips: How to write data into AzureSQL and mySQL through Databricks.
To SQL server:
# you have to load com.microsoft.azure:azure-sqldb-spark:1.0.2 into library first
%scala
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val config = Config(Map(
"url" -> "<accountname>.database.windows.net",
"databaseName" -> "<dbname>",
"dbTable" -> "<tablename>",
"user" -> "<admin name>",
"password" -> "<password name>"
))
import org.apache.spark.sql.SaveMode
df.write.mode(SaveMode.Overwrite).sqlDB(config)
To mySQL:
%scala
val jdbcHostname = "<mysql address>"
val jdbcPort = 3306
val jdbcDatabase = "<dbname>"
val jdbcUsername = "<user name>"
val jdbcPassword ="<password>"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
import org.apache.spark.sql.SaveMode
df.write
.mode(SaveMode.Overwrite) // <--- Append to the existing table
.jdbc(jdbcUrl, "<table name>", connectionProperties)
Leave a Reply