Columnstore index for MS SQL SERVER

Columnstore is the most popular storage tech within big data. We must have already heard parquet, delta lake. They are both columnstore format which brings 10x times compress ratio and super faster query speed to analytic work. SQL server, one of fastest evolving relation database, also provides columnstore index with multiple optimizations.

Loading into a clustered columnstore index
Figure 1: Columnstore index compress the data by column oriented. And optimized by deltastore. It can reach 10x times compression and 100x times query speed.

SQL server provides clustered and non-clustered columnstore index. Delta store is a clustered B-tree index used only with columnstore index automatically. It stores rows until the number of rows reaches a threshold(~1048576 rows) them moved data into columnsotre, and set state from Open to Closed. I will show you at the end of this article.

Clustered columnstore index

  • Primary storage method for the entire table
  • All columns are included, there is no keys.
  • can only combined with non-clustered B-tree index to speed up
    • queries that search for specific values or small ranges of values.
    • updates and deletes of specific rows
  • usually for fact table or large dimension tables

Non-clustered columnstore index

  • We can indicate which columns to be indexed, usually for frequently used columns.
  • requires extra storage to store a copy of columns in the index(~10%)
  • can be combined with other index.

How to choose columnstore index?

Microsoft already provided the conventions on his document.

Figure 2: Choose the best columnstore index for your needs

I would recommend use columnstored index for most of OLAP work, as we need fast query without much delete/update tasks.

How to delete large size of data from columnstore table

Although Microsoft doesn’t suggest us to delete more than 10% data from columnstore table, there is still have chance we have to. In this case, I summarized some of my experience.

| Delete from columnstore is a soft delete

If you tried to delete rows from columnstore table, you will not actually delete the data, but sql server will mark this row as deleted.

You can run sql blew to find out the number of delete rows and deltastore. Once there are too many row marked “delete”, then you have to rebuild/reorganize the columnstore table. Remember , it is not like Optimize clause in deltalake, which is more like bin-packing for small files.

SELECT
       tables.name AS table_name,
       indexes.name AS index_name,
       partitions.partition_number,
       column_store_row_groups.row_group_id,
       column_store_row_groups.state_description,
       column_store_row_groups.total_rows,
       column_store_row_groups.size_in_bytes,
       column_store_row_groups.deleted_rows,
       internal_partitions.partition_id,
       internal_partitions.internal_object_type_desc,
       internal_partitions.rows
FROM sys.column_store_row_groups
INNER JOIN sys.indexes
ON indexes.index_id = column_store_row_groups.index_id
AND indexes.object_id = column_store_row_groups.object_id
INNER JOIN sys.tables
ON tables.object_id = indexes.object_id
INNER JOIN sys.partitions
ON partitions.partition_number = column_store_row_groups.partition_number
AND partitions.index_id = indexes.index_id
AND partitions.object_id = tables.object_id
LEFT JOIN sys.internal_partitions
ON internal_partitions.object_id = tables.object_id
AND column_store_row_groups.deleted_rows > 0
WHERE tables.name = 'table_name'
Figure 3: One delta group is open, and there is no delete rows

| Steps to delete large size of data in columnstored table

1. delete non-clustered B-tree index( after delete operation, rebuild it if needed)

If we run execution plan for bulk delete, you will find B-tree index related operations spend most of time rather than columnsotre index.

Figure 4: B-tree index related operations spend 90% time.

2. delete by a small batch

One of the down side of Azure SQL is we can not set log as simple. So it will take lots of transaction time when we try to delete a large table. The work around is delete by a small batch to make each log transaction smaller and quicker. Blew I gave an example to delete data by chunksize = 1000000.

deleteMore:
delete top(1000000) from table_name
where id%2=0
IF @@ROWCOUNT != 0
begin
       print current_timestamp
    goto deleteMore
end

3. rebuild/reorganizing columnstored index if needed

Sometimes, we have to rebuild/reorganizing index if we delete too many data and it affects performance of query. here I give a snippet to show how to archive it.

-- rebuild column stored index
alter index indexname on table_name rebuild/reorgnize
alter index all on tablename rebuild/reorgnize

-- check fragment
SELECT a.object_id, object_name(a.object_id) AS TableName,
    a.index_id, name AS IndedxName, avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats
    (DB_ID (N'schema_name')
        , OBJECT_ID(N'table_name')
        , NULL
        , NULL
        , NULL) AS a
INNER JOIN sys.indexes AS b
    ON a.object_id = b.object_id
    AND a.index_id = b.index_id;
GO

Reference:

Choose the best columnstore index for your needs. https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-design-guidance?view=sql-server-ver15#choose-the-best-columnstore-index-for-your-needs

Columnstore indexes – Query performance. https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-query-performance?view=sql-server-ver15

How to efficiently delete rows while NOT using Truncate Table in a 500,000+ rows table. https://stackoverflow.com/questions/11230225/how-to-efficiently-delete-rows-while-not-using-truncate-table-in-a-500-000-rows

Hands-On with Columnstore Indexes: Part 1 Architecture. https://www.red-gate.com/simple-talk/sql/sql-development/hands-on-with-columnstore-indexes-part-1-architecture/

Enable GPU Accelerate in WSL2 to support AI frameworks

Since Microsoft upgraded WSL to version 2, it introduced full Linux kernel and full VM manage features. Except the performance benefit through deep integration with windows, WSL2 allows installing additional powerful apps like docker and upgrading Linux kernel anytime when it is available.

Two months ago, Microsoft with NVIDIA brought GPU acceleration to WSL2. This new feature made me exciting, so that we don’t have to train our models on a separated Linux machine or install dual OS startup.

Figure 1: Stack image showing layers involved while running AI frameworks in WSL 2 containers. The container provides integration with CUDA related components. WSL2 communicates with windows host through GPU paravirtualization protocol

Before I start, I did some search about basic ideas of virtualization and WSL2 GPU. It is good for me to understand how GPU paravirtualization works in WSL2.

Types of virtualization

Figure 2: four major types of virtualization
  • Full virtualization. In full virtualization, there is almost a complete model of the underlying physical system resources that allows any and all installed software to run without modification. There are two types of full virtualization.
software assisted full virtualization( binary translation). like VMware workstation(32bit), virtual PC, VirtualBox(32 bits). issue: low performance
hardware- assisted full virtualization. eliminates the binary translation and directly interrupts with hardware ( intel VT-x and AMD-V). like , KVM, VMware ESX, Hyper-V, Xen. issue: virtual context execute privileged instruction directly on the processor.
  • Paravirtualization. Paravirtualization (PV) is an enhancement of virtualization technology in which a guest operating system (guest OS) is modified prior to installation inside a virtual machine (VM) in order to allow all guest OS within the system to share resources and successfully collaborate, rather than attempt to emulate an entire hardware environment. so the guests aware that it has been virtualized.  products like Xen, IBM LPAR, Oracle VM for X86
Xen supports both Full virtualization and Para-virtualization
  • Hybrid virtualization(hardware virtualized with PV drivers). virtual machine uses PV for specific hardware drivers(like I/O), and the host use full virtualization for other features. products like Oracle VM for x86, Xen. 
VMware paravirtual with hardware full virtualization
  • OS level Virtualization. aka containerization. No overhead . Products like docker, Linux LCX, AIX WPAR
The difference between VM and container

Except containerization, all virtualization use hypervisor to communicate with the host. We can take a look how hypervisor works blew.

  • Hypervisor
    • Emulation. (software full virtualization)
      • emulate a certain piece of hardware which guest VM can only see.
      • expense of performance since “common lowest” denominator
      • need to translate instruction
      • wide compatibility
    • Paravirtualization
      • only support certain hardware in certain configurations.
      • Direct hardware access is possible
      • Compatibility is limited
    • hardware pass-through(hardware full virtualization)
      • native performance, but need proper drivers for the real physical hardware
      • hardware specific images
      • GPU supported

GPU Virtualization on Windows

How it works on WSL

  • a new kernel driver “dxgkrnl” which expoes “/dev/dxg” device to user mode.
  • /dev/dxg mimic the native WDDM D3DKMT kernel service layer on Windows.
  • dxgkrnl communicate with its big brother on Windows through VM Bus WDDM paravirtualization protocol.
Figure 3: there is no partitioning of resources between Linux and Windows or limit on Linux application

DxCore & D3D12 on Linux

  • libd3d12.so is compiled from the same source code as d3d12.dll on windows
  • except Present() function, all others are same with windows. 
  • libxcore(DxCore) is a simplified version of dxgi
  • GPU manufacturer partners provide UMD(user mode driver) for Linux
Figure 4: D3D12 builds upon the /dev/dxg device

DirectML and AI Training

  • DirectML sits on top of D3D12 API, provides a a collection of compute compute operations.
  • Tensorflow with an integrated DirectML backend.
Figure 5: DirectML provides beginner a basic ML framework

OpenGL, OpenCL & Vulkan

  • Mesa library is the mapping layer which bring hardware acceleration for OpenCL , OpenGL
  • vulkan is not supported right now.
Figure 6: WSL2 only support OpenGL and OpenCL right now.

Nvidia CUDA

  • a version of CUDA taht directly targets WDDM 2.9 abstraction exposed by /dev/dxg. 
  • libcuda.so enables CUDA-X libaries such as cuDNN, cuBLAS, TensorRT.
  • available on any glibc-based WSL distro
Figure 7: NVIDIA-docker tolls available ( NVIDIA container toolkit), which provides us container like plugin and usage experience.

GPU container in WSL

  • libnvidia-container libarary is able to detect the presence of libdxcore.so at runtime and uses it to detect all the GPUs exposed to this interface.
  • driver store is a folder that containers all driver librarians for both Linux and Windows
Figure 8: NVIDIA docker provides NVIDIA container toolkits along with lots of good images.

GUI Application is still under developing.

How to enable GPU Acceleration in WSL

for the detail step, we can refer https://docs.nvidia.com/cuda/wsl-user-guide/index.html. Here I brief some keypoints:

  1. Windows version: 20150 or above (Dev Channel)
  2. Enable WSL 2
  3. Install Ubuntu On WSL
  4. Install Windows Terminal
  5. Upgrade kernel to 4.19.121 or higher
  6. NVIDA DRIVERS FOR CUDA ON WSL  https://developer.nvidia.com/cuda/wsl/download
  7. Install docker in WSL:  
    • curl https://get.docker.com | sh
    • You can see vmmen process on your windows task manger. It is the process for virtual machine in wsl2
  8. Install Nvidia Container Toolkit( nvidia-docker2)
Figure 9: docker in WSL2 with NIVIDA container toolkit

9. Start A TensorFlow Container

# test for docker
docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
# pull tersorflow image and run it
docker run -it --gpus all -p 8888:8888 tensorflow/tensorflow:latest-gpu-py3-jupyter

After you pull tersoflow image, and run it. You can see following instruction:

Figure 10: replace 127.0.0.1 to localhost, and open this URL on your browser then we can use GPU acceleration in our WSL2

Reference:

Para virtualization vs Full virtualization vs Hardware assisted Virtualization, https://www.unixarena.com/2017/12/para-virtualization-full-virtualization-hardware-assisted-virtualization.html/

Emulation, paravirtualization, and pass-through: what you need to know for client hypervisors, https://searchvirtualdesktop.techtarget.com/opinion/Emulation-paravirtualization-and-pass-through-what-you-need-to-know-for-client-hypervisors

DirectX is coming to the Windows Subsystem for Linux, https://devblogs.microsoft.com/directx/directx-heart-linux/

NVIDIA Container Toolkit, https://github.com/NVIDIA/nvidia-docker

CUDA on WSL User Guide, https://docs.nvidia.com/cuda/wsl-user-guide/index.html

NVIDIA Drivers for CUDA on WSL, https://developer.nvidia.com/cuda/wsl/download

Tensorflow image on Docker, https://www.tensorflow.org/install/docker


Deprecated: preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated in /home/jietao/jie-tao/wp-content/themes/zacklive/library/zacklive.php on line 283