标签: CUDA

CUDA学习笔记

0 目标

快速掌握CUDA编程的大致方式
了解CUDA软硬件组织架构与关系
了解如何查找文档

1 介绍

1.1 架构

CUDA (Compute Unified Device Architecture)：统一计算设备架构，在 GPU 上发布的一个新的硬件和软件架构,它不需要映射到一个图形 API 便可在 GPU 上管理和进行并行数据计算
CPU、GPU架构对比：GPU 被设计用于高密度和并行计算,更确切地说是用于图形渲染，因此更多的晶体管被投入到数据处理而不是数据缓存和流量控制
CUDA软件堆栈：CUDA 软件堆栈由几层组成:一个硬件驱动程序,一个应用程序编程接口(API)和它的Runtime, 还有二个高级的通用数学库,CUFFT 和CUBLAS
CUDA内存操作：CUDA 提供一般 DRAM 内存寻址方式: Gather 和 Scatter内存操作，它可以在 DRAM的任何区域进行读写数据的操作
允许并行数据缓冲或者在 On-chip 内存共享使数据更接近ALU，可以进行快速的常规读写存取,在线程之间共享数据。应用程序可以最小化数据到 DRAM 的 overfetch 和 round-trips, 从而减少对 DRAM 内存带宽的依赖

BradZhone大约 5 分钟

HugeCTR源码阅读笔记

1. Data

深度推荐模型的输入特征可分为数值特征和分类特征，数值特征是一组连续值，而分类特征是离散值，以HugeCTR按照Criteo点击率数据格式（每个数据sample包括14个数值特征和26个分类特征，总共40个特征）合成的数据为例：

数值特征（numeric feature、dense feature）：

	_col0	_col1	_col2	_col3	_col4	_col5	_col6
0	0.080380	0.435741	0.078185	0.194161	0.087724	0.845081	0.937019
1	0.310647	0.669963	0.218886	0.945537	0.735421	0.637027	0.007011
2	0.337267	0.908792	0.795987	0.608301	0.290421	0.012273	0.671650
3	0.873908	0.694296	0.796788	0.553089	0.872149	0.502299	0.114150
4	0.333109	0.456773	0.403027	0.091778	0.215718	0.729457	0.941204

	_col7	_col8	_col9	_col10	_col11	_col12	_col13
0	0.977882	0.042342	0.054632	0.855919	0.264451	0.224891	0.467242
1	0.204856	0.307856	0.775143	0.265654	0.301945	0.066413	0.499416
2	0.960113	0.018073	0.639101	0.229013	0.645756	0.123180	0.894010
3	0.444433	0.001794	0.147979	0.083302	0.744487	0.971924	0.362019
4	0.997079	0.563684	0.811862	0.457039	0.133213	0.169442	0.124149

分类特征（sparse feature、category feature）：

	_col14	_col15	_col16	_col17	_col18	_col19	_col20	_col21	_col22
0	151	0	9	13	1	1	1	9	4
1	0	0	11	4801	44	2	160	9	0
2	4549	3	1	10	31	2	485	2	10
3	0	0	1	1	1	0	3	111	10
4	2	5	160	0	72	0	13	53	0

_col23 _col24 _col25 _col26 _col27 _col28 _col29 _col30 _col31

0 2 395 41 1 14 5 2 7 0

1 101 3 1 1 4 1 1 1 1

2 2 19 6 6 1 0 4 1 2

3 1 2 38 6 1 7 1 2 0

4 63 616 7 1 175 23 4 0 1
_col32 _col33 _col34 _col35 _col36 _col37 _col38 _col39

0 1 1 5283 4 0 21 33 1

1 2 3 204 310 1640 6 4 6

2 4 7 29 2 11 66 2 22

3 9 43 2 10 286 6 2 0

4 0 477 10 6 0 2 0 30

	_col23	_col24	_col25	_col26	_col27	_col28	_col29	_col30	_col31
0	2	395	41	1	14	5	2	7	0
1	101	3	1	1	4	1	1	1	1
2	2	19	6	6	1	0	4	1	2
3	1	2	38	6	1	7	1	2	0
4	63	616	7	1	175	23	4	0	1

	_col32	_col33	_col34	_col35	_col36	_col37	_col38	_col39
0	1	1	5283	4	0	21	33	1
1	2	3	204	310	1640	6	4	6
2	4	7	29	2	11	66	2	22
3	9	43	2	10	286	6	2	0
4	0	477	10	6	0	2	0	30

数据相关属性：

以生成模拟数据为例：

# generate_data.py

import hugectr
from hugectr.tools import DataGenerator, DataGeneratorParams
from mpi4py import MPI
import argparse
parser = argparse.ArgumentParser(description=("Data Generation"))

parser.add_argument("--num_files", type=int, help="number of files in training data", default = 8)
parser.add_argument("--eval_num_files", type=int, help="number of files in validation data", default = 2)
parser.add_argument('--num_samples_per_file', type=int, help="number of samples per file", default=1000000)
parser.add_argument('--dir_name', type=str, help="data directory name(Required)")
args = parser.parse_args()

data_generator_params = DataGeneratorParams(
  format = hugectr.DataReaderType_t.Parquet,
  label_dim = 1,
  dense_dim = 13,
  num_slot = 26,
  num_files = args.num_files,
  eval_num_files = args.eval_num_files,
  i64_input_key = True,
  num_samples_per_file = args.num_samples_per_file,
  source = "./etc_data/" + args.dir_name + "/file_list.txt",
  eval_source = "./etc_data/" + args.dir_name + "/file_list_test.txt",
  slot_size_array = [12988, 7129, 8720, 5820, 15196, 4, 4914, 1020, 30, 14274, 10220, 15088, 10, 1518, 3672, 48, 4, 820, 15, 12817, 13908, 13447, 9447, 5867, 45, 33],
  nnz_array = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  # for parquet, check_type doesn't make any difference
  check_type = hugectr.Check_t.Non,
  dist_type = hugectr.Distribution_t.PowerLaw,
  power_law_type = hugectr.PowerLaw_t.Short)
data_generator = DataGenerator(data_generator_params)
data_generator.generate()

num_files：即数据集分为几个子集，也是使用ETC训练时每一个pass所用到的数据，子集数等于训练所需pass数
num_samples_per_file：也即每个子数据集的样本数
label_dim：每一个sample的标签维度
dense_dim：连续特征的数量
num_slot：分类特征的数量，也即slot数（特征域数量）
slot_size_array：每一个slot中的最大特征数量（也就是特征域的大小）
nnz_array：样本在对应的slot (特征域)中最多可同时有几个特征（用于选择是one-hot还是multi-hot）

执行脚本生成模拟数据（总共产生8个训练数据子集和2个测试子集）：

python generate_data.py --dir_name "file0"

[HCTR][03:28:28.823][INFO][RK0][main]: Generate Parquet dataset
[HCTR][03:28:28.823][INFO][RK0][main]: train data folder: ./etc_data/file0, eval data folder: ./etc_data/file0, slot_size_array: 12988, 7129, 8720, 5820, 15196, 4, 4914, 1020, 30, 14274, 10220, 15088, 10, 1518, 3672, 48, 4, 820, 15, 12817, 13908, 13447, 9447, 5867, 45, 33, nnz array: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, #files for train: 8, #files for eval: 2, #samples per file: 1000000, Use power law distribution: 1, alpha of power law: 1.3
[HCTR][03:28:28.823][INFO][RK0][main]: ./etc_data/file0 exist
[HCTR][03:28:28.823][INFO][RK0][main]: ./etc_data/file0/train/gen_0.parquet
[HCTR][03:28:33.757][INFO][RK0][main]: ./etc_data/file0/train/gen_1.parquet
[HCTR][03:28:36.560][INFO][RK0][main]: ./etc_data/file0/train/gen_2.parquet
[HCTR][03:28:39.337][INFO][RK0][main]: ./etc_data/file0/train/gen_3.parquet
[HCTR][03:28:42.083][INFO][RK0][main]: ./etc_data/file0/train/gen_4.parquet
[HCTR][03:28:44.807][INFO][RK0][main]: ./etc_data/file0/train/gen_5.parquet
[HCTR][03:28:47.641][INFO][RK0][main]: ./etc_data/file0/train/gen_6.parquet
[HCTR][03:28:50.377][INFO][RK0][main]: ./etc_data/file0/train/gen_7.parquet
[HCTR][03:28:53.131][INFO][RK0][main]: ./etc_data/file0/file_list.txt done!
[HCTR][03:28:53.132][INFO][RK0][main]: ./etc_data/file0/val/gen_0.parquet
[HCTR][03:28:55.941][INFO][RK0][main]: ./etc_data/file0/val/gen_1.parquet
[HCTR][03:28:58.788][INFO][RK0][main]: ./etc_data/file0/file_list_test.txt done!

输出keysets（以gen_0为例）：

每一个子数据集（每个pass用的数据集）会生成一个keysets，所有数据的keysets构成一个更大的集合all_keysets，最终将根据all_keysets生成keysets文件xxx.keyset供GPU在t时刻预读取t+1时刻的数据到ETC中

  ------------------------
  unique_keys:
  0            0
  1            1
  2            2
  3            3
  4            4
          ...  
  12115    12978
  12116    12979
  12117    12981
  12118    12983
  12119    12986
  Name: _col14, Length: 12120, dtype: int64
  ------------------------
  unique_keys:
  0       12988
  1       12989
  2       12990
  3       12991
  4       12992
          ...  
  7077    20111
  7078    20112
  7079    20113
  7080    20114
  7081    20116
  Name: _col15, Length: 7082, dtype: int64
  ------------------------
  unique_keys:
  0       20117
  1       20118
  2       20119
  3       20120
  4       20121
          ...  
  8554    28827
  8555    28828
  8556    28830
  8557    28831
  8558    28834
  Name: _col16, Length: 8559, dtype: int64
  ------------------------
  unique_keys:
  0       28837
  1       28838
  2       28839
  3       28840
  4       28841
          ...  
  5798    34652
  5799    34653
  5800    34654
  5801    34655
  5802    34656
  Name: _col17, Length: 5803, dtype: int64

使用ETC分pass训练示例(与上例不同，此例使用10个pass来训练)：

Pass ID	Number of Unique Keys	Embedding size (GB)
#0	24199179	11.54
#1	26015075	12.40
#2	27387817	13.06
#3	23672542	11.29
#4	26053910	12.42
#5	27697628	13.21
#6	24727672	11.79
#7	25643779	12.23
#8	26374086	12.58
#9	26580983	12.67

数据预处理时，HugeCTR将分类特征转换为整型序列的方法：
- 使用NVTabular：
- ```
# e.g. using NVTabular

target_encode = (
    ['brand', 'user_id', 'product_id', 'cat_2', ['ts_weekday', 'ts_day']] >>
    nvt.ops.TargetEncoding(
        nvt.ColumnGroup('target'),
        kfold=5,
        p_smooth=20,
        out_dtype="float32",
        )
```
- https://nvidia-merlin.github.io/NVTabular/v0.7.1/api/ops/targetencoding.html
  
  Target encoding is a common feature-engineering technique for categorical columns in tabular datasets. For each categorical group, the mean of a continuous target column is calculated, and the group-specific mean of each row is used to create a new feature (column). To prevent overfitting, the following additional logic is applied:
  1. Cross Validation: To prevent overfitting in training data, a cross-validation strategy is used - The data is split into k random “folds”, and the mean values within the i-th fold are calculated with data from all other folds. The cross-validation strategy is only employed when the dataset is used to update recorded statistics. For transformation-only workflow execution, global-mean statistics are used instead.
  2. Smoothing: To prevent overfitting for low cardinality categories, the means are smoothed with the overall mean of the target variable.

BradZhone大约 13 分钟

Distributed_embeddings

1. Introduce

参考链接：
- 项目地址：https://github.com/NVIDIA-Merlin/distributed-embeddings
- 相关blog：https://developer.nvidia.com/blog/fast-terabyte-scale-recommender-training-made-easy-with-nvidia-merlin-distributed-embeddings/
- 相关项目：https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Recommendation/DLRM#hybrid-parallel-training-with-merlin-distributed-embeddings
- 项目文档：https://nvidia-merlin.github.io/distributed-embeddings/index.html
基于TF2构建大embedding，提供可伸缩模型并行封装器，能够自动将嵌入表分布到多GPU上（目前支持table-wise和column-wise）
支持混合模型并行（dist_model_parallel）：

仅需修改少量代码即可使用

import dist_model_parallel as dmp
 
class MyEmbeddingModel(tf.keras.Model):
  def  __init__(self, table_sizes):
    ...
    self.embedding_layers = [tf.keras.layers.Embedding(input_dim, output_dim) for input_dim, output_dim in table_sizes]
    # 1. Add this line to wrap list of embedding layers used in the model
    self.embedding_layers = dmp.DistributedEmbedding(self.embedding_layers)
  def call(self, inputs):
    # embedding_outputs = [e(i) for e, i in zip(self.embedding_layers, inputs)]
    embedding_outputs = self.embedding_layers(inputs)
     
@tf.function
def training_step(inputs, labels, first_batch):
  with tf.GradientTape() as tape:
    probs = model(inputs)
    loss_value = loss(labels, probs)
 
  # 2. Change Horovod Gradient Tape to dmp tape
  # tape = hvd.DistributedGradientTape(tape)
  tape = dmp.DistributedGradientTape(tape)
  grads = tape.gradient(loss_value, model.trainable_variables)
  opt.apply_gradients(zip(grads, model.trainable_variables))
 
  if first_batch:
    # 3. Change Horovod broadcast_variables to dmp's
    # hvd.broadcast_variables(model.variables, root_rank=0)
    dmp.broadcast_variables(model.variables, root_rank=0)
  return loss_value

示例

import tensorflow as tf
import distributed_embeddings.python.layers.dist_model_parallel as dmp
from distributed_embeddings.python.layers.dist_model_parallel import Embedding, DistributedEmbedding
layer0 = Embedding(1000, 64, name="emb_0")
layer1 = Embedding(1000, 64, combiner="mean", name="emb_1")
layer2 = tf.keras.layers.Embedding(1000, 64, name="emb_2")
layer3 = tf.keras.layers.Embedding(1000, 64, name="emb_3")
layers0 = [layer0, layer1]
layers1 = [layer2, layer3]
emb_layers0 = DistributedEmbedding(layers0, column_slice_threshold=32000)
emb_layers1 = DistributedEmbedding(layers1, column_slice_threshold=32000)

API:

distributed_embeddings.python.layers.embedding.

Embedding
- 接口基本上对齐 tf.keras.layers.Embedding，且增加了支持同时lookup多个embedding 再将它们combine（sum/mean）为一个vector输出
- 支持多种数据类型：onehot/fixed hotness (multihot, 输入tensor维度相同)/variable hotness（multihot，输入tensor维度不同，使用tf.RaggedTensor）/sparse tensor(multihot, 使用tf.sparse.SparseTensor，按照csr格式存储稀疏tensor)
distributed_embeddings.python.layers.dist_model_parallel.

DistributedEmbedding
- 支持按列切分embedding table，适用于emb table规模庞大且无法存入单张卡内存的情况
- 可设置column_slice_threshold参数确定切分阈值（elements num）
- 可封装共享embedding table，如两个特征域，一个表示watched_video, 一个表示browsed_video，他们的特征都是videos,因此可公用同一个emb table

性能表现：

DLRM model with 113 billion parameters (421 GiB model size) trained on the Criteo Terabyte Click Logs dataset（官方数据，未实测）

Hardware	Description	Training Throughput (samples/second)	Speedup over CPU
2 x AMD EPYC 7742	Both MLP layers and embeddings on CPU	17.7k	1x
1 x A100-80GB; 2 x AMD EPYC 7742	Large embeddings on CPU, everything else on GPU	768k	43x
DGX A100 (8xA100-80GB)	Hybrid parallel with NVIDIA Merlin Distributed-Embeddings, whole model on GPU	12.1M	683x

Synthetic models benchmark （见第三节test部分benchmarks测试）

模型规模定义

Model	Total number of embedding tables	Total embedding size (GiB)
Tiny	55	4.2
Small	107	26.3
Medium	311	206.2
Large	612	773.8
Jumbo	1,022	3,109.5

官方性能(DGX-A100-80GB, batchsize=65536, optimizer=adagrad )

Model	Training step time (ms)	Training step time (ms)	Training step time (ms)	Training step time (ms)	Training step time (ms)
Model	1 GPU	8 GPU	16 GPU	32 GPU	128 GPU
Tiny	17.6	3.6	3.2
Small	57.8	14.0	11.6	7.4
Medium		64.4	44.9	31.1	17.2
Large				65.0	33.4
Jumbo					102.3

实测性能(Tesla T4, batchsize=65536, optimizer=adagrad)

Model	Training step time (ms)	Training step time (ms)
Model	1 GPU	4 GPU
Tiny	42.703	82.856

对比TF原生数据并行

Solution	Training step time (ms)	Training step time (ms)	Training step time (ms)	Training step time (ms)
Solution	1 GPU	2 GPU	4 GPU	8 GPU
NVIDIA Merlin Distributed Embeddings Model Parallel	17.7	11.6	6.4	4.2
Native TensorFlow Data Parallel	19.9	20.2	21.2	22.3

BradZhone大约 4 分钟

HugeCTR 学习笔记

1 背景

1.1 推荐系统与深度学习

随着互联网的发展，受益于数据爆炸式地增长，用户获取信息的途径与方式越来越轻松多样，但也因为其中夹杂着大量庞杂冗余甚至无用的信息，如何提供用户真正感兴趣的内容也成为了各大企业尤其是商业领域重点关的问题。推荐系统就是从海量的数据中，根据用户偏好为其选择出可能感兴趣的内容并推送给用户
CTR（Click-trough rate）也即点击率，是用于评估广告、搜索内容、博文等质量、搜索相关程度以及用户喜爱程度的重要指标，也能反馈给信息提供者所推荐给用户的内容是否合适、质量是否上乘、该内容是否选对了潜在受众。CTR的定义为内容被用户点击的次数除以内容展示给用户的次数，e.g. 一条广告被用户刷到了100次，但用户只点进去了1次，那么点击率就是1%
从技术架构上，可将推荐系统分为数据与模型部分：
- 数据部分主要负责“用户”“物品”“场景”的信息收集与处理；
- 模型部分是推荐系统的主体，模型的结构一般由“召回层”“排序层”“补充策略与算法层”组成：
  - 召回层：一般利用高效的召回规则、算法或简单的模型,快速从海量的候选集中召回用户可能感兴趣的物品
  - 排序层：利用排序模型对初筛的候选集进行精排序
  - 补充策略与算法层：也被称为“再排序层”,可以在将推荐列表返回用户之前,为兼顾结果的“多样性”“流行度”“新鲜度”等指标,结合一些补充的策略和算法对推荐列表进行一定的调整,最终形成用户可见的推荐列表
img
与传统推荐系统实现方式相比，深度学习推荐模型具有更强的表达能力，模型结构更加灵活能够适应不同的使用场景，但现代推荐系统及使用场景有以下特点与难点：
- 现代推荐模型合并了TB级别的嵌入表，传统推理服务架构无法将整个模型部署到单个服务器上（高时延、高存储占用）
- 许多推荐场景需要支持在线推理与模型更新，要求低时延
- 查找embedding的过程是独立的，因此容易实现并行化（GPU：higher bandwidth & throughput），但也需要占用大量内存资源和少量的计算资源（不平衡的资源需求降低了GPU在推理系统中的吸引力）→ 大多现存解决方案将嵌入查找操作与在GPU中的稠密计算相解耦，放入CPU中进行（放弃GPU带宽优势，CPU与GPU间的通信带宽成为首要瓶颈）
- 现实世界推荐数据集的经验证据表明，在 CTR 和其他推荐任务的推理过程中嵌入key访问通常表现出很强的局部性，并且大致遵循幂律分布，具有长尾效应 （大量特征的embedding总和占据了整个模型的大部分，但是他们出现的频率非常低，因此将这种特征长期存储在CPU和GPU中是低效的）
  
  Visualization of power law distribution representing the likelihood of embedding key accesses. A few embeddings are accessed far more often than the others.

BradZhone大约 22 分钟

cuCollections性能测试

0. 测试目标

cucollection性能分析（测试在负载因子为0.8/0.9时的性能，以及空载时的插入性能，吞吐量，带宽 etc）key为64bit int，数据使用均匀分布
- [x] 阅读benchmark代码，修改benchmark中的参数，测试不同负载因子下的性能( dynamic, static)
- [x] 弄清example中的示例，基本用法，可参考test
- [x] 根据所需测试性能参数修改benchmark测试

1. 环境配置

BradZhone大约 14 分钟

WARPCORE 学习笔记

项目链接：https://github.com/sleeepyjack/warpcore

文档：https://sleeepyjack.github.io/warpcore/index.html

BradZhone大约 10 分钟