标签: HugeCTR

1. Data

深度推荐模型的输入特征可分为数值特征和分类特征，数值特征是一组连续值，而分类特征是离散值，以HugeCTR按照Criteo点击率数据格式（每个数据sample包括14个数值特征和26个分类特征，总共40个特征）合成的数据为例：

数值特征（numeric feature、dense feature）：

	_col0	_col1	_col2	_col3	_col4	_col5	_col6
0	0.080380	0.435741	0.078185	0.194161	0.087724	0.845081	0.937019
1	0.310647	0.669963	0.218886	0.945537	0.735421	0.637027	0.007011
2	0.337267	0.908792	0.795987	0.608301	0.290421	0.012273	0.671650
3	0.873908	0.694296	0.796788	0.553089	0.872149	0.502299	0.114150
4	0.333109	0.456773	0.403027	0.091778	0.215718	0.729457	0.941204

	_col7	_col8	_col9	_col10	_col11	_col12	_col13
0	0.977882	0.042342	0.054632	0.855919	0.264451	0.224891	0.467242
1	0.204856	0.307856	0.775143	0.265654	0.301945	0.066413	0.499416
2	0.960113	0.018073	0.639101	0.229013	0.645756	0.123180	0.894010
3	0.444433	0.001794	0.147979	0.083302	0.744487	0.971924	0.362019
4	0.997079	0.563684	0.811862	0.457039	0.133213	0.169442	0.124149

分类特征（sparse feature、category feature）：

	_col14	_col15	_col16	_col17	_col18	_col19	_col20	_col21	_col22
0	151	0	9	13	1	1	1	9	4
1	0	0	11	4801	44	2	160	9	0
2	4549	3	1	10	31	2	485	2	10
3	0	0	1	1	1	0	3	111	10
4	2	5	160	0	72	0	13	53	0

_col23 _col24 _col25 _col26 _col27 _col28 _col29 _col30 _col31

0 2 395 41 1 14 5 2 7 0

1 101 3 1 1 4 1 1 1 1

2 2 19 6 6 1 0 4 1 2

3 1 2 38 6 1 7 1 2 0

4 63 616 7 1 175 23 4 0 1
_col32 _col33 _col34 _col35 _col36 _col37 _col38 _col39

0 1 1 5283 4 0 21 33 1

1 2 3 204 310 1640 6 4 6

2 4 7 29 2 11 66 2 22

3 9 43 2 10 286 6 2 0

4 0 477 10 6 0 2 0 30

	_col23	_col24	_col25	_col26	_col27	_col28	_col29	_col30	_col31
0	2	395	41	1	14	5	2	7	0
1	101	3	1	1	4	1	1	1	1
2	2	19	6	6	1	0	4	1	2
3	1	2	38	6	1	7	1	2	0
4	63	616	7	1	175	23	4	0	1

	_col32	_col33	_col34	_col35	_col36	_col37	_col38	_col39
0	1	1	5283	4	0	21	33	1
1	2	3	204	310	1640	6	4	6
2	4	7	29	2	11	66	2	22
3	9	43	2	10	286	6	2	0
4	0	477	10	6	0	2	0	30

数据相关属性：

以生成模拟数据为例：

# generate_data.py

import hugectr
from hugectr.tools import DataGenerator, DataGeneratorParams
from mpi4py import MPI
import argparse
parser = argparse.ArgumentParser(description=("Data Generation"))

parser.add_argument("--num_files", type=int, help="number of files in training data", default = 8)
parser.add_argument("--eval_num_files", type=int, help="number of files in validation data", default = 2)
parser.add_argument('--num_samples_per_file', type=int, help="number of samples per file", default=1000000)
parser.add_argument('--dir_name', type=str, help="data directory name(Required)")
args = parser.parse_args()

data_generator_params = DataGeneratorParams(
  format = hugectr.DataReaderType_t.Parquet,
  label_dim = 1,
  dense_dim = 13,
  num_slot = 26,
  num_files = args.num_files,
  eval_num_files = args.eval_num_files,
  i64_input_key = True,
  num_samples_per_file = args.num_samples_per_file,
  source = "./etc_data/" + args.dir_name + "/file_list.txt",
  eval_source = "./etc_data/" + args.dir_name + "/file_list_test.txt",
  slot_size_array = [12988, 7129, 8720, 5820, 15196, 4, 4914, 1020, 30, 14274, 10220, 15088, 10, 1518, 3672, 48, 4, 820, 15, 12817, 13908, 13447, 9447, 5867, 45, 33],
  nnz_array = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
  # for parquet, check_type doesn't make any difference
  check_type = hugectr.Check_t.Non,
  dist_type = hugectr.Distribution_t.PowerLaw,
  power_law_type = hugectr.PowerLaw_t.Short)
data_generator = DataGenerator(data_generator_params)
data_generator.generate()

num_files：即数据集分为几个子集，也是使用ETC训练时每一个pass所用到的数据，子集数等于训练所需pass数
num_samples_per_file：也即每个子数据集的样本数
label_dim：每一个sample的标签维度
dense_dim：连续特征的数量
num_slot：分类特征的数量，也即slot数（特征域数量）
slot_size_array：每一个slot中的最大特征数量（也就是特征域的大小）
nnz_array：样本在对应的slot (特征域)中最多可同时有几个特征（用于选择是one-hot还是multi-hot）

执行脚本生成模拟数据（总共产生8个训练数据子集和2个测试子集）：

python generate_data.py --dir_name "file0"

[HCTR][03:28:28.823][INFO][RK0][main]: Generate Parquet dataset
[HCTR][03:28:28.823][INFO][RK0][main]: train data folder: ./etc_data/file0, eval data folder: ./etc_data/file0, slot_size_array: 12988, 7129, 8720, 5820, 15196, 4, 4914, 1020, 30, 14274, 10220, 15088, 10, 1518, 3672, 48, 4, 820, 15, 12817, 13908, 13447, 9447, 5867, 45, 33, nnz array: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, #files for train: 8, #files for eval: 2, #samples per file: 1000000, Use power law distribution: 1, alpha of power law: 1.3
[HCTR][03:28:28.823][INFO][RK0][main]: ./etc_data/file0 exist
[HCTR][03:28:28.823][INFO][RK0][main]: ./etc_data/file0/train/gen_0.parquet
[HCTR][03:28:33.757][INFO][RK0][main]: ./etc_data/file0/train/gen_1.parquet
[HCTR][03:28:36.560][INFO][RK0][main]: ./etc_data/file0/train/gen_2.parquet
[HCTR][03:28:39.337][INFO][RK0][main]: ./etc_data/file0/train/gen_3.parquet
[HCTR][03:28:42.083][INFO][RK0][main]: ./etc_data/file0/train/gen_4.parquet
[HCTR][03:28:44.807][INFO][RK0][main]: ./etc_data/file0/train/gen_5.parquet
[HCTR][03:28:47.641][INFO][RK0][main]: ./etc_data/file0/train/gen_6.parquet
[HCTR][03:28:50.377][INFO][RK0][main]: ./etc_data/file0/train/gen_7.parquet
[HCTR][03:28:53.131][INFO][RK0][main]: ./etc_data/file0/file_list.txt done!
[HCTR][03:28:53.132][INFO][RK0][main]: ./etc_data/file0/val/gen_0.parquet
[HCTR][03:28:55.941][INFO][RK0][main]: ./etc_data/file0/val/gen_1.parquet
[HCTR][03:28:58.788][INFO][RK0][main]: ./etc_data/file0/file_list_test.txt done!

输出keysets（以gen_0为例）：

每一个子数据集（每个pass用的数据集）会生成一个keysets，所有数据的keysets构成一个更大的集合all_keysets，最终将根据all_keysets生成keysets文件xxx.keyset供GPU在t时刻预读取t+1时刻的数据到ETC中

  ------------------------
  unique_keys:
  0            0
  1            1
  2            2
  3            3
  4            4
          ...  
  12115    12978
  12116    12979
  12117    12981
  12118    12983
  12119    12986
  Name: _col14, Length: 12120, dtype: int64
  ------------------------
  unique_keys:
  0       12988
  1       12989
  2       12990
  3       12991
  4       12992
          ...  
  7077    20111
  7078    20112
  7079    20113
  7080    20114
  7081    20116
  Name: _col15, Length: 7082, dtype: int64
  ------------------------
  unique_keys:
  0       20117
  1       20118
  2       20119
  3       20120
  4       20121
          ...  
  8554    28827
  8555    28828
  8556    28830
  8557    28831
  8558    28834
  Name: _col16, Length: 8559, dtype: int64
  ------------------------
  unique_keys:
  0       28837
  1       28838
  2       28839
  3       28840
  4       28841
          ...  
  5798    34652
  5799    34653
  5800    34654
  5801    34655
  5802    34656
  Name: _col17, Length: 5803, dtype: int64

使用ETC分pass训练示例(与上例不同，此例使用10个pass来训练)：

Pass ID	Number of Unique Keys	Embedding size (GB)
#0	24199179	11.54
#1	26015075	12.40
#2	27387817	13.06
#3	23672542	11.29
#4	26053910	12.42
#5	27697628	13.21
#6	24727672	11.79
#7	25643779	12.23
#8	26374086	12.58
#9	26580983	12.67

数据预处理时，HugeCTR将分类特征转换为整型序列的方法：
- 使用NVTabular：
- ```
# e.g. using NVTabular

target_encode = (
    ['brand', 'user_id', 'product_id', 'cat_2', ['ts_weekday', 'ts_day']] >>
    nvt.ops.TargetEncoding(
        nvt.ColumnGroup('target'),
        kfold=5,
        p_smooth=20,
        out_dtype="float32",
        )
```
- https://nvidia-merlin.github.io/NVTabular/v0.7.1/api/ops/targetencoding.html
  
  Target encoding is a common feature-engineering technique for categorical columns in tabular datasets. For each categorical group, the mean of a continuous target column is calculated, and the group-specific mean of each row is used to create a new feature (column). To prevent overfitting, the following additional logic is applied:
  1. Cross Validation: To prevent overfitting in training data, a cross-validation strategy is used - The data is split into k random “folds”, and the mean values within the i-th fold are calculated with data from all other folds. The cross-validation strategy is only employed when the dataset is used to update recorded statistics. For transformation-only workflow execution, global-mean statistics are used instead.
  2. Smoothing: To prevent overfitting for low cardinality categories, the means are smoothed with the overall mean of the target variable.

BradZhone大约 13 分钟

HugeCTR 学习笔记

1 背景

1.1 推荐系统与深度学习

随着互联网的发展，受益于数据爆炸式地增长，用户获取信息的途径与方式越来越轻松多样，但也因为其中夹杂着大量庞杂冗余甚至无用的信息，如何提供用户真正感兴趣的内容也成为了各大企业尤其是商业领域重点关的问题。推荐系统就是从海量的数据中，根据用户偏好为其选择出可能感兴趣的内容并推送给用户
CTR（Click-trough rate）也即点击率，是用于评估广告、搜索内容、博文等质量、搜索相关程度以及用户喜爱程度的重要指标，也能反馈给信息提供者所推荐给用户的内容是否合适、质量是否上乘、该内容是否选对了潜在受众。CTR的定义为内容被用户点击的次数除以内容展示给用户的次数，e.g. 一条广告被用户刷到了100次，但用户只点进去了1次，那么点击率就是1%
从技术架构上，可将推荐系统分为数据与模型部分：
- 数据部分主要负责“用户”“物品”“场景”的信息收集与处理；
- 模型部分是推荐系统的主体，模型的结构一般由“召回层”“排序层”“补充策略与算法层”组成：
  - 召回层：一般利用高效的召回规则、算法或简单的模型,快速从海量的候选集中召回用户可能感兴趣的物品
  - 排序层：利用排序模型对初筛的候选集进行精排序
  - 补充策略与算法层：也被称为“再排序层”,可以在将推荐列表返回用户之前,为兼顾结果的“多样性”“流行度”“新鲜度”等指标,结合一些补充的策略和算法对推荐列表进行一定的调整,最终形成用户可见的推荐列表
img
与传统推荐系统实现方式相比，深度学习推荐模型具有更强的表达能力，模型结构更加灵活能够适应不同的使用场景，但现代推荐系统及使用场景有以下特点与难点：
- 现代推荐模型合并了TB级别的嵌入表，传统推理服务架构无法将整个模型部署到单个服务器上（高时延、高存储占用）
- 许多推荐场景需要支持在线推理与模型更新，要求低时延
- 查找embedding的过程是独立的，因此容易实现并行化（GPU：higher bandwidth & throughput），但也需要占用大量内存资源和少量的计算资源（不平衡的资源需求降低了GPU在推理系统中的吸引力）→ 大多现存解决方案将嵌入查找操作与在GPU中的稠密计算相解耦，放入CPU中进行（放弃GPU带宽优势，CPU与GPU间的通信带宽成为首要瓶颈）
- 现实世界推荐数据集的经验证据表明，在 CTR 和其他推荐任务的推理过程中嵌入key访问通常表现出很强的局部性，并且大致遵循幂律分布，具有长尾效应 （大量特征的embedding总和占据了整个模型的大部分，但是他们出现的频率非常低，因此将这种特征长期存储在CPU和GPU中是低效的）
  
  Visualization of power law distribution representing the likelihood of embedding key accesses. A few embeddings are accessed far more often than the others.

BradZhone大约 22 分钟