1. Data
-
深度推荐模型的输入特征可分为数值特征和分类特征,数值特征是一组连续值,而分类特征是离散值,以HugeCTR按照Criteo点击率数据格式(每个数据sample包括14个数值特征和26个分类特征,总共40个特征)合成的数据为例:
-
数值特征(numeric feature、dense feature):
-
_col0 _col1 _col2 _col3 _col4 _col5 _col6 0 0.080380 0.435741 0.078185 0.194161 0.087724 0.845081 0.937019 1 0.310647 0.669963 0.218886 0.945537 0.735421 0.637027 0.007011 2 0.337267 0.908792 0.795987 0.608301 0.290421 0.012273 0.671650 3 0.873908 0.694296 0.796788 0.553089 0.872149 0.502299 0.114150 4 0.333109 0.456773 0.403027 0.091778 0.215718 0.729457 0.941204 -
_col7 _col8 _col9 _col10 _col11 _col12 _col13 0 0.977882 0.042342 0.054632 0.855919 0.264451 0.224891 0.467242 1 0.204856 0.307856 0.775143 0.265654 0.301945 0.066413 0.499416 2 0.960113 0.018073 0.639101 0.229013 0.645756 0.123180 0.894010 3 0.444433 0.001794 0.147979 0.083302 0.744487 0.971924 0.362019 4 0.997079 0.563684 0.811862 0.457039 0.133213 0.169442 0.124149
-
-
分类特征(sparse feature、category feature):
-
_col14 _col15 _col16 _col17 _col18 _col19 _col20 _col21 _col22 0 151 0 9 13 1 1 1 9 4 1 0 0 11 4801 44 2 160 9 0 2 4549 3 1 10 31 2 485 2 10 3 0 0 1 1 1 0 3 111 10 4 2 5 160 0 72 0 13 53 0 -
_col23 _col24 _col25 _col26 _col27 _col28 _col29 _col30 _col31 0 2 395 41 1 14 5 2 7 0 1 101 3 1 1 4 1 1 1 1 2 2 19 6 6 1 0 4 1 2 3 1 2 38 6 1 7 1 2 0 4 63 616 7 1 175 23 4 0 1 -
_col32 _col33 _col34 _col35 _col36 _col37 _col38 _col39 0 1 1 5283 4 0 21 33 1 1 2 3 204 310 1640 6 4 6 2 4 7 29 2 11 66 2 22 3 9 43 2 10 286 6 2 0 4 0 477 10 6 0 2 0 30
-
-
数据相关属性:
-
以生成模拟数据为例:
-
# generate_data.py import hugectr from hugectr.tools import DataGenerator, DataGeneratorParams from mpi4py import MPI import argparse parser = argparse.ArgumentParser(description=("Data Generation")) parser.add_argument("--num_files", type=int, help="number of files in training data", default = 8) parser.add_argument("--eval_num_files", type=int, help="number of files in validation data", default = 2) parser.add_argument('--num_samples_per_file', type=int, help="number of samples per file", default=1000000) parser.add_argument('--dir_name', type=str, help="data directory name(Required)") args = parser.parse_args() data_generator_params = DataGeneratorParams( format = hugectr.DataReaderType_t.Parquet, label_dim = 1, dense_dim = 13, num_slot = 26, num_files = args.num_files, eval_num_files = args.eval_num_files, i64_input_key = True, num_samples_per_file = args.num_samples_per_file, source = "./etc_data/" + args.dir_name + "/file_list.txt", eval_source = "./etc_data/" + args.dir_name + "/file_list_test.txt", slot_size_array = [12988, 7129, 8720, 5820, 15196, 4, 4914, 1020, 30, 14274, 10220, 15088, 10, 1518, 3672, 48, 4, 820, 15, 12817, 13908, 13447, 9447, 5867, 45, 33], nnz_array = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], # for parquet, check_type doesn't make any difference check_type = hugectr.Check_t.Non, dist_type = hugectr.Distribution_t.PowerLaw, power_law_type = hugectr.PowerLaw_t.Short) data_generator = DataGenerator(data_generator_params) data_generator.generate()
-
num_files:即数据集分为几个子集,也是使用ETC训练时每一个pass所用到的数据,子集数等于训练所需pass数
-
num_samples_per_file:也即每个子数据集的样本数
-
label_dim:每一个sample的标签维度
-
dense_dim:连续特征的数量
-
num_slot:分类特征的数量,也即slot数(特征域数量)
-
slot_size_array:每一个slot中的最大特征数量(也就是特征域的大小)
-
nnz_array: 样本在对应的slot (特征域)中最多可同时有几个特征(用于选择是one-hot还是multi-hot)
-
-
执行脚本生成模拟数据(总共产生8个训练数据子集和2个测试子集):
-
python generate_data.py --dir_name "file0"
[HCTR][03:28:28.823][INFO][RK0][main]: Generate Parquet dataset [HCTR][03:28:28.823][INFO][RK0][main]: train data folder: ./etc_data/file0, eval data folder: ./etc_data/file0, slot_size_array: 12988, 7129, 8720, 5820, 15196, 4, 4914, 1020, 30, 14274, 10220, 15088, 10, 1518, 3672, 48, 4, 820, 15, 12817, 13908, 13447, 9447, 5867, 45, 33, nnz array: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, #files for train: 8, #files for eval: 2, #samples per file: 1000000, Use power law distribution: 1, alpha of power law: 1.3 [HCTR][03:28:28.823][INFO][RK0][main]: ./etc_data/file0 exist [HCTR][03:28:28.823][INFO][RK0][main]: ./etc_data/file0/train/gen_0.parquet [HCTR][03:28:33.757][INFO][RK0][main]: ./etc_data/file0/train/gen_1.parquet [HCTR][03:28:36.560][INFO][RK0][main]: ./etc_data/file0/train/gen_2.parquet [HCTR][03:28:39.337][INFO][RK0][main]: ./etc_data/file0/train/gen_3.parquet [HCTR][03:28:42.083][INFO][RK0][main]: ./etc_data/file0/train/gen_4.parquet [HCTR][03:28:44.807][INFO][RK0][main]: ./etc_data/file0/train/gen_5.parquet [HCTR][03:28:47.641][INFO][RK0][main]: ./etc_data/file0/train/gen_6.parquet [HCTR][03:28:50.377][INFO][RK0][main]: ./etc_data/file0/train/gen_7.parquet [HCTR][03:28:53.131][INFO][RK0][main]: ./etc_data/file0/file_list.txt done! [HCTR][03:28:53.132][INFO][RK0][main]: ./etc_data/file0/val/gen_0.parquet [HCTR][03:28:55.941][INFO][RK0][main]: ./etc_data/file0/val/gen_1.parquet [HCTR][03:28:58.788][INFO][RK0][main]: ./etc_data/file0/file_list_test.txt done!
-
输出keysets(以gen_0为例):
-
每一个子数据集(每个pass用的数据集)会生成一个keysets,所有数据的keysets构成一个更大的集合all_keysets,最终将根据all_keysets生成keysets文件xxx.keyset供GPU在t时刻预读取t+1时刻的数据到ETC中
-
------------------------ unique_keys: 0 0 1 1 2 2 3 3 4 4 ... 12115 12978 12116 12979 12117 12981 12118 12983 12119 12986 Name: _col14, Length: 12120, dtype: int64 ------------------------ unique_keys: 0 12988 1 12989 2 12990 3 12991 4 12992 ... 7077 20111 7078 20112 7079 20113 7080 20114 7081 20116 Name: _col15, Length: 7082, dtype: int64 ------------------------ unique_keys: 0 20117 1 20118 2 20119 3 20120 4 20121 ... 8554 28827 8555 28828 8556 28830 8557 28831 8558 28834 Name: _col16, Length: 8559, dtype: int64 ------------------------ unique_keys: 0 28837 1 28838 2 28839 3 28840 4 28841 ... 5798 34652 5799 34653 5800 34654 5801 34655 5802 34656 Name: _col17, Length: 5803, dtype: int64
-
-
-
-
使用ETC分pass训练示例(与上例不同,此例使用10个pass来训练):
-
Pass ID Number of Unique Keys Embedding size (GB) #0 24199179 11.54 #1 26015075 12.40 #2 27387817 13.06 #3 23672542 11.29 #4 26053910 12.42 #5 27697628 13.21 #6 24727672 11.79 #7 25643779 12.23 #8 26374086 12.58 #9 26580983 12.67
-
-
数据预处理时,HugeCTR将分类特征转换为整型序列的方法:
-
使用NVTabular:
-
# e.g. using NVTabular target_encode = ( ['brand', 'user_id', 'product_id', 'cat_2', ['ts_weekday', 'ts_day']] >> nvt.ops.TargetEncoding( nvt.ColumnGroup('target'), kfold=5, p_smooth=20, out_dtype="float32", )
-
https://nvidia-merlin.github.io/NVTabular/v0.7.1/api/ops/targetencoding.html
Target encoding is a common feature-engineering technique for categorical columns in tabular datasets. For each categorical group, the mean of a continuous target column is calculated, and the group-specific mean of each row is used to create a new feature (column). To prevent overfitting, the following additional logic is applied:
-
Cross Validation: To prevent overfitting in training data, a cross-validation strategy is used - The data is split into k random “folds”, and the mean values within the i-th fold are calculated with data from all other folds. The cross-validation strategy is only employed when the dataset is used to update recorded statistics. For transformation-only workflow execution, global-mean statistics are used instead.
-
Smoothing: To prevent overfitting for low cardinality categories, the means are smoothed with the overall mean of the target variable.
-
-