1. Introduce
-
参考链接:
- 项目地址:https://github.com/NVIDIA-Merlin/distributed-embeddings
- 相关blog:https://developer.nvidia.com/blog/fast-terabyte-scale-recommender-training-made-easy-with-nvidia-merlin-distributed-embeddings/
- 相关项目:https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Recommendation/DLRM#hybrid-parallel-training-with-merlin-distributed-embeddings
- 项目文档:https://nvidia-merlin.github.io/distributed-embeddings/index.html
-
基于TF2构建大embedding,提供可伸缩模型并行封装器,能够自动将嵌入表分布到多GPU上(目前支持table-wise和column-wise)
-
支持混合模型并行(dist_model_parallel):
-
仅需修改少量代码即可使用
import dist_model_parallel as dmp class MyEmbeddingModel(tf.keras.Model): def __init__(self, table_sizes): ... self.embedding_layers = [tf.keras.layers.Embedding(input_dim, output_dim) for input_dim, output_dim in table_sizes] # 1. Add this line to wrap list of embedding layers used in the model self.embedding_layers = dmp.DistributedEmbedding(self.embedding_layers) def call(self, inputs): # embedding_outputs = [e(i) for e, i in zip(self.embedding_layers, inputs)] embedding_outputs = self.embedding_layers(inputs) @tf.function def training_step(inputs, labels, first_batch): with tf.GradientTape() as tape: probs = model(inputs) loss_value = loss(labels, probs) # 2. Change Horovod Gradient Tape to dmp tape # tape = hvd.DistributedGradientTape(tape) tape = dmp.DistributedGradientTape(tape) grads = tape.gradient(loss_value, model.trainable_variables) opt.apply_gradients(zip(grads, model.trainable_variables)) if first_batch: # 3. Change Horovod broadcast_variables to dmp's # hvd.broadcast_variables(model.variables, root_rank=0) dmp.broadcast_variables(model.variables, root_rank=0) return loss_value
-
示例
import tensorflow as tf import distributed_embeddings.python.layers.dist_model_parallel as dmp from distributed_embeddings.python.layers.dist_model_parallel import Embedding, DistributedEmbedding layer0 = Embedding(1000, 64, name="emb_0") layer1 = Embedding(1000, 64, combiner="mean", name="emb_1") layer2 = tf.keras.layers.Embedding(1000, 64, name="emb_2") layer3 = tf.keras.layers.Embedding(1000, 64, name="emb_3") layers0 = [layer0, layer1] layers1 = [layer2, layer3] emb_layers0 = DistributedEmbedding(layers0, column_slice_threshold=32000) emb_layers1 = DistributedEmbedding(layers1, column_slice_threshold=32000)
-
API:
-
distributed_embeddings.python.layers.embedding.
Embedding
- 接口基本上对齐 tf.keras.layers.Embedding,且增加了支持同时lookup多个embedding 再将它们combine(sum/mean)为一个vector输出
- 支持多种数据类型:onehot/fixed hotness (multihot, 输入tensor维度相同)/variable hotness(multihot, 输入tensor维度不同,使用tf.RaggedTensor)/sparse tensor(multihot, 使用tf.sparse.SparseTensor,按照csr格式存储稀疏tensor)
-
distributed_embeddings.python.layers.dist_model_parallel.
DistributedEmbedding
- 支持按列切分embedding table,适用于emb table规模庞大且无法存入单张卡内存的情况
- 可设置column_slice_threshold参数确定切分阈值(elements num)
- 可封装共享embedding table,如两个特征域,一个表示watched_video, 一个表示browsed_video,他们的特征都是videos,因此可公用同一个emb table
性能表现:
- DLRM model with 113 billion parameters (421 GiB model size) trained on the Criteo Terabyte Click Logs dataset(官方数据,未实测)
Hardware Description Training Throughput (samples/second) Speedup over CPU 2 x AMD EPYC 7742 Both MLP layers and embeddings on CPU 17.7k 1x 1 x A100-80GB; 2 x AMD EPYC 7742 Large embeddings on CPU, everything else on GPU 768k 43x DGX A100 (8xA100-80GB) Hybrid parallel with NVIDIA Merlin Distributed-Embeddings, whole model on GPU 12.1M 683x Synthetic models benchmark (见第三节test部分benchmarks测试)
- 模型规模定义
Model Total number of embedding tables Total embedding size (GiB) Tiny 55 4.2 Small 107 26.3 Medium 311 206.2 Large 612 773.8 Jumbo 1,022 3,109.5 - 官方性能(DGX-A100-80GB, batchsize=65536, optimizer=adagrad )
Model Training step time (ms) Training step time (ms) Training step time (ms) Training step time (ms) Training step time (ms) Model 1 GPU 8 GPU 16 GPU 32 GPU 128 GPU Tiny 17.6 3.6 3.2 Small 57.8 14.0 11.6 7.4 Medium 64.4 44.9 31.1 17.2 Large 65.0 33.4 Jumbo 102.3 - 实测性能(Tesla T4, batchsize=65536, optimizer=adagrad)
Model Training step time (ms) Training step time (ms) Model 1 GPU 4 GPU Tiny 42.703 82.856 - 对比TF原生数据并行
Solution Training step time (ms) Training step time (ms) Training step time (ms) Training step time (ms) Solution 1 GPU 2 GPU 4 GPU 8 GPU NVIDIA Merlin Distributed Embeddings Model Parallel 17.7 11.6 6.4 4.2 Native TensorFlow Data Parallel 19.9 20.2 21.2 22.3 -
大约 4 分钟