高性能 分类

1. Introduce

参考链接：
- 项目地址：https://github.com/NVIDIA-Merlin/distributed-embeddings
- 相关blog：https://developer.nvidia.com/blog/fast-terabyte-scale-recommender-training-made-easy-with-nvidia-merlin-distributed-embeddings/
- 相关项目：https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Recommendation/DLRM#hybrid-parallel-training-with-merlin-distributed-embeddings
- 项目文档：https://nvidia-merlin.github.io/distributed-embeddings/index.html
基于TF2构建大embedding，提供可伸缩模型并行封装器，能够自动将嵌入表分布到多GPU上（目前支持table-wise和column-wise）
支持混合模型并行（dist_model_parallel）：

仅需修改少量代码即可使用

import dist_model_parallel as dmp
 
class MyEmbeddingModel(tf.keras.Model):
  def  __init__(self, table_sizes):
    ...
    self.embedding_layers = [tf.keras.layers.Embedding(input_dim, output_dim) for input_dim, output_dim in table_sizes]
    # 1. Add this line to wrap list of embedding layers used in the model
    self.embedding_layers = dmp.DistributedEmbedding(self.embedding_layers)
  def call(self, inputs):
    # embedding_outputs = [e(i) for e, i in zip(self.embedding_layers, inputs)]
    embedding_outputs = self.embedding_layers(inputs)
     
@tf.function
def training_step(inputs, labels, first_batch):
  with tf.GradientTape() as tape:
    probs = model(inputs)
    loss_value = loss(labels, probs)
 
  # 2. Change Horovod Gradient Tape to dmp tape
  # tape = hvd.DistributedGradientTape(tape)
  tape = dmp.DistributedGradientTape(tape)
  grads = tape.gradient(loss_value, model.trainable_variables)
  opt.apply_gradients(zip(grads, model.trainable_variables))
 
  if first_batch:
    # 3. Change Horovod broadcast_variables to dmp's
    # hvd.broadcast_variables(model.variables, root_rank=0)
    dmp.broadcast_variables(model.variables, root_rank=0)
  return loss_value

示例

import tensorflow as tf
import distributed_embeddings.python.layers.dist_model_parallel as dmp
from distributed_embeddings.python.layers.dist_model_parallel import Embedding, DistributedEmbedding
layer0 = Embedding(1000, 64, name="emb_0")
layer1 = Embedding(1000, 64, combiner="mean", name="emb_1")
layer2 = tf.keras.layers.Embedding(1000, 64, name="emb_2")
layer3 = tf.keras.layers.Embedding(1000, 64, name="emb_3")
layers0 = [layer0, layer1]
layers1 = [layer2, layer3]
emb_layers0 = DistributedEmbedding(layers0, column_slice_threshold=32000)
emb_layers1 = DistributedEmbedding(layers1, column_slice_threshold=32000)

API:

distributed_embeddings.python.layers.embedding.

Embedding
- 接口基本上对齐 tf.keras.layers.Embedding，且增加了支持同时lookup多个embedding 再将它们combine（sum/mean）为一个vector输出
- 支持多种数据类型：onehot/fixed hotness (multihot, 输入tensor维度相同)/variable hotness（multihot，输入tensor维度不同，使用tf.RaggedTensor）/sparse tensor(multihot, 使用tf.sparse.SparseTensor，按照csr格式存储稀疏tensor)
distributed_embeddings.python.layers.dist_model_parallel.

DistributedEmbedding
- 支持按列切分embedding table，适用于emb table规模庞大且无法存入单张卡内存的情况
- 可设置column_slice_threshold参数确定切分阈值（elements num）
- 可封装共享embedding table，如两个特征域，一个表示watched_video, 一个表示browsed_video，他们的特征都是videos,因此可公用同一个emb table

性能表现：

DLRM model with 113 billion parameters (421 GiB model size) trained on the Criteo Terabyte Click Logs dataset（官方数据，未实测）

Hardware	Description	Training Throughput (samples/second)	Speedup over CPU
2 x AMD EPYC 7742	Both MLP layers and embeddings on CPU	17.7k	1x
1 x A100-80GB; 2 x AMD EPYC 7742	Large embeddings on CPU, everything else on GPU	768k	43x
DGX A100 (8xA100-80GB)	Hybrid parallel with NVIDIA Merlin Distributed-Embeddings, whole model on GPU	12.1M	683x

Synthetic models benchmark （见第三节test部分benchmarks测试）

模型规模定义

Model	Total number of embedding tables	Total embedding size (GiB)
Tiny	55	4.2
Small	107	26.3
Medium	311	206.2
Large	612	773.8
Jumbo	1,022	3,109.5

官方性能(DGX-A100-80GB, batchsize=65536, optimizer=adagrad )

Model	Training step time (ms)	Training step time (ms)	Training step time (ms)	Training step time (ms)	Training step time (ms)
Model	1 GPU	8 GPU	16 GPU	32 GPU	128 GPU
Tiny	17.6	3.6	3.2
Small	57.8	14.0	11.6	7.4
Medium		64.4	44.9	31.1	17.2
Large				65.0	33.4
Jumbo					102.3

实测性能(Tesla T4, batchsize=65536, optimizer=adagrad)

Model	Training step time (ms)	Training step time (ms)
Model	1 GPU	4 GPU
Tiny	42.703	82.856

对比TF原生数据并行

Solution	Training step time (ms)	Training step time (ms)	Training step time (ms)	Training step time (ms)
Solution	1 GPU	2 GPU	4 GPU	8 GPU
NVIDIA Merlin Distributed Embeddings Model Parallel	17.7	11.6	6.4	4.2
Native TensorFlow Data Parallel	19.9	20.2	21.2	22.3

BradZhone大约 4 分钟

1. Introduce

0. 测试目标

1. 环境配置