跳至主要內容
Distributed_embeddings

1. Introduce

  • 参考链接:

    • 项目地址:https://github.com/NVIDIA-Merlin/distributed-embeddings
    • 相关blog:https://developer.nvidia.com/blog/fast-terabyte-scale-recommender-training-made-easy-with-nvidia-merlin-distributed-embeddings/
    • 相关项目:https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Recommendation/DLRM#hybrid-parallel-training-with-merlin-distributed-embeddings
    • 项目文档:https://nvidia-merlin.github.io/distributed-embeddings/index.html
  • 基于TF2构建大embedding,提供可伸缩模型并行封装器,能够自动将嵌入表分布到多GPU上(目前支持table-wise和column-wise)

  • 支持混合模型并行(dist_model_parallel):

  • 仅需修改少量代码即可使用

    import dist_model_parallel as dmp
     
    class MyEmbeddingModel(tf.keras.Model):
      def  __init__(self, table_sizes):
        ...
        self.embedding_layers = [tf.keras.layers.Embedding(input_dim, output_dim) for input_dim, output_dim in table_sizes]
        # 1. Add this line to wrap list of embedding layers used in the model
        self.embedding_layers = dmp.DistributedEmbedding(self.embedding_layers)
      def call(self, inputs):
        # embedding_outputs = [e(i) for e, i in zip(self.embedding_layers, inputs)]
        embedding_outputs = self.embedding_layers(inputs)
         
    @tf.function
    def training_step(inputs, labels, first_batch):
      with tf.GradientTape() as tape:
        probs = model(inputs)
        loss_value = loss(labels, probs)
     
      # 2. Change Horovod Gradient Tape to dmp tape
      # tape = hvd.DistributedGradientTape(tape)
      tape = dmp.DistributedGradientTape(tape)
      grads = tape.gradient(loss_value, model.trainable_variables)
      opt.apply_gradients(zip(grads, model.trainable_variables))
     
      if first_batch:
        # 3. Change Horovod broadcast_variables to dmp's
        # hvd.broadcast_variables(model.variables, root_rank=0)
        dmp.broadcast_variables(model.variables, root_rank=0)
      return loss_value
    
  • 示例

    import tensorflow as tf
    import distributed_embeddings.python.layers.dist_model_parallel as dmp
    from distributed_embeddings.python.layers.dist_model_parallel import Embedding, DistributedEmbedding
    layer0 = Embedding(1000, 64, name="emb_0")
    layer1 = Embedding(1000, 64, combiner="mean", name="emb_1")
    layer2 = tf.keras.layers.Embedding(1000, 64, name="emb_2")
    layer3 = tf.keras.layers.Embedding(1000, 64, name="emb_3")
    layers0 = [layer0, layer1]
    layers1 = [layer2, layer3]
    emb_layers0 = DistributedEmbedding(layers0, column_slice_threshold=32000)
    emb_layers1 = DistributedEmbedding(layers1, column_slice_threshold=32000)
    
  • API:

    • distributed_embeddings.python.layers.embedding.

      Embedding

      • 接口基本上对齐 tf.keras.layers.Embedding,且增加了支持同时lookup多个embedding 再将它们combine(sum/mean)为一个vector输出
      • 支持多种数据类型:onehot/fixed hotness (multihot, 输入tensor维度相同)/variable hotness(multihot, 输入tensor维度不同,使用tf.RaggedTensor)/sparse tensor(multihot, 使用tf.sparse.SparseTensor,按照csr格式存储稀疏tensor)
    • distributed_embeddings.python.layers.dist_model_parallel.

      DistributedEmbedding

      • 支持按列切分embedding table,适用于emb table规模庞大且无法存入单张卡内存的情况
      • 可设置column_slice_threshold参数确定切分阈值(elements num)
      • 可封装共享embedding table,如两个特征域,一个表示watched_video, 一个表示browsed_video,他们的特征都是videos,因此可公用同一个emb table

    性能表现:

    • DLRM model with 113 billion parameters (421 GiB model size) trained on the Criteo Terabyte Click Logs dataset(官方数据,未实测)
    Hardware Description Training Throughput (samples/second) Speedup over CPU
    2 x AMD EPYC 7742 Both MLP layers and embeddings on CPU 17.7k 1x
    1 x A100-80GB; 2 x AMD EPYC 7742 Large embeddings on CPU, everything else on GPU 768k 43x
    DGX A100 (8xA100-80GB) Hybrid parallel with NVIDIA Merlin Distributed-Embeddings, whole model on GPU 12.1M 683x

    Synthetic models benchmark (见第三节test部分benchmarks测试)

    • 模型规模定义
    Model Total number of embedding tables Total embedding size (GiB)
    Tiny 55 4.2
    Small 107 26.3
    Medium 311 206.2
    Large 612 773.8
    Jumbo 1,022 3,109.5
    • 官方性能(DGX-A100-80GB, batchsize=65536, optimizer=adagrad )
    Model Training step time (ms) Training step time (ms) Training step time (ms) Training step time (ms) Training step time (ms)
    Model 1 GPU 8 GPU 16 GPU 32 GPU 128 GPU
    Tiny 17.6 3.6 3.2
    Small 57.8 14.0 11.6 7.4
    Medium 64.4 44.9 31.1 17.2
    Large 65.0 33.4
    Jumbo 102.3
    • 实测性能(Tesla T4, batchsize=65536, optimizer=adagrad)
    Model Training step time (ms) Training step time (ms)
    Model 1 GPU 4 GPU
    Tiny 42.703 82.856
    • 对比TF原生数据并行
    Solution Training step time (ms) Training step time (ms) Training step time (ms) Training step time (ms)
    Solution 1 GPU 2 GPU 4 GPU 8 GPU
    NVIDIA Merlin Distributed Embeddings Model Parallel 17.7 11.6 6.4 4.2
    Native TensorFlow Data Parallel 19.9 20.2 21.2 22.3

BradZhone大约 4 分钟高性能CUDATensorflowEmbedding
cuCollections性能测试

0. 测试目标

  • cucollection性能分析(测试在负载因子为0.8/0.9时的性能,以及空载时的插入性能,吞吐量,带宽 etc)key为64bit int,数据使用均匀分布
    • [x] 阅读benchmark代码,修改benchmark中的参数,测试不同负载因子下的性能( dynamic, static)
    • [x] 弄清example中的示例,基本用法,可参考test
    • [x] 根据所需测试性能参数修改benchmark测试

1. 环境配置


BradZhone大约 14 分钟高性能CUDAHash