feat: Add DistributedOptimizer and support ZeRO-1 #98

Chamberlain0w0 · 2025-12-24T08:54:49Z

通过 --use_distributed_optimizer 来开启基于分布式优化器的 ZeRO-1 优化。Stage 1 的优化对 SGD 没有效果，但是依然在 GPT-2 训练脚本中加了相关 option，为的是之后 stage 2/3 的扩展。

实现细节：

添加 ParamAndGradBucket/Group/Buffer 的基建，采用了与 Megatron-LM 类似的实现思路：所有 param/grad 都连续放在一个一维 buffer 中，并分 bucket 按组进行通信；
修改 DistributedDataParallel 的逻辑，需要在构造函数中创建上述 buffer、划分 bucket group，并完成 hook 注册；同时保留了原 reducer 逻辑的实现分支，不开启 --use_distributed_optimizer 时则走原先的逻辑；
添加 DistributedOptimizer 类，继承自 Optimizer；同时为每种优化器包了一个仅接受 models->Parameters() 的构造 function 方法，DistributedOptimizer 在构造时会接受这个 function，并在其构造函数内部创建 base_optimizer；
修改了 PP 的构造函数参数，把 optimizer 的传入延后到了 Module::TrainStep，这样 PP 对象可以完全不持有 optimizer 对象；
在训练循环添加了本轮峰值占用/预留显存信息的输出。

详细实现流程可以参考叙述：https://gxtctab8no8.feishu.cn/wiki/XQbGwXSsZi3MutkZuhXcKWsinnY#share-QU0fdM1cYoT06vxmwMbcnLJUn1b

两个比较重要的细节：

DDP 对象在构造时，会一同构造模型对应的 buffer/bucket_group 等；同时 DistOpt 在构造时会接受 DDP 对象的 buffer/group；
由于现在 PP + DDP 的实现，会单独将每个 chunk 构造为 DDP 对象，所以传给 DistOpt 的 buffer/bucket_group 等，需要额外在 chunk 之间做一道汇总，得到一个总 list 再传进去。

…support ZeRO-1

…testcase

kilinchange · 2026-01-27T07:40:41Z

infini_train/include/nn/parallel/distributed_optimizer.h

+#include <unordered_map>
+#include <vector>
+
+#include "infini_train/include/nn/parallel/param_and_grad_buffer.h"


前置声明

kilinchange · 2026-01-27T08:01:20Z

infini_train/src/nn/parallel/distributed_optimizer.cc

+    BuildShardParamsAndBindGrads();
+
+    // Build base optimizer
+    base_optimizer_ = creator_(shard_params_);


creator 只在构造时被调用了一次吧，没必要存下来？

kilinchange · 2026-01-27T08:02:18Z

infini_train/src/nn/parallel/distributed_optimizer.cc

+    : Optimizer(full_params), param_grad_buffers_(buffers), bucket_groups_(bucket_groups), dp_pg_(dp_pg),
+      dp_world_size_(dp_world_size), dp_rank_(dp_rank), creator_(std::move(creator)) {
+
+    CHECK(dp_pg_);


dp_pg_ 在 DistributedOptimizer 里似乎没被用到过

目前删了，暂时确实用不到，只用得到 size 和 rank，后续用上了的话再看情况加上吧

kilinchange · 2026-01-27T08:04:20Z

infini_train/src/nn/parallel/distributed_optimizer.cc

+namespace infini_train::nn::parallel {
+
+namespace {
+std::shared_ptr<Tensor> GetShardView(const std::shared_ptr<Tensor> &buffer, size_t world_size, size_t rank) {


这个函数没被调用过

kilinchange · 2026-01-27T08:32:48Z

infini_train/src/nn/parallel/distributed_optimizer.cc

+                                           const std::vector<std::shared_ptr<Tensor>> &full_params,
+                                           const std::vector<std::shared_ptr<ParamAndGradBuffer>> &buffers,
+                                           const std::vector<std::shared_ptr<ParamAndGradBucketGroup>> &bucket_groups,
+                                           const ProcessGroup *dp_pg, size_t dp_world_size, size_t dp_rank)


kilinchange · 2026-01-27T17:36:23Z

infini_train/include/nn/parallel/reducer.h

     */
    explicit Reducer(std::vector<std::shared_ptr<Tensor>> parameters, std::vector<std::vector<size_t>> bucket_indices,
-                     const ReducerOptions &opts);
+                     const DistributedDataParallelConfig ddp_config = DistributedDataParallelConfig());


尽量不要用默认参数

kilinchange · 2026-01-27T18:07:37Z

infini_train/src/tensor.cc

    }
 }

+void Tensor::SetData(const Tensor &tensor, size_t offset, bool overwrite) {


这里 overwrite 的语义是，是否在“重绑定 buffer 之前”，把当前 Tensor 的数据拷贝到目标 buffer 的对应位置中，overwrite 这个名字有点歧义，改成 preserve_data 吧

kilinchange · 2026-01-27T18:08:05Z

infini_train/src/optimizer.cc

@@ -1,5 +1,6 @@
 #include "infini_train/include/optimizer.h"

+#include <utility>


这个头文件有必要添加吗？

kilinchange · 2026-01-27T18:08:27Z

infini_train/src/nn/parallel/pp/pipeline_stage.cc

 const Device *PipelineStage::device() const { return device_; }
 const std::vector<std::vector<int64_t>> &PipelineStage::recv_shape() const { return recv_shape_; }
-std::shared_ptr<Optimizer> PipelineStage::optimizer() { return optimizer_; }
+// std::shared_ptr<Optimizer> PipelineStage::optimizer() { return optimizer_; }


直接删了吧

kilinchange · 2026-01-27T18:10:40Z

scripts/test_config.json

                "nthread_per_process": 8,
                "num_iteration": 10,
-                "batch_size": 40,
+                "batch_size": 20,


为什么要改成 20？

kilinchange · 2026-01-27T18:25:53Z

把 ddp 相关的文件在 parallel 里单独放一个 ddp 文件夹吧。

…and other minor fixes

Chamberlain0w0 force-pushed the feat/distributed_optimizer branch from be321ca to 4afb235 Compare January 12, 2026 06:27

Chamberlain0w0 changed the title ~~[WIP] feat: Add DistributedOptimizer and support ZeRO-1~~ feat: Add DistributedOptimizer and support ZeRO-1 Jan 12, 2026

Chamberlain0w0 added 5 commits January 26, 2026 10:08

feat: Add ParamAndGradBuffer related logic and DistributedOptimizer, …

61f73cd

…support ZeRO-1

fix: Integrating DistOpt with PP/vPP

96e9c9f

fix: add eof newline and remove redundant code

1ee477f

feat: modify feishu writer, support automatic sheet creation for new …

f3491b8

…testcase

fix: add new testcases

837b825

kilinchange requested review from JYMiracle305 and kilinchange January 26, 2026 02:12

Chamberlain0w0 force-pushed the feat/distributed_optimizer branch from df89009 to 837b825 Compare January 26, 2026 02:33

kilinchange requested changes Jan 27, 2026

View reviewed changes

Chamberlain0w0 added 2 commits January 28, 2026 17:26

fix: remove main_grad, modify DistOpt constructor, create ddp folder …

e5593e2

…and other minor fixes

fix: rename dp_* to ddp_*, remove unnecessary comments

f818677

		@@ -1,5 +1,6 @@
		#include "infini_train/include/optimizer.h"

		#include <utility>

feat: Add DistributedOptimizer and support ZeRO-1 #98

Are you sure you want to change the base?

feat: Add DistributedOptimizer and support ZeRO-1 #98

Uh oh!

Conversation

Chamberlain0w0 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kilinchange commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Chamberlain0w0 commented Dec 24, 2025 •

edited

Loading