-
Notifications
You must be signed in to change notification settings - Fork 19
feat: Add DistributedOptimizer and support ZeRO-1 #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
be321ca to
4afb235
Compare
df89009 to
837b825
Compare
| #include <unordered_map> | ||
| #include <vector> | ||
|
|
||
| #include "infini_train/include/nn/parallel/param_and_grad_buffer.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
前置声明
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已改
| BuildShardParamsAndBindGrads(); | ||
|
|
||
| // Build base optimizer | ||
| base_optimizer_ = creator_(shard_params_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
creator 只在构造时被调用了一次吧,没必要存下来?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删
| : Optimizer(full_params), param_grad_buffers_(buffers), bucket_groups_(bucket_groups), dp_pg_(dp_pg), | ||
| dp_world_size_(dp_world_size), dp_rank_(dp_rank), creator_(std::move(creator)) { | ||
|
|
||
| CHECK(dp_pg_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dp_pg_ 在 DistributedOptimizer 里似乎没被用到过
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前删了,暂时确实用不到,只用得到 size 和 rank,后续用上了的话再看情况加上吧
| namespace infini_train::nn::parallel { | ||
|
|
||
| namespace { | ||
| std::shared_ptr<Tensor> GetShardView(const std::shared_ptr<Tensor> &buffer, size_t world_size, size_t rank) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数没被调用过
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删
| const std::vector<std::shared_ptr<Tensor>> &full_params, | ||
| const std::vector<std::shared_ptr<ParamAndGradBuffer>> &buffers, | ||
| const std::vector<std::shared_ptr<ParamAndGradBucketGroup>> &bucket_groups, | ||
| const ProcessGroup *dp_pg, size_t dp_world_size, size_t dp_rank) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ddp"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已改
| */ | ||
| explicit Reducer(std::vector<std::shared_ptr<Tensor>> parameters, std::vector<std::vector<size_t>> bucket_indices, | ||
| const ReducerOptions &opts); | ||
| const DistributedDataParallelConfig ddp_config = DistributedDataParallelConfig()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
尽量不要用默认参数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删
infini_train/src/tensor.cc
Outdated
| } | ||
| } | ||
|
|
||
| void Tensor::SetData(const Tensor &tensor, size_t offset, bool overwrite) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里 overwrite 的语义是,是否在“重绑定 buffer 之前”,把当前 Tensor 的数据拷贝到目标 buffer 的对应位置中,overwrite 这个名字有点歧义,改成 preserve_data 吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已改
infini_train/src/optimizer.cc
Outdated
| @@ -1,5 +1,6 @@ | |||
| #include "infini_train/include/optimizer.h" | |||
|
|
|||
| #include <utility> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个头文件有必要添加吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删
| const Device *PipelineStage::device() const { return device_; } | ||
| const std::vector<std::vector<int64_t>> &PipelineStage::recv_shape() const { return recv_shape_; } | ||
| std::shared_ptr<Optimizer> PipelineStage::optimizer() { return optimizer_; } | ||
| // std::shared_ptr<Optimizer> PipelineStage::optimizer() { return optimizer_; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接删了吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删
scripts/test_config.json
Outdated
| "nthread_per_process": 8, | ||
| "num_iteration": 10, | ||
| "batch_size": 40, | ||
| "batch_size": 20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么要改成 20?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已改
|
把 ddp 相关的文件在 parallel 里单独放一个 ddp 文件夹吧。 |
通过
--use_distributed_optimizer来开启基于分布式优化器的 ZeRO-1 优化。Stage 1 的优化对 SGD 没有效果,但是依然在 GPT-2 训练脚本中加了相关 option,为的是之后 stage 2/3 的扩展。实现细节:
--use_distributed_optimizer时则走原先的逻辑;models->Parameters()的构造 function 方法,DistributedOptimizer 在构造时会接受这个 function,并在其构造函数内部创建 base_optimizer;详细实现流程可以参考叙述:https://gxtctab8no8.feishu.cn/wiki/XQbGwXSsZi3MutkZuhXcKWsinnY#share-QU0fdM1cYoT06vxmwMbcnLJUn1b
两个比较重要的细节: