feat: RL training support for VERL #1196

shihaobai · 2026-01-28T14:14:51Z

No description provided.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Co-authored-by: Weichao Luo <luoweichao@sensetime.com> Co-authored-by: shihaobai <1798930569@qq.com>

Co-authored-by: sufubao <sufubao@sensetime.com>

gemini-code-assist · 2026-01-28T14:17:19Z

Summary of Changes

Hello @shihaobai, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the LightLLM serving framework by introducing robust support for Reinforcement Learning (RL) training, particularly for VERL. The core changes revolve around advanced GPU memory management, enabling dynamic model weight updates, and refining the server's architecture for better control and communication. These enhancements pave the way for more flexible and efficient online model training and deployment scenarios.

Highlights

RL Training Support: Introduced comprehensive support for Reinforcement Learning (RL) training, specifically for VERL, by integrating new memory management capabilities and dynamic weight update mechanisms.
Memory Management: Implemented TorchMemorySaverWrapper to efficiently manage GPU memory regions (weights, KV cache, and CUDA graphs), allowing for pausing and resuming memory occupation. This includes new API endpoints for release_memory_occupation and resume_memory_occupation.
Dynamic Weight Updates: Added functionality for online weight updates, enabling models to be updated without restarting the server. This is facilitated by new API endpoints like init_weights_update_group, destroy_weights_update_group, update_weights_from_distributed, and update_weights_from_tensor.
API and Server Refactoring: Refactored server startup logic and API endpoints to improve modularity and support new features. This includes new endpoints for server information, weight version, request abortion, cache flushing, and generation control (pause/continue).
Inter-Process Communication (IPC) Enhancements: Introduced new data structures in io_struct.py for clearer and more robust communication between different server components. Updated ZMQ socket types for more efficient message passing.
Distributed Environment Improvements: Enhanced distributed training utilities by adding init_custom_process_group for flexible process group initialization and modifying init_process_group calls to avoid device-bound split optimizations that could hinder RL weight updates.
Serialization and Tensor Handling: Added serializer.py and tensor_bucket.py to provide secure and efficient serialization of tensors across multiprocessing boundaries, including support for flattened tensor buckets.
Kernel Autotuning Configurations: Included new Triton kernel autotuning configurations for NVIDIA H200 and H100 GPUs, optimizing performance for various grouped matrix multiplication and MoE operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant changes to support reinforcement learning (RL) training, including dynamic memory management and online weight updates. The refactoring of I/O structures and communication patterns is a positive step towards a more modular and maintainable codebase. The introduction of TorchMemorySaverWrapper provides a clean way to optionally use memory-saving features.

However, I've identified several issues that need attention, including a critical bug in the distributed handling of aborted requests, a logic error in request scheduling, and a potential race condition with asyncio.Event. There are also opportunities for code simplification and improved error handling. Please see the detailed comments for each issue.

gemini-code-assist · 2026-01-28T14:20:46Z

lightllm/server/router/manager.py

+            [req.is_aborted for req in self.running_batch.reqs], dtype=torch.bool, device="cpu"
+        )
+        if self.is_multinode_tp:
+            dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MIN, group=self.mulitnode_group)


The all_reduce operation on aborted_req_mask uses dist.ReduceOp.MIN. For a boolean tensor, this is equivalent to a logical AND operation across all ranks. This means a request will only be considered aborted if it is marked as aborted on all ranks. The correct behavior should be to consider a request aborted if it is aborted on any rank. This requires a logical OR, which corresponds to dist.ReduceOp.MAX.

Suggested change

dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MIN, group=self.mulitnode_group)

dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MAX, group=self.mulitnode_group)

gemini-code-assist · 2026-01-28T14:20:46Z

lightllm/server/router/manager.py

+                    aborted_req_mask = torch.tensor(
+                        [req.is_aborted for req in new_batch.reqs], dtype=torch.bool, device="cpu"
+                    )
+                    dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MIN, group=self.mulitnode_group)


The all_reduce operation on aborted_req_mask uses dist.ReduceOp.MIN. For a boolean tensor, this is equivalent to a logical AND operation across all ranks. This means a request will only be considered aborted if it is marked as aborted on all ranks. The correct behavior should be to consider a request aborted if it is aborted on any rank. This requires a logical OR, which corresponds to dist.ReduceOp.MAX.

Suggested change

dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MIN, group=self.mulitnode_group)

dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MAX, group=self.mulitnode_group)

gemini-code-assist · 2026-01-28T14:20:46Z

lightllm/server/router/manager.py

-                            select_req_ids.append(req_id)
-
+                    aborted_req_mask = torch.tensor(aborted_req_mask, dtype=torch.bool, device="cpu")
+                    dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MIN, group=self.mulitnode_group)


The all_reduce operation on aborted_req_mask uses dist.ReduceOp.MIN. For a boolean tensor, this is equivalent to a logical AND operation across all ranks. This means a request will only be considered aborted if it is marked as aborted on all ranks. The correct behavior should be to consider a request aborted if it is aborted on any rank. This requires a logical OR, which corresponds to dist.ReduceOp.MAX.

Suggested change

dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MIN, group=self.mulitnode_group)

dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MAX, group=self.mulitnode_group)

gemini-code-assist · 2026-01-28T14:20:46Z

lightllm/utils/device_utils.py

+        return "unknown"  # need fix
+        # raise RuntimeError("No GPU available")


The function get_current_device_name returns "unknown" when no GPU is available. This could mask potential configuration errors. It would be better to raise a RuntimeError as suggested in the commented-out code to fail fast and provide a clear error message when a GPU is expected but not found.

Suggested change

return "unknown" # need fix

# raise RuntimeError("No GPU available")

raise RuntimeError("No GPU available")

gemini-code-assist · 2026-01-28T14:20:46Z

lightllm/server/httpserver/manager.py

+    async def http_to_model_special_request(
+        self, request: GeneralHttpToModelRpcReq, timeout: int = 300
+    ) -> GeneralModelToHttpRpcRsp:
+        event = await self.get_event_for_func(request.func_name)
+        await self.transfer_to_next_module(request)
+        try:
+            await asyncio.wait_for(event.wait(), timeout=timeout)
+            ret = event.result
+
+        except asyncio.TimeoutError:
+            ret = GeneralModelToHttpRpcRsp(success=False, msg="wait for response timeout", func_name=request.func_name)
+        except Exception as e:
+            ret = GeneralModelToHttpRpcRsp(
+                success=False, msg="wait for response error: %s" % str(e), func_name=request.func_name
+            )
+        return ret


The asyncio.Event object is not cleared after being used. If the same special request function is called again, the old event will still be set, causing event.wait() to return immediately with potentially stale data. This can lead to incorrect behavior. The event should be cleared before it's waited on.

Suggested change

async def http_to_model_special_request(

self, request: GeneralHttpToModelRpcReq, timeout: int = 300

) -> GeneralModelToHttpRpcRsp:

event = await self.get_event_for_func(request.func_name)

await self.transfer_to_next_module(request)

try:

await asyncio.wait_for(event.wait(), timeout=timeout)

ret = event.result

except asyncio.TimeoutError:

ret = GeneralModelToHttpRpcRsp(success=False, msg="wait for response timeout", func_name=request.func_name)

except Exception as e:

ret = GeneralModelToHttpRpcRsp(

success=False, msg="wait for response error: %s" % str(e), func_name=request.func_name

)

return ret

async def http_to_model_special_request(

self, request: GeneralHttpToModelRpcReq, timeout: int = 300

) -> GeneralModelToHttpRpcRsp:

event = await self.get_event_for_func(request.func_name)

event.clear()

await self.transfer_to_next_module(request)

try:

await asyncio.wait_for(event.wait(), timeout=timeout)

ret = event.result

except asyncio.TimeoutError:

ret = GeneralModelToHttpRpcRsp(success=False, msg="wait for response timeout", func_name=request.func_name)

except Exception as e:

ret = GeneralModelToHttpRpcRsp(

success=False, msg="wait for response error: %s" % str(e), func_name=request.func_name

)

return ret

gemini-code-assist · 2026-01-28T14:20:46Z

lightllm/server/router/req_queue/chunked_prefill/beam_impl.py

                cur_group_reqs, is_busy, new_batch_first_router_need_tokens
            )
-            if ok_insert:
+            if ok_insert and False:


The condition if ok_insert and False: will always evaluate to false, which prevents the last group of requests (cur_group_reqs) from being added to the can_run_list. This appears to be a logic error, likely a leftover from debugging, and will cause requests to be unnecessarily delayed or starved.

Suggested change

if ok_insert and False:

if ok_insert:

gemini-code-assist · 2026-01-28T14:20:47Z

lightllm/common/basemodel/basemodel.py

+    def release_all(self):
+        self.torch_memory_saver.pause(tag=MemoryTag.WEIGHT)
+        self.torch_memory_saver.pause(tag=MemoryTag.KV_CACHE)
+        self.torch_memory_saver.pause(tag=MemoryTag.GRAPH)


The release_all method can be made more concise and maintainable by iterating over a list of memory tags. This avoids code repetition and makes it easier to add or remove tags in the future.

Suggested change

def release_all(self):

self.torch_memory_saver.pause(tag=MemoryTag.WEIGHT)

self.torch_memory_saver.pause(tag=MemoryTag.KV_CACHE)

self.torch_memory_saver.pause(tag=MemoryTag.GRAPH)

def release_all(self):

for tag in [MemoryTag.WEIGHT, MemoryTag.KV_CACHE, MemoryTag.GRAPH]:

self.torch_memory_saver.pause(tag=tag)

gemini-code-assist · 2026-01-28T14:20:47Z

lightllm/common/basemodel/basemodel.py

+    def resume_all(self):
+        self.torch_memory_saver.resume(tag=MemoryTag.WEIGHT)
+        self.torch_memory_saver.resume(tag=MemoryTag.KV_CACHE)
+        self.torch_memory_saver.resume(tag=MemoryTag.GRAPH)


The resume_all method can be made more concise and maintainable by iterating over a list of memory tags. This avoids code repetition and makes it easier to add or remove tags in the future.

Suggested change

def resume_all(self):

self.torch_memory_saver.resume(tag=MemoryTag.WEIGHT)

self.torch_memory_saver.resume(tag=MemoryTag.KV_CACHE)

self.torch_memory_saver.resume(tag=MemoryTag.GRAPH)

def resume_all(self):

for tag in [MemoryTag.WEIGHT, MemoryTag.KV_CACHE, MemoryTag.GRAPH]:

self.torch_memory_saver.resume(tag=tag)

gemini-code-assist · 2026-01-28T14:20:47Z

lightllm/server/httpserver/manager.py

+        if abort_all:
+            for group_req_id in list(self.req_id_to_out_inf.keys()):
+                await self.abort(group_req_id)
+        pass


The pass statement at the end of this method is unnecessary and can be removed for cleaner code.

shihaobai and others added 30 commits November 14, 2025 13:38

add /flush_cache (#1108)

b636310

Aborted reqs (#1113)

60c379e

flush cache mulit node (#1116)

4095831

[bugfix]: flush cache in single node (#1118)

ca9325f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

add pause and continue (#1120)

9948925

add launch_server and StartArgs (#1119)

4b32287

Update weight (#1127)

27abcf5

Co-authored-by: Weichao Luo <luoweichao@sensetime.com> Co-authored-by: shihaobai <1798930569@qq.com>

release and resume (#1122)

c210c82

use portpicker (#1142)

094df8c

Rl weight (#1143)

560be02

Co-authored-by: sufubao <sufubao@sensetime.com>

add_cli

3d225d7

add 30b moe configs

499074a

update requirement

f737585

add-neo-chat

8a67a47

add-neo-chat

fdc1369

add-neo-chat

e8e7416

add-neo-chat

ba44983

add-neo-chat

4d41a33

fix-neo-chat

0e8845c

fix-neo-chat-position-ids-h

b48cd49

add-neo-chat-dense

7a904f3

add-neo-chat-dense

4b757dd

support verl.

e208733

improve0108

245357c

add min/max pixels sampling parameters

6503ac8

fix fused_moe not installed use pip.

07df460

add visual nccl port alloc

a6f00fb

fix0115

9360197

fix0115

920a741

fp8 online quant for moe

3aa5e18

shihaobai and others added 7 commits January 16, 2026 12:56

hotfix for fa3 of llama

7cb890b

fp8w8a8 triton config

c242a75

fp16 config

a0195aa

release ipc tensor early.

7f0c437

bugfix: fix flattened_bucket update weights

5738d9e

bugfix: fix update_weights from tensor

e11bf58

merge main

f767609

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

sangchengmeng added 3 commits January 29, 2026 09:12

add-merge-kv-mode

45259ec

add-neo-chat0129

da3b53d

Merge branch 'add-neo-chat-rebase' into rl_verl

1e066d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: RL training support for VERL #1196

feat: RL training support for VERL #1196

Uh oh!

shihaobai commented Jan 28, 2026

Uh oh!

gemini-code-assist bot commented Jan 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MIN, group=self.mulitnode_group)
	dist.all_reduce(aborted_req_mask, op=dist.ReduceOp.MAX, group=self.mulitnode_group)

		return "unknown" # need fix
		# raise RuntimeError("No GPU available")

	return "unknown" # need fix
	# raise RuntimeError("No GPU available")
	raise RuntimeError("No GPU available")

feat: RL training support for VERL #1196

Are you sure you want to change the base?

feat: RL training support for VERL #1196

Uh oh!

Conversation

shihaobai commented Jan 28, 2026

Uh oh!

gemini-code-assist bot commented Jan 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants