-
Notifications
You must be signed in to change notification settings - Fork 0
fix: k8s limits enforce stuff for pods, no need for separate overhead #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… of resource manager -> removed
📝 WalkthroughWalkthroughThe coordinator transitions from a resource-centric, polling-based model to an event-driven, queue-backed architecture: ResourceManager and related exports removed, ExecutionCoordinator simplified to accept explicit dispatcher/queue_manager collaborators, provider wiring moved to explicit multi-method providers with lifecycle-managed consumer startup, and docs/tests updated accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant QM as QueueManager
participant Coord as ExecutionCoordinator
participant Disp as EventDispatcher
participant Prod as Producer
Note over Coord: Event-driven scheduling
App->>Coord: ExecutionRequested event
Coord->>QM: Enqueue execution (by priority)
QM-->>Coord: Enqueued confirmation
Coord->>Coord: Is item front of queue?
alt If front
Coord->>Disp: Emit CreatePodCommand
Disp->>Prod: Publish command
Prod-->>Worker: Worker receives command
Coord->>Coord: Mark active (dedup guard)
end
Worker->>Coord: ExecutionCompleted / Failed / Cancelled
Coord->>Coord: Remove active tracking
Coord->>Coord: Attempt schedule next (repeat)
sequenceDiagram
participant DI as DependencyInjector
participant CP as CoordinatorProvider
participant UC as UnifiedConsumer
participant Coord as ExecutionCoordinator
participant QM as QueueManager
Note over DI,CP: Provider wiring & lifecycle
DI->>CP: call get_coordinator_consumer(...)
CP->>CP: get_coordinator_dispatcher(...)
CP-->>CP: returns EventDispatcher
CP->>CP: get_queue_manager(...)
CP-->>CP: returns QueueManager
CP->>CP: get_execution_coordinator(...)
CP-->>CP: returns ExecutionCoordinator
CP->>UC: construct UnifiedConsumer(coordinator, dispatcher, ...)
UC->>QM: start queue manager lifecycle
UC->>Coord: start coordinator lifecycle (consumer-driven)
UC-->>DI: yields running consumer
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@backend/app/core/providers.py`:
- Around line 743-752: The generator currently starts coordinator.queue_manager
with coordinator.queue_manager.start() then calls consumer.start() but if
consumer.start() raises, the finally block after yield never runs and
queue_manager remains started; wrap the consumer.start() call in a try/except
that on exception calls await coordinator.queue_manager.stop() (and optionally
logs) and then re-raises the exception so the queue manager is cleaned up,
keeping the existing finally block (which calls consumer.stop() and
coordinator.queue_manager.stop()) for normal teardown after yield.
In `@backend/app/services/coordinator/coordinator.py`:
- Around line 60-74: The wrapper methods (_handle_requested_wrapper,
_handle_completed_wrapper, _handle_failed_wrapper, _handle_cancelled_wrapper)
currently use assert for runtime type validation which is stripped under Python
-O; replace each assert isinstance(...) with an explicit isinstance check and
raise a clear exception (e.g., TypeError or ValueError) when the event is not
the expected type (ExecutionRequestedEvent, ExecutionCompletedEvent,
ExecutionFailedEvent, ExecutionCancelledEvent), then proceed to call the
existing handler (_handle_execution_requested, _handle_execution_completed,
_handle_execution_failed, _handle_execution_cancelled).
In `@docs/components/workers/coordinator.md`:
- Line 34: The doc statement in coordinator.md claiming "When resources are
unavailable, executions are requeued with reduced priority" is inaccurate
because the coordinator no longer tracks resources; update that sentence to
reflect the actual requeue trigger (for example: when pod creation fails or the
scheduler rejects a pod) or remove the assertion entirely; specifically edit the
phrase "requeued with reduced priority" in the coordinator documentation to say
something like "requeued with reduced priority when pod creation fails or when
the scheduler rejects the pod" or delete the line if the coordinator no longer
performs requeueing for those cases, and ensure the coordinator is referenced
consistently in the updated text.
🧹 Nitpick comments (1)
backend/app/services/coordinator/coordinator.py (1)
179-186: Consider retry or requeue for transient scheduling failures.When
_publish_execution_startedfails, the execution is removed from_active_executionsand marked as failed, but it was already popped from the queue. If the failure is transient (e.g., temporary Kafka unavailability), the execution is permanently lost rather than retried.This fail-fast approach is valid for simplicity, but consider whether transient failures should trigger a requeue with backoff rather than immediate failure, especially for user-initiated executions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 15 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/app/core/providers.py">
<violation number="1" location="backend/app/core/providers.py:743">
P2: If consumer.start fails, the queue manager cleanup task stays running because it’s started before any error handling. Ensure queue_manager.stop() is called when consumer startup fails to avoid orphaned tasks.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 16 files
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="backend/app/core/providers.py">
<violation number="1" location="backend/app/core/providers.py:743">
P2: Queue manager is started before the `try/finally`, so a failure in `consumer.start(...)` leaves it running without cleanup. Wrap startup in a try/except to stop the queue manager on startup failures.</violation>
</file>
<file name="docs/architecture/kafka-topic-architecture.md">
<violation number="1" location="docs/architecture/kafka-topic-architecture.md:27">
P3: This new queue-based validation conflicts with the following sentence about executions "waiting for resources," which is now outdated and confusing after removing resource management. Update the later sentence to align with queue-based scheduling.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| await coordinator.queue_manager.start() | ||
| await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Queue manager is started before the try/finally, so a failure in consumer.start(...) leaves it running without cleanup. Wrap startup in a try/except to stop the queue manager on startup failures.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/core/providers.py, line 743:
<comment>Queue manager is started before the `try/finally`, so a failure in `consumer.start(...)` leaves it running without cleanup. Wrap startup in a try/except to stop the queue manager on startup failures.</comment>
<file context>
@@ -702,9 +666,90 @@ def get_admin_user_service(
+ event_metrics=event_metrics,
+ )
+
+ await coordinator.queue_manager.start()
+ await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR]))
+ logger.info("Coordinator consumer started")
</file context>
| await coordinator.queue_manager.start() | |
| await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR])) | |
| try: | |
| await coordinator.queue_manager.start() | |
| await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR])) | |
| except Exception: | |
| await coordinator.queue_manager.stop() | |
| raise |
|
|
||
| - Has the user exceeded their rate limit? | ||
| - Are sufficient resources available? | ||
| - Is the queue full? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: This new queue-based validation conflicts with the following sentence about executions "waiting for resources," which is now outdated and confusing after removing resource management. Update the later sentence to align with queue-based scheduling.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/architecture/kafka-topic-architecture.md, line 27:
<comment>This new queue-based validation conflicts with the following sentence about executions "waiting for resources," which is now outdated and confusing after removing resource management. Update the later sentence to align with queue-based scheduling.</comment>
<file context>
@@ -24,7 +24,7 @@ When a user submits code, the API creates an `ExecutionRequestedEvent` and publi
- Has the user exceeded their rate limit?
-- Are sufficient resources available?
+- Is the queue full?
- Should this execution be prioritized or queued?
</file context>



Summary by cubic
Removed the in-process ResourceManager and now rely on Kubernetes pod requests/limits to enforce CPU/memory. Simplified the coordinator to event-driven scheduling and moved the Kafka consumer lifecycle to DI for cleaner startup/shutdown.
Refactors
Docs
Written for commit abb0461. Summary will update on new commits.
Summary by CodeRabbit
Refactor
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.