Skip to content

Conversation

@HardMax71
Copy link
Owner

@HardMax71 HardMax71 commented Jan 29, 2026


Summary by cubic

Removed the in-process ResourceManager and now rely on Kubernetes pod requests/limits to enforce CPU/memory. Simplified the coordinator to event-driven scheduling and moved the Kafka consumer lifecycle to DI for cleaner startup/shutdown.

  • Refactors

    • Dropped ResourceManager and all allocation logic/metrics; coordinator now publishes CreatePodCommand directly.
    • Switched to reactive scheduling (no background loop); schedule next on queue-front or completion/failure/cancel.
    • Added DI providers for dispatcher, queue manager, coordinator, and a consumer that starts/stops both.
    • Simplified run_coordinator to resolve UnifiedConsumer and wait for shutdown.
    • Updated tests: removed resource manager tests and coordinator status/consumer assertions.
  • Docs

    • Updated architecture and worker docs to reflect K8s-enforced limits and queue-only coordinator.
    • Removed resource allocation references/metrics and clarified event flow and configuration.

Written for commit abb0461. Summary will update on new commits.

Summary by CodeRabbit

  • Refactor

    • Execution scheduling moved from resource-allocation polling to reactive, event-driven queue management; coordinator now publishes pod-creation commands and relies on Kubernetes for resource limits.
    • Coordinator lifecycle simplified: consumer/queue lifecycle is DI-managed and the coordinator is stateless.
  • Documentation

    • Architecture, component, lifecycle, and metrics docs updated to reflect queue-based prioritization, reactive scheduling, and removal of resource-allocation metrics.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

The coordinator transitions from a resource-centric, polling-based model to an event-driven, queue-backed architecture: ResourceManager and related exports removed, ExecutionCoordinator simplified to accept explicit dispatcher/queue_manager collaborators, provider wiring moved to explicit multi-method providers with lifecycle-managed consumer startup, and docs/tests updated accordingly.

Changes

Cohort / File(s) Summary
Provider Wiring Reorganization
backend/app/core/providers.py
Added multi-method provider pattern (get_coordinator_dispatcher, get_queue_manager, get_execution_coordinator, get_coordinator_consumer); removed constructor registration of _provide_execution_coordinator and _provide_saga_orchestrator; imported QueueManager.
Coordinator Core Refactor
backend/app/services/coordinator/coordinator.py, backend/app/services/coordinator/__init__.py
ExecutionCoordinator constructor simplified to accept producer, dispatcher, queue_manager, execution_repository, logger, coordinator_metrics; removed ResourceManager, EventStore, Settings dependencies and lifecycle mixins; replaced polling/scheduling loop with event-driven queue scheduling; removed ResourceAllocation/ResourceManager exports.
Resource Management Removal
backend/app/services/coordinator/resource_manager.py
Deleted module and all types/functions for resource pools, allocations, and ResourceManager (resource allocation, tracking, and metrics subsystem removed).
Tests Updated / Removed
backend/tests/e2e/services/coordinator/test_execution_coordinator.py, backend/tests/unit/services/coordinator/test_resource_manager.py
Removed resource-manager-focused unit tests and some e2e checks (status/resource tests and a consumer existence check); retained scheduling/priority/queue-related tests.
Runtime Consumer Wiring
backend/workers/run_coordinator.py
Replaced direct ExecutionCoordinator instantiation and polling loop with DI-triggered UnifiedConsumer startup; simplified shutdown to await shutdown event; added UnifiedConsumer import.
Documentation Updates
docs/architecture/execution-queue.md, docs/architecture/kafka-topic-architecture.md, docs/architecture/lifecycle.md, docs/architecture/services-overview.md, docs/components/*, docs/operations/metrics-reference.md
Reworked docs to reflect reactive scheduling and queue-based gating; removed resource allocation details/metrics and ResourceManager references; updated component descriptions and configuration tables to drop resource parameters.
Container Composition
backend/app/core/container.py
Added CoordinatorProvider() to main DI container composition.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant QM as QueueManager
    participant Coord as ExecutionCoordinator
    participant Disp as EventDispatcher
    participant Prod as Producer

    Note over Coord: Event-driven scheduling

    App->>Coord: ExecutionRequested event
    Coord->>QM: Enqueue execution (by priority)
    QM-->>Coord: Enqueued confirmation

    Coord->>Coord: Is item front of queue?
    alt If front
        Coord->>Disp: Emit CreatePodCommand
        Disp->>Prod: Publish command
        Prod-->>Worker: Worker receives command
        Coord->>Coord: Mark active (dedup guard)
    end

    Worker->>Coord: ExecutionCompleted / Failed / Cancelled
    Coord->>Coord: Remove active tracking
    Coord->>Coord: Attempt schedule next (repeat)
Loading
sequenceDiagram
    participant DI as DependencyInjector
    participant CP as CoordinatorProvider
    participant UC as UnifiedConsumer
    participant Coord as ExecutionCoordinator
    participant QM as QueueManager

    Note over DI,CP: Provider wiring & lifecycle

    DI->>CP: call get_coordinator_consumer(...)
    CP->>CP: get_coordinator_dispatcher(...)
    CP-->>CP: returns EventDispatcher
    CP->>CP: get_queue_manager(...)
    CP-->>CP: returns QueueManager
    CP->>CP: get_execution_coordinator(...)
    CP-->>CP: returns ExecutionCoordinator
    CP->>UC: construct UnifiedConsumer(coordinator, dispatcher, ...)
    UC->>QM: start queue manager lifecycle
    UC->>Coord: start coordinator lifecycle (consumer-driven)
    UC-->>DI: yields running consumer
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐇 I hopped from pools to queues today,
Events now guide where jobs should play.
No more counts of cores on my mind—
Just priorities and commands, well-timed.
Cheers to simpler hops and smoother day!

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title uses vague and colloquial language ('stuff', 'overhead') that obscures the actual technical change, making it unclear what the PR accomplishes. Use a clear, specific title like 'Refactor coordinator to use Kubernetes resource limits instead of in-process ResourceManager' to accurately convey the main architectural change.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@backend/app/core/providers.py`:
- Around line 743-752: The generator currently starts coordinator.queue_manager
with coordinator.queue_manager.start() then calls consumer.start() but if
consumer.start() raises, the finally block after yield never runs and
queue_manager remains started; wrap the consumer.start() call in a try/except
that on exception calls await coordinator.queue_manager.stop() (and optionally
logs) and then re-raises the exception so the queue manager is cleaned up,
keeping the existing finally block (which calls consumer.stop() and
coordinator.queue_manager.stop()) for normal teardown after yield.

In `@backend/app/services/coordinator/coordinator.py`:
- Around line 60-74: The wrapper methods (_handle_requested_wrapper,
_handle_completed_wrapper, _handle_failed_wrapper, _handle_cancelled_wrapper)
currently use assert for runtime type validation which is stripped under Python
-O; replace each assert isinstance(...) with an explicit isinstance check and
raise a clear exception (e.g., TypeError or ValueError) when the event is not
the expected type (ExecutionRequestedEvent, ExecutionCompletedEvent,
ExecutionFailedEvent, ExecutionCancelledEvent), then proceed to call the
existing handler (_handle_execution_requested, _handle_execution_completed,
_handle_execution_failed, _handle_execution_cancelled).

In `@docs/components/workers/coordinator.md`:
- Line 34: The doc statement in coordinator.md claiming "When resources are
unavailable, executions are requeued with reduced priority" is inaccurate
because the coordinator no longer tracks resources; update that sentence to
reflect the actual requeue trigger (for example: when pod creation fails or the
scheduler rejects a pod) or remove the assertion entirely; specifically edit the
phrase "requeued with reduced priority" in the coordinator documentation to say
something like "requeued with reduced priority when pod creation fails or when
the scheduler rejects the pod" or delete the line if the coordinator no longer
performs requeueing for those cases, and ensure the coordinator is referenced
consistently in the updated text.
🧹 Nitpick comments (1)
backend/app/services/coordinator/coordinator.py (1)

179-186: Consider retry or requeue for transient scheduling failures.

When _publish_execution_started fails, the execution is removed from _active_executions and marked as failed, but it was already popped from the queue. If the failure is transient (e.g., temporary Kafka unavailability), the execution is permanently lost rather than retried.

This fail-fast approach is valid for simplicity, but consider whether transient failures should trigger a requeue with backoff rather than immediate failure, especially for user-initiated executions.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 15 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/app/core/providers.py">

<violation number="1" location="backend/app/core/providers.py:743">
P2: If consumer.start fails, the queue manager cleanup task stays running because it’s started before any error handling. Ensure queue_manager.stop() is called when consumer startup fails to avoid orphaned tasks.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@sonarqubecloud
Copy link

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 16 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/app/core/providers.py">

<violation number="1" location="backend/app/core/providers.py:743">
P2: Queue manager is started before the `try/finally`, so a failure in `consumer.start(...)` leaves it running without cleanup. Wrap startup in a try/except to stop the queue manager on startup failures.</violation>
</file>

<file name="docs/architecture/kafka-topic-architecture.md">

<violation number="1" location="docs/architecture/kafka-topic-architecture.md:27">
P3: This new queue-based validation conflicts with the following sentence about executions "waiting for resources," which is now outdated and confusing after removing resource management. Update the later sentence to align with queue-based scheduling.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment on lines +743 to +744
await coordinator.queue_manager.start()
await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR]))
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Queue manager is started before the try/finally, so a failure in consumer.start(...) leaves it running without cleanup. Wrap startup in a try/except to stop the queue manager on startup failures.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/core/providers.py, line 743:

<comment>Queue manager is started before the `try/finally`, so a failure in `consumer.start(...)` leaves it running without cleanup. Wrap startup in a try/except to stop the queue manager on startup failures.</comment>

<file context>
@@ -702,9 +666,90 @@ def get_admin_user_service(
+            event_metrics=event_metrics,
+        )
+
+        await coordinator.queue_manager.start()
+        await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR]))
+        logger.info("Coordinator consumer started")
</file context>
Suggested change
await coordinator.queue_manager.start()
await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR]))
try:
await coordinator.queue_manager.start()
await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR]))
except Exception:
await coordinator.queue_manager.stop()
raise
Fix with Cubic


- Has the user exceeded their rate limit?
- Are sufficient resources available?
- Is the queue full?
Copy link

@cubic-dev-ai cubic-dev-ai bot Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: This new queue-based validation conflicts with the following sentence about executions "waiting for resources," which is now outdated and confusing after removing resource management. Update the later sentence to align with queue-based scheduling.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/architecture/kafka-topic-architecture.md, line 27:

<comment>This new queue-based validation conflicts with the following sentence about executions "waiting for resources," which is now outdated and confusing after removing resource management. Update the later sentence to align with queue-based scheduling.</comment>

<file context>
@@ -24,7 +24,7 @@ When a user submits code, the API creates an `ExecutionRequestedEvent` and publi
 
 - Has the user exceeded their rate limit?
-- Are sufficient resources available?
+- Is the queue full?
 - Should this execution be prioritized or queued?
 
</file context>
Fix with Cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants