fix: k8s limits enforce stuff for pods, no need for separate overhead #117

HardMax71 · 2026-01-29T12:09:25Z

Summary by cubic

Removed the in-process ResourceManager and now rely on Kubernetes pod requests/limits to enforce CPU/memory. Simplified the coordinator to event-driven scheduling and moved the Kafka consumer lifecycle to DI for cleaner startup/shutdown.

Refactors
- Dropped ResourceManager and all allocation logic/metrics; coordinator now publishes CreatePodCommand directly.
- Switched to reactive scheduling (no background loop); schedule next on queue-front or completion/failure/cancel.
- Added DI providers for dispatcher, queue manager, coordinator, and a consumer that starts/stops both.
- Simplified run_coordinator to resolve UnifiedConsumer and wait for shutdown.
- Updated tests: removed resource manager tests and coordinator status/consumer assertions.
Docs
- Updated architecture and worker docs to reflect K8s-enforced limits and queue-only coordinator.
- Removed resource allocation references/metrics and clarified event flow and configuration.

^{Written for commit abb0461. Summary will update on new commits.}

Summary by CodeRabbit

Refactor
- Execution scheduling moved from resource-allocation polling to reactive, event-driven queue management; coordinator now publishes pod-creation commands and relies on Kubernetes for resource limits.
- Coordinator lifecycle simplified: consumer/queue lifecycle is DI-managed and the coordinator is stateless.
Documentation
- Architecture, component, lifecycle, and metrics docs updated to reflect queue-based prioritization, reactive scheduling, and removal of resource-allocation metrics.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

… of resource manager -> removed

coderabbitai · 2026-01-29T12:09:52Z

📝 Walkthrough

Walkthrough

The coordinator transitions from a resource-centric, polling-based model to an event-driven, queue-backed architecture: ResourceManager and related exports removed, ExecutionCoordinator simplified to accept explicit dispatcher/queue_manager collaborators, provider wiring moved to explicit multi-method providers with lifecycle-managed consumer startup, and docs/tests updated accordingly.

Changes

Cohort / File(s)	Summary
Provider Wiring Reorganization `backend/app/core/providers.py`	Added multi-method provider pattern (`get_coordinator_dispatcher`, `get_queue_manager`, `get_execution_coordinator`, `get_coordinator_consumer`); removed constructor registration of `_provide_execution_coordinator` and `_provide_saga_orchestrator`; imported `QueueManager`.
Coordinator Core Refactor `backend/app/services/coordinator/coordinator.py`, `backend/app/services/coordinator/__init__.py`	ExecutionCoordinator constructor simplified to accept producer, dispatcher, queue_manager, execution_repository, logger, coordinator_metrics; removed ResourceManager, EventStore, Settings dependencies and lifecycle mixins; replaced polling/scheduling loop with event-driven queue scheduling; removed ResourceAllocation/ResourceManager exports.
Resource Management Removal `backend/app/services/coordinator/resource_manager.py`	Deleted module and all types/functions for resource pools, allocations, and ResourceManager (resource allocation, tracking, and metrics subsystem removed).
Tests Updated / Removed `backend/tests/e2e/services/coordinator/test_execution_coordinator.py`, `backend/tests/unit/services/coordinator/test_resource_manager.py`	Removed resource-manager-focused unit tests and some e2e checks (status/resource tests and a consumer existence check); retained scheduling/priority/queue-related tests.
Runtime Consumer Wiring `backend/workers/run_coordinator.py`	Replaced direct ExecutionCoordinator instantiation and polling loop with DI-triggered UnifiedConsumer startup; simplified shutdown to await shutdown event; added UnifiedConsumer import.
Documentation Updates `docs/architecture/execution-queue.md`, `docs/architecture/kafka-topic-architecture.md`, `docs/architecture/lifecycle.md`, `docs/architecture/services-overview.md`, `docs/components/*`, `docs/operations/metrics-reference.md`	Reworked docs to reflect reactive scheduling and queue-based gating; removed resource allocation details/metrics and ResourceManager references; updated component descriptions and configuration tables to drop resource parameters.
Container Composition `backend/app/core/container.py`	Added `CoordinatorProvider()` to main DI container composition.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant QM as QueueManager
    participant Coord as ExecutionCoordinator
    participant Disp as EventDispatcher
    participant Prod as Producer

    Note over Coord: Event-driven scheduling

    App->>Coord: ExecutionRequested event
    Coord->>QM: Enqueue execution (by priority)
    QM-->>Coord: Enqueued confirmation

    Coord->>Coord: Is item front of queue?
    alt If front
        Coord->>Disp: Emit CreatePodCommand
        Disp->>Prod: Publish command
        Prod-->>Worker: Worker receives command
        Coord->>Coord: Mark active (dedup guard)
    end

    Worker->>Coord: ExecutionCompleted / Failed / Cancelled
    Coord->>Coord: Remove active tracking
    Coord->>Coord: Attempt schedule next (repeat)

sequenceDiagram
    participant DI as DependencyInjector
    participant CP as CoordinatorProvider
    participant UC as UnifiedConsumer
    participant Coord as ExecutionCoordinator
    participant QM as QueueManager

    Note over DI,CP: Provider wiring & lifecycle

    DI->>CP: call get_coordinator_consumer(...)
    CP->>CP: get_coordinator_dispatcher(...)
    CP-->>CP: returns EventDispatcher
    CP->>CP: get_queue_manager(...)
    CP-->>CP: returns QueueManager
    CP->>CP: get_execution_coordinator(...)
    CP-->>CP: returns ExecutionCoordinator
    CP->>UC: construct UnifiedConsumer(coordinator, dispatcher, ...)
    UC->>QM: start queue manager lifecycle
    UC->>Coord: start coordinator lifecycle (consumer-driven)
    UC-->>DI: yields running consumer

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

fix: removed statefulness of idem manager - now through DI, updated tests #116 — Adjusts ExecutionCoordinator/provider wiring to accept an EventDispatcher (overlaps coordinator constructor/provider changes).
chore: tests fix/update #60 — Modifies provider lifecycle and ExecutionCoordinator construction similar to this PR (shared wiring changes).
refactor: added settings-driven DI and CSRF middleware with SSE subscribed event #73 — Changes DI/provider wiring and coordinator-related service construction (touches same provider methods).

Poem

🐇 I hopped from pools to queues today,
Events now guide where jobs should play.
No more counts of cores on my mind—
Just priorities and commands, well-timed.
Cheers to simpler hops and smoother day!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title uses vague and colloquial language ('stuff', 'overhead') that obscures the actual technical change, making it unclear what the PR accomplishes.	Use a clear, specific title like 'Refactor coordinator to use Kubernetes resource limits instead of in-process ResourceManager' to accurately convey the main architectural change.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@backend/app/core/providers.py`:
- Around line 743-752: The generator currently starts coordinator.queue_manager
with coordinator.queue_manager.start() then calls consumer.start() but if
consumer.start() raises, the finally block after yield never runs and
queue_manager remains started; wrap the consumer.start() call in a try/except
that on exception calls await coordinator.queue_manager.stop() (and optionally
logs) and then re-raises the exception so the queue manager is cleaned up,
keeping the existing finally block (which calls consumer.stop() and
coordinator.queue_manager.stop()) for normal teardown after yield.

In `@backend/app/services/coordinator/coordinator.py`:
- Around line 60-74: The wrapper methods (_handle_requested_wrapper,
_handle_completed_wrapper, _handle_failed_wrapper, _handle_cancelled_wrapper)
currently use assert for runtime type validation which is stripped under Python
-O; replace each assert isinstance(...) with an explicit isinstance check and
raise a clear exception (e.g., TypeError or ValueError) when the event is not
the expected type (ExecutionRequestedEvent, ExecutionCompletedEvent,
ExecutionFailedEvent, ExecutionCancelledEvent), then proceed to call the
existing handler (_handle_execution_requested, _handle_execution_completed,
_handle_execution_failed, _handle_execution_cancelled).

In `@docs/components/workers/coordinator.md`:
- Line 34: The doc statement in coordinator.md claiming "When resources are
unavailable, executions are requeued with reduced priority" is inaccurate
because the coordinator no longer tracks resources; update that sentence to
reflect the actual requeue trigger (for example: when pod creation fails or the
scheduler rejects a pod) or remove the assertion entirely; specifically edit the
phrase "requeued with reduced priority" in the coordinator documentation to say
something like "requeued with reduced priority when pod creation fails or when
the scheduler rejects the pod" or delete the line if the coordinator no longer
performs requeueing for those cases, and ensure the coordinator is referenced
consistently in the updated text.

🧹 Nitpick comments (1)

backend/app/services/coordinator/coordinator.py (1)

179-186: Consider retry or requeue for transient scheduling failures.

When _publish_execution_started fails, the execution is removed from _active_executions and marked as failed, but it was already popped from the queue. If the failure is transient (e.g., temporary Kafka unavailability), the execution is permanently lost rather than retried.

This fail-fast approach is valid for simplicity, but consider whether transient failures should trigger a requeue with backoff rather than immediate failure, especially for user-initiated executions.

backend/app/core/providers.py

backend/app/services/coordinator/coordinator.py

docs/components/workers/coordinator.md

cubic-dev-ai

1 issue found across 15 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/app/core/providers.py">

<violation number="1" location="backend/app/core/providers.py:743">
P2: If consumer.start fails, the queue manager cleanup task stays running because it’s started before any error handling. Ensure queue_manager.stop() is called when consumer startup fails to avoid orphaned tasks.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

backend/app/core/providers.py

sonarqubecloud · 2026-01-29T12:29:37Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cubic-dev-ai

2 issues found across 16 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="backend/app/core/providers.py">

<violation number="1" location="backend/app/core/providers.py:743">
P2: Queue manager is started before the `try/finally`, so a failure in `consumer.start(...)` leaves it running without cleanup. Wrap startup in a try/except to stop the queue manager on startup failures.</violation>
</file>

<file name="docs/architecture/kafka-topic-architecture.md">

<violation number="1" location="docs/architecture/kafka-topic-architecture.md:27">
P3: This new queue-based validation conflicts with the following sentence about executions "waiting for resources," which is now outdated and confusing after removing resource management. Update the later sentence to align with queue-based scheduling.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-01-29T12:44:36Z

backend/app/core/providers.py

+        await coordinator.queue_manager.start()
+        await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR]))


P2: Queue manager is started before the try/finally, so a failure in consumer.start(...) leaves it running without cleanup. Wrap startup in a try/except to stop the queue manager on startup failures.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At backend/app/core/providers.py, line 743: <comment>Queue manager is started before the `try/finally`, so a failure in `consumer.start(...)` leaves it running without cleanup. Wrap startup in a try/except to stop the queue manager on startup failures.</comment> <file context> @@ -702,9 +666,90 @@ def get_admin_user_service( + event_metrics=event_metrics, + ) + + await coordinator.queue_manager.start() + await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR])) + logger.info("Coordinator consumer started") </file context>

Suggested change

await coordinator.queue_manager.start()

await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR]))

try:

await coordinator.queue_manager.start()

await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR]))

except Exception:

await coordinator.queue_manager.stop()

raise

cubic-dev-ai · 2026-01-29T12:44:36Z

docs/architecture/kafka-topic-architecture.md


 - Has the user exceeded their rate limit?
- Are sufficient resources available?
+- Is the queue full?


P3: This new queue-based validation conflicts with the following sentence about executions "waiting for resources," which is now outdated and confusing after removing resource management. Update the later sentence to align with queue-based scheduling.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At docs/architecture/kafka-topic-architecture.md, line 27: <comment>This new queue-based validation conflicts with the following sentence about executions "waiting for resources," which is now outdated and confusing after removing resource management. Update the later sentence to align with queue-based scheduling.</comment> <file context> @@ -24,7 +24,7 @@ When a user submits code, the API creates an `ExecutionRequestedEvent` and publi - Has the user exceeded their rate limit? -- Are sufficient resources available? +- Is the queue full? - Should this execution be prioritized or queued? </file context>

fix: k8s limits enforce stuff for pods, no need for separate overhead…

a145021

… of resource manager -> removed

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

backend/app/core/providers.py Show resolved Hide resolved

backend/app/services/coordinator/coordinator.py Show resolved Hide resolved

docs/components/workers/coordinator.md Show resolved Hide resolved

cubic-dev-ai bot reviewed Jan 29, 2026

View reviewed changes

backend/app/core/providers.py Show resolved Hide resolved

fix: added missing provider

abb0461

cubic-dev-ai bot reviewed Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: k8s limits enforce stuff for pods, no need for separate overhead #117

fix: k8s limits enforce stuff for pods, no need for separate overhead #117

HardMax71 commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Jan 29, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Jan 29, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai bot Jan 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		await coordinator.queue_manager.start()
		await consumer.start(list(CONSUMER_GROUP_SUBSCRIPTIONS[GroupId.EXECUTION_COORDINATOR]))

fix: k8s limits enforce stuff for pods, no need for separate overhead #117

Are you sure you want to change the base?

fix: k8s limits enforce stuff for pods, no need for separate overhead #117

Conversation

HardMax71 commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud bot commented Jan 29, 2026

Quality Gate passed

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HardMax71 commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

cubic-dev-ai bot Jan 29, 2026 •

edited

Loading

cubic-dev-ai bot Jan 29, 2026 •

edited

Loading