Skip to content

Conversation

@hiroTamada
Copy link
Contributor

@hiroTamada hiroTamada commented Jan 22, 2026

Summary

  • Add real-time VM resource utilization metrics using /proc/<pid>/stat and /proc/<pid>/statm for accurate per-process measurements
  • Uses /proc instead of cgroups to avoid session-level aggregation issues
  • Add REST endpoint GET /instances/{id}/stats for per-instance utilization data

New OTel Metrics

Metric Description
hypeman_vm_cpu_seconds_total CPU time consumed by VM hypervisor process
hypeman_vm_allocated_vcpus Number of vCPUs allocated to the VM
hypeman_vm_memory_rss_bytes Resident Set Size (actual physical memory used)
hypeman_vm_memory_vms_bytes Virtual Memory Size
hypeman_vm_allocated_memory_bytes Total memory allocated (Size + HotplugSize)
hypeman_vm_network_rx_bytes_total Network bytes received (from TAP interface)
hypeman_vm_network_tx_bytes_total Network bytes transmitted (from TAP interface)
hypeman_vm_memory_utilization_ratio RSS / allocated memory

Stats Endpoint

curl -H "Authorization: Bearer <token>" http://localhost:8083/instances/{id}/stats

Response:

{
  "instance_id": "qilviffnqzck2jrim1x6s2b1",
  "instance_name": "test-vm",
  "cpu_seconds": 29.94,
  "memory_rss_bytes": 443338752,
  "memory_vms_bytes": 4330745856,
  "network_rx_bytes": 0,
  "network_tx_bytes": 0,
  "allocated_vcpus": 2,
  "allocated_memory_bytes": 4294967296,
  "memory_utilization_ratio": 0.103
}

Prometheus Queries

# CPU utilization (0-1 per vCPU)
rate(hypeman_vm_cpu_seconds_total[1m]) / hypeman_vm_allocated_vcpus

# Memory utilization
hypeman_vm_memory_rss_bytes / hypeman_vm_allocated_memory_bytes

Test plan

  • Unit tests pass (go test ./lib/resources/...)
  • Manual test: verify /instances/{id}/stats returns correct data for running VM
  • Manual test: verify OTel metrics appear in Signoz after deployment

Note

Adds real-time VM utilization surfaced via OTel and a public stats API.

  • Implements /proc/<pid>/stat + /proc/<pid>/statm and TAP reads in resources (ReadProcStat, ReadProcStatm, ReadTAPStats) with CollectVMUtilization
  • Registers OTel instruments in resources/utilization_metrics.go (e.g., hypeman_vm_cpu_seconds_total, hypeman_vm_memory_rss_bytes, ..._vms_bytes, ..._allocated_*, ..._network_{rx,tx}_bytes_total, ..._memory_utilization_ratio)
  • Extends resources.Manager to wire utilization source and initialize metrics; providers now call InitializeMetrics
  • instances.Manager adds ListRunningInstancesInfo and TAP name helper; used as UtilizationSource
  • Adds REST endpoint GET /instances/{id}/stats (oapi server/client, OpenAPI schema, API handler) returning InstanceStats
  • Updates Grafana dashboard with VM CPU, memory (RSS/VMS), network I/O, and memory utilization panels
  • Adds unit tests for utilization readers, metrics registration, and TAP naming

Written by Cursor Bugbot for commit cafebe8. This will update automatically on new commits. Configure here.

Add real-time VM resource utilization metrics using /proc/<pid>/stat and
/proc/<pid>/statm for accurate per-process measurements (instead of cgroups
which aggregate at the session level).

New metrics exported via OpenTelemetry:
- hypeman_vm_cpu_seconds_total: CPU time consumed by VM hypervisor
- hypeman_vm_allocated_vcpus: Number of vCPUs allocated
- hypeman_vm_memory_rss_bytes: Resident Set Size (actual physical memory)
- hypeman_vm_memory_vms_bytes: Virtual Memory Size
- hypeman_vm_allocated_memory_bytes: Total allocated memory
- hypeman_vm_network_rx_bytes_total: Network bytes received (from TAP)
- hypeman_vm_network_tx_bytes_total: Network bytes transmitted (from TAP)
- hypeman_vm_memory_utilization_ratio: RSS / allocated memory

Also adds REST endpoint GET /instances/{id}/stats for per-instance stats.
- Add InstanceStats schema and /instances/{id}/stats endpoint to openapi.yaml
- Regenerate oapi code with make oapi-generate
- Move stats implementation to instances.go following existing patterns
- Remove custom stats.go and route from main.go
cursor[bot]

This comment was marked as outdated.

@github-actions
Copy link

github-actions bot commented Jan 22, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat(metrics): add per-VM resource utilization metrics

Edit this comment to update it. It will appear in the SDK's changelogs.

⚠️ hypeman-typescript studio · code · diff

There was a regression in your SDK.
generate ⚠️ (prev: generate ✅) → build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/b9586192a186850493fbf2166adb3fe6d2c65a7f/dist.tar.gz
New diagnostics (1 warning)
⚠️ Endpoint/NotConfigured: `get /instances/{id}/stats` exists in the OpenAPI spec, but isn't specified in the Stainless config, so code will not be generated for it.
⚠️ hypeman-go studio · code · diff

There was a regression in your SDK.
generate ⚠️ (prev: generate ✅) → lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@840b32a2e7c794c5b575dc7105ca18c60878ea5e
New diagnostics (1 warning)
⚠️ Endpoint/NotConfigured: `get /instances/{id}/stats` exists in the OpenAPI spec, but isn't specified in the Stainless config, so code will not be generated for it.
hypeman-cli studio

Unknown conclusion: fatal

New diagnostics (1 warning)
⚠️ Endpoint/NotConfigured: `get /instances/{id}/stats` exists in the OpenAPI spec, but isn't specified in the Stainless config, so code will not be generated for it.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-01-22 21:49:15 UTC

cursor[bot]

This comment was marked as outdated.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Replace hardcoded 4096 page size with os.Getpagesize() to support
ARM systems (AWS Graviton, Apple Silicon) which may use 16KB or 64KB
pages. Without this fix, memory metrics would be underreported by
4x-16x on non-x86 systems.
}

// generateTAPName generates TAP device name from instance ID
func generateTAPName(instanceID string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should reuse existing code that determine tap name from instance name

ListInstanceAllocations(ctx context.Context) ([]resources.InstanceAllocation, error)
// ListRunningInstancesInfo returns info needed for utilization metrics collection.
// Used by the resource manager for VM utilization tracking.
ListRunningInstancesInfo(ctx context.Context) ([]resources.InstanceUtilizationInfo, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe most of the code in this PR can go in a new lib/ directory like lib/vm_metrics, and have manager like VmMetricsManager.


// generateTAPName generates TAP device name from instance ID.
// This matches the logic in network/allocate.go.
func generateTAPName(instanceID string) string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too


// VMUtilization holds actual resource utilization metrics for a VM.
// These are real-time values read from /proc/<pid>/stat, /proc/<pid>/statm, and TAP interfaces.
type VMUtilization struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nice to move the new code for this feature in lib/vm_metrics or lib/utilization or whatever name is best, just because it's a separate feature from the current lib/resources feature which is about the host's resources. I think since this is all net-new that would allow for almost all this change to live in new feature directory and isolated from the rest mostly.

if err != nil {
log.DebugContext(ctx, "failed to read proc stat", "pid", pid, "error", err)
} else {
stats.CpuSeconds = float64(cpuUsec) / 1_000_000.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like too much logic happening in the API handler, api handler ought to just translate from domain types (e.g. lib/utilization/types.go) into API types and other handler-level concerns like error mapping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it would be nice if moved to new lib/ directory to get a README explaining the feature, similar to other features in the repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants