feat(metrics): add per-VM resource utilization metrics #67

hiroTamada · 2026-01-22T21:04:07Z

Summary

Add real-time VM resource utilization metrics using /proc/<pid>/stat and /proc/<pid>/statm for accurate per-process measurements
Uses /proc instead of cgroups to avoid session-level aggregation issues
Add REST endpoint GET /instances/{id}/stats for per-instance utilization data

New OTel Metrics

Metric	Description
`hypeman_vm_cpu_seconds_total`	CPU time consumed by VM hypervisor process
`hypeman_vm_allocated_vcpus`	Number of vCPUs allocated to the VM
`hypeman_vm_memory_rss_bytes`	Resident Set Size (actual physical memory used)
`hypeman_vm_memory_vms_bytes`	Virtual Memory Size
`hypeman_vm_allocated_memory_bytes`	Total memory allocated (Size + HotplugSize)
`hypeman_vm_network_rx_bytes_total`	Network bytes received (from TAP interface)
`hypeman_vm_network_tx_bytes_total`	Network bytes transmitted (from TAP interface)
`hypeman_vm_memory_utilization_ratio`	RSS / allocated memory

Stats Endpoint

curl -H "Authorization: Bearer <token>" http://localhost:8083/instances/{id}/stats

Response:

{
  "instance_id": "qilviffnqzck2jrim1x6s2b1",
  "instance_name": "test-vm",
  "cpu_seconds": 29.94,
  "memory_rss_bytes": 443338752,
  "memory_vms_bytes": 4330745856,
  "network_rx_bytes": 0,
  "network_tx_bytes": 0,
  "allocated_vcpus": 2,
  "allocated_memory_bytes": 4294967296,
  "memory_utilization_ratio": 0.103
}

Prometheus Queries

# CPU utilization (0-1 per vCPU)
rate(hypeman_vm_cpu_seconds_total[1m]) / hypeman_vm_allocated_vcpus

# Memory utilization
hypeman_vm_memory_rss_bytes / hypeman_vm_allocated_memory_bytes

Test plan

Unit tests pass (go test ./lib/resources/...)
Manual test: verify /instances/{id}/stats returns correct data for running VM
Manual test: verify OTel metrics appear in Signoz after deployment

Note

Adds real-time VM utilization surfaced via OTel and a public stats API.

Implements /proc/<pid>/stat + /proc/<pid>/statm and TAP reads in resources (ReadProcStat, ReadProcStatm, ReadTAPStats) with CollectVMUtilization
Registers OTel instruments in resources/utilization_metrics.go (e.g., hypeman_vm_cpu_seconds_total, hypeman_vm_memory_rss_bytes, ..._vms_bytes, ..._allocated_*, ..._network_{rx,tx}_bytes_total, ..._memory_utilization_ratio)
Extends resources.Manager to wire utilization source and initialize metrics; providers now call InitializeMetrics
instances.Manager adds ListRunningInstancesInfo and TAP name helper; used as UtilizationSource
Adds REST endpoint GET /instances/{id}/stats (oapi server/client, OpenAPI schema, API handler) returning InstanceStats
Updates Grafana dashboard with VM CPU, memory (RSS/VMS), network I/O, and memory utilization panels
Adds unit tests for utilization readers, metrics registration, and TAP naming

^{Written by Cursor Bugbot for commit cafebe8. This will update automatically on new commits. Configure here.}

Add real-time VM resource utilization metrics using /proc/<pid>/stat and /proc/<pid>/statm for accurate per-process measurements (instead of cgroups which aggregate at the session level). New metrics exported via OpenTelemetry: - hypeman_vm_cpu_seconds_total: CPU time consumed by VM hypervisor - hypeman_vm_allocated_vcpus: Number of vCPUs allocated - hypeman_vm_memory_rss_bytes: Resident Set Size (actual physical memory) - hypeman_vm_memory_vms_bytes: Virtual Memory Size - hypeman_vm_allocated_memory_bytes: Total allocated memory - hypeman_vm_network_rx_bytes_total: Network bytes received (from TAP) - hypeman_vm_network_tx_bytes_total: Network bytes transmitted (from TAP) - hypeman_vm_memory_utilization_ratio: RSS / allocated memory Also adds REST endpoint GET /instances/{id}/stats for per-instance stats.

- Add InstanceStats schema and /instances/{id}/stats endpoint to openapi.yaml - Regenerate oapi code with make oapi-generate - Move stats implementation to instances.go following existing patterns - Remove custom stats.go and route from main.go

github-actions · 2026-01-22T21:11:11Z

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat(metrics): add per-VM resource utilization metrics

Edit this comment to update it. It will appear in the SDK's changelogs.

⚠️

hypeman-typescript studio · code · diff

There was a regression in your SDK.
generate ⚠️ (prev: generate ✅) → build ✅ → lint ✅ → test ✅
npm install https://pkg.stainless.com/s/hypeman-typescript/b9586192a186850493fbf2166adb3fe6d2c65a7f/dist.tar.gz
New diagnostics (1 warning)

⚠️ Endpoint/NotConfigured: `get /instances/{id}/stats` exists in the OpenAPI spec, but isn't specified in the Stainless config, so code will not be generated for it.

⚠️

hypeman-go studio · code · diff

There was a regression in your SDK.
generate ⚠️ (prev: generate ✅) → lint ✅ → test ✅
go get github.com/stainless-sdks/hypeman-go@840b32a2e7c794c5b575dc7105ca18c60878ea5e
New diagnostics (1 warning)

⚠️ Endpoint/NotConfigured: `get /instances/{id}/stats` exists in the OpenAPI spec, but isn't specified in the Stainless config, so code will not be generated for it.

❗ hypeman-cli studio

Unknown conclusion: fatal

New diagnostics (1 warning)

⚠️ Endpoint/NotConfigured: `get /instances/{id}/stats` exists in the OpenAPI spec, but isn't specified in the Stainless config, so code will not be generated for it.

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-01-22 21:49:15 UTC

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

lib/resources/utilization.go

Replace hardcoded 4096 page size with os.Getpagesize() to support ARM systems (AWS Graviton, Apple Silicon) which may use 16KB or 64KB pages. Without this fix, memory metrics would be underreported by 4x-16x on non-x86 systems.

sjmiller609 · 2026-01-22T22:11:05Z

cmd/api/api/instances.go

+}
+
+// generateTAPName generates TAP device name from instance ID
+func generateTAPName(instanceID string) string {


we should reuse existing code that determine tap name from instance name

sjmiller609 · 2026-01-22T22:14:45Z

lib/instances/manager.go

 	ListInstanceAllocations(ctx context.Context) ([]resources.InstanceAllocation, error)
+	// ListRunningInstancesInfo returns info needed for utilization metrics collection.
+	// Used by the resource manager for VM utilization tracking.
+	ListRunningInstancesInfo(ctx context.Context) ([]resources.InstanceUtilizationInfo, error)


Maybe most of the code in this PR can go in a new lib/ directory like lib/vm_metrics, and have manager like VmMetricsManager.

sjmiller609 · 2026-01-22T22:15:13Z

lib/instances/manager.go

+
+// generateTAPName generates TAP device name from instance ID.
+// This matches the logic in network/allocate.go.
+func generateTAPName(instanceID string) string {


sjmiller609 · 2026-01-22T22:17:36Z

lib/resources/utilization.go

+
+// VMUtilization holds actual resource utilization metrics for a VM.
+// These are real-time values read from /proc/<pid>/stat, /proc/<pid>/statm, and TAP interfaces.
+type VMUtilization struct {


I think it might be nice to move the new code for this feature in lib/vm_metrics or lib/utilization or whatever name is best, just because it's a separate feature from the current lib/resources feature which is about the host's resources. I think since this is all net-new that would allow for almost all this change to live in new feature directory and isolated from the rest mostly.

sjmiller609 · 2026-01-22T22:19:02Z

cmd/api/api/instances.go

+		if err != nil {
+			log.DebugContext(ctx, "failed to read proc stat", "pid", pid, "error", err)
+		} else {
+			stats.CpuSeconds = float64(cpuUsec) / 1_000_000.0


it seems like too much logic happening in the API handler, api handler ought to just translate from domain types (e.g. lib/utilization/types.go) into API types and other handler-level concerns like error mapping.

also it would be nice if moved to new lib/ directory to get a README explaining the feature, similar to other features in the repo

hiroTamada added 2 commits January 22, 2026 16:02

This comment was marked as outdated.

Sign in to view

fix: add ListRunningInstancesInfo to mock in builds tests

58fc604

cursor bot reviewed Jan 22, 2026

View reviewed changes

lib/resources/utilization.go Show resolved Hide resolved

hiroTamada requested review from rgarcia and sjmiller609 January 22, 2026 21:49

sjmiller609 reviewed Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add per-VM resource utilization metrics #67

feat(metrics): add per-VM resource utilization metrics #67

Uh oh!

hiroTamada commented Jan 22, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

sjmiller609 Jan 22, 2026

Uh oh!

sjmiller609 Jan 22, 2026

Uh oh!

sjmiller609 Jan 22, 2026

Uh oh!

sjmiller609 Jan 22, 2026

Uh oh!

sjmiller609 Jan 22, 2026

Uh oh!

sjmiller609 Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(metrics): add per-VM resource utilization metrics #67

Are you sure you want to change the base?

feat(metrics): add per-VM resource utilization metrics #67

Uh oh!

Conversation

hiroTamada commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New OTel Metrics

Stats Endpoint

Prometheus Queries

Test plan

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✱ Stainless preview builds

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sjmiller609 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

sjmiller609 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

sjmiller609 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

sjmiller609 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

sjmiller609 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

sjmiller609 Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hiroTamada commented Jan 22, 2026 •

edited

Loading

github-actions bot commented Jan 22, 2026 •

edited

Loading