Skip to content

Conversation

@abuccts
Copy link
Member

@abuccts abuccts commented Jan 27, 2026

Fix numa parsing in job exporter on GB200.

Fix numa parsing in job exporter on GB200.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a NUMA parsing issue in the job exporter that occurs on GB200 hardware. The GB200's numactl --hardware output includes header lines in the format "node X cpus:" without CPU lists, which the previous code would incorrectly attempt to parse.

Changes:

  • Added a check to skip lines ending with "cpus:" to prevent parsing empty CPU lists on GB200 systems
Comments suppressed due to low confidence (2)

src/job-exporter/src/Moneo/src/worker/exporters/node_exporter.py:281

  • The fix correctly filters out header lines ending with "cpus:" to prevent parsing issues on GB200. However, there's a potential logic issue in the subsequent lines (278-281). When a valid line is found (passing all three conditions on line 276), the code checks for ':' again on line 278, which is redundant since line 276 already requires 'cpus' to be in the line. More importantly, if a line passes the check on line 276 but doesn't contain ':', the code continues without processing that line, potentially missing valid NUMA data.

Consider refactoring the inner condition to be clearer about what it's checking, or combine the checks on lines 276 and 278 into a single, more explicit condition.

        if 'node ' in line and 'cpus' in line and not line.strip().endswith('cpus:'):
            current_numa_domain = int(re.search(r'node (\d+)', line).group(1))
            if ':' in line:
                cpus_str = line.split(': ')[1].split()
                for cpu in cpus_str:
                    numa_mapping[int(cpu)] = current_numa_domain

src/job-exporter/src/Moneo/src/worker/exporters/node_exporter.py:281

  • This change fixes a parsing issue for GB200 hardware but lacks test coverage. The repository has comprehensive test coverage for other exporters (test_amd.py, test_nvidia.py, test_collector.py, etc.), but there are no tests for the get_core_numa_mapping function or the node_exporter module. Consider adding unit tests that verify the function correctly handles various numactl output formats, including:
  1. Standard output with CPU lists
  2. GB200-specific output with empty "node X cpus:" lines
  3. The fallback to lscpu when core_count doesn't match

This would prevent regressions and document the expected behavior across different hardware platforms.

        if 'node ' in line and 'cpus' in line and not line.strip().endswith('cpus:'):
            current_numa_domain = int(re.search(r'node (\d+)', line).group(1))
            if ':' in line:
                cpus_str = line.split(': ')[1].split()
                for cpu in cpus_str:
                    numa_mapping[int(cpu)] = current_numa_domain

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants