Skip to content

[Bug] HeadNode checks all compute nodes when adding queue, causing timeout with running jobs #7203

@almightychang

Description

@almightychang

Required Info:

  • AWS ParallelCluster version: 3.14.0
  • Cluster name: pcluster-prod
  • Region: us-east-2
  • Output of pcluster describe-cluster command:
{
  "clusterName": "pcluster-prod",
  "version": "3.14.0",
  "clusterStatus": "UPDATE_FAILED",
  "cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

Bug description and how to reproduce:

When adding a new SLURM queue to an existing ParallelCluster, the cluster update fails with HeadNodeWaitCondition timeout after 35 minutes. The root cause is that the HeadNode readiness check validates cluster_config_version for all existing compute nodes, not just nodes in the newly added queue.

Root Cause

The HeadNode runs /opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py which calls check_deployed_config_version(). This function queries all compute/login nodes in the cluster:

for instance_ids in list_cluster_instance_ids_iterator(
    cluster_name=cluster_name,
    node_type=["Compute", "LoginNode"],  # ← Checks ALL nodes
    instance_state=["running"],
    region=region,
):

Problem: When adding a new queue:

  • New config version is generated (e.g., iGGLMANbWzWVLBzgVcfpxqP8UHsEMowr)
  • Existing nodes in other queues still have old version (e.g., x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX)
  • These nodes have running jobs (18+ hours of compute) and were never updated
  • TERMINATE strategy does NOT terminate nodes when only adding queues (only applies to modified queues)
  • Result: Version mismatch → readiness check fails → timeout → rollback

Steps to Reproduce

  1. Create a cluster with existing queues that have running compute nodes:
SlurmQueues:
  - Name: existing-queue
    ComputeResources:
      - Name: compute-resource
        InstanceType: p5e.48xlarge
        MinCount: 2
        MaxCount: 10
  1. Submit long-running jobs (12+ hours) to existing queue

  2. While jobs are running, add a NEW queue to config.yaml:

SlurmQueues:
  # ... existing queues unchanged ...
  - Name: new-queue  # ← NEW QUEUE
    CapacityType: ONDEMAND
    ComputeResources:
      - Name: new-compute
        InstanceType: g6e.12xlarge
        MinCount: 0
        MaxCount: 4
  1. Run pcluster update-cluster --cluster-name pcluster-prod --cluster-configuration config.yaml

  2. Observe:

    • CloudFormation ComputeFleet update succeeds (new queue created)
    • HeadNode readiness check fails with:
      CheckFailedError: Check failed due to the following erroneous records:
        * wrong records (7): [
            ('i-0cb03846fa80df86f', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
            ...
          ]
      
    • After 35 minutes: HeadNodeWaitCondition timeout
    • CloudFormation rolls back to previous version

Expected Behavior

When adding a new queue to an existing cluster:

  1. New queue should be added without affecting existing queues
  2. Existing compute nodes should NOT need to be updated
  3. Config version check should either:
    • Only check nodes in modified queues, OR
    • Skip version check when only adding queues

Actual Behavior

  • HeadNode checks config_version for all compute nodes regardless of which queues changed
  • Adding a queue changes the global config version
  • Existing nodes retain old config version since they weren't updated
  • Readiness check fails → timeout → rollback

Impact

  • Cannot add new queues to a cluster with running jobs
  • Must wait for all jobs to complete or manually terminate them
  • Defeats the purpose of QueueUpdateStrategy: DRAIN which is meant to allow updates without interrupting jobs
  • Prevents elastic capacity scaling during production workloads

Evidence

From HeadNode /var/log/chef-client.log:

File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 116, in check_deployed_config_version
  raise CheckFailedError(
common.exceptions.CheckFailedError: Check failed due to the following erroneous records:
  * missing records (0): []
  * incomplete records (0): []
  * wrong records (7): [
      ('i-0cb03846fa80df86f', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-0636cf08ef148ccc9', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-06c7a230de8c7e2a1', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-0ed7b4f23f5e71954', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-03bd69018b354ad0e', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-05c4f92e4e3ed0732', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
      ('i-0ce324d164ac31a1c', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX')
    ]
[2026-01-21T07:45:32+00:00] INFO: Retrying execution of execute[Check cluster readiness], 7 attempts left

Affected Nodes (7 nodes across 4 different queues, all with running jobs):

  • Queue alinlab-gpu-2c: 1 node, 2 jobs
  • Queue rlwrld-gpu: 2 nodes, 2 jobs
  • Queue rlwrld-cpu: 1 node, 1 job
  • Queue alinlab-gpu-2b: 3 nodes, 3 jobs
  • Total: 8 running jobs, longest running 18.5 hours

None of these queues were modified - only a new queue was added.

Workaround

Manually update cluster_config_version in DynamoDB for existing nodes:

# 1. Get old and new config versions from HeadNode logs
OLD_VERSION="x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX"
NEW_VERSION="iGGLMANbWzWVLBzgVcfpxqP8UHsEMowr"

# 2. List affected instance IDs from error logs
INSTANCE_IDS=(
  i-0cb03846fa80df86f
  i-0636cf08ef148ccc9
  i-06c7a230de8c7e2a1
  i-0ed7b4f23f5e71954
  i-03bd69018b354ad0e
  i-05c4f92e4e3ed0732
  i-0ce324d164ac31a1c
)

# 3. Update config_version in DynamoDB
for instance_id in "${INSTANCE_IDS[@]}"; do
  aws dynamodb update-item \
    --table-name parallelcluster-pcluster-prod \
    --region us-east-2 \
    --key "{\"Id\":{\"S\":\"CLUSTER_CONFIG.$instance_id\"}}" \
    --update-expression "SET #data.#cv = :newver, #data.#time = :time" \
    --expression-attribute-names '{"#data":"Data","#cv":"cluster_config_version","#time":"lastUpdateTime"}' \
    --expression-attribute-values "{\":newver\":{\"S\":\"$NEW_VERSION\"},\":time\":{\"S\":\"$(date -u '+%Y-%m-%d %H:%M:%S UTC')\"}}"
done

# 4. Wait for HeadNode to complete readiness check (~2 minutes)

Result: Cluster update completed successfully, jobs continued running without interruption.

Safety: This workaround is safe when:

  • Only adding new queues (no modifications to existing queues)
  • No CustomActions or IAM policy changes affecting existing nodes
  • Essentially updating metadata only, not actual node configuration

Proposed Fix

Option A: Modify check_deployed_config_version() to only check nodes in modified queues

def check_deployed_config_version(
    cluster_name: str, 
    table_name: str, 
    expected_config_version: str, 
    region: str,
    modified_queues: Optional[List[str]] = None  # ← NEW PARAMETER
):
    if modified_queues:
        # Only check nodes in specified queues
        # Filter by tag:parallelcluster:queue-name
    else:
        # Check all nodes (backward compatible)

Option B: Skip version check when only adding queues (no modifications)

def check_deployed_config_version(..., change_type: Optional[str] = None):
    if change_type == 'QUEUE_ADDITION':
        logger.info("Skipping config version check for queue addition")
        return

Option C: Add configuration option to skip readiness check

DevSettings:
  SkipConfigVersionCheckOnQueueAddition: true

Infrastructure Available

ParallelCluster already tags instances with queue names:

# cli/src/pcluster/constants.py
PCLUSTER_QUEUE_NAME_TAG = "parallelcluster:queue-name"

The tag exists but check_cluster_ready.py doesn't use it for filtering.

Related Issues

Environment Details

  • ParallelCluster Version: 3.14.0
  • Region: us-east-2
  • Scheduler: SLURM
  • QueueUpdateStrategy: TERMINATE (changing to DRAIN doesn't help)
  • Date: 2026-01-21

This appears to be a design issue where the readiness check assumes a monolithic cluster configuration rather than supporting incremental queue additions. A fix would enable truly elastic HPC clusters that can scale capacity without disrupting running workloads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions