-
Notifications
You must be signed in to change notification settings - Fork 318
Description
Required Info:
- AWS ParallelCluster version: 3.14.0
- Cluster name: pcluster-prod
- Region: us-east-2
- Output of
pcluster describe-clustercommand:
{
"clusterName": "pcluster-prod",
"version": "3.14.0",
"clusterStatus": "UPDATE_FAILED",
"cloudFormationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
"scheduler": {
"type": "slurm"
}
}Bug description and how to reproduce:
When adding a new SLURM queue to an existing ParallelCluster, the cluster update fails with HeadNodeWaitCondition timeout after 35 minutes. The root cause is that the HeadNode readiness check validates cluster_config_version for all existing compute nodes, not just nodes in the newly added queue.
Root Cause
The HeadNode runs /opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py which calls check_deployed_config_version(). This function queries all compute/login nodes in the cluster:
for instance_ids in list_cluster_instance_ids_iterator(
cluster_name=cluster_name,
node_type=["Compute", "LoginNode"], # ← Checks ALL nodes
instance_state=["running"],
region=region,
):Problem: When adding a new queue:
- New config version is generated (e.g.,
iGGLMANbWzWVLBzgVcfpxqP8UHsEMowr) - Existing nodes in other queues still have old version (e.g.,
x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX) - These nodes have running jobs (18+ hours of compute) and were never updated
TERMINATEstrategy does NOT terminate nodes when only adding queues (only applies to modified queues)- Result: Version mismatch → readiness check fails → timeout → rollback
Steps to Reproduce
- Create a cluster with existing queues that have running compute nodes:
SlurmQueues:
- Name: existing-queue
ComputeResources:
- Name: compute-resource
InstanceType: p5e.48xlarge
MinCount: 2
MaxCount: 10-
Submit long-running jobs (12+ hours) to existing queue
-
While jobs are running, add a NEW queue to config.yaml:
SlurmQueues:
# ... existing queues unchanged ...
- Name: new-queue # ← NEW QUEUE
CapacityType: ONDEMAND
ComputeResources:
- Name: new-compute
InstanceType: g6e.12xlarge
MinCount: 0
MaxCount: 4-
Run
pcluster update-cluster --cluster-name pcluster-prod --cluster-configuration config.yaml -
Observe:
- CloudFormation ComputeFleet update succeeds (new queue created)
- HeadNode readiness check fails with:
CheckFailedError: Check failed due to the following erroneous records: * wrong records (7): [ ('i-0cb03846fa80df86f', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'), ... ] - After 35 minutes:
HeadNodeWaitConditiontimeout - CloudFormation rolls back to previous version
Expected Behavior
When adding a new queue to an existing cluster:
- New queue should be added without affecting existing queues
- Existing compute nodes should NOT need to be updated
- Config version check should either:
- Only check nodes in modified queues, OR
- Skip version check when only adding queues
Actual Behavior
- HeadNode checks config_version for all compute nodes regardless of which queues changed
- Adding a queue changes the global config version
- Existing nodes retain old config version since they weren't updated
- Readiness check fails → timeout → rollback
Impact
- Cannot add new queues to a cluster with running jobs
- Must wait for all jobs to complete or manually terminate them
- Defeats the purpose of
QueueUpdateStrategy: DRAINwhich is meant to allow updates without interrupting jobs - Prevents elastic capacity scaling during production workloads
Evidence
From HeadNode /var/log/chef-client.log:
File "/opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py", line 116, in check_deployed_config_version
raise CheckFailedError(
common.exceptions.CheckFailedError: Check failed due to the following erroneous records:
* missing records (0): []
* incomplete records (0): []
* wrong records (7): [
('i-0cb03846fa80df86f', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
('i-0636cf08ef148ccc9', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
('i-06c7a230de8c7e2a1', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
('i-0ed7b4f23f5e71954', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
('i-03bd69018b354ad0e', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
('i-05c4f92e4e3ed0732', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX'),
('i-0ce324d164ac31a1c', 'x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX')
]
[2026-01-21T07:45:32+00:00] INFO: Retrying execution of execute[Check cluster readiness], 7 attempts left
Affected Nodes (7 nodes across 4 different queues, all with running jobs):
- Queue
alinlab-gpu-2c: 1 node, 2 jobs - Queue
rlwrld-gpu: 2 nodes, 2 jobs - Queue
rlwrld-cpu: 1 node, 1 job - Queue
alinlab-gpu-2b: 3 nodes, 3 jobs - Total: 8 running jobs, longest running 18.5 hours
None of these queues were modified - only a new queue was added.
Workaround
Manually update cluster_config_version in DynamoDB for existing nodes:
# 1. Get old and new config versions from HeadNode logs
OLD_VERSION="x2XtFktnSTEzs3v_r4L4Qjze1k0TyCyX"
NEW_VERSION="iGGLMANbWzWVLBzgVcfpxqP8UHsEMowr"
# 2. List affected instance IDs from error logs
INSTANCE_IDS=(
i-0cb03846fa80df86f
i-0636cf08ef148ccc9
i-06c7a230de8c7e2a1
i-0ed7b4f23f5e71954
i-03bd69018b354ad0e
i-05c4f92e4e3ed0732
i-0ce324d164ac31a1c
)
# 3. Update config_version in DynamoDB
for instance_id in "${INSTANCE_IDS[@]}"; do
aws dynamodb update-item \
--table-name parallelcluster-pcluster-prod \
--region us-east-2 \
--key "{\"Id\":{\"S\":\"CLUSTER_CONFIG.$instance_id\"}}" \
--update-expression "SET #data.#cv = :newver, #data.#time = :time" \
--expression-attribute-names '{"#data":"Data","#cv":"cluster_config_version","#time":"lastUpdateTime"}' \
--expression-attribute-values "{\":newver\":{\"S\":\"$NEW_VERSION\"},\":time\":{\"S\":\"$(date -u '+%Y-%m-%d %H:%M:%S UTC')\"}}"
done
# 4. Wait for HeadNode to complete readiness check (~2 minutes)Result: Cluster update completed successfully, jobs continued running without interruption.
Safety: This workaround is safe when:
- Only adding new queues (no modifications to existing queues)
- No CustomActions or IAM policy changes affecting existing nodes
- Essentially updating metadata only, not actual node configuration
Proposed Fix
Option A: Modify check_deployed_config_version() to only check nodes in modified queues
def check_deployed_config_version(
cluster_name: str,
table_name: str,
expected_config_version: str,
region: str,
modified_queues: Optional[List[str]] = None # ← NEW PARAMETER
):
if modified_queues:
# Only check nodes in specified queues
# Filter by tag:parallelcluster:queue-name
else:
# Check all nodes (backward compatible)Option B: Skip version check when only adding queues (no modifications)
def check_deployed_config_version(..., change_type: Optional[str] = None):
if change_type == 'QUEUE_ADDITION':
logger.info("Skipping config version check for queue addition")
returnOption C: Add configuration option to skip readiness check
DevSettings:
SkipConfigVersionCheckOnQueueAddition: trueInfrastructure Available
ParallelCluster already tags instances with queue names:
# cli/src/pcluster/constants.py
PCLUSTER_QUEUE_NAME_TAG = "parallelcluster:queue-name"The tag exists but check_cluster_ready.py doesn't use it for filtering.
Related Issues
- Similar issue reported in [Bug] Orchestration Deadlock: Check cluster readiness runs before clustermgtd restart, causing timeout on stale DynamoDB records #7166 (ghost records during update)
- Related to Cluster Update Failure When Adding a New Slurm Queue #4286 (cluster update failure when adding queue, different root cause)
Environment Details
- ParallelCluster Version: 3.14.0
- Region: us-east-2
- Scheduler: SLURM
- QueueUpdateStrategy:
TERMINATE(changing toDRAINdoesn't help) - Date: 2026-01-21
This appears to be a design issue where the readiness check assumes a monolithic cluster configuration rather than supporting incremental queue additions. A fix would enable truly elastic HPC clusters that can scale capacity without disrupting running workloads.