Skip to content

Conversation

@azertyfun
Copy link

Hi!

We have been using core-dump-handler for a little while but frequently encountered the following behavior on OpenShift:

[2025-08-05T07:07:10Z INFO  core_dump_agent] Setting s3 endpoint location to: <REDACTED>
[2025-08-05T07:07:10Z INFO  core_dump_agent] Uploading: /var/mnt/core-dump-handler/cores/<REDACTED>.zip
[2025-08-05T07:07:10Z INFO  core_dump_agent] zip size is 129879392
[2025-08-05T07:07:19Z ERROR core_dump_agent] Upload Failed hyper: channel closed

Restarting the pod would attempt the upload again, but it would not succeed either. We've had to resort to fetching the zip file using kubectl which is quite painful operationally.

For me the issue is twofold:

  1. The upload fails
  2. Failed uploads are not retried until the pod is (for whichever reason) re-created. This forces us to pro-actively monitor the logs for failed upload attempts.

For the first problem, simply updating rust-s3 and its dependencies worked a treat:

Retrying reqwest: error sending request for url (http://<REDACTED>/core-dumps-storage-bucket-<SNIP>.zip?partNumber=3&uploadId=<REDACTED>)

The lack of retry isn't nearly as painful, but in order to offer a harder guarantee that we won't have missing uploads (e.g. because of an issue with the S3 bucket itself), I have added a retry mechanism with exponential backoff.
On that front the application behavior is a bit spaghetti, and I am not 100 % sure that use_inotify == "true" is the right condition to check to enable the retry behavior. However in my setup (k8s using inotify) this works a treat.


This is actually a rebase of a broader change we are implemented internally, which also includes Prometheus metrics for core dumps to allow us to write some alerts (which is why I didn't raise an issue beforehand, the work had to be done anyway). If this PR gets merged I will create a follow-up for that.

Noticed the following error would happen a lot on OpenShift:

> [2025-08-05T07:07:10Z INFO  core_dump_agent] Setting s3 endpoint location to: <REDACTED>
> [2025-08-05T07:07:10Z INFO  core_dump_agent] Uploading: /var/mnt/core-dump-handler/cores/<REDACTED>.zip
> [2025-08-05T07:07:10Z INFO  core_dump_agent] zip size is 129879392
> [2025-08-05T07:07:19Z ERROR core_dump_agent] Upload Failed hyper: channel closed

Restarting the pod or retrying the upload would not help.

After upgrading, the uploads finally worked again:

> Retrying reqwest: error sending request for url (http://<REDACTED>/core-dumps-storage-bucket-<SNIP>.zip?partNumber=3&uploadId=<REDACTED>)

Signed-off-by: Nathan Monfils <nathan.monfils@destiny.eu>
@azertyfun azertyfun force-pushed the fix-file-upload branch 2 times, most recently from b6c0d77 to 350152f Compare January 23, 2026 15:41
Nathan Monfils added 2 commits January 23, 2026 16:42
Prevents zip files from being lost if the upload failed for whichever
reason.

Signed-off-by: Nathan Monfils <nathan.monfils@destiny.eu>
Signed-off-by: Nathan Monfils <nathan.monfils@destiny.eu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant