-
Notifications
You must be signed in to change notification settings - Fork 445
Add fabric manager configuration support #2045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
|
How coincidental that I resolved to implement something like this and 2 hours ago you submitted this draft! I want to ask what the plan is for the CDI-side. The ideal scenario is that the fabricmanager can be spawned as a Kata container, which means we need to inject the NVSwitch VFIO cdevs just like how we do for passthrough GPUs. When I tried to use GPU operator a few months ago, this was simply not possible at the time so I used libvirt instead. Does the GPU Operator CDI already expose the NVswitches to k8s now? I apologize if my knowledge is a little out of date. |
5f8e006 to
e27e938
Compare
…-passthrough Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
When clusterPolicy.fabricManager.mode=shared-nvswitch and workload=vm-passthrough, the vfio-manager now preserves the NVIDIA driver for fabric management while enabling GPU device passthrough to VMs. Changes: - Modify TransformVFIOManager to detect shared-nvswitch mode. - Replace driver uninstall init container with device unbind init container. - Use vfio-manage unbind --all to detach devices from nvidia driver. - Keep nvidia driver loaded for fabric management functionality. - Add comprehensive unit tests for both normal and shared-nvswitch modes. The new flow for shared-nvswitch mode for the vfio-manager: 1. InitContainer: vfio-manage unbind --all (unbind from nvidia driver) 2. Container: vfio-manage bind --all (bind to vfio-pci) This enables simultaneous fabric management and VM passthrough capabilities.
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
c53ceaa to
70c5d78
Compare
Signed-off-by: Michail Resvanis <mresvani@redhat.com>
70c5d78 to
28c95d9
Compare
| # For vm-passthrough with shared-nvswitch mode, nvidia-smi may fail due to unbound devices | ||
| # Fall back to checking if nvidia module is loaded when FABRIC_MANAGER_FABRIC_MODE=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question (for my understanding) -- GPUs may not be bound to the nvidia driver since there is a chance that the vfio-manager ran already and unbound the devices? Am I understanding this correct?
| exit 1 | ||
| # For vm-passthrough with shared-nvswitch mode, nvidia-smi may fail due to unbound devices | ||
| # Fall back to checking if nvidia module is loaded when FABRIC_MANAGER_FABRIC_MODE=1 | ||
| if [ "${FABRIC_MANAGER_FABRIC_MODE:-}" = "1" ] && lsmod | grep -q "^nvidia "; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- isn't the right-hand-side of this if statement redundant? Don't we already know the nvidia module is loaded prior to this (see L19-22)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is hard for me to review since the diff is large and it appears to mostly be a change in indentation. Did anything meaningful change here besides the indentation? (if not, I'd prefer if we reverted this to minimize the diff)
| if config.FabricManager.IsSharedNVSwitchMode() { | ||
| // In shared-nvswitch mode, replace driver uninstall with device unbind | ||
| // Find the k8s-driver-manager init container and replace it with vfio-manage unbind | ||
| for i := range obj.Spec.Template.Spec.InitContainers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a helper findContainerByName() that we should use here:
container := findContainerByName(obj.Spec.Template.Spec.InitContainers, "k8s-driver-manager")
container.Name = "vfio-device-unbind"
// ... all other transformations ...
| initContainer.Command = []string{"/bin/sh"} | ||
| initContainer.Args = []string{"-c", ` | ||
| # For shared-nvswitch mode, wait for driver to be ready before unbinding | ||
| echo "Shared NVSwitch mode detected, waiting for driver readiness..." | ||
| until [ -f /run/nvidia/validations/driver-ready ] | ||
| do | ||
| echo "waiting for the driver validations to be ready..." | ||
| sleep 5 | ||
| done | ||
|
|
||
| set -o allexport | ||
| cat /run/nvidia/validations/driver-ready | ||
| . /run/nvidia/validations/driver-ready | ||
|
|
||
| echo "Driver is ready, proceeding with device unbind" | ||
| exec vfio-manage unbind --all`} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding this in code, what about encapsulating this logic in a custom entrypoint script that is stored in a ConfigMap? The fabric manager mode can be indicated via an envvar and the entrypoint script can check the envvar to determine what actions to take (and what command to run -- k8s-driver-manager uninstall_driver vs vfio-manage unbind --all).
|
Question -- do we need to add an extra host path volume to the |
Description
This PR enables Fabric Manager (FM) configuration for
vm-passthroughworkloads using the Shared NVSwitch virtualization model.It enables users to configure the Fabric Manager mode (i.e.
FABRIC_MODE=[0,1,2],0- full-passthrough,1- shared NVSwitch,2- vGPU) through theClusterPolicyCRD, providing better support for NVIDIA multi-GPU systems in virtualized environments.In the FM shared NVSwitch virtualization model the NVIDIA driver on the host is used for the NVSwitch devices, while the GPU devices are bound to the
vfio-pcidriver. The goal is for the GPU devices to be passed-through to kubevirt VMs, while the respective fabric is managed on the host.Depends on / relates to: NVIDIA/gpu-driver-container#538
Changes
ClusetrPolicy API
FabricManagerSpecto theClusterPolicyCRD with support for two modes:full-passthrough(FABRIC_MODE=0) - default mode.shared-nvswitch(FABRIC_MODE=1) - shared NVSwitch virtualization mode.Controller logic
vm-passthroughwith FM shared NVSwitch mode and pass an env var to the driver container to indicate the selected fabric mode (the driver container is the one configuring and starting the FM).Driver state management
vm-passthroughand FM shared NVSwitch mode case.vm-passthroughwith shared NVSwitch mode.Sandbox validator
VFIO manager
nvidia-smi, which requires the driver to be loaded and bound to the GPU devices. Once that's done we can bind the GPU devices to vfio-pci.vfio-manage unbind --allwhen FM shared NVSwitch mode.Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
TBD