Skip to content

Conversation

@klihub
Copy link
Contributor

@klihub klihub commented Oct 18, 2025

When injecting mounts to a container with user namespaces in use, inject the mounts with UID and GID mappings taken from the OCI Spec.

Fixes #287.

@elezar @haircommander Although this simplistic fix works for the reported case, it might be a bit too simplistic (in the long run).

In particular, I wonder if we should eventually provide a way for the one requesting CDI injection (typically a DRA driver) to also provide UID-/GID-mappings for injected mounts, and only let CDI itself pick and inject UID-/GID-mappings on its own, if the necessary preconditions apply but mappings are not provided.

IOW, should we provide a way for a DRA driver itself to pick and inject the mappings ? If we are of the opinion that we should, then I think we have a bit of a problem with APIs and how things are currently wired up with CDI/CRI. The straightforward way to allow a driver to pick mappings would be to simply extend the CDI notion of a mount with UID- and GID-mappings. However, that would imply that only dynamically generated CDI Specs would allow injecting non-static mappings, since there is no way to pass on any additionald/dynamic parameters along a CDI device injection request over CRI (or in the CDI/runtime API).

@klihub klihub force-pushed the fixes/uid-gid-mappings branch from 87acc6c to 19dae60 Compare October 18, 2025 18:44
@klihub klihub marked this pull request as ready for review October 19, 2025 16:30
@klihub klihub requested a review from elezar October 20, 2025 07:21
@klihub klihub force-pushed the fixes/uid-gid-mappings branch from 19dae60 to 0c5ec3a Compare October 20, 2025 08:46
pkg/cdi/oci.go Outdated
type extraOCIMountOption func(*spec.Mount)

// withMountIDMappings adds UID and GID mappings for the given mount.
func withMountIDMappings(uid, gid []spec.LinuxIDMapping) extraOCIMountOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have uid and gid settable by separate options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with anything. Separate options, combined options, even with simply doing it like this

func (m *Mount) toOCI(uids, gids []spec.LinuxIDMapping) spec.Mount {
	om := spec.Mount{
		Source:      m.HostPath,
		Destination: m.ContainerPath,
		Options:     m.Options,
		Type:        m.Type,
		UIDMappings: uids,
		GIDMappings: gids,
	}
	return om

Destination: "/dest/path/a",
UIDMappings: []oci.LinuxIDMapping{
{
ContainerID: 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is ContainerID non-zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean in practice ? In principle, one could set up arbitrary mappings with multiple ranges which is possible, but in practice I think you'll only see a single range and starting (in the container) with 0, because that's what this is primarily used for. But from the OCI Spec point of view, nothing dictates any range or sub-range to start (in the container) at 0.

@elezar
Copy link
Contributor

elezar commented Oct 20, 2025

Is something similar required for devices since the uid and gid in the spec are the IDs in the container (https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#devices)?

@klihub
Copy link
Contributor Author

klihub commented Oct 20, 2025

Is something similar required for devices since the uid and gid in the spec are the IDs in the container (https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#devices)?

At least nothing identical we can do for devices, since they don't have UID/GID mappings, only mounts have them. Also AFAICT, device access control is a bit simpler (when compared with access control for a full bind-mounted subdirectory), just device node permissions and cgroup access rules.

@klihub klihub force-pushed the fixes/uid-gid-mappings branch 2 times, most recently from 069d16f to 7832481 Compare October 20, 2025 09:34
When injecting mounts to a container with user namespaces in use,
inject the mounts with UID and GID mappings taken from the OCI
Spec.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
@klihub klihub force-pushed the fixes/uid-gid-mappings branch from 7832481 to 7116a1d Compare October 20, 2025 13:40
@kad
Copy link
Contributor

kad commented Oct 21, 2025

Device permissions are handled inside the containerd/cri-o based on runAsUser in non-root containers.

@rata
Copy link

rata commented Oct 27, 2025

Hi! userns KEP author here. I'm not very familiar with the CDI injection process to provide more input. do you have any links I should read?

My current opinion without much education on CDI injection is that we should allow drivers to set some mappings, but we should complete with sane defaults (as this PR does) if userns are used and those fields are not set. Wouldn't that solve the "However, that would imply that only dynamically generated CDI Specs would allow injecting non-static mappings" issue you pointed in the description?

The k8s end user currently can't select mappings for mounts, as the kubelet will set them to what makes sense in that context. However, if CDI might inject devices or things that are not just bind-mounts (like a device node), setting the UID/GID for it will be critical to work with user-namespaces (uidmappings only work for bind-mounts). I wonder if the CDI drivers have access to the mapping used for the userns, as you basically need to set the UID/GID of the inode to the hostUID of the userns mapping.

Sorry I took some time to reply, I was on PTO and just coming back. I'm happy to help in any way that I can with this :)

@klihub
Copy link
Contributor Author

klihub commented Nov 4, 2025

b> Hi! userns KEP author here. I'm not very familiar with the CDI injection process to provide more input. do you have any links I should read?

@rata No, unfortunatly I don't have, but I think that it's enough to take a look at the CDI Spec (Mount) definition and the minimalist CRI CDI device injection request/message which is then also reflected in the library entry point signature for device injection to show what I meant.

My current opinion without much education on CDI injection is that we should allow drivers to set some mappings, but we should complete with sane defaults (as this PR does) if userns are used and those fields are not set. Wouldn't that solve the "However, that would imply that only dynamically generated CDI Specs would allow injecting non-static mappings" issue you pointed in the description?

Yes, it would, but given the above linked setup it requires more changes than just adding new data to the CDI Spec. And that's why I brought the whole thing up, because my gut feeling is/was the same: the requester of a CDI device injection, typically a DRA driver, should eventually be able to provide mappings at will. However, ATM if we add UID-/GID-mapping attributes to CDI Spec Mounts without other changes, the only way for a driver to provide mappings would be to dynamically generate a per-claim CDI Spec with the desired mappings and then reference that by name the the CDI-specific CRI injection request/message bits.

This would be far from ideal. But anything else will require further changes. To CRI, to be able to pass extra parameters for the injection, and to CDI, to allow such parameters to be passed in for injection to the library and used. And this last bit (constructing injection data from a CDI Spec and extra information provided by the injection request itself) is something new that we don't have in place at the moment. So it doesn't come as easily as the rest of just adding some new data to some entity to be injected.

The k8s end user currently can't select mappings for mounts, as the kubelet will set them to what makes sense in that context. However, if CDI might inject devices or things that are not just bind-mounts (like a device node), setting the UID/GID for it will be critical to work with user-namespaces (uidmappings only work for bind-mounts).

I wonder if the CDI drivers have access to the mapping used for the userns, as you basically need to set the UID/GID of the inode to the hostUID of the userns mapping.

If you mean DRA drivers, then I think a DRA driver, purely in the 'K8s/kubelet DRA driver' context, does not have access to that information.

Currently if an injected device node does not come with UID/GID's or permission bits specified in the CDI Spec, IIRC we take a look at the corresponding device node on the host side and if it's there we take those missing bits of information from there. Do you think we should change this logic and, for instance, if there are UID/GID-mappings present in the container try setting the device node UID/GID based on that mapping (for root/UID/GID 0) ?

Sorry I took some time to reply, I was on PTO and just coming back. I'm happy to help in any way that I can with this :)

No problemo. Happy that you can chime in and help.

@rata
Copy link

rata commented Nov 4, 2025

Currently if an injected device node does not come with UID/GID's or permission bits specified in the CDI Spec, IIRC we take a look at the corresponding device node on the host side and if it's there we take those missing bits of information from there. Do you think we should change this logic and, for instance, if there are UID/GID-mappings present in the container try setting the device node UID/GID based on that mapping (for root/UID/GID 0) ?

Yes, I was thinking just that too as an alternative :)

What kind of mount are you injecting here? Is it only bind-mounts or do you also inject inodes (like a device, char device, block device or some else)?

For bind-mounts we should inject the uid/gid mappings here (as otherwise we need to chown, it's a complete mess, we can't share, etc.).

For devices (char device, block device, etc.), I think we should used the mapped UID/GID. In other words, if the inode on the host has UID 0, then we should use the hostUID that is mapped to 0 inside the container.

@klihub what do you think? Do you think it's doable?

@elezar
Copy link
Contributor

elezar commented Nov 4, 2025

For devices (char device, block device, etc.), I think we should used the mapped UID/GID. In other words, if the inode on the host has UID 0, then we should use the hostUID that is mapped to 0 inside the container.

@rata could you describe what you are proposing using the OCI Runtime Spec fields?

@rata
Copy link

rata commented Nov 4, 2025

@elezar sure.

The devices object in the runtime-spec has a uid/gid: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#devices. I'm proposing to set those when devices are injected.

I'm proposing to set it to the user mapped inside the userns. The mapping in the OCI runtime spec is here: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#user-namespace-mappings. Using the example from the spec:

"uidMappings": [
    {
        "containerID": 0,
        "hostID": 1000,
        "size": 32000
    }
],
"gidMappings": [
    {
        "containerID": 0,
        "hostID": 1000,
        "size": 32000
    }
]

If the device we want (

type Mount struct {
HostPath string `json:"hostPath" yaml:"hostPath"`
ContainerPath string `json:"containerPath" yaml:"containerPath"`
Options []string `json:"options,omitempty" yaml:"options,omitempty"`
Type string `json:"type,omitempty" yaml:"type,omitempty"` // Added in v0.4.0
}
) the hostPath is owned by UID 0 and GID 0, then using this examples we want to assign it to "0 inside the container". Then, we check what is the mapped ID for 0 inside the container.

The UID and GID 0 inside the container are mapped to the ID 1000 outside the container. Then, the device will have the uid and gid set to 1000 in this example.

Is this example clear? Don't hesitate to ask otherwise :)

@klihub
Copy link
Contributor Author

klihub commented Nov 4, 2025

Currently if an injected device node does not come with UID/GID's or permission bits specified in the CDI Spec, IIRC we take a look at the corresponding device node on the host side and if it's there we take those missing bits of information from there. Do you think we should change this logic and, for instance, if there are UID/GID-mappings present in the container try setting the device node UID/GID based on that mapping (for root/UID/GID 0) ?

Yes, I was thinking just that too as an alternative :)

What kind of mount are you injecting here? Is it only bind-mounts or do you also inject inodes (like a device, char device, block device or some else)?

For bind-mounts we should inject the uid/gid mappings here (as otherwise we need to chown, it's a complete mess, we can't share, etc.).

For devices (char device, block device, etc.), I think we should used the mapped UID/GID. In other words, if the inode on the host has UID 0, then we should use the hostUID that is mapped to 0 inside the container.

@klihub what do you think? Do you think it's doable?

@rata Yes, I was thinking much along the same lines... to add to this PR for devices. Slightly differently though as I was thinking that instead of the host side UID/GID 0 check, what I'd do is that if the CDI Spec does not explicitly specify a UID/GID for the device node (which now causes them to be looked for in the corresponding host device node), we'd check if there are UID/GID-mappings set up for root/UID 0 in the container. And if there are, we'd use the mapped UID/GID for UID/GID 0 as the device node UID/GID. But your version might be better...

@rata
Copy link

rata commented Nov 4, 2025

Cool. I'd add to that, to handle all the cases, that if the UID/GID is not mapped, then we set uid/gid as 0 inside the container.

Just curious, completely out of topic to understand the big picture, how are people using GPUs using this? Is there any example workflow I can run locally or something?

@everzakov
Copy link

everzakov commented Nov 4, 2025

Sorry for interrupting. However, only setting necessary uid/gid for the device won't work (if container has user ns) because now runc don't do anything with devices if container has a user namespace (only set root uid in device config state). It uses mount bind command instead of mknod that's why the device will have the same uid/gid as on the host. So, if the device has a 0660 permission and we will try to use not the default uid/gid mapping then we will have EPERM error.

I have made an example to show the problem. CDI commit: everzakov@429161b Containerd everzakov/containerd@52d45ca My simple dra plugin https://github.com/everzakov/dra-plugin-health-example/tree/userns-with-devices

So, in host i have created the /dev/another_urandom device using mknod with 1,9 major/minor, root owner and 0660 permissions.

I have created the new container but i have got EPERM error. Also, you can see that all devices have nobody, nogroup:

vboxuser@vboxuser:~/dra-plugin-health-example$ cat /run/cdi/dev0-dra-example-com.json 
{"cdiVersion":"1.0.0","kind":"example.com/device","devices":[{"name":"dev0-dra-example-com","containerEdits":{"env":["Check=True"],"deviceNodes":[{"path":"/dev/another_urandom","hostPath":"/dev/another_urandom","type":"c","major":1,"minor":9,"permissions":"rwm","uid":0,"gid":0}]}}],"containerEdits":{}}vboxuser@vboxuser:~/dra-plugin-health-example$ kubectl get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE   NOMINATED NODE   READINESS GATES
dra-example-kubeletplugin-s7f45   1/1     Running   0          30s   10.244.0.98   node   <none>           <none>
example-pod                       3/3     Running   0          30s   10.244.0.99   node   <none>           <none>
vboxuser@vboxuser:~/dra-plugin-health-example$ kubectl  exec -ti example-pod -- /bin/bash
Defaulted container "container1" out of: container1, container2, container3
root@example-pod:/# ls -la /dev/
total 4
drwxr-xr-x 5 root   root     380 Nov  4 18:06 .
drwxr-xr-x 1 root   root    4096 Nov  4 18:06 ..
crw-rw---- 1 nobody nogroup 1, 9 Nov  4 15:52 another_urandom
lrwxrwxrwx 1 root   root      11 Nov  4 18:06 core -> /proc/kcore
lrwxrwxrwx 1 root   root      13 Nov  4 18:06 fd -> /proc/self/fd
crw-rw-rw- 1 nobody nogroup 1, 7 Nov  4 15:19 full
drwxrwxrwt 2 root   nogroup   40 Nov  4 18:06 mqueue
crw-rw-rw- 1 nobody nogroup 1, 3 Nov  4 15:19 null
lrwxrwxrwx 1 root   root       8 Nov  4 18:06 ptmx -> pts/ptmx
drwxr-xr-x 2 root   root       0 Nov  4 18:06 pts
crw-rw-rw- 1 nobody nogroup 1, 8 Nov  4 15:19 random
drwxrwxrwt 2 nobody nogroup   40 Nov  4 18:06 shm
lrwxrwxrwx 1 root   root      15 Nov  4 18:06 stderr -> /proc/self/fd/2
lrwxrwxrwx 1 root   root      15 Nov  4 18:06 stdin -> /proc/self/fd/0
lrwxrwxrwx 1 root   root      15 Nov  4 18:06 stdout -> /proc/self/fd/1
-rw-rw-rw- 1 root   root       0 Nov  4 18:06 termination-log
crw-rw-rw- 1 nobody nogroup 5, 0 Nov  4 18:10 tty
crw-rw-rw- 1 nobody nogroup 1, 9 Nov  4 15:19 urandom
crw-rw-rw- 1 nobody nogroup 1, 5 Nov  4 15:19 zero
root@example-pod:/# ls -la /dev/another_urandom 
crw-rw---- 1 nobody nogroup 1, 9 Nov  4 15:52 /dev/another_urandom
root@example-pod:/# head -n 1 /dev/another_urandom 
head: cannot open '/dev/another_urandom' for reading: Permission denied

My CDI commit have set the right uid/gid for a device:

vboxuser@vboxuser:~/container-device-interface$ sudo crictl inspect c4b49517e67c4 | jq .info.config.CDI_devices
[
  {
    "name": "example.com/device=dev0-dra-example-com"
  }
]

vboxuser@vboxuser:~/container-device-interface$ sudo ctr -n k8s.io c info c4b49517e67c4342355f8825fcbe8c65c8e7c4abc165fd4ca7f11699c2f8aa3b  | jq .Spec.linux.gidMappings
[
  {
    "containerID": 0,
    "hostID": 2460155904,
    "size": 65536
  }
]

vboxuser@vboxuser:~/container-device-interface$ sudo ctr -n k8s.io c info c4b49517e67c4342355f8825fcbe8c65c8e7c4abc165fd4ca7f11699c2f8aa3b  | jq .Spec.linux.devices
[
  {
    "path": "/dev/another_urandom",
    "type": "c",
    "major": 1,
    "minor": 9,
    "uid": 2460155904,
    "gid": 2460155904
  }
]

vboxuser@vboxuser:~/container-device-interface$ sudo python3 -m json.tool /run/containerd/runc/k8s.io/c4b49517e67c4342355f8825fcbe8c65c8e7c4abc165fd4ca7f11699c2f8aa3b/state.json  | jq .config.devices
[
  {
    "type": 99,
    "major": 1,
    "minor": 3,
    "permissions": "rwm",
    "allow": true,
    "path": "/dev/null",
    "file_mode": 438,
    "uid": 2460155904,
    "gid": 2460155904
  },
  {
    "type": 99,
    "major": 1,
    "minor": 8,
    "permissions": "rwm",
    "allow": true,
    "path": "/dev/random",
    "file_mode": 438,
    "uid": 2460155904,
    "gid": 2460155904
  },
  {
    "type": 99,
    "major": 1,
    "minor": 7,
    "permissions": "rwm",
    "allow": true,
    "path": "/dev/full",
    "file_mode": 438,
    "uid": 2460155904,
    "gid": 2460155904
  },
  {
    "type": 99,
    "major": 5,
    "minor": 0,
    "permissions": "rwm",
    "allow": true,
    "path": "/dev/tty",
    "file_mode": 438,
    "uid": 2460155904,
    "gid": 2460155904
  },
  {
    "type": 99,
    "major": 1,
    "minor": 5,
    "permissions": "rwm",
    "allow": true,
    "path": "/dev/zero",
    "file_mode": 438,
    "uid": 2460155904,
    "gid": 2460155904
  },
  {
    "type": 99,
    "major": 1,
    "minor": 9,
    "permissions": "rwm",
    "allow": true,
    "path": "/dev/urandom",
    "file_mode": 438,
    "uid": 2460155904,
    "gid": 2460155904
  },
  {
    "type": 99,
    "major": 1,
    "minor": 9,
    "permissions": "",
    "allow": false,
    "path": "/dev/another_urandom",
    "file_mode": 438,
    "uid": 2460155904,
    "gid": 2460155904
  }
]

The initial device can be chowned to the right user. However, we wan't be able to use this device with several containers.

vboxuser@vboxuser:~/container-device-interface$ sudo python3 -m json.tool /run/containerd/runc/k8s.io/c4b49517e67c4342355f8825fcbe8c65c8e7c4abc165fd4ca7f11699c2f8aa3b/state.json  | grep pid
    "init_process_pid": 74042

vboxuser@vboxuser:~/container-device-interface$ sudo  chown 2460155904.2460155904 /proc/74042/root/dev/another_urandom
vboxuser@vboxuser:~/container-device-interface$ ls -la /dev/another_urandom 
crw-rw---- 1 2460155904 2460155904 1, 9 Nov  4 15:52 /dev/another_urandom

root@example-pod:/# head -n 1 /dev/another_urandom 
����x\��{����9���jDJl�������,Ȭ��O9 �3=�J��Hjo]��7��_�[��4oN���ؐ0�zWE�;��J�.y,�}��.��!n�|����z�<�slJ�&1�p
root@example-pod:/# ls -la /dev/another_urandom 
crw-rw---- 1 root root 1, 9 Nov  4 15:52 /dev/another_urandom

PS. The default devices have 0666 permissions that's why we can use them in the container.

@rata
Copy link

rata commented Nov 5, 2025

@everzakov ohh, thanks for catching this! Okay, so runc is already setting the uid/gid to root inside the container and that doesn't work :(

So, for bind-mounts of filesystems (a file or a directory), what this PR is doing is fine.

For other devices, we need to see what we can do. Because the option that should work, isn't working and as @everzakov was saying maybe it can't work without changing the uid/gid of the device on the host, that might be problematic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CDI-injected bind-mount inaccessible for pods created with hostUsers: false

5 participants