-
Notifications
You must be signed in to change notification settings - Fork 47
cdi: inject mount UID/GID mappings if user NS is in use. #288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
87acc6c to
19dae60
Compare
19dae60 to
0c5ec3a
Compare
pkg/cdi/oci.go
Outdated
| type extraOCIMountOption func(*spec.Mount) | ||
|
|
||
| // withMountIDMappings adds UID and GID mappings for the given mount. | ||
| func withMountIDMappings(uid, gid []spec.LinuxIDMapping) extraOCIMountOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have uid and gid settable by separate options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with anything. Separate options, combined options, even with simply doing it like this
func (m *Mount) toOCI(uids, gids []spec.LinuxIDMapping) spec.Mount {
om := spec.Mount{
Source: m.HostPath,
Destination: m.ContainerPath,
Options: m.Options,
Type: m.Type,
UIDMappings: uids,
GIDMappings: gids,
}
return om| Destination: "/dest/path/a", | ||
| UIDMappings: []oci.LinuxIDMapping{ | ||
| { | ||
| ContainerID: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is ContainerID non-zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean in practice ? In principle, one could set up arbitrary mappings with multiple ranges which is possible, but in practice I think you'll only see a single range and starting (in the container) with 0, because that's what this is primarily used for. But from the OCI Spec point of view, nothing dictates any range or sub-range to start (in the container) at 0.
|
Is something similar required for devices since the uid and gid in the spec are the IDs in the container (https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#devices)? |
At least nothing identical we can do for devices, since they don't have UID/GID mappings, only mounts have them. Also AFAICT, device access control is a bit simpler (when compared with access control for a full bind-mounted subdirectory), just device node permissions and cgroup access rules. |
069d16f to
7832481
Compare
When injecting mounts to a container with user namespaces in use, inject the mounts with UID and GID mappings taken from the OCI Spec. Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
7832481 to
7116a1d
Compare
|
Device permissions are handled inside the containerd/cri-o based on runAsUser in non-root containers. |
|
Hi! userns KEP author here. I'm not very familiar with the CDI injection process to provide more input. do you have any links I should read? My current opinion without much education on CDI injection is that we should allow drivers to set some mappings, but we should complete with sane defaults (as this PR does) if userns are used and those fields are not set. Wouldn't that solve the "However, that would imply that only dynamically generated CDI Specs would allow injecting non-static mappings" issue you pointed in the description? The k8s end user currently can't select mappings for mounts, as the kubelet will set them to what makes sense in that context. However, if CDI might inject devices or things that are not just bind-mounts (like a device node), setting the UID/GID for it will be critical to work with user-namespaces (uidmappings only work for bind-mounts). I wonder if the CDI drivers have access to the mapping used for the userns, as you basically need to set the UID/GID of the inode to the hostUID of the userns mapping. Sorry I took some time to reply, I was on PTO and just coming back. I'm happy to help in any way that I can with this :) |
|
b> Hi! userns KEP author here. I'm not very familiar with the CDI injection process to provide more input. do you have any links I should read? @rata No, unfortunatly I don't have, but I think that it's enough to take a look at the CDI Spec (Mount) definition and the minimalist CRI CDI device injection request/message which is then also reflected in the library entry point signature for device injection to show what I meant.
Yes, it would, but given the above linked setup it requires more changes than just adding new data to the CDI Spec. And that's why I brought the whole thing up, because my gut feeling is/was the same: the requester of a CDI device injection, typically a DRA driver, should eventually be able to provide mappings at will. However, ATM if we add UID-/GID-mapping attributes to CDI Spec Mounts without other changes, the only way for a driver to provide mappings would be to dynamically generate a per-claim CDI Spec with the desired mappings and then reference that by name the the CDI-specific CRI injection request/message bits. This would be far from ideal. But anything else will require further changes. To CRI, to be able to pass extra parameters for the injection, and to CDI, to allow such parameters to be passed in for injection to the library and used. And this last bit (constructing injection data from a CDI Spec and extra information provided by the injection request itself) is something new that we don't have in place at the moment. So it doesn't come as easily as the rest of just adding some new data to some entity to be injected.
If you mean DRA drivers, then I think a DRA driver, purely in the 'K8s/kubelet DRA driver' context, does not have access to that information. Currently if an injected device node does not come with UID/GID's or permission bits specified in the CDI Spec, IIRC we take a look at the corresponding device node on the host side and if it's there we take those missing bits of information from there. Do you think we should change this logic and, for instance, if there are UID/GID-mappings present in the container try setting the device node UID/GID based on that mapping (for root/UID/GID 0) ?
No problemo. Happy that you can chime in and help. |
Yes, I was thinking just that too as an alternative :) What kind of mount are you injecting here? Is it only bind-mounts or do you also inject inodes (like a device, char device, block device or some else)? For bind-mounts we should inject the uid/gid mappings here (as otherwise we need to chown, it's a complete mess, we can't share, etc.). For devices (char device, block device, etc.), I think we should used the mapped UID/GID. In other words, if the inode on the host has UID 0, then we should use the hostUID that is mapped to 0 inside the container. @klihub what do you think? Do you think it's doable? |
@rata could you describe what you are proposing using the OCI Runtime Spec fields? |
|
@elezar sure. The devices object in the runtime-spec has a uid/gid: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#devices. I'm proposing to set those when devices are injected. I'm proposing to set it to the user mapped inside the userns. The mapping in the OCI runtime spec is here: https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#user-namespace-mappings. Using the example from the spec: If the device we want ( container-device-interface/specs-go/config.go Lines 49 to 54 in ffd52dc
The UID and GID 0 inside the container are mapped to the ID 1000 outside the container. Then, the device will have the uid and gid set to 1000 in this example. Is this example clear? Don't hesitate to ask otherwise :) |
@rata Yes, I was thinking much along the same lines... to add to this PR for devices. Slightly differently though as I was thinking that instead of the host side UID/GID 0 check, what I'd do is that if the CDI Spec does not explicitly specify a UID/GID for the device node (which now causes them to be looked for in the corresponding host device node), we'd check if there are UID/GID-mappings set up for root/UID 0 in the container. And if there are, we'd use the mapped UID/GID for UID/GID 0 as the device node UID/GID. But your version might be better... |
|
Cool. I'd add to that, to handle all the cases, that if the UID/GID is not mapped, then we set uid/gid as 0 inside the container. Just curious, completely out of topic to understand the big picture, how are people using GPUs using this? Is there any example workflow I can run locally or something? |
|
Sorry for interrupting. However, only setting necessary uid/gid for the device won't work (if container has user ns) because now runc don't do anything with devices if container has a user namespace (only set root uid in device config state). It uses mount bind command instead of mknod that's why the device will have the same uid/gid as on the host. So, if the device has a 0660 permission and we will try to use not the default uid/gid mapping then we will have EPERM error. I have made an example to show the problem. CDI commit: everzakov@429161b Containerd everzakov/containerd@52d45ca My simple dra plugin https://github.com/everzakov/dra-plugin-health-example/tree/userns-with-devices So, in host i have created the /dev/another_urandom device using mknod with 1,9 major/minor, root owner and 0660 permissions. I have created the new container but i have got EPERM error. Also, you can see that all devices have nobody, nogroup: vboxuser@vboxuser:~/dra-plugin-health-example$ cat /run/cdi/dev0-dra-example-com.json
{"cdiVersion":"1.0.0","kind":"example.com/device","devices":[{"name":"dev0-dra-example-com","containerEdits":{"env":["Check=True"],"deviceNodes":[{"path":"/dev/another_urandom","hostPath":"/dev/another_urandom","type":"c","major":1,"minor":9,"permissions":"rwm","uid":0,"gid":0}]}}],"containerEdits":{}}vboxuser@vboxuser:~/dra-plugin-health-example$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
dra-example-kubeletplugin-s7f45 1/1 Running 0 30s 10.244.0.98 node <none> <none>
example-pod 3/3 Running 0 30s 10.244.0.99 node <none> <none>
vboxuser@vboxuser:~/dra-plugin-health-example$ kubectl exec -ti example-pod -- /bin/bash
Defaulted container "container1" out of: container1, container2, container3
root@example-pod:/# ls -la /dev/
total 4
drwxr-xr-x 5 root root 380 Nov 4 18:06 .
drwxr-xr-x 1 root root 4096 Nov 4 18:06 ..
crw-rw---- 1 nobody nogroup 1, 9 Nov 4 15:52 another_urandom
lrwxrwxrwx 1 root root 11 Nov 4 18:06 core -> /proc/kcore
lrwxrwxrwx 1 root root 13 Nov 4 18:06 fd -> /proc/self/fd
crw-rw-rw- 1 nobody nogroup 1, 7 Nov 4 15:19 full
drwxrwxrwt 2 root nogroup 40 Nov 4 18:06 mqueue
crw-rw-rw- 1 nobody nogroup 1, 3 Nov 4 15:19 null
lrwxrwxrwx 1 root root 8 Nov 4 18:06 ptmx -> pts/ptmx
drwxr-xr-x 2 root root 0 Nov 4 18:06 pts
crw-rw-rw- 1 nobody nogroup 1, 8 Nov 4 15:19 random
drwxrwxrwt 2 nobody nogroup 40 Nov 4 18:06 shm
lrwxrwxrwx 1 root root 15 Nov 4 18:06 stderr -> /proc/self/fd/2
lrwxrwxrwx 1 root root 15 Nov 4 18:06 stdin -> /proc/self/fd/0
lrwxrwxrwx 1 root root 15 Nov 4 18:06 stdout -> /proc/self/fd/1
-rw-rw-rw- 1 root root 0 Nov 4 18:06 termination-log
crw-rw-rw- 1 nobody nogroup 5, 0 Nov 4 18:10 tty
crw-rw-rw- 1 nobody nogroup 1, 9 Nov 4 15:19 urandom
crw-rw-rw- 1 nobody nogroup 1, 5 Nov 4 15:19 zero
root@example-pod:/# ls -la /dev/another_urandom
crw-rw---- 1 nobody nogroup 1, 9 Nov 4 15:52 /dev/another_urandom
root@example-pod:/# head -n 1 /dev/another_urandom
head: cannot open '/dev/another_urandom' for reading: Permission deniedMy CDI commit have set the right uid/gid for a device: vboxuser@vboxuser:~/container-device-interface$ sudo crictl inspect c4b49517e67c4 | jq .info.config.CDI_devices
[
{
"name": "example.com/device=dev0-dra-example-com"
}
]
vboxuser@vboxuser:~/container-device-interface$ sudo ctr -n k8s.io c info c4b49517e67c4342355f8825fcbe8c65c8e7c4abc165fd4ca7f11699c2f8aa3b | jq .Spec.linux.gidMappings
[
{
"containerID": 0,
"hostID": 2460155904,
"size": 65536
}
]
vboxuser@vboxuser:~/container-device-interface$ sudo ctr -n k8s.io c info c4b49517e67c4342355f8825fcbe8c65c8e7c4abc165fd4ca7f11699c2f8aa3b | jq .Spec.linux.devices
[
{
"path": "/dev/another_urandom",
"type": "c",
"major": 1,
"minor": 9,
"uid": 2460155904,
"gid": 2460155904
}
]
vboxuser@vboxuser:~/container-device-interface$ sudo python3 -m json.tool /run/containerd/runc/k8s.io/c4b49517e67c4342355f8825fcbe8c65c8e7c4abc165fd4ca7f11699c2f8aa3b/state.json | jq .config.devices
[
{
"type": 99,
"major": 1,
"minor": 3,
"permissions": "rwm",
"allow": true,
"path": "/dev/null",
"file_mode": 438,
"uid": 2460155904,
"gid": 2460155904
},
{
"type": 99,
"major": 1,
"minor": 8,
"permissions": "rwm",
"allow": true,
"path": "/dev/random",
"file_mode": 438,
"uid": 2460155904,
"gid": 2460155904
},
{
"type": 99,
"major": 1,
"minor": 7,
"permissions": "rwm",
"allow": true,
"path": "/dev/full",
"file_mode": 438,
"uid": 2460155904,
"gid": 2460155904
},
{
"type": 99,
"major": 5,
"minor": 0,
"permissions": "rwm",
"allow": true,
"path": "/dev/tty",
"file_mode": 438,
"uid": 2460155904,
"gid": 2460155904
},
{
"type": 99,
"major": 1,
"minor": 5,
"permissions": "rwm",
"allow": true,
"path": "/dev/zero",
"file_mode": 438,
"uid": 2460155904,
"gid": 2460155904
},
{
"type": 99,
"major": 1,
"minor": 9,
"permissions": "rwm",
"allow": true,
"path": "/dev/urandom",
"file_mode": 438,
"uid": 2460155904,
"gid": 2460155904
},
{
"type": 99,
"major": 1,
"minor": 9,
"permissions": "",
"allow": false,
"path": "/dev/another_urandom",
"file_mode": 438,
"uid": 2460155904,
"gid": 2460155904
}
]The initial device can be chowned to the right user. However, we wan't be able to use this device with several containers. vboxuser@vboxuser:~/container-device-interface$ sudo python3 -m json.tool /run/containerd/runc/k8s.io/c4b49517e67c4342355f8825fcbe8c65c8e7c4abc165fd4ca7f11699c2f8aa3b/state.json | grep pid
"init_process_pid": 74042
vboxuser@vboxuser:~/container-device-interface$ sudo chown 2460155904.2460155904 /proc/74042/root/dev/another_urandom
vboxuser@vboxuser:~/container-device-interface$ ls -la /dev/another_urandom
crw-rw---- 1 2460155904 2460155904 1, 9 Nov 4 15:52 /dev/another_urandom
root@example-pod:/# head -n 1 /dev/another_urandom
����x\��{����9���jDJl�������,Ȭ��O9 �3=�J��Hjo]��7��_�[��4oN���ؐ0�zWE�;��J�.y,�}��.��!n�|����z�<�slJ�&1�p
root@example-pod:/# ls -la /dev/another_urandom
crw-rw---- 1 root root 1, 9 Nov 4 15:52 /dev/another_urandomPS. The default devices have 0666 permissions that's why we can use them in the container. |
|
@everzakov ohh, thanks for catching this! Okay, so runc is already setting the uid/gid to root inside the container and that doesn't work :( So, for bind-mounts of filesystems (a file or a directory), what this PR is doing is fine. For other devices, we need to see what we can do. Because the option that should work, isn't working and as @everzakov was saying maybe it can't work without changing the uid/gid of the device on the host, that might be problematic. |
When injecting mounts to a container with user namespaces in use, inject the mounts with UID and GID mappings taken from the OCI Spec.
Fixes #287.
@elezar @haircommander Although this simplistic fix works for the reported case, it might be a bit too simplistic (in the long run).
In particular, I wonder if we should eventually provide a way for the one requesting CDI injection (typically a DRA driver) to also provide UID-/GID-mappings for injected mounts, and only let CDI itself pick and inject UID-/GID-mappings on its own, if the necessary preconditions apply but mappings are not provided.
IOW, should we provide a way for a DRA driver itself to pick and inject the mappings ? If we are of the opinion that we should, then I think we have a bit of a problem with APIs and how things are currently wired up with CDI/CRI. The straightforward way to allow a driver to pick mappings would be to simply extend the CDI notion of a mount with UID- and GID-mappings. However, that would imply that only dynamically generated CDI Specs would allow injecting non-static mappings, since there is no way to pass on any additionald/dynamic parameters along a CDI device injection request over CRI (or in the CDI/runtime API).