Enabling GPU access with Compose
预计阅读时间:6分钟
如果 Docker 主机包含此类设备并且 Docker 守护程序已相应设置,则 Compose 服务可以定义 GPU 设备预留. 为此,请确保安装先决条件(如果您尚未安装).
以下部分中的示例特别关注使用 Docker Compose 为服务容器提供对 GPU 设备的访问. 您可以使用docker-compose
或docker compose
命令.
Use of service runtime
property from Compose v2.3 format (legacy)
Docker Compose v1.27.0+ 切换到使用 Compose 规范模式,该模式是 2.x 和 3.x 版本的所有属性的组合. 这重新启用了服务属性作为运行时的使用,以提供对服务容器的 GPU 访问. 但是,这不允许控制 GPU 设备的特定属性.
services:
test:
image: nvidia/cuda:10.2-base
command: nvidia-smi
runtime: nvidia
Enabling GPU access to service containers
Docker Compose v1.28.0+ 允许使用 Compose 规范中定义的设备结构来定义 GPU 预留. 这提供了对 GPU 预留的更精细控制,因为可以为以下设备属性设置自定义值:
- 能力- 值指定为字符串列表(例如
capabilities: [gpu]
). 您必须在 Compose 文件中设置此字段. 否则,它会在服务部署时返回错误. - count - 指定为 int
all
值或表示应保留的 GPU 设备数量的值(假设主机拥有该数量的 GPU). - device_ids - 指定为表示来自主机的 GPU 设备 ID 的字符串列表的值. 您可以在主机上的
nvidia-smi
的输出中找到设备 ID. - driver - 指定为字符串的值(例如
driver: 'nvidia'
) - options - 表示驱动程序特定选项的键值对.
Note
您必须设置
capabilities
字段. 否则,它会在服务部署时返回错误.
count
和device_ids
是互斥的. 您一次只能定义一个字段.
有关这些属性的更多信息,请参阅撰写规范中的deploy
部分.
用于运行可访问 1 个 GPU 设备的服务的 Compose 文件示例:
services:
test:
image: nvidia/cuda:10.2-base
command: nvidia-smi
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
使用 Docker Compose 运行:
$ docker-compose up
Creating network "gpu_default" with the default driver
Creating gpu_test_1 ... done
Attaching to gpu_test_1
test_1 | +-----------------------------------------------------------------------------+
test_1 | | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.1 |
test_1 | |-------------------------------+----------------------+----------------------+
test_1 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
test_1 | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
test_1 | | | | MIG M. |
test_1 | |===============================+======================+======================|
test_1 | | 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
test_1 | | N/A 23C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
test_1 | | | | N/A |
test_1 | +-------------------------------+----------------------+----------------------+
test_1 |
test_1 | +-----------------------------------------------------------------------------+
test_1 | | Processes: |
test_1 | | GPU GI CI PID Type Process name GPU Memory |
test_1 | | ID ID Usage |
test_1 | |=============================================================================|
test_1 | | No running processes found |
test_1 | +-----------------------------------------------------------------------------+
gpu_test_1 exited with code 0
如果未设置count
或device_ids
,则默认使用主机上所有可用的 GPU.
services:
test:
image: tensorflow/tensorflow:latest-gpu
command: python -c "import tensorflow as tf;tf.test.gpu_device_name()"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
$ docker-compose up
Creating network "gpu_default" with the default driver
Creating gpu_test_1 ... done
Attaching to gpu_test_1
test_1 | I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
.....
test_1 | I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402]
Created TensorFlow device (/device:GPU:0 with 13970 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
test_1 | /device:GPU:0
gpu_test_1 exited with code 0
在托管多个 GPU 的机器上,可以将device_ids
字段设置为针对特定的 GPU 设备,并且可以使用count
来限制分配给服务容器的 GPU 设备的数量. 如果count
超过主机上可用 GPU 的数量,则部署将出错.
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1B.0 Off | 0 |
| N/A 72C P8 12W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:1C.0 Off | 0 |
| N/A 67C P8 11W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:1D.0 Off | 0 |
| N/A 74C P8 12W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 62C P8 11W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
要仅启用对 GPU-0 和 GPU-3 设备的访问:
services:
test:
image: tensorflow/tensorflow:latest-gpu
command: python -c "import tensorflow as tf;tf.test.gpu_device_name()"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0', '3']
capabilities: [gpu]
$ docker-compose up
...
Created TensorFlow device (/device:GPU:0 with 13970 MB memory -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1b.0, compute capability: 7.5)
...
Created TensorFlow device (/device:GPU:1 with 13970 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
...
gpu_test_1 exited with code 0