TL;DR
- Setup GPU instance (p2.xlarge)
- Install GPU Driver (https://docs.nvidia.com/datacenter/tesla/)
- Install nvidia-container-toolkit (hKps://github.com/NVIDIA/nvidia-docker)
- nvidia-smi
- FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
- docker run –gpus all
image
Capacity Reservation
On EC2 dashboard, Capacity Reservation
> Create a Capacity Reservation
- Instance type:
p2.xlarge
- Availability zone:
us-east-1a
- Total capacity:
1
instances
- Capacity reservation detail:
At specific time
and set due.
Then Create
, there is a warning on that.
1
|
You have requested more vCPU capacity than your current vCPU limit of ${normalized_limit} allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
|
At http://aws.amazon.com/contact-us/ec2-request, request belows:
- Region: US East(Northern Virginia)
- Primary Instance Type: All P instances
- New Limit value: 4
- Use case description
1
|
Hi, I would like to make a request to increase the number of vCPU for p2.xlarge instance.
|
In my case, set capacity reservation ends at specific time (it costs 0.9USD/hour just 2 hours from now, for example)
Create GPU instance on EC2
On instance creation,
select the same region and instance type : p2.xlarge
. Also, expand the disk space to 20GB.
1. Install Docker on EC2
1
2
3
4
|
sudo apt-get update
sudo apt-get install docker.io
sudo gpasswd -a ubuntu docker
docker --version
|
2. Install NVIDIA Driver
Check NVIDIA Tesla Installation Notes.
https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#ubuntu-lts
Check System Management Interface
If nvidia-smi
does not work, you need to install it.
On EC2, check linux version:
1
2
3
4
5
6
7
8
9
10
11
12
13
|
cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.2 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
|
Check recommended NVIDIA driver version:
1
2
3
|
sudo apt-get update
sudo apt install ubuntu-drivers-common
ubuntu-drivers devices
|
Output
1
2
3
4
5
6
7
8
9
10
|
== /sys/devices/pci0000:00/0000:00:1e.0 ==
modalias : pci:v000010DEd0000102Dsv000010DEsd0000106Cbc03sc02i00
vendor : NVIDIA Corporation
model : GK210GL [Tesla K80]
driver : nvidia-driver-470 - distro non-free recommended
driver : nvidia-driver-418-server - distro non-free
driver : nvidia-driver-390 - distro non-free
driver : nvidia-driver-450-server - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin
|
In this case, we want to install nvidia-driver-470
.
1
2
|
sudo apt install nvidia-driver-470
sudo reboot
|
Run nvidia-smi
(system management interface)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
$ nvidia-smi
Tue Aug 22 09:52:43 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 37C P0 58W / 149W | 0MiB / 11441MiB | 54% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
The installation command at the above links did not work for me.
1
2
3
4
|
$ sudo apt-get update \
&& sudo apt-get install -y nvidia-container-toolkit-base
...
E: Unable to locate package nvidia-container-toolkit
|
Instead, some manual command lines at the github issue worked for me.
1
2
3
4
5
6
|
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker
|
Check if installed successfully.
1
2
3
|
$ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.13.5
commit: 6b8589dcb4dead72ab64f14a5912886e6165c079
|
Check the latest (or appropriate) tag of nvidia/cuda at https://hub.docker.com/r/nvidia/cuda/tags.
Then run docker with it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
$ docker run --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:12.2.0-base-ubuntu22.04' locally
12.2.0-base-ubuntu22.04: Pulling from nvidia/cuda
6b851dcae6ca: Pull complete
8f5f0e71700a: Pull complete
fac7ce4a13c3: Pull complete
1af9bee222cb: Pull complete
d47e0a26d15c: Pull complete
Digest: sha256:f8870283bea6a85ba4b4a5e1b65158dd15e8009e433539e7c83c94707e703a1b
Status: Downloaded newer image for nvidia/cuda:12.2.0-base-ubuntu22.04
Tue Aug 22 12:56:41 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 12.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:1E.0 Off | 0 |
| N/A 37C P0 57W / 149W | 0MiB / 11441MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
|
The above output shows nvidia-smi output on docker container (with docker container toolkit).
Upload build context
Access SFTP
1
2
3
4
|
sftp -i ~/.ssh/mydocker.pem ubuntu@ec2-54-146-60-95.compute-1.amazonaws.com
$put -r dsenv_build
$ssh -i mygpukey.pem ubuntu@<hostname>
$vim dsenv_build/Dockerfile
|
Update Dockerfile for GPU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
sudo \
wget \
vim
WORKDIR /opt
RUN wget https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh && \
sh Anaconda3-2019.10-Linux-x86_64.sh -b -p /opt/anaconda3 && \
rm -f Anaconda3-2019.10-Linux-x86_64.sh
ENV PATH /opt/anaconda3/bin:$PATH
RUN pip install --upgrade pip && pip install \
keras==2.3 \
scipy==1.4.1 \
tensorflow-gpu==2.1
WORKDIR /
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--LabApp.token=''"]
|
Build context.
Run a container on GPU
1
2
|
docker run --gpus all -v ~:/work -p 8888:8888 <image>
nvidia-smi
|
Access to Jupyter lab on <Public DNS>:8888
Run the code at https://github.com/keras-team/keras/blob/keras-2/examples/mnist_cnn.py
Check how nvidia-smi
works on EC2.