Update: I recommend reading Nvidia’ GPU Server Application Deployment Made Easy for an alternative approach (Ansible) to setting up the NVDIA-Docker plugin.
There are many tutorials on how to leverage Amazon’s supreme computing power to perform deep learning tasks. I would like to take this opportunity to contribute to that collection.
I started working with Amazon’s EC2 instances for deep learning by reading some of these tutorials, including predominantly:
- Installing CUDA, OpenCL, and PyOpenCL on AWS EC2
- Deep Learning Tutorial for Kaggle’s Facial Keypoints Detection
- Installing TensorFlow on AWS
I would launch an instance, install the necessary dependencies and create an AMI. This would be repeated for each deep learning framework that I wanted to use (Theano, Caffe, Tensorflow etc.)
I recently came upon NVIDIA Docker and I haven’t stopped using it since. I’m only storing one AMI now and updating dependencies has become hassle-free. NVIDIA Docker is a thin wrapper for Docker that can, in addition to the default functionality, discover available GPU devices and their respective driver files.
Throughout this tutorial, I’m going to assume you’ve selected one of the GPU EC2 instances, running Ubuntu 14.04.4 LTS (Trusty Tahr). Issues might arise if you don’t follow.
It will take roughly 15 minutes to complete this tutorial, from launching the instance to having a container ready with Keras, Theano, and CUDA.
Creating the EC2 Instance
For starters, I recommend reading Getting Started with Amazon EC2 Linux Instances.
To summarize,
1.Go to the EC2 page on the AWS Console and click on the blue Launch instance button.
2.Choose the latest stable Ubuntu AMI. You can find it on the Quick Start and Community AMI panes.
3.Select one of the GPU instances: g2.2xlarge (1 GPU), g2.8xlarge (4 GPUs).
4.Choose ‘Request Spot Instances’ if you want to save up to 90% on instance costs. Spot instances provide computer power at a much cheaper rate but come with the risk of getting killed unexpectedly (depends on your max bid price). If you can’t handle the interruptions and are willing to pay more, stick to the default on-demand instance.
Installing the prerequisites
Connect to your instance. Read Connecting to Your Linux Instance Using SSH for instructions.
ssh -i [my_keypair.pem] ubuntu@[dns_of_ec2_instance]
Connecting to an instance running Ubuntu using SSH client.
Once you’ve made your way into the instance, it’s time to start installing everything we need to start deep learning.
In order to use NVIDIA Docker, we need to fulfill Nvidia-docker prerequisites.
Update all the default packages on the instance.
sudo apt-get update && sudo apt-get upgrade
Install Docker on the instance. You need to follow this.
The tutorial consists of updating your apt sources, installing the linux-image-extra
kernel package and docker-engine
.
To summarize,
sudo apt-key adv --keyserver hkp://p80.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
Create file /etc/apt/sources.list.d/docker.list
.
Add line deb https://apt.dockerproject.org/repo ubuntu-trusty main
sudo apt-get update
apt-cache policy docker-engine
sudo apt-get install linux-image-extra-$(uname -r)
sudo apt-get install docker-engine
If you’ve followed all of those instructions, you can test it out using the following:
sudo docker run hello-world
Install the necessary graphics drivers. Read more here.
According to the PPA page, nvidia-361
is the recommended version.
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-361
Install nvidia-modprobe
. It loads the NVIDIA kernel module and creates NVIDIA character device files.
sudo apt-get install nvidia-modprobe
Installing NVIDIA Docker
If you’ve followed the instructions above, the next few should be a breeze.
Install NVIDIA Docker.
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.0-rc.3/nvidia-docker_1.0.0.rc.3-1_amd64.deb
sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb
The following instruction can be used to test everything so far. I’ve also included what should be roughly returned from the command.
sudo nvidia-docker run --rm nvidia/cuda nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.45.18 Driver Version: 361.45.18 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 29C P8 19W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I highly recommend creating an AMI at this point in time. You will avoid having to follow this tutorial again. Read Creating an AMI EBS.
Select a Docker image from Kaixhin’s repository. Let’s pick kaixhin/cuda-keras
and download it.
sudo nvidia-docker pull kaixhin/cuda-keras
Create a container with the image.
sudo nvidia-docker run -it kaixhin/cuda-keras
Voila! You’ve got yourself a container setup with Keras, Theano and CUDA.
Extras
Adding code and data
You’ve got a container now but no code or data. What is the point?!?!
In the EC2 instance, create a directory where your code and data will reside. You can use s3cmd
to move data from/to Amazon’s S3.
sudo apt-get install s3cmd
One way of moving files onto the container is using docker’s scp
command. Unfortunately, with a lot of data and code, this can be quite a hassle.
I recommend attaching a data volume to a container. Next time you run a container, use the -v
flag.
sudo nvidia-docker run -v /home/ubuntu/[HOST_DIR]:/[CONTAINER_DIR] -it kaixhin/cuda-keras
cd to /[CONTAINER_DIR]
and you will find everything that is in the [HOST_DIR]
. Any changes in [HOST_DIR]
will be directly reflected in the container (without having to run again).
Additional dependencies
Go to Kaixhin’s repository and download the Dockerfile related to the image you’re interested in building.
Modify the Dockerfile and copy it over to your EC2 instance. I recommend reading Best practices for writing Dockerfiles.
Create a directory [DOCKER_DIR]
and move the modified Dockerfile into that directory.
Run the following:
sudo nvidia-docker build [DOCKER_DIR]
Next time you run a container, you can use the id of the image you just built.
Conclusion
Once I’ve got everything set up, I’ll usually run some code, detach from my container and tail the logs from the host. I hope you enjoyed my tutorial and found it useful. If you have any suggestions or questions, feel free to reach out in the comments below.