Learn how to submit an AI Training job through the ovhai CLI. To illustrate the submission, we will iteratively build a command to run a notebook image ovhcom/ai-training-transformers:3.1.0
with the Huggingface framework preinstalled. This Docker image is freely available.
Requirements
- a working
ovhai
CLI (see our CLI - Installation guide)
Instructions
job run
If you need any help while submitting a new job, run ovhai job run --help
:
Size your run
First, you need to tweak the resources you need for your new run depending on your expected workload.
For example, if you are on a data exploration step or designing your neural network to train you might start with a few vCPUs. Once your experiment is ready, switch over to using GPUs to train.
Flags --cpu
and --gpu
are exclusive, if GPU resources are specified then the CPU flag is ignored and the standard GPU to CPU ratio is applied. You can find out more about these ratios in the capabilities.
If you provision GPUs for your run, you can also select the model of GPU you wish to use with the --gpu-model
flag. If this flag is not specified, the default GPU model for the cluster on which you submit is used. You can find out about the default GPU for your cluster with ovhai capabilitites flavor list
command.
The maximum amount of vCPUs or GPUs available depends on the GPU model and the cluster you are using. You can find out about your cluster resources limitation with ovhai capabilities flavor list
.
For this experiment, we will deploy a notebook with 1 GPU of the default model:
- If no resource flag is specified, the job will run with one unit of the default GPU model.
- If both CPU and GPU flags are provided, only the GPU one is considered.
Attaching volumes
This step assumes that you either have data in your OVHCloud Object Storage that you wish to use during your experiment or that you need to save your job results into the Object Storage. To learn more about data, volumes, and permission check out the data page.
You can attach as many volumes as you want to your job with various options. Let us go through those options and outline a few good practices with volume mounts.
The --volume
flag is used to attach a container as a volume to the job. The volume description sets the option for the volume and synchronization process <container@region/prefix:mount_path:permission:cache>
:
-
container
the container in OVHcloud Object Storage to synchronize -
region
the Object Storage region on which the container is located -
prefix
objects in the container are filtered based on this prefix, only matching objects are synced -
mount_path
the location in the job where the synced data is mounted -
permission
the permission rights on the mounted data. Available rights are read only (ro), read write (rw), or read write delete (rwd). Data mounted with ro permission is not synced back at the end of the job. Thus, it avoids unnecessary synchronization delay on static data. -
cache
whether the synced data should be added to the project cache. Available options are eithercache
orno-cache
. Data in the cache can be used by other jobs without additional synchronization. To benefit from the cache, the new jobs also need to mount the data with the cache option.
Let's assume you have a team of data scientists working on the same input dataset, but each is running their own experiment. In this case, a good practice is to mount the input dataset with ro permission and cache activated for each experiment, the input data is synced only once and never synced back. In addition, each of the experiment will yield specific results that should be stored in a dedicated container. For each job, we would then mount an output container with rw permission and no cache. If a container does not exist yet in the object storage, it is created during the data synchronization.
Assuming our data is located in the Vint Hill Object Storage in a container named dataset
the command would now be:
- Data in the cache does not persist indefinitely. After a period of inactivity the data is emptied from the cache. Inactivity is defined as having no running jobs using the data in cache.
Define your process
Once resources and volumes are set up, you will now need to define the specifics of the process running within your job. First, you need a Docker image that you either built yourself or found freely available on a public repository such as DockerHub. In our example, we will use the notebook image ovhcom/ai-training-transformers:3.1.0
.
You can tweak the behavior of your Docker image without having to rebuild it every time (like updating the number of epochs for a training run) by using the --env
flag. Using this, you can simply set environment variables directly in your job, e.g.:
In our example, we do not require any environment variable.
It is also possible to override the default CMD of Entrypoint of the Docker image, simply add the new command at the end of the job run request. To make sure flags from your command are not interpreted as ovhai
parameters you can prefix your command with --
. To simply print Hello World
the command would be:
When a job is running a job_url
is associated to it, that allows you to access any service exposed in your job. By default, the exposed port for this URL is the 8080
, in our case the Jupyter Notebook is directly exposed on 8080
and we do not need to override it. However, if you are running an experiment and monitoring it with Tensorboard, the default port should be 6006
, you can override the port with:
Extra options
A few other options are available for your jobs.
-
--timeout
timeout after which the job will stop even if the process in the job did not end, helps you control your consumption -
--label
free labels to help you organize your jobs, labels are also used to scopeapp_token
, learn more aboutapp_token
and how to create them here -
--read-user
you can add aread-user
to a job, a read user will only have access to the service exposed behind thejob_url
. The read-user must match the username of an AI Platform user with anAI Training read
role. -
--ssh-public-keys
allows you to access your job through SSH, it is particularly useful to setup a VSCode Remote -
--from
run a job based on the specification of a previous one. All options will override the base job values. The--image
is the flag used to override the image of the base job.
Run a job
Finally, to submit a notebook job with 1 GPU, a dataset container and an output container we run
You can then follow the progress of all your jobs using the following commands:
If you want to fetch the specific job you just selected, retrieve its ID
and then:
For more information about the job and its lifecycle refer to the jobs page.
Going further
To know more about the CLI and available commands to interact with your job check out the overview of ovhai.
For more information and tutorials, please see our other AI & Machine Learning support guides or explore the guides for other OVHcloud products and services.
If you need training or technical assistance to implement our solutions, contact your sales representative or click on this link to get a quote and ask our Professional Services experts for a custom analysis of your project.