Definition
A job in AI Training is the workload unit submitted to the cluster. A job runs as a Docker container within the OVHcloud infrastructure.
Each AI Training job is linked to a Public Cloud project and specifies a number of resources to use to run the training task along with a Docker image either publicly available, in the AI Training shared registry scoped to your project, or the private registry of your choosing that you added. For the latter, see the OVHcloud documentation on how to add, use, and manage registries.
Considerations
An AI Training job runs continuously until manually interrupted by the user or until it is done, unless it exceeds 7 days of running. It will then be automatically stopped. You can choose to automatically restart it using the auto-restart
option (set this parameter to True
). The job will then restart as is.
- Data can be attached to a job to serve either/both as input for your training workload or output (e.g., model weights).
- If you do not customize your resource request, the default requested is 1 GPU. Memory is not customizable.
-
Billing for jobs is minute-based and starts from the beginning to the end of the job's
RUNNING
status. Each commenced minute is billed completely. - You can read further on job limitations here.
Under the hood
Jobs in AI Training are Docker containers within the OVHcloud infrastructure.
Job lifecycle
During its lifetime, the job will transit between the following states:
Only the RUNNING
time of the job is billed. For more information about jobs billing, refer to this documentation.
NOTE: You can only SSH into your job while it is in the RUNNING
state.
-
QUEUED
: The job run request is about to be processed. -
INITIALIZING
: The job instance is created, and the data is synchronized from the Object Storage. You can learn more about data synchronization here. -
PENDING
: The job is being started. -
RUNNING
: The job is running. -
INTERRUPTING
: The job is still running, but an interruption order was received and is about to be processed. -
FINALIZING
: The job instance is deleted, and the data is synchronized back to the Object Storage. You can learn more about data synchronization here. -
DONE
: The job ended normally. -
TIMEOUT
: The job is still running but is about to be interrupted because the timeout was reached. -
INTERRUPTED
: The job ended and was interrupted. -
FAILED
: The job ended with an error (e.g., the process in the job finished with a non-0 exit code, the Docker image could not be pulled, etc.). -
ERROR
: The job ended due to a backend error.
Go further
- You can check the OVHcloud documentation on how to create a data container.
- You can check the OVHcloud documentation on how to submit a job.
For more information and tutorials, please see our other AI & Machine Learning support guides or explore the guides for other OVHcloud products and services.
If you need training or technical assistance to implement our solutions, contact your sales representative or click on this link to get a quote and ask our Professional Services experts for a custom analysis of your project.