This tutorial gives you some hints on how to debug your jobs if things go wrong.
Requirements
- an AI Training Job you would like to start
- the OVHcloud AI CLI installed
Instructions
What is an AI Training job and how do I run one?
All steps for starting and working on AI Training are described in the AI Training - Getting Started guide.
How do I get my files back once I have finished training?
When you use AI Training, make sure to mount Object Storage containers to your job. You will need to back up your files on these volumes. Once your training job is complete (status is DONE
), your job should synchronize your data on the mounted Object Storage container(s).
If no volumes are mounted in the specified location, your files will be saved in the job's local and ephemeral storage and then deleted once the job is finished.
Which commands and arguments can I use to debug?
A lot of options and sub-commands are available in the ovhai CLI tool. This is the recommended means of interaction with AI Solutions. To install it, follow the CLI Installation guide.
To get a list of available sub-commands and arguments, use the command:
Further details on each sub-command can be accessed with:
Where can I find the UUID of my job?
The UUIDs of your projects appear in the OVHcloud Control Panel when you go to the AI Training
section.
You can also find them when you list your existing jobs, using the ovhai
CLI with the following command:
If your job is not listed, you may use this command to list all the jobs:
Why has my job FAILED?
First, check the return-code / error-code of your job
You can find the return code of your job by running:
Your return code is listed in the "Infos" field in the "Status"-section:
The following info is returned if there was an issue with downloading/pulling your image. Check for typos in the image name and assess issues if you try to access a non-public image.
Check if there are any error-messages
Your stdout (Output) and stderr (Error) messages can be read with:
Note that you can also consult them from the OVHcloud Control Panel, by going to AI Training
> Job UUID
> Logs
.
Debug interactively
If the answers above don't help you to solve your issue, it may help running your job a bit more interactively.
To skip any "autostart" of your image, you may use a bash with infinite sleep and connect to this by SSH.
Verify you can connect to the SSH host by running the following command:
You may now start your commands and/or use the typical command line tools to debug your issue within the container.
Debug your Code
The easiest way to debug your code may be using the above interactive debug-session and run/compile your code interactively checking for (for example):
- error-messages
- syntax errors
- missing libs
- wrong versions
You can do this by running (parts) of your python-code with:
or using any other debugger.
Is it possible to update a running job?
It is not possible to update a running job. If you wish to change the specification of a job, you need to interrupt the current one and recreate it.
How is the product billed?
During its lifetime the job should transit between the following statuses: QUEUED
, INITIALIZING
, PENDING
, RUNNING
, INTERRUPTING
, FINALIZING
, and DONE
.
Billing is minute-based and starts from the beginning until the end of the job's RUNNING
status. Each commenced minute is billed completely. Jobs that do not reach the RUNNING
state will not be billed.
The price will depend on the compute resources you use (CPUs and GPUs) and their running time.
For more information about AI Training billing and pricing examples, please check the AI Training - Billing and lifecycle guide.
How long can I use my AI Training job?
An AI Training job runs continuously until manually interrupted by the user or until it is done unless it exceeds 7 days of running. It will then be automatically stopped. You can choose to automatically restart it using the auto-restart
option (set this parameter to True
). The job will then restart as is. To increase this 7-day limit, you will have to contact Customer Support to ask for an upgrade of this quota for your Public Cloud project.
Go further
For more information and tutorials, please see our other AI & Machine Learning support guides or explore the guides for other OVHcloud products and services.
If you need training or technical assistance to implement our solutions, contact your sales representative or click on this link to get a quote and ask our Professional Services experts for a custom analysis of your project.