AI Training - Billing and lifecycle – Support Guides

The OVHcloud AI Training service provides you with a container as a service platform that is linked to CPU or GPU resources without the hassle of installing or operating them. This guide will cover the lifecycle of an AI Training job and its associated billing.

Introduction

AI Training jobs are linked to a Public Cloud project. The whole project is billed at the end of the month. With pay-as-you-go, you will only pay for what you consume based on the compute resources you use (CPUs and GPUs) and their running time.

AI Training job lifecycle

During its lifetime, the AI Training job will go through the following statuses:

QUEUED: The job run request is about to be processed.
INITIALIZING: The job instance is created, and the data is synchronized from the Object Storage. To learn more about data synchronization, check out this Data - How it works section.
PENDING: The job is being started.
RUNNING: The job is running, and you can connect to it. Compute resources (GPUs/CPUs) are allocated to your specific job, and data are available.
INTERRUPTING: The job is still running, but an interruption order was received and is about to be processed.
FINALIZING: The job instance is deleted, and the data is synchronized back to the Object Storage. To learn more about data synchronization, check out this Data - How it works section.
DONE: The job ended normally.
TIMEOUT: The job is still running but is about to be interrupted because the timeout was reached.
INTERRUPTED: The job ended and was interrupted.
FAILED: The job ended with an error, e.g., the process in the job finished with a non-0 exit code, and the Docker image could not be pulled. For more information, refer to this section of our Troubleshooting documentation.
ERROR: The job ended due to a backend error. You may contact our support.

Billing principles

AI Training is a pay-per-use solution. You only pay for the resources consumed during the RUNNING phase of your jobs.

The billing principle is quite simple: you select the amount of compute resource (CPUs or GPUs) you would like to work with and pay only for this.

Included in AI Training resources:

AI Training managed service
Dedicated CPU/GPU compute resources (based on the selected amount)
Ephemeral local storage (size depends on the selected compute resources)
Ingress/Egress network traffic

Optional with AI Training:

Private registries, based on OVHcloud Managed Private registry pricing
Remote storage space, based on OVHcloud Object Storage pricing
Egress traffic for remote Object storage

Visual explanations about paid items:

A more detailed view:

Compute resources details

During the AI Training job creation, you can select compute resources, known as CPUs or GPUs. Their official pricing is available in the OVHcloud Control Panel or on the OVHcloud Public Cloud website.

Rates for compute are mentioned per hour to facilitate the reading of the prices, but the billing granularity remains per minute.

Storage details

Ephemeral local storage

Each compute resource (CPU or GPU) comes with local storage that we can consider ephemeral since this storage space is not saved when you delete an AI Training job.

The sizing depends on the selected amount of compute resources; check the details on the OVHcloud Public Cloud website.

Remote Object storage

When working with remote data, you pay separately for the storage of this data. The pricing of Object Storage is apart from the AI Training pricing.

Pricing examples

Example 1: one GPU notebook for 45 minutes, then deleted

We start one AI Notebook with two GPUs, and we keep it running for 45 minutes, then we delete it.

compute resources: 1 x GPU NVIDIA L4 ($0.91 / hour)
remote storage: nothing
duration: 45 minutes

Compute cost: 0.75 (hours) x 1 (GPU) x $0.91 (price / GPU) = $0.6825

Storage cost: none

Total: $0.6825, billed at the end of the month

Example 2: One CPU job for 10 days

We start one AI Training job with two CPUs, and we keep it running for 10 days, then it is stopped.

compute resources: 2 x CPU ($0.04 / hour)
remote storage: nothing
duration: 10 days, then stopped

Compute cost: 10 (days) x 24 (hours) x 2 (CPU) x $0.04 (price / CPU) = $19.20

Storage cost: none

Total: $19.20, billed at the end of the month

Example 3: One GPU job for 5 hours

We start one AI Training job with two GPUs, and we keep it running for 5 hours, then it is stopped.

compute resources: 2 x GPU NVIDIA L4 ($0.91 / hour)
remote storage: nothing
duration: 5 hours, then stopped

Compute cost: 5 (hours) x 2 (GPU) x $0.91 (price / GPU) = $9.10

Storage cost: none

Total: $9.10, billed at the end of the month

Go further

For more information and tutorials, please see our other AI & Machine Learning support guides or explore the guides for other OVHcloud products and services.

If you need training or technical assistance to implement our solutions, contact your sales representative or click on this link to get a quote and ask our Professional Services experts for a custom analysis of your project.

Introduction

AI Training job lifecycle

Billing principles

Compute resources details

Storage details

Ephemeral local storage

Remote Object storage

Pricing examples

Example 1: one GPU notebook for 45 minutes, then deleted

Example 2: One CPU job for 10 days

Example 3: One GPU job for 5 hours

Go further

Related articles