Description
AI-Training provides users with access to several types of resources. Resource usage can be fetched through a dedicated UI on Grafana.
Requirements
- a running job on AI-Training
UI Access
The URL to access the monitoring is built like this: https://monitoring.<REGION>
.ai.cloud.ovh.us/d/gpu?var-job=<JOB-ID>
It can be fetched in the CLI using this command:
It is displayed in the Monitoring Url line, which is accessed through any private browser.
Panel Details
We will now go through each panel to give a short description.
GPU Usage
NOTE: This panel will only be present on GPU jobs.
This panel displays the usage of each GPU allocated to your job.
GPU Memory
NOTE: This panel will only be present on GPU jobs.
This panel displays the usage and limit of memory for each GPU allocated to your Job
CPU Usage
This panel displays the overall CPU usage of your job.
Memory Usage
This panel displays the usage and limit of Memory allocated to your job.
Network Usage
This panel displays input and output traffic on your job.
Ephemeral storage usage
This panel shows the usage and limit of ephemeral storage allocated to your job. Jobs can use ephemeral storage for data not within a synchronized container.
If your usage goes beyond the limit of the ephemeral storage, your job will be rejected.
Go further
For more information and tutorials, please see our other AI & Machine Learning support guides or explore the guides for other OVHcloud products and services.
If you need training or technical assistance to implement our solutions, contact your sales representative or click on this link to get a quote and ask our Professional Services experts for a custom analysis of your project.