AI Deploy - Tutorial - How to load test your application with Locust – Support Guides

Learn how to load test your deployed applications by gradually querying your APIs with a load testing tool.

Usually, your challenge is to forecast your compute needs, for example, how many CPUs or GPUs will be required for 1,000 API calls per hour and acceptable latency. There are several applications to simulate the number of users and requests that you may have to address. In this tutorial, we will use one of them and interpret the results.

AI Deploy is covered by OVHcloud Public Cloud Special Conditions.

Requirements

Access to the OVHcloud Control Panel
A Public Cloud project in your OVHcloud account
An app with an API running in AI Deploy on your Public Cloud project
A Python environment, with enough CPU, RAM, and internet access (a virtual machine is recommended)

Selecting the right load testing tool for your needs

Depending on your preferred programming language and time to spend on this topic, you can opt for different options.

You can go for a SaaS load tester, such as Gatling.io or K6.io; nothing to install, easy to start.

A second option is using open-source load testing tools. Some tools are only command-line based, such as hey or Wrk2, while others come with a web interface like Locust.

Selecting the right tool for the right test is mandatory. For the next parts, we will select and use Locust, allowing us to show visual graphs.

Instructions

Deploy an app with a REST API

Feel free to deploy any app and API that you would like to load test, as long as you can query it via REST queries.

For this tutorial, we will load test a spam classifier API from the AI Deploy app portfolio. This API takes sentences (emails) as input text and outputs a spam probability score.

You can deploy this API easily from the OVHcloud Control Panel or OVHcloud CLI. A good strategy is to deploy with autoscaling, with minimum and maximum replicas. This way, we will monitor the growth of used replicas.

Here is the CLI command used to deploy it, with autoscaling going from one to five replicas and a CPU threshold of 75%:

ovhai app run --name spamclassifier --cpu 1 \
--auto-min-replicas 1 \
--auto-max-replicas 5 \
--auto-resource-type CPU \
--auto-resource-usage-threshold 75 \
ij8m5ic1.c1.va1.container-registry.ovh.us/ai-deploy-portfolio/fastapi-spam-classification
--unsecure-http #use this attribute if you want your application to be reachable without any authentication

Verify that your API is up and running with cURL

To be able to connect to your AI Deploy app, you have to create a token bearer for your OVHcloud AI user.

Once deployed, let's test first our API with a simple cURL command. Here is the command to try in a terminal:

curl -s -X POST \
"<api_url>/spam_detection_path" \
-H "Authorization: Bearer <token>" \
-H "Content-Type: application/json" \
-d '{"message":"This is a test from my machine"}' | jq

Here is the result given by our call:

{
  "label": "ham",
  "label_probability": 0.9875114695528484
}

A few explanations on the lines above:

In the first line, we specify that we will use a POST method.
We specify the URL where the POST request will be executed. The api_url is the URL of your API. It should be similar to: https://fd5dfa12-9bce-444c-81d7-544f46c4ddcf.app.us-east-va.ai.cloud.ovh.us/.
We put the token to access our API, generated via the OVHcloud Control Panel or ovhai CLI. We specify it in the header of the request. If you want to know more about the generation and use of tokens, you can follow this tutorial.
We specify that our body is in a JSON format.
We put in our body the message we want to send to the spam classifier. In your case, the body could be different because it depends on the API. Our objective is that the spam classifier will send us the probability of each response. The last | jq instruction allows us to have a good display of the result in the terminal.

We now have the confirmation that our API is up and running, let's try to load test it.

We will simply simulate several curl commands. With the Locust tool, we can simulate several users and define the number of calls per minute to the API. This can be easily done with Locust's interface. But before using this interface, we need to launch Locust and configure the tool. This can be easily done with Python.

Install locust.io

Locust is an open-source Python package that you can install with one line of code. Follow their official documentation:

pip3 install locust

You can install it on your personal computer, but keep in mind that load testing tools will require four elements to not become the bottleneck in your load test:

Enough compute (CPU).
Enough memory (RAM).
Low latency connectivity to your API.
No "noisy neighbors", meaning no software installed that can compromise your results. Imagine your CPU power getting used by video rendering; it will bias your results.

For all these reasons, a Public Cloud instance is recommended, such as a medium-sized virtual machine. For this tutorial, we will use an OVHcloud B2-30 instance.

Configure Locust

To configure the software, you need to create a file named locustfile.py. In this file, you can put the path where you want to make your request, the headers of your request, the type of the request (POST, GET, etc) and the body if you want to add a body to the request.

A generic file will look like this:

from locust import HttpUser, task

class HelloWorldUser(HttpUser):
    @task
    def hello_world(self):
        self.client.get("/hello")
        self.client.get("/world")

For our API and our needs, the locust file will be slightly modified:

# Import the Locust dependencies
from locust import HttpUser, task

# Import general library from python
import os
import random

# Import lorem ipsum library to generate some random texts
from lorem_text import lorem

# Import dotenv to load the environments variables
from dotenv import load_dotenv
load_dotenv()

# Create a table with some lorem ipsum texts
messages = []

# Add 1000 random texts of 10 paragraphs each (simulate 1000 emails)
for i in range(1000):
    messages.append(lorem.paragraphs(10))

# Define the headers of the request. Token is stored as an environment variable here
headers = {
    "Authorization": f"Bearer {os.getenv('TOKEN')}", "Content-Type": "application/json"}

class HelloWorldUser(HttpUser):
    # Definition of the first path where we do our post request
    @task
    def hello_world(self):
        # Define the body with the email choose randomly from the tab of all the emails
        body = {"message": random.choice(messages)}
        # Do the post request on the spam detection path
        self.client.post("/spam_detection_path",
                         headers=headers, json=body)

For your own needs, you will have to change the path, the headers, and the body because these are parameters that change from one API to another.

Once your locustfile.py is ready and your token environment variable is set, launch Locust:

locust -f locustfile.py --host <api_url>

Open the Locust web interface on <http://your_IP:8089>.

The web interface should look as below:

Run your load tests

You now have your app running on OVHcloud and Locust configured. Let's simulate some user calls.

From the web interface, fill in the number of simultaneous users (API calls) and incremental step (spawn rate).

For this tutorial, we will add 480 users in total, and a spawn rate of two users added per second. We will simulate this for four minutes. We suppose that this case is for a rush on the API. Most of the time, we can assume that there aren't so many users on the platform.

Launch the test.

Interpret the results via Locust

At the end of the load test, you will see this quick summary:

If we want to get more details about our test, we can see the graphs provided by Locust in the charts tab. Here is what we can see:

We deployed this API from one to five replicas, each with one CPU, and enabled autoscaling. From the chart above, we can see that the API handled the load fairly well, with no recorded failures during the test. The request rate quickly ramped up and stabilized around 230–250 RPS.

However, we observe a consistent increase in response times over the duration of the test. The latency climbed steadily, with the 50th percentile nearing 2.5 seconds and the 95th percentile approaching 3.1 seconds by the end. This suggests the system may have been under increasing pressure, potentially before scaling events occurred.

Result interpretation may depend on your needs and performance criteria. We can make a new test with more users to see the limits of our APIs, and put several tasks in the locustfile.py.

One thing can not be seen here: OVHcloud backend scaling. We deployed our app with autoscaling, from one replica minimum to five replicas maximum. Did we use them? Were they useful and at maximum capacity?

Let's see the same results in detail with the AI Deploy monitoring tool.

Interpret the results with the AI Deploy Monitoring

Go to the OVHcloud Control Panel and get the details of your deployed application. Click on the Access Dashboards button.

This dashboard is provided for free in AI Deploy for each deployed application. All of the deployed apps are combined in a simple Grafana Dashboard.

You can select the deployed app at the top of this Grafana dashboard, as shown below:

With this dashboard, you can see the percentage of CPU used in real time, the HTTP latency of your API, the autoscaling of the app, network bandwidth, and more. Vertical blue bars show scaling events.

Here is the result for the CPU load, overall (all replicas combined):

We can see that our app has scaled a few times and hasn't reached maximum capacity usage, thanks to autoscaling. Let's now take a look at the latency of our application:

Here we see that the latency has increased gradually since we have a spawn rate of two new users per second. API latency was stable at approximately 2.55s, then peaked at 4.59s.

Again, interpretation will depend on your needs. Do we need to provide more CPU because the latency is too high? This question will vary depending on your customers need. For an anti-spam, adding one second is quite significant for a company receiving thousands of emails per day, not so disturbing if it's a dozen per day. Let's now take a look at the scaling of our application:

We can see that the threshold has been capped at 75% for the autoscaling, and this has been respected. Of the five replicas provided to the application, four have been used.

In conclusion, both Locust and AI Deploy Monitoring are useful for interpreting results but, more important than tools, is to define realistic workloads and performance criteria.

While Locust is measuring an end-to-end latency (from Locust virtual machine here, to the API model deployed), AI Deploy monitoring is only measuring the backbone latency (from the query to the answer). That's why latency values are higher on the Locust side, reaching 3.1 seconds.

Go further

Locust official documentation: Locust.io

Comparison of load testing tools: Comparison of load testing tools

For more information and tutorials, please see our other AI & Machine Learning support guides or explore the guides for other OVHcloud products and services.

If you need training or technical assistance to implement our solutions, contact your sales representative or click on this link to get a quote and ask our Professional Services experts for a custom analysis of your project.