# Rate Limits

The IONOS AI Model Hub offers scalable access to inference features like Chat Completions, Image Generation, and Embeddings through an OpenAI-compatible API. The API enforces rate limits on all incoming requests to ensure performance, fairness, and availability.

This article is intended to help both developers and technical decision-makers understand

* The benefits of rate limits for IONOS and its customers,
* The types and values of rate limits applied,
* What happens when limits are exceeded?
* And how to track them effectively.

## Benefits of rate limits

Rate limits protect system integrity and ensure a reliable experience for all users by:

* **Preventing resource monopolization**: Limits prevent users or applications from consuming disproportionate resources.
* **Mitigating misuse and abuse**: They deter malicious activities such as scraping, brute-force attacks, or spamming.
* **Stabilizing performance under load**: During traffic spikes, limits help maintain responsiveness and availability.
* **Supporting long-term scalability**: Predictable request patterns enable better infrastructure planning and resource allocation.

## IONOS AI Model Hub rate limits

The IONOS AI Model Hub enforces different rate limits:

| Endpoint/Scope    | Base Limit        | Burst Limit        | Notes                                     |
| ----------------- | ----------------- | ------------------ | ----------------------------------------- |
| General API Usage | 5 requests/second | 10 requests/second | Applies across all endpoints per contract |
| Image Generation  | 10 images/minute  | 20 images/minute   | Additional limit specifically for images  |

### Scopes explained

* **General API Usage**: Applies to all requests across Chat Completions, Image Generation, Embeddings, Predictions, and Document Collections.
* **Image Generation**: Due to higher compute demands, image endpoints have a stricter, more conservative rate limit.

{% hint style="info" %}
**Contract-based enforcement:** Rate limits are enforced per IONOS contract, not per user or endpoint. All activity under the same contract counts toward shared quotas.
{% endhint %}

### Understanding the two-tiered rate limit model

IONOS applies a two-tiered rate-limiting system consisting of a **base limit** and a **burst limit**.

* **Base limit**: The sustained request rate you are allowed to maintain over time. Staying within this rate ensures uninterrupted access.
* **Burst limit**: The maximum number of requests allowed within a 2-second window, accommodating sudden traffic spikes that exceed the sustained rate. Example: 20 requests in a 2-second burst.

**How it works:**

* The base limit of 5 requests/second is continuously replenished.
* You can temporarily exceed this rate by making up to 10 requests in a 2-second window, using the burst allowance.
* After using a full burst (e.g., 10 requests at once), you must wait for the base tokens to replenish at 5 requests per second, so it takes 2 seconds to regain the full burst capacity.
* The **X-RateLimit-Remaining** header shows how many requests are still available in the current 2-second burst window. For more information, see [<mark style="color:blue;">IONOS AI Model Hub rate limits</mark>](#ionos-ai-model-hub-rate-limits).

This tiered model balances flexibility and fairness by allowing short bursts without compromising system stability.

## Hitting rate limits

When rate limits are exceeded, the API returns an `HTTP 429 Too Many Requests` response.

Additionally, every API response includes headers that provide real-time information about your current rate limit usage:

| Header                  | Description                                                                                                                                                                                               |
| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `X-RateLimit-Limit`     | Defines the maximum requests per minute (example: 300 RPM). For a short-window, the limit is enforced at a per-second resolution (see `X-RateLimit-Burst`) and the actual value depends on your contract. |
| `X-RateLimit-Burst`     | Maximum number of requests allowed in a 2-second burst window. The actual value depends on your contract.                                                                                                 |
| `X-RateLimit-Remaining` | The number of requests remaining in the current 2-second burst window. This value resets on window expiration and operates independently of the per-minute limit.                                         |
| `Retry-After`           | The number of seconds the client must wait before retrying the request. This header appears only in `429 Too Many Requests` responses.                                                                    |

{% hint style="info" %}
**Note:** Some limits are reported using minute-based values for easier tracking, but enforcement occurs per second.
{% endhint %}

Example response headers:

```bash
X-RateLimit-Limit: 300
X-RateLimit-Burst: 10
X-RateLimit-Remaining: 9
```

By monitoring these headers, clients can adjust their request rates dynamically and avoid unintentionally breaching limits.

## Mitigation strategies and best practices

To avoid hitting rate limits and ensure smoother operation of your applications, consider the following mitigation strategies and best practices:

**1. Monitor quota in real time**

* Use rate limit headers to track your remaining quota.
* Distribute requests evenly rather than sending them all at once.

**2. Implement client-side throttling**

* Limit outgoing requests based on rate thresholds.
* Use algorithms like token bucket or leaky bucket to space out requests.

**3. Use the `Retry-After` header**

* When you receive a `429` response, read the `Retry-After` header value and wait the specified number of seconds before resubmitting the request.
* If the `Retry-After` value is missing from the header, use an exponential backoff strategy to increase the delay after each failed attempt.

**4. Build in resilience**

* Log failed and retried requests to detect rate-limit issues and system bottlenecks.
* Ensure your system can gracefully handle rate-limiting responses and temporary interruptions.

**5. Perform load testing**

* Simulate production traffic to identify potential rate-limit issues.
* Validate throttling and retry logic under high load conditions.

## Summary

Rate limits are essential for maintaining the reliability and scalability of your AI-powered applications. By understanding how the IONOS AI Model Hub enforces these limits, you can optimize usage, plan for peak loads, and build resilient integrations.

For more, explore the official API documentation or consider integrating real-time quota monitoring into your observability stack.

* [<mark style="color:blue;">AI Model Hub API Documentation</mark>](https://api.ionos.com/docs/inference-modelhub/v1/)
* [<mark style="color:blue;">AI Model Hub OpenAI-comp. API Documentation</mark>](https://api.ionos.com/docs/inference-openai/v1/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.ionos.com/cloud/ai/ai-model-hub/how-tos/rate-limits.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
