Rate Limits

The IONOS AI Model Hub offers scalable access to inference features like Chat Completions, Image Generation, and Embeddings through an OpenAI-compatible API. The API enforces rate limits on all incoming requests to ensure performance, fairness, and availability.

This article is intended to help both developers and technical decision-makers understand

The benefits of rate limits for IONOS and its customers,
The types and values of rate limits applied,
What happens when limits are exceeded?
And how to track them effectively.

Benefits of rate limits

Rate limits protect system integrity and ensure a reliable experience for all users by:

Preventing resource monopolization: Limits prevent users or applications from consuming disproportionate resources.
Mitigating misuse and abuse: They deter malicious activities such as scraping, brute-force attacks, or spamming.
Stabilizing performance under load: During traffic spikes, limits help maintain responsiveness and availability.
Supporting long-term scalability: Predictable request patterns enable better infrastructure planning and resource allocation.

IONOS AI Model Hub rate limits

The IONOS AI Model Hub enforces different rate limits:

Endpoint/Scope

Base Limit

Burst Limit

Notes

General API Usage

5 requests/second

10 requests/second

Applies across all endpoints per contract

Image Generation

10 images/minute

20 images/minute

Additional limit specifically for images

Scopes explained

General API Usage: Applies to all requests across Chat Completions, Image Generation, Embeddings, Predictions, and Document Collections.
Image Generation: Due to higher compute demands, image endpoints have a stricter, more conservative rate limit.

Contract-based enforcement: Rate limits are enforced per IONOS contract, not per user or endpoint. All activity under the same contract counts toward shared quotas.

Understanding the two-tiered rate limit model

IONOS applies a two-tiered rate-limiting system consisting of a base limit and a burst limit.

Base limit: The sustained request rate you are allowed to maintain over time. Staying within this rate ensures uninterrupted access.
Burst limit: The short-term maximum number of requests allowed in a 1-second window (e.g., 10 requests/second), useful for absorbing traffic spikes.

How it works:

The base limit of 5 requests/second is continuously replenished.
You can temporarily exceed this rate by making up to 10 requests in a second, using the burst allowance.
After using a full burst (e.g., 10 requests at once), you must wait for the base tokens to replenish at 5 requests per second, so it takes 2 seconds to regain the full burst capacity.
The X-RateLimit-Remaining header (see further explanation below) shows how many requests are still allowed in the current second.

This tiered model balances flexibility and fairness by allowing short bursts without compromising system stability.

Hitting rate limits

When rate limits are exceeded, the API returns an HTTP 429 Too Many Requests response.

Additionally, every API response includes headers that provide real-time information about your current rate limit usage:

Header

Description

X-RateLimit-Limit

Maximum number of requests allowed per minute.

X-RateLimit-Burst

Maximum number of requests allowed in a 2-second burst window.

X-RateLimit-Remaining

Requests remaining in the current burst window.

Some limits are reported using minute-based values for easier tracking, but enforcement occurs per second.

Example response headers:

X-RateLimit-Limit: 300
X-RateLimit-Burst: 10
X-RateLimit-Remaining: 9

By monitoring these headers, clients can adjust their request rates dynamically and avoid unintentionally breaching limits.

Mitigation strategies and best practices

To avoid hitting rate limits and ensure smoother operation of your applications, consider the following mitigation strategies and best practices:

1. Monitor quota in real time

Use rate limit headers to track your remaining quota.
Distribute requests evenly rather than sending them all at once.

2. Implement client-side throttling

Limit outgoing requests based on rate thresholds.
Use algorithms like token bucket or leaky bucket to space out requests.

3. Use exponential backoff

Avoid retrying immediately after a 429 response.
Implement exponential backoff logic that increases the delay after each retry attempt.

4. Build in resilience

Log failed and retried requests to detect rate-limit issues and system bottlenecks.
Ensure your system can gracefully handle rate-limiting responses and temporary interruptions.

5. Perform load testing

Simulate production traffic to identify potential rate-limit issues.
Validate throttling and retry logic under high load conditions.

Summary

Rate limits are essential for maintaining the reliability and scalability of your AI-powered applications. By understanding how the IONOS AI Model Hub enforces these limits, you can optimize usage, plan for peak loads, and build resilient integrations.

For more, explore the official API documentation or consider integrating real-time quota monitoring into your observability stack.

PreviousAccess Management NextText Generation

Last updated 15 days ago

Was this helpful?