Rate Limits
AI Model Hub for Free: From December 1, 2024, to June 30, 2025, IONOS is offering all foundation models in the AI Model Hub for free. Create your contract today and kickstart your AI journey!
The IONOS AI Model Hub offers scalable access to inference features like Chat Completions, Image Generation, and Embeddings via an OpenAI-compatible API. The API enforces rate limits on all incoming requests to ensure performance, fairness, and availability.
This article is intended to help both developers and technical decision-makers understand
The benefits of rate limits for IONOS and its customers,
The types and values of rate limits applied,
What happens when limits are exceeded?
And how to track them effectively.
Benefits of rate limits
Rate limits protect system integrity and ensure a reliable experience for all users by:
Preventing resource monopolization: Limits prevent users or applications from consuming disproportionate resources.
Mitigating misuse and abuse: They deter malicious activities such as scraping, brute-force attacks, or spamming.
Stabilizing performance under load: During traffic spikes, limits help maintain responsiveness and availability.
Supporting long-term scalability: Predictable request patterns enable better infrastructure planning and resource allocation.
IONOS AI Model Hub rate limits
The IONOS AI Model Hub enforces different rate limits:
General API Usage
5 requests/second
10 requests/second
Applies across all endpoints per contract
Image Generation
10 images/minute
20 images/minute
Additional limit specifically for images
Scopes explained
General API Usage: Applies to all requests across Chat Completions, Image Generation, Embeddings, Predictions, and Document Collections.
Image Generation: Due to higher compute demands, image endpoints have a stricter, more conservative rate limit.
Understanding the two-tiered rate limit model
IONOS applies a two-tiered rate-limiting system consisting of a base limit and a burst limit.
Base limit: The sustained request rate you are allowed to maintain over time. Staying within this rate ensures uninterrupted access.
Burst limit: The short-term maximum number of requests allowed in a 1-second window (e.g., 10 requests/second), useful for absorbing traffic spikes.
How it works:
The base limit of 5 requests/second is continuously replenished.
You can temporarily exceed this rate by making up to 10 requests in a second, using the burst allowance.
After using a full burst (e.g., 10 requests at once), you must wait for the base tokens to replenish at 5 requests per second, so it takes 2 seconds to regain the full burst capacity.
The X-RateLimit-Remaining header (see further explanation below) shows how many requests are still allowed in the current second.
This tiered model balances flexibility and fairness by allowing short bursts without compromising system stability.
Hitting rate limits
When rate limits are exceeded, the API returns an HTTP 429 Too Many Requests
response.
Additionally, every API response includes headers that provide real-time information about your current rate limit usage:
X-RateLimit-Limit
Maximum number of requests allowed per minute.
X-RateLimit-Burst
Maximum number of requests allowed in a 2-second burst window.
X-RateLimit-Remaining
Requests remaining in the current burst window.
Example response headers:
By monitoring these headers, clients can adjust their request rates dynamically and avoid unintentionally breaching limits.
Mitigation strategies and best practices
To avoid hitting rate limits and ensure smoother operation of your applications, consider the following mitigation strategies and best practices:
1. Monitor quota in real time
Use rate limit headers to track your remaining quota.
Distribute requests evenly rather than sending them all at once.
2. Implement client-side throttling
Limit outgoing requests based on rate thresholds.
Use algorithms like token bucket or leaky bucket to space out requests.
3. Use exponential backoff
Avoid retrying immediately after a 429 response.
Implement exponential backoff logic that increases the delay after each retry attempt.
4. Build in resilience
Log failed and retried requests to detect rate-limit issues and system bottlenecks.
Ensure your system can gracefully handle rate-limiting responses and temporary interruptions.
5. Perform load testing
Simulate production traffic to identify potential rate-limit issues.
Validate throttling and retry logic under high load conditions.
Summary
Rate limits are essential for maintaining the reliability and scalability of your AI-powered applications. By understanding how the IONOS AI Model Hub enforces these limits, you can optimize usage, plan for peak loads, and build resilient integrations.
For more, explore the official API documentation or consider integrating real-time quota monitoring into your observability stack.
Last updated
Was this helpful?