# Llama 3.1 405B

**Summary:** Llama 3.1 405B is Meta's flagship large language model representing the pinnacle of open-source AI capabilities with exceptional reasoning abilities and comprehensive knowledge coverage. This massive model excels in the most demanding AI applications including advanced research, complex problem-solving, sophisticated content creation, and enterprise-grade AI solutions where maximum intelligence, accuracy, and depth are paramount, despite longer inference times inherent to its large-scale architecture.

|                                                                       **Intelligence**                                                                      |                   **Speed**                  |                   **Sovereignty**                  |                                                                 **Input**                                                                 |                                                                 **Output**                                                                |
| :---------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------: | :------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------------------------: |
| ![Intelligence active](/files/dnDi7yuqXqkBFqwaxdnm) ![Intelligence active](/files/dnDi7yuqXqkBFqwaxdnm) ![Intelligence active](/files/dnDi7yuqXqkBFqwaxdnm) | ![Speed active](/files/evfYW3bq4dTBLlZH3dQf) | ![Sovereignty active](/files/bNpzGRJfez9SidEjNCoy) | ![Text active](/files/45qlqURbT8c2Ekr8HJfK) ![Image inactive](/files/0mPVwOtrYhZrpz9clC3D) ![Audio inactive](/files/PRglWWEC5Zoc5fgynNLM) | ![Text active](/files/45qlqURbT8c2Ekr8HJfK) ![Image inactive](/files/0mPVwOtrYhZrpz9clC3D) ![Audio inactive](/files/PRglWWEC5Zoc5fgynNLM) |
|                                                                            *High*                                                                           |                     *Low*                    |                        *Low*                       |                                                                   *Text*                                                                  |                                                                   *Text*                                                                  |

## Central parameters

**Description:** Largest open-source model from Meta with 405B parameters, optimized with FP8 quantization for maximum intelligence and knowledge coverage.

**Model identifier:** `meta-llama/Meta-Llama-3.1-405B-Instruct-FP8`

## IONOS CLOUD AI Model Hub Lifecycle and Alternatives

| **IONOS start date** | **End of Life** |                                                                                                          **Alternative**                                                                                                         | **Successor** |
| :------------------: | :-------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------: |
|   *August 1, 2024*   |       N/A       | [<mark style="color:blue;">**Llama 3.3 (70B)**</mark>](/cloud/ai/ai-model-hub/models/llms/meta-llama-3-3-70b.md), [<mark style="color:blue;">**GPT-OSS 120B**</mark>](/cloud/ai/ai-model-hub/models/llms/openai-gpt-oss-120b.md) |               |

## Origin

|                            **Provider**                            | **Country** |                                       **License**                                      | **Flavor** |   **Release**   |
| :----------------------------------------------------------------: | :---------: | :------------------------------------------------------------------------------------: | :--------: | :-------------: |
| [<mark style="color:blue;">**Meta**</mark>](https://www.meta.com/) |     USA     | [<mark style="color:blue;">**License**</mark>](https://llama.meta.com/llama3/license/) |  Instruct  | *July 23, 2024* |

## Technology

| **Context window** | **Parameters** | **Quantization** | **Multilingual** |                                               **Further details**                                              |
| :----------------: | :------------: | :--------------: | :--------------: | :------------------------------------------------------------------------------------------------------------: |
|       *128k*       |     *406B*     |      *int4*      |       *Yes*      | [<mark style="color:blue;">**Hugging Face**</mark>](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) |

## Modalities

|     **Text**     |   **Image**   |   **Audio**   |
| :--------------: | :-----------: | :-----------: |
| Input and output | Not supported | Not supported |

## Endpoints

| **Chat Completions** | **Embeddings** | **Image generation** |
| :------------------: | :------------: | :------------------: |
|  v1/chat/completions |  Not supported |     Not supported    |

## Features

| **Streaming** | **Reasoning** | **Tool calling** |
| :-----------: | :-----------: | :--------------: |
|   Supported   | Not supported |     Supported    |

## Usage example

### Chat completions

The following example demonstrates how to use **Llama 3.1 405B** for complex reasoning tasks.

**API Endpoint:** `POST https://openai.inference.de-txl.ionos.com/v1/chat/completions`

**Request:**

```json
{
  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
  "messages": [
    {
      "role": "user",
      "content": "Explain the concept of quantum entanglement to a 5-year-old using simple analogies."
    }
  ],
  "temperature": 0.7,
  "max_tokens": 100
}
```

**Response:**

```json
{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Imagine you have two magic dice. No matter how far apart they are, even if one is on Earth and the other is on Mars, if you roll a 6 on one, the other one will instantly show a 6 too! They are connected in a special way that lets them 'talk' to each other instantly."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 60,
    "total_tokens": 85
  }
}
```

## Troubleshooting

### Infinite or repetitive response loops

Llama 3.1 405B can produce repetitive output that does not terminate naturally. When "max\_tokens" is set to the context window maximum (128000) or left unconfigured, the model continues generating until it hits that context window ceiling.

**Recommended mitigations:**

1. **Set max\_tokens explicitly.** Avoid setting `max_tokens` (or `max_completion_tokens`) to the full context window (128000). Instead, limit the value to match your specific use case:

   | **Use case**                               | **Recommended value** |
   | ------------------------------------------ | :-------------------: |
   | Conversational use and short responses     |          2048         |
   | Detailed analysis and code generation      |          8192         |
   | Long-form documents and research summaries |         16384         |

{% hint style="info" %}
**Note:** Use values exceeding 16384 only when the task strictly requires them. When doing so, always implement stop sequences (see Step 2) to ensure the model terminates correctly.
{% endhint %}

1. **Add explicit stop sequences.** Add both Llama 3 end-of-turn tokens as stop strings in your request:

   ```json
   "stop": ["<|eot_id|>", "<|end_of_text|>"]
   ```
2. **Use sampling instead of greedy decoding.** Avoid combining `temperature: 0` with ambiguous or contradictory prompts, as this often triggers infinite loops. Instead, use the following:

   ```json
   "temperature": 0.6,
   "top_p": 0.9
   ```
3. **Apply a frequency penalty.** Setting `frequency_penalty` to a small positive value reduces the likelihood of the model repeating the same tokens. A value between `0.1` and `0.3` is effective for most use cases.

   ```json
   "frequency_penalty": 0.1
   ```

**Example request with all mitigations applied:**

```json
{
  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
  "messages": [{ "role": "user", "content": "Hello" }],
  "max_tokens": 2048,
  "temperature": 0.6,
  "top_p": 0.9,
  "frequency_penalty": 0.1,
  "stop": ["<|eot_id|>", "<|end_of_text|>"]
}
```

## Rate limits

Rate limits ensure fair usage and reliable access to the AI Model Hub. In addition to the [<mark style="color:blue;">contract-wide rate limits</mark>](/cloud/ai/ai-model-hub/how-tos/rate-limits.md), no model-specific limits apply.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.ionos.com/cloud/ai/ai-model-hub/models/llms/meta-llama-3-1-405b.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.