Llama 3.1 405B

Summary: Llama 3.1 405B is Meta's flagship large language model representing the pinnacle of open-source AI capabilities with exceptional reasoning abilities and comprehensive knowledge coverage. This massive model excels in the most demanding AI applications including advanced research, complex problem-solving, sophisticated content creation, and enterprise-grade AI solutions where maximum intelligence and accuracy are paramount, despite longer inference times inherent to its large-scale architecture.

Intelligence

Speed

Sovereignty

Input

Output

High

Low

Text

Central parameters

Description: Largest open-source model from Meta with 405B parameters, optimized with FP8 quantization for maximum intelligence and knowledge coverage.

Model identifier: meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

IONOS AI Model Hub Lifecycle and Alternatives

IONOS Launch

End of Life

Alternative

Successor

August 1, 2024

N/A

Llama 3.3 (70B), GPT-OSS 120B

Origin

Provider

Country

License

Flavor

Release

Meta

USA

License

Instruct

July 23, 2024

Technology

Context window

Parameters

Quantization

Multilingual

Further details

128k

406B

int4

Yes

Hugging Face

Modalities

Text

Image

Audio

Input and output

Not supported

Endpoints

Chat Completions

Embeddings

Image generation

v1/chat/completions

Not supported

Features

Streaming

Reasoning

Tool calling

Supported

Not supported

Supported

Usage example

Chat completions

The following example demonstrates how to use Llama 3.1 405B for complex reasoning tasks.

API Endpoint: POST https://openai.inference.de-txl.ionos.com/v1/chat/completions

Request:

{
  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
  "messages": [
    {
      "role": "user",
      "content": "Explain the concept of quantum entanglement to a 5-year-old using simple analogies."
    }
  ],
  "temperature": 0.7,
  "max_tokens": 100
}

Response:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Imagine you have two magic dice. No matter how far apart they are—even if one is on Earth and the other is on Mars—if you roll a 6 on one, the other one will instantly show a 6 too! They are connected in a special way that lets them 'talk' to each other instantly."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 60,
    "total_tokens": 85
  }
}

Troubleshooting

Infinite or repetitive response loops

Llama 3.1 405B can produce repetitive output that does not terminate naturally. When "max_tokens" is set to the context window maximum (128000) or left unconfigured, the model continues generating until it hits that context window ceiling.

Recommended mitigations:

Set max_tokens explicitly. Avoid setting max_tokens (or max_completion_tokens) to the full context window (128000). Instead, limit the value to match your specific use case:
Use case
Recommended value
Conversational use and short responses
2048
Detailed analysis and code generation
8192
Long-form documents and research summaries
16384

Note: Use values exceeding 16384 only when the task strictly requires them. When doing so, always implement stop sequences (see Step 2) to ensure the model terminates correctly.

Add explicit stop sequences. Add both Llama 3 end-of-turn tokens as stop strings in your request:

"stop": ["<|eot_id|>", "<|end_of_text|>"]

Use sampling instead of greedy decoding. Avoid combining temperature: 0 with ambiguous or contradictory prompts, as this often triggers infinite loops. Instead, use the following:
```
"temperature": 0.6,
"top_p": 0.9
```
Apply a frequency penalty. Setting frequency_penalty to a small positive value reduces the likelihood of the model repeating the same tokens. A value between 0.1 and 0.3 is effective for most use cases.
```
"frequency_penalty": 0.1
```

Example request with all mitigations applied:

{
  "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
  "messages": [{ "role": "user", "content": "Hello" }],
  "max_tokens": 2048,
  "temperature": 0.6,
  "top_p": 0.9,
  "frequency_penalty": 0.1,
  "stop": ["<|eot_id|>", "<|end_of_text|>"]
}

Rate limits

Rate limits ensure fair usage and reliable access to the AI Model Hub. In addition to the contract-wide rate limits, no model-specific limits apply.

PreviousGPT-OSS 120B NextCoding Models

Last updated 2 days ago

Was this helpful?

Good night

hashtagCentral parameters

hashtagIONOS AI Model Hub Lifecycle and Alternatives

hashtagOrigin

hashtagTechnology

hashtagModalities

hashtagEndpoints

hashtagFeatures

hashtagUsage example

hashtagChat completions

hashtagTroubleshooting

hashtagInfinite or repetitive response loops

hashtagRate limits

Central parameters

IONOS AI Model Hub Lifecycle and Alternatives

Origin

Technology

Modalities

Endpoints

Features

Usage example

Chat completions

Troubleshooting

Infinite or repetitive response loops

Rate limits