Generate conversational responses with streaming support and multimodal capabilities.
This endpoint generates chat completions using language models, supporting both text-only conversations and vision/multimodal inputs. Includes support for function calling, streaming responses, and automatic conversation tracking.
Request Body:
model: Model ID (required) - e.g., nugen-flash-instruct, vision models for multimodalmessages: Array of message objects (required, minimum 1), each containing:
role: Message role (system, user, or assistant)content: Text string OR array of content objects (for multimodal)
{"type": "text", "text": "your message"}{"type": "image_url", "image_url": {"url": "https://..."}}{"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,..."}}name (optional): Author name (a-z, A-Z, 0-9, underscores, max 64 chars)max_tokens (optional): Maximum tokens to generate in completionprompt_truncate_len (optional): Size to truncate chat prompts (default: 1500)temperature (optional): Sampling temperature between 0 and 2 (default: 1)stream (optional): Enable streaming responses (default: false)tools (optional): List of tools/functions available to the modeltool_choice (optional): Tool selection mode (auto, none, or specific tool)top_p (optional): Nucleus sampling parametertop_k (optional): Top-k sampling parametern (optional): Number of completions to generate (default: 1)reasoning (optional): Reasoning configuration object with:
effort: Reasoning effort level (xhigh, high, medium, low, minimal, none)max_tokens: Token limit for reasoningexclude: Set true to exclude reasoning tokens from responseenabled: Enable reasoning with default parametersOptional Headers:
X-Session-ID: Session identifier for multi-turn conversation trackingReturns:
Non-streaming mode -
id: Unique identifier for the responseobject: Object type (always embedding)created: Unix timestamp when completion was createdmodel: Model used for chat completionchoices: List of completion choices, each containing:
index: Index of the choicemessage: Response message with:
role: Role of the author (always assistant)content: Generated response texttool_calls (optional): Tool calls made by the model (for function calling)finish_reason: Reason model stopped (stop for natural stop, length if max tokens reached)usage: Token usage statistics:
prompt_tokens: Number of tokens in the promptcompletion_tokens: Number of tokens generatedtotal_tokens: Total tokens used (prompt + completion)confidence_score (optional): Confidence score from Domain-Aligned AI modelsStreaming mode - ChatCompletionChunk stream with:
id: Unique identifiercreated: Timestampmodel: Model IDchoices: List of chunk choices with:
index: Choice indexdelta: Delta content with role and contentfinish_reason: Reason for stopping (only in final chunk)usage (optional): Only present in final chunkExample Request (Text Chat):
POST /api/v3/inference/chat/completions
Headers: {"Authorization": "Bearer <api_key>"}
{
"model": "nugen-flash-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
}
],
"max_tokens": 500,
"temperature": 0.7
}
Example Response (Text Chat):
{
"id": "nugen-abc123",
"object": "chat.completion",
"created": 1704123600.0,
"model": "nugen-flash-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 8,
"total_tokens": 28
},
"confidence_score": 89.5454
}
Example Request (Vision - Image URL):
POST /api/v3/inference/chat/completions
Headers: {"Authorization": "Bearer <api_key>"}
{
"model": "qwen3-vl-30b",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe what you see in this image."
},
{
"type": "image_url",
"image_url": {
"url": "https://example.com/image.jpg"
}
}
]
}
],
"max_tokens": 1000
}
Example Request (Vision - Base64 Image):
POST /api/v3/inference/chat/completions
Headers: {"Authorization": "Bearer <api_key>"}
{
"model": "qwen3-vl-30b",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD..."
}
}
]
}
]
}
Example Request (Function Calling):
POST /api/v3/inference/chat/completions
Headers: {"Authorization": "Bearer <api_key>"}
{
"model": "nugen-flash-instruct",
"messages": [
{"role": "user", "content": "What's the weather in Boston?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}
],
"tool_choice": "auto"
}
Example Request (Streaming):
POST /api/v3/inference/chat/completions
Headers: {"Authorization": "Bearer <api_key>"}
{
"model": "nugen-flash-instruct",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}
Example Response (Streaming):
data: {"id":"nugen-abc123","created":1704123600.0,"model":"nugen-flash-instruct","choices":[{"index":0,"delta":{"role":"assistant","content":"Once"},"finish_reason":null}]}
data: {"id":"nugen-abc123","created":1704123600.0,"model":"nugen-flash-instruct","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}
data: {"id":"nugen-abc123","created":1704123600.0,"model":"nugen-flash-instruct","choices":[{"index":0,"delta":{"content":" a time"},"finish_reason":"stop"}],"usage":{"prompt_tokens":10,"completion_tokens":5,"total_tokens":15}}
data: [DONE]
Notes:
tools and tool_choice parametersconfidence_score is only available for Domain-Aligned AI modelsX-Session-ID header for multi-turn conversation trackingprompt_truncate_len to control context window usage for long conversationsreasoning parameter enables advanced reasoning capabilities for supported modelsBearer authentication header of the form Bearer <token>, where <token> is your auth token.
The name of the model to use.
"nugen-flash-instruct"
A list of messages comprising the conversation so far.
1The maximum number of tokens to generate in the completion.
The size to which to truncate chat prompts.
What sampling temperature to use, between 0 and 2.
0 <= x <= 2Whether to stream back partial progress as server-sent events.
List of tools/functions
'auto', 'none', or specific tool
Nucleus sampling
Top-k sampling
Number of completions
Reasoning configuration for the model
Streaming chat completion responses or complete response depending on stream parameter