AnyInferenceBeta

LLM Inference
Ready in Minutes

Deploy production inference endpoints powered by vLLM on high-performance GPUs. OpenAI-compatible API, validated configurations, real-time monitoring — no infrastructure to manage.

GPU cost + 10% service fee · Billed in 5-minute increments
Terminal

# Your endpoint is ready. Use it like OpenAI:

$ curl https://your-endpoint.vast.ai/v1/chat/completions \

-H "Authorization: Bearer $API_KEY" \

-H "Content-Type: application/json" \

-d '{"model":"nvidia/nemotron-3-super","messages":[{"role":"user","content":"Hello!"}]}'

# Response:

{"choices":[{"message":{"content":"Hello! How can I help you today?"}}]}

Available Models

Pre-configured profiles validated on real hardware. Every configuration is tested end-to-end before it goes live.

GLM-5 744B (NVFP4)

Zhipu AI (Z.ai) / NVIDIA (NVFP4)

NVFP4 (FP4+FP8 scales)

MoE · 744B / 40B active · 257 experts

Zhipu AI's 744B MoE model with 40B active params. MLA + DeepSeek Sparse Attention for efficient long-context inference. NVFP4 quantization on 8x RTX PRO 6000.

MoEReasoningTool CallingMulti-GPU
Context

33K

Speed

TBD

GPU

8x RTX PRO 6000

MiniMax M2.5 229B (NVFP4)

MiniMax / lukealonso (NVFP4)

NVFP4 (FP4+FP8 scales)

MoE · 229B / 10B active · 256 experts

Full MiniMax M2.5 model with 229B total params (10B active) in NVFP4. Requires 4x RTX PRO 6000. Top-tier MoE reasoning model with Lightning Attention for long contexts.

MoEReasoningTool CallingMulti-GPU
Context

33K

Speed

TBD

GPU

4x RTX PRO 6000

MiniMax M2.5 REAP 139B (NVFP4)

MiniMax / Cerebras (REAP pruning) / lukealonso (NVFP4)

NVFP4 (FP4+FP8 scales)

MoE · 139B / 10B active · 154 experts

MiniMax M2.5 pruned to 139B total params (10B active) via Cerebras REAP. NVFP4 quantization on 2x RTX PRO 6000. Retains ~96-98% of full model quality with 40% fewer expert parameters.

MoEReasoningTool CallingMulti-GPU
Context

66K

Speed

TBD

GPU

2x RTX PRO 6000

NVIDIA Nemotron 3 Super 120B

NVIDIA

NVFP4 (E2M1)

Hybrid · 120B / 12B active

NVIDIA's hybrid reasoning model with 120B total parameters and 12B active. Uses NVFP4 quantization for Blackwell Tensor Cores. Single GPU (262K context) or 2-GPU tensor parallelism (1M context).

HybridReasoning
Context

262K

Speed

~70 tok/s

GPU

1x RTX PRO 6000

Qwen 3.5 122B A10B

Alibaba / Qwen Team

AWQ 4-bit

MoE · 122B / 10B active · 128 experts

Alibaba's 122B Mixture of Experts model with only 10B active parameters per token. Excellent efficiency with native tool calling and reasoning support.

MoEReasoningTool Calling
Context

262K

Speed

~83 tok/s

GPU

1x RTX PRO 6000

Qwen 3.5 27B AWQ

Alibaba / Qwen Team

AWQ 4-bit

Hybrid · 27B / 27B active

Alibaba's dense 27B model with hybrid Mamba+Attention architecture. AWQ 4-bit quantization fits a single RTX 5090 (32K context) or RTX PRO 6000 (full 262K context). Strong reasoning with native tool calling and thinking support.

HybridReasoningTool Calling
Context

33K

Speed

~74.3 tok/s

GPU

1x RTX 5090

Qwen 3.5 35B A3B

Alibaba / Qwen Team

GPTQ 4-bit

MoE · 35B / 3B active · 128 experts

Alibaba's 35B Mixture of Experts model with only 3B active parameters per token. GPTQ 4-bit quantization fits a single RTX 5090. Frontier-class reasoning with native tool calling, vision, and thinking support.

MoEReasoningTool CallingVision
Context

16K

Speed

~200 tok/s

GPU

1x RTX 5090

How It Works

From model selection to a live endpoint in three steps.

1

Choose a Model

Select from validated profiles for models like Nemotron and Qwen. Each one is pre-tuned with optimized vLLM settings for specific GPU hardware.

2

Pick a GPU

We search the Vast.ai marketplace and rank available offers by price, reliability, and performance so you can make an informed choice.

3

Deploy

We provision the instance, download the model, and start vLLM. You get an OpenAI-compatible endpoint URL ready to handle requests.

Built for Developers

OpenAI-Compatible API

Standard OpenAI-compatible endpoints. Drop your URL into any application that uses the OpenAI SDK — no code changes required.

Smart GPU Matching

Automatically searches the marketplace and scores offers by price, performance, and reliability so you get the best value.

One-Click Deploy

Select a model, pick a GPU, and deploy. Provisioning, model download, and server startup are fully automated.

Validated Configurations

Every profile is tested on real GPU hardware with optimized vLLM settings. No guessing about memory, context length, or engine flags.

Real-Time Monitoring

Live GPU telemetry, health checks, and cost tracking streamed directly from the agent running on your instance.

Built-in Chat

Test your model immediately with the integrated chat interface. Supports tool calling and streaming responses out of the box.

Post-Deploy Benchmarking

Automatically benchmark throughput and latency after deployment so you know exactly what performance you're getting.

Zero-SSH Deployment

A lightweight Go agent handles GPU setup, model download, and vLLM lifecycle. No SSH access needed — fully automated and secure.

Transparent Pricing

GPU compute cost plus a 10% service fee. Billed in 5-minute increments — you only pay for what you use.