LLM Inference
Ready in Minutes
Deploy production inference endpoints powered by vLLM on high-performance GPUs. OpenAI-compatible API, validated configurations, real-time monitoring — no infrastructure to manage.
# Your endpoint is ready. Use it like OpenAI:
$ curl https://your-endpoint.vast.ai/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"nvidia/nemotron-3-super","messages":[{"role":"user","content":"Hello!"}]}'
# Response:
{"choices":[{"message":{"content":"Hello! How can I help you today?"}}]}
Available Models
Pre-configured profiles validated on real hardware. Every configuration is tested end-to-end before it goes live.
GLM-5 744B (NVFP4)
Zhipu AI (Z.ai) / NVIDIA (NVFP4)
MoE · 744B / 40B active · 257 experts
Zhipu AI's 744B MoE model with 40B active params. MLA + DeepSeek Sparse Attention for efficient long-context inference. NVFP4 quantization on 8x RTX PRO 6000.
33K
TBD
8x RTX PRO 6000
MiniMax M2.5 229B (NVFP4)
MiniMax / lukealonso (NVFP4)
MoE · 229B / 10B active · 256 experts
Full MiniMax M2.5 model with 229B total params (10B active) in NVFP4. Requires 4x RTX PRO 6000. Top-tier MoE reasoning model with Lightning Attention for long contexts.
33K
TBD
4x RTX PRO 6000
MiniMax M2.5 REAP 139B (NVFP4)
MiniMax / Cerebras (REAP pruning) / lukealonso (NVFP4)
MoE · 139B / 10B active · 154 experts
MiniMax M2.5 pruned to 139B total params (10B active) via Cerebras REAP. NVFP4 quantization on 2x RTX PRO 6000. Retains ~96-98% of full model quality with 40% fewer expert parameters.
66K
TBD
2x RTX PRO 6000
NVIDIA Nemotron 3 Super 120B
NVIDIA
Hybrid · 120B / 12B active
NVIDIA's hybrid reasoning model with 120B total parameters and 12B active. Uses NVFP4 quantization for Blackwell Tensor Cores. Single GPU (262K context) or 2-GPU tensor parallelism (1M context).
262K
~70 tok/s
1x RTX PRO 6000
Qwen 3.5 122B A10B
Alibaba / Qwen Team
MoE · 122B / 10B active · 128 experts
Alibaba's 122B Mixture of Experts model with only 10B active parameters per token. Excellent efficiency with native tool calling and reasoning support.
262K
~83 tok/s
1x RTX PRO 6000
Qwen 3.5 27B AWQ
Alibaba / Qwen Team
Hybrid · 27B / 27B active
Alibaba's dense 27B model with hybrid Mamba+Attention architecture. AWQ 4-bit quantization fits a single RTX 5090 (32K context) or RTX PRO 6000 (full 262K context). Strong reasoning with native tool calling and thinking support.
33K
~74.3 tok/s
1x RTX 5090
Qwen 3.5 35B A3B
Alibaba / Qwen Team
MoE · 35B / 3B active · 128 experts
Alibaba's 35B Mixture of Experts model with only 3B active parameters per token. GPTQ 4-bit quantization fits a single RTX 5090. Frontier-class reasoning with native tool calling, vision, and thinking support.
16K
~200 tok/s
1x RTX 5090
How It Works
From model selection to a live endpoint in three steps.
Choose a Model
Select from validated profiles for models like Nemotron and Qwen. Each one is pre-tuned with optimized vLLM settings for specific GPU hardware.
Pick a GPU
We search the Vast.ai marketplace and rank available offers by price, reliability, and performance so you can make an informed choice.
Deploy
We provision the instance, download the model, and start vLLM. You get an OpenAI-compatible endpoint URL ready to handle requests.
Built for Developers
OpenAI-Compatible API
Standard OpenAI-compatible endpoints. Drop your URL into any application that uses the OpenAI SDK — no code changes required.
Smart GPU Matching
Automatically searches the marketplace and scores offers by price, performance, and reliability so you get the best value.
One-Click Deploy
Select a model, pick a GPU, and deploy. Provisioning, model download, and server startup are fully automated.
Validated Configurations
Every profile is tested on real GPU hardware with optimized vLLM settings. No guessing about memory, context length, or engine flags.
Real-Time Monitoring
Live GPU telemetry, health checks, and cost tracking streamed directly from the agent running on your instance.
Built-in Chat
Test your model immediately with the integrated chat interface. Supports tool calling and streaming responses out of the box.
Post-Deploy Benchmarking
Automatically benchmark throughput and latency after deployment so you know exactly what performance you're getting.
Zero-SSH Deployment
A lightweight Go agent handles GPU setup, model download, and vLLM lifecycle. No SSH access needed — fully automated and secure.
Transparent Pricing
GPU compute cost plus a 10% service fee. Billed in 5-minute increments — you only pay for what you use.