Baseten
by Baseten Labs, Inc.
Deploy and scale ML models with fast production inference
Baseten is a model inference and deployment platform for running open-source, custom, and fine-tuned models in production with low-latency, autoscaling GPUs, per-token Model APIs, and the open-source Truss packaging framework.
At a Glance
- Category
- Inference
- Pricing
- Usage-based, Free tier, Enterprise
- Target Market
- Enterprise, Startups, Developers
- Founded
- 2019
- Headquarters
- San Francisco, California, USA
Key Features
- ✓Dedicated Deployments
Autoscaling dedicated GPU inference for custom or fine-tuned models with per-minute billing and scale-to-zero.
- ✓Model APIs
Pre-optimized, hosted open-source models via an OpenAI-compatible, per-token API.
- ✓Truss (open source)
Single-config packaging of any framework (vLLM, SGLang, TensorRT-LLM, diffusers, and more) into a production endpoint.
- ✓Baseten Inference Stack
TensorRT-LLM optimization and custom kernels for low-latency, high-throughput serving.
- ✓Multi-cloud and Self-hosted/Hybrid
Deploy in Baseten Cloud or your own VPC across 20+ providers and regions.
- ✓Autoscaling and scale-to-zero
Fast cold starts with billing only for active inference.
Capabilities
Use Cases
- •Real-time voice agents and AI phone calls
Low-latency text-to-speech and transcription streaming for conversational voice applications.
- •Production LLM apps
Serving open-source or fine-tuned LLMs behind an OpenAI-compatible API at scale.
- •High-throughput transcription
Whisper-based audio-to-text with predictable, sub-300ms latency.
- •Image generation and ComfyUI pipelines
Deploying custom diffusion models and multi-step image workflows.
Ideal For
Best For
- ✓Putting open-source or custom models into low-latency production
- ✓Dedicated, autoscaling GPU inference without managing Kubernetes
- ✓Latency-sensitive workloads (voice, transcription, real-time LLM)
- ✓Self-hosted or VPC deployment with a managed experience
Not Ideal For
- ✗Teams wanting a no-code end-user chatbot or app builder
- ✗Users needing only a single flat-rate API
Integrations
Deployment
Pricing
Basic
$0/month pay-as-you-go
- ✓Per-minute GPU billing (e.g., A100 80GB $4.00/hr, H100 80GB $6.50/hr)
- ✓Model APIs per token (e.g., GPT-OSS 120B $0.10/$0.50 per 1M)
- ✓Scale-to-zero; pay only for active inference
- ✓Starter credits for new accounts
Pro
Custom / volume discounts
- ✓Higher limits
- ✓Volume discounts
Enterprise
Custom
- ✓Self-hosted / VPC deployment
- ✓Dedicated support
- ✓Volume discounts
Usage-based pricing published at baseten.co/pricing (verified 2026-07). Dedicated GPUs billed per minute (T4 from $0.6312/hr up to B200 180GB $9.98/hr; idle scale-to-zero replicas free). Model APIs billed per 1M tokens with a cached-input discount. New accounts receive starter credits. Model-API token prices change as the model roster updates.
Connect
Stay Ahead of the Curve
Weekly enterprise AI insights for technology leaders. No spam, no vendor pitches—unsubscribe anytime.
Subscribe