Together AI
by Together Computer Inc. (Together AI)
The AI Native Cloud
Together AI is a full-stack AI acceleration cloud offering serverless and dedicated inference, fine-tuning, and on-demand/reserved NVIDIA GPU clusters for running open-source and custom generative AI models in production.
At a Glance
- Category
- Infrastructure & Cloud
- Pricing
- usage-based, pay-per-token, per-GPU-hour, reserved/committed
- Target Market
- AI startups, Enterprises, ML researchers, Developers building on open-source models
- Founded
- 2022
- Headquarters
- San Francisco, California, United States
Key Features
- ✓Serverless inference
On-demand, pay-per-token access to 200+ open-source chat, vision, image, video, and audio models with snappy, low-latency inference.
- ✓Dedicated inference endpoints
Single-tenant GPU deployments (e.g., H100, B200) for predictable performance and dedicated container inference for generative media.
- ✓GPU clusters
On-demand and reserved NVIDIA H100/H200/Blackwell B200 clusters scaling from single instances to thousands of GPUs, with managed storage and zero egress fees.
- ✓Fine-tuning
Adapt open-source models to production tasks with per-token fine-tuning pricing and instant deployment.
- ✓Together Kernel Collection & research optimizations
Proprietary optimized kernels and research-driven inference stack claiming ~2x faster inference, 60% lower cost, and 90% faster pre-training.
Capabilities
Use Cases
- •Production open-source model serving
Deploy and scale inference for Llama, Mistral, DeepSeek, and other open-source models via serverless or dedicated endpoints.
- •Large-scale model training
Reserve NVIDIA GPU clusters with managed storage and optimized kernels to pre-train or fine-tune large models cost-effectively.
- •Custom enterprise AI applications
Fine-tune open models on proprietary data and serve them through dedicated infrastructure for enterprises like Salesforce and Zoom.
Ideal For
Best For
- ✓Open-source model inference at scale
- ✓Fine-tuning custom models
- ✓GPU cluster compute for training
- ✓Cost-optimized production AI workloads
Integrations
Market Analysis
Pros
- ✓Fast, low-latency inference across many open-source models
- ✓Broad full-stack offering from serverless to GPU clusters
- ✓Competitive cost vs. closed-model providers
- ✓Outstanding value and reliable API per user reviews
Cons
- ✗Not beginner-friendly; documentation thin for non-developers in places
- ✗Some billing complaints (unexpected charges, confusing invoices) on Trustpilot
- ✗Requires comfort with APIs/code
Pricing
Serverless inference
Pay-per-token (chat/vision $0.0015–$4.50 per 1M input tokens)
- ✓200+ open-source models
- ✓On-demand access
- ✓Image, video, and audio generation pricing per unit
Dedicated inference
Per GPU-hour (1x H100 80GB $6.49/hr; 1x HGX B200 180GB $11.95/hr)
- ✓Single-tenant GPU endpoints
- ✓Dedicated container inference for media
GPU clusters
On-demand $4.79–$8.19/GPU-hr; reserved $3.29–$7.99/GPU-hr
- ✓H100, H200, B200 clusters
- ✓Volume discounts on 7–180+ day reservations
- ✓Managed storage with zero egress fees
Fine-tuning
Per 1M tokens ($0.48–$1.35 for models up to 16B; higher for specialized models)
- ✓Fine-tune open-source models
- ✓Instant deployment
Pay-per-use with no traditional subscription tiers: tokens for LLMs/embeddings, per image/video for generative media, per GPU-hour for dedicated endpoints and clusters, and per minute for audio. 'Start for free, scale on demand,' though specific free-credit amounts are not detailed publicly.
Stay Ahead of the Curve
Weekly enterprise AI insights for technology leaders. No spam, no vendor pitches—unsubscribe anytime.
Subscribe