Fireworks AI
by Fireworks AI, Inc.
Fast, low-cost inference and fine-tuning for open models
Fireworks AI, built by the creators of PyTorch, is a fast and cost-efficient platform for serving and fine-tuning open and proprietary generative models across text, image, audio, and embeddings via OpenAI-compatible APIs.
At a Glance
- Category
- Inference
- Pricing
- Usage-based, Free tier, Enterprise
- Target Market
- Enterprise, Startups, Developers
- Founded
- 2022
- Headquarters
- Redwood City, California, USA
Key Features
- ✓FireAttention
Proprietary CUDA inference engine for high-speed, low-cost serving, optimized for long context.
- ✓FireFunction
Function-calling model for building compound-AI and agentic systems, compatible with OpenAI function calling.
- ✓Fine-tuning
LoRA and full-parameter SFT/DPO with multi-LoRA production deployment.
- ✓OpenAI/Anthropic-compatible API
Drop-in replacement for existing application code using standard SDKs.
- ✓Serverless and dedicated GPU inference
Pay-per-token serverless plus on-demand and reserved GPU deployments (H100/H200/B200/B300).
- ✓Multi-modal serving
Text, vision, image (FLUX), audio (Whisper), and embedding models on one platform.
Capabilities
Use Cases
- •Code assistants
Powers AI coding tools such as Cursor and Sourcegraph with fast code-model inference.
- •RAG and agentic backends
Function calling plus fast inference for retrieval-augmented search copilots and agents.
- •Custom fine-tuned domain models
Deploy private, specialized models cheaply via LoRA fine-tuning.
- •Real-time audio transcription
Whisper v3 transcription for fast, low-cost voice and audio applications.
Ideal For
Best For
- ✓Fast, low-cost inference for open-source LLMs
- ✓Fine-tuning and deploying custom models quickly
- ✓OpenAI-compatible drop-in at lower cost
- ✓High-throughput agentic and RAG backends
- ✓Multi-modal (text, image, audio) production apps
Not Ideal For
- ✗Teams wanting their own proprietary frontier model
- ✗Fully self-hosted / on-prem-only deployments
Integrations
Deployment
Pricing
Serverless (pay-per-token)
Usage-based per token
- ✓From $0.10 per 1M tokens (<4B params), $0.20 (4-16B), $0.90 (>16B)
- ✓Cached input billed at 50%
- ✓Batch inference at 50%
On-demand GPUs
Per GPU-hour
- ✓H100/H200 $7.00/hr
- ✓B200 $10.00/hr
- ✓B300 $12.00/hr
Enterprise / Reserved
Custom
- ✓Reserved capacity
- ✓Private/VPC deployment
- ✓Volume discounts
Usage-based pricing published at fireworks.ai/pricing (verified 2026-07). New signups receive $1 in free credits (no open-ended free tier). Fine-tuning billed per 1M training tokens (e.g., up to 16B: LoRA SFT $0.50); embeddings from $0.008 per 1M tokens; speech-to-text per audio minute. Image and enterprise/reserved pricing may be custom.
Connect
Stay Ahead of the Curve
Weekly enterprise AI insights for technology leaders. No spam, no vendor pitches—unsubscribe anytime.
Subscribe