Groq
by Groq, Inc.
Ultra-fast LPU inference for open-weight LLMs
Groq is an AI inference platform powered by its custom LPU (Language Processing Unit) chip, delivering ultra-low-latency, high-throughput token generation for open-weight LLMs through GroqCloud's OpenAI-compatible API.
At a Glance
- Category
- Inference
- Pricing
- Usage-based, Free tier, Enterprise
- Target Market
- Developers, Startups, Enterprise
- Founded
- 2016
- Headquarters
- San Jose, California, USA
Key Features
- ✓LPU inference chip
Custom silicon purpose-built for AI inference, delivering very high tokens/sec at low cost.
- ✓GroqCloud API
OpenAI-compatible, tokens-as-a-service inference for open-weight models.
- ✓Built-in agentic tools
Web search, website visit, code execution, and browser automation callable from the API.
- ✓Batch API
50% lower cost for large-scale asynchronous inference jobs.
- ✓Prompt caching
50% discount on cached input tokens for repeated context.
- ✓LoRA adapter serving
Deploy multiple custom LoRA fine-tunes at base-model speed (enterprise tier).
Capabilities
Use Cases
- •Real-time conversational AI and voice agents
Low-latency token streaming plus Whisper speech-to-text and Orpheus text-to-speech power responsive chat and voice assistants.
- •High-throughput agentic applications
Agents that make many fast LLM calls with built-in web search and code execution via Groq Compound.
- •Speech transcription at scale
Whisper Large v3 Turbo transcription at very high real-time factors and low cost.
- •Sovereign and regulated on-prem inference
GroqRack clusters for data-residency and air-gapped enterprise deployments.
Ideal For
Best For
- ✓Ultra-low-latency LLM inference
- ✓Real-time voice and chat assistants
- ✓High-volume, cost-sensitive token workloads
- ✓Migrating off OpenAI with minimal code change
- ✓On-premise or sovereign inference deployments
Not Ideal For
- ✗Teams needing proprietary frontier models (GPT-4/Claude/Gemini)
- ✗Image-generation workloads
- ✗Managed model training or full fine-tuning as a service
Integrations
Deployment
Pricing
Free
$0
- ✓Free API key with rate limits
- ✓OpenAI-compatible endpoints
Pay-as-you-go
Usage-based per token
- ✓Per-token pricing (e.g., Llama 3.1 8B Instant $0.05/$0.08 per 1M in/out; Llama 3.3 70B $0.59/$0.79)
- ✓Batch API 50% discount
- ✓Prompt caching 50% discount
Enterprise
Custom
- ✓Dedicated capacity
- ✓LoRA adapter serving
- ✓On-prem GroqRack / GroqNode
Usage-based per-token pricing published at groq.com/pricing (verified 2026-07). Free developer tier with rate limits, then pay-as-you-go. Speech-to-text billed per hour transcribed (Whisper V3 Large ~$0.111/hr; Turbo ~$0.04/hr); text-to-speech per 1M characters. Built-in tools priced separately. Prices change as the model roster updates.
Connect
Stay Ahead of the Curve
Weekly enterprise AI insights for technology leaders. No spam, no vendor pitches—unsubscribe anytime.
Subscribe