G

Groq

by Groq, Inc.

InferenceAI InfrastructureLLM API

Ultra-fast LPU inference for open-weight LLMs

Usage-based · Free tier · Enterprise·Added July 2, 2026·Updated July 2, 2026
Share:
THE DAILY BRIEF
Groq

by Groq, Inc.

InferenceAI InfrastructureLLM API

Ultra-fast LPU inference for open-weight LLMs

Usage-based · Free tier · Enterprise

Groq is an AI inference platform powered by its custom LPU (Language Processing Unit) chip, delivering ultra-low-latency, high-throughput token generation for open-weight LLMs through GroqCloud's OpenAI-compatible API.

At a Glance

Category
Inference
Pricing
Usage-based, Free tier, Enterprise
Target Market
Developers, Startups, Enterprise
Founded
2016
Headquarters
San Jose, California, USA

Key Features

  • LPU inference chip
  • GroqCloud API
  • Built-in agentic tools
  • Batch API
  • Prompt caching
  • LoRA adapter serving

Capabilities

api access
text generation
code generation
speech to text
text to speech
agent orchestration

Use Cases

  • Real-time conversational AI and voice agents
  • High-throughput agentic applications
  • Speech transcription at scale
  • Sovereign and regulated on-prem inference

Ideal For

Best For

  • Ultra-low-latency LLM inference
  • Real-time voice and chat assistants
  • High-volume, cost-sensitive token workloads
  • Migrating off OpenAI with minimal code change
  • On-premise or sovereign inference deployments

Not Ideal For

  • Teams needing proprietary frontier models (GPT-4/Claude/Gemini)
  • Image-generation workloads
  • Managed model training or full fine-tuning as a service

Pricing

Free

$0

  • Free API key with rate limits
  • OpenAI-compatible endpoints

Pay-as-you-go

Usage-based per token

  • Per-token pricing (e.g., Llama 3.1 8B Instant $0.05/$0.08 per 1M in/out; Llama 3.3 70B $0.59/$0.79)
  • Batch API 50% discount
  • Prompt caching 50% discount

Enterprise

Custom

  • Dedicated capacity
  • LoRA adapter serving
  • On-prem GroqRack / GroqNode

Usage-based per-token pricing published at groq.com/pricing (verified 2026-07). Free developer tier with rate limits, then pay-as-you-go. Speech-to-text billed per hour transcribed (Whisper V3 Large ~$0.111/hr; Turbo ~$0.04/hr); text-to-speech per 1M characters. Built-in tools priced separately. Prices change as the model roster updates.

THE DAILY BRIEF

Enterprise AI insights for technology and business leaders, twice weekly.

beri.net

Subscribe at beri.net/subscribe for twice-weekly AI insights delivered to your inbox.

LinkedIn: linkedin.com/in/rberi  |  X: x.com/rajeshberi

© 2026 Rajesh Beri. All rights reserved.

Groq is an AI inference platform powered by its custom LPU (Language Processing Unit) chip, delivering ultra-low-latency, high-throughput token generation for open-weight LLMs through GroqCloud's OpenAI-compatible API.

At a Glance

Category
Inference
Pricing
Usage-based, Free tier, Enterprise
Target Market
Developers, Startups, Enterprise
Founded
2016
Headquarters
San Jose, California, USA

Key Features

  • LPU inference chip

    Custom silicon purpose-built for AI inference, delivering very high tokens/sec at low cost.

  • GroqCloud API

    OpenAI-compatible, tokens-as-a-service inference for open-weight models.

  • Built-in agentic tools

    Web search, website visit, code execution, and browser automation callable from the API.

  • Batch API

    50% lower cost for large-scale asynchronous inference jobs.

  • Prompt caching

    50% discount on cached input tokens for repeated context.

  • LoRA adapter serving

    Deploy multiple custom LoRA fine-tunes at base-model speed (enterprise tier).

Capabilities

api access
text generation
code generation
speech to text
text to speech
agent orchestration

Use Cases

  • Real-time conversational AI and voice agents

    Low-latency token streaming plus Whisper speech-to-text and Orpheus text-to-speech power responsive chat and voice assistants.

  • High-throughput agentic applications

    Agents that make many fast LLM calls with built-in web search and code execution via Groq Compound.

  • Speech transcription at scale

    Whisper Large v3 Turbo transcription at very high real-time factors and low cost.

  • Sovereign and regulated on-prem inference

    GroqRack clusters for data-residency and air-gapped enterprise deployments.

Ideal For

Best For

  • Ultra-low-latency LLM inference
  • Real-time voice and chat assistants
  • High-volume, cost-sensitive token workloads
  • Migrating off OpenAI with minimal code change
  • On-premise or sovereign inference deployments

Not Ideal For

  • Teams needing proprietary frontier models (GPT-4/Claude/Gemini)
  • Image-generation workloads
  • Managed model training or full fine-tuning as a service

Integrations

API Support
SDK Available
SDK:PythonJavaScript/TypeScript

Deployment

Self-Hosted
Cloud-Hosted
On-Premise
GroqCloud (hosted API)GroqRack / GroqNode on-premise clusters

Pricing

Free Trial Available

Free

$0

  • Free API key with rate limits
  • OpenAI-compatible endpoints

Pay-as-you-go

Usage-based per token

  • Per-token pricing (e.g., Llama 3.1 8B Instant $0.05/$0.08 per 1M in/out; Llama 3.3 70B $0.59/$0.79)
  • Batch API 50% discount
  • Prompt caching 50% discount

Enterprise

Custom

  • Dedicated capacity
  • LoRA adapter serving
  • On-prem GroqRack / GroqNode

Usage-based per-token pricing published at groq.com/pricing (verified 2026-07). Free developer tier with rate limits, then pay-as-you-go. Speech-to-text billed per hour transcribed (Whisper V3 Large ~$0.111/hr; Turbo ~$0.04/hr); text-to-speech per 1M characters. Built-in tools priced separately. Prices change as the model roster updates.

Connect

Newsletter

Stay Ahead of the Curve

Weekly enterprise AI insights for technology leaders. No spam, no vendor pitches—unsubscribe anytime.

Subscribe