Attention Is All You Need (Original Transformer Paper)
by Vaswani et al., Google
The 2017 paper that introduced the transformer and set off the entire LLM era.
Overview
This is the paper that started it all. Vaswani et al. proposed the transformer — an architecture built purely on attention mechanisms, removing the recurrence and convolutions that dominated sequence modeling before it. Every modern LLM descends from this design. Reading the original (ideally alongside The Illustrated Transformer or an annotated implementation) is a rite of passage and pays off in genuine understanding.
At a Glance
- Topic
- Models
- Level
- Advanced
- Format
- Paper
- Cost
- Free
- Duration
- ~1-2 hours
- Provider
- Vaswani et al., Google
- Hands-on
- No
- Certificate
- None
What You’ll Learn
- ✓The original definition of scaled dot-product attention
- ✓Multi-head attention and positional encodings
- ✓The encoder-decoder transformer design
- ✓The reasoning that replaced RNNs with attention
Highlights
- •The single most influential modern ML paper
- •Best read with an annotated companion
Who It’s For
Best For
- ✓Learners going to the primary source
Prerequisites
- •Neural network and attention familiarity
FAQ
What is Attention Is All You Need (Original Transformer Paper)?
The foundational research paper introducing the transformer architecture based entirely on attention, dispensing with recurrence.
Is Attention Is All You Need (Original Transformer Paper) free?
Attention Is All You Need (Original Transformer Paper) is free to access.
What level is Attention Is All You Need (Original Transformer Paper) for?
Attention Is All You Need (Original Transformer Paper) is aimed at a advanced audience. Recommended background: Neural network and attention familiarity.
How long does Attention Is All You Need (Original Transformer Paper) take?
Expect roughly ~1-2 hours. Most learners work through it at their own pace.
What will I learn from Attention Is All You Need (Original Transformer Paper)?
You'll learn: The original definition of scaled dot-product attention; Multi-head attention and positional encodings; The encoder-decoder transformer design; The reasoning that replaced RNNs with attention.