Attention Is All You Need (Original Transformer Paper)

by Vaswani et al., Google

AdvancedPaperFree~1-2 hours

The 2017 paper that introduced the transformer and set off the entire LLM era.

Start LearningReviewed July 3, 2026

Overview

This is the paper that started it all. Vaswani et al. proposed the transformer — an architecture built purely on attention mechanisms, removing the recurrence and convolutions that dominated sequence modeling before it. Every modern LLM descends from this design. Reading the original (ideally alongside The Illustrated Transformer or an annotated implementation) is a rite of passage and pays off in genuine understanding.

At a Glance

Topic
Models
Level
Advanced
Format
Paper
Cost
Free
Duration
~1-2 hours
Provider
Vaswani et al., Google
Hands-on
No
Certificate
None

What You’ll Learn

  • The original definition of scaled dot-product attention
  • Multi-head attention and positional encodings
  • The encoder-decoder transformer design
  • The reasoning that replaced RNNs with attention

Highlights

  • The single most influential modern ML paper
  • Best read with an annotated companion

Who It’s For

Best For

  • Learners going to the primary source

Prerequisites

  • Neural network and attention familiarity

FAQ

What is Attention Is All You Need (Original Transformer Paper)?

The foundational research paper introducing the transformer architecture based entirely on attention, dispensing with recurrence.

Is Attention Is All You Need (Original Transformer Paper) free?

Attention Is All You Need (Original Transformer Paper) is free to access.

What level is Attention Is All You Need (Original Transformer Paper) for?

Attention Is All You Need (Original Transformer Paper) is aimed at a advanced audience. Recommended background: Neural network and attention familiarity.

How long does Attention Is All You Need (Original Transformer Paper) take?

Expect roughly ~1-2 hours. Most learners work through it at their own pace.

What will I learn from Attention Is All You Need (Original Transformer Paper)?

You'll learn: The original definition of scaled dot-product attention; Multi-head attention and positional encodings; The encoder-decoder transformer design; The reasoning that replaced RNNs with attention.

Topics

transformerattentionresearch paperfoundational