A reinforcement-learning project

A chess engine that teaches itself

An AlphaZero-style reinforcement-learning system, built from scratch in Python & PyTorch — a neural network (policy + value) guided by Monte-Carlo Tree Search that learns purely by playing itself. This site showcases that project: read how it learns, watch it self-play, and play a game yourself.

View source

Heads-up: so a game is enjoyable today, the live board's opponent is a fast classical engine. The reinforcement-learning network is the from-scratch learning project — see How it works.

Neural network

A residual CNN reads the board and outputs a move policy and a value (who's winning) — no hand-written chess knowledge.

Search (MCTS)

Monte-Carlo Tree Search with the PUCT rule looks ahead, turning the network's hunches into much stronger moves.

Self-play learning

The engine is its own opponent and its own teacher: games become training data, which sharpens the next generation.

Play it now

The live board uses a fast classical search so it plays sound chess today — captures, tactics and real checkmates.

How this is different from other chess engines

Most chess engines are pre-built black boxes. This one is about understanding how an engine learns and thinks.

It learns, it isn't told

Engines like Stockfish run on decades of human-tuned chess knowledge. This one starts from random and teaches itself — the AlphaZero idea — discovering good play from its own games.

It shows its thinking

See the live evaluation, win probability and the moves it's considering, drawn as arrows on the board. Most engines never let you see why.

Built from scratch to be understood

The network, the search and the self-play loop are all written from scratch and explained in plain English on the How it works page. It's a learning tool, not just a player.

Watch it play itself

Hit “Watch self-play” and the engine reasons through a whole game on its own — exactly the kind of self-play it learns from.

You play White. Click a piece then its destination — or drag it.

Game

Play as

Difficulty

Opponent

Time

Show legal moves when I pick up a piece

Tip: right-click + drag to draw an arrow; right-click a square to mark it. Keys: N new · F flip · U undo · H hint.

Material

You captured

Engine captured

Evaluation

Even · 50%

Engine recommends

Press “Hint”.

Moves

No moves yet.

How the engine works

The same three ideas that powered DeepMind's AlphaZero — implemented from scratch to be read and understood, not just run.

The simple version (30 seconds)

Nobody teaches the engine chess strategy. It only knows the rules. It gets good through one repeating loop:

Guess. A neural network looks at the board and guesses the best moves and who's winning.
Think. A search tries out those moves many moves deep to find something better than the first guess.
Play itself. It plays thousands of games against itself using that search.
Learn. Whoever won, it nudges the network toward the moves that led to wins — so next time the guess is already smarter.

Repeat millions of times and the engine bootstraps itself from random moves to real chess — with zero human strategy added. Everything below is just that loop, in detail.

1 · One network, two heads — the "guess"

A position is encoded as an 18 × 8 × 8 stack of planes (piece locations, castling rights, …), always shown from the side-to-move's perspective. A residual convolutional network maps it to:

a policy — a probability over all 8×8×73 = 4672 possible moves ("which moves look promising?"), and
a value in [−1, 1] — "who is winning?"

2 · Search that thinks ahead (MCTS + PUCT) — the "think"

Each move, hundreds of simulations descend a search tree, balancing what looks good now against what's worth exploring, using the PUCT rule:

a* = argmaxₐ [ Q(s,a) + c · P(s,a) · √(Σ N(s,b)) / (1 + N(s,a)) ]

Exploitation (Q, the average value found) plus exploration (the network's prior P, damped by how often a move was tried). The most-visited move is the search's verdict — far stronger than the raw network.

3 · The self-play loop — "play itself" & "learn"

01

Play a full game against itself, picking moves with MCTS.

02

Record every (position, search-policy, who-won) triple.

03

Train the network to predict the search and the outcome.

04

The stronger network produces better games. Repeat.

The training objective combines both heads (plus weight decay):

L = (v − z)² − πᵀ log p + λ‖θ‖²

value error (MSE to the game result z) + policy cross-entropy to the MCTS target π. Minimising both is what makes a single network good at both halves of the search.

4 · A real training run

Loss from an actual self-play run — the policy loss drops sharply as the network learns to imitate the search:

Honest note: reaching strong play needs far more (GPU) self-play than a laptop CPU allows — the learning machinery is correct, the limit is compute. So the live board above uses a fast classical alpha-beta engine (material + endgame heuristics, quiescence) that plays sound chess today, while the neural network remains the from-scratch learning project.

Source on GitHub Train it on Colab