I train self-supervised models on chess game data. My Python pipeline using python-chess took 25 minutes to parse and tokenize 1M games from Lichess PGN dumps. I rewrote it in Rust. It now takes 15 seconds.
This post covers the architecture, why Rust was the right choice, and what I learned.
The problem
Training a chess move predictor requires converting PGN (Portable Game Notation) files into tokenized sequences — arrays of integer IDs that a neural network can consume. A typical Lichess monthly dump has 5M+ games in a zstd-compressed PGN file.
My Python pipeline had three bottlenecks:
- PGN parsing — python-chess parses SAN notation, validates moves on a board, handles edge cases. Correct, but slow. ~15 minutes for 1M games.
- Tokenization — converting validated UCI moves to token IDs, tracking piece types and turns. ~10 minutes.
- Memory — all games loaded into a Python list of dicts. 1M games = ~4GB RAM.
The Rust rewrite
The tool is called ailed-soulsteal (named after Alucard’s Soul Steal from Castlevania: Symphony of the Night — the AILED project has a Castlevania naming theme).
Architecture
Three layers, each a clean boundary:
Input Layer → Filter Layer → Output Layer
(PGN parser) (ELO, result) (.somabin binary)
Everything streams — games are parsed, filtered, tokenized, and written one at a time. Memory usage stays constant regardless of input size.
Move validation with shakmaty
shakmaty is a pure Rust chess library. It handles SAN parsing, move validation, and piece type lookup — the same things python-chess does, but at native speed.
let san: shakmaty::san::San = san_str.parse().ok()?;
let m = san.to_move(&pos).ok()?;
let uci = uci_string(&m);
let category = role_to_category(m.role());
pos = pos.play(&m).ok()?;
This is where most of the speedup comes from. shakmaty’s play() is essentially a few bitboard operations — no Python overhead, no GC pressure.
Binary output format
Instead of writing JSON or CSV, I designed a binary format (.somabin) optimized for ML training:
Header (64 bytes): magic, version, vocab_size, num_games, ...
Index Table: byte offset for each game (enables random access)
Data Section: per game: [seq_len, token_ids, turn_ids, category_ids, outcome]
The index table is the key insight. A PyTorch Dataset.__getitem__(i) can seek directly to game i via mmap without scanning the file. Loading 50K games takes 20ms. Random access runs at 500K games/sec.
Benchmarks
Processing Lichess monthly dumps (zstd compressed) on an M1 MacBook:
| Month | Input | Games (1000-1800) | Time | Rate |
|---|---|---|---|---|
| 2016-01 | 831 MB .zst | 2,060,197 | 45s | 46K/s |
| 2016-02 | 866 MB .zst | 2,071,332 | 46s | 45K/s |
| 2016-03 | 994 MB .zst | 2,399,234 | 54s | 45K/s |
| 2016-04 | 1.0 GB .zst | 2,438,621 | 55s | 44K/s |
| 2016-07 | 1.0 GB .zst | 2,598,733 | 59s | 44K/s |
11.6M games in 4.3 minutes. The equivalent Python pipeline would take roughly 5 hours.
What I learned
Streaming wins. The biggest architectural decision was making everything an iterator. Games flow through parse → filter → tokenize → write without buffering. Memory usage is constant at ~10MB regardless of input size.
Binary formats beat JSON for ML. My first version wrote JSONL. A 1M-game JSONL file was 2GB and took 30 seconds to load in Python. The .somabin binary for the same data is 550MB and loads in 20ms via mmap.
shakmaty is excellent. Chess move validation is the bottleneck in any PGN pipeline. shakmaty’s bitboard implementation made this a non-issue.
Rust’s type system caught real bugs. The GameParser and GameTokenizer traits enforce separation between parsing and tokenization. When I mixed them up during development, the compiler told me immediately.
Try it
cargo install ailed-soulsteal
# Generate vocabulary
soulsteal vocab --generate -o vocab.json
# Tokenize a Lichess dump
soulsteal tokenize lichess_2016-02.pgn.zst \
-o train.somabin \
--vocab vocab.json \
--elo 1000:1800
# Inspect
soulsteal info train.somabin
soulsteal stats train.somabin
Pre-tokenized datasets are available on Hugging Face.
The tool is designed to support any turn-based game — Go (SGF), Shogi (KIF), etc. Chess is the v1 implementation, but the GameParser and GameTokenizer traits are game-agnostic.
Source: github.com/Ailed-AI/ailed-soulsteal
crates.io: ailed-soulsteal
License: MIT





Leave a Comment