Frank Chiarulli Jr.
all posts

SMLL: Using 200MB of Neural Network to Save 400 Bytes

February 6, 2026

We compressed Jane Austen’s opening line to 10 bytes. The model weights required to decompress it are 200 megabytes.

This is the future of compression.

Results

TextOriginalgzipSMLL
“It is a truth universally acknowledged…”117 bytes125 bytes10 bytes
“It was the best of times…”52 bytes60 bytes6 bytes
“All happy families are alike…”76 bytes84 bytes9 bytes

gzip makes these files larger. We made them 10x smaller. (Yes, gzip has ~20 bytes of header overhead on tiny inputs. The real benchmarks are below.)

The compression ratio on LLM-generated text is 14.96x. gzip achieves 1.89x. We are 8x better than gzip at compressing text, provided you don’t count the 200MB model both parties need to agree on beforehand.

Background

In 1948, Claude Shannon proved there’s a theoretical minimum number of bits required to encode any message. This minimum is the entropy. Roughly, how “surprising” the data is on average.

If you can predict what comes next with high confidence, the next symbol carries little information. If you can’t predict it at all, it carries maximum information.

LLMs do just this. When GPT outputs “the probability of the next token being ‘the’ is 73%,” it’s directly stating how much information that token carries: -log₂(0.73) = 0.45 bits.

Arithmetic coding is a compression algorithm from the 1970s that converts a stream of probabilities into a bitstream. You give it probabilities, it gives you near-optimal compression.

So we plugged an LLM into an arithmetic coder. The LLM provides probabilities. The arithmetic coder converts them to bits. The result approaches the theoretical limit of compression.

This idea is not new. DeepMind published a paper about it in 2023. Fabrice Bellard built ts_zip. The Hutter Prize has offered €500,000 since 2006 for compressing Wikipedia, on the explicit premise that compression and intelligence are related. We just wanted to see the numbers ourselves.

How It Works

Text → Tokenizer → LLM → Probabilities → Arithmetic Coder → Bits

For each token:

  1. Ask the LLM: “Given the previous tokens, what’s the probability distribution over the next token?”
  2. Look up the actual token’s probability
  3. Feed that probability to the arithmetic coder
  4. The arithmetic coder outputs bits proportional to -log₂(probability)
  5. Append the token to context, repeat

Decompression is symmetric. Read bits, query the LLM for probabilities, decode the token, extend context, continue until end-of-sequence.

The model must be identical on both ends. The weights are the codebook. Different weights, different probabilities, wrong tokens, garbage output. This is what we in the business call a feature.

We use arithmetic coding rather than Huffman because it achieves fractional bits per symbol. A token with 70% probability takes 0.51 bits, not Huffman’s rounded-up 1 bit.

The implementation uses numerically-stable softmax for probability extraction, 32-bit fixed-point arithmetic coding with a “bits outstanding” counter for underflow, and vocabulary sorted by probability to ensure encoder and decoder compute identical CDFs. Built on llama.cpp for inference. Model format is GGUF. Python bindings via pybind11.

Benchmarks

By Content Type

Content TypeSMLLgzipzstd
LLM-Generated14.96x1.89x1.86x
Wikipedia14.83x2.05x2.02x
C Code11.19x2.60x2.53x
Python Code10.48x3.19x3.19x
Natural Prose9.75x1.81x1.79x
JSON7.86x2.72x2.66x
Repetitive75.00x22.22x20.00x
UUIDs0.94x1.71x1.76x

SMLL wins on 7 of 8 content types. UUIDs are random hexadecimal strings. The LLM cannot predict random data. Neither can anything else. This is fine.

LLM-generated text compresses best. The model is predicting outputs similar to what it would generate. This is circular in a way that happens to be useful.

By Text Length

LengthSMLL (bits/char)gzip (bits/char)
502.249.12
100.1.446.96
5000.984.98
10000.854.47

Compression improves with length because the LLM accumulates context. More context means better predictions means fewer bits. At 1000 characters we’re under 1 bit per character. The theoretical minimum for English is estimated around 0.6-1.3 bits per character. We’re in that range.

Speed

MethodThroughput
SMLL700 chars/sec
gzip6,500,000 chars/sec

SMLL is approximately 10,000x slower. Each token requires a forward pass through the neural network. This is the cost of using a 360 million parameter model as your compression dictionary.

A 10KB document takes about 15 seconds to compress. A 1MB document takes about 25 minutes. These are the trade-offs one makes.

On Practicality

The model is 200MB. The encoder and decoder must have identical model weights. Compression is 10,000x slower than gzip.

In exchange, you get 8x better compression on natural language and approach the theoretical limit of what’s possible.

Whether this trade-off makes sense depends on your situation. If you’re archiving text you’ll rarely decompress and storage costs more than compute, maybe. If you’re compressing HTTP responses, absolutely not.

The more interesting observation is theoretical. Cross-entropy loss, the thing we train LLMs on, directly measures compression efficiency. When we say “this model has lower perplexity,” we mean “this model compresses text better.” Language modeling is compression. Compression is language modeling. These are the same research problem wearing different clothes.

Solomonoff induction, the theoretical ideal for prediction, is defined in terms of Kolmogorov complexity: the length of the shortest program that produces the data. The Hutter Prize makes this connection explicit: compress Wikipedia well enough and you’ve demonstrated something about intelligence.

We haven’t demonstrated anything about intelligence yet. A simple n-gram lookup table might achieve similar ratios through memorization alone. If the LLM meaningfully outperforms lookup tables on novel text, that’s a stronger claim about what “compression = intelligence” actually means. I think this would be an interesting follow up.

We’ve demonstrated that a 360M parameter model can compress text to 0.68 bits per character, which is close to optimal. I found this fascinating, I hope you did too.

Usage

If you want to try it out or run your own benchmarks, source code and installation instructions are on GitHub.

pip install smll
import smll
 
with smll.Compressor.from_pretrained(
    "QuantFactory/SmolLM2-360M-GGUF",
    "*Q4_0.gguf"
) as c:
    text = "To be, or not to be, that is the question."
 
    compressed = c.compress(text)
    decompressed = c.decompress(compressed)
 
    assert decompressed == text
    print(f"{len(text)} bytes → {len(compressed)} bytes")

This was built at the Recurse Center with Lauria. If spending time on experiments like this sounds interesting to you, you should apply.


Edit on GitHub