The milestone · The Forge

Training completed

The Forge.

The first iCrisol Mini has been trained from scratch. No cluster of thousands of GPUs, no trillions of tokens, no army of researchers. One person, one machine, 24 hours and ten cents of electricity. This is the open, honest record of that milestone — and of what it means that, with so little, a cognitive organism has come to breathe.

How it works inside →

The magnitude is in the disproportion.

The big models are born from hundred-million budgets. iCrisol is born from constraint — and turns it into a thesis: if this holds with so little, the paradigm matters.

single person

Design, architecture, corpus, training and product — one developer.

vs Hundreds of researchers at the big labs.

single machine

One NVIDIA DGX Spark GB10 (128 GB unified memory).

vs Tens of thousands of GPUs in dedicated clusters.

164 M

tokens seen

Just 0.87% of one epoch over an 18.83 B-token corpus.

vs Trillions of tokens and full epochs.

24 h

of training

A single from-scratch run, 5,000 steps, on desktop-class hardware.

vs Months of massive parallel compute.

€0.10

of electricity

51.8 W average · 1.27 kWh · at €0.08/kWh.

vs Power bills the size of a city.

Training, in data.

The real loss curve and the perplexity reached, compared with chance and with a mature transformer. Ups and downs included: this is how an organism learns from scratch.

✓ Real data · forja_mini_5000 run · 5,000 steps · 24.4 h · 2026

Loss curve (cross-entropy)

CE throughout training — with its real ups and downs, unretouched

Perplexity reached vs. references

Lower is better · logarithmic scale

Reasoning wakes up on its own

NAR geometry (lower = aligning) · Causal confidence CAG (rising)

The data chasm

Training tokens — logarithmic scale

The astonishing part isn't what it knows. It's how little it learned from.

Each Crisol expert weighs ~105 million parameters — the size of a GPT-2. For a model like that, theory (the Chinchilla law) asks for some 42.6 billion tokens; the industry trains models of that size with hundreds of billions, even trillions. The first Crisol saw 164 million. Less than 1% of a single pass through its library. And still, it breathes.

105 M

parameters per expert

The size of a GPT-2 (124 M). Crisol has 12, one per layer — and the knowledge is distilled into each, not diluted across a colossus.

1 / 260

of what theory asked for

The Chinchilla law recommends ~42.6 B tokens for 2.13 B parameters. The model saw 164 M — 0.38% of that optimum (and just 0.87% of one epoch of the 18.83 B corpus).

100–300 B

tokens the industry uses

A model the size of an expert (≈125 M) is trained today with hundreds of billions of tokens. Ours, with 164 M — between 600 and 1,800 times less.

* Open models of comparable size (2-3 B parameters) are trained today with between 2 and 18 trillion tokens. The "Industry" bar in the chart uses a conservative figure (2 B).

Run spec sheet

The Forge configuration.

Parameters ~2.13 B (one expert per layer)

Architecture 12 layers · 5 universal slots · 1 active

Holographic space holo 4096 · NAR 2048 · NOE 2048

Expert (SwiGLU) dim 8192 · 1 per layer

Steps 5,000 · from scratch

Context seq_len 1024 · effective batch 32

Learning rate 1e-4 → 5e-6 · warmup 100 · cosine

Vocabulary 64,000 (multilingual BPE)

Precision bfloat16

Hardware DGX Spark GB10 (ARM64, 128 GB)

Tokens seen 163.84 M · 0.0087 epochs

Forge time 24.4 h · 51.8 W average

Electricity cost ~€0.10 (1.27 kWh at €0.08/kWh)

Best / final CE 3.528 (PPL ~34) / 7.43 — with ups and downs, unretouched

Why this milestone matters

If a sovereign cognitive organism can be born with this, it stops being a lab promise and becomes a real possibility.

The first iCrisol doesn't compete on scale. It proves the paradigm — living memory, causality, sovereignty, modularity — works from the very first brick. The rest is growth.

Read the manifesto