A fine-tuned language model trained on the complete works of Rudolf Steiner
Rudolf Steiner's Gesamtausgabe spans 354 volumes and roughly 100 million tokens of philosophy, education, agriculture, medicine, and esoteric science. It is one of the largest coherent intellectual systems by a single author — and it is almost entirely invisible to modern AI.
A local 14B-parameter model scores just 23% on anthroposophy questions. Standard RAG over the full corpus adds only 3 points. Meanwhile, the best frontier model (Claude Opus) scores 91%, proving the knowledge is learnable — but inaccessible to smaller models without targeted training.
This project bridges that gap: fine-tuning an open-weight model on the entire corpus so it runs locally, privately, and for free.
AnthroBench v1.0 tested 9 models across 24 questions spanning epistemology, Christology, karma, Waldorf education, biodynamics, eurythmy, and more. Each answer is scored against atomic golden facts from the primary texts.
| Model | Score | |
|---|---|---|
| Qwen3 14B (bare) | 23.1% | |
| Qwen3 14B + RAG | 26.2% | |
| GPT-4o | 53.2% | |
| Claude Haiku 4.5 | 60.2% | |
| Grok-3 | 63.9% | |
| Grok-4 | 69.3% | |
| Gemini 2.5 Pro | 79.5% | |
| Claude Sonnet 4.6 | 84.1% | |
| Claude Opus 4.6 | 91.1% | |
| Anthroposophy.ai (target) | > 50% |
Goal: double the local model's score, making it competitive with GPT-4o on domain-specific questions while running entirely on-device. Full methodology and analysis in the whitepaper.
The entire Gesamtausgabe is public domain (Steiner died in 1925). Source corpus maintained at steiner-ga with full provenance.
Qwen3 14B base, quantized to 4-bit via MLX. 14.8B parameters, 40 transformer layers. Chosen for strong multilingual capability and Apache 2.0 license.
Two-stage QLoRA: continued pre-training (CPT) on raw text, then supervised fine-tuning (SFT) on synthetic Q&A. Rank 64, targeting all 7 linear projections across all 40 layers.
Apple M4 Max, 36GB unified memory, 32-core GPU. Everything trains locally using MLX — no cloud GPUs, no API costs, fully reproducible on consumer hardware.
22-experiment greedy coordinate search across 8 dimensions. Winner: LR 5e-5, rank 64, scale 2.0, Adam, constant schedule, all layers, no dropout.
Loss: 2.38 → 2.03 — noisy but trending downward. Training on M4 Max at ~60 iterations/hour.
| Outcome | AnthroBench Target | Meaning |
|---|---|---|
| Minimum Viable | > 40% | Real domain knowledge transferred. Worth iterating. |
| Good | 50–60% | Competitive with GPT-4o, running locally for free. |
| Great | 65%+ | Best open-weight anthroposophy model available anywhere. |
The model weights are ephemeral — they'll be replaced as better base models arrive. What endures:
A reusable benchmark for measuring any model's knowledge of anthroposophy. 24 questions, 127 golden facts, atomic scoring.
124M tokens, cleaned, deduplicated, and curriculum-ordered. Ready for any base model.
6,812 quality-filtered Q&A pairs spanning 12 domains of Steiner's work.
End-to-end reproducible: corpus audit → data engineering → HP search → training → eval. Runs on a laptop.
This project is a personal research effort to make Rudolf Steiner's work more accessible through AI. Steiner's writings are public domain. The training pipeline, benchmark, and datasets will be open-sourced upon completion.
Built with MLX, mlx-lm, and Qwen. Benchmarked against models from Anthropic, OpenAI, Google, and xAI.
Related project: Astrosophy.ai — astrosophical chart computation in the Steiner-Sucher tradition.