anthroposophy.ai

A fine-tuned language model trained on the complete works of Rudolf Steiner

Open Source Local-First Apple Silicon Training in Progress

The Problem

Rudolf Steiner's Gesamtausgabe spans 354 volumes and roughly 100 million tokens of philosophy, education, agriculture, medicine, and esoteric science. It is one of the largest coherent intellectual systems by a single author — and it is almost entirely invisible to modern AI.

A local 14B-parameter model scores just 23% on anthroposophy questions. Standard RAG over the full corpus adds only 3 points. Meanwhile, the best frontier model (Claude Opus) scores 91%, proving the knowledge is learnable — but inaccessible to smaller models without targeted training.

This project bridges that gap: fine-tuning an open-weight model on the entire corpus so it runs locally, privately, and for free.

The 68-Point Gap

AnthroBench v1.0 tested 9 models across 24 questions spanning epistemology, Christology, karma, Waldorf education, biodynamics, eurythmy, and more. Each answer is scored against atomic golden facts from the primary texts.

Model Score
Qwen3 14B (bare) 23.1%
Qwen3 14B + RAG 26.2%
GPT-4o 53.2%
Claude Haiku 4.5 60.2%
Grok-3 63.9%
Grok-4 69.3%
Gemini 2.5 Pro 79.5%
Claude Sonnet 4.6 84.1%
Claude Opus 4.6 91.1%
Anthroposophy.ai (target) > 50%

Goal: double the local model's score, making it competitive with GPT-4o on domain-specific questions while running entirely on-device. Full methodology and analysis in the whitepaper.

The Corpus

334 GA Volumes
124M Tokens
16,835 Documents
1861–1925 Span

Training Data

  • 102M tokens from Steiner (English + German)
  • 22M tokens FineWeb-Edu (general knowledge)
  • 13 curriculum batches, chunked at 4,096 tokens
  • 18% general data interleaving

SFT Data

  • 6,812 synthetic Q&A pairs (Claude-generated)
  • Bloom's Taxonomy + cross-reference questions
  • Quality filtered to ≥ 4.0/5.0 by LLM judge
  • 5% general instruction data (OpenHermes)

The entire Gesamtausgabe is public domain (Steiner died in 1925). Source corpus maintained at steiner-ga with full provenance.

Approach

Base Model

Qwen3 14B base, quantized to 4-bit via MLX. 14.8B parameters, 40 transformer layers. Chosen for strong multilingual capability and Apache 2.0 license.

Fine-Tuning

Two-stage QLoRA: continued pre-training (CPT) on raw text, then supervised fine-tuning (SFT) on synthetic Q&A. Rank 64, targeting all 7 linear projections across all 40 layers.

Hardware

Apple M4 Max, 36GB unified memory, 32-core GPU. Everything trains locally using MLX — no cloud GPUs, no API costs, fully reproducible on consumer hardware.

Hyperparameters

22-experiment greedy coordinate search across 8 dimensions. Winner: LR 5e-5, rank 64, scale 2.0, Adam, constant schedule, all layers, no dropout.

Training Progress

1
Corpus Audit
Verified 334 volumes, 16,835 files, 100% catalog coverage
Done
2
RAG Baseline
10-model benchmark establishing the 68-point gap
Done
3
Corpus Engineering
124.5M tokens, 13 curriculum batches, dedup & cleaning
Done
4
Synthetic Q&A
6,812 filtered pairs from 26K candidates, Bloom's Taxonomy
Done
5
Hyperparameter Search
22 experiments, 8 dimensions, ~20 min each on M4 Max
Done
6
CPT Training
3,000 iterations on full 124M-token corpus
Training
7
Evaluation
AnthroBench v2 (300 questions), perplexity, general benchmarks
Pending
8
SFT Training
Supervised fine-tuning on curated Q&A pairs
Pending
9
Release
GGUF export, Ollama model, Open WebUI integration
Pending
CPT: 1,428 / 3,000 iterations 47.6%

Loss: 2.38 → 2.03 — noisy but trending downward. Training on M4 Max at ~60 iterations/hour.

Success Criteria

Outcome AnthroBench Target Meaning
Minimum Viable > 40% Real domain knowledge transferred. Worth iterating.
Good 50–60% Competitive with GPT-4o, running locally for free.
Great 65%+ Best open-weight anthroposophy model available anywhere.

Durable Assets

The model weights are ephemeral — they'll be replaced as better base models arrive. What endures:

AnthroBench

A reusable benchmark for measuring any model's knowledge of anthroposophy. 24 questions, 127 golden facts, atomic scoring.

Training Corpus

124M tokens, cleaned, deduplicated, and curriculum-ordered. Ready for any base model.

SFT Dataset

6,812 quality-filtered Q&A pairs spanning 12 domains of Steiner's work.

Pipeline

End-to-end reproducible: corpus audit → data engineering → HP search → training → eval. Runs on a laptop.

About

This project is a personal research effort to make Rudolf Steiner's work more accessible through AI. Steiner's writings are public domain. The training pipeline, benchmark, and datasets will be open-sourced upon completion.

Built with MLX, mlx-lm, and Qwen. Benchmarked against models from Anthropic, OpenAI, Google, and xAI.

Related project: Astrosophy.ai — astrosophical chart computation in the Steiner-Sucher tradition.