Looped & Recursive Laguna-XS.2

looped Laguna

Two experiments squeezing more out of poolside/Laguna-XS.2 (33B total / 3B active, 256-expert MoE): one at inference time, one by shrinking the weights.

1. Training-free looped transformer

We turn Laguna into a looped transformer at inference time, re-applying a mid-stack block of layers with a damped Runge–Kutta update, following Chen et al. 2026, Training-Free Looped Transformers. No training, no new weights, no architecture change.

Headline: it transfers to a large fine-grained MoE and gives a small but consistent gain: positive on 5 of 6 knowledge benchmarks, significant on ARC (p=0.005) and MMLU (p=0.038). But the intuitive levers to improve it (loop deeper, loop the global-attention layers) mostly don't pan out. It's a local refiner, not a reasoning amplifier.

per-task forest plot

📄 Full report → REPORT_LOOPED.md: method, K-sweep, the refuted global-vs-sliding hypothesis, mechanism probes, and all numbers.

2. Relaxed Recursive Transformer

Tying MoE expert banks across layers to shrink the model, then uptraining with an LM loss + KD distillation from the full model (Bae et al., 2025). Inference compute is unchanged, cost paid once in training. The reference method was only tried on dense models; this applies it to a large MoE.

Headline: distillation recovers large tying perturbations: at 4.5–9.1% fewer stored params, held-out Python perplexity lands within ~1 point of a matched reference, and that gap stays roughly constant as compression grows.

recovery gap across configurations

📄 Full report → REPORT_RRT.md: method, setup, per-config gaps, and limitations.


Code & setup

uv sync
uv run python scripts/fetch_laguna_src.py   # pull model source (no weights)
uv run pytest                               # CPU suite (skips network/gpu)

Full method, GPU run instructions, and results are in the reports above.