Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.
## The gap distill closes
The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.
## What distill does
Narrow today-filter with historical rollup:
1. Find all summarized conversations dated TODAY
2. Extract their topics: — this is the "topics of today" set
3. For each topic in that set, pull ALL summarized conversations
across history that share that topic (full historical context)
4. Extract hall_facts + hall_discoveries + hall_advice bullets
(the high-signal hall types — skips event/preference/tooling)
5. Send topic group + wiki index.md to claude -p
6. Model emits JSON actions[]: new_page / update_page / skip
7. Write each action to staging/<type>/ with distill provenance
frontmatter (staged_by: wiki-distill, distill_topic,
distill_source_conversations, compilation_notes)
First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.
Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.
## Orchestration — distill BEFORE harvest
wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):
1a. wiki-distill.py — conversations → staging (PRIORITY)
1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement)
2. wiki-hygiene.py — decay, archive, repair, checks
3. qmd reindex
Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.
## Verified on real wiki
Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.
## Code additions
- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
+ high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
the Phase 1a/1b split
## Docs additions
- README.md: 8th row in extensions table, updated compounding-loop
diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
verify section test count
- .gitignore: note distill-state.json is committed (sync across
machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
tab with flow diagram, hall filter table, narrow-today/wide-
history explanation, staging provenance example
## Tests
192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
Two changes, one commit:
1. Reframe "weaknesses" as "extensions memex adds":
Karpathy's gist is a concept pitch, not an implementation. Reframe
the seven places memex extends the pattern as engineering-layer
additions rather than problems to fix. Cleaner narrative — memex
builds on Karpathy's work instead of critiquing it.
Touches README.md (Why each part exists + Credits) and
DESIGN-RATIONALE.md (section titles, trade-off framing, biggest
layer section, scope note at the end).
2. Replace docs/artifacts/signal-and-noise.html with the full
upstream version:
The earlier abbreviated copy dropped the MemPalace integration tab,
the detailed mitigation steps with effort pips, the impact
before/after cards, and the qmd vs ChromaDB comparison. This
restores all of that. Also swaps self-references from "LLM Wiki"
to "memex" while leaving external "LLM Wiki v2" community
citations alone (those refer to a separate pattern and aren't ours
to rename).
The live hosted copy at eric-turner.com/memex/signal-and-noise.html
has already been updated via scp — Hugo picks up static changes with
--poll 1s so the public URL reflects this file immediately.
Replace all four references to the Claude public artifact URL with the
self-hosted version at eric-turner.com/memex/signal-and-noise.html plus
the offline-capable archive at docs/artifacts/signal-and-noise.html.
The Claude artifact can now be unpublished without breaking any links
in the repo. The self-hosted HTML is deployed to the Hugo site's static
directory and lives alongside the archived copy in this repo — either
can stand on its own.
Archive a self-contained HTML copy of the design rationale artifact —
the interactive Signal & Noise analysis of Karpathy's pattern that
produced memex. Fully self-contained (inline CSS + JS, only external
dependency is Google Fonts), works offline, renders identically in any
modern browser.
Updated the README Credits section to link:
1. Live interactive version at eric-turner.com/memex/signal-and-noise.html
2. Original Claude artifact
3. Archived copy in this repo
4. Condensed written version in DESIGN-RATIONALE.md
The archived HTML means the analysis survives even if the live site or
the Claude artifact URL ever goes away.
Replace project self-references throughout README, SETUP, and the example
CLAUDE.md files. External artifact titles are preserved as-is since they
refer to the actual title of the Claude design artifact.
Also add a "Why 'memex'?" aside to the README that roots the project in
Vannevar Bush's 1945 "As We May Think" essay, where the term originates.
The compounding knowledge wiki is the LLM-era realization of Bush's
memex concept: the "associative trails" he imagined are the related:
frontmatter fields and wikilinks the agent maintains.
Kept lowercase where referring to the generic pattern (e.g. "an LLM wiki
persists its mistakes") since that refers to the class of system, not
this specific project.
A compounding LLM-maintained knowledge wiki.
Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's
mempalace, with an automation layer on top for conversation mining, URL
harvesting, human-in-the-loop staging, staleness decay, and hygiene.
Includes:
- 11 pipeline scripts (extract, summarize, index, harvest, stage,
hygiene, maintain, sync, + shared library)
- Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE
- Example CLAUDE.md files (wiki schema + global instructions) tuned for
the three-collection qmd setup
- 171-test pytest suite (cross-platform, runs in ~1.3s)
- MIT licensed