feat(distill): close the MemPalace loop — conversations → wiki pages

Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.

## The gap distill closes

The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.

## What distill does

Narrow today-filter with historical rollup:

  1. Find all summarized conversations dated TODAY
  2. Extract their topics: — this is the "topics of today" set
  3. For each topic in that set, pull ALL summarized conversations
     across history that share that topic (full historical context)
  4. Extract hall_facts + hall_discoveries + hall_advice bullets
     (the high-signal hall types — skips event/preference/tooling)
  5. Send topic group + wiki index.md to claude -p
  6. Model emits JSON actions[]: new_page / update_page / skip
  7. Write each action to staging/<type>/ with distill provenance
     frontmatter (staged_by: wiki-distill, distill_topic,
     distill_source_conversations, compilation_notes)

First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.

Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.

## Orchestration — distill BEFORE harvest

wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):

  1a. wiki-distill.py    — conversations → staging (PRIORITY)
  1b. wiki-harvest.py    — URLs → raw/harvested → staging (supplement)
  2.  wiki-hygiene.py    — decay, archive, repair, checks
  3.  qmd reindex

Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.

## Verified on real wiki

Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.

## Code additions

- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
  + high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
  state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
  the Phase 1a/1b split

## Docs additions

- README.md: 8th row in extensions table, updated compounding-loop
  diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
  loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
  order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
  verify section test count
- .gitignore: note distill-state.json is committed (sync across
  machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
  tab with flow diagram, hall filter table, narrow-today/wide-
  history explanation, staging provenance example

## Tests

192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
This commit is contained in:
Eric Turner
2026-04-12 22:34:33 -06:00
parent 4c6b7609a1
commit 997aa837de
11 changed files with 1732 additions and 66 deletions

View File

@@ -114,13 +114,14 @@ to one of those extensions:
| What memex adds | How it works |
|-----------------|--------------|
| **Conversation distillation** — your sessions become wiki pages | `wiki-distill.py` finds today's topics, rolls up ALL historical conversations sharing each topic, pulls their `hall_facts` + `hall_discoveries` + `hall_advice` content, and asks `claude -p` to create new pages or update existing ones. This is what closes the MemPalace loop — closet summaries become the source material for the wiki itself, not just the URLs cited in them. |
| **Time-decaying confidence** — pages earn trust through reinforcement and fade without it | `confidence` field + `last_verified`, 6/9/12 month decay thresholds, auto-archive. Full-mode hygiene also adds LLM contradiction detection across pages. |
| **Scalable search beyond the context window** | `qmd` (BM25 + vector + LLM re-ranking) from day one, with three collections (`wiki` / `wiki-archive` / `wiki-conversations`) so queries can route to the right surface. |
| **Traceable sources for every claim** | Every compiled page traces back to an immutable `raw/harvested/*.md` file with a SHA-256 content hash. Staging review is the built-in cross-check, and `compilation_notes` makes review fast. |
| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. |
| **Traceable sources for every claim** | Every compiled page traces back to either an immutable `raw/harvested/*.md` file (URL-sourced) or specific conversations with a `distill_source_conversations` field (session-sourced). Staging review is the built-in cross-check, and `compilation_notes` makes review fast. |
| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → distill → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. |
| **Human-in-the-loop staging** for automated content | Every automated page lands in `staging/` first with `origin: automated`, `status: pending`. Nothing bypasses human review — one promotion step and it's in the live wiki with `last_verified` set. |
| **Hybrid retrieval** — structural navigation + semantic search | Wings/rooms/halls (borrowed from mempalace) give structural filtering that narrows the search space before qmd's hybrid BM25 + vector pass runs. Full-mode hygiene also auto-adds missing cross-references. |
| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Harvest and hygiene state files sync across machines so both agree on what's been processed. |
| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Distill, harvest, and hygiene state files sync across machines so both agree on what's been processed. |
The short version: Karpathy shared the idea, milla-jovovich's mempalace
added the structural memory taxonomy, and memex is the automation layer
@@ -147,21 +148,28 @@ memex doesn't cover.
│ summarize-conversations.py --claude (daily)
┌─────────────────────┐
│ conversations/ │ summaries with related: wiki links
│ conversations/ │ summaries with halls + topics + related:
│ <project>/*.md │ (status: summarized)
└──────────┬──────────┘
│ wiki-harvest.py (daily)
┌─────────────────────┐
│ raw/harvested/ │ fetched URL content
│ *.md │ (immutable source material)
└──────────┬──────────┘
│ claude -p compile step
┌─────────────────────┐
│ staging/<type>/ pending pages
*.md │ (status: pending, origin: automated)
└──────────┬──────────┘
└──────┬───────────┬──
└──▶ wiki-distill.py (daily Phase 1a) ──┐
│ - rollup by today's topics │
│ - pull historical conversations│
- extract fact/discovery/advice│
│ - claude -p → new or update │
│ wiki-harvest.py (daily Phase 1b) │
▼ │
┌─────────────────────┐
raw/harvested/ fetched URL content │
│ *.md │ (immutable source material) │
└──────────┬──────────┘ │
│ claude -p compile step │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ staging/<type>/ pending pages │◀─┘
│ *.md (status: pending, origin: auto) │
└──────────┬───────────────────────────────────────────┘
│ human review (wiki-staging.py --review)
┌─────────────────────┐
@@ -283,6 +291,7 @@ wiki/
├── reports/ ← Hygiene operation logs
├── scripts/ ← The automation pipeline
├── tests/ ← Pytest suite (171 tests)
├── .distill-state.json ← Conversation distill state (committed, synced)
├── .harvest-state.json ← URL dedup state (committed, synced)
├── .hygiene-state.json ← Content hashes, deferred issues (committed, synced)
└── .mine-state.json ← Conversation extraction offsets (gitignored, per-machine)
@@ -333,11 +342,12 @@ Eleven scripts organized in three layers:
- `update-conversation-index.py` — Regenerate conversation index + wake-up context
**Automation layer** (maintains the wiki):
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, constants
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, hall extraction, constants
- `wiki-distill.py` — Conversation distillation (closet → wiki pages via claude -p, closes the MemPalace loop)
- `wiki-harvest.py` — URL classification + fetch cascade + compile to staging
- `wiki-staging.py` — Human review (list/promote/reject/review/sync)
- `wiki-hygiene.py` — Quick + full hygiene checks, archival, auto-restore
- `wiki-maintain.sh` — Top-level orchestrator chaining harvest + hygiene
- `wiki-maintain.sh` — Top-level orchestrator chaining distill + harvest + hygiene
**Sync layer**:
- `wiki-sync.sh` — Git commit/pull/push with merge-union markdown handling