feat(distill): close the MemPalace loop — conversations → wiki pages

Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is the 8th extension memex adds to Karpathy's pattern and the one that makes the MemPalace integration a real ingest pipeline instead of just a searchable archive beside the wiki. ## The gap distill closes The mining layer was extracting Claude Code sessions, classifying bullets into halls (fact/discovery/preference/advice/event/tooling), and tagging topics. The URL harvester scanned conversations for cited links. Hygiene refreshed last_verified on wiki pages referenced in related: fields. But none of those steps compiled the knowledge *inside* the conversations themselves into wiki pages. Decisions, root causes, and patterns stayed in the summaries forever — findable via qmd but never synthesized into canonical pages. ## What distill does Narrow today-filter with historical rollup: 1. Find all summarized conversations dated TODAY 2. Extract their topics: — this is the "topics of today" set 3. For each topic in that set, pull ALL summarized conversations across history that share that topic (full historical context) 4. Extract hall_facts + hall_discoveries + hall_advice bullets (the high-signal hall types — skips event/preference/tooling) 5. Send topic group + wiki index.md to claude -p 6. Model emits JSON actions[]: new_page / update_page / skip 7. Write each action to staging/<type>/ with distill provenance frontmatter (staged_by: wiki-distill, distill_topic, distill_source_conversations, compilation_notes) First-run bootstrap: uses 7-day lookback instead of today-only so the state file gets seeded reasonably. After that, daily runs stay narrow. Self-triggering: dormant topics that resurface in a new conversation automatically pull in all historical conversations on that topic via the rollup. Old knowledge gets distilled when it becomes relevant again without manual intervention. ## Orchestration — distill BEFORE harvest wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest): 1a. wiki-distill.py — conversations → staging (PRIORITY) 1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement) 2. wiki-hygiene.py — decay, archive, repair, checks 3. qmd reindex Conversation content drives the page shape; URL harvesting fills gaps for external references conversations don't cover. New flags: --distill-only, --no-distill, --distill-first-run. ## Verified on real wiki Tested end-to-end on the production wiki with 611 summarized conversations across 14 wings. First-run dry-run found 116 topic groups worth distilling (+ 3 too-thin). Tested single-topic compile with --topic zoho-api: the LLM rolled up 2 conversations (34 bullets), synthesized a proper pattern page with "What / Why / Known Limitations" structure, linked it to existing wiki pages, and landed it in staging with full distill provenance. LLM correctly rejected claude-code-statusline (already well-covered by an existing live page) — so the "skip" path works. ## Code additions - scripts/wiki-distill.py (new, ~530 lines) - scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls + high_signal_halls + _flatten_bullet helpers - scripts/wiki-maintain.sh: Phase 1a distill, new flags - tests/test_wiki_distill.py (21 new tests — hall parsing, rollup, state management, CLI smoke tests) - tests/test_shell_scripts.py: updated phase-name assertion for the Phase 1a/1b split ## Docs additions - README.md: 8th row in extensions table, updated compounding-loop diagram, new wiki-distill.py reference in architecture overview - docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace loop" with full mempalace taxonomy mapping - docs/ARCHITECTURE.md: wiki-distill.py section, updated phase order, updated state file table, updated dep graph - docs/SETUP.md: updated cron comment, first-run distill guidance, verify section test count - .gitignore: note distill-state.json is committed (sync across machines), not gitignored - docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level tab with flow diagram, hall filter table, narrow-today/wide- history explanation, staging provenance example ## Tests 192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
2026-04-12 22:34:33 -06:00
parent 4c6b7609a1
commit 997aa837de
11 changed files with 1732 additions and 66 deletions
--- a/docs/DESIGN-RATIONALE.md
+++ b/docs/DESIGN-RATIONALE.md
@@ -43,10 +43,13 @@ repo preserves all of them:

 Karpathy's gist is a concept pitch. He was explicit that he was sharing
 an "idea file" for others to build on, not publishing a working
-implementation. The analysis identified seven places where the core idea
-needs an engineering layer to become practical day-to-day — five have
-first-class answers in memex, and two remain scoped-out trade-offs that
-the architecture cleanly acknowledges.
+implementation. The analysis identified eight places where the core idea
+needs an engineering layer to become practical day-to-day. The first
+seven emerged from the original Signal & Noise review; the eighth
+(conversation distillation) surfaced after building the other layers
+and realizing that the conversations themselves were being mined,
+summarized, indexed, and scanned for URLs — but the knowledge *inside*
+them was never becoming wiki pages.

 ### 1. Claim freshness and reversibility

@@ -236,6 +239,71 @@ story. If you need any of that, you need a different architecture.
 This is for the personal and small-team case where git + Tailscale is
 the right amount of rigor.

+### 8. Closing the MemPalace loop — conversation distillation
+
+**The gap**: The mining pipeline extracts Claude Code sessions into
+transcripts, classifies them by memory type (fact/discovery/preference/
+advice/event/tooling), and tags them with topics. The URL harvester
+scans them for cited links. Hygiene refreshes `last_verified` on any
+wiki page that appears in a conversation's `related:` field. But none
+of those steps actually *compile the knowledge inside the conversations
+themselves into wiki pages.* A decision made in a session, a root cause
+found during debugging, a pattern spotted in review — these stay in the
+conversation summaries (searchable but not synthesized) until a human
+manually writes them up. That's the last piece of the MemPalace model
+that wasn't wired through: **closet content was never becoming the
+source for the wiki proper**.
+
+**How memex extends it**:
+
+- **`wiki-distill.py`** runs as Phase 1a of `wiki-maintain.sh`, before
+  URL harvesting. The ordering is deliberate: conversation content
+  should drive the page, and URL harvesting should only supplement
+  what the conversations are already covering.
+- **Narrow today-filter with historical rollup** — daily runs only
+  look at topics appearing in TODAY's summarized conversations, but
+  for each such topic the script pulls in ALL historical conversations
+  sharing that topic. Processing scope stays small; LLM context stays
+  wide. Old topics that resurface in new sessions automatically
+  trigger a re-distillation of the full history on that topic.
+- **First-run bootstrap** — the very first run uses a 7-day lookback
+  to seed the state. After that, daily runs stay narrow.
+- **High-signal halls only** — distill reads `hall_facts`,
+  `hall_discoveries`, and `hall_advice` bullets. Skips `hall_events`
+  (temporal, not knowledge), `hall_preferences` (user working style),
+  and `hall_tooling` (often low-signal). These are the halls the
+  MemPalace taxonomy treats as "canonical knowledge" vs "context."
+- **claude -p compile step** — each topic group (topic + all matching
+  conversations + their high-signal halls) is sent to `claude -p`
+  with the current wiki index. The model decides whether to create a
+  new page, update an existing one, emit both, or skip (topic not
+  substantive enough or already well-covered).
+- **Staging output with distill provenance** — new/updated pages land
+  in `staging/` with `staged_by: wiki-distill`, `distill_topic`, and
+  `distill_source_conversations` frontmatter fields. Every page traces
+  back to the exact conversations it was distilled from.
+- **State file `.distill-state.json`** tracks processed conversations
+  by content hash and topic set, so re-runs only process what actually
+  changed. A conversation gets re-distilled if its body changes OR if
+  it gains a new topic not seen at previous distill time.
+
+**Why this matters**: Without distillation, the MemPalace integration
+was incomplete — the closet summaries existed, the structural metadata
+existed, qmd could search them, but knowledge discovered during work
+never escaped the conversation archive. You could find "we had a
+debugging session about X last month" but couldn't find "here's the
+canonical page on X that captures what we learned." This extension
+turns the MemPalace layer from a searchable archive into a proper
+**ingest pipeline** for the wiki.
+
+**Residual consideration**: Summarization quality is now load-bearing.
+The distill step trusts the summarizer's classification of bullets
+into halls. If the summarizer puts a debugging dead-end in
+`hall_discoveries`, it may enter the wiki compilation pipeline. The
+`MIN_BULLETS_PER_TOPIC` filter (default 2) and the LLM's own
+substantiveness check (it can choose `skip` with a reason) together
+catch most noise, and the staging review catches the rest.
+
 ---

 ## The biggest layer — active upkeep
@@ -255,7 +323,7 @@ thing the human has to think about:
 | Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
 | Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
 | Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
-| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
+| Daily 3am | `wiki-maintain.sh` | Distill + harvest + quick hygiene + reindex |
 | Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |

 If you disable all of these, you get the same outcome as every