feat(distill): close the MemPalace loop — conversations → wiki pages

Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is the 8th extension memex adds to Karpathy's pattern and the one that makes the MemPalace integration a real ingest pipeline instead of just a searchable archive beside the wiki. ## The gap distill closes The mining layer was extracting Claude Code sessions, classifying bullets into halls (fact/discovery/preference/advice/event/tooling), and tagging topics. The URL harvester scanned conversations for cited links. Hygiene refreshed last_verified on wiki pages referenced in related: fields. But none of those steps compiled the knowledge *inside* the conversations themselves into wiki pages. Decisions, root causes, and patterns stayed in the summaries forever — findable via qmd but never synthesized into canonical pages. ## What distill does Narrow today-filter with historical rollup: 1. Find all summarized conversations dated TODAY 2. Extract their topics: — this is the "topics of today" set 3. For each topic in that set, pull ALL summarized conversations across history that share that topic (full historical context) 4. Extract hall_facts + hall_discoveries + hall_advice bullets (the high-signal hall types — skips event/preference/tooling) 5. Send topic group + wiki index.md to claude -p 6. Model emits JSON actions[]: new_page / update_page / skip 7. Write each action to staging/<type>/ with distill provenance frontmatter (staged_by: wiki-distill, distill_topic, distill_source_conversations, compilation_notes) First-run bootstrap: uses 7-day lookback instead of today-only so the state file gets seeded reasonably. After that, daily runs stay narrow. Self-triggering: dormant topics that resurface in a new conversation automatically pull in all historical conversations on that topic via the rollup. Old knowledge gets distilled when it becomes relevant again without manual intervention. ## Orchestration — distill BEFORE harvest wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest): 1a. wiki-distill.py — conversations → staging (PRIORITY) 1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement) 2. wiki-hygiene.py — decay, archive, repair, checks 3. qmd reindex Conversation content drives the page shape; URL harvesting fills gaps for external references conversations don't cover. New flags: --distill-only, --no-distill, --distill-first-run. ## Verified on real wiki Tested end-to-end on the production wiki with 611 summarized conversations across 14 wings. First-run dry-run found 116 topic groups worth distilling (+ 3 too-thin). Tested single-topic compile with --topic zoho-api: the LLM rolled up 2 conversations (34 bullets), synthesized a proper pattern page with "What / Why / Known Limitations" structure, linked it to existing wiki pages, and landed it in staging with full distill provenance. LLM correctly rejected claude-code-statusline (already well-covered by an existing live page) — so the "skip" path works. ## Code additions - scripts/wiki-distill.py (new, ~530 lines) - scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls + high_signal_halls + _flatten_bullet helpers - scripts/wiki-maintain.sh: Phase 1a distill, new flags - tests/test_wiki_distill.py (21 new tests — hall parsing, rollup, state management, CLI smoke tests) - tests/test_shell_scripts.py: updated phase-name assertion for the Phase 1a/1b split ## Docs additions - README.md: 8th row in extensions table, updated compounding-loop diagram, new wiki-distill.py reference in architecture overview - docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace loop" with full mempalace taxonomy mapping - docs/ARCHITECTURE.md: wiki-distill.py section, updated phase order, updated state file table, updated dep graph - docs/SETUP.md: updated cron comment, first-run distill guidance, verify section test count - .gitignore: note distill-state.json is committed (sync across machines), not gitignored - docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level tab with flow diagram, hall filter table, narrow-today/wide- history explanation, staging provenance example ## Tests 192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
2026-04-12 22:34:33 -06:00
parent 4c6b7609a1
commit 997aa837de
11 changed files with 1732 additions and 66 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -77,6 +77,7 @@ Automation + lifecycle management on top of both:
     ┌─────────────────────────────────┐
     │    AUTOMATION LAYER             │
     │  wiki_lib.py  (shared helpers)  │
+     │  wiki-distill.py                │  (conversations → staging) ← closes MemPalace loop
     │  wiki-harvest.py                │  (URL → raw → staging)
     │  wiki-staging.py                │  (human review)
     │  wiki-hygiene.py                │  (decay, archive, repair, checks)
@@ -169,10 +170,63 @@ Provides:
 All paths honor the `WIKI_DIR` environment variable, so tests and
 alternate installs can override the root.

+### `wiki-distill.py`
+
+**Closes the MemPalace loop.** Reads the *content* of summarized
+conversations — not the URLs they cite — and compiles wiki pages from
+the high-signal hall entries (`hall_facts`, `hall_discoveries`,
+`hall_advice`). Runs as Phase 1a in `wiki-maintain.sh`, before URL
+harvesting.
+
+**Scope filter (deliberately narrow)**:
+1. Find all summarized conversations dated TODAY
+2. Extract their `topics:` — this is the "topics-of-today" set
+3. For each topic in that set, pull ALL summarized conversations across
+   history that share that topic (full historical context via rollup)
+4. Extract `hall_facts` + `hall_discoveries` + `hall_advice` bullet
+   content from each conversation's body
+5. Send the topic group (topic + matching conversations + halls) to
+   `claude -p` with the current `index.md`
+6. Model emits a JSON `actions` array with `new_page` / `update_page` /
+   `skip` verdicts; the script writes each to `staging/<type>/`
+
+**First-run bootstrap**: the very first run uses a 7-day lookback
+instead of today-only, so the state file gets seeded with a reasonable
+starting set. After that, daily runs stay narrow.
+
+**Self-triggering**: dormant topics that resurface in a new
+conversation automatically pull in all historical conversations on
+that topic via the rollup. No manual intervention needed to
+reprocess old knowledge when it becomes relevant again.
+
+**Model routing**: haiku for short topic groups (< 15K chars prompt,
+< 20 bullets), sonnet for longer ones.
+
+**State** lives in `.distill-state.json` — tracks processed
+conversations by content hash and topics-at-distill-time. A
+conversation is re-processed if its body changes OR if it gains a new
+topic not seen at previous distill.
+
+**Staging output** includes distill-specific frontmatter:
+- `staged_by: wiki-distill`
+- `distill_topic: <topic>`
+- `distill_source_conversations: <comma-separated conversation paths>`
+
+Commands:
+- `wiki-distill.py` — today-only rollup (default mode after first run)
+- `wiki-distill.py --first-run` — 7-day lookback bootstrap
+- `wiki-distill.py --topic TOPIC` — explicit single-topic processing
+- `wiki-distill.py --project WING` — only today-topics from this wing
+- `wiki-distill.py --dry-run` — plan only, no LLM calls, no writes
+- `wiki-distill.py --no-compile` — rollup only, skip claude -p step
+- `wiki-distill.py --limit N` — stop after N topic groups
+
 ### `wiki-harvest.py`

 Scans summarized conversations for HTTP(S) URLs, classifies them,
-fetches content, and compiles pending wiki pages.
+fetches content, and compiles pending wiki pages. Runs as Phase 1b in
+`wiki-maintain.sh`, after distill — URL content is treated as a
+supplement to conversation-driven knowledge, not the primary source.

 URL classification:
 - **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
@@ -254,13 +308,17 @@ full-mode runs can skip unchanged pages. Reports land in
 Top-level orchestrator:

 ```
-Phase 1: wiki-harvest.py     (unless --hygiene-only)
-Phase 2: wiki-hygiene.py     (--full for the weekly pass, else quick)
-Phase 3: qmd update && qmd embed     (unless --no-reindex or --dry-run)
+Phase 1a: wiki-distill.py    (unless --no-distill or --harvest-only / --hygiene-only)
+Phase 1b: wiki-harvest.py    (unless --distill-only / --hygiene-only)
+Phase 2:  wiki-hygiene.py    (--full for the weekly pass, else quick)
+Phase 3:  qmd update && qmd embed     (unless --no-reindex or --dry-run)
 ```

-Flags pass through to child scripts. Error-tolerant: if one phase fails,
-the others still run. Logs to `scripts/.maintain.log`.
+Ordering is deliberate: distill runs before harvest so that
+conversation content drives the page shape, and URL harvesting only
+supplements what the conversations are already covering. Flags pass
+through to child scripts. Error-tolerant: if one phase fails, the
+others still run. Logs to `scripts/.maintain.log`.

 ---

@@ -289,6 +347,7 @@ Three JSON files track per-pipeline state:
 | File | Owner | Synced? | Purpose |
 |------|-------|---------|---------|
 | `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
+| `.distill-state.json` | `wiki-distill.py` | Yes (committed) | Processed conversations (content hash + topics seen), rejected topics, first-run flag |
 | `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
 | `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |

@@ -301,13 +360,15 @@ because Claude Code session files live at OS-specific paths.
 ## Module dependency graph

 ```
-wiki_lib.py  ─┬─>  wiki-harvest.py
+wiki_lib.py  ─┬─>  wiki-distill.py
+              ├─>  wiki-harvest.py
              ├─>  wiki-staging.py
              └─>  wiki-hygiene.py

-wiki-maintain.sh  ─>  wiki-harvest.py
-                  ─>  wiki-hygiene.py
-                  ─>  qmd (external)
+wiki-maintain.sh  ─>  wiki-distill.py   (Phase 1a — conversations → staging)
+                  ─>  wiki-harvest.py   (Phase 1b — URLs → staging)
+                  ─>  wiki-hygiene.py   (Phase 2)
+                  ─>  qmd (external)    (Phase 3)

 mine-conversations.sh  ─>  extract-sessions.py
                       ─>  summarize-conversations.py