feat(distill): close the MemPalace loop — conversations → wiki pages

Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is the 8th extension memex adds to Karpathy's pattern and the one that makes the MemPalace integration a real ingest pipeline instead of just a searchable archive beside the wiki. ## The gap distill closes The mining layer was extracting Claude Code sessions, classifying bullets into halls (fact/discovery/preference/advice/event/tooling), and tagging topics. The URL harvester scanned conversations for cited links. Hygiene refreshed last_verified on wiki pages referenced in related: fields. But none of those steps compiled the knowledge *inside* the conversations themselves into wiki pages. Decisions, root causes, and patterns stayed in the summaries forever — findable via qmd but never synthesized into canonical pages. ## What distill does Narrow today-filter with historical rollup: 1. Find all summarized conversations dated TODAY 2. Extract their topics: — this is the "topics of today" set 3. For each topic in that set, pull ALL summarized conversations across history that share that topic (full historical context) 4. Extract hall_facts + hall_discoveries + hall_advice bullets (the high-signal hall types — skips event/preference/tooling) 5. Send topic group + wiki index.md to claude -p 6. Model emits JSON actions[]: new_page / update_page / skip 7. Write each action to staging/<type>/ with distill provenance frontmatter (staged_by: wiki-distill, distill_topic, distill_source_conversations, compilation_notes) First-run bootstrap: uses 7-day lookback instead of today-only so the state file gets seeded reasonably. After that, daily runs stay narrow. Self-triggering: dormant topics that resurface in a new conversation automatically pull in all historical conversations on that topic via the rollup. Old knowledge gets distilled when it becomes relevant again without manual intervention. ## Orchestration — distill BEFORE harvest wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest): 1a. wiki-distill.py — conversations → staging (PRIORITY) 1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement) 2. wiki-hygiene.py — decay, archive, repair, checks 3. qmd reindex Conversation content drives the page shape; URL harvesting fills gaps for external references conversations don't cover. New flags: --distill-only, --no-distill, --distill-first-run. ## Verified on real wiki Tested end-to-end on the production wiki with 611 summarized conversations across 14 wings. First-run dry-run found 116 topic groups worth distilling (+ 3 too-thin). Tested single-topic compile with --topic zoho-api: the LLM rolled up 2 conversations (34 bullets), synthesized a proper pattern page with "What / Why / Known Limitations" structure, linked it to existing wiki pages, and landed it in staging with full distill provenance. LLM correctly rejected claude-code-statusline (already well-covered by an existing live page) — so the "skip" path works. ## Code additions - scripts/wiki-distill.py (new, ~530 lines) - scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls + high_signal_halls + _flatten_bullet helpers - scripts/wiki-maintain.sh: Phase 1a distill, new flags - tests/test_wiki_distill.py (21 new tests — hall parsing, rollup, state management, CLI smoke tests) - tests/test_shell_scripts.py: updated phase-name assertion for the Phase 1a/1b split ## Docs additions - README.md: 8th row in extensions table, updated compounding-loop diagram, new wiki-distill.py reference in architecture overview - docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace loop" with full mempalace taxonomy mapping - docs/ARCHITECTURE.md: wiki-distill.py section, updated phase order, updated state file table, updated dep graph - docs/SETUP.md: updated cron comment, first-run distill guidance, verify section test count - .gitignore: note distill-state.json is committed (sync across machines), not gitignored - docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level tab with flow diagram, hall filter table, narrow-today/wide- history explanation, staging provenance example ## Tests 192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
2026-04-12 22:34:33 -06:00
parent 4c6b7609a1
commit 997aa837de
11 changed files with 1732 additions and 66 deletions
--- a/scripts/wiki_lib.py
+++ b/scripts/wiki_lib.py
@@ -209,3 +209,63 @@ def iter_archived_pages() -> list[WikiPage]:
 def page_content_hash(page: WikiPage) -> str:
    """Hash of page body only (excludes frontmatter) so mechanical frontmatter fixes don't churn the hash."""
    return "sha256:" + hashlib.sha256(page.body.strip().encode("utf-8")).hexdigest()
+
+
+# ---------------------------------------------------------------------------
+# Conversation hall parsing
+# ---------------------------------------------------------------------------
+#
+# Summarized conversations have sections in the body like:
+#   ## Decisions (hall: fact)
+#   - bullet
+#   - bullet
+#   ## Discoveries (hall: discovery)
+#   - bullet
+#
+# Hall types used by the summarizer: fact, discovery, preference, advice,
+# event, tooling. Only fact/discovery/advice are high-signal enough to
+# distill into wiki pages; the others are tracked but not auto-promoted.
+
+HIGH_SIGNAL_HALLS = {"fact", "discovery", "advice"}
+
+_HALL_SECTION_RE = re.compile(
+    r"^##\s+[^\n]*?\(hall:\s*(\w+)\s*\)\s*$(.*?)(?=^##\s|\Z)",
+    re.MULTILINE | re.DOTALL,
+)
+_BULLET_RE = re.compile(r"^\s*-\s+(.*?)$", re.MULTILINE)
+
+
+def parse_conversation_halls(page: WikiPage) -> dict[str, list[str]]:
+    """Extract hall-bucketed bullet content from a summarized conversation body.
+
+    Returns a dict like:
+        {"fact": ["claim one", "claim two"],
+         "discovery": ["root cause X"],
+         "advice": ["do Y", "consider Z"], ...}
+
+    Empty hall types are omitted. Bullet lines are stripped of leading "- "
+    and trailing whitespace; multi-line bullets are joined with a space.
+    """
+    result: dict[str, list[str]] = {}
+    for match in _HALL_SECTION_RE.finditer(page.body):
+        hall_type = match.group(1).strip().lower()
+        section_body = match.group(2)
+        bullets = [
+            _flatten_bullet(b.group(1))
+            for b in _BULLET_RE.finditer(section_body)
+        ]
+        bullets = [b for b in bullets if b]
+        if bullets:
+            result.setdefault(hall_type, []).extend(bullets)
+    return result
+
+
+def _flatten_bullet(text: str) -> str:
+    """Collapse a possibly-multiline bullet into a single clean line."""
+    return " ".join(text.split()).strip()
+
+
+def high_signal_halls(page: WikiPage) -> dict[str, list[str]]:
+    """Return only fact/discovery/advice content from a conversation."""
+    all_halls = parse_conversation_halls(page)
+    return {k: v for k, v in all_halls.items() if k in HIGH_SIGNAL_HALLS}