feat(distill): close the MemPalace loop — conversations → wiki pages

Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.

## The gap distill closes

The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.

## What distill does

Narrow today-filter with historical rollup:

  1. Find all summarized conversations dated TODAY
  2. Extract their topics: — this is the "topics of today" set
  3. For each topic in that set, pull ALL summarized conversations
     across history that share that topic (full historical context)
  4. Extract hall_facts + hall_discoveries + hall_advice bullets
     (the high-signal hall types — skips event/preference/tooling)
  5. Send topic group + wiki index.md to claude -p
  6. Model emits JSON actions[]: new_page / update_page / skip
  7. Write each action to staging/<type>/ with distill provenance
     frontmatter (staged_by: wiki-distill, distill_topic,
     distill_source_conversations, compilation_notes)

First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.

Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.

## Orchestration — distill BEFORE harvest

wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):

  1a. wiki-distill.py    — conversations → staging (PRIORITY)
  1b. wiki-harvest.py    — URLs → raw/harvested → staging (supplement)
  2.  wiki-hygiene.py    — decay, archive, repair, checks
  3.  qmd reindex

Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.

## Verified on real wiki

Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.

## Code additions

- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
  + high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
  state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
  the Phase 1a/1b split

## Docs additions

- README.md: 8th row in extensions table, updated compounding-loop
  diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
  loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
  order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
  verify section test count
- .gitignore: note distill-state.json is committed (sync across
  machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
  tab with flow diagram, hall filter table, narrow-today/wide-
  history explanation, staging provenance example

## Tests

192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
This commit is contained in:
Eric Turner
2026-04-12 22:34:33 -06:00
parent 4c6b7609a1
commit 997aa837de
11 changed files with 1732 additions and 66 deletions

View File

@@ -1209,6 +1209,7 @@
<button class="tab-btn" onclick="switchTab(this, 'tab-signals')">Signal Breakdown</button>
<button class="tab-btn" onclick="switchTab(this, 'tab-mitigations')">Mitigations ★</button>
<button class="tab-btn" onclick="switchTab(this, 'tab-mempalace')" style="color:var(--accent-green);font-weight:600">MemPalace ⬡</button>
<button class="tab-btn" onclick="switchTab(this, 'tab-distill')" style="color:var(--accent-amber);font-weight:600">Distill ⬣</button>
</div>
<!-- TAB: PROS & CONS -->
@@ -2259,6 +2260,255 @@
</div><!-- /tab-mempalace -->
<!-- TAB: DISTILL — the 8th extension, closing the MemPalace loop -->
<div id="tab-distill" class="tab-panel">
<div class="palace-hero" style="background:linear-gradient(135deg, #2a1810 0%, #1a1a10 50%, #0a1510 100%); border-color:#4a3a1a;">
<div class="kicker" style="color:#f0c060">⬣ The 8th Extension — Closing the MemPalace Loop</div>
<h3>Closet summaries <em>become</em> the source for the wiki itself.</h3>
<p>The first seven extensions came out of the Signal &amp; Noise review. The eighth surfaced only after the other layers were built — and it's the one that makes the MemPalace integration a real pipeline into the wiki instead of just a searchable archive beside it. The mining layer was extracting sessions, classifying bullets into halls, tagging topics, and making everything searchable via qmd. But the knowledge <em>inside</em> the conversations was never being compiled into wiki pages. A decision made in a session, a root cause found during debugging, a pattern spotted in review — these stayed in the conversation summaries forever, findable but not synthesized.</p>
<p style="color:#f0c060;font-size:12.5px;font-family:'JetBrains Mono',monospace;letter-spacing:0.05em;">This is what the <code>wiki-distill.py</code> script solves. It's Phase 1a of <code>wiki-maintain.sh</code> and runs before URL harvesting because conversation content should drive the page, not the URLs the conversation cites.</p>
<div class="hero-stats">
<div class="hstat"><span class="hval">Phase 1a</span><span class="hlbl">Runs before harvest</span></div>
<div class="hstat"><span class="hval">today</span><span class="hlbl">Narrow filter — today's topics</span></div>
<div class="hstat"><span class="hval">∀ history</span><span class="hlbl">Rollup all past conversations on each topic</span></div>
<div class="hstat"><span class="hval">3 halls</span><span class="hlbl">fact + discovery + advice</span></div>
<div class="hstat"><span class="hval">haiku/sonnet</span><span class="hlbl">Auto-routed by topic size</span></div>
</div>
</div>
<!-- FLOW DIAGRAM -->
<div class="flow-diagram">
<div class="flow-title">Distill Flow — Conversation Content → Wiki Pages</div>
<div class="flow-label">Narrow: what topics to process today</div>
<div class="flow-row">
<div class="flow-node convo">Today's<br>conversations</div>
<div class="flow-arrow"></div>
<div class="flow-node palace">Extract<br>topics[]</div>
<div class="flow-arrow">=</div>
<div class="flow-node wiki">Topics of<br>today set</div>
</div>
<div class="flow-label" style="margin-top:14px">Wide: pull full history for each today-topic</div>
<div class="flow-row">
<div class="flow-node wiki">Each<br>today-topic</div>
<div class="flow-arrow"></div>
<div class="flow-node palace">Rollup ALL<br>historical convs</div>
<div class="flow-arrow"></div>
<div class="flow-node palace">Extract<br>fact / discovery / advice</div>
<div class="flow-arrow"></div>
<div class="flow-node llm">claude -p<br>distill prompt</div>
</div>
<div class="flow-label" style="margin-top:14px">Compile: model decides new / update / skip</div>
<div class="flow-row">
<div class="flow-node llm">JSON<br>actions[]</div>
<div class="flow-arrow"></div>
<div class="flow-node wiki">new_page</div>
<div class="flow-arrow">+</div>
<div class="flow-node wiki">update_page<br>(modifies existing)</div>
<div class="flow-arrow"></div>
<div class="flow-node raw">staging/&lt;type&gt;/<br>pending review</div>
</div>
</div>
<!-- SECTION: WHY IT COMPLETES MEMPALACE -->
<div class="section-header">
<h2>Why This Completes MemPalace</h2>
<span class="section-tag" style="border-color:var(--accent-amber);color:var(--accent-amber);background:#fff8e6">Pipeline Closure</span>
</div>
<div class="palace-map">
<div class="palace-cell">
<div class="pc-icon">📦</div>
<div class="pc-term">Drawer — before</div>
<div class="pc-name">Verbatim Archive</div>
<div class="pc-desc">Full transcripts stored, searchable via qmd. No compilation — if you wanted canonical knowledge from them, you had to write it up manually.</div>
<div class="pc-wiki-map">Status: already working</div>
</div>
<div class="palace-cell">
<div class="pc-icon">🗂️</div>
<div class="pc-term">Closet — before</div>
<div class="pc-name">Summary Layer</div>
<div class="pc-desc">Summaries with hall classification (fact / discovery / preference / advice / event / tooling) and topics. Searchable. Terminal: never fed forward into the wiki compiler.</div>
<div class="pc-wiki-map">Status: terminal data, not flowing</div>
</div>
<div class="palace-cell">
<div class="pc-icon"></div>
<div class="pc-term">Distill — NEW</div>
<div class="pc-name">Compiler Bridge</div>
<div class="pc-desc">Reads closet content by topic, rolls up all matching conversations across history, filters to high-signal halls only, sends to claude -p with the current wiki index, emits new or updated wiki pages to staging.</div>
<div class="pc-wiki-map">Status: wiki-distill.py</div>
</div>
<div class="palace-cell">
<div class="pc-icon">📄</div>
<div class="pc-term">Wiki Pages — NEW</div>
<div class="pc-name">Distilled Knowledge</div>
<div class="pc-desc">Pages in staging/&lt;type&gt;/ with full distill provenance: distill_topic, distill_source_conversations, compilation_notes. Promote via staging review. Session knowledge becomes canonical knowledge.</div>
<div class="pc-wiki-map">Status: origin=automated, staged_by=wiki-distill</div>
</div>
</div>
<!-- HALL FILTERING -->
<div class="section-header">
<h2>Which Halls Get Distilled</h2>
<span class="section-tag" style="border-color:var(--accent-green);color:var(--accent-green);background:#eaf5ee">High Signal Only</span>
</div>
<table class="compare-table">
<thead>
<tr>
<th>Hall</th>
<th style="text-align:center">Distilled?</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr>
<td class="row-label">hall_facts</td>
<td style="text-align:center" class="cell-win">✦ YES</td>
<td>Decisions locked in, choices made, specs agreed. Canonical knowledge.</td>
</tr>
<tr>
<td class="row-label">hall_discoveries</td>
<td style="text-align:center" class="cell-win">✦ YES</td>
<td>Root causes, breakthroughs, non-obvious findings. The highest-signal content in any session.</td>
</tr>
<tr>
<td class="row-label">hall_advice</td>
<td style="text-align:center" class="cell-win">✦ YES</td>
<td>Recommendations, lessons learned, "next time do X." Worth capturing as patterns.</td>
</tr>
<tr>
<td class="row-label">hall_events</td>
<td style="text-align:center" class="cell-mid">no</td>
<td>Deployments, incidents, milestones. Temporal data — belongs in logs, not the wiki.</td>
</tr>
<tr>
<td class="row-label">hall_preferences</td>
<td style="text-align:center" class="cell-mid">no</td>
<td>User working style notes. Belong in personal configs, not the shared wiki.</td>
</tr>
<tr>
<td class="row-label">hall_tooling</td>
<td style="text-align:center" class="cell-mid">no</td>
<td>Script/command usage, failures, improvements. Usually low-signal or duplicates what's already in the wiki.</td>
</tr>
</tbody>
</table>
<!-- HOW THE NARROW-TODAY + WIDE-HISTORY FILTER WORKS -->
<div class="section-header">
<h2>The Narrow-Today / Wide-History Filter</h2>
<span class="section-tag" style="border-color:var(--accent-blue);color:var(--accent-blue);background:#e8eef5">Key Design</span>
</div>
<div class="mitigation-intro">
<strong>Processing scope stays narrow; LLM context stays wide.</strong> This is the key property that makes distill cheap enough to run daily and smart enough to produce good pages.
</div>
<div class="mitigation-steps">
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">01</span>
<span class="step-title">Daily filter: only process topics appearing in TODAY's conversations</span>
<span class="step-tool-tag">Scope</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>Each daily run only looks at conversations dated today. It extracts the <code>topics:</code> frontmatter from each — that union becomes the "topics of today" set. If you didn't discuss a topic today, it's not in the processing scope. This keeps the cron job cheap and predictable: if today was a light session day, distill runs fast. If today was a heavy architecture discussion, distill does real work.</p>
<div class="tip-box"><strong>First run only:</strong> The very first run uses a 7-day lookback instead of today-only so the state file gets seeded. After that first bootstrap, daily runs stay narrow.</div>
</div>
</div>
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">02</span>
<span class="step-title">Historical rollup: for each today-topic, pull ALL matching conversations</span>
<span class="step-tool-tag">Context</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>Once the today-topic set is known, for each topic the script walks the entire conversation archive and pulls every summarized conversation that shares that topic. A discussion about <code>blue-green-deploy</code> today might roll up 16 conversations across the last 6 months. The claude -p call sees the full history, not just today's fragment.</p>
<p>This is what makes the distilled pages <em>good</em>. The LLM isn't guessing what a pattern looks like from one session — it's synthesizing across everything you've ever discussed on the topic.</p>
</div>
</div>
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">03</span>
<span class="step-title">Self-triggering: dormant topics wake up when they resurface</span>
<span class="step-tool-tag">Emergent</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>The narrow-today/wide-history combination produces a useful emergent property: <strong>dormant topics wake up automatically.</strong> If you discussed <code>database-migrations</code> three months ago and it never came up again, it's not in the daily scope. But the day you mention it again in any new conversation, that topic enters today's set — and the rollup pulls in all three months of historical discussion. The wiki page gets updated with fresh synthesis across the full history without you having to manually trigger reprocessing.</p>
<div class="tip-box"><strong>What this means in practice:</strong> Old knowledge gets distilled <em>when it becomes relevant again</em>. You don't need to remember to ask "hey, is there a wiki page for X?" — the next time X comes up in a session, distill will check the wiki state and either create or update the page for you.</div>
</div>
</div>
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">04</span>
<span class="step-title">State tracking by content hash + topic set</span>
<span class="step-tool-tag">.distill-state.json</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>A conversation is considered "already distilled" only if its body hash AND its topic set match what was seen at the last distill. If the body changes (summarizer re-ran and updated the bullets) OR a new topic is added, the conversation gets re-processed on the next run. Topics get tracked so rejected ones don't get reprocessed forever — if the LLM says "this topic doesn't deserve a wiki page" once, it stays rejected until something meaningful changes.</p>
</div>
</div>
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">05</span>
<span class="step-title">Distill runs BEFORE harvest — conversation content has priority</span>
<span class="step-tool-tag">Phase 1a</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>The orchestrator runs distill as Phase 1a and harvest as Phase 1b. Deliberate: if a topic is being actively discussed in your sessions, you want the wiki page to reflect <em>your</em> synthesis of what you've learned, not just the external URL cited in passing. URL harvesting then fills in gaps — it picks up the docs pages, blog posts, and references that your sessions didn't already cover.</p>
<div class="warn-box">Both phases can produce staging pages. If distill creates <code>patterns/docker-hardening.md</code> and harvest creates <code>patterns/docker-hardening.md</code>, the staging-unique-path helper appends a short hash suffix so they don't collide. The reviewer sees both in staging and picks the better one (usually distill, since it has historical context).</div>
</div>
</div>
</div>
<!-- STAGING FRONTMATTER -->
<div class="section-header">
<h2>Distill Staging Provenance</h2>
<span class="section-tag" style="border-color:var(--accent-green);color:var(--accent-green);background:#eaf5ee">Traceable</span>
</div>
<p style="font-size:13.5px;color:var(--muted);margin-bottom:20px;line-height:1.6;">Every distilled page lands in staging with full provenance in its frontmatter. When you review a page in staging, you can see exactly which conversations it came from and jump directly to those transcripts.</p>
<div class="flow-diagram" style="background:#0d0d0d; border-color:#2a2a2a;">
<div class="flow-title" style="color:#c4b99a">Example: staging/patterns/zoho-crm-integration.md frontmatter</div>
<pre style="font-family:'JetBrains Mono',monospace;font-size:11px;color:#c4b99a;line-height:1.6;margin:0;padding:14px 0;overflow-x:auto;">---
origin: automated
status: pending
staged_date: 2026-04-12
staged_by: wiki-distill
target_path: patterns/zoho-crm-integration.md
distill_topic: zoho-api
distill_source_conversations: conversations/general/2026-04-06-73d15650.md,conversations/mc/2026-03-30-64089d1d.md
compilation_notes: Two separate incidents discovered the same Zoho CRM v2 API limitations, documenting them as a pattern page prevents re-investigation and provides a canonical reference for future Zoho integrations.
title: Zoho CRM Integration
type: pattern
confidence: high
sources: [conversations/general/2026-04-06-73d15650.md, conversations/mc/2026-03-30-64089d1d.md]
related: [database-migrations.md, activity-event-auditing.md]
last_compiled: 2026-04-12
last_verified: 2026-04-12
---</pre>
</div>
<div class="pull-quote" style="border-left-color:var(--accent-amber)">
Without distillation, MemPalace was a searchable archive sitting beside the wiki. With distillation, it's a real ingest pipeline — closet content becomes the source material for the wiki proper, completing the eight-extension story.
<span class="attribution">— memex design rationale, April 2026</span>
</div>
</div><!-- /tab-distill -->
</div><!-- /page -->
<footer class="page-footer">