feat(distill): close the MemPalace loop — conversations → wiki pages
Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.
## The gap distill closes
The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.
## What distill does
Narrow today-filter with historical rollup:
1. Find all summarized conversations dated TODAY
2. Extract their topics: — this is the "topics of today" set
3. For each topic in that set, pull ALL summarized conversations
across history that share that topic (full historical context)
4. Extract hall_facts + hall_discoveries + hall_advice bullets
(the high-signal hall types — skips event/preference/tooling)
5. Send topic group + wiki index.md to claude -p
6. Model emits JSON actions[]: new_page / update_page / skip
7. Write each action to staging/<type>/ with distill provenance
frontmatter (staged_by: wiki-distill, distill_topic,
distill_source_conversations, compilation_notes)
First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.
Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.
## Orchestration — distill BEFORE harvest
wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):
1a. wiki-distill.py — conversations → staging (PRIORITY)
1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement)
2. wiki-hygiene.py — decay, archive, repair, checks
3. qmd reindex
Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.
## Verified on real wiki
Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.
## Code additions
- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
+ high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
the Phase 1a/1b split
## Docs additions
- README.md: 8th row in extensions table, updated compounding-loop
diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
verify section test count
- .gitignore: note distill-state.json is committed (sync across
machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
tab with flow diagram, hall filter table, narrow-today/wide-
history explanation, staging provenance example
## Tests
192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
This commit is contained in:
@@ -43,10 +43,13 @@ repo preserves all of them:
|
||||
|
||||
Karpathy's gist is a concept pitch. He was explicit that he was sharing
|
||||
an "idea file" for others to build on, not publishing a working
|
||||
implementation. The analysis identified seven places where the core idea
|
||||
needs an engineering layer to become practical day-to-day — five have
|
||||
first-class answers in memex, and two remain scoped-out trade-offs that
|
||||
the architecture cleanly acknowledges.
|
||||
implementation. The analysis identified eight places where the core idea
|
||||
needs an engineering layer to become practical day-to-day. The first
|
||||
seven emerged from the original Signal & Noise review; the eighth
|
||||
(conversation distillation) surfaced after building the other layers
|
||||
and realizing that the conversations themselves were being mined,
|
||||
summarized, indexed, and scanned for URLs — but the knowledge *inside*
|
||||
them was never becoming wiki pages.
|
||||
|
||||
### 1. Claim freshness and reversibility
|
||||
|
||||
@@ -236,6 +239,71 @@ story. If you need any of that, you need a different architecture.
|
||||
This is for the personal and small-team case where git + Tailscale is
|
||||
the right amount of rigor.
|
||||
|
||||
### 8. Closing the MemPalace loop — conversation distillation
|
||||
|
||||
**The gap**: The mining pipeline extracts Claude Code sessions into
|
||||
transcripts, classifies them by memory type (fact/discovery/preference/
|
||||
advice/event/tooling), and tags them with topics. The URL harvester
|
||||
scans them for cited links. Hygiene refreshes `last_verified` on any
|
||||
wiki page that appears in a conversation's `related:` field. But none
|
||||
of those steps actually *compile the knowledge inside the conversations
|
||||
themselves into wiki pages.* A decision made in a session, a root cause
|
||||
found during debugging, a pattern spotted in review — these stay in the
|
||||
conversation summaries (searchable but not synthesized) until a human
|
||||
manually writes them up. That's the last piece of the MemPalace model
|
||||
that wasn't wired through: **closet content was never becoming the
|
||||
source for the wiki proper**.
|
||||
|
||||
**How memex extends it**:
|
||||
|
||||
- **`wiki-distill.py`** runs as Phase 1a of `wiki-maintain.sh`, before
|
||||
URL harvesting. The ordering is deliberate: conversation content
|
||||
should drive the page, and URL harvesting should only supplement
|
||||
what the conversations are already covering.
|
||||
- **Narrow today-filter with historical rollup** — daily runs only
|
||||
look at topics appearing in TODAY's summarized conversations, but
|
||||
for each such topic the script pulls in ALL historical conversations
|
||||
sharing that topic. Processing scope stays small; LLM context stays
|
||||
wide. Old topics that resurface in new sessions automatically
|
||||
trigger a re-distillation of the full history on that topic.
|
||||
- **First-run bootstrap** — the very first run uses a 7-day lookback
|
||||
to seed the state. After that, daily runs stay narrow.
|
||||
- **High-signal halls only** — distill reads `hall_facts`,
|
||||
`hall_discoveries`, and `hall_advice` bullets. Skips `hall_events`
|
||||
(temporal, not knowledge), `hall_preferences` (user working style),
|
||||
and `hall_tooling` (often low-signal). These are the halls the
|
||||
MemPalace taxonomy treats as "canonical knowledge" vs "context."
|
||||
- **claude -p compile step** — each topic group (topic + all matching
|
||||
conversations + their high-signal halls) is sent to `claude -p`
|
||||
with the current wiki index. The model decides whether to create a
|
||||
new page, update an existing one, emit both, or skip (topic not
|
||||
substantive enough or already well-covered).
|
||||
- **Staging output with distill provenance** — new/updated pages land
|
||||
in `staging/` with `staged_by: wiki-distill`, `distill_topic`, and
|
||||
`distill_source_conversations` frontmatter fields. Every page traces
|
||||
back to the exact conversations it was distilled from.
|
||||
- **State file `.distill-state.json`** tracks processed conversations
|
||||
by content hash and topic set, so re-runs only process what actually
|
||||
changed. A conversation gets re-distilled if its body changes OR if
|
||||
it gains a new topic not seen at previous distill time.
|
||||
|
||||
**Why this matters**: Without distillation, the MemPalace integration
|
||||
was incomplete — the closet summaries existed, the structural metadata
|
||||
existed, qmd could search them, but knowledge discovered during work
|
||||
never escaped the conversation archive. You could find "we had a
|
||||
debugging session about X last month" but couldn't find "here's the
|
||||
canonical page on X that captures what we learned." This extension
|
||||
turns the MemPalace layer from a searchable archive into a proper
|
||||
**ingest pipeline** for the wiki.
|
||||
|
||||
**Residual consideration**: Summarization quality is now load-bearing.
|
||||
The distill step trusts the summarizer's classification of bullets
|
||||
into halls. If the summarizer puts a debugging dead-end in
|
||||
`hall_discoveries`, it may enter the wiki compilation pipeline. The
|
||||
`MIN_BULLETS_PER_TOPIC` filter (default 2) and the LLM's own
|
||||
substantiveness check (it can choose `skip` with a reason) together
|
||||
catch most noise, and the staging review catches the rest.
|
||||
|
||||
---
|
||||
|
||||
## The biggest layer — active upkeep
|
||||
@@ -255,7 +323,7 @@ thing the human has to think about:
|
||||
| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
|
||||
| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
|
||||
| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
|
||||
| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
|
||||
| Daily 3am | `wiki-maintain.sh` | Distill + harvest + quick hygiene + reindex |
|
||||
| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
|
||||
|
||||
If you disable all of these, you get the same outcome as every
|
||||
|
||||
Reference in New Issue
Block a user