feat(distill): close the MemPalace loop — conversations → wiki pages
Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.
## The gap distill closes
The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.
## What distill does
Narrow today-filter with historical rollup:
1. Find all summarized conversations dated TODAY
2. Extract their topics: — this is the "topics of today" set
3. For each topic in that set, pull ALL summarized conversations
across history that share that topic (full historical context)
4. Extract hall_facts + hall_discoveries + hall_advice bullets
(the high-signal hall types — skips event/preference/tooling)
5. Send topic group + wiki index.md to claude -p
6. Model emits JSON actions[]: new_page / update_page / skip
7. Write each action to staging/<type>/ with distill provenance
frontmatter (staged_by: wiki-distill, distill_topic,
distill_source_conversations, compilation_notes)
First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.
Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.
## Orchestration — distill BEFORE harvest
wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):
1a. wiki-distill.py — conversations → staging (PRIORITY)
1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement)
2. wiki-hygiene.py — decay, archive, repair, checks
3. qmd reindex
Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.
## Verified on real wiki
Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.
## Code additions
- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
+ high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
the Phase 1a/1b split
## Docs additions
- README.md: 8th row in extensions table, updated compounding-loop
diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
verify section test count
- .gitignore: note distill-state.json is committed (sync across
machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
tab with flow diagram, hall filter table, narrow-today/wide-
history explanation, staging provenance example
## Tests
192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
This commit is contained in:
@@ -77,6 +77,7 @@ Automation + lifecycle management on top of both:
|
||||
┌─────────────────────────────────┐
|
||||
│ AUTOMATION LAYER │
|
||||
│ wiki_lib.py (shared helpers) │
|
||||
│ wiki-distill.py │ (conversations → staging) ← closes MemPalace loop
|
||||
│ wiki-harvest.py │ (URL → raw → staging)
|
||||
│ wiki-staging.py │ (human review)
|
||||
│ wiki-hygiene.py │ (decay, archive, repair, checks)
|
||||
@@ -169,10 +170,63 @@ Provides:
|
||||
All paths honor the `WIKI_DIR` environment variable, so tests and
|
||||
alternate installs can override the root.
|
||||
|
||||
### `wiki-distill.py`
|
||||
|
||||
**Closes the MemPalace loop.** Reads the *content* of summarized
|
||||
conversations — not the URLs they cite — and compiles wiki pages from
|
||||
the high-signal hall entries (`hall_facts`, `hall_discoveries`,
|
||||
`hall_advice`). Runs as Phase 1a in `wiki-maintain.sh`, before URL
|
||||
harvesting.
|
||||
|
||||
**Scope filter (deliberately narrow)**:
|
||||
1. Find all summarized conversations dated TODAY
|
||||
2. Extract their `topics:` — this is the "topics-of-today" set
|
||||
3. For each topic in that set, pull ALL summarized conversations across
|
||||
history that share that topic (full historical context via rollup)
|
||||
4. Extract `hall_facts` + `hall_discoveries` + `hall_advice` bullet
|
||||
content from each conversation's body
|
||||
5. Send the topic group (topic + matching conversations + halls) to
|
||||
`claude -p` with the current `index.md`
|
||||
6. Model emits a JSON `actions` array with `new_page` / `update_page` /
|
||||
`skip` verdicts; the script writes each to `staging/<type>/`
|
||||
|
||||
**First-run bootstrap**: the very first run uses a 7-day lookback
|
||||
instead of today-only, so the state file gets seeded with a reasonable
|
||||
starting set. After that, daily runs stay narrow.
|
||||
|
||||
**Self-triggering**: dormant topics that resurface in a new
|
||||
conversation automatically pull in all historical conversations on
|
||||
that topic via the rollup. No manual intervention needed to
|
||||
reprocess old knowledge when it becomes relevant again.
|
||||
|
||||
**Model routing**: haiku for short topic groups (< 15K chars prompt,
|
||||
< 20 bullets), sonnet for longer ones.
|
||||
|
||||
**State** lives in `.distill-state.json` — tracks processed
|
||||
conversations by content hash and topics-at-distill-time. A
|
||||
conversation is re-processed if its body changes OR if it gains a new
|
||||
topic not seen at previous distill.
|
||||
|
||||
**Staging output** includes distill-specific frontmatter:
|
||||
- `staged_by: wiki-distill`
|
||||
- `distill_topic: <topic>`
|
||||
- `distill_source_conversations: <comma-separated conversation paths>`
|
||||
|
||||
Commands:
|
||||
- `wiki-distill.py` — today-only rollup (default mode after first run)
|
||||
- `wiki-distill.py --first-run` — 7-day lookback bootstrap
|
||||
- `wiki-distill.py --topic TOPIC` — explicit single-topic processing
|
||||
- `wiki-distill.py --project WING` — only today-topics from this wing
|
||||
- `wiki-distill.py --dry-run` — plan only, no LLM calls, no writes
|
||||
- `wiki-distill.py --no-compile` — rollup only, skip claude -p step
|
||||
- `wiki-distill.py --limit N` — stop after N topic groups
|
||||
|
||||
### `wiki-harvest.py`
|
||||
|
||||
Scans summarized conversations for HTTP(S) URLs, classifies them,
|
||||
fetches content, and compiles pending wiki pages.
|
||||
fetches content, and compiles pending wiki pages. Runs as Phase 1b in
|
||||
`wiki-maintain.sh`, after distill — URL content is treated as a
|
||||
supplement to conversation-driven knowledge, not the primary source.
|
||||
|
||||
URL classification:
|
||||
- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
|
||||
@@ -254,13 +308,17 @@ full-mode runs can skip unchanged pages. Reports land in
|
||||
Top-level orchestrator:
|
||||
|
||||
```
|
||||
Phase 1: wiki-harvest.py (unless --hygiene-only)
|
||||
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
|
||||
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
|
||||
Phase 1a: wiki-distill.py (unless --no-distill or --harvest-only / --hygiene-only)
|
||||
Phase 1b: wiki-harvest.py (unless --distill-only / --hygiene-only)
|
||||
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
|
||||
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
|
||||
```
|
||||
|
||||
Flags pass through to child scripts. Error-tolerant: if one phase fails,
|
||||
the others still run. Logs to `scripts/.maintain.log`.
|
||||
Ordering is deliberate: distill runs before harvest so that
|
||||
conversation content drives the page shape, and URL harvesting only
|
||||
supplements what the conversations are already covering. Flags pass
|
||||
through to child scripts. Error-tolerant: if one phase fails, the
|
||||
others still run. Logs to `scripts/.maintain.log`.
|
||||
|
||||
---
|
||||
|
||||
@@ -289,6 +347,7 @@ Three JSON files track per-pipeline state:
|
||||
| File | Owner | Synced? | Purpose |
|
||||
|------|-------|---------|---------|
|
||||
| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
|
||||
| `.distill-state.json` | `wiki-distill.py` | Yes (committed) | Processed conversations (content hash + topics seen), rejected topics, first-run flag |
|
||||
| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
|
||||
| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
|
||||
|
||||
@@ -301,13 +360,15 @@ because Claude Code session files live at OS-specific paths.
|
||||
## Module dependency graph
|
||||
|
||||
```
|
||||
wiki_lib.py ─┬─> wiki-harvest.py
|
||||
wiki_lib.py ─┬─> wiki-distill.py
|
||||
├─> wiki-harvest.py
|
||||
├─> wiki-staging.py
|
||||
└─> wiki-hygiene.py
|
||||
|
||||
wiki-maintain.sh ─> wiki-harvest.py
|
||||
─> wiki-hygiene.py
|
||||
─> qmd (external)
|
||||
wiki-maintain.sh ─> wiki-distill.py (Phase 1a — conversations → staging)
|
||||
─> wiki-harvest.py (Phase 1b — URLs → staging)
|
||||
─> wiki-hygiene.py (Phase 2)
|
||||
─> qmd (external) (Phase 3)
|
||||
|
||||
mine-conversations.sh ─> extract-sessions.py
|
||||
─> summarize-conversations.py
|
||||
|
||||
Reference in New Issue
Block a user