feat(distill): close the MemPalace loop — conversations → wiki pages
Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.
## The gap distill closes
The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.
## What distill does
Narrow today-filter with historical rollup:
1. Find all summarized conversations dated TODAY
2. Extract their topics: — this is the "topics of today" set
3. For each topic in that set, pull ALL summarized conversations
across history that share that topic (full historical context)
4. Extract hall_facts + hall_discoveries + hall_advice bullets
(the high-signal hall types — skips event/preference/tooling)
5. Send topic group + wiki index.md to claude -p
6. Model emits JSON actions[]: new_page / update_page / skip
7. Write each action to staging/<type>/ with distill provenance
frontmatter (staged_by: wiki-distill, distill_topic,
distill_source_conversations, compilation_notes)
First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.
Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.
## Orchestration — distill BEFORE harvest
wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):
1a. wiki-distill.py — conversations → staging (PRIORITY)
1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement)
2. wiki-hygiene.py — decay, archive, repair, checks
3. qmd reindex
Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.
## Verified on real wiki
Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.
## Code additions
- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
+ high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
the Phase 1a/1b split
## Docs additions
- README.md: 8th row in extensions table, updated compounding-loop
diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
verify section test count
- .gitignore: note distill-state.json is committed (sync across
machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
tab with flow diagram, hall filter table, narrow-today/wide-
history explanation, staging provenance example
## Tests
192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
This commit is contained in:
48
README.md
48
README.md
@@ -114,13 +114,14 @@ to one of those extensions:
|
||||
|
||||
| What memex adds | How it works |
|
||||
|-----------------|--------------|
|
||||
| **Conversation distillation** — your sessions become wiki pages | `wiki-distill.py` finds today's topics, rolls up ALL historical conversations sharing each topic, pulls their `hall_facts` + `hall_discoveries` + `hall_advice` content, and asks `claude -p` to create new pages or update existing ones. This is what closes the MemPalace loop — closet summaries become the source material for the wiki itself, not just the URLs cited in them. |
|
||||
| **Time-decaying confidence** — pages earn trust through reinforcement and fade without it | `confidence` field + `last_verified`, 6/9/12 month decay thresholds, auto-archive. Full-mode hygiene also adds LLM contradiction detection across pages. |
|
||||
| **Scalable search beyond the context window** | `qmd` (BM25 + vector + LLM re-ranking) from day one, with three collections (`wiki` / `wiki-archive` / `wiki-conversations`) so queries can route to the right surface. |
|
||||
| **Traceable sources for every claim** | Every compiled page traces back to an immutable `raw/harvested/*.md` file with a SHA-256 content hash. Staging review is the built-in cross-check, and `compilation_notes` makes review fast. |
|
||||
| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. |
|
||||
| **Traceable sources for every claim** | Every compiled page traces back to either an immutable `raw/harvested/*.md` file (URL-sourced) or specific conversations with a `distill_source_conversations` field (session-sourced). Staging review is the built-in cross-check, and `compilation_notes` makes review fast. |
|
||||
| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → distill → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. |
|
||||
| **Human-in-the-loop staging** for automated content | Every automated page lands in `staging/` first with `origin: automated`, `status: pending`. Nothing bypasses human review — one promotion step and it's in the live wiki with `last_verified` set. |
|
||||
| **Hybrid retrieval** — structural navigation + semantic search | Wings/rooms/halls (borrowed from mempalace) give structural filtering that narrows the search space before qmd's hybrid BM25 + vector pass runs. Full-mode hygiene also auto-adds missing cross-references. |
|
||||
| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Harvest and hygiene state files sync across machines so both agree on what's been processed. |
|
||||
| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Distill, harvest, and hygiene state files sync across machines so both agree on what's been processed. |
|
||||
|
||||
The short version: Karpathy shared the idea, milla-jovovich's mempalace
|
||||
added the structural memory taxonomy, and memex is the automation layer
|
||||
@@ -147,21 +148,28 @@ memex doesn't cover.
|
||||
│ summarize-conversations.py --claude (daily)
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ conversations/ │ summaries with related: wiki links
|
||||
│ conversations/ │ summaries with halls + topics + related:
|
||||
│ <project>/*.md │ (status: summarized)
|
||||
└──────────┬──────────┘
|
||||
│ wiki-harvest.py (daily)
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ raw/harvested/ │ fetched URL content
|
||||
│ *.md │ (immutable source material)
|
||||
└──────────┬──────────┘
|
||||
│ claude -p compile step
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ staging/<type>/ │ pending pages
|
||||
│ *.md │ (status: pending, origin: automated)
|
||||
└──────────┬──────────┘
|
||||
└──────┬───────────┬──┘
|
||||
│ │
|
||||
│ └──▶ wiki-distill.py (daily Phase 1a) ──┐
|
||||
│ - rollup by today's topics │
|
||||
│ - pull historical conversations│
|
||||
│ - extract fact/discovery/advice│
|
||||
│ - claude -p → new or update │
|
||||
│ │
|
||||
│ wiki-harvest.py (daily Phase 1b) │
|
||||
▼ │
|
||||
┌─────────────────────┐ │
|
||||
│ raw/harvested/ │ fetched URL content │
|
||||
│ *.md │ (immutable source material) │
|
||||
└──────────┬──────────┘ │
|
||||
│ claude -p compile step │
|
||||
▼ │
|
||||
┌──────────────────────────────────────────────────────┐ │
|
||||
│ staging/<type>/ pending pages │◀─┘
|
||||
│ *.md (status: pending, origin: auto) │
|
||||
└──────────┬───────────────────────────────────────────┘
|
||||
│ human review (wiki-staging.py --review)
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
@@ -283,6 +291,7 @@ wiki/
|
||||
├── reports/ ← Hygiene operation logs
|
||||
├── scripts/ ← The automation pipeline
|
||||
├── tests/ ← Pytest suite (171 tests)
|
||||
├── .distill-state.json ← Conversation distill state (committed, synced)
|
||||
├── .harvest-state.json ← URL dedup state (committed, synced)
|
||||
├── .hygiene-state.json ← Content hashes, deferred issues (committed, synced)
|
||||
└── .mine-state.json ← Conversation extraction offsets (gitignored, per-machine)
|
||||
@@ -333,11 +342,12 @@ Eleven scripts organized in three layers:
|
||||
- `update-conversation-index.py` — Regenerate conversation index + wake-up context
|
||||
|
||||
**Automation layer** (maintains the wiki):
|
||||
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, constants
|
||||
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, hall extraction, constants
|
||||
- `wiki-distill.py` — Conversation distillation (closet → wiki pages via claude -p, closes the MemPalace loop)
|
||||
- `wiki-harvest.py` — URL classification + fetch cascade + compile to staging
|
||||
- `wiki-staging.py` — Human review (list/promote/reject/review/sync)
|
||||
- `wiki-hygiene.py` — Quick + full hygiene checks, archival, auto-restore
|
||||
- `wiki-maintain.sh` — Top-level orchestrator chaining harvest + hygiene
|
||||
- `wiki-maintain.sh` — Top-level orchestrator chaining distill + harvest + hygiene
|
||||
|
||||
**Sync layer**:
|
||||
- `wiki-sync.sh` — Git commit/pull/push with merge-union markdown handling
|
||||
|
||||
Reference in New Issue
Block a user