Files
memex/docs/ARCHITECTURE.md
Eric Turner 997aa837de feat(distill): close the MemPalace loop — conversations → wiki pages
Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.

## The gap distill closes

The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.

## What distill does

Narrow today-filter with historical rollup:

  1. Find all summarized conversations dated TODAY
  2. Extract their topics: — this is the "topics of today" set
  3. For each topic in that set, pull ALL summarized conversations
     across history that share that topic (full historical context)
  4. Extract hall_facts + hall_discoveries + hall_advice bullets
     (the high-signal hall types — skips event/preference/tooling)
  5. Send topic group + wiki index.md to claude -p
  6. Model emits JSON actions[]: new_page / update_page / skip
  7. Write each action to staging/<type>/ with distill provenance
     frontmatter (staged_by: wiki-distill, distill_topic,
     distill_source_conversations, compilation_notes)

First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.

Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.

## Orchestration — distill BEFORE harvest

wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):

  1a. wiki-distill.py    — conversations → staging (PRIORITY)
  1b. wiki-harvest.py    — URLs → raw/harvested → staging (supplement)
  2.  wiki-hygiene.py    — decay, archive, repair, checks
  3.  qmd reindex

Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.

## Verified on real wiki

Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.

## Code additions

- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
  + high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
  state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
  the Phase 1a/1b split

## Docs additions

- README.md: 8th row in extensions table, updated compounding-loop
  diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
  loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
  order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
  verify section test count
- .gitignore: note distill-state.json is committed (sync across
  machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
  tab with flow diagram, hall filter table, narrow-today/wide-
  history explanation, staging provenance example

## Tests

192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
2026-04-12 22:34:33 -06:00

422 lines
19 KiB
Markdown

# Architecture
Eleven scripts across three conceptual layers. This document walks through
what each one does, how they talk to each other, and where the seams are
for customization.
> **See also**: [`DESIGN-RATIONALE.md`](DESIGN-RATIONALE.md) — the *why*
> behind each component, with links to the interactive design artifact.
## Borrowed concepts
The architecture is a synthesis of two external ideas with an automation
layer on top. The terminology often maps 1:1, so it's worth calling out
which concepts came from where:
### From Karpathy's persistent-wiki gist
| Concept | How this repo implements it |
|---------|-----------------------------|
| Immutable `raw/` sources | `raw/` directory — never modified by the agent |
| LLM-compiled `wiki/` pages | `patterns/` `decisions/` `concepts/` `environments/` |
| Schema file disciplining the agent | `CLAUDE.md` at the wiki root |
| Periodic "lint" passes | `wiki-hygiene.py --quick` (daily) + `--full` (weekly) |
| Wiki as fine-tuning material | Clean markdown body is ready for synthetic training data |
### From [mempalace](https://github.com/milla-jovovich/mempalace)
MemPalace gave us the structural memory taxonomy that turns a flat
corpus into something you can navigate without reading everything. The
concepts map directly:
| MemPalace term | Meaning | How this repo implements it |
|----------------|---------|-----------------------------|
| **Wing** | Per-person or per-project namespace | Project code in `conversations/<code>/` (set by `PROJECT_MAP` in `extract-sessions.py`) |
| **Room** | Topic within a wing | `topics:` frontmatter field on summarized conversation files |
| **Closet** | Summary layer — high-signal compressed knowledge | The summary body written by `summarize-conversations.py --claude` |
| **Drawer** | Verbatim archive, never lost | The extracted transcript under `conversations/<wing>/*.md` (before summarization) |
| **Hall** | Memory-type corridor (fact / event / discovery / preference / advice / tooling) | `halls:` frontmatter field classified by the summarizer |
| **Tunnel** | Cross-wing connection — same topic in multiple projects | `related:` frontmatter linking conversations to wiki pages and to each other |
The key benefit of wing + room filtering is documented in MemPalace's
benchmarks as a **+34% retrieval boost** over flat search — because
`qmd` can search a pre-narrowed subset of the corpus instead of
everything. This is why the wiki scales past the Karpathy-pattern's
~50K token ceiling without needing a full vector DB rebuild.
### What this repo adds
Automation + lifecycle management on top of both:
- **Automation layer** — cron-friendly orchestration via `wiki-maintain.sh`
- **Staging pipeline** — human-in-the-loop checkpoint for automated content
- **Confidence decay + auto-archive + auto-restore** — the retention curve
- **`qmd` integration** — the scalable search layer (chosen over ChromaDB
because it uses markdown storage like the wiki itself)
- **Hygiene reports** — fixed vs needs-review separation
- **Cross-machine sync** — git with markdown merge-union
---
## Overview
```
┌─────────────────────────────────┐
│ SYNC LAYER │
│ wiki-sync.sh │ (git commit/pull/push, qmd reindex)
└─────────────────────────────────┘
┌─────────────────────────────────┐
│ MINING LAYER │
│ extract-sessions.py │ (Claude Code JSONL → markdown)
│ summarize-conversations.py │ (LLM classify + summarize)
│ update-conversation-index.py │ (regenerate indexes + wake-up)
│ mine-conversations.sh │ (orchestrator)
└─────────────────────────────────┘
┌─────────────────────────────────┐
│ AUTOMATION LAYER │
│ wiki_lib.py (shared helpers) │
│ wiki-distill.py │ (conversations → staging) ← closes MemPalace loop
│ wiki-harvest.py │ (URL → raw → staging)
│ wiki-staging.py │ (human review)
│ wiki-hygiene.py │ (decay, archive, repair, checks)
│ wiki-maintain.sh │ (orchestrator)
└─────────────────────────────────┘
```
Each layer is independent — you can run the mining layer without the
automation layer, or vice versa. The layers communicate through files on
disk (conversation markdown, raw harvested pages, staging pages, wiki
pages), never through in-memory state.
---
## Mining layer
### `extract-sessions.py`
Parses Claude Code JSONL session files from `~/.claude/projects/` into
clean markdown transcripts under `conversations/<project-code>/`.
Deterministic, no LLM calls. Incremental — tracks byte offsets in
`.mine-state.json` so it safely re-runs on partially-processed sessions.
Key features:
- Summarizes tool calls intelligently: full output for `Bash` and `Skill`,
paths-only for `Read`/`Glob`/`Grep`, path + summary for `Edit`/`Write`
- Caps Bash output at 200 lines to prevent transcript bloat
- Handles session resumption — if a session has grown since last extraction,
it appends new messages without re-processing old ones
- Maps Claude project directory names to short wiki codes via `PROJECT_MAP`
### `summarize-conversations.py`
Sends extracted transcripts to an LLM for classification and summarization.
Supports two backends:
1. **`--claude` mode** (recommended): Uses `claude -p` with
haiku for short sessions (≤200 messages) and sonnet for longer ones.
Runs chunked over long transcripts, keeping a rolling context window.
2. **Local LLM mode** (default, omit `--claude`): Uses a local
`llama-server` instance at `localhost:8080` (or WSL gateway:8081 on
Windows Subsystem for Linux). Requires llama.cpp installed and a GGUF
model loaded.
Output: adds frontmatter to each conversation file — `topics`, `halls`
(fact/discovery/preference/advice/event/tooling), and `related` wiki
page links. The `related` links are load-bearing: they're what
`wiki-hygiene.py` uses to refresh `last_verified` on pages that are still
being discussed.
### `update-conversation-index.py`
Regenerates three files from the summarized conversations:
- `conversations/index.md` — catalog of all conversations grouped by project
- `context/wake-up.md` — a ~200-token briefing the agent loads at the start
of every session ("current focus areas, recent decisions, active
concerns")
- `context/active-concerns.md` — longer-form current state
The wake-up file is important: it's what gives the agent *continuity*
across sessions without forcing you to re-explain context every time.
### `mine-conversations.sh`
Orchestrator chaining extract → summarize → index. Supports
`--extract-only`, `--summarize-only`, `--index-only`, `--project <code>`,
and `--dry-run`.
---
## Automation layer
### `wiki_lib.py`
The shared library. Everything in the automation layer imports from here.
Provides:
- `WikiPage` dataclass — path + frontmatter + body + raw YAML
- `parse_page(path)` — safe markdown parser with YAML frontmatter
- `parse_yaml_lite(text)` — subset YAML parser (no external deps, handles
the frontmatter patterns we use)
- `serialize_frontmatter(fm)` — writes YAML back in canonical key order
- `write_page(page, ...)` — full round-trip writer
- `page_content_hash(page)` — body-only SHA-256 for change detection
- `iter_live_pages()` / `iter_staging_pages()` / `iter_archived_pages()`
- Shared constants: `WIKI_DIR`, `STAGING_DIR`, `ARCHIVE_DIR`, etc.
All paths honor the `WIKI_DIR` environment variable, so tests and
alternate installs can override the root.
### `wiki-distill.py`
**Closes the MemPalace loop.** Reads the *content* of summarized
conversations — not the URLs they cite — and compiles wiki pages from
the high-signal hall entries (`hall_facts`, `hall_discoveries`,
`hall_advice`). Runs as Phase 1a in `wiki-maintain.sh`, before URL
harvesting.
**Scope filter (deliberately narrow)**:
1. Find all summarized conversations dated TODAY
2. Extract their `topics:` — this is the "topics-of-today" set
3. For each topic in that set, pull ALL summarized conversations across
history that share that topic (full historical context via rollup)
4. Extract `hall_facts` + `hall_discoveries` + `hall_advice` bullet
content from each conversation's body
5. Send the topic group (topic + matching conversations + halls) to
`claude -p` with the current `index.md`
6. Model emits a JSON `actions` array with `new_page` / `update_page` /
`skip` verdicts; the script writes each to `staging/<type>/`
**First-run bootstrap**: the very first run uses a 7-day lookback
instead of today-only, so the state file gets seeded with a reasonable
starting set. After that, daily runs stay narrow.
**Self-triggering**: dormant topics that resurface in a new
conversation automatically pull in all historical conversations on
that topic via the rollup. No manual intervention needed to
reprocess old knowledge when it becomes relevant again.
**Model routing**: haiku for short topic groups (< 15K chars prompt,
< 20 bullets), sonnet for longer ones.
**State** lives in `.distill-state.json` — tracks processed
conversations by content hash and topics-at-distill-time. A
conversation is re-processed if its body changes OR if it gains a new
topic not seen at previous distill.
**Staging output** includes distill-specific frontmatter:
- `staged_by: wiki-distill`
- `distill_topic: <topic>`
- `distill_source_conversations: <comma-separated conversation paths>`
Commands:
- `wiki-distill.py` — today-only rollup (default mode after first run)
- `wiki-distill.py --first-run` — 7-day lookback bootstrap
- `wiki-distill.py --topic TOPIC` — explicit single-topic processing
- `wiki-distill.py --project WING` — only today-topics from this wing
- `wiki-distill.py --dry-run` — plan only, no LLM calls, no writes
- `wiki-distill.py --no-compile` — rollup only, skip claude -p step
- `wiki-distill.py --limit N` — stop after N topic groups
### `wiki-harvest.py`
Scans summarized conversations for HTTP(S) URLs, classifies them,
fetches content, and compiles pending wiki pages. Runs as Phase 1b in
`wiki-maintain.sh`, after distill — URL content is treated as a
supplement to conversation-driven knowledge, not the primary source.
URL classification:
- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
- **Check** (Type C) — GitHub issues, Stack Overflow — only harvest if
the topic is already covered in the wiki (to avoid noise)
- **Skip** (Type D) — internal domains, localhost, private IPs, chat tools
Fetch cascade (tries in order, validates at each step):
1. `trafilatura -u <url> --markdown --no-comments --precision`
2. `crwl <url> -o markdown-fit`
3. `crwl <url> -o markdown-fit -b "user_agent_mode=random" -c "magic=true"` (stealth)
4. Conversation-transcript fallback — pull inline content from where the
URL was mentioned during the session
Validated content goes to `raw/harvested/<domain>-<path>.md` with
frontmatter recording source URL, fetch method, and a content hash.
Compilation step: sends the raw content + `index.md` + conversation
context to `claude -p`, asking for a JSON verdict:
- `new_page` — create a new wiki page
- `update_page` — update an existing page (with `modifies:` field)
- `both` — do both
- `skip` — content isn't substantive enough
Result lands in `staging/<type>/` with `origin: automated`,
`status: pending`, and all the staging-specific frontmatter that gets
stripped on promotion.
### `wiki-staging.py`
Pure file operations — no LLM calls. Human review pipeline for automated
content.
Commands:
- `--list` / `--list --json` — pending items with metadata
- `--stats` — counts by type/source + age stats
- `--review` — interactive a/r/s/q loop with preview
- `--promote <path>` — approve, strip staging fields, move to live, update
main index, rewrite cross-refs, preserve `origin: automated` as audit trail
- `--reject <path> --reason "..."` — delete, record in
`.harvest-state.json` rejected_urls so the harvester won't re-create
- `--promote-all` — bulk approve everything
- `--sync` — regenerate `staging/index.md`, detect drift
### `wiki-hygiene.py`
The heavy lifter. Two modes:
**Quick mode** (no LLM, ~1 second on a 100-page wiki, run daily):
- Backfill `last_verified` from `last_compiled`/git/mtime
- Refresh `last_verified` from conversation `related:` links — this is
the "something's still being discussed" signal
- Auto-restore archived pages that are referenced again
- Repair frontmatter (missing/invalid fields get sensible defaults)
- Apply confidence decay per thresholds (6/9/12 months)
- Archive stale and superseded pages
- Detect index drift (pages on disk not in index, stale index entries)
- Detect orphan pages (no inbound links) and auto-add them to index
- Detect broken cross-references, fuzzy-match to the intended target
via `difflib.get_close_matches`, fix in place
- Report empty stubs (body < 100 chars)
- Detect state file drift (references to missing files)
- Regenerate `staging/index.md` and `archive/index.md` if out of sync
**Full mode** (LLM-powered, run weekly — extends quick mode with):
- Missing cross-references (haiku, batched 5 pages per call)
- Duplicate coverage (sonnet — weaker merged into stronger, auto-archives
the loser with `archived_reason: Merged into <winner>`)
- Contradictions (sonnet, **report-only** — the human decides)
- Technology lifecycle (regex + conversation comparison — flags pages
mentioning `Node 18` when recent conversations are using `Node 20`)
State lives in `.hygiene-state.json` — tracks content hashes per page so
full-mode runs can skip unchanged pages. Reports land in
`reports/hygiene-YYYY-MM-DD-{fixed,needs-review}.md`.
### `wiki-maintain.sh`
Top-level orchestrator:
```
Phase 1a: wiki-distill.py (unless --no-distill or --harvest-only / --hygiene-only)
Phase 1b: wiki-harvest.py (unless --distill-only / --hygiene-only)
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
```
Ordering is deliberate: distill runs before harvest so that
conversation content drives the page shape, and URL harvesting only
supplements what the conversations are already covering. Flags pass
through to child scripts. Error-tolerant: if one phase fails, the
others still run. Logs to `scripts/.maintain.log`.
---
## Sync layer
### `wiki-sync.sh`
Git-based sync for cross-machine use. Commands:
- `--commit` — stage and commit local changes
- `--pull``git pull` with markdown merge-union (keeps both sides on conflict)
- `--push` — push to origin
- `full` — commit + pull + push + qmd reindex
- `--status` — read-only sync state report
The `.gitattributes` file sets `*.md merge=union` so markdown conflicts
auto-resolve by keeping both versions. This works because most conflicts
are additive (two machines both adding new entries).
---
## State files
Three JSON files track per-pipeline state:
| File | Owner | Synced? | Purpose |
|------|-------|---------|---------|
| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
| `.distill-state.json` | `wiki-distill.py` | Yes (committed) | Processed conversations (content hash + topics seen), rejected topics, first-run flag |
| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
Harvest and hygiene state need to sync across machines so both
installations agree on what's been processed. Mining state is per-machine
because Claude Code session files live at OS-specific paths.
---
## Module dependency graph
```
wiki_lib.py ─┬─> wiki-distill.py
├─> wiki-harvest.py
├─> wiki-staging.py
└─> wiki-hygiene.py
wiki-maintain.sh ─> wiki-distill.py (Phase 1a — conversations → staging)
─> wiki-harvest.py (Phase 1b — URLs → staging)
─> wiki-hygiene.py (Phase 2)
─> qmd (external) (Phase 3)
mine-conversations.sh ─> extract-sessions.py
─> summarize-conversations.py
─> update-conversation-index.py
extract-sessions.py (standalone — reads Claude JSONL)
summarize-conversations.py ─> claude CLI (or llama-server)
update-conversation-index.py ─> qmd (external)
```
`wiki_lib.py` is the only shared Python module — everything else is
self-contained within its layer.
---
## Extension seams
The places to modify when customizing:
1. **`scripts/extract-sessions.py`** — `PROJECT_MAP` controls how Claude
project directories become wiki "wings". Also `KEEP_FULL_OUTPUT_TOOLS`,
`SUMMARIZE_TOOLS`, `MAX_BASH_OUTPUT_LINES` to tune transcript shape.
2. **`scripts/update-conversation-index.py`** — `PROJECT_NAMES` and
`PROJECT_ORDER` control how the index groups conversations.
3. **`scripts/wiki-harvest.py`** —
- `SKIP_DOMAIN_PATTERNS` — your internal domains
- `C_TYPE_URL_PATTERNS` — URL shapes that need topic-match before harvesting
- `FETCH_DELAY_SECONDS` — rate limit between fetches
- `COMPILE_PROMPT_TEMPLATE` — what the AI compile step tells the LLM
- `SONNET_CONTENT_THRESHOLD` — size cutoff for haiku vs sonnet
4. **`scripts/wiki-hygiene.py`** —
- `DECAY_HIGH_TO_MEDIUM` / `DECAY_MEDIUM_TO_LOW` / `DECAY_LOW_TO_STALE`
— decay thresholds in days
- `EMPTY_STUB_THRESHOLD` — what counts as a stub
- `VERSION_REGEX` — which tools/runtimes to track for lifecycle checks
- `REQUIRED_FIELDS` — frontmatter fields the repair step enforces
5. **`scripts/summarize-conversations.py`** —
- `CLAUDE_LONG_THRESHOLD` — haiku/sonnet routing cutoff
- `MINE_PROMPT_FILE` — the LLM system prompt for summarization
- Backend selection (claude vs llama-server)
6. **`CLAUDE.md`** at the wiki root — the instructions the agent reads
every session. This is where you tell the agent how to maintain the
wiki, what conventions to follow, when to flag things to you.
See [`docs/CUSTOMIZE.md`](CUSTOMIZE.md) for recipes.