Files
memex/docs/DESIGN-RATIONALE.md
Eric Turner 997aa837de feat(distill): close the MemPalace loop — conversations → wiki pages
Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.

## The gap distill closes

The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.

## What distill does

Narrow today-filter with historical rollup:

  1. Find all summarized conversations dated TODAY
  2. Extract their topics: — this is the "topics of today" set
  3. For each topic in that set, pull ALL summarized conversations
     across history that share that topic (full historical context)
  4. Extract hall_facts + hall_discoveries + hall_advice bullets
     (the high-signal hall types — skips event/preference/tooling)
  5. Send topic group + wiki index.md to claude -p
  6. Model emits JSON actions[]: new_page / update_page / skip
  7. Write each action to staging/<type>/ with distill provenance
     frontmatter (staged_by: wiki-distill, distill_topic,
     distill_source_conversations, compilation_notes)

First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.

Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.

## Orchestration — distill BEFORE harvest

wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):

  1a. wiki-distill.py    — conversations → staging (PRIORITY)
  1b. wiki-harvest.py    — URLs → raw/harvested → staging (supplement)
  2.  wiki-hygiene.py    — decay, archive, repair, checks
  3.  qmd reindex

Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.

## Verified on real wiki

Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.

## Code additions

- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
  + high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
  state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
  the Phase 1a/1b split

## Docs additions

- README.md: 8th row in extensions table, updated compounding-loop
  diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
  loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
  order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
  verify section test count
- .gitignore: note distill-state.json is committed (sync across
  machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
  tab with flow diagram, hall filter table, narrow-today/wide-
  history explanation, staging provenance example

## Tests

192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
2026-04-12 22:34:33 -06:00

437 lines
21 KiB
Markdown

# Design Rationale — Signal & Noise
Why each part of this repo exists. This is the "why" document; the other
docs are the "what" and "how."
Before implementing anything, the design was worked out interactively
with Claude as a structured Signal & Noise analysis of Andrej Karpathy's
original persistent-wiki pattern:
> **Interactive version**: [eric-turner.com/memex/signal-and-noise.html](https://eric-turner.com/memex/signal-and-noise.html)
> — tabs for pros/cons, vs RAG, use-case fits, signal breakdown, mitigations
>
> **Self-contained archive**: [`artifacts/signal-and-noise.html`](artifacts/signal-and-noise.html)
> — same content, works offline
The analysis walks through the pattern's seven genuine strengths, seven
places where it needs an engineering layer to be practical, and the
concrete extension for each. memex is the implementation of those
extensions. If you want to understand *why* a component exists, the
interactive version has the longer-form argument; this document is the
condensed written version.
---
## Where the pattern is genuinely strong
The analysis found seven strengths that hold up under scrutiny. This
repo preserves all of them:
| Strength | How this repo keeps it |
|----------|-----------------------|
| **Knowledge compounds over time** | Every ingest adds to the existing wiki rather than restarting; conversation mining and URL harvesting continuously feed new material in |
| **Zero maintenance burden on humans** | Cron-driven harvest + hygiene; the only manual step is staging review, and that's fast because the AI already compiled the page |
| **Token-efficient at personal scale** | `index.md` fits in context; `qmd` kicks in only at 50+ articles; the wake-up briefing is ~200 tokens |
| **Human-readable & auditable** | Plain markdown everywhere; every cross-reference is visible; git history shows every change |
| **Future-proof & portable** | No vendor lock-in; you can point any agent at the same tree tomorrow |
| **Self-healing via lint passes** | `wiki-hygiene.py` runs quick checks daily and full (LLM) checks weekly |
| **Path to fine-tuning** | Wiki pages are high-quality synthetic training data once purified through hygiene |
---
## Where memex extends the pattern
Karpathy's gist is a concept pitch. He was explicit that he was sharing
an "idea file" for others to build on, not publishing a working
implementation. The analysis identified eight places where the core idea
needs an engineering layer to become practical day-to-day. The first
seven emerged from the original Signal & Noise review; the eighth
(conversation distillation) surfaced after building the other layers
and realizing that the conversations themselves were being mined,
summarized, indexed, and scanned for URLs — but the knowledge *inside*
them was never becoming wiki pages.
### 1. Claim freshness and reversibility
**The gap**: Unlike RAG — where a hallucination is ephemeral and the
next query starts clean — an LLM-maintained wiki is stateful. If a
claim is wrong at ingest time, it stays wrong until something corrects
it. For the pattern to work long-term, claims need a way to earn trust
over time and lose it when unused.
**How memex extends it**:
- **`confidence` field** — every page carries `high`/`medium`/`low` with
decay based on `last_verified`. Wrong claims aren't treated as
permanent — they age out visibly.
- **Archive + restore** — decayed pages get moved to `archive/` where
they're excluded from default search. If they get referenced again
they're auto-restored with `confidence: medium` (never straight to
`high` — they have to re-earn trust).
- **Raw harvested material is immutable** — `raw/harvested/*.md` files
are the ground truth. Every compiled wiki page can be traced back to
its source via the `sources:` frontmatter field.
- **Full-mode contradiction detection** — `wiki-hygiene.py --full` uses
sonnet to find conflicting claims across pages. Report-only (humans
decide which side wins).
- **Staging review** — automated content goes to `staging/` first.
Nothing enters the live wiki without human approval, so errors have
two chances to get caught (AI compile + human review) before they
become persistent.
### 2. Scalable search beyond the context window
**The gap**: The pattern works beautifully up to ~100 articles, where
`index.md` still fits in context. Karpathy's own wiki was right at the
ceiling. Past that point, the agent needs a real search layer — loading
the full index stops being practical.
**How memex extends it**:
- **`qmd` from day one** — `qmd` (BM25 + vector + LLM re-ranking) is set
up in the default configuration so the agent never has to load the
full index. At 50+ pages, `qmd search` replaces `cat index.md`.
- **Wing/room structural filtering** — conversations are partitioned by
project code (wing) and topic (room, via the `topics:` frontmatter).
Retrieval is pre-narrowed to the relevant wing before search runs.
This extends the effective ceiling because `qmd` works on a relevant
subset, not the whole corpus.
- **Hygiene full mode flags redundancy** — duplicate detection auto-merges
weaker pages into stronger ones, keeping the corpus lean.
- **Archive excludes stale content** — the `wiki-archive` collection has
`includeByDefault: false`, so archived pages don't eat context until
explicitly queried.
### 3. Traceable sources for every claim
**The gap**: In precision-sensitive domains (API specs, version
constraints, legal records, medical protocols), LLM-generated content
needs to be verifiable against a source. For the pattern to work in
those contexts, every claim needs to trace back to something immutable.
**How memex extends it**:
- **Staging workflow** — every automated page goes through human review.
For precision-critical content, that review IS the cross-check. The
AI does the drafting; you verify.
- **`compilation_notes` field** — staging pages include the AI's own
explanation of what it did and why. Makes review faster — you can
spot-check the reasoning rather than re-reading the whole page.
- **Immutable raw sources** — every wiki claim traces back to a specific
file in `raw/harvested/` with a SHA-256 `content_hash`. Verification
means comparing the claim to the source, not "trust the LLM."
- **`confidence: low` for precision domains** — the agent's instructions
(via `CLAUDE.md`) tell it to flag low-confidence content when
citing. Humans see the warning before acting.
**Residual trade-off**: For *truly* mission-critical data (legal,
medical, compliance), no amount of automation replaces domain-expert
review. If that's your use case, treat this repo as a *drafting* tool,
not a canonical source.
### 4. Continuous feed without manual discipline
**The gap**: Community analysis of 120+ comments on Karpathy's gist
converged on one clear finding: this is the #1 friction point. Most
people who try the pattern get
the folder structure right and still end up with a wiki that slowly
becomes unreliable because they stop feeding it. Six-week half-life is
typical.
**How memex extends it** (this is the biggest layer):
- **Automation replaces human discipline** — daily cron runs
`wiki-maintain.sh` (harvest + hygiene + qmd reindex); weekly cron runs
`--full` mode. You don't need to remember anything.
- **Conversation mining is the feed** — you don't need to curate sources
manually. Every Claude Code session becomes potential ingest. The
feed is automatic and continuous, as long as you're doing work.
- **`last_verified` refreshes from conversation references** — when the
summarizer links a conversation to a wiki page via `related:`, the
hygiene script picks that up and bumps `last_verified`. Pages stay
fresh as long as they're still being discussed.
- **Decay thresholds force attention** — pages without refresh signals
for 6/9/12 months get downgraded and eventually archived. The wiki
self-trims.
- **Hygiene reports** — `reports/hygiene-YYYY-MM-DD-needs-review.md`
flags the things that *do* need human judgment. Everything else is
auto-fixed.
This is the single biggest layer memex adds. Nothing about it is
exotic — it's a cron-scheduled pipeline that runs the scripts you'd
otherwise have to remember to run. That's the whole trick.
### 5. Keeping the human engaged with their own knowledge
**The gap**: Hacker News critics pointed out that the bookkeeping
Karpathy outsources — filing, cross-referencing, summarizing — is
precisely where genuine understanding forms. If the LLM does all of
it, you can end up with a comprehensive wiki you haven't internalized.
For the pattern to be an actual memory aid and not a false one, the
human needs touchpoints that keep them engaged.
**How memex extends it**:
- **Staging review is a forcing function** — you see every automated
page before it lands. Even skimming forces engagement with the
material.
- **`qmd query "..."` for exploration** — searching the wiki is an
active process, not passive retrieval. You're asking questions, not
pulling a file.
- **The wake-up briefing** — `context/wake-up.md` is a 200-token digest
the agent reads at session start. You read it too (or the agent reads
it to you) — ongoing re-exposure to your own knowledge base.
**Caveat**: memex is designed as *augmentation*, not *replacement*.
It's most valuable when you engage with it actively — reading your own
wake-up briefing, spot-checking promoted pages, noticing decay flags.
If you only consult the wiki through the agent and never look at it
yourself, you've outsourced the learning. That's a usage pattern
choice, not an architecture problem.
### 6. Hybrid retrieval — structure and semantics
**The gap**: Explicit wikilinks catch direct topic references but miss
semantic neighbors that use different wording. At scale, the pattern
benefits from vector similarity to find cross-topic connections the
human (or the LLM at ingest time) didn't think to link manually.
**How memex extends it**:
- **`qmd` is hybrid (BM25 + vector)** — not just keyword search. Vector
similarity is built into the retrieval pipeline from day one.
- **Structural navigation complements semantic search** — project codes
(wings) and topic frontmatter narrow the search space before the
hybrid search runs. Structure + semantics is stronger than either
alone.
- **Missing cross-reference detection** — full-mode hygiene asks the
LLM to find pages that *should* link to each other but don't, then
auto-adds them. This is the explicit-linking approach catching up to
semantic retrieval over time.
**Residual trade-off**: At enterprise scale (millions of documents), a
proper vector DB with specialized retrieval wins. This repo is for
personal / small-team scale where the hybrid approach is sufficient.
### 7. Cross-machine collaboration
**The gap**: Karpathy's gist describes a single-user, single-machine
setup. In practice, people work from multiple machines (laptop,
workstation, server) and sometimes collaborate with small teams. The
pattern needs a sync story that handles concurrent writes gracefully.
**How memex extends it**:
- **Git-based sync with merge-union** — concurrent writes on different
machines auto-resolve because markdown is set to `merge=union` in
`.gitattributes`. Both sides win.
- **State file sync** — `.harvest-state.json` and `.hygiene-state.json`
are committed, so two machines running the same pipeline agree on
what's already been processed instead of re-doing the work.
- **Network boundary as access gate** — the suggested deployment is
over Tailscale or a VPN, so the network enforces who can reach the
wiki at all. Simple and sufficient for personal/family/small-team
use.
**Explicit scope**: memex is **deliberately not** enterprise knowledge
management. No audit trails, no fine-grained permissions, no compliance
story. If you need any of that, you need a different architecture.
This is for the personal and small-team case where git + Tailscale is
the right amount of rigor.
### 8. Closing the MemPalace loop — conversation distillation
**The gap**: The mining pipeline extracts Claude Code sessions into
transcripts, classifies them by memory type (fact/discovery/preference/
advice/event/tooling), and tags them with topics. The URL harvester
scans them for cited links. Hygiene refreshes `last_verified` on any
wiki page that appears in a conversation's `related:` field. But none
of those steps actually *compile the knowledge inside the conversations
themselves into wiki pages.* A decision made in a session, a root cause
found during debugging, a pattern spotted in review — these stay in the
conversation summaries (searchable but not synthesized) until a human
manually writes them up. That's the last piece of the MemPalace model
that wasn't wired through: **closet content was never becoming the
source for the wiki proper**.
**How memex extends it**:
- **`wiki-distill.py`** runs as Phase 1a of `wiki-maintain.sh`, before
URL harvesting. The ordering is deliberate: conversation content
should drive the page, and URL harvesting should only supplement
what the conversations are already covering.
- **Narrow today-filter with historical rollup** — daily runs only
look at topics appearing in TODAY's summarized conversations, but
for each such topic the script pulls in ALL historical conversations
sharing that topic. Processing scope stays small; LLM context stays
wide. Old topics that resurface in new sessions automatically
trigger a re-distillation of the full history on that topic.
- **First-run bootstrap** — the very first run uses a 7-day lookback
to seed the state. After that, daily runs stay narrow.
- **High-signal halls only** — distill reads `hall_facts`,
`hall_discoveries`, and `hall_advice` bullets. Skips `hall_events`
(temporal, not knowledge), `hall_preferences` (user working style),
and `hall_tooling` (often low-signal). These are the halls the
MemPalace taxonomy treats as "canonical knowledge" vs "context."
- **claude -p compile step** — each topic group (topic + all matching
conversations + their high-signal halls) is sent to `claude -p`
with the current wiki index. The model decides whether to create a
new page, update an existing one, emit both, or skip (topic not
substantive enough or already well-covered).
- **Staging output with distill provenance** — new/updated pages land
in `staging/` with `staged_by: wiki-distill`, `distill_topic`, and
`distill_source_conversations` frontmatter fields. Every page traces
back to the exact conversations it was distilled from.
- **State file `.distill-state.json`** tracks processed conversations
by content hash and topic set, so re-runs only process what actually
changed. A conversation gets re-distilled if its body changes OR if
it gains a new topic not seen at previous distill time.
**Why this matters**: Without distillation, the MemPalace integration
was incomplete — the closet summaries existed, the structural metadata
existed, qmd could search them, but knowledge discovered during work
never escaped the conversation archive. You could find "we had a
debugging session about X last month" but couldn't find "here's the
canonical page on X that captures what we learned." This extension
turns the MemPalace layer from a searchable archive into a proper
**ingest pipeline** for the wiki.
**Residual consideration**: Summarization quality is now load-bearing.
The distill step trusts the summarizer's classification of bullets
into halls. If the summarizer puts a debugging dead-end in
`hall_discoveries`, it may enter the wiki compilation pipeline. The
`MIN_BULLETS_PER_TOPIC` filter (default 2) and the LLM's own
substantiveness check (it can choose `skip` with a reason) together
catch most noise, and the staging review catches the rest.
---
## The biggest layer — active upkeep
The other six extensions are important, but this is the one that makes
or breaks the pattern in practice. The community data is unambiguous:
- People who automate the lint schedule → wikis healthy at 6+ months
- People who rely on "I'll remember to lint" → wikis abandoned at 6 weeks
The entire automation layer of this repo exists to remove upkeep as a
thing the human has to think about:
| Cadence | Job | Purpose |
|---------|-----|---------|
| Every 15 min | `wiki-sync.sh` | Commit/pull/push — cross-machine sync |
| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
| Daily 3am | `wiki-maintain.sh` | Distill + harvest + quick hygiene + reindex |
| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
If you disable all of these, you get the same outcome as every
abandoned wiki: six-week half-life. The scripts aren't optional
convenience — they're the load-bearing automation that lets the pattern
actually compound over months and years instead of requiring a
disciplined human to keep it alive.
---
## What was borrowed from where
This repo is a synthesis of two ideas with an automation layer on top:
### From Karpathy
- The core pattern: LLM-maintained persistent wiki, compile at ingest
time instead of retrieve at query time
- Separation of `raw/` (immutable sources) from `wiki/` (compiled pages)
- `CLAUDE.md` as the schema that disciplines the agent
- Periodic "lint" passes to catch orphans, contradictions, missing refs
- The idea that the wiki becomes fine-tuning material over time
### From mempalace
- **Wings** = per-person or per-project namespaces → this repo uses
project codes (`mc`, `wiki`, `web`, etc.) as the same thing in
`conversations/<project>/`
- **Rooms** = topics within a wing → the `topics:` frontmatter on
conversation files
- **Halls** = memory-type corridors (fact / event / discovery /
preference / advice / tooling) → the `halls:` frontmatter field
classified by the summarizer
- **Closets** = summary layer → the summary body of each summarized
conversation
- **Drawers** = verbatim archive, never lost → the extracted
conversation transcripts under `conversations/<project>/*.md`
- **Tunnels** = cross-wing connections → the `related:` frontmatter
linking conversations to wiki pages
- Wing + room structural filtering gives a documented +34% retrieval
boost over flat search
The MemPalace taxonomy solved a problem Karpathy's pattern doesn't
address: how do you navigate a growing corpus without reading
everything? The answer is to give the corpus structural metadata at
ingest time, then filter on that metadata before doing semantic search.
This repo borrows that wholesale.
### What this repo adds
- **Automation layer** tying the pieces together with cron-friendly
orchestration
- **Staging pipeline** as a human-in-the-loop checkpoint for automated
content
- **Confidence decay + auto-archive + auto-restore** as the "retention
curve" that community analysis identified as critical for long-term
wiki health
- **`qmd` integration** as the scalable search layer (chosen over
ChromaDB because it uses the same markdown storage as the wiki —
one index to maintain, not two)
- **Hygiene reports** with fixed vs needs-review separation so
automation handles mechanical fixes and humans handle ambiguity
- **Cross-machine sync** via git with markdown merge-union so the same
wiki lives on multiple machines without merge hell
---
## What memex deliberately doesn't try to do
Five things memex is explicitly scoped around — not because they're
unsolvable, but because solving them well requires a different kind of
architecture than a personal/small-team wiki. If any of these are
dealbreakers for your use case, memex is probably not the right fit:
1. **Enterprise scale** — millions of documents, hundreds of users,
RBAC, compliance: these need real enterprise knowledge management
infrastructure. memex is tuned for personal and small-team use.
2. **True semantic retrieval at massive scale**`qmd` hybrid search
works great up to thousands of pages. At millions, a dedicated
vector database with specialized retrieval wins.
3. **Replacing your own learning** — memex is an augmentation layer,
not a substitute for reading. Used well, it's a memory aid; used as
a bypass, it just lets you forget more.
4. **Precision-critical source of truth** — for legal, medical, or
regulatory data, memex is a drafting tool. Human domain-expert
review still owns the final call.
5. **Access control** — the network boundary (Tailscale) is the
fastest path to "only authorized people can reach it." memex itself
doesn't enforce permissions inside that boundary.
These are scope decisions, not unfinished work. memex is the best
personal/small-team answer to Karpathy's pattern I could build; it's
not trying to be every answer.
---
## Further reading
- [The original Karpathy gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
— the concept
- [mempalace](https://github.com/milla-jovovich/mempalace) — the
structural memory layer
- [Signal & Noise interactive analysis](https://eric-turner.com/memex/signal-and-noise.html)
— the design rationale this document summarizes (live interactive version)
- [`artifacts/signal-and-noise.html`](artifacts/signal-and-noise.html)
— self-contained archive of the same analysis, works offline
- [README](../README.md) — the concept pitch
- [ARCHITECTURE.md](ARCHITECTURE.md) — component deep-dive
- [SETUP.md](SETUP.md) — installation
- [CUSTOMIZE.md](CUSTOMIZE.md) — adapting for non-Claude-Code setups