Files
memex/README.md
Eric Turner 997aa837de feat(distill): close the MemPalace loop — conversations → wiki pages
Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.

## The gap distill closes

The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.

## What distill does

Narrow today-filter with historical rollup:

  1. Find all summarized conversations dated TODAY
  2. Extract their topics: — this is the "topics of today" set
  3. For each topic in that set, pull ALL summarized conversations
     across history that share that topic (full historical context)
  4. Extract hall_facts + hall_discoveries + hall_advice bullets
     (the high-signal hall types — skips event/preference/tooling)
  5. Send topic group + wiki index.md to claude -p
  6. Model emits JSON actions[]: new_page / update_page / skip
  7. Write each action to staging/<type>/ with distill provenance
     frontmatter (staged_by: wiki-distill, distill_topic,
     distill_source_conversations, compilation_notes)

First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.

Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.

## Orchestration — distill BEFORE harvest

wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):

  1a. wiki-distill.py    — conversations → staging (PRIORITY)
  1b. wiki-harvest.py    — URLs → raw/harvested → staging (supplement)
  2.  wiki-hygiene.py    — decay, archive, repair, checks
  3.  qmd reindex

Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.

## Verified on real wiki

Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.

## Code additions

- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
  + high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
  state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
  the Phase 1a/1b split

## Docs additions

- README.md: 8th row in extensions table, updated compounding-loop
  diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
  loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
  order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
  verify section test count
- .gitignore: note distill-state.json is committed (sync across
  machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
  tab with flow diagram, hall filter table, narrow-today/wide-
  history explanation, staging provenance example

## Tests

192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
2026-04-12 22:34:33 -06:00

458 lines
23 KiB
Markdown

# memex — Compounding Knowledge for AI Agents
A persistent, LLM-maintained knowledge base that sits between you and the
sources it was compiled from. Unlike RAG — which re-discovers the same
answers on every query — memex **gets richer over time**. Facts get
cross-referenced, contradictions get flagged, stale advice ages out and
gets archived, and new knowledge discovered during a session gets written
back so it's there next time.
The agent reads the wiki at the start of every session and updates it as
new things are learned. The wiki is the long-term memory; the session is
the working memory.
> **Why "memex"?** In 1945, Vannevar Bush wrote
> [*As We May Think*](https://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/)
> describing a hypothetical machine called the **memex** (a portmanteau
> of "memory" and "index") that would store and cross-reference a
> person's entire library of books, records, and communications, with
> "associative trails" linking related ideas. He imagined someone using
> it would build up a personal knowledge web over a lifetime, and that
> the trails themselves — the network of learned associations — were
> more valuable than any individual document.
>
> Eighty years later, LLMs make the memex finally buildable. The
> "associative trails" Bush imagined are the `related:` frontmatter
> fields and wikilinks the agent maintains. This repo is one attempt
> at that.
> **Inspiration**: memex combines
> [Andrej Karpathy's persistent-wiki gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
> and [milla-jovovich/mempalace](https://github.com/milla-jovovich/mempalace),
> and adds an automation layer on top so the wiki maintains itself.
---
## The problem with stateless RAG
Most people's experience with LLMs and documents looks like RAG: you upload
files, the LLM retrieves chunks at query time, generates an answer, done.
This works — but the LLM is rediscovering knowledge from scratch on every
question. There's no accumulation.
Ask the same subtle question twice and the LLM does all the same work twice.
Ask something that requires synthesizing five documents and the LLM has to
find and piece together the relevant fragments every time. Nothing is built
up. NotebookLM, ChatGPT file uploads, and most RAG systems work this way.
Worse, raw sources go stale. URLs rot. Documentation lags. Blog posts
get retracted. If your knowledge base is "the original documents,"
stale advice keeps showing up alongside current advice and there's no way
to know which is which.
## The core idea — a compounding wiki
Instead of retrieving from raw documents at query time, the LLM
**incrementally builds and maintains a persistent wiki** — a structured,
interlinked collection of markdown files that sits between you and the
raw sources.
When a new source shows up (a doc page, a blog post, a CLI `--help`, a
conversation transcript), the LLM doesn't just index it. It reads it,
extracts what's load-bearing, and integrates it into the existing wiki —
updating topic pages, revising summaries, noting where new data
contradicts old claims, strengthening or challenging the evolving
synthesis. The knowledge is compiled once and then *kept current*, not
re-derived on every query.
This is the key difference: **the wiki is a persistent, compounding
artifact.** The cross-references are already there. The contradictions have
already been flagged. The synthesis already reflects everything the LLM
has read. The wiki gets richer with every source added and every question
asked.
You never (or rarely) write the wiki yourself. The LLM writes and maintains
all of it. You're in charge of sourcing, exploration, and asking the right
questions. The LLM does the summarizing, cross-referencing, filing, and
bookkeeping that make a knowledge base actually useful over time.
---
## What this adds beyond Karpathy's gist
Karpathy's gist describes the *idea* — a wiki the agent maintains. This
repo is a working implementation with an automation layer that handles the
lifecycle of knowledge, not just its creation:
| Layer | What it does |
|-------|--------------|
| **Conversation mining** | Extracts Claude Code session transcripts into searchable markdown. Summarizes them via `claude -p` with model routing (haiku for short sessions, sonnet for long ones). Links summaries to wiki pages by topic. |
| **URL harvesting** | Scans summarized conversations for external reference URLs. Fetches them via `trafilatura``crawl4ai` → stealth mode cascade. Compiles clean markdown into pending wiki pages. |
| **Human-in-the-loop staging** | Automated content lands in `staging/` with `status: pending`. You review via CLI, interactive prompts, or an in-session Claude review. Nothing automated goes live without approval. |
| **Staleness decay** | Every page tracks `last_verified`. After 6 months without a refresh signal, confidence decays `high → medium`; 9 months → `low`; 12 months → `stale` → auto-archived. |
| **Auto-restoration** | Archived pages that get referenced again in new conversations or wiki updates are automatically restored. |
| **Hygiene** | Daily structural checks (orphans, broken cross-refs, index drift, frontmatter repair). Weekly LLM-powered checks (duplicates, contradictions, missing cross-references). |
| **Orchestrator** | One script chains all of the above into a daily cron-able pipeline. |
The result: you don't have to maintain the wiki. You just *use* it. The
automation handles harvesting new knowledge, retiring old knowledge,
keeping cross-references intact, and flagging ambiguity for review.
---
## How memex extends Karpathy's pattern
Before implementing anything, the design was worked out interactively
with Claude as a structured
[Signal & Noise analysis](https://eric-turner.com/memex/signal-and-noise.html).
Karpathy's original gist is a concept pitch, not an implementation —
he was explicit that he was sharing an "idea file" for others to build
on. memex is one attempt at that build-out. The analysis identified
seven places where the core idea needed an engineering layer to become
practical day-to-day, and every automation component in this repo maps
to one of those extensions:
| What memex adds | How it works |
|-----------------|--------------|
| **Conversation distillation** — your sessions become wiki pages | `wiki-distill.py` finds today's topics, rolls up ALL historical conversations sharing each topic, pulls their `hall_facts` + `hall_discoveries` + `hall_advice` content, and asks `claude -p` to create new pages or update existing ones. This is what closes the MemPalace loop — closet summaries become the source material for the wiki itself, not just the URLs cited in them. |
| **Time-decaying confidence** — pages earn trust through reinforcement and fade without it | `confidence` field + `last_verified`, 6/9/12 month decay thresholds, auto-archive. Full-mode hygiene also adds LLM contradiction detection across pages. |
| **Scalable search beyond the context window** | `qmd` (BM25 + vector + LLM re-ranking) from day one, with three collections (`wiki` / `wiki-archive` / `wiki-conversations`) so queries can route to the right surface. |
| **Traceable sources for every claim** | Every compiled page traces back to either an immutable `raw/harvested/*.md` file (URL-sourced) or specific conversations with a `distill_source_conversations` field (session-sourced). Staging review is the built-in cross-check, and `compilation_notes` makes review fast. |
| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → distill → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. |
| **Human-in-the-loop staging** for automated content | Every automated page lands in `staging/` first with `origin: automated`, `status: pending`. Nothing bypasses human review — one promotion step and it's in the live wiki with `last_verified` set. |
| **Hybrid retrieval** — structural navigation + semantic search | Wings/rooms/halls (borrowed from mempalace) give structural filtering that narrows the search space before qmd's hybrid BM25 + vector pass runs. Full-mode hygiene also auto-adds missing cross-references. |
| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Distill, harvest, and hygiene state files sync across machines so both agree on what's been processed. |
The short version: Karpathy shared the idea, milla-jovovich's mempalace
added the structural memory taxonomy, and memex is the automation layer
that lets the whole thing run day-to-day without constant human
maintenance. See **[`docs/DESIGN-RATIONALE.md`](docs/DESIGN-RATIONALE.md)**
for the longer rationale on each extension, plus honest notes on what
memex doesn't cover.
---
## Compounding loop
```
┌─────────────────────┐
│ Claude Code │
│ sessions (.jsonl) │
└──────────┬──────────┘
│ extract-sessions.py (hourly, no LLM)
┌─────────────────────┐
│ conversations/ │ markdown transcripts
│ <project>/*.md │ (status: extracted)
└──────────┬──────────┘
│ summarize-conversations.py --claude (daily)
┌─────────────────────┐
│ conversations/ │ summaries with halls + topics + related:
│ <project>/*.md │ (status: summarized)
└──────┬───────────┬──┘
│ │
│ └──▶ wiki-distill.py (daily Phase 1a) ──┐
│ - rollup by today's topics │
│ - pull historical conversations│
│ - extract fact/discovery/advice│
│ - claude -p → new or update │
│ │
│ wiki-harvest.py (daily Phase 1b) │
▼ │
┌─────────────────────┐ │
│ raw/harvested/ │ fetched URL content │
│ *.md │ (immutable source material) │
└──────────┬──────────┘ │
│ claude -p compile step │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ staging/<type>/ pending pages │◀─┘
│ *.md (status: pending, origin: auto) │
└──────────┬───────────────────────────────────────────┘
│ human review (wiki-staging.py --review)
┌─────────────────────┐
│ patterns/ │ LIVE wiki
│ decisions/ │ (origin: manual or promoted-from-automated)
│ concepts/ │
│ environments/ │
└──────────┬──────────┘
│ wiki-hygiene.py (daily quick / weekly full)
│ - refresh last_verified from new conversations
│ - decay confidence on idle pages
│ - auto-restore archived pages referenced again
│ - fuzzy-fix broken cross-references
┌─────────────────────┐
│ archive/<type>/ │ stale/superseded content
│ *.md │ (excluded from default search)
└─────────────────────┘
```
Every arrow is automated. The only human step is staging review — and
that's quick because the AI compilation step already wrote the page, you
just approve or reject.
---
## Quick start — two paths
### Path A: just the idea (Karpathy-style)
Open a Claude Code session in an empty directory and tell it:
```
I want you to start maintaining a persistent knowledge wiki for me.
Create a directory structure with patterns/, decisions/, concepts/, and
environments/ subdirectories. Each page should have YAML frontmatter with
title, type, confidence, sources, related, last_compiled, and last_verified
fields. Create an index.md at the root that catalogs every page.
From now on, when I share a source (a doc page, a CLI --help, a conversation
I had), read it, extract what's load-bearing, and integrate it into the
wiki. Update existing pages when new knowledge refines them. Flag
contradictions between pages. Create new pages when topics aren't
covered yet. Update index.md every time you create or remove a page.
When I ask a question, read the relevant wiki pages first, then answer.
If you rely on a wiki page with `confidence: low`, flag that to me.
```
That's the whole idea. The agent will build you a growing markdown tree
that compounds over time. This is the minimum viable version.
### Path B: the full automation (this repo)
```bash
git clone <this-repo> ~/projects/wiki
cd ~/projects/wiki
# Install the Python extraction tools
pipx install trafilatura
pipx install crawl4ai && crawl4ai-setup
# Install qmd for full-text + vector search
npm install -g @tobilu/qmd
# Configure qmd (3 collections — see docs/SETUP.md for the YAML)
# Edit scripts/extract-sessions.py with your project codes
# Edit scripts/update-conversation-index.py with matching display names
# Copy the example CLAUDE.md files (wiki schema + global instructions)
cp docs/examples/wiki-CLAUDE.md CLAUDE.md
cat docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
# edit both for your conventions
# Run the full pipeline once, manually
bash scripts/mine-conversations.sh --extract-only # Fast, no LLM
python3 scripts/summarize-conversations.py --claude # Classify + summarize
python3 scripts/update-conversation-index.py --reindex
# Then maintain
bash scripts/wiki-maintain.sh # Daily hygiene
bash scripts/wiki-maintain.sh --hygiene-only --full # Weekly deep pass
```
See [`docs/SETUP.md`](docs/SETUP.md) for complete setup including qmd
configuration (three collections: `wiki`, `wiki-archive`,
`wiki-conversations`), optional cron schedules, git sync, and the
post-merge hook. See [`docs/examples/`](docs/examples/) for starter
`CLAUDE.md` files (wiki schema + global instructions) with explicit
guidance on using the three qmd collections.
---
## Directory layout after setup
```
wiki/
├── CLAUDE.md ← Schema + instructions the agent reads every session
├── index.md ← Content catalog (the agent reads this first)
├── patterns/ ← HOW things should be built (LIVE)
├── decisions/ ← WHY we chose this approach (LIVE)
├── concepts/ ← WHAT the foundational ideas are (LIVE)
├── environments/ ← WHERE implementations differ (LIVE)
├── staging/ ← PENDING automated content awaiting review
│ ├── index.md
│ └── <type>/
├── archive/ ← STALE / superseded (excluded from search)
│ ├── index.md
│ └── <type>/
├── raw/ ← Immutable source material (never modified)
│ ├── <topic>/
│ └── harvested/ ← URL harvester output
├── conversations/ ← Mined Claude Code session transcripts
│ ├── index.md
│ └── <project>/
├── context/ ← Auto-updated AI session briefing
│ ├── wake-up.md ← Loaded at the start of every session
│ └── active-concerns.md ← Current blockers and focus areas
├── reports/ ← Hygiene operation logs
├── scripts/ ← The automation pipeline
├── tests/ ← Pytest suite (171 tests)
├── .distill-state.json ← Conversation distill state (committed, synced)
├── .harvest-state.json ← URL dedup state (committed, synced)
├── .hygiene-state.json ← Content hashes, deferred issues (committed, synced)
└── .mine-state.json ← Conversation extraction offsets (gitignored, per-machine)
```
---
## What's Claude-specific (and what isn't)
This repo is built around **Claude Code** as the agent. Specifically:
1. **Session mining** expects `~/.claude/projects/<hashed-path>/*.jsonl`
files written by the Claude Code CLI. Other agents won't produce these.
2. **Summarization** uses `claude -p` (the Claude Code CLI's one-shot mode)
with haiku/sonnet routing by conversation length. Other LLM CLIs would
need a different wrapper.
3. **URL compilation** uses `claude -p` to turn raw harvested content into
a wiki page with proper frontmatter.
4. **The agent itself** (the thing that reads `CLAUDE.md` and maintains the
wiki conversationally) is Claude Code. Any agent that reads markdown
and can write files could do this job — `CLAUDE.md` is just a text
file telling the agent what the wiki's conventions are.
**What's NOT Claude-specific**:
- The wiki schema (frontmatter, directory layout, lifecycle states)
- The staleness decay model and archive/restore semantics
- The human-in-the-loop staging workflow
- The hygiene checks (orphans, broken cross-refs, duplicates)
- The `trafilatura` + `crawl4ai` URL fetching
- The qmd search integration
- The git-based cross-machine sync
If you use a different agent, you replace parts **1-4** above with
equivalents for your agent. The other 80% of the repo is agent-agnostic.
See [`docs/CUSTOMIZE.md`](docs/CUSTOMIZE.md) for concrete adaptation
recipes.
---
## Architecture at a glance
Eleven scripts organized in three layers:
**Mining layer** (ingests conversations):
- `extract-sessions.py` — Parse Claude Code JSONL → markdown transcripts
- `summarize-conversations.py` — Classify + summarize via `claude -p`
- `update-conversation-index.py` — Regenerate conversation index + wake-up context
**Automation layer** (maintains the wiki):
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, hall extraction, constants
- `wiki-distill.py` — Conversation distillation (closet → wiki pages via claude -p, closes the MemPalace loop)
- `wiki-harvest.py` — URL classification + fetch cascade + compile to staging
- `wiki-staging.py` — Human review (list/promote/reject/review/sync)
- `wiki-hygiene.py` — Quick + full hygiene checks, archival, auto-restore
- `wiki-maintain.sh` — Top-level orchestrator chaining distill + harvest + hygiene
**Sync layer**:
- `wiki-sync.sh` — Git commit/pull/push with merge-union markdown handling
- `mine-conversations.sh` — Mining orchestrator
See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) for a deeper tour.
---
## Why markdown, not a real database?
Markdown files are:
- **Human-readable without any tooling** — you can browse in Obsidian, VS Code, or `cat`
- **Git-native** — full history, branching, rollback, cross-machine sync for free
- **Agent-friendly** — every LLM was trained on markdown, so reading and writing it is free
- **Durable** — no schema migrations, no database corruption, no vendor lock-in
- **Interoperable** — Obsidian graph view, `grep`, `qmd`, `ripgrep`, any editor
A SQLite file with the same content would be faster to query but harder
to browse, harder to merge, harder to audit, and fundamentally less
*collaborative* between you and the agent. Markdown wins for knowledge
management what Postgres wins for transactions.
---
## Testing
Full pytest suite in `tests/` — 171 tests across all scripts, runs in
**~1.3 seconds**, no network or LLM calls needed, works on macOS and
Linux/WSL.
```bash
cd tests && python3 -m pytest
# or
bash tests/run.sh
```
The test suite uses a disposable `tmp_wiki` fixture so no test ever
touches your real wiki.
---
## Credits and inspiration
This repo is a synthesis of two existing ideas with an automation layer
on top. It would not exist without either of them.
**Core pattern — [Andrej Karpathy — "Agent-Maintained Persistent Wiki" gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)**
The foundational idea of a compounding LLM-maintained wiki that moves
synthesis from query-time (RAG) to ingest-time. memex is an
implementation of Karpathy's pattern with the engineering layer that
turns the concept into something practical to run day-to-day.
**Structural memory taxonomy — [milla-jovovich/mempalace](https://github.com/milla-jovovich/mempalace)**
The wing/room/hall/closet/drawer/tunnel concepts that turn a flat
corpus into something you can navigate without reading everything. See
[`ARCHITECTURE.md#borrowed-concepts`](docs/ARCHITECTURE.md#borrowed-concepts)
for the explicit mapping of MemPalace terms to this repo's
implementation.
**Search layer — [qmd](https://github.com/tobi/qmd)** by Tobi Lütke
(Shopify CEO). Local BM25 + vector + LLM re-ranking on markdown files.
Chosen over ChromaDB because it uses the same storage format as the
wiki — one index to maintain, not two. Explicitly recommended by
Karpathy as well.
**URL extraction stack** — [trafilatura](https://github.com/adbar/trafilatura)
for fast static-page extraction and [crawl4ai](https://github.com/unclecode/crawl4ai)
for JS-rendered and anti-bot cases. The two-tool cascade handles
essentially any web content without needing a full browser stack for
simple pages.
**The agent** — [Claude Code](https://claude.com/claude-code) by Anthropic.
The repo is Claude-specific (see the section above for what that means
and how to adapt for other agents).
**Design process** — memex was designed interactively with Claude as a
structured Signal & Noise analysis before any code was written. The
analysis walks through the seven real strengths of Karpathy's pattern
and seven places where it needs an engineering layer to be practical,
and works through the concrete extension for each. Every component in
this repo maps back to a specific extension identified there.
- **Live interactive version**:
[eric-turner.com/memex/signal-and-noise.html](https://eric-turner.com/memex/signal-and-noise.html)
— click tabs to explore pros/cons, vs RAG, use-case fits, signal
breakdown, and mitigations
- **Self-contained archive in this repo**:
[`docs/artifacts/signal-and-noise.html`](docs/artifacts/signal-and-noise.html)
— download and open locally; works offline
- **Condensed written version**:
[`docs/DESIGN-RATIONALE.md`](docs/DESIGN-RATIONALE.md)
— every tradeoff and mitigation rendered as prose
---
## License
MIT — see [`LICENSE`](LICENSE).
## Contributing
This is a personal project that I'm making public in case the pattern is
useful to others. Issues and PRs welcome, but I make no promises about
response time. If you fork and make it your own, I'd love to hear how you
adapted it.