Initial commit — memex

A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,421 @@
+# LLM Wiki — Compounding Knowledge for AI Agents
+
+A persistent, LLM-maintained knowledge base that sits between you and the
+sources it was compiled from. Unlike RAG — which re-discovers the same
+answers on every query — the wiki **gets richer over time**. Facts get
+cross-referenced, contradictions get flagged, stale advice ages out and
+gets archived, and new knowledge discovered during a session gets written
+back so it's there next time.
+
+The agent reads the wiki at the start of every session and updates it as
+new things are learned. The wiki is the long-term memory; the session is
+the working memory.
+
+> **Inspiration**: this combines the ideas from
+> [Andrej Karpathy's persistent-wiki gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
+> and [milla-jovovich/mempalace](https://github.com/milla-jovovich/mempalace),
+> and adds an automation layer on top so the wiki maintains itself.
+
+---
+
+## The problem with stateless RAG
+
+Most people's experience with LLMs and documents looks like RAG: you upload
+files, the LLM retrieves chunks at query time, generates an answer, done.
+This works — but the LLM is rediscovering knowledge from scratch on every
+question. There's no accumulation.
+
+Ask the same subtle question twice and the LLM does all the same work twice.
+Ask something that requires synthesizing five documents and the LLM has to
+find and piece together the relevant fragments every time. Nothing is built
+up. NotebookLM, ChatGPT file uploads, and most RAG systems work this way.
+
+Worse, raw sources go stale. URLs rot. Documentation lags. Blog posts
+get retracted. If your knowledge base is "the original documents,"
+stale advice keeps showing up alongside current advice and there's no way
+to know which is which.
+
+## The core idea — a compounding wiki
+
+Instead of retrieving from raw documents at query time, the LLM
+**incrementally builds and maintains a persistent wiki** — a structured,
+interlinked collection of markdown files that sits between you and the
+raw sources.
+
+When a new source shows up (a doc page, a blog post, a CLI `--help`, a
+conversation transcript), the LLM doesn't just index it. It reads it,
+extracts what's load-bearing, and integrates it into the existing wiki —
+updating topic pages, revising summaries, noting where new data
+contradicts old claims, strengthening or challenging the evolving
+synthesis. The knowledge is compiled once and then *kept current*, not
+re-derived on every query.
+
+This is the key difference: **the wiki is a persistent, compounding
+artifact.** The cross-references are already there. The contradictions have
+already been flagged. The synthesis already reflects everything the LLM
+has read. The wiki gets richer with every source added and every question
+asked.
+
+You never (or rarely) write the wiki yourself. The LLM writes and maintains
+all of it. You're in charge of sourcing, exploration, and asking the right
+questions. The LLM does the summarizing, cross-referencing, filing, and
+bookkeeping that make a knowledge base actually useful over time.
+
+---
+
+## What this adds beyond Karpathy's gist
+
+Karpathy's gist describes the *idea* — a wiki the agent maintains. This
+repo is a working implementation with an automation layer that handles the
+lifecycle of knowledge, not just its creation:
+
+| Layer | What it does |
+|-------|--------------|
+| **Conversation mining** | Extracts Claude Code session transcripts into searchable markdown. Summarizes them via `claude -p` with model routing (haiku for short sessions, sonnet for long ones). Links summaries to wiki pages by topic. |
+| **URL harvesting** | Scans summarized conversations for external reference URLs. Fetches them via `trafilatura` → `crawl4ai` → stealth mode cascade. Compiles clean markdown into pending wiki pages. |
+| **Human-in-the-loop staging** | Automated content lands in `staging/` with `status: pending`. You review via CLI, interactive prompts, or an in-session Claude review. Nothing automated goes live without approval. |
+| **Staleness decay** | Every page tracks `last_verified`. After 6 months without a refresh signal, confidence decays `high → medium`; 9 months → `low`; 12 months → `stale` → auto-archived. |
+| **Auto-restoration** | Archived pages that get referenced again in new conversations or wiki updates are automatically restored. |
+| **Hygiene** | Daily structural checks (orphans, broken cross-refs, index drift, frontmatter repair). Weekly LLM-powered checks (duplicates, contradictions, missing cross-references). |
+| **Orchestrator** | One script chains all of the above into a daily cron-able pipeline. |
+
+The result: you don't have to maintain the wiki. You just *use* it. The
+automation handles harvesting new knowledge, retiring old knowledge,
+keeping cross-references intact, and flagging ambiguity for review.
+
+---
+
+## Why each part exists
+
+Before implementing anything, the design was worked out interactively
+with Claude as a [Signal & Noise analysis of Karpathy's
+pattern](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c).
+That analysis found seven real weaknesses in the core pattern. This
+repo exists because each weakness has a concrete mitigation — and
+every component maps directly to one:
+
+| Karpathy-pattern weakness | How this repo answers it |
+|---------------------------|--------------------------|
+| **Errors persist and compound** | `confidence` field with time-based decay → pages age out visibly. Staging review catches automated content before it goes live. Full-mode hygiene does LLM contradiction detection. |
+| **Hard ~50K-token ceiling** | `qmd` (BM25 + vector + re-ranking) set up from day one. Wing/room structural filtering narrows search before retrieval. Archive collection is excluded from default search. |
+| **Manual cross-checking returns** | Every wiki claim traces back to immutable `raw/harvested/*.md` with SHA-256 hash. Staging review IS the cross-check. `compilation_notes` field makes review fast. |
+| **Knowledge staleness** (the #1 failure mode in community data) | Daily + weekly cron removes "I forgot" as a failure mode. `last_verified` auto-refreshes from conversation references. Decayed pages auto-archive. |
+| **Cognitive outsourcing risk** | Staging review forces engagement with every automated page. `qmd query` makes retrieval an active exploration. Wake-up briefing ~200 tokens the human reads too. |
+| **Weaker semantic retrieval** | `qmd` hybrid (BM25 + vector). Full-mode hygiene adds missing cross-references. Structural metadata (wings, rooms) complements semantic search. |
+| **No access control** | Git sync with `merge=union` markdown handling. Network-boundary ACL via Tailscale is the suggested path. *This one is a residual trade-off — see [DESIGN-RATIONALE.md](docs/DESIGN-RATIONALE.md).* |
+
+The short version: Karpathy published the idea, the community found the
+holes, and this repo is the automation layer that plugs the holes.
+See **[`docs/DESIGN-RATIONALE.md`](docs/DESIGN-RATIONALE.md)** for the
+full argument with honest residual trade-offs and what this repo
+explicitly does NOT solve.
+
+---
+
+## Compounding loop
+
+```
+┌─────────────────────┐
+│  Claude Code        │
+│  sessions (.jsonl)  │
+└──────────┬──────────┘
+           │ extract-sessions.py (hourly, no LLM)
+           ▼
+┌─────────────────────┐
+│  conversations/     │  markdown transcripts
+│  <project>/*.md     │  (status: extracted)
+└──────────┬──────────┘
+           │ summarize-conversations.py --claude (daily)
+           ▼
+┌─────────────────────┐
+│  conversations/     │  summaries with related: wiki links
+│  <project>/*.md     │  (status: summarized)
+└──────────┬──────────┘
+           │ wiki-harvest.py (daily)
+           ▼
+┌─────────────────────┐
+│  raw/harvested/     │  fetched URL content
+│  *.md               │  (immutable source material)
+└──────────┬──────────┘
+           │ claude -p compile step
+           ▼
+┌─────────────────────┐
+│  staging/<type>/    │  pending pages
+│  *.md               │  (status: pending, origin: automated)
+└──────────┬──────────┘
+           │ human review (wiki-staging.py --review)
+           ▼
+┌─────────────────────┐
+│  patterns/          │  LIVE wiki
+│  decisions/         │  (origin: manual or promoted-from-automated)
+│  concepts/          │
+│  environments/      │
+└──────────┬──────────┘
+           │ wiki-hygiene.py (daily quick / weekly full)
+           │ - refresh last_verified from new conversations
+           │ - decay confidence on idle pages
+           │ - auto-restore archived pages referenced again
+           │ - fuzzy-fix broken cross-references
+           ▼
+┌─────────────────────┐
+│  archive/<type>/    │  stale/superseded content
+│  *.md               │  (excluded from default search)
+└─────────────────────┘
+```
+
+Every arrow is automated. The only human step is staging review — and
+that's quick because the AI compilation step already wrote the page, you
+just approve or reject.
+
+---
+
+## Quick start — two paths
+
+### Path A: just the idea (Karpathy-style)
+
+Open a Claude Code session in an empty directory and tell it:
+
+```
+I want you to start maintaining a persistent knowledge wiki for me.
+Create a directory structure with patterns/, decisions/, concepts/, and
+environments/ subdirectories. Each page should have YAML frontmatter with
+title, type, confidence, sources, related, last_compiled, and last_verified
+fields. Create an index.md at the root that catalogs every page.
+
+From now on, when I share a source (a doc page, a CLI --help, a conversation
+I had), read it, extract what's load-bearing, and integrate it into the
+wiki. Update existing pages when new knowledge refines them. Flag
+contradictions between pages. Create new pages when topics aren't
+covered yet. Update index.md every time you create or remove a page.
+
+When I ask a question, read the relevant wiki pages first, then answer.
+If you rely on a wiki page with `confidence: low`, flag that to me.
+```
+
+That's the whole idea. The agent will build you a growing markdown tree
+that compounds over time. This is the minimum viable version.
+
+### Path B: the full automation (this repo)
+
+```bash
+git clone <this-repo> ~/projects/wiki
+cd ~/projects/wiki
+
+# Install the Python extraction tools
+pipx install trafilatura
+pipx install crawl4ai && crawl4ai-setup
+
+# Install qmd for full-text + vector search
+npm install -g @tobilu/qmd
+
+# Configure qmd (3 collections — see docs/SETUP.md for the YAML)
+# Edit scripts/extract-sessions.py with your project codes
+# Edit scripts/update-conversation-index.py with matching display names
+
+# Copy the example CLAUDE.md files (wiki schema + global instructions)
+cp docs/examples/wiki-CLAUDE.md CLAUDE.md
+cat docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
+# edit both for your conventions
+
+# Run the full pipeline once, manually
+bash scripts/mine-conversations.sh --extract-only     # Fast, no LLM
+python3 scripts/summarize-conversations.py --claude   # Classify + summarize
+python3 scripts/update-conversation-index.py --reindex
+
+# Then maintain
+bash scripts/wiki-maintain.sh                         # Daily hygiene
+bash scripts/wiki-maintain.sh --hygiene-only --full   # Weekly deep pass
+```
+
+See [`docs/SETUP.md`](docs/SETUP.md) for complete setup including qmd
+configuration (three collections: `wiki`, `wiki-archive`,
+`wiki-conversations`), optional cron schedules, git sync, and the
+post-merge hook. See [`docs/examples/`](docs/examples/) for starter
+`CLAUDE.md` files (wiki schema + global instructions) with explicit
+guidance on using the three qmd collections.
+
+---
+
+## Directory layout after setup
+
+```
+wiki/
+├── CLAUDE.md                  ← Schema + instructions the agent reads every session
+├── index.md                   ← Content catalog (the agent reads this first)
+├── patterns/                  ← HOW things should be built (LIVE)
+├── decisions/                 ← WHY we chose this approach (LIVE)
+├── concepts/                  ← WHAT the foundational ideas are (LIVE)
+├── environments/              ← WHERE implementations differ (LIVE)
+├── staging/                   ← PENDING automated content awaiting review
+│   ├── index.md
+│   └── <type>/
+├── archive/                   ← STALE / superseded (excluded from search)
+│   ├── index.md
+│   └── <type>/
+├── raw/                       ← Immutable source material (never modified)
+│   ├── <topic>/
+│   └── harvested/             ← URL harvester output
+├── conversations/             ← Mined Claude Code session transcripts
+│   ├── index.md
+│   └── <project>/
+├── context/                   ← Auto-updated AI session briefing
+│   ├── wake-up.md             ← Loaded at the start of every session
+│   └── active-concerns.md     ← Current blockers and focus areas
+├── reports/                   ← Hygiene operation logs
+├── scripts/                   ← The automation pipeline
+├── tests/                     ← Pytest suite (171 tests)
+├── .harvest-state.json        ← URL dedup state (committed, synced)
+├── .hygiene-state.json        ← Content hashes, deferred issues (committed, synced)
+└── .mine-state.json           ← Conversation extraction offsets (gitignored, per-machine)
+```
+
+---
+
+## What's Claude-specific (and what isn't)
+
+This repo is built around **Claude Code** as the agent. Specifically:
+
+1. **Session mining** expects `~/.claude/projects/<hashed-path>/*.jsonl`
+   files written by the Claude Code CLI. Other agents won't produce these.
+2. **Summarization** uses `claude -p` (the Claude Code CLI's one-shot mode)
+   with haiku/sonnet routing by conversation length. Other LLM CLIs would
+   need a different wrapper.
+3. **URL compilation** uses `claude -p` to turn raw harvested content into
+   a wiki page with proper frontmatter.
+4. **The agent itself** (the thing that reads `CLAUDE.md` and maintains the
+   wiki conversationally) is Claude Code. Any agent that reads markdown
+   and can write files could do this job — `CLAUDE.md` is just a text
+   file telling the agent what the wiki's conventions are.
+
+**What's NOT Claude-specific**:
+
+- The wiki schema (frontmatter, directory layout, lifecycle states)
+- The staleness decay model and archive/restore semantics
+- The human-in-the-loop staging workflow
+- The hygiene checks (orphans, broken cross-refs, duplicates)
+- The `trafilatura` + `crawl4ai` URL fetching
+- The qmd search integration
+- The git-based cross-machine sync
+
+If you use a different agent, you replace parts **1-4** above with
+equivalents for your agent. The other 80% of the repo is agent-agnostic.
+See [`docs/CUSTOMIZE.md`](docs/CUSTOMIZE.md) for concrete adaptation
+recipes.
+
+---
+
+## Architecture at a glance
+
+Eleven scripts organized in three layers:
+
+**Mining layer** (ingests conversations):
+- `extract-sessions.py` — Parse Claude Code JSONL → markdown transcripts
+- `summarize-conversations.py` — Classify + summarize via `claude -p`
+- `update-conversation-index.py` — Regenerate conversation index + wake-up context
+
+**Automation layer** (maintains the wiki):
+- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, constants
+- `wiki-harvest.py` — URL classification + fetch cascade + compile to staging
+- `wiki-staging.py` — Human review (list/promote/reject/review/sync)
+- `wiki-hygiene.py` — Quick + full hygiene checks, archival, auto-restore
+- `wiki-maintain.sh` — Top-level orchestrator chaining harvest + hygiene
+
+**Sync layer**:
+- `wiki-sync.sh` — Git commit/pull/push with merge-union markdown handling
+- `mine-conversations.sh` — Mining orchestrator
+
+See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) for a deeper tour.
+
+---
+
+## Why markdown, not a real database?
+
+Markdown files are:
+
+- **Human-readable without any tooling** — you can browse in Obsidian, VS Code, or `cat`
+- **Git-native** — full history, branching, rollback, cross-machine sync for free
+- **Agent-friendly** — every LLM was trained on markdown, so reading and writing it is free
+- **Durable** — no schema migrations, no database corruption, no vendor lock-in
+- **Interoperable** — Obsidian graph view, `grep`, `qmd`, `ripgrep`, any editor
+
+A SQLite file with the same content would be faster to query but harder
+to browse, harder to merge, harder to audit, and fundamentally less
+*collaborative* between you and the agent. Markdown wins for knowledge
+management what Postgres wins for transactions.
+
+---
+
+## Testing
+
+Full pytest suite in `tests/` — 171 tests across all scripts, runs in
+**~1.3 seconds**, no network or LLM calls needed, works on macOS and
+Linux/WSL.
+
+```bash
+cd tests && python3 -m pytest
+# or
+bash tests/run.sh
+```
+
+The test suite uses a disposable `tmp_wiki` fixture so no test ever
+touches your real wiki.
+
+---
+
+## Credits and inspiration
+
+This repo is a synthesis of two existing ideas with an automation layer
+on top. It would not exist without either of them.
+
+**Core pattern — [Andrej Karpathy — "Agent-Maintained Persistent Wiki" gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)**
+The foundational idea of a compounding LLM-maintained wiki that moves
+synthesis from query-time (RAG) to ingest-time. This repo is an
+implementation of Karpathy's pattern with the community-identified
+failure modes plugged.
+
+**Structural memory taxonomy — [milla-jovovich/mempalace](https://github.com/milla-jovovich/mempalace)**
+The wing/room/hall/closet/drawer/tunnel concepts that turn a flat
+corpus into something you can navigate without reading everything. See
+[`ARCHITECTURE.md#borrowed-concepts`](docs/ARCHITECTURE.md#borrowed-concepts)
+for the explicit mapping of MemPalace terms to this repo's
+implementation.
+
+**Search layer — [qmd](https://github.com/tobi/qmd)** by Tobi Lütke
+(Shopify CEO). Local BM25 + vector + LLM re-ranking on markdown files.
+Chosen over ChromaDB because it uses the same storage format as the
+wiki — one index to maintain, not two. Explicitly recommended by
+Karpathy as well.
+
+**URL extraction stack** — [trafilatura](https://github.com/adbar/trafilatura)
+for fast static-page extraction and [crawl4ai](https://github.com/unclecode/crawl4ai)
+for JS-rendered and anti-bot cases. The two-tool cascade handles
+essentially any web content without needing a full browser stack for
+simple pages.
+
+**The agent** — [Claude Code](https://claude.com/claude-code) by Anthropic.
+The repo is Claude-specific (see the section above for what that means
+and how to adapt for other agents).
+
+**Design process** — this repo was designed interactively with Claude
+as a structured Signal & Noise analysis before any code was written.
+The interactive design artifact is here:
+[The LLM Wiki — Karpathy's Pattern — Signal & Noise](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c).
+That artifact walks through the seven real strengths and seven real
+weaknesses of the core pattern, then works through concrete mitigations
+for each weakness. Every component in this repo maps back to a specific
+mitigation identified there.
+[`docs/DESIGN-RATIONALE.md`](docs/DESIGN-RATIONALE.md) is the condensed
+version of that analysis as it applies to this implementation.
+
+---
+
+## License
+
+MIT — see [`LICENSE`](LICENSE).
+
+## Contributing
+
+This is a personal project that I'm making public in case the pattern is
+useful to others. Issues and PRs welcome, but I make no promises about
+response time. If you fork and make it your own, I'd love to hear how you
+adapted it.