Initial commit — memex

A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions
--- a/docs/DESIGN-RATIONALE.md
+++ b/docs/DESIGN-RATIONALE.md
@@ -0,0 +1,338 @@
+# Design Rationale — Signal & Noise
+
+Why each part of this repo exists. This is the "why" document; the other
+docs are the "what" and "how."
+
+Before implementing anything, the design was worked out interactively
+with Claude as a structured Signal & Noise analysis of Andrej Karpathy's
+original persistent-wiki pattern:
+
+> **Interactive design artifact**: [The LLM Wiki — Karpathy's Pattern — Signal & Noise](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
+
+That artifact walks through the pattern's seven genuine strengths, seven
+real weaknesses, and concrete mitigations for each weakness. This repo
+is the implementation of those mitigations. If you want to understand
+*why* a component exists, the artifact has the longer-form argument; this
+document is the condensed version.
+
+---
+
+## Where the pattern is genuinely strong
+
+The analysis found seven strengths that hold up under scrutiny. This
+repo preserves all of them:
+
+| Strength | How this repo keeps it |
+|----------|-----------------------|
+| **Knowledge compounds over time** | Every ingest adds to the existing wiki rather than restarting; conversation mining and URL harvesting continuously feed new material in |
+| **Zero maintenance burden on humans** | Cron-driven harvest + hygiene; the only manual step is staging review, and that's fast because the AI already compiled the page |
+| **Token-efficient at personal scale** | `index.md` fits in context; `qmd` kicks in only at 50+ articles; the wake-up briefing is ~200 tokens |
+| **Human-readable & auditable** | Plain markdown everywhere; every cross-reference is visible; git history shows every change |
+| **Future-proof & portable** | No vendor lock-in; you can point any agent at the same tree tomorrow |
+| **Self-healing via lint passes** | `wiki-hygiene.py` runs quick checks daily and full (LLM) checks weekly |
+| **Path to fine-tuning** | Wiki pages are high-quality synthetic training data once purified through hygiene |
+
+---
+
+## Where the pattern is genuinely weak — and how this repo answers
+
+The analysis identified seven real weaknesses. Five have direct
+mitigations in this repo; two remain open trade-offs you should be aware
+of.
+
+### 1. Errors persist and compound
+
+**The problem**: Unlike RAG — where a hallucination is ephemeral and the
+next query starts clean — an LLM wiki persists its mistakes. If the LLM
+incorrectly links two concepts at ingest time, future ingests build on
+that wrong prior.
+
+**How this repo mitigates**:
+
+- **`confidence` field** — every page carries `high`/`medium`/`low` with
+  decay based on `last_verified`. Wrong claims aren't treated as
+  permanent — they age out visibly.
+- **Archive + restore** — decayed pages get moved to `archive/` where
+  they're excluded from default search. If they get referenced again
+  they're auto-restored with `confidence: medium` (never straight to
+  `high` — they have to re-earn trust).
+- **Raw harvested material is immutable** — `raw/harvested/*.md` files
+  are the ground truth. Every compiled wiki page can be traced back to
+  its source via the `sources:` frontmatter field.
+- **Full-mode contradiction detection** — `wiki-hygiene.py --full` uses
+  sonnet to find conflicting claims across pages. Report-only (humans
+  decide which side wins).
+- **Staging review** — automated content goes to `staging/` first.
+  Nothing enters the live wiki without human approval, so errors have
+  two chances to get caught (AI compile + human review) before they
+  become persistent.
+
+### 2. Hard scale ceiling at ~50K tokens
+
+**The problem**: The wiki approach stops working when `index.md` no
+longer fits in context. Karpathy's own wiki was ~100 articles / 400K
+words — already near the ceiling.
+
+**How this repo mitigates**:
+
+- **`qmd` from day one** — `qmd` (BM25 + vector + LLM re-ranking) is set
+  up in the default configuration so the agent never has to load the
+  full index. At 50+ pages, `qmd search` replaces `cat index.md`.
+- **Wing/room structural filtering** — conversations are partitioned by
+  project code (wing) and topic (room, via the `topics:` frontmatter).
+  Retrieval is pre-narrowed to the relevant wing before search runs.
+  This extends the effective ceiling because `qmd` works on a relevant
+  subset, not the whole corpus.
+- **Hygiene full mode flags redundancy** — duplicate detection auto-merges
+  weaker pages into stronger ones, keeping the corpus lean.
+- **Archive excludes stale content** — the `wiki-archive` collection has
+  `includeByDefault: false`, so archived pages don't eat context until
+  explicitly queried.
+
+### 3. Manual cross-checking burden returns in precision-critical domains
+
+**The problem**: For API specs, version constraints, legal records, and
+medical protocols, LLM-generated content needs human verification. The
+maintenance burden you thought you'd eliminated comes back as
+verification overhead.
+
+**How this repo mitigates**:
+
+- **Staging workflow** — every automated page goes through human review.
+  For precision-critical content, that review IS the cross-check. The
+  AI does the drafting; you verify.
+- **`compilation_notes` field** — staging pages include the AI's own
+  explanation of what it did and why. Makes review faster — you can
+  spot-check the reasoning rather than re-reading the whole page.
+- **Immutable raw sources** — every wiki claim traces back to a specific
+  file in `raw/harvested/` with a SHA-256 `content_hash`. Verification
+  means comparing the claim to the source, not "trust the LLM."
+- **`confidence: low` for precision domains** — the agent's instructions
+  (via `CLAUDE.md`) tell it to flag low-confidence content when
+  citing. Humans see the warning before acting.
+
+**Residual trade-off**: For *truly* mission-critical data (legal,
+medical, compliance), no amount of automation replaces domain-expert
+review. If that's your use case, treat this repo as a *drafting* tool,
+not a canonical source.
+
+### 4. Knowledge staleness without active upkeep
+
+**The problem**: Community analysis of 120+ comments on Karpathy's gist
+found this is the #1 failure mode. Most people who try the pattern get
+the folder structure right and still end up with a wiki that slowly
+becomes unreliable because they stop feeding it. Six-week half-life is
+typical.
+
+**How this repo mitigates** (this is the biggest thing):
+
+- **Automation replaces human discipline** — daily cron runs
+  `wiki-maintain.sh` (harvest + hygiene + qmd reindex); weekly cron runs
+  `--full` mode. You don't need to remember anything.
+- **Conversation mining is the feed** — you don't need to curate sources
+  manually. Every Claude Code session becomes potential ingest. The
+  feed is automatic and continuous, as long as you're doing work.
+- **`last_verified` refreshes from conversation references** — when the
+  summarizer links a conversation to a wiki page via `related:`, the
+  hygiene script picks that up and bumps `last_verified`. Pages stay
+  fresh as long as they're still being discussed.
+- **Decay thresholds force attention** — pages without refresh signals
+  for 6/9/12 months get downgraded and eventually archived. The wiki
+  self-trims.
+- **Hygiene reports** — `reports/hygiene-YYYY-MM-DD-needs-review.md`
+  flags the things that *do* need human judgment. Everything else is
+  auto-fixed.
+
+This is the single biggest reason this repo exists. The automation
+layer is entirely about removing "I forgot to lint" as a failure mode.
+
+### 5. Cognitive outsourcing risk
+
+**The problem**: Hacker News critics argued that the bookkeeping
+Karpathy outsources — filing, cross-referencing, summarizing — is
+precisely where genuine understanding forms. Outsource it and you end up
+with a comprehensive wiki you haven't internalized.
+
+**How this repo mitigates**:
+
+- **Staging review is a forcing function** — you see every automated
+  page before it lands. Even skimming forces engagement with the
+  material.
+- **`qmd query "..."` for exploration** — searching the wiki is an
+  active process, not passive retrieval. You're asking questions, not
+  pulling a file.
+- **The wake-up briefing** — `context/wake-up.md` is a 200-token digest
+  the agent reads at session start. You read it too (or the agent reads
+  it to you) — ongoing re-exposure to your own knowledge base.
+
+**Residual trade-off**: This is a real concern even with mitigations.
+The wiki is designed as *augmentation*, not *replacement*. If you
+never read your own wiki and only consult it through the agent, you're
+in the outsourcing failure mode. The fix is discipline, not
+architecture.
+
+### 6. Weaker semantic retrieval than RAG at scale
+
+**The problem**: At large corpora, vector embeddings find semantically
+related content across different wording in ways explicit wikilinks
+can't match.
+
+**How this repo mitigates**:
+
+- **`qmd` is hybrid (BM25 + vector)** — not just keyword search. Vector
+  similarity is built into the retrieval pipeline from day one.
+- **Structural navigation complements semantic search** — project codes
+  (wings) and topic frontmatter narrow the search space before the
+  hybrid search runs. Structure + semantics is stronger than either
+  alone.
+- **Missing cross-reference detection** — full-mode hygiene asks the
+  LLM to find pages that *should* link to each other but don't, then
+  auto-adds them. This is the explicit-linking approach catching up to
+  semantic retrieval over time.
+
+**Residual trade-off**: At enterprise scale (millions of documents), a
+proper vector DB with specialized retrieval wins. This repo is for
+personal / small-team scale where the hybrid approach is sufficient.
+
+### 7. No access control or multi-user support
+
+**The problem**: It's a folder of markdown files. No RBAC, no audit
+logging, no concurrency handling, no permissions model.
+
+**How this repo mitigates**:
+
+- **Git-based sync with merge-union** — concurrent writes on different
+  machines auto-resolve because markdown is set to `merge=union` in
+  `.gitattributes`. Both sides win.
+- **Network boundary as soft access control** — the suggested
+  deployment is over Tailscale or a VPN, so the network does the work a
+  RBAC layer would otherwise do. Not enterprise-grade, but sufficient
+  for personal/family/small-team use.
+
+**Residual trade-off**: **This is the big one.** The repo is not a
+replacement for enterprise knowledge management. No audit trails, no
+fine-grained permissions, no compliance story. If you need any of
+that, you need a different architecture. This repo is explicitly
+scoped to the personal/small-team use case.
+
+---
+
+## The #1 failure mode — active upkeep
+
+Every other weakness has a mitigation. *Active upkeep is the one that
+kills wikis in the wild.* The community data is unambiguous:
+
+- People who automate the lint schedule → wikis healthy at 6+ months
+- People who rely on "I'll remember to lint" → wikis abandoned at 6 weeks
+
+The entire automation layer of this repo exists to remove upkeep as a
+thing the human has to think about:
+
+| Cadence | Job | Purpose |
+|---------|-----|---------|
+| Every 15 min | `wiki-sync.sh` | Commit/pull/push — cross-machine sync |
+| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
+| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
+| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
+| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
+| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
+
+If you disable all of these, you get the same outcome as every
+abandoned wiki: six-week half-life. The scripts aren't optional
+convenience — they're the load-bearing answer to the pattern's primary
+failure mode.
+
+---
+
+## What was borrowed from where
+
+This repo is a synthesis of two ideas with an automation layer on top:
+
+### From Karpathy
+
+- The core pattern: LLM-maintained persistent wiki, compile at ingest
+  time instead of retrieve at query time
+- Separation of `raw/` (immutable sources) from `wiki/` (compiled pages)
+- `CLAUDE.md` as the schema that disciplines the agent
+- Periodic "lint" passes to catch orphans, contradictions, missing refs
+- The idea that the wiki becomes fine-tuning material over time
+
+### From mempalace
+
+- **Wings** = per-person or per-project namespaces → this repo uses
+  project codes (`mc`, `wiki`, `web`, etc.) as the same thing in
+  `conversations/<project>/`
+- **Rooms** = topics within a wing → the `topics:` frontmatter on
+  conversation files
+- **Halls** = memory-type corridors (fact / event / discovery /
+  preference / advice / tooling) → the `halls:` frontmatter field
+  classified by the summarizer
+- **Closets** = summary layer → the summary body of each summarized
+  conversation
+- **Drawers** = verbatim archive, never lost → the extracted
+  conversation transcripts under `conversations/<project>/*.md`
+- **Tunnels** = cross-wing connections → the `related:` frontmatter
+  linking conversations to wiki pages
+- Wing + room structural filtering gives a documented +34% retrieval
+  boost over flat search
+
+The MemPalace taxonomy solved a problem Karpathy's pattern doesn't
+address: how do you navigate a growing corpus without reading
+everything? The answer is to give the corpus structural metadata at
+ingest time, then filter on that metadata before doing semantic search.
+This repo borrows that wholesale.
+
+### What this repo adds
+
+- **Automation layer** tying the pieces together with cron-friendly
+  orchestration
+- **Staging pipeline** as a human-in-the-loop checkpoint for automated
+  content
+- **Confidence decay + auto-archive + auto-restore** as the "retention
+  curve" that community analysis identified as critical for long-term
+  wiki health
+- **`qmd` integration** as the scalable search layer (chosen over
+  ChromaDB because it uses the same markdown storage as the wiki —
+  one index to maintain, not two)
+- **Hygiene reports** with fixed vs needs-review separation so
+  automation handles mechanical fixes and humans handle ambiguity
+- **Cross-machine sync** via git with markdown merge-union so the same
+  wiki lives on multiple machines without merge hell
+
+---
+
+## Honest residual trade-offs
+
+Five items from the analysis that this repo doesn't fully solve and
+where you should know the limits:
+
+1. **Enterprise scale** — this is a personal/small-team tool. Millions
+   of documents, hundreds of users, RBAC, compliance: wrong
+   architecture.
+2. **True semantic retrieval at massive scale** — `qmd` hybrid search
+   is great for thousands of pages, not millions.
+3. **Cognitive outsourcing** — no architecture fix. Discipline
+   yourself to read your own wiki, not just query it through the agent.
+4. **Precision-critical domains** — for legal/medical/regulatory data,
+   use this as a drafting tool, not a source of truth. Human
+   domain-expert review is not replaceable.
+5. **Access control** — network boundary (Tailscale) is the fastest
+   path; nothing in the repo itself enforces permissions.
+
+If any of these are dealbreakers for your use case, a different
+architecture is probably what you need.
+
+---
+
+## Further reading
+
+- [The original Karpathy gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
+  — the concept
+- [mempalace](https://github.com/milla-jovovich/mempalace) — the
+  structural memory layer
+- [Signal & Noise interactive analysis](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
+  — the design rationale this document summarizes
+- [README](../README.md) — the concept pitch
+- [ARCHITECTURE.md](ARCHITECTURE.md) — component deep-dive
+- [SETUP.md](SETUP.md) — installation
+- [CUSTOMIZE.md](CUSTOMIZE.md) — adapting for non-Claude-Code setups