Initial commit — memex

A compounding LLM-maintained knowledge wiki.

Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's
mempalace, with an automation layer on top for conversation mining, URL
harvesting, human-in-the-loop staging, staleness decay, and hygiene.

Includes:
- 11 pipeline scripts (extract, summarize, index, harvest, stage,
  hygiene, maintain, sync, + shared library)
- Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE
- Example CLAUDE.md files (wiki schema + global instructions) tuned for
  the three-collection qmd setup
- 171-test pytest suite (cross-platform, runs in ~1.3s)
- MIT licensed
This commit is contained in:
Eric Turner
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions

338
docs/DESIGN-RATIONALE.md Normal file
View File

@@ -0,0 +1,338 @@
# Design Rationale — Signal & Noise
Why each part of this repo exists. This is the "why" document; the other
docs are the "what" and "how."
Before implementing anything, the design was worked out interactively
with Claude as a structured Signal & Noise analysis of Andrej Karpathy's
original persistent-wiki pattern:
> **Interactive design artifact**: [The LLM Wiki — Karpathy's Pattern — Signal & Noise](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
That artifact walks through the pattern's seven genuine strengths, seven
real weaknesses, and concrete mitigations for each weakness. This repo
is the implementation of those mitigations. If you want to understand
*why* a component exists, the artifact has the longer-form argument; this
document is the condensed version.
---
## Where the pattern is genuinely strong
The analysis found seven strengths that hold up under scrutiny. This
repo preserves all of them:
| Strength | How this repo keeps it |
|----------|-----------------------|
| **Knowledge compounds over time** | Every ingest adds to the existing wiki rather than restarting; conversation mining and URL harvesting continuously feed new material in |
| **Zero maintenance burden on humans** | Cron-driven harvest + hygiene; the only manual step is staging review, and that's fast because the AI already compiled the page |
| **Token-efficient at personal scale** | `index.md` fits in context; `qmd` kicks in only at 50+ articles; the wake-up briefing is ~200 tokens |
| **Human-readable & auditable** | Plain markdown everywhere; every cross-reference is visible; git history shows every change |
| **Future-proof & portable** | No vendor lock-in; you can point any agent at the same tree tomorrow |
| **Self-healing via lint passes** | `wiki-hygiene.py` runs quick checks daily and full (LLM) checks weekly |
| **Path to fine-tuning** | Wiki pages are high-quality synthetic training data once purified through hygiene |
---
## Where the pattern is genuinely weak — and how this repo answers
The analysis identified seven real weaknesses. Five have direct
mitigations in this repo; two remain open trade-offs you should be aware
of.
### 1. Errors persist and compound
**The problem**: Unlike RAG — where a hallucination is ephemeral and the
next query starts clean — an LLM wiki persists its mistakes. If the LLM
incorrectly links two concepts at ingest time, future ingests build on
that wrong prior.
**How this repo mitigates**:
- **`confidence` field** — every page carries `high`/`medium`/`low` with
decay based on `last_verified`. Wrong claims aren't treated as
permanent — they age out visibly.
- **Archive + restore** — decayed pages get moved to `archive/` where
they're excluded from default search. If they get referenced again
they're auto-restored with `confidence: medium` (never straight to
`high` — they have to re-earn trust).
- **Raw harvested material is immutable** — `raw/harvested/*.md` files
are the ground truth. Every compiled wiki page can be traced back to
its source via the `sources:` frontmatter field.
- **Full-mode contradiction detection** — `wiki-hygiene.py --full` uses
sonnet to find conflicting claims across pages. Report-only (humans
decide which side wins).
- **Staging review** — automated content goes to `staging/` first.
Nothing enters the live wiki without human approval, so errors have
two chances to get caught (AI compile + human review) before they
become persistent.
### 2. Hard scale ceiling at ~50K tokens
**The problem**: The wiki approach stops working when `index.md` no
longer fits in context. Karpathy's own wiki was ~100 articles / 400K
words — already near the ceiling.
**How this repo mitigates**:
- **`qmd` from day one** — `qmd` (BM25 + vector + LLM re-ranking) is set
up in the default configuration so the agent never has to load the
full index. At 50+ pages, `qmd search` replaces `cat index.md`.
- **Wing/room structural filtering** — conversations are partitioned by
project code (wing) and topic (room, via the `topics:` frontmatter).
Retrieval is pre-narrowed to the relevant wing before search runs.
This extends the effective ceiling because `qmd` works on a relevant
subset, not the whole corpus.
- **Hygiene full mode flags redundancy** — duplicate detection auto-merges
weaker pages into stronger ones, keeping the corpus lean.
- **Archive excludes stale content** — the `wiki-archive` collection has
`includeByDefault: false`, so archived pages don't eat context until
explicitly queried.
### 3. Manual cross-checking burden returns in precision-critical domains
**The problem**: For API specs, version constraints, legal records, and
medical protocols, LLM-generated content needs human verification. The
maintenance burden you thought you'd eliminated comes back as
verification overhead.
**How this repo mitigates**:
- **Staging workflow** — every automated page goes through human review.
For precision-critical content, that review IS the cross-check. The
AI does the drafting; you verify.
- **`compilation_notes` field** — staging pages include the AI's own
explanation of what it did and why. Makes review faster — you can
spot-check the reasoning rather than re-reading the whole page.
- **Immutable raw sources** — every wiki claim traces back to a specific
file in `raw/harvested/` with a SHA-256 `content_hash`. Verification
means comparing the claim to the source, not "trust the LLM."
- **`confidence: low` for precision domains** — the agent's instructions
(via `CLAUDE.md`) tell it to flag low-confidence content when
citing. Humans see the warning before acting.
**Residual trade-off**: For *truly* mission-critical data (legal,
medical, compliance), no amount of automation replaces domain-expert
review. If that's your use case, treat this repo as a *drafting* tool,
not a canonical source.
### 4. Knowledge staleness without active upkeep
**The problem**: Community analysis of 120+ comments on Karpathy's gist
found this is the #1 failure mode. Most people who try the pattern get
the folder structure right and still end up with a wiki that slowly
becomes unreliable because they stop feeding it. Six-week half-life is
typical.
**How this repo mitigates** (this is the biggest thing):
- **Automation replaces human discipline** — daily cron runs
`wiki-maintain.sh` (harvest + hygiene + qmd reindex); weekly cron runs
`--full` mode. You don't need to remember anything.
- **Conversation mining is the feed** — you don't need to curate sources
manually. Every Claude Code session becomes potential ingest. The
feed is automatic and continuous, as long as you're doing work.
- **`last_verified` refreshes from conversation references** — when the
summarizer links a conversation to a wiki page via `related:`, the
hygiene script picks that up and bumps `last_verified`. Pages stay
fresh as long as they're still being discussed.
- **Decay thresholds force attention** — pages without refresh signals
for 6/9/12 months get downgraded and eventually archived. The wiki
self-trims.
- **Hygiene reports** — `reports/hygiene-YYYY-MM-DD-needs-review.md`
flags the things that *do* need human judgment. Everything else is
auto-fixed.
This is the single biggest reason this repo exists. The automation
layer is entirely about removing "I forgot to lint" as a failure mode.
### 5. Cognitive outsourcing risk
**The problem**: Hacker News critics argued that the bookkeeping
Karpathy outsources — filing, cross-referencing, summarizing — is
precisely where genuine understanding forms. Outsource it and you end up
with a comprehensive wiki you haven't internalized.
**How this repo mitigates**:
- **Staging review is a forcing function** — you see every automated
page before it lands. Even skimming forces engagement with the
material.
- **`qmd query "..."` for exploration** — searching the wiki is an
active process, not passive retrieval. You're asking questions, not
pulling a file.
- **The wake-up briefing** — `context/wake-up.md` is a 200-token digest
the agent reads at session start. You read it too (or the agent reads
it to you) — ongoing re-exposure to your own knowledge base.
**Residual trade-off**: This is a real concern even with mitigations.
The wiki is designed as *augmentation*, not *replacement*. If you
never read your own wiki and only consult it through the agent, you're
in the outsourcing failure mode. The fix is discipline, not
architecture.
### 6. Weaker semantic retrieval than RAG at scale
**The problem**: At large corpora, vector embeddings find semantically
related content across different wording in ways explicit wikilinks
can't match.
**How this repo mitigates**:
- **`qmd` is hybrid (BM25 + vector)** — not just keyword search. Vector
similarity is built into the retrieval pipeline from day one.
- **Structural navigation complements semantic search** — project codes
(wings) and topic frontmatter narrow the search space before the
hybrid search runs. Structure + semantics is stronger than either
alone.
- **Missing cross-reference detection** — full-mode hygiene asks the
LLM to find pages that *should* link to each other but don't, then
auto-adds them. This is the explicit-linking approach catching up to
semantic retrieval over time.
**Residual trade-off**: At enterprise scale (millions of documents), a
proper vector DB with specialized retrieval wins. This repo is for
personal / small-team scale where the hybrid approach is sufficient.
### 7. No access control or multi-user support
**The problem**: It's a folder of markdown files. No RBAC, no audit
logging, no concurrency handling, no permissions model.
**How this repo mitigates**:
- **Git-based sync with merge-union** — concurrent writes on different
machines auto-resolve because markdown is set to `merge=union` in
`.gitattributes`. Both sides win.
- **Network boundary as soft access control** — the suggested
deployment is over Tailscale or a VPN, so the network does the work a
RBAC layer would otherwise do. Not enterprise-grade, but sufficient
for personal/family/small-team use.
**Residual trade-off**: **This is the big one.** The repo is not a
replacement for enterprise knowledge management. No audit trails, no
fine-grained permissions, no compliance story. If you need any of
that, you need a different architecture. This repo is explicitly
scoped to the personal/small-team use case.
---
## The #1 failure mode — active upkeep
Every other weakness has a mitigation. *Active upkeep is the one that
kills wikis in the wild.* The community data is unambiguous:
- People who automate the lint schedule → wikis healthy at 6+ months
- People who rely on "I'll remember to lint" → wikis abandoned at 6 weeks
The entire automation layer of this repo exists to remove upkeep as a
thing the human has to think about:
| Cadence | Job | Purpose |
|---------|-----|---------|
| Every 15 min | `wiki-sync.sh` | Commit/pull/push — cross-machine sync |
| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
If you disable all of these, you get the same outcome as every
abandoned wiki: six-week half-life. The scripts aren't optional
convenience — they're the load-bearing answer to the pattern's primary
failure mode.
---
## What was borrowed from where
This repo is a synthesis of two ideas with an automation layer on top:
### From Karpathy
- The core pattern: LLM-maintained persistent wiki, compile at ingest
time instead of retrieve at query time
- Separation of `raw/` (immutable sources) from `wiki/` (compiled pages)
- `CLAUDE.md` as the schema that disciplines the agent
- Periodic "lint" passes to catch orphans, contradictions, missing refs
- The idea that the wiki becomes fine-tuning material over time
### From mempalace
- **Wings** = per-person or per-project namespaces → this repo uses
project codes (`mc`, `wiki`, `web`, etc.) as the same thing in
`conversations/<project>/`
- **Rooms** = topics within a wing → the `topics:` frontmatter on
conversation files
- **Halls** = memory-type corridors (fact / event / discovery /
preference / advice / tooling) → the `halls:` frontmatter field
classified by the summarizer
- **Closets** = summary layer → the summary body of each summarized
conversation
- **Drawers** = verbatim archive, never lost → the extracted
conversation transcripts under `conversations/<project>/*.md`
- **Tunnels** = cross-wing connections → the `related:` frontmatter
linking conversations to wiki pages
- Wing + room structural filtering gives a documented +34% retrieval
boost over flat search
The MemPalace taxonomy solved a problem Karpathy's pattern doesn't
address: how do you navigate a growing corpus without reading
everything? The answer is to give the corpus structural metadata at
ingest time, then filter on that metadata before doing semantic search.
This repo borrows that wholesale.
### What this repo adds
- **Automation layer** tying the pieces together with cron-friendly
orchestration
- **Staging pipeline** as a human-in-the-loop checkpoint for automated
content
- **Confidence decay + auto-archive + auto-restore** as the "retention
curve" that community analysis identified as critical for long-term
wiki health
- **`qmd` integration** as the scalable search layer (chosen over
ChromaDB because it uses the same markdown storage as the wiki —
one index to maintain, not two)
- **Hygiene reports** with fixed vs needs-review separation so
automation handles mechanical fixes and humans handle ambiguity
- **Cross-machine sync** via git with markdown merge-union so the same
wiki lives on multiple machines without merge hell
---
## Honest residual trade-offs
Five items from the analysis that this repo doesn't fully solve and
where you should know the limits:
1. **Enterprise scale** — this is a personal/small-team tool. Millions
of documents, hundreds of users, RBAC, compliance: wrong
architecture.
2. **True semantic retrieval at massive scale**`qmd` hybrid search
is great for thousands of pages, not millions.
3. **Cognitive outsourcing** — no architecture fix. Discipline
yourself to read your own wiki, not just query it through the agent.
4. **Precision-critical domains** — for legal/medical/regulatory data,
use this as a drafting tool, not a source of truth. Human
domain-expert review is not replaceable.
5. **Access control** — network boundary (Tailscale) is the fastest
path; nothing in the repo itself enforces permissions.
If any of these are dealbreakers for your use case, a different
architecture is probably what you need.
---
## Further reading
- [The original Karpathy gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
— the concept
- [mempalace](https://github.com/milla-jovovich/mempalace) — the
structural memory layer
- [Signal & Noise interactive analysis](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
— the design rationale this document summarizes
- [README](../README.md) — the concept pitch
- [ARCHITECTURE.md](ARCHITECTURE.md) — component deep-dive
- [SETUP.md](SETUP.md) — installation
- [CUSTOMIZE.md](CUSTOMIZE.md) — adapting for non-Claude-Code setups