Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.
## The gap distill closes
The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.
## What distill does
Narrow today-filter with historical rollup:
1. Find all summarized conversations dated TODAY
2. Extract their topics: — this is the "topics of today" set
3. For each topic in that set, pull ALL summarized conversations
across history that share that topic (full historical context)
4. Extract hall_facts + hall_discoveries + hall_advice bullets
(the high-signal hall types — skips event/preference/tooling)
5. Send topic group + wiki index.md to claude -p
6. Model emits JSON actions[]: new_page / update_page / skip
7. Write each action to staging/<type>/ with distill provenance
frontmatter (staged_by: wiki-distill, distill_topic,
distill_source_conversations, compilation_notes)
First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.
Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.
## Orchestration — distill BEFORE harvest
wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):
1a. wiki-distill.py — conversations → staging (PRIORITY)
1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement)
2. wiki-hygiene.py — decay, archive, repair, checks
3. qmd reindex
Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.
## Verified on real wiki
Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.
## Code additions
- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
+ high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
the Phase 1a/1b split
## Docs additions
- README.md: 8th row in extensions table, updated compounding-loop
diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
verify section test count
- .gitignore: note distill-state.json is committed (sync across
machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
tab with flow diagram, hall filter table, narrow-today/wide-
history explanation, staging provenance example
## Tests
192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
437 lines
21 KiB
Markdown
437 lines
21 KiB
Markdown
# Design Rationale — Signal & Noise
|
|
|
|
Why each part of this repo exists. This is the "why" document; the other
|
|
docs are the "what" and "how."
|
|
|
|
Before implementing anything, the design was worked out interactively
|
|
with Claude as a structured Signal & Noise analysis of Andrej Karpathy's
|
|
original persistent-wiki pattern:
|
|
|
|
> **Interactive version**: [eric-turner.com/memex/signal-and-noise.html](https://eric-turner.com/memex/signal-and-noise.html)
|
|
> — tabs for pros/cons, vs RAG, use-case fits, signal breakdown, mitigations
|
|
>
|
|
> **Self-contained archive**: [`artifacts/signal-and-noise.html`](artifacts/signal-and-noise.html)
|
|
> — same content, works offline
|
|
|
|
The analysis walks through the pattern's seven genuine strengths, seven
|
|
places where it needs an engineering layer to be practical, and the
|
|
concrete extension for each. memex is the implementation of those
|
|
extensions. If you want to understand *why* a component exists, the
|
|
interactive version has the longer-form argument; this document is the
|
|
condensed written version.
|
|
|
|
---
|
|
|
|
## Where the pattern is genuinely strong
|
|
|
|
The analysis found seven strengths that hold up under scrutiny. This
|
|
repo preserves all of them:
|
|
|
|
| Strength | How this repo keeps it |
|
|
|----------|-----------------------|
|
|
| **Knowledge compounds over time** | Every ingest adds to the existing wiki rather than restarting; conversation mining and URL harvesting continuously feed new material in |
|
|
| **Zero maintenance burden on humans** | Cron-driven harvest + hygiene; the only manual step is staging review, and that's fast because the AI already compiled the page |
|
|
| **Token-efficient at personal scale** | `index.md` fits in context; `qmd` kicks in only at 50+ articles; the wake-up briefing is ~200 tokens |
|
|
| **Human-readable & auditable** | Plain markdown everywhere; every cross-reference is visible; git history shows every change |
|
|
| **Future-proof & portable** | No vendor lock-in; you can point any agent at the same tree tomorrow |
|
|
| **Self-healing via lint passes** | `wiki-hygiene.py` runs quick checks daily and full (LLM) checks weekly |
|
|
| **Path to fine-tuning** | Wiki pages are high-quality synthetic training data once purified through hygiene |
|
|
|
|
---
|
|
|
|
## Where memex extends the pattern
|
|
|
|
Karpathy's gist is a concept pitch. He was explicit that he was sharing
|
|
an "idea file" for others to build on, not publishing a working
|
|
implementation. The analysis identified eight places where the core idea
|
|
needs an engineering layer to become practical day-to-day. The first
|
|
seven emerged from the original Signal & Noise review; the eighth
|
|
(conversation distillation) surfaced after building the other layers
|
|
and realizing that the conversations themselves were being mined,
|
|
summarized, indexed, and scanned for URLs — but the knowledge *inside*
|
|
them was never becoming wiki pages.
|
|
|
|
### 1. Claim freshness and reversibility
|
|
|
|
**The gap**: Unlike RAG — where a hallucination is ephemeral and the
|
|
next query starts clean — an LLM-maintained wiki is stateful. If a
|
|
claim is wrong at ingest time, it stays wrong until something corrects
|
|
it. For the pattern to work long-term, claims need a way to earn trust
|
|
over time and lose it when unused.
|
|
|
|
**How memex extends it**:
|
|
|
|
- **`confidence` field** — every page carries `high`/`medium`/`low` with
|
|
decay based on `last_verified`. Wrong claims aren't treated as
|
|
permanent — they age out visibly.
|
|
- **Archive + restore** — decayed pages get moved to `archive/` where
|
|
they're excluded from default search. If they get referenced again
|
|
they're auto-restored with `confidence: medium` (never straight to
|
|
`high` — they have to re-earn trust).
|
|
- **Raw harvested material is immutable** — `raw/harvested/*.md` files
|
|
are the ground truth. Every compiled wiki page can be traced back to
|
|
its source via the `sources:` frontmatter field.
|
|
- **Full-mode contradiction detection** — `wiki-hygiene.py --full` uses
|
|
sonnet to find conflicting claims across pages. Report-only (humans
|
|
decide which side wins).
|
|
- **Staging review** — automated content goes to `staging/` first.
|
|
Nothing enters the live wiki without human approval, so errors have
|
|
two chances to get caught (AI compile + human review) before they
|
|
become persistent.
|
|
|
|
### 2. Scalable search beyond the context window
|
|
|
|
**The gap**: The pattern works beautifully up to ~100 articles, where
|
|
`index.md` still fits in context. Karpathy's own wiki was right at the
|
|
ceiling. Past that point, the agent needs a real search layer — loading
|
|
the full index stops being practical.
|
|
|
|
**How memex extends it**:
|
|
|
|
- **`qmd` from day one** — `qmd` (BM25 + vector + LLM re-ranking) is set
|
|
up in the default configuration so the agent never has to load the
|
|
full index. At 50+ pages, `qmd search` replaces `cat index.md`.
|
|
- **Wing/room structural filtering** — conversations are partitioned by
|
|
project code (wing) and topic (room, via the `topics:` frontmatter).
|
|
Retrieval is pre-narrowed to the relevant wing before search runs.
|
|
This extends the effective ceiling because `qmd` works on a relevant
|
|
subset, not the whole corpus.
|
|
- **Hygiene full mode flags redundancy** — duplicate detection auto-merges
|
|
weaker pages into stronger ones, keeping the corpus lean.
|
|
- **Archive excludes stale content** — the `wiki-archive` collection has
|
|
`includeByDefault: false`, so archived pages don't eat context until
|
|
explicitly queried.
|
|
|
|
### 3. Traceable sources for every claim
|
|
|
|
**The gap**: In precision-sensitive domains (API specs, version
|
|
constraints, legal records, medical protocols), LLM-generated content
|
|
needs to be verifiable against a source. For the pattern to work in
|
|
those contexts, every claim needs to trace back to something immutable.
|
|
|
|
**How memex extends it**:
|
|
|
|
- **Staging workflow** — every automated page goes through human review.
|
|
For precision-critical content, that review IS the cross-check. The
|
|
AI does the drafting; you verify.
|
|
- **`compilation_notes` field** — staging pages include the AI's own
|
|
explanation of what it did and why. Makes review faster — you can
|
|
spot-check the reasoning rather than re-reading the whole page.
|
|
- **Immutable raw sources** — every wiki claim traces back to a specific
|
|
file in `raw/harvested/` with a SHA-256 `content_hash`. Verification
|
|
means comparing the claim to the source, not "trust the LLM."
|
|
- **`confidence: low` for precision domains** — the agent's instructions
|
|
(via `CLAUDE.md`) tell it to flag low-confidence content when
|
|
citing. Humans see the warning before acting.
|
|
|
|
**Residual trade-off**: For *truly* mission-critical data (legal,
|
|
medical, compliance), no amount of automation replaces domain-expert
|
|
review. If that's your use case, treat this repo as a *drafting* tool,
|
|
not a canonical source.
|
|
|
|
### 4. Continuous feed without manual discipline
|
|
|
|
**The gap**: Community analysis of 120+ comments on Karpathy's gist
|
|
converged on one clear finding: this is the #1 friction point. Most
|
|
people who try the pattern get
|
|
the folder structure right and still end up with a wiki that slowly
|
|
becomes unreliable because they stop feeding it. Six-week half-life is
|
|
typical.
|
|
|
|
**How memex extends it** (this is the biggest layer):
|
|
|
|
- **Automation replaces human discipline** — daily cron runs
|
|
`wiki-maintain.sh` (harvest + hygiene + qmd reindex); weekly cron runs
|
|
`--full` mode. You don't need to remember anything.
|
|
- **Conversation mining is the feed** — you don't need to curate sources
|
|
manually. Every Claude Code session becomes potential ingest. The
|
|
feed is automatic and continuous, as long as you're doing work.
|
|
- **`last_verified` refreshes from conversation references** — when the
|
|
summarizer links a conversation to a wiki page via `related:`, the
|
|
hygiene script picks that up and bumps `last_verified`. Pages stay
|
|
fresh as long as they're still being discussed.
|
|
- **Decay thresholds force attention** — pages without refresh signals
|
|
for 6/9/12 months get downgraded and eventually archived. The wiki
|
|
self-trims.
|
|
- **Hygiene reports** — `reports/hygiene-YYYY-MM-DD-needs-review.md`
|
|
flags the things that *do* need human judgment. Everything else is
|
|
auto-fixed.
|
|
|
|
This is the single biggest layer memex adds. Nothing about it is
|
|
exotic — it's a cron-scheduled pipeline that runs the scripts you'd
|
|
otherwise have to remember to run. That's the whole trick.
|
|
|
|
### 5. Keeping the human engaged with their own knowledge
|
|
|
|
**The gap**: Hacker News critics pointed out that the bookkeeping
|
|
Karpathy outsources — filing, cross-referencing, summarizing — is
|
|
precisely where genuine understanding forms. If the LLM does all of
|
|
it, you can end up with a comprehensive wiki you haven't internalized.
|
|
For the pattern to be an actual memory aid and not a false one, the
|
|
human needs touchpoints that keep them engaged.
|
|
|
|
**How memex extends it**:
|
|
|
|
- **Staging review is a forcing function** — you see every automated
|
|
page before it lands. Even skimming forces engagement with the
|
|
material.
|
|
- **`qmd query "..."` for exploration** — searching the wiki is an
|
|
active process, not passive retrieval. You're asking questions, not
|
|
pulling a file.
|
|
- **The wake-up briefing** — `context/wake-up.md` is a 200-token digest
|
|
the agent reads at session start. You read it too (or the agent reads
|
|
it to you) — ongoing re-exposure to your own knowledge base.
|
|
|
|
**Caveat**: memex is designed as *augmentation*, not *replacement*.
|
|
It's most valuable when you engage with it actively — reading your own
|
|
wake-up briefing, spot-checking promoted pages, noticing decay flags.
|
|
If you only consult the wiki through the agent and never look at it
|
|
yourself, you've outsourced the learning. That's a usage pattern
|
|
choice, not an architecture problem.
|
|
|
|
### 6. Hybrid retrieval — structure and semantics
|
|
|
|
**The gap**: Explicit wikilinks catch direct topic references but miss
|
|
semantic neighbors that use different wording. At scale, the pattern
|
|
benefits from vector similarity to find cross-topic connections the
|
|
human (or the LLM at ingest time) didn't think to link manually.
|
|
|
|
**How memex extends it**:
|
|
|
|
- **`qmd` is hybrid (BM25 + vector)** — not just keyword search. Vector
|
|
similarity is built into the retrieval pipeline from day one.
|
|
- **Structural navigation complements semantic search** — project codes
|
|
(wings) and topic frontmatter narrow the search space before the
|
|
hybrid search runs. Structure + semantics is stronger than either
|
|
alone.
|
|
- **Missing cross-reference detection** — full-mode hygiene asks the
|
|
LLM to find pages that *should* link to each other but don't, then
|
|
auto-adds them. This is the explicit-linking approach catching up to
|
|
semantic retrieval over time.
|
|
|
|
**Residual trade-off**: At enterprise scale (millions of documents), a
|
|
proper vector DB with specialized retrieval wins. This repo is for
|
|
personal / small-team scale where the hybrid approach is sufficient.
|
|
|
|
### 7. Cross-machine collaboration
|
|
|
|
**The gap**: Karpathy's gist describes a single-user, single-machine
|
|
setup. In practice, people work from multiple machines (laptop,
|
|
workstation, server) and sometimes collaborate with small teams. The
|
|
pattern needs a sync story that handles concurrent writes gracefully.
|
|
|
|
**How memex extends it**:
|
|
|
|
- **Git-based sync with merge-union** — concurrent writes on different
|
|
machines auto-resolve because markdown is set to `merge=union` in
|
|
`.gitattributes`. Both sides win.
|
|
- **State file sync** — `.harvest-state.json` and `.hygiene-state.json`
|
|
are committed, so two machines running the same pipeline agree on
|
|
what's already been processed instead of re-doing the work.
|
|
- **Network boundary as access gate** — the suggested deployment is
|
|
over Tailscale or a VPN, so the network enforces who can reach the
|
|
wiki at all. Simple and sufficient for personal/family/small-team
|
|
use.
|
|
|
|
**Explicit scope**: memex is **deliberately not** enterprise knowledge
|
|
management. No audit trails, no fine-grained permissions, no compliance
|
|
story. If you need any of that, you need a different architecture.
|
|
This is for the personal and small-team case where git + Tailscale is
|
|
the right amount of rigor.
|
|
|
|
### 8. Closing the MemPalace loop — conversation distillation
|
|
|
|
**The gap**: The mining pipeline extracts Claude Code sessions into
|
|
transcripts, classifies them by memory type (fact/discovery/preference/
|
|
advice/event/tooling), and tags them with topics. The URL harvester
|
|
scans them for cited links. Hygiene refreshes `last_verified` on any
|
|
wiki page that appears in a conversation's `related:` field. But none
|
|
of those steps actually *compile the knowledge inside the conversations
|
|
themselves into wiki pages.* A decision made in a session, a root cause
|
|
found during debugging, a pattern spotted in review — these stay in the
|
|
conversation summaries (searchable but not synthesized) until a human
|
|
manually writes them up. That's the last piece of the MemPalace model
|
|
that wasn't wired through: **closet content was never becoming the
|
|
source for the wiki proper**.
|
|
|
|
**How memex extends it**:
|
|
|
|
- **`wiki-distill.py`** runs as Phase 1a of `wiki-maintain.sh`, before
|
|
URL harvesting. The ordering is deliberate: conversation content
|
|
should drive the page, and URL harvesting should only supplement
|
|
what the conversations are already covering.
|
|
- **Narrow today-filter with historical rollup** — daily runs only
|
|
look at topics appearing in TODAY's summarized conversations, but
|
|
for each such topic the script pulls in ALL historical conversations
|
|
sharing that topic. Processing scope stays small; LLM context stays
|
|
wide. Old topics that resurface in new sessions automatically
|
|
trigger a re-distillation of the full history on that topic.
|
|
- **First-run bootstrap** — the very first run uses a 7-day lookback
|
|
to seed the state. After that, daily runs stay narrow.
|
|
- **High-signal halls only** — distill reads `hall_facts`,
|
|
`hall_discoveries`, and `hall_advice` bullets. Skips `hall_events`
|
|
(temporal, not knowledge), `hall_preferences` (user working style),
|
|
and `hall_tooling` (often low-signal). These are the halls the
|
|
MemPalace taxonomy treats as "canonical knowledge" vs "context."
|
|
- **claude -p compile step** — each topic group (topic + all matching
|
|
conversations + their high-signal halls) is sent to `claude -p`
|
|
with the current wiki index. The model decides whether to create a
|
|
new page, update an existing one, emit both, or skip (topic not
|
|
substantive enough or already well-covered).
|
|
- **Staging output with distill provenance** — new/updated pages land
|
|
in `staging/` with `staged_by: wiki-distill`, `distill_topic`, and
|
|
`distill_source_conversations` frontmatter fields. Every page traces
|
|
back to the exact conversations it was distilled from.
|
|
- **State file `.distill-state.json`** tracks processed conversations
|
|
by content hash and topic set, so re-runs only process what actually
|
|
changed. A conversation gets re-distilled if its body changes OR if
|
|
it gains a new topic not seen at previous distill time.
|
|
|
|
**Why this matters**: Without distillation, the MemPalace integration
|
|
was incomplete — the closet summaries existed, the structural metadata
|
|
existed, qmd could search them, but knowledge discovered during work
|
|
never escaped the conversation archive. You could find "we had a
|
|
debugging session about X last month" but couldn't find "here's the
|
|
canonical page on X that captures what we learned." This extension
|
|
turns the MemPalace layer from a searchable archive into a proper
|
|
**ingest pipeline** for the wiki.
|
|
|
|
**Residual consideration**: Summarization quality is now load-bearing.
|
|
The distill step trusts the summarizer's classification of bullets
|
|
into halls. If the summarizer puts a debugging dead-end in
|
|
`hall_discoveries`, it may enter the wiki compilation pipeline. The
|
|
`MIN_BULLETS_PER_TOPIC` filter (default 2) and the LLM's own
|
|
substantiveness check (it can choose `skip` with a reason) together
|
|
catch most noise, and the staging review catches the rest.
|
|
|
|
---
|
|
|
|
## The biggest layer — active upkeep
|
|
|
|
The other six extensions are important, but this is the one that makes
|
|
or breaks the pattern in practice. The community data is unambiguous:
|
|
|
|
- People who automate the lint schedule → wikis healthy at 6+ months
|
|
- People who rely on "I'll remember to lint" → wikis abandoned at 6 weeks
|
|
|
|
The entire automation layer of this repo exists to remove upkeep as a
|
|
thing the human has to think about:
|
|
|
|
| Cadence | Job | Purpose |
|
|
|---------|-----|---------|
|
|
| Every 15 min | `wiki-sync.sh` | Commit/pull/push — cross-machine sync |
|
|
| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
|
|
| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
|
|
| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
|
|
| Daily 3am | `wiki-maintain.sh` | Distill + harvest + quick hygiene + reindex |
|
|
| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
|
|
|
|
If you disable all of these, you get the same outcome as every
|
|
abandoned wiki: six-week half-life. The scripts aren't optional
|
|
convenience — they're the load-bearing automation that lets the pattern
|
|
actually compound over months and years instead of requiring a
|
|
disciplined human to keep it alive.
|
|
|
|
---
|
|
|
|
## What was borrowed from where
|
|
|
|
This repo is a synthesis of two ideas with an automation layer on top:
|
|
|
|
### From Karpathy
|
|
|
|
- The core pattern: LLM-maintained persistent wiki, compile at ingest
|
|
time instead of retrieve at query time
|
|
- Separation of `raw/` (immutable sources) from `wiki/` (compiled pages)
|
|
- `CLAUDE.md` as the schema that disciplines the agent
|
|
- Periodic "lint" passes to catch orphans, contradictions, missing refs
|
|
- The idea that the wiki becomes fine-tuning material over time
|
|
|
|
### From mempalace
|
|
|
|
- **Wings** = per-person or per-project namespaces → this repo uses
|
|
project codes (`mc`, `wiki`, `web`, etc.) as the same thing in
|
|
`conversations/<project>/`
|
|
- **Rooms** = topics within a wing → the `topics:` frontmatter on
|
|
conversation files
|
|
- **Halls** = memory-type corridors (fact / event / discovery /
|
|
preference / advice / tooling) → the `halls:` frontmatter field
|
|
classified by the summarizer
|
|
- **Closets** = summary layer → the summary body of each summarized
|
|
conversation
|
|
- **Drawers** = verbatim archive, never lost → the extracted
|
|
conversation transcripts under `conversations/<project>/*.md`
|
|
- **Tunnels** = cross-wing connections → the `related:` frontmatter
|
|
linking conversations to wiki pages
|
|
- Wing + room structural filtering gives a documented +34% retrieval
|
|
boost over flat search
|
|
|
|
The MemPalace taxonomy solved a problem Karpathy's pattern doesn't
|
|
address: how do you navigate a growing corpus without reading
|
|
everything? The answer is to give the corpus structural metadata at
|
|
ingest time, then filter on that metadata before doing semantic search.
|
|
This repo borrows that wholesale.
|
|
|
|
### What this repo adds
|
|
|
|
- **Automation layer** tying the pieces together with cron-friendly
|
|
orchestration
|
|
- **Staging pipeline** as a human-in-the-loop checkpoint for automated
|
|
content
|
|
- **Confidence decay + auto-archive + auto-restore** as the "retention
|
|
curve" that community analysis identified as critical for long-term
|
|
wiki health
|
|
- **`qmd` integration** as the scalable search layer (chosen over
|
|
ChromaDB because it uses the same markdown storage as the wiki —
|
|
one index to maintain, not two)
|
|
- **Hygiene reports** with fixed vs needs-review separation so
|
|
automation handles mechanical fixes and humans handle ambiguity
|
|
- **Cross-machine sync** via git with markdown merge-union so the same
|
|
wiki lives on multiple machines without merge hell
|
|
|
|
---
|
|
|
|
## What memex deliberately doesn't try to do
|
|
|
|
Five things memex is explicitly scoped around — not because they're
|
|
unsolvable, but because solving them well requires a different kind of
|
|
architecture than a personal/small-team wiki. If any of these are
|
|
dealbreakers for your use case, memex is probably not the right fit:
|
|
|
|
1. **Enterprise scale** — millions of documents, hundreds of users,
|
|
RBAC, compliance: these need real enterprise knowledge management
|
|
infrastructure. memex is tuned for personal and small-team use.
|
|
2. **True semantic retrieval at massive scale** — `qmd` hybrid search
|
|
works great up to thousands of pages. At millions, a dedicated
|
|
vector database with specialized retrieval wins.
|
|
3. **Replacing your own learning** — memex is an augmentation layer,
|
|
not a substitute for reading. Used well, it's a memory aid; used as
|
|
a bypass, it just lets you forget more.
|
|
4. **Precision-critical source of truth** — for legal, medical, or
|
|
regulatory data, memex is a drafting tool. Human domain-expert
|
|
review still owns the final call.
|
|
5. **Access control** — the network boundary (Tailscale) is the
|
|
fastest path to "only authorized people can reach it." memex itself
|
|
doesn't enforce permissions inside that boundary.
|
|
|
|
These are scope decisions, not unfinished work. memex is the best
|
|
personal/small-team answer to Karpathy's pattern I could build; it's
|
|
not trying to be every answer.
|
|
|
|
---
|
|
|
|
## Further reading
|
|
|
|
- [The original Karpathy gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
|
|
— the concept
|
|
- [mempalace](https://github.com/milla-jovovich/mempalace) — the
|
|
structural memory layer
|
|
- [Signal & Noise interactive analysis](https://eric-turner.com/memex/signal-and-noise.html)
|
|
— the design rationale this document summarizes (live interactive version)
|
|
- [`artifacts/signal-and-noise.html`](artifacts/signal-and-noise.html)
|
|
— self-contained archive of the same analysis, works offline
|
|
- [README](../README.md) — the concept pitch
|
|
- [ARCHITECTURE.md](ARCHITECTURE.md) — component deep-dive
|
|
- [SETUP.md](SETUP.md) — installation
|
|
- [CUSTOMIZE.md](CUSTOMIZE.md) — adapting for non-Claude-Code setups
|