Initial commit — memex
A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
This commit is contained in:
360
docs/ARCHITECTURE.md
Normal file
360
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,360 @@
|
||||
# Architecture
|
||||
|
||||
Eleven scripts across three conceptual layers. This document walks through
|
||||
what each one does, how they talk to each other, and where the seams are
|
||||
for customization.
|
||||
|
||||
> **See also**: [`DESIGN-RATIONALE.md`](DESIGN-RATIONALE.md) — the *why*
|
||||
> behind each component, with links to the interactive design artifact.
|
||||
|
||||
## Borrowed concepts
|
||||
|
||||
The architecture is a synthesis of two external ideas with an automation
|
||||
layer on top. The terminology often maps 1:1, so it's worth calling out
|
||||
which concepts came from where:
|
||||
|
||||
### From Karpathy's persistent-wiki gist
|
||||
|
||||
| Concept | How this repo implements it |
|
||||
|---------|-----------------------------|
|
||||
| Immutable `raw/` sources | `raw/` directory — never modified by the agent |
|
||||
| LLM-compiled `wiki/` pages | `patterns/` `decisions/` `concepts/` `environments/` |
|
||||
| Schema file disciplining the agent | `CLAUDE.md` at the wiki root |
|
||||
| Periodic "lint" passes | `wiki-hygiene.py --quick` (daily) + `--full` (weekly) |
|
||||
| Wiki as fine-tuning material | Clean markdown body is ready for synthetic training data |
|
||||
|
||||
### From [mempalace](https://github.com/milla-jovovich/mempalace)
|
||||
|
||||
MemPalace gave us the structural memory taxonomy that turns a flat
|
||||
corpus into something you can navigate without reading everything. The
|
||||
concepts map directly:
|
||||
|
||||
| MemPalace term | Meaning | How this repo implements it |
|
||||
|----------------|---------|-----------------------------|
|
||||
| **Wing** | Per-person or per-project namespace | Project code in `conversations/<code>/` (set by `PROJECT_MAP` in `extract-sessions.py`) |
|
||||
| **Room** | Topic within a wing | `topics:` frontmatter field on summarized conversation files |
|
||||
| **Closet** | Summary layer — high-signal compressed knowledge | The summary body written by `summarize-conversations.py --claude` |
|
||||
| **Drawer** | Verbatim archive, never lost | The extracted transcript under `conversations/<wing>/*.md` (before summarization) |
|
||||
| **Hall** | Memory-type corridor (fact / event / discovery / preference / advice / tooling) | `halls:` frontmatter field classified by the summarizer |
|
||||
| **Tunnel** | Cross-wing connection — same topic in multiple projects | `related:` frontmatter linking conversations to wiki pages and to each other |
|
||||
|
||||
The key benefit of wing + room filtering is documented in MemPalace's
|
||||
benchmarks as a **+34% retrieval boost** over flat search — because
|
||||
`qmd` can search a pre-narrowed subset of the corpus instead of
|
||||
everything. This is why the wiki scales past the Karpathy-pattern's
|
||||
~50K token ceiling without needing a full vector DB rebuild.
|
||||
|
||||
### What this repo adds
|
||||
|
||||
Automation + lifecycle management on top of both:
|
||||
|
||||
- **Automation layer** — cron-friendly orchestration via `wiki-maintain.sh`
|
||||
- **Staging pipeline** — human-in-the-loop checkpoint for automated content
|
||||
- **Confidence decay + auto-archive + auto-restore** — the retention curve
|
||||
- **`qmd` integration** — the scalable search layer (chosen over ChromaDB
|
||||
because it uses markdown storage like the wiki itself)
|
||||
- **Hygiene reports** — fixed vs needs-review separation
|
||||
- **Cross-machine sync** — git with markdown merge-union
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────┐
|
||||
│ SYNC LAYER │
|
||||
│ wiki-sync.sh │ (git commit/pull/push, qmd reindex)
|
||||
└─────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────────────┐
|
||||
│ MINING LAYER │
|
||||
│ extract-sessions.py │ (Claude Code JSONL → markdown)
|
||||
│ summarize-conversations.py │ (LLM classify + summarize)
|
||||
│ update-conversation-index.py │ (regenerate indexes + wake-up)
|
||||
│ mine-conversations.sh │ (orchestrator)
|
||||
└─────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────────────┐
|
||||
│ AUTOMATION LAYER │
|
||||
│ wiki_lib.py (shared helpers) │
|
||||
│ wiki-harvest.py │ (URL → raw → staging)
|
||||
│ wiki-staging.py │ (human review)
|
||||
│ wiki-hygiene.py │ (decay, archive, repair, checks)
|
||||
│ wiki-maintain.sh │ (orchestrator)
|
||||
└─────────────────────────────────┘
|
||||
```
|
||||
|
||||
Each layer is independent — you can run the mining layer without the
|
||||
automation layer, or vice versa. The layers communicate through files on
|
||||
disk (conversation markdown, raw harvested pages, staging pages, wiki
|
||||
pages), never through in-memory state.
|
||||
|
||||
---
|
||||
|
||||
## Mining layer
|
||||
|
||||
### `extract-sessions.py`
|
||||
|
||||
Parses Claude Code JSONL session files from `~/.claude/projects/` into
|
||||
clean markdown transcripts under `conversations/<project-code>/`.
|
||||
Deterministic, no LLM calls. Incremental — tracks byte offsets in
|
||||
`.mine-state.json` so it safely re-runs on partially-processed sessions.
|
||||
|
||||
Key features:
|
||||
- Summarizes tool calls intelligently: full output for `Bash` and `Skill`,
|
||||
paths-only for `Read`/`Glob`/`Grep`, path + summary for `Edit`/`Write`
|
||||
- Caps Bash output at 200 lines to prevent transcript bloat
|
||||
- Handles session resumption — if a session has grown since last extraction,
|
||||
it appends new messages without re-processing old ones
|
||||
- Maps Claude project directory names to short wiki codes via `PROJECT_MAP`
|
||||
|
||||
### `summarize-conversations.py`
|
||||
|
||||
Sends extracted transcripts to an LLM for classification and summarization.
|
||||
Supports two backends:
|
||||
|
||||
1. **`--claude` mode** (recommended): Uses `claude -p` with
|
||||
haiku for short sessions (≤200 messages) and sonnet for longer ones.
|
||||
Runs chunked over long transcripts, keeping a rolling context window.
|
||||
|
||||
2. **Local LLM mode** (default, omit `--claude`): Uses a local
|
||||
`llama-server` instance at `localhost:8080` (or WSL gateway:8081 on
|
||||
Windows Subsystem for Linux). Requires llama.cpp installed and a GGUF
|
||||
model loaded.
|
||||
|
||||
Output: adds frontmatter to each conversation file — `topics`, `halls`
|
||||
(fact/discovery/preference/advice/event/tooling), and `related` wiki
|
||||
page links. The `related` links are load-bearing: they're what
|
||||
`wiki-hygiene.py` uses to refresh `last_verified` on pages that are still
|
||||
being discussed.
|
||||
|
||||
### `update-conversation-index.py`
|
||||
|
||||
Regenerates three files from the summarized conversations:
|
||||
|
||||
- `conversations/index.md` — catalog of all conversations grouped by project
|
||||
- `context/wake-up.md` — a ~200-token briefing the agent loads at the start
|
||||
of every session ("current focus areas, recent decisions, active
|
||||
concerns")
|
||||
- `context/active-concerns.md` — longer-form current state
|
||||
|
||||
The wake-up file is important: it's what gives the agent *continuity*
|
||||
across sessions without forcing you to re-explain context every time.
|
||||
|
||||
### `mine-conversations.sh`
|
||||
|
||||
Orchestrator chaining extract → summarize → index. Supports
|
||||
`--extract-only`, `--summarize-only`, `--index-only`, `--project <code>`,
|
||||
and `--dry-run`.
|
||||
|
||||
---
|
||||
|
||||
## Automation layer
|
||||
|
||||
### `wiki_lib.py`
|
||||
|
||||
The shared library. Everything in the automation layer imports from here.
|
||||
Provides:
|
||||
|
||||
- `WikiPage` dataclass — path + frontmatter + body + raw YAML
|
||||
- `parse_page(path)` — safe markdown parser with YAML frontmatter
|
||||
- `parse_yaml_lite(text)` — subset YAML parser (no external deps, handles
|
||||
the frontmatter patterns we use)
|
||||
- `serialize_frontmatter(fm)` — writes YAML back in canonical key order
|
||||
- `write_page(page, ...)` — full round-trip writer
|
||||
- `page_content_hash(page)` — body-only SHA-256 for change detection
|
||||
- `iter_live_pages()` / `iter_staging_pages()` / `iter_archived_pages()`
|
||||
- Shared constants: `WIKI_DIR`, `STAGING_DIR`, `ARCHIVE_DIR`, etc.
|
||||
|
||||
All paths honor the `WIKI_DIR` environment variable, so tests and
|
||||
alternate installs can override the root.
|
||||
|
||||
### `wiki-harvest.py`
|
||||
|
||||
Scans summarized conversations for HTTP(S) URLs, classifies them,
|
||||
fetches content, and compiles pending wiki pages.
|
||||
|
||||
URL classification:
|
||||
- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
|
||||
- **Check** (Type C) — GitHub issues, Stack Overflow — only harvest if
|
||||
the topic is already covered in the wiki (to avoid noise)
|
||||
- **Skip** (Type D) — internal domains, localhost, private IPs, chat tools
|
||||
|
||||
Fetch cascade (tries in order, validates at each step):
|
||||
1. `trafilatura -u <url> --markdown --no-comments --precision`
|
||||
2. `crwl <url> -o markdown-fit`
|
||||
3. `crwl <url> -o markdown-fit -b "user_agent_mode=random" -c "magic=true"` (stealth)
|
||||
4. Conversation-transcript fallback — pull inline content from where the
|
||||
URL was mentioned during the session
|
||||
|
||||
Validated content goes to `raw/harvested/<domain>-<path>.md` with
|
||||
frontmatter recording source URL, fetch method, and a content hash.
|
||||
|
||||
Compilation step: sends the raw content + `index.md` + conversation
|
||||
context to `claude -p`, asking for a JSON verdict:
|
||||
- `new_page` — create a new wiki page
|
||||
- `update_page` — update an existing page (with `modifies:` field)
|
||||
- `both` — do both
|
||||
- `skip` — content isn't substantive enough
|
||||
|
||||
Result lands in `staging/<type>/` with `origin: automated`,
|
||||
`status: pending`, and all the staging-specific frontmatter that gets
|
||||
stripped on promotion.
|
||||
|
||||
### `wiki-staging.py`
|
||||
|
||||
Pure file operations — no LLM calls. Human review pipeline for automated
|
||||
content.
|
||||
|
||||
Commands:
|
||||
- `--list` / `--list --json` — pending items with metadata
|
||||
- `--stats` — counts by type/source + age stats
|
||||
- `--review` — interactive a/r/s/q loop with preview
|
||||
- `--promote <path>` — approve, strip staging fields, move to live, update
|
||||
main index, rewrite cross-refs, preserve `origin: automated` as audit trail
|
||||
- `--reject <path> --reason "..."` — delete, record in
|
||||
`.harvest-state.json` rejected_urls so the harvester won't re-create
|
||||
- `--promote-all` — bulk approve everything
|
||||
- `--sync` — regenerate `staging/index.md`, detect drift
|
||||
|
||||
### `wiki-hygiene.py`
|
||||
|
||||
The heavy lifter. Two modes:
|
||||
|
||||
**Quick mode** (no LLM, ~1 second on a 100-page wiki, run daily):
|
||||
- Backfill `last_verified` from `last_compiled`/git/mtime
|
||||
- Refresh `last_verified` from conversation `related:` links — this is
|
||||
the "something's still being discussed" signal
|
||||
- Auto-restore archived pages that are referenced again
|
||||
- Repair frontmatter (missing/invalid fields get sensible defaults)
|
||||
- Apply confidence decay per thresholds (6/9/12 months)
|
||||
- Archive stale and superseded pages
|
||||
- Detect index drift (pages on disk not in index, stale index entries)
|
||||
- Detect orphan pages (no inbound links) and auto-add them to index
|
||||
- Detect broken cross-references, fuzzy-match to the intended target
|
||||
via `difflib.get_close_matches`, fix in place
|
||||
- Report empty stubs (body < 100 chars)
|
||||
- Detect state file drift (references to missing files)
|
||||
- Regenerate `staging/index.md` and `archive/index.md` if out of sync
|
||||
|
||||
**Full mode** (LLM-powered, run weekly — extends quick mode with):
|
||||
- Missing cross-references (haiku, batched 5 pages per call)
|
||||
- Duplicate coverage (sonnet — weaker merged into stronger, auto-archives
|
||||
the loser with `archived_reason: Merged into <winner>`)
|
||||
- Contradictions (sonnet, **report-only** — the human decides)
|
||||
- Technology lifecycle (regex + conversation comparison — flags pages
|
||||
mentioning `Node 18` when recent conversations are using `Node 20`)
|
||||
|
||||
State lives in `.hygiene-state.json` — tracks content hashes per page so
|
||||
full-mode runs can skip unchanged pages. Reports land in
|
||||
`reports/hygiene-YYYY-MM-DD-{fixed,needs-review}.md`.
|
||||
|
||||
### `wiki-maintain.sh`
|
||||
|
||||
Top-level orchestrator:
|
||||
|
||||
```
|
||||
Phase 1: wiki-harvest.py (unless --hygiene-only)
|
||||
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
|
||||
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
|
||||
```
|
||||
|
||||
Flags pass through to child scripts. Error-tolerant: if one phase fails,
|
||||
the others still run. Logs to `scripts/.maintain.log`.
|
||||
|
||||
---
|
||||
|
||||
## Sync layer
|
||||
|
||||
### `wiki-sync.sh`
|
||||
|
||||
Git-based sync for cross-machine use. Commands:
|
||||
|
||||
- `--commit` — stage and commit local changes
|
||||
- `--pull` — `git pull` with markdown merge-union (keeps both sides on conflict)
|
||||
- `--push` — push to origin
|
||||
- `full` — commit + pull + push + qmd reindex
|
||||
- `--status` — read-only sync state report
|
||||
|
||||
The `.gitattributes` file sets `*.md merge=union` so markdown conflicts
|
||||
auto-resolve by keeping both versions. This works because most conflicts
|
||||
are additive (two machines both adding new entries).
|
||||
|
||||
---
|
||||
|
||||
## State files
|
||||
|
||||
Three JSON files track per-pipeline state:
|
||||
|
||||
| File | Owner | Synced? | Purpose |
|
||||
|------|-------|---------|---------|
|
||||
| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
|
||||
| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
|
||||
| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
|
||||
|
||||
Harvest and hygiene state need to sync across machines so both
|
||||
installations agree on what's been processed. Mining state is per-machine
|
||||
because Claude Code session files live at OS-specific paths.
|
||||
|
||||
---
|
||||
|
||||
## Module dependency graph
|
||||
|
||||
```
|
||||
wiki_lib.py ─┬─> wiki-harvest.py
|
||||
├─> wiki-staging.py
|
||||
└─> wiki-hygiene.py
|
||||
|
||||
wiki-maintain.sh ─> wiki-harvest.py
|
||||
─> wiki-hygiene.py
|
||||
─> qmd (external)
|
||||
|
||||
mine-conversations.sh ─> extract-sessions.py
|
||||
─> summarize-conversations.py
|
||||
─> update-conversation-index.py
|
||||
|
||||
extract-sessions.py (standalone — reads Claude JSONL)
|
||||
summarize-conversations.py ─> claude CLI (or llama-server)
|
||||
update-conversation-index.py ─> qmd (external)
|
||||
```
|
||||
|
||||
`wiki_lib.py` is the only shared Python module — everything else is
|
||||
self-contained within its layer.
|
||||
|
||||
---
|
||||
|
||||
## Extension seams
|
||||
|
||||
The places to modify when customizing:
|
||||
|
||||
1. **`scripts/extract-sessions.py`** — `PROJECT_MAP` controls how Claude
|
||||
project directories become wiki "wings". Also `KEEP_FULL_OUTPUT_TOOLS`,
|
||||
`SUMMARIZE_TOOLS`, `MAX_BASH_OUTPUT_LINES` to tune transcript shape.
|
||||
|
||||
2. **`scripts/update-conversation-index.py`** — `PROJECT_NAMES` and
|
||||
`PROJECT_ORDER` control how the index groups conversations.
|
||||
|
||||
3. **`scripts/wiki-harvest.py`** —
|
||||
- `SKIP_DOMAIN_PATTERNS` — your internal domains
|
||||
- `C_TYPE_URL_PATTERNS` — URL shapes that need topic-match before harvesting
|
||||
- `FETCH_DELAY_SECONDS` — rate limit between fetches
|
||||
- `COMPILE_PROMPT_TEMPLATE` — what the AI compile step tells the LLM
|
||||
- `SONNET_CONTENT_THRESHOLD` — size cutoff for haiku vs sonnet
|
||||
|
||||
4. **`scripts/wiki-hygiene.py`** —
|
||||
- `DECAY_HIGH_TO_MEDIUM` / `DECAY_MEDIUM_TO_LOW` / `DECAY_LOW_TO_STALE`
|
||||
— decay thresholds in days
|
||||
- `EMPTY_STUB_THRESHOLD` — what counts as a stub
|
||||
- `VERSION_REGEX` — which tools/runtimes to track for lifecycle checks
|
||||
- `REQUIRED_FIELDS` — frontmatter fields the repair step enforces
|
||||
|
||||
5. **`scripts/summarize-conversations.py`** —
|
||||
- `CLAUDE_LONG_THRESHOLD` — haiku/sonnet routing cutoff
|
||||
- `MINE_PROMPT_FILE` — the LLM system prompt for summarization
|
||||
- Backend selection (claude vs llama-server)
|
||||
|
||||
6. **`CLAUDE.md`** at the wiki root — the instructions the agent reads
|
||||
every session. This is where you tell the agent how to maintain the
|
||||
wiki, what conventions to follow, when to flag things to you.
|
||||
|
||||
See [`docs/CUSTOMIZE.md`](CUSTOMIZE.md) for recipes.
|
||||
Reference in New Issue
Block a user