Initial commit — memex

A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions
--- a/docs/ARCHITECTURE.md
+++ b/docs/ARCHITECTURE.md
@@ -0,0 +1,360 @@
+# Architecture
+
+Eleven scripts across three conceptual layers. This document walks through
+what each one does, how they talk to each other, and where the seams are
+for customization.
+
+> **See also**: [`DESIGN-RATIONALE.md`](DESIGN-RATIONALE.md) — the *why*
+> behind each component, with links to the interactive design artifact.
+
+## Borrowed concepts
+
+The architecture is a synthesis of two external ideas with an automation
+layer on top. The terminology often maps 1:1, so it's worth calling out
+which concepts came from where:
+
+### From Karpathy's persistent-wiki gist
+
+| Concept | How this repo implements it |
+|---------|-----------------------------|
+| Immutable `raw/` sources | `raw/` directory — never modified by the agent |
+| LLM-compiled `wiki/` pages | `patterns/` `decisions/` `concepts/` `environments/` |
+| Schema file disciplining the agent | `CLAUDE.md` at the wiki root |
+| Periodic "lint" passes | `wiki-hygiene.py --quick` (daily) + `--full` (weekly) |
+| Wiki as fine-tuning material | Clean markdown body is ready for synthetic training data |
+
+### From [mempalace](https://github.com/milla-jovovich/mempalace)
+
+MemPalace gave us the structural memory taxonomy that turns a flat
+corpus into something you can navigate without reading everything. The
+concepts map directly:
+
+| MemPalace term | Meaning | How this repo implements it |
+|----------------|---------|-----------------------------|
+| **Wing** | Per-person or per-project namespace | Project code in `conversations/<code>/` (set by `PROJECT_MAP` in `extract-sessions.py`) |
+| **Room** | Topic within a wing | `topics:` frontmatter field on summarized conversation files |
+| **Closet** | Summary layer — high-signal compressed knowledge | The summary body written by `summarize-conversations.py --claude` |
+| **Drawer** | Verbatim archive, never lost | The extracted transcript under `conversations/<wing>/*.md` (before summarization) |
+| **Hall** | Memory-type corridor (fact / event / discovery / preference / advice / tooling) | `halls:` frontmatter field classified by the summarizer |
+| **Tunnel** | Cross-wing connection — same topic in multiple projects | `related:` frontmatter linking conversations to wiki pages and to each other |
+
+The key benefit of wing + room filtering is documented in MemPalace's
+benchmarks as a **+34% retrieval boost** over flat search — because
+`qmd` can search a pre-narrowed subset of the corpus instead of
+everything. This is why the wiki scales past the Karpathy-pattern's
+~50K token ceiling without needing a full vector DB rebuild.
+
+### What this repo adds
+
+Automation + lifecycle management on top of both:
+
+- **Automation layer** — cron-friendly orchestration via `wiki-maintain.sh`
+- **Staging pipeline** — human-in-the-loop checkpoint for automated content
+- **Confidence decay + auto-archive + auto-restore** — the retention curve
+- **`qmd` integration** — the scalable search layer (chosen over ChromaDB
+  because it uses markdown storage like the wiki itself)
+- **Hygiene reports** — fixed vs needs-review separation
+- **Cross-machine sync** — git with markdown merge-union
+
+---
+
+## Overview
+
+```
+     ┌─────────────────────────────────┐
+     │       SYNC LAYER                │
+     │  wiki-sync.sh                   │  (git commit/pull/push, qmd reindex)
+     └─────────────────────────────────┘
+                     │
+     ┌─────────────────────────────────┐
+     │       MINING LAYER              │
+     │  extract-sessions.py            │  (Claude Code JSONL → markdown)
+     │  summarize-conversations.py     │  (LLM classify + summarize)
+     │  update-conversation-index.py   │  (regenerate indexes + wake-up)
+     │  mine-conversations.sh          │  (orchestrator)
+     └─────────────────────────────────┘
+                     │
+     ┌─────────────────────────────────┐
+     │    AUTOMATION LAYER             │
+     │  wiki_lib.py  (shared helpers)  │
+     │  wiki-harvest.py                │  (URL → raw → staging)
+     │  wiki-staging.py                │  (human review)
+     │  wiki-hygiene.py                │  (decay, archive, repair, checks)
+     │  wiki-maintain.sh               │  (orchestrator)
+     └─────────────────────────────────┘
+```
+
+Each layer is independent — you can run the mining layer without the
+automation layer, or vice versa. The layers communicate through files on
+disk (conversation markdown, raw harvested pages, staging pages, wiki
+pages), never through in-memory state.
+
+---
+
+## Mining layer
+
+### `extract-sessions.py`
+
+Parses Claude Code JSONL session files from `~/.claude/projects/` into
+clean markdown transcripts under `conversations/<project-code>/`.
+Deterministic, no LLM calls. Incremental — tracks byte offsets in
+`.mine-state.json` so it safely re-runs on partially-processed sessions.
+
+Key features:
+- Summarizes tool calls intelligently: full output for `Bash` and `Skill`,
+  paths-only for `Read`/`Glob`/`Grep`, path + summary for `Edit`/`Write`
+- Caps Bash output at 200 lines to prevent transcript bloat
+- Handles session resumption — if a session has grown since last extraction,
+  it appends new messages without re-processing old ones
+- Maps Claude project directory names to short wiki codes via `PROJECT_MAP`
+
+### `summarize-conversations.py`
+
+Sends extracted transcripts to an LLM for classification and summarization.
+Supports two backends:
+
+1. **`--claude` mode** (recommended): Uses `claude -p` with
+   haiku for short sessions (≤200 messages) and sonnet for longer ones.
+   Runs chunked over long transcripts, keeping a rolling context window.
+
+2. **Local LLM mode** (default, omit `--claude`): Uses a local
+   `llama-server` instance at `localhost:8080` (or WSL gateway:8081 on
+   Windows Subsystem for Linux). Requires llama.cpp installed and a GGUF
+   model loaded.
+
+Output: adds frontmatter to each conversation file — `topics`, `halls`
+(fact/discovery/preference/advice/event/tooling), and `related` wiki
+page links. The `related` links are load-bearing: they're what
+`wiki-hygiene.py` uses to refresh `last_verified` on pages that are still
+being discussed.
+
+### `update-conversation-index.py`
+
+Regenerates three files from the summarized conversations:
+
+- `conversations/index.md` — catalog of all conversations grouped by project
+- `context/wake-up.md` — a ~200-token briefing the agent loads at the start
+  of every session ("current focus areas, recent decisions, active
+  concerns")
+- `context/active-concerns.md` — longer-form current state
+
+The wake-up file is important: it's what gives the agent *continuity*
+across sessions without forcing you to re-explain context every time.
+
+### `mine-conversations.sh`
+
+Orchestrator chaining extract → summarize → index. Supports
+`--extract-only`, `--summarize-only`, `--index-only`, `--project <code>`,
+and `--dry-run`.
+
+---
+
+## Automation layer
+
+### `wiki_lib.py`
+
+The shared library. Everything in the automation layer imports from here.
+Provides:
+
+- `WikiPage` dataclass — path + frontmatter + body + raw YAML
+- `parse_page(path)` — safe markdown parser with YAML frontmatter
+- `parse_yaml_lite(text)` — subset YAML parser (no external deps, handles
+  the frontmatter patterns we use)
+- `serialize_frontmatter(fm)` — writes YAML back in canonical key order
+- `write_page(page, ...)` — full round-trip writer
+- `page_content_hash(page)` — body-only SHA-256 for change detection
+- `iter_live_pages()` / `iter_staging_pages()` / `iter_archived_pages()`
+- Shared constants: `WIKI_DIR`, `STAGING_DIR`, `ARCHIVE_DIR`, etc.
+
+All paths honor the `WIKI_DIR` environment variable, so tests and
+alternate installs can override the root.
+
+### `wiki-harvest.py`
+
+Scans summarized conversations for HTTP(S) URLs, classifies them,
+fetches content, and compiles pending wiki pages.
+
+URL classification:
+- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
+- **Check** (Type C) — GitHub issues, Stack Overflow — only harvest if
+  the topic is already covered in the wiki (to avoid noise)
+- **Skip** (Type D) — internal domains, localhost, private IPs, chat tools
+
+Fetch cascade (tries in order, validates at each step):
+1. `trafilatura -u <url> --markdown --no-comments --precision`
+2. `crwl <url> -o markdown-fit`
+3. `crwl <url> -o markdown-fit -b "user_agent_mode=random" -c "magic=true"` (stealth)
+4. Conversation-transcript fallback — pull inline content from where the
+   URL was mentioned during the session
+
+Validated content goes to `raw/harvested/<domain>-<path>.md` with
+frontmatter recording source URL, fetch method, and a content hash.
+
+Compilation step: sends the raw content + `index.md` + conversation
+context to `claude -p`, asking for a JSON verdict:
+- `new_page` — create a new wiki page
+- `update_page` — update an existing page (with `modifies:` field)
+- `both` — do both
+- `skip` — content isn't substantive enough
+
+Result lands in `staging/<type>/` with `origin: automated`,
+`status: pending`, and all the staging-specific frontmatter that gets
+stripped on promotion.
+
+### `wiki-staging.py`
+
+Pure file operations — no LLM calls. Human review pipeline for automated
+content.
+
+Commands:
+- `--list` / `--list --json` — pending items with metadata
+- `--stats` — counts by type/source + age stats
+- `--review` — interactive a/r/s/q loop with preview
+- `--promote <path>` — approve, strip staging fields, move to live, update
+  main index, rewrite cross-refs, preserve `origin: automated` as audit trail
+- `--reject <path> --reason "..."` — delete, record in
+  `.harvest-state.json` rejected_urls so the harvester won't re-create
+- `--promote-all` — bulk approve everything
+- `--sync` — regenerate `staging/index.md`, detect drift
+
+### `wiki-hygiene.py`
+
+The heavy lifter. Two modes:
+
+**Quick mode** (no LLM, ~1 second on a 100-page wiki, run daily):
+- Backfill `last_verified` from `last_compiled`/git/mtime
+- Refresh `last_verified` from conversation `related:` links — this is
+  the "something's still being discussed" signal
+- Auto-restore archived pages that are referenced again
+- Repair frontmatter (missing/invalid fields get sensible defaults)
+- Apply confidence decay per thresholds (6/9/12 months)
+- Archive stale and superseded pages
+- Detect index drift (pages on disk not in index, stale index entries)
+- Detect orphan pages (no inbound links) and auto-add them to index
+- Detect broken cross-references, fuzzy-match to the intended target
+  via `difflib.get_close_matches`, fix in place
+- Report empty stubs (body < 100 chars)
+- Detect state file drift (references to missing files)
+- Regenerate `staging/index.md` and `archive/index.md` if out of sync
+
+**Full mode** (LLM-powered, run weekly — extends quick mode with):
+- Missing cross-references (haiku, batched 5 pages per call)
+- Duplicate coverage (sonnet — weaker merged into stronger, auto-archives
+  the loser with `archived_reason: Merged into <winner>`)
+- Contradictions (sonnet, **report-only** — the human decides)
+- Technology lifecycle (regex + conversation comparison — flags pages
+  mentioning `Node 18` when recent conversations are using `Node 20`)
+
+State lives in `.hygiene-state.json` — tracks content hashes per page so
+full-mode runs can skip unchanged pages. Reports land in
+`reports/hygiene-YYYY-MM-DD-{fixed,needs-review}.md`.
+
+### `wiki-maintain.sh`
+
+Top-level orchestrator:
+
+```
+Phase 1: wiki-harvest.py     (unless --hygiene-only)
+Phase 2: wiki-hygiene.py     (--full for the weekly pass, else quick)
+Phase 3: qmd update && qmd embed     (unless --no-reindex or --dry-run)
+```
+
+Flags pass through to child scripts. Error-tolerant: if one phase fails,
+the others still run. Logs to `scripts/.maintain.log`.
+
+---
+
+## Sync layer
+
+### `wiki-sync.sh`
+
+Git-based sync for cross-machine use. Commands:
+
+- `--commit` — stage and commit local changes
+- `--pull` — `git pull` with markdown merge-union (keeps both sides on conflict)
+- `--push` — push to origin
+- `full` — commit + pull + push + qmd reindex
+- `--status` — read-only sync state report
+
+The `.gitattributes` file sets `*.md merge=union` so markdown conflicts
+auto-resolve by keeping both versions. This works because most conflicts
+are additive (two machines both adding new entries).
+
+---
+
+## State files
+
+Three JSON files track per-pipeline state:
+
+| File | Owner | Synced? | Purpose |
+|------|-------|---------|---------|
+| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
+| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
+| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
+
+Harvest and hygiene state need to sync across machines so both
+installations agree on what's been processed. Mining state is per-machine
+because Claude Code session files live at OS-specific paths.
+
+---
+
+## Module dependency graph
+
+```
+wiki_lib.py  ─┬─>  wiki-harvest.py
+              ├─>  wiki-staging.py
+              └─>  wiki-hygiene.py
+
+wiki-maintain.sh  ─>  wiki-harvest.py
+                  ─>  wiki-hygiene.py
+                  ─>  qmd (external)
+
+mine-conversations.sh  ─>  extract-sessions.py
+                       ─>  summarize-conversations.py
+                       ─>  update-conversation-index.py
+
+extract-sessions.py     (standalone — reads Claude JSONL)
+summarize-conversations.py  ─>  claude CLI (or llama-server)
+update-conversation-index.py  ─>  qmd (external)
+```
+
+`wiki_lib.py` is the only shared Python module — everything else is
+self-contained within its layer.
+
+---
+
+## Extension seams
+
+The places to modify when customizing:
+
+1. **`scripts/extract-sessions.py`** — `PROJECT_MAP` controls how Claude
+   project directories become wiki "wings". Also `KEEP_FULL_OUTPUT_TOOLS`,
+   `SUMMARIZE_TOOLS`, `MAX_BASH_OUTPUT_LINES` to tune transcript shape.
+
+2. **`scripts/update-conversation-index.py`** — `PROJECT_NAMES` and
+   `PROJECT_ORDER` control how the index groups conversations.
+
+3. **`scripts/wiki-harvest.py`** —
+   - `SKIP_DOMAIN_PATTERNS` — your internal domains
+   - `C_TYPE_URL_PATTERNS` — URL shapes that need topic-match before harvesting
+   - `FETCH_DELAY_SECONDS` — rate limit between fetches
+   - `COMPILE_PROMPT_TEMPLATE` — what the AI compile step tells the LLM
+   - `SONNET_CONTENT_THRESHOLD` — size cutoff for haiku vs sonnet
+
+4. **`scripts/wiki-hygiene.py`** —
+   - `DECAY_HIGH_TO_MEDIUM` / `DECAY_MEDIUM_TO_LOW` / `DECAY_LOW_TO_STALE`
+     — decay thresholds in days
+   - `EMPTY_STUB_THRESHOLD` — what counts as a stub
+   - `VERSION_REGEX` — which tools/runtimes to track for lifecycle checks
+   - `REQUIRED_FIELDS` — frontmatter fields the repair step enforces
+
+5. **`scripts/summarize-conversations.py`** —
+   - `CLAUDE_LONG_THRESHOLD` — haiku/sonnet routing cutoff
+   - `MINE_PROMPT_FILE` — the LLM system prompt for summarization
+   - Backend selection (claude vs llama-server)
+
+6. **`CLAUDE.md`** at the wiki root — the instructions the agent reads
+   every session. This is where you tell the agent how to maintain the
+   wiki, what conventions to follow, when to flag things to you.
+
+See [`docs/CUSTOMIZE.md`](CUSTOMIZE.md) for recipes.