Files

Eric Turner ee54a2f5d4 Initial commit — memex

A compounding LLM-maintained knowledge wiki.

Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's
mempalace, with an automation layer on top for conversation mining, URL
harvesting, human-in-the-loop staging, staleness decay, and hygiene.

Includes:
- 11 pipeline scripts (extract, summarize, index, harvest, stage,
  hygiene, maintain, sync, + shared library)
- Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE
- Example CLAUDE.md files (wiki schema + global instructions) tuned for
  the three-collection qmd setup
- 171-test pytest suite (cross-platform, runs in ~1.3s)
- MIT licensed

2026-04-12 21:16:02 -06:00

Architecture

Eleven scripts across three conceptual layers. This document walks through what each one does, how they talk to each other, and where the seams are for customization.

See also: DESIGN-RATIONALE.md — the why behind each component, with links to the interactive design artifact.

Borrowed concepts

The architecture is a synthesis of two external ideas with an automation layer on top. The terminology often maps 1:1, so it's worth calling out which concepts came from where:

From Karpathy's persistent-wiki gist

Concept	How this repo implements it
Immutable `raw/` sources	`raw/` directory — never modified by the agent
LLM-compiled `wiki/` pages	`patterns/` `decisions/` `concepts/` `environments/`
Schema file disciplining the agent	`CLAUDE.md` at the wiki root
Periodic "lint" passes	`wiki-hygiene.py --quick` (daily) + `--full` (weekly)
Wiki as fine-tuning material	Clean markdown body is ready for synthetic training data

From mempalace

MemPalace gave us the structural memory taxonomy that turns a flat corpus into something you can navigate without reading everything. The concepts map directly:

MemPalace term	Meaning	How this repo implements it
Wing	Per-person or per-project namespace	Project code in `conversations/<code>/` (set by `PROJECT_MAP` in `extract-sessions.py`)
Room	Topic within a wing	`topics:` frontmatter field on summarized conversation files
Closet	Summary layer — high-signal compressed knowledge	The summary body written by `summarize-conversations.py --claude`
Drawer	Verbatim archive, never lost	The extracted transcript under `conversations/<wing>/*.md` (before summarization)
Hall	Memory-type corridor (fact / event / discovery / preference / advice / tooling)	`halls:` frontmatter field classified by the summarizer
Tunnel	Cross-wing connection — same topic in multiple projects	`related:` frontmatter linking conversations to wiki pages and to each other

The key benefit of wing + room filtering is documented in MemPalace's benchmarks as a +34% retrieval boost over flat search — because qmd can search a pre-narrowed subset of the corpus instead of everything. This is why the wiki scales past the Karpathy-pattern's ~50K token ceiling without needing a full vector DB rebuild.

What this repo adds

Automation + lifecycle management on top of both:

Automation layer — cron-friendly orchestration via wiki-maintain.sh
Staging pipeline — human-in-the-loop checkpoint for automated content
Confidence decay + auto-archive + auto-restore — the retention curve
qmd integration — the scalable search layer (chosen over ChromaDB because it uses markdown storage like the wiki itself)
Hygiene reports — fixed vs needs-review separation
Cross-machine sync — git with markdown merge-union

Overview

     ┌─────────────────────────────────┐
     │       SYNC LAYER                │
     │  wiki-sync.sh                   │  (git commit/pull/push, qmd reindex)
     └─────────────────────────────────┘
                     │
     ┌─────────────────────────────────┐
     │       MINING LAYER              │
     │  extract-sessions.py            │  (Claude Code JSONL → markdown)
     │  summarize-conversations.py     │  (LLM classify + summarize)
     │  update-conversation-index.py   │  (regenerate indexes + wake-up)
     │  mine-conversations.sh          │  (orchestrator)
     └─────────────────────────────────┘
                     │
     ┌─────────────────────────────────┐
     │    AUTOMATION LAYER             │
     │  wiki_lib.py  (shared helpers)  │
     │  wiki-harvest.py                │  (URL → raw → staging)
     │  wiki-staging.py                │  (human review)
     │  wiki-hygiene.py                │  (decay, archive, repair, checks)
     │  wiki-maintain.sh               │  (orchestrator)
     └─────────────────────────────────┘

Each layer is independent — you can run the mining layer without the automation layer, or vice versa. The layers communicate through files on disk (conversation markdown, raw harvested pages, staging pages, wiki pages), never through in-memory state.

Mining layer

`extract-sessions.py`

Parses Claude Code JSONL session files from ~/.claude/projects/ into clean markdown transcripts under conversations/<project-code>/. Deterministic, no LLM calls. Incremental — tracks byte offsets in .mine-state.json so it safely re-runs on partially-processed sessions.

Key features:

Summarizes tool calls intelligently: full output for Bash and Skill, paths-only for Read/Glob/Grep, path + summary for Edit/Write
Caps Bash output at 200 lines to prevent transcript bloat
Handles session resumption — if a session has grown since last extraction, it appends new messages without re-processing old ones
Maps Claude project directory names to short wiki codes via PROJECT_MAP

`summarize-conversations.py`

Sends extracted transcripts to an LLM for classification and summarization. Supports two backends:

--claude mode (recommended): Uses claude -p with haiku for short sessions (≤200 messages) and sonnet for longer ones. Runs chunked over long transcripts, keeping a rolling context window.
Local LLM mode (default, omit --claude): Uses a local llama-server instance at localhost:8080 (or WSL gateway:8081 on Windows Subsystem for Linux). Requires llama.cpp installed and a GGUF model loaded.

Output: adds frontmatter to each conversation file — topics, halls (fact/discovery/preference/advice/event/tooling), and related wiki page links. The related links are load-bearing: they're what wiki-hygiene.py uses to refresh last_verified on pages that are still being discussed.

`update-conversation-index.py`

Regenerates three files from the summarized conversations:

conversations/index.md — catalog of all conversations grouped by project
context/wake-up.md — a ~200-token briefing the agent loads at the start of every session ("current focus areas, recent decisions, active concerns")
context/active-concerns.md — longer-form current state

The wake-up file is important: it's what gives the agent continuity across sessions without forcing you to re-explain context every time.

`mine-conversations.sh`

Orchestrator chaining extract → summarize → index. Supports --extract-only, --summarize-only, --index-only, --project <code>, and --dry-run.

Automation layer

`wiki_lib.py`

The shared library. Everything in the automation layer imports from here. Provides:

WikiPage dataclass — path + frontmatter + body + raw YAML
parse_page(path) — safe markdown parser with YAML frontmatter
parse_yaml_lite(text) — subset YAML parser (no external deps, handles the frontmatter patterns we use)
serialize_frontmatter(fm) — writes YAML back in canonical key order
write_page(page, ...) — full round-trip writer
page_content_hash(page) — body-only SHA-256 for change detection
iter_live_pages() / iter_staging_pages() / iter_archived_pages()
Shared constants: WIKI_DIR, STAGING_DIR, ARCHIVE_DIR, etc.

All paths honor the WIKI_DIR environment variable, so tests and alternate installs can override the root.

`wiki-harvest.py`

Scans summarized conversations for HTTP(S) URLs, classifies them, fetches content, and compiles pending wiki pages.

URL classification:

Harvest (Type A/B) — docs, articles, blogs → fetch and compile
Check (Type C) — GitHub issues, Stack Overflow — only harvest if the topic is already covered in the wiki (to avoid noise)
Skip (Type D) — internal domains, localhost, private IPs, chat tools

Fetch cascade (tries in order, validates at each step):

trafilatura -u <url> --markdown --no-comments --precision
crwl <url> -o markdown-fit
crwl <url> -o markdown-fit -b "user_agent_mode=random" -c "magic=true" (stealth)
Conversation-transcript fallback — pull inline content from where the URL was mentioned during the session

Validated content goes to raw/harvested/<domain>-<path>.md with frontmatter recording source URL, fetch method, and a content hash.

Compilation step: sends the raw content + index.md + conversation context to claude -p, asking for a JSON verdict:

new_page — create a new wiki page
update_page — update an existing page (with modifies: field)
both — do both
skip — content isn't substantive enough

Result lands in staging/<type>/ with origin: automated, status: pending, and all the staging-specific frontmatter that gets stripped on promotion.

`wiki-staging.py`

Pure file operations — no LLM calls. Human review pipeline for automated content.

Commands:

--list / --list --json — pending items with metadata
--stats — counts by type/source + age stats
--review — interactive a/r/s/q loop with preview
--promote <path> — approve, strip staging fields, move to live, update main index, rewrite cross-refs, preserve origin: automated as audit trail
--reject <path> --reason "..." — delete, record in .harvest-state.json rejected_urls so the harvester won't re-create
--promote-all — bulk approve everything
--sync — regenerate staging/index.md, detect drift

`wiki-hygiene.py`

The heavy lifter. Two modes:

Quick mode (no LLM, ~1 second on a 100-page wiki, run daily):

Backfill last_verified from last_compiled/git/mtime
Refresh last_verified from conversation related: links — this is the "something's still being discussed" signal
Auto-restore archived pages that are referenced again
Repair frontmatter (missing/invalid fields get sensible defaults)
Apply confidence decay per thresholds (6/9/12 months)
Archive stale and superseded pages
Detect index drift (pages on disk not in index, stale index entries)
Detect orphan pages (no inbound links) and auto-add them to index
Detect broken cross-references, fuzzy-match to the intended target via difflib.get_close_matches, fix in place
Report empty stubs (body < 100 chars)
Detect state file drift (references to missing files)
Regenerate staging/index.md and archive/index.md if out of sync

Full mode (LLM-powered, run weekly — extends quick mode with):

Missing cross-references (haiku, batched 5 pages per call)
Duplicate coverage (sonnet — weaker merged into stronger, auto-archives the loser with archived_reason: Merged into <winner>)
Contradictions (sonnet, report-only — the human decides)
Technology lifecycle (regex + conversation comparison — flags pages mentioning Node 18 when recent conversations are using Node 20)

State lives in .hygiene-state.json — tracks content hashes per page so full-mode runs can skip unchanged pages. Reports land in reports/hygiene-YYYY-MM-DD-{fixed,needs-review}.md.

`wiki-maintain.sh`

Top-level orchestrator:

Phase 1: wiki-harvest.py     (unless --hygiene-only)
Phase 2: wiki-hygiene.py     (--full for the weekly pass, else quick)
Phase 3: qmd update && qmd embed     (unless --no-reindex or --dry-run)

Flags pass through to child scripts. Error-tolerant: if one phase fails, the others still run. Logs to scripts/.maintain.log.

Sync layer

`wiki-sync.sh`

Git-based sync for cross-machine use. Commands:

--commit — stage and commit local changes
--pull — git pull with markdown merge-union (keeps both sides on conflict)
--push — push to origin
full — commit + pull + push + qmd reindex
--status — read-only sync state report

The .gitattributes file sets *.md merge=union so markdown conflicts auto-resolve by keeping both versions. This works because most conflicts are additive (two machines both adding new entries).

State files

Three JSON files track per-pipeline state:

File	Owner	Synced?	Purpose
`.mine-state.json`	`extract-sessions.py`, `summarize-conversations.py`	No (gitignored)	Per-session byte offsets — local filesystem state, not portable
`.harvest-state.json`	`wiki-harvest.py`	Yes (committed)	URL dedup — harvested/skipped/failed/rejected URLs
`.hygiene-state.json`	`wiki-hygiene.py`	Yes (committed)	Page content hashes, deferred issues, last-run timestamps

Harvest and hygiene state need to sync across machines so both installations agree on what's been processed. Mining state is per-machine because Claude Code session files live at OS-specific paths.

Module dependency graph

wiki_lib.py  ─┬─>  wiki-harvest.py
              ├─>  wiki-staging.py
              └─>  wiki-hygiene.py

wiki-maintain.sh  ─>  wiki-harvest.py
                  ─>  wiki-hygiene.py
                  ─>  qmd (external)

mine-conversations.sh  ─>  extract-sessions.py
                       ─>  summarize-conversations.py
                       ─>  update-conversation-index.py

extract-sessions.py     (standalone — reads Claude JSONL)
summarize-conversations.py  ─>  claude CLI (or llama-server)
update-conversation-index.py  ─>  qmd (external)

wiki_lib.py is the only shared Python module — everything else is self-contained within its layer.

Extension seams

The places to modify when customizing:

scripts/extract-sessions.py — PROJECT_MAP controls how Claude project directories become wiki "wings". Also KEEP_FULL_OUTPUT_TOOLS, SUMMARIZE_TOOLS, MAX_BASH_OUTPUT_LINES to tune transcript shape.
scripts/update-conversation-index.py — PROJECT_NAMES and PROJECT_ORDER control how the index groups conversations.
scripts/wiki-harvest.py —
- SKIP_DOMAIN_PATTERNS — your internal domains
- C_TYPE_URL_PATTERNS — URL shapes that need topic-match before harvesting
- FETCH_DELAY_SECONDS — rate limit between fetches
- COMPILE_PROMPT_TEMPLATE — what the AI compile step tells the LLM
- SONNET_CONTENT_THRESHOLD — size cutoff for haiku vs sonnet
scripts/wiki-hygiene.py —
- DECAY_HIGH_TO_MEDIUM / DECAY_MEDIUM_TO_LOW / DECAY_LOW_TO_STALE — decay thresholds in days
- EMPTY_STUB_THRESHOLD — what counts as a stub
- VERSION_REGEX — which tools/runtimes to track for lifecycle checks
- REQUIRED_FIELDS — frontmatter fields the repair step enforces
scripts/summarize-conversations.py —
- CLAUDE_LONG_THRESHOLD — haiku/sonnet routing cutoff
- MINE_PROMPT_FILE — the LLM system prompt for summarization
- Backend selection (claude vs llama-server)
CLAUDE.md at the wiki root — the instructions the agent reads every session. This is where you tell the agent how to maintain the wiki, what conventions to follow, when to flag things to you.

See docs/CUSTOMIZE.md for recipes.

15 KiB Raw Blame History

Architecture

Borrowed concepts

From Karpathy's persistent-wiki gist

From mempalace

What this repo adds

Overview

Mining layer

extract-sessions.py

summarize-conversations.py

update-conversation-index.py

mine-conversations.sh

Automation layer

wiki_lib.py

wiki-harvest.py

wiki-staging.py

wiki-hygiene.py

wiki-maintain.sh