A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
361 lines
15 KiB
Markdown
361 lines
15 KiB
Markdown
# Architecture
|
|
|
|
Eleven scripts across three conceptual layers. This document walks through
|
|
what each one does, how they talk to each other, and where the seams are
|
|
for customization.
|
|
|
|
> **See also**: [`DESIGN-RATIONALE.md`](DESIGN-RATIONALE.md) — the *why*
|
|
> behind each component, with links to the interactive design artifact.
|
|
|
|
## Borrowed concepts
|
|
|
|
The architecture is a synthesis of two external ideas with an automation
|
|
layer on top. The terminology often maps 1:1, so it's worth calling out
|
|
which concepts came from where:
|
|
|
|
### From Karpathy's persistent-wiki gist
|
|
|
|
| Concept | How this repo implements it |
|
|
|---------|-----------------------------|
|
|
| Immutable `raw/` sources | `raw/` directory — never modified by the agent |
|
|
| LLM-compiled `wiki/` pages | `patterns/` `decisions/` `concepts/` `environments/` |
|
|
| Schema file disciplining the agent | `CLAUDE.md` at the wiki root |
|
|
| Periodic "lint" passes | `wiki-hygiene.py --quick` (daily) + `--full` (weekly) |
|
|
| Wiki as fine-tuning material | Clean markdown body is ready for synthetic training data |
|
|
|
|
### From [mempalace](https://github.com/milla-jovovich/mempalace)
|
|
|
|
MemPalace gave us the structural memory taxonomy that turns a flat
|
|
corpus into something you can navigate without reading everything. The
|
|
concepts map directly:
|
|
|
|
| MemPalace term | Meaning | How this repo implements it |
|
|
|----------------|---------|-----------------------------|
|
|
| **Wing** | Per-person or per-project namespace | Project code in `conversations/<code>/` (set by `PROJECT_MAP` in `extract-sessions.py`) |
|
|
| **Room** | Topic within a wing | `topics:` frontmatter field on summarized conversation files |
|
|
| **Closet** | Summary layer — high-signal compressed knowledge | The summary body written by `summarize-conversations.py --claude` |
|
|
| **Drawer** | Verbatim archive, never lost | The extracted transcript under `conversations/<wing>/*.md` (before summarization) |
|
|
| **Hall** | Memory-type corridor (fact / event / discovery / preference / advice / tooling) | `halls:` frontmatter field classified by the summarizer |
|
|
| **Tunnel** | Cross-wing connection — same topic in multiple projects | `related:` frontmatter linking conversations to wiki pages and to each other |
|
|
|
|
The key benefit of wing + room filtering is documented in MemPalace's
|
|
benchmarks as a **+34% retrieval boost** over flat search — because
|
|
`qmd` can search a pre-narrowed subset of the corpus instead of
|
|
everything. This is why the wiki scales past the Karpathy-pattern's
|
|
~50K token ceiling without needing a full vector DB rebuild.
|
|
|
|
### What this repo adds
|
|
|
|
Automation + lifecycle management on top of both:
|
|
|
|
- **Automation layer** — cron-friendly orchestration via `wiki-maintain.sh`
|
|
- **Staging pipeline** — human-in-the-loop checkpoint for automated content
|
|
- **Confidence decay + auto-archive + auto-restore** — the retention curve
|
|
- **`qmd` integration** — the scalable search layer (chosen over ChromaDB
|
|
because it uses markdown storage like the wiki itself)
|
|
- **Hygiene reports** — fixed vs needs-review separation
|
|
- **Cross-machine sync** — git with markdown merge-union
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
```
|
|
┌─────────────────────────────────┐
|
|
│ SYNC LAYER │
|
|
│ wiki-sync.sh │ (git commit/pull/push, qmd reindex)
|
|
└─────────────────────────────────┘
|
|
│
|
|
┌─────────────────────────────────┐
|
|
│ MINING LAYER │
|
|
│ extract-sessions.py │ (Claude Code JSONL → markdown)
|
|
│ summarize-conversations.py │ (LLM classify + summarize)
|
|
│ update-conversation-index.py │ (regenerate indexes + wake-up)
|
|
│ mine-conversations.sh │ (orchestrator)
|
|
└─────────────────────────────────┘
|
|
│
|
|
┌─────────────────────────────────┐
|
|
│ AUTOMATION LAYER │
|
|
│ wiki_lib.py (shared helpers) │
|
|
│ wiki-harvest.py │ (URL → raw → staging)
|
|
│ wiki-staging.py │ (human review)
|
|
│ wiki-hygiene.py │ (decay, archive, repair, checks)
|
|
│ wiki-maintain.sh │ (orchestrator)
|
|
└─────────────────────────────────┘
|
|
```
|
|
|
|
Each layer is independent — you can run the mining layer without the
|
|
automation layer, or vice versa. The layers communicate through files on
|
|
disk (conversation markdown, raw harvested pages, staging pages, wiki
|
|
pages), never through in-memory state.
|
|
|
|
---
|
|
|
|
## Mining layer
|
|
|
|
### `extract-sessions.py`
|
|
|
|
Parses Claude Code JSONL session files from `~/.claude/projects/` into
|
|
clean markdown transcripts under `conversations/<project-code>/`.
|
|
Deterministic, no LLM calls. Incremental — tracks byte offsets in
|
|
`.mine-state.json` so it safely re-runs on partially-processed sessions.
|
|
|
|
Key features:
|
|
- Summarizes tool calls intelligently: full output for `Bash` and `Skill`,
|
|
paths-only for `Read`/`Glob`/`Grep`, path + summary for `Edit`/`Write`
|
|
- Caps Bash output at 200 lines to prevent transcript bloat
|
|
- Handles session resumption — if a session has grown since last extraction,
|
|
it appends new messages without re-processing old ones
|
|
- Maps Claude project directory names to short wiki codes via `PROJECT_MAP`
|
|
|
|
### `summarize-conversations.py`
|
|
|
|
Sends extracted transcripts to an LLM for classification and summarization.
|
|
Supports two backends:
|
|
|
|
1. **`--claude` mode** (recommended): Uses `claude -p` with
|
|
haiku for short sessions (≤200 messages) and sonnet for longer ones.
|
|
Runs chunked over long transcripts, keeping a rolling context window.
|
|
|
|
2. **Local LLM mode** (default, omit `--claude`): Uses a local
|
|
`llama-server` instance at `localhost:8080` (or WSL gateway:8081 on
|
|
Windows Subsystem for Linux). Requires llama.cpp installed and a GGUF
|
|
model loaded.
|
|
|
|
Output: adds frontmatter to each conversation file — `topics`, `halls`
|
|
(fact/discovery/preference/advice/event/tooling), and `related` wiki
|
|
page links. The `related` links are load-bearing: they're what
|
|
`wiki-hygiene.py` uses to refresh `last_verified` on pages that are still
|
|
being discussed.
|
|
|
|
### `update-conversation-index.py`
|
|
|
|
Regenerates three files from the summarized conversations:
|
|
|
|
- `conversations/index.md` — catalog of all conversations grouped by project
|
|
- `context/wake-up.md` — a ~200-token briefing the agent loads at the start
|
|
of every session ("current focus areas, recent decisions, active
|
|
concerns")
|
|
- `context/active-concerns.md` — longer-form current state
|
|
|
|
The wake-up file is important: it's what gives the agent *continuity*
|
|
across sessions without forcing you to re-explain context every time.
|
|
|
|
### `mine-conversations.sh`
|
|
|
|
Orchestrator chaining extract → summarize → index. Supports
|
|
`--extract-only`, `--summarize-only`, `--index-only`, `--project <code>`,
|
|
and `--dry-run`.
|
|
|
|
---
|
|
|
|
## Automation layer
|
|
|
|
### `wiki_lib.py`
|
|
|
|
The shared library. Everything in the automation layer imports from here.
|
|
Provides:
|
|
|
|
- `WikiPage` dataclass — path + frontmatter + body + raw YAML
|
|
- `parse_page(path)` — safe markdown parser with YAML frontmatter
|
|
- `parse_yaml_lite(text)` — subset YAML parser (no external deps, handles
|
|
the frontmatter patterns we use)
|
|
- `serialize_frontmatter(fm)` — writes YAML back in canonical key order
|
|
- `write_page(page, ...)` — full round-trip writer
|
|
- `page_content_hash(page)` — body-only SHA-256 for change detection
|
|
- `iter_live_pages()` / `iter_staging_pages()` / `iter_archived_pages()`
|
|
- Shared constants: `WIKI_DIR`, `STAGING_DIR`, `ARCHIVE_DIR`, etc.
|
|
|
|
All paths honor the `WIKI_DIR` environment variable, so tests and
|
|
alternate installs can override the root.
|
|
|
|
### `wiki-harvest.py`
|
|
|
|
Scans summarized conversations for HTTP(S) URLs, classifies them,
|
|
fetches content, and compiles pending wiki pages.
|
|
|
|
URL classification:
|
|
- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
|
|
- **Check** (Type C) — GitHub issues, Stack Overflow — only harvest if
|
|
the topic is already covered in the wiki (to avoid noise)
|
|
- **Skip** (Type D) — internal domains, localhost, private IPs, chat tools
|
|
|
|
Fetch cascade (tries in order, validates at each step):
|
|
1. `trafilatura -u <url> --markdown --no-comments --precision`
|
|
2. `crwl <url> -o markdown-fit`
|
|
3. `crwl <url> -o markdown-fit -b "user_agent_mode=random" -c "magic=true"` (stealth)
|
|
4. Conversation-transcript fallback — pull inline content from where the
|
|
URL was mentioned during the session
|
|
|
|
Validated content goes to `raw/harvested/<domain>-<path>.md` with
|
|
frontmatter recording source URL, fetch method, and a content hash.
|
|
|
|
Compilation step: sends the raw content + `index.md` + conversation
|
|
context to `claude -p`, asking for a JSON verdict:
|
|
- `new_page` — create a new wiki page
|
|
- `update_page` — update an existing page (with `modifies:` field)
|
|
- `both` — do both
|
|
- `skip` — content isn't substantive enough
|
|
|
|
Result lands in `staging/<type>/` with `origin: automated`,
|
|
`status: pending`, and all the staging-specific frontmatter that gets
|
|
stripped on promotion.
|
|
|
|
### `wiki-staging.py`
|
|
|
|
Pure file operations — no LLM calls. Human review pipeline for automated
|
|
content.
|
|
|
|
Commands:
|
|
- `--list` / `--list --json` — pending items with metadata
|
|
- `--stats` — counts by type/source + age stats
|
|
- `--review` — interactive a/r/s/q loop with preview
|
|
- `--promote <path>` — approve, strip staging fields, move to live, update
|
|
main index, rewrite cross-refs, preserve `origin: automated` as audit trail
|
|
- `--reject <path> --reason "..."` — delete, record in
|
|
`.harvest-state.json` rejected_urls so the harvester won't re-create
|
|
- `--promote-all` — bulk approve everything
|
|
- `--sync` — regenerate `staging/index.md`, detect drift
|
|
|
|
### `wiki-hygiene.py`
|
|
|
|
The heavy lifter. Two modes:
|
|
|
|
**Quick mode** (no LLM, ~1 second on a 100-page wiki, run daily):
|
|
- Backfill `last_verified` from `last_compiled`/git/mtime
|
|
- Refresh `last_verified` from conversation `related:` links — this is
|
|
the "something's still being discussed" signal
|
|
- Auto-restore archived pages that are referenced again
|
|
- Repair frontmatter (missing/invalid fields get sensible defaults)
|
|
- Apply confidence decay per thresholds (6/9/12 months)
|
|
- Archive stale and superseded pages
|
|
- Detect index drift (pages on disk not in index, stale index entries)
|
|
- Detect orphan pages (no inbound links) and auto-add them to index
|
|
- Detect broken cross-references, fuzzy-match to the intended target
|
|
via `difflib.get_close_matches`, fix in place
|
|
- Report empty stubs (body < 100 chars)
|
|
- Detect state file drift (references to missing files)
|
|
- Regenerate `staging/index.md` and `archive/index.md` if out of sync
|
|
|
|
**Full mode** (LLM-powered, run weekly — extends quick mode with):
|
|
- Missing cross-references (haiku, batched 5 pages per call)
|
|
- Duplicate coverage (sonnet — weaker merged into stronger, auto-archives
|
|
the loser with `archived_reason: Merged into <winner>`)
|
|
- Contradictions (sonnet, **report-only** — the human decides)
|
|
- Technology lifecycle (regex + conversation comparison — flags pages
|
|
mentioning `Node 18` when recent conversations are using `Node 20`)
|
|
|
|
State lives in `.hygiene-state.json` — tracks content hashes per page so
|
|
full-mode runs can skip unchanged pages. Reports land in
|
|
`reports/hygiene-YYYY-MM-DD-{fixed,needs-review}.md`.
|
|
|
|
### `wiki-maintain.sh`
|
|
|
|
Top-level orchestrator:
|
|
|
|
```
|
|
Phase 1: wiki-harvest.py (unless --hygiene-only)
|
|
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
|
|
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
|
|
```
|
|
|
|
Flags pass through to child scripts. Error-tolerant: if one phase fails,
|
|
the others still run. Logs to `scripts/.maintain.log`.
|
|
|
|
---
|
|
|
|
## Sync layer
|
|
|
|
### `wiki-sync.sh`
|
|
|
|
Git-based sync for cross-machine use. Commands:
|
|
|
|
- `--commit` — stage and commit local changes
|
|
- `--pull` — `git pull` with markdown merge-union (keeps both sides on conflict)
|
|
- `--push` — push to origin
|
|
- `full` — commit + pull + push + qmd reindex
|
|
- `--status` — read-only sync state report
|
|
|
|
The `.gitattributes` file sets `*.md merge=union` so markdown conflicts
|
|
auto-resolve by keeping both versions. This works because most conflicts
|
|
are additive (two machines both adding new entries).
|
|
|
|
---
|
|
|
|
## State files
|
|
|
|
Three JSON files track per-pipeline state:
|
|
|
|
| File | Owner | Synced? | Purpose |
|
|
|------|-------|---------|---------|
|
|
| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
|
|
| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
|
|
| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
|
|
|
|
Harvest and hygiene state need to sync across machines so both
|
|
installations agree on what's been processed. Mining state is per-machine
|
|
because Claude Code session files live at OS-specific paths.
|
|
|
|
---
|
|
|
|
## Module dependency graph
|
|
|
|
```
|
|
wiki_lib.py ─┬─> wiki-harvest.py
|
|
├─> wiki-staging.py
|
|
└─> wiki-hygiene.py
|
|
|
|
wiki-maintain.sh ─> wiki-harvest.py
|
|
─> wiki-hygiene.py
|
|
─> qmd (external)
|
|
|
|
mine-conversations.sh ─> extract-sessions.py
|
|
─> summarize-conversations.py
|
|
─> update-conversation-index.py
|
|
|
|
extract-sessions.py (standalone — reads Claude JSONL)
|
|
summarize-conversations.py ─> claude CLI (or llama-server)
|
|
update-conversation-index.py ─> qmd (external)
|
|
```
|
|
|
|
`wiki_lib.py` is the only shared Python module — everything else is
|
|
self-contained within its layer.
|
|
|
|
---
|
|
|
|
## Extension seams
|
|
|
|
The places to modify when customizing:
|
|
|
|
1. **`scripts/extract-sessions.py`** — `PROJECT_MAP` controls how Claude
|
|
project directories become wiki "wings". Also `KEEP_FULL_OUTPUT_TOOLS`,
|
|
`SUMMARIZE_TOOLS`, `MAX_BASH_OUTPUT_LINES` to tune transcript shape.
|
|
|
|
2. **`scripts/update-conversation-index.py`** — `PROJECT_NAMES` and
|
|
`PROJECT_ORDER` control how the index groups conversations.
|
|
|
|
3. **`scripts/wiki-harvest.py`** —
|
|
- `SKIP_DOMAIN_PATTERNS` — your internal domains
|
|
- `C_TYPE_URL_PATTERNS` — URL shapes that need topic-match before harvesting
|
|
- `FETCH_DELAY_SECONDS` — rate limit between fetches
|
|
- `COMPILE_PROMPT_TEMPLATE` — what the AI compile step tells the LLM
|
|
- `SONNET_CONTENT_THRESHOLD` — size cutoff for haiku vs sonnet
|
|
|
|
4. **`scripts/wiki-hygiene.py`** —
|
|
- `DECAY_HIGH_TO_MEDIUM` / `DECAY_MEDIUM_TO_LOW` / `DECAY_LOW_TO_STALE`
|
|
— decay thresholds in days
|
|
- `EMPTY_STUB_THRESHOLD` — what counts as a stub
|
|
- `VERSION_REGEX` — which tools/runtimes to track for lifecycle checks
|
|
- `REQUIRED_FIELDS` — frontmatter fields the repair step enforces
|
|
|
|
5. **`scripts/summarize-conversations.py`** —
|
|
- `CLAUDE_LONG_THRESHOLD` — haiku/sonnet routing cutoff
|
|
- `MINE_PROMPT_FILE` — the LLM system prompt for summarization
|
|
- Backend selection (claude vs llama-server)
|
|
|
|
6. **`CLAUDE.md`** at the wiki root — the instructions the agent reads
|
|
every session. This is where you tell the agent how to maintain the
|
|
wiki, what conventions to follow, when to flag things to you.
|
|
|
|
See [`docs/CUSTOMIZE.md`](CUSTOMIZE.md) for recipes.
|