Initial commit — memex

A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions
@@ -0,0 +1,360 @@
+# Architecture
+
+Eleven scripts across three conceptual layers. This document walks through
+what each one does, how they talk to each other, and where the seams are
+for customization.
+
+> **See also**: [`DESIGN-RATIONALE.md`](DESIGN-RATIONALE.md) — the *why*
+> behind each component, with links to the interactive design artifact.
+
+## Borrowed concepts
+
+The architecture is a synthesis of two external ideas with an automation
+layer on top. The terminology often maps 1:1, so it's worth calling out
+which concepts came from where:
+
+### From Karpathy's persistent-wiki gist
+
+| Concept | How this repo implements it |
+|---------|-----------------------------|
+| Immutable `raw/` sources | `raw/` directory — never modified by the agent |
+| LLM-compiled `wiki/` pages | `patterns/` `decisions/` `concepts/` `environments/` |
+| Schema file disciplining the agent | `CLAUDE.md` at the wiki root |
+| Periodic "lint" passes | `wiki-hygiene.py --quick` (daily) + `--full` (weekly) |
+| Wiki as fine-tuning material | Clean markdown body is ready for synthetic training data |
+
+### From [mempalace](https://github.com/milla-jovovich/mempalace)
+
+MemPalace gave us the structural memory taxonomy that turns a flat
+corpus into something you can navigate without reading everything. The
+concepts map directly:
+
+| MemPalace term | Meaning | How this repo implements it |
+|----------------|---------|-----------------------------|
+| **Wing** | Per-person or per-project namespace | Project code in `conversations/<code>/` (set by `PROJECT_MAP` in `extract-sessions.py`) |
+| **Room** | Topic within a wing | `topics:` frontmatter field on summarized conversation files |
+| **Closet** | Summary layer — high-signal compressed knowledge | The summary body written by `summarize-conversations.py --claude` |
+| **Drawer** | Verbatim archive, never lost | The extracted transcript under `conversations/<wing>/*.md` (before summarization) |
+| **Hall** | Memory-type corridor (fact / event / discovery / preference / advice / tooling) | `halls:` frontmatter field classified by the summarizer |
+| **Tunnel** | Cross-wing connection — same topic in multiple projects | `related:` frontmatter linking conversations to wiki pages and to each other |
+
+The key benefit of wing + room filtering is documented in MemPalace's
+benchmarks as a **+34% retrieval boost** over flat search — because
+`qmd` can search a pre-narrowed subset of the corpus instead of
+everything. This is why the wiki scales past the Karpathy-pattern's
+~50K token ceiling without needing a full vector DB rebuild.
+
+### What this repo adds
+
+Automation + lifecycle management on top of both:
+
+- **Automation layer** — cron-friendly orchestration via `wiki-maintain.sh`
+- **Staging pipeline** — human-in-the-loop checkpoint for automated content
+- **Confidence decay + auto-archive + auto-restore** — the retention curve
+- **`qmd` integration** — the scalable search layer (chosen over ChromaDB
+  because it uses markdown storage like the wiki itself)
+- **Hygiene reports** — fixed vs needs-review separation
+- **Cross-machine sync** — git with markdown merge-union
+
+---
+
+## Overview
+
+```
+     ┌─────────────────────────────────┐
+     │       SYNC LAYER                │
+     │  wiki-sync.sh                   │  (git commit/pull/push, qmd reindex)
+     └─────────────────────────────────┘
+                     │
+     ┌─────────────────────────────────┐
+     │       MINING LAYER              │
+     │  extract-sessions.py            │  (Claude Code JSONL → markdown)
+     │  summarize-conversations.py     │  (LLM classify + summarize)
+     │  update-conversation-index.py   │  (regenerate indexes + wake-up)
+     │  mine-conversations.sh          │  (orchestrator)
+     └─────────────────────────────────┘
+                     │
+     ┌─────────────────────────────────┐
+     │    AUTOMATION LAYER             │
+     │  wiki_lib.py  (shared helpers)  │
+     │  wiki-harvest.py                │  (URL → raw → staging)
+     │  wiki-staging.py                │  (human review)
+     │  wiki-hygiene.py                │  (decay, archive, repair, checks)
+     │  wiki-maintain.sh               │  (orchestrator)
+     └─────────────────────────────────┘
+```
+
+Each layer is independent — you can run the mining layer without the
+automation layer, or vice versa. The layers communicate through files on
+disk (conversation markdown, raw harvested pages, staging pages, wiki
+pages), never through in-memory state.
+
+---
+
+## Mining layer
+
+### `extract-sessions.py`
+
+Parses Claude Code JSONL session files from `~/.claude/projects/` into
+clean markdown transcripts under `conversations/<project-code>/`.
+Deterministic, no LLM calls. Incremental — tracks byte offsets in
+`.mine-state.json` so it safely re-runs on partially-processed sessions.
+
+Key features:
+- Summarizes tool calls intelligently: full output for `Bash` and `Skill`,
+  paths-only for `Read`/`Glob`/`Grep`, path + summary for `Edit`/`Write`
+- Caps Bash output at 200 lines to prevent transcript bloat
+- Handles session resumption — if a session has grown since last extraction,
+  it appends new messages without re-processing old ones
+- Maps Claude project directory names to short wiki codes via `PROJECT_MAP`
+
+### `summarize-conversations.py`
+
+Sends extracted transcripts to an LLM for classification and summarization.
+Supports two backends:
+
+1. **`--claude` mode** (recommended): Uses `claude -p` with
+   haiku for short sessions (≤200 messages) and sonnet for longer ones.
+   Runs chunked over long transcripts, keeping a rolling context window.
+
+2. **Local LLM mode** (default, omit `--claude`): Uses a local
+   `llama-server` instance at `localhost:8080` (or WSL gateway:8081 on
+   Windows Subsystem for Linux). Requires llama.cpp installed and a GGUF
+   model loaded.
+
+Output: adds frontmatter to each conversation file — `topics`, `halls`
+(fact/discovery/preference/advice/event/tooling), and `related` wiki
+page links. The `related` links are load-bearing: they're what
+`wiki-hygiene.py` uses to refresh `last_verified` on pages that are still
+being discussed.
+
+### `update-conversation-index.py`
+
+Regenerates three files from the summarized conversations:
+
+- `conversations/index.md` — catalog of all conversations grouped by project
+- `context/wake-up.md` — a ~200-token briefing the agent loads at the start
+  of every session ("current focus areas, recent decisions, active
+  concerns")
+- `context/active-concerns.md` — longer-form current state
+
+The wake-up file is important: it's what gives the agent *continuity*
+across sessions without forcing you to re-explain context every time.
+
+### `mine-conversations.sh`
+
+Orchestrator chaining extract → summarize → index. Supports
+`--extract-only`, `--summarize-only`, `--index-only`, `--project <code>`,
+and `--dry-run`.
+
+---
+
+## Automation layer
+
+### `wiki_lib.py`
+
+The shared library. Everything in the automation layer imports from here.
+Provides:
+
+- `WikiPage` dataclass — path + frontmatter + body + raw YAML
+- `parse_page(path)` — safe markdown parser with YAML frontmatter
+- `parse_yaml_lite(text)` — subset YAML parser (no external deps, handles
+  the frontmatter patterns we use)
+- `serialize_frontmatter(fm)` — writes YAML back in canonical key order
+- `write_page(page, ...)` — full round-trip writer
+- `page_content_hash(page)` — body-only SHA-256 for change detection
+- `iter_live_pages()` / `iter_staging_pages()` / `iter_archived_pages()`
+- Shared constants: `WIKI_DIR`, `STAGING_DIR`, `ARCHIVE_DIR`, etc.
+
+All paths honor the `WIKI_DIR` environment variable, so tests and
+alternate installs can override the root.
+
+### `wiki-harvest.py`
+
+Scans summarized conversations for HTTP(S) URLs, classifies them,
+fetches content, and compiles pending wiki pages.
+
+URL classification:
+- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
+- **Check** (Type C) — GitHub issues, Stack Overflow — only harvest if
+  the topic is already covered in the wiki (to avoid noise)
+- **Skip** (Type D) — internal domains, localhost, private IPs, chat tools
+
+Fetch cascade (tries in order, validates at each step):
+1. `trafilatura -u <url> --markdown --no-comments --precision`
+2. `crwl <url> -o markdown-fit`
+3. `crwl <url> -o markdown-fit -b "user_agent_mode=random" -c "magic=true"` (stealth)
+4. Conversation-transcript fallback — pull inline content from where the
+   URL was mentioned during the session
+
+Validated content goes to `raw/harvested/<domain>-<path>.md` with
+frontmatter recording source URL, fetch method, and a content hash.
+
+Compilation step: sends the raw content + `index.md` + conversation
+context to `claude -p`, asking for a JSON verdict:
+- `new_page` — create a new wiki page
+- `update_page` — update an existing page (with `modifies:` field)
+- `both` — do both
+- `skip` — content isn't substantive enough
+
+Result lands in `staging/<type>/` with `origin: automated`,
+`status: pending`, and all the staging-specific frontmatter that gets
+stripped on promotion.
+
+### `wiki-staging.py`
+
+Pure file operations — no LLM calls. Human review pipeline for automated
+content.
+
+Commands:
+- `--list` / `--list --json` — pending items with metadata
+- `--stats` — counts by type/source + age stats
+- `--review` — interactive a/r/s/q loop with preview
+- `--promote <path>` — approve, strip staging fields, move to live, update
+  main index, rewrite cross-refs, preserve `origin: automated` as audit trail
+- `--reject <path> --reason "..."` — delete, record in
+  `.harvest-state.json` rejected_urls so the harvester won't re-create
+- `--promote-all` — bulk approve everything
+- `--sync` — regenerate `staging/index.md`, detect drift
+
+### `wiki-hygiene.py`
+
+The heavy lifter. Two modes:
+
+**Quick mode** (no LLM, ~1 second on a 100-page wiki, run daily):
+- Backfill `last_verified` from `last_compiled`/git/mtime
+- Refresh `last_verified` from conversation `related:` links — this is
+  the "something's still being discussed" signal
+- Auto-restore archived pages that are referenced again
+- Repair frontmatter (missing/invalid fields get sensible defaults)
+- Apply confidence decay per thresholds (6/9/12 months)
+- Archive stale and superseded pages
+- Detect index drift (pages on disk not in index, stale index entries)
+- Detect orphan pages (no inbound links) and auto-add them to index
+- Detect broken cross-references, fuzzy-match to the intended target
+  via `difflib.get_close_matches`, fix in place
+- Report empty stubs (body < 100 chars)
+- Detect state file drift (references to missing files)
+- Regenerate `staging/index.md` and `archive/index.md` if out of sync
+
+**Full mode** (LLM-powered, run weekly — extends quick mode with):
+- Missing cross-references (haiku, batched 5 pages per call)
+- Duplicate coverage (sonnet — weaker merged into stronger, auto-archives
+  the loser with `archived_reason: Merged into <winner>`)
+- Contradictions (sonnet, **report-only** — the human decides)
+- Technology lifecycle (regex + conversation comparison — flags pages
+  mentioning `Node 18` when recent conversations are using `Node 20`)
+
+State lives in `.hygiene-state.json` — tracks content hashes per page so
+full-mode runs can skip unchanged pages. Reports land in
+`reports/hygiene-YYYY-MM-DD-{fixed,needs-review}.md`.
+
+### `wiki-maintain.sh`
+
+Top-level orchestrator:
+
+```
+Phase 1: wiki-harvest.py     (unless --hygiene-only)
+Phase 2: wiki-hygiene.py     (--full for the weekly pass, else quick)
+Phase 3: qmd update && qmd embed     (unless --no-reindex or --dry-run)
+```
+
+Flags pass through to child scripts. Error-tolerant: if one phase fails,
+the others still run. Logs to `scripts/.maintain.log`.
+
+---
+
+## Sync layer
+
+### `wiki-sync.sh`
+
+Git-based sync for cross-machine use. Commands:
+
+- `--commit` — stage and commit local changes
+- `--pull` — `git pull` with markdown merge-union (keeps both sides on conflict)
+- `--push` — push to origin
+- `full` — commit + pull + push + qmd reindex
+- `--status` — read-only sync state report
+
+The `.gitattributes` file sets `*.md merge=union` so markdown conflicts
+auto-resolve by keeping both versions. This works because most conflicts
+are additive (two machines both adding new entries).
+
+---
+
+## State files
+
+Three JSON files track per-pipeline state:
+
+| File | Owner | Synced? | Purpose |
+|------|-------|---------|---------|
+| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
+| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
+| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
+
+Harvest and hygiene state need to sync across machines so both
+installations agree on what's been processed. Mining state is per-machine
+because Claude Code session files live at OS-specific paths.
+
+---
+
+## Module dependency graph
+
+```
+wiki_lib.py  ─┬─>  wiki-harvest.py
+              ├─>  wiki-staging.py
+              └─>  wiki-hygiene.py
+
+wiki-maintain.sh  ─>  wiki-harvest.py
+                  ─>  wiki-hygiene.py
+                  ─>  qmd (external)
+
+mine-conversations.sh  ─>  extract-sessions.py
+                       ─>  summarize-conversations.py
+                       ─>  update-conversation-index.py
+
+extract-sessions.py     (standalone — reads Claude JSONL)
+summarize-conversations.py  ─>  claude CLI (or llama-server)
+update-conversation-index.py  ─>  qmd (external)
+```
+
+`wiki_lib.py` is the only shared Python module — everything else is
+self-contained within its layer.
+
+---
+
+## Extension seams
+
+The places to modify when customizing:
+
+1. **`scripts/extract-sessions.py`** — `PROJECT_MAP` controls how Claude
+   project directories become wiki "wings". Also `KEEP_FULL_OUTPUT_TOOLS`,
+   `SUMMARIZE_TOOLS`, `MAX_BASH_OUTPUT_LINES` to tune transcript shape.
+
+2. **`scripts/update-conversation-index.py`** — `PROJECT_NAMES` and
+   `PROJECT_ORDER` control how the index groups conversations.
+
+3. **`scripts/wiki-harvest.py`** —
+   - `SKIP_DOMAIN_PATTERNS` — your internal domains
+   - `C_TYPE_URL_PATTERNS` — URL shapes that need topic-match before harvesting
+   - `FETCH_DELAY_SECONDS` — rate limit between fetches
+   - `COMPILE_PROMPT_TEMPLATE` — what the AI compile step tells the LLM
+   - `SONNET_CONTENT_THRESHOLD` — size cutoff for haiku vs sonnet
+
+4. **`scripts/wiki-hygiene.py`** —
+   - `DECAY_HIGH_TO_MEDIUM` / `DECAY_MEDIUM_TO_LOW` / `DECAY_LOW_TO_STALE`
+     — decay thresholds in days
+   - `EMPTY_STUB_THRESHOLD` — what counts as a stub
+   - `VERSION_REGEX` — which tools/runtimes to track for lifecycle checks
+   - `REQUIRED_FIELDS` — frontmatter fields the repair step enforces
+
+5. **`scripts/summarize-conversations.py`** —
+   - `CLAUDE_LONG_THRESHOLD` — haiku/sonnet routing cutoff
+   - `MINE_PROMPT_FILE` — the LLM system prompt for summarization
+   - Backend selection (claude vs llama-server)
+
+6. **`CLAUDE.md`** at the wiki root — the instructions the agent reads
+   every session. This is where you tell the agent how to maintain the
+   wiki, what conventions to follow, when to flag things to you.
+
+See [`docs/CUSTOMIZE.md`](CUSTOMIZE.md) for recipes.
@@ -0,0 +1,432 @@
+# Customization Guide
+
+This repo is built around Claude Code, cron-based automation, and a
+specific directory layout. None of those are load-bearing for the core
+idea. This document walks through adapting it for different agents,
+different scheduling, and different subsets of functionality.
+
+## What's actually required for the core idea
+
+The minimum viable compounding wiki is:
+
+1. A markdown directory tree
+2. An agent that reads the tree at the start of a session and writes to
+   it during the session
+3. Some convention (a `CLAUDE.md` or equivalent) telling the agent how to
+   maintain the wiki
+
+**Everything else in this repo is optional optimization** — automated
+extraction, URL harvesting, hygiene checks, cron scheduling. They're
+worth the setup effort once the wiki grows past a few dozen pages, but
+they're not the *idea*.
+
+---
+
+## Adapting for non-Claude-Code agents
+
+Four script components are Claude-specific. Each has a natural
+replacement path:
+
+### 1. `extract-sessions.py` — Claude Code JSONL parsing
+
+**What it does**: Reads session files from `~/.claude/projects/` and
+converts them to markdown transcripts.
+
+**What's Claude-specific**: The JSONL format and directory structure are
+specific to the Claude Code CLI. Other agents don't produce these files.
+
+**Replacements**:
+
+- **Cursor**: Cursor stores chat history in `~/Library/Application
+  Support/Cursor/User/globalStorage/` (macOS) as SQLite. Write an
+  equivalent `extract-sessions.py` that queries that SQLite and produces
+  the same markdown format.
+- **Aider**: Aider stores chat history as `.aider.chat.history.md` in
+  each project directory. A much simpler extractor: walk all project
+  directories, read each `.aider.chat.history.md`, split on session
+  boundaries, write to `conversations/<project>/`.
+- **OpenAI Codex / gemini CLI / other**: Whatever session format your
+  tool uses — the target format is a markdown file with a specific
+  frontmatter shape (`title`, `type: conversation`, `project`, `date`,
+  `status: extracted`, `messages: N`, body of user/assistant turns).
+  Anything that produces files in that shape will flow through the rest
+  of the pipeline unchanged.
+- **No agent at all — just manual**: Skip this script entirely. Paste
+  interesting conversations into `conversations/general/YYYY-MM-DD-slug.md`
+  by hand and set `status: extracted` yourself.
+
+The pipeline downstream of `extract-sessions.py` doesn't care how the
+transcripts got there, only that they exist with the right frontmatter.
+
+### 2. `summarize-conversations.py` — `claude -p` summarization
+
+**What it does**: Classifies extracted conversations into "halls"
+(fact/discovery/preference/advice/event/tooling) and writes summaries.
+
+**What's Claude-specific**: Uses `claude -p` with haiku/sonnet routing.
+
+**Replacements**:
+
+- **OpenAI**: Replace the `call_claude` helper with a function that calls
+  `openai` Python SDK or `gpt` CLI. Use gpt-4o-mini for short
+  conversations (equivalent to haiku routing) and gpt-4o for long ones.
+- **Local LLM**: The script already supports this path — just omit the
+  `--claude` flag and run a `llama-server` on localhost:8080 (or the WSL
+  gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
+- **Ollama**: Point `AI_BASE_URL` at your Ollama endpoint (e.g.
+  `http://localhost:11434/v1`). Ollama exposes an OpenAI-compatible API.
+- **Any OpenAI-compatible endpoint**: `AI_BASE_URL` and `AI_MODEL` env
+  vars configure the script — no code changes needed.
+- **No LLM at all — manual summaries**: Edit each conversation file by
+  hand to set `status: summarized` and add your own `topics`/`related`
+  frontmatter. Tedious but works for a small wiki.
+
+### 3. `wiki-harvest.py` — AI compile step
+
+**What it does**: After fetching raw URL content, sends it to `claude -p`
+to get a structured JSON verdict (new_page / update_page / both / skip)
+plus the page content.
+
+**What's Claude-specific**: `claude -p --model haiku|sonnet`.
+
+**Replacements**:
+
+- **Any other LLM**: Replace `call_claude_compile()` with a function that
+  calls your preferred backend. The prompt template
+  (`COMPILE_PROMPT_TEMPLATE`) is reusable — just swap the transport.
+- **Skip AI compilation entirely**: Run `wiki-harvest.py --no-compile`
+  and the harvester will save raw content to `raw/harvested/` without
+  trying to compile it. You can then manually (or via a different script)
+  turn the raw content into wiki pages.
+
+### 4. `wiki-hygiene.py --full` — LLM-powered checks
+
+**What it does**: Duplicate detection, contradiction detection, missing
+cross-reference suggestions.
+
+**What's Claude-specific**: `claude -p --model haiku|sonnet`.
+
+**Replacements**:
+
+- **Same as #3**: Replace the `call_claude()` helper in `wiki-hygiene.py`.
+- **Skip full mode entirely**: Only run `wiki-hygiene.py --quick` (the
+  default). Quick mode has no LLM calls and catches 90% of structural
+  issues. Contradictions and duplicates just have to be caught by human
+  review during `wiki-staging.py --review` sessions.
+
+### 5. `CLAUDE.md` at the wiki root
+
+**What it does**: The instructions Claude Code reads at the start of
+every session that explain the wiki schema and maintenance operations.
+
+**What's Claude-specific**: The filename. Claude Code specifically looks
+for `CLAUDE.md`; other agents look for other files.
+
+**Replacements**:
+
+| Agent | Equivalent file |
+|-------|-----------------|
+| Claude Code | `CLAUDE.md` |
+| Cursor | `.cursorrules` or `.cursor/rules/` |
+| Aider | `CONVENTIONS.md` (read via `--read CONVENTIONS.md`) |
+| Gemini CLI | `GEMINI.md` |
+| Continue.dev | `config.json` prompts or `.continue/rules/` |
+
+The content is the same — just rename the file and point your agent at
+it.
+
+---
+
+## Running without cron
+
+Cron is convenient but not required. Alternatives:
+
+### Manual runs
+
+Just call the scripts when you want the wiki updated:
+
+```bash
+cd ~/projects/wiki
+
+# When you want to ingest new Claude Code sessions
+bash scripts/mine-conversations.sh
+
+# When you want hygiene + harvest
+bash scripts/wiki-maintain.sh
+
+# When you want the expensive LLM pass
+bash scripts/wiki-maintain.sh --hygiene-only --full
+```
+
+This is arguably *better* than cron if you work in bursts — run
+maintenance when you start a session, not on a schedule.
+
+### systemd timers (Linux)
+
+More observable than cron, better journaling:
+
+```ini
+# ~/.config/systemd/user/wiki-maintain.service
+[Unit]
+Description=Wiki maintenance pipeline
+
+[Service]
+Type=oneshot
+WorkingDirectory=%h/projects/wiki
+ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh
+```
+
+```ini
+# ~/.config/systemd/user/wiki-maintain.timer
+[Unit]
+Description=Run wiki-maintain daily
+
+[Timer]
+OnCalendar=daily
+Persistent=true
+
+[Install]
+WantedBy=timers.target
+```
+
+```bash
+systemctl --user enable --now wiki-maintain.timer
+journalctl --user -u wiki-maintain.service  # see logs
+```
+
+### launchd (macOS)
+
+More native than cron on macOS:
+
+```xml
+<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+<plist version="1.0">
+<dict>
+  <key>Label</key><string>com.user.wiki-maintain</string>
+  <key>ProgramArguments</key>
+  <array>
+    <string>/bin/bash</string>
+    <string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
+  </array>
+  <key>StartCalendarInterval</key>
+  <dict>
+    <key>Hour</key><integer>3</integer>
+    <key>Minute</key><integer>0</integer>
+  </dict>
+  <key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
+  <key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
+</dict>
+</plist>
+```
+
+```bash
+launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
+launchctl list | grep wiki  # verify
+```
+
+### Git hooks (pre-push)
+
+Run hygiene before every push so the wiki is always clean when it hits
+the remote:
+
+```bash
+cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
+#!/usr/bin/env bash
+set -euo pipefail
+bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
+HOOK
+chmod +x ~/projects/wiki/.git/hooks/pre-push
+```
+
+Downside: every push is slow. Upside: you never push a broken wiki.
+
+### CI pipeline
+
+Run `wiki-hygiene.py --check-only` in a CI workflow on every PR:
+
+```yaml
+# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
+name: Wiki hygiene check
+on: [push, pull_request]
+jobs:
+  hygiene:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+      - run: python3 scripts/wiki-hygiene.py --check-only
+```
+
+`--check-only` reports issues without auto-fixing them, so CI can flag
+problems without modifying files.
+
+---
+
+## Minimal subsets
+
+You don't have to run the whole pipeline. Pick what's useful:
+
+### "Just the wiki" (no automation)
+
+- Delete `scripts/wiki-*` and `scripts/*-conversations*`
+- Delete `tests/`
+- Keep the directory structure (`patterns/`, `decisions/`, etc.)
+- Keep `index.md` and `CLAUDE.md`
+- Write and maintain the wiki manually with your agent
+
+This is the Karpathy-gist version. Works great for small wikis.
+
+### "Wiki + mining" (no harvesting, no hygiene)
+
+- Keep the mining layer (`extract-sessions.py`, `summarize-conversations.py`, `update-conversation-index.py`)
+- Delete the automation layer (`wiki-harvest.py`, `wiki-hygiene.py`, `wiki-staging.py`, `wiki-maintain.sh`)
+- The wiki grows from session mining but you maintain it manually
+
+Useful if you want session continuity (the wake-up briefing) without
+the full automation.
+
+### "Wiki + hygiene" (no mining, no harvesting)
+
+- Keep `wiki-hygiene.py` and `wiki_lib.py`
+- Delete everything else
+- Run `wiki-hygiene.py --quick` periodically to catch structural issues
+
+Useful if you write the wiki manually but want automated checks for
+orphans, broken links, and staleness.
+
+### "Wiki + harvesting" (no session mining)
+
+- Keep `wiki-harvest.py`, `wiki-staging.py`, `wiki_lib.py`
+- Delete mining scripts
+- Source URLs manually — put them in a file and point the harvester at
+  it. You'd need to write a wrapper that extracts URLs from your source
+  file and feeds them into the fetch cascade.
+
+Useful if URLs come from somewhere other than Claude Code sessions
+(e.g. browser bookmarks, Pocket export, RSS).
+
+---
+
+## Schema customization
+
+The repo uses these live content types:
+
+- `patterns/` — HOW things should be built
+- `decisions/` — WHY we chose this approach
+- `concepts/` — WHAT the foundational ideas are
+- `environments/` — WHERE implementations differ
+
+These reflect my engineering-focused use case. Your wiki might need
+different categories. To change them:
+
+1. Rename / add directories under the wiki root
+2. Edit `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py`
+3. Update the `type:` frontmatter validation in
+   `scripts/wiki-hygiene.py` (`VALID_TYPES` constant)
+4. Update `CLAUDE.md` to describe the new categories
+5. Update `index.md` section headers to match
+
+Examples of alternative schemas:
+
+**Research wiki**:
+- `findings/` — experimental results
+- `hypotheses/` — what you're testing
+- `methods/` — how you test
+- `literature/` — external sources
+
+**Product wiki**:
+- `features/` — what the product does
+- `decisions/` — why we chose this
+- `users/` — personas, interviews, feedback
+- `metrics/` — what we measure
+
+**Personal knowledge wiki**:
+- `topics/` — general subject matter
+- `projects/` — specific ongoing work
+- `journal/` — dated entries
+- `references/` — external links/papers
+
+None of these are better or worse — pick what matches how you think.
+
+---
+
+## Frontmatter customization
+
+The required fields are documented in `CLAUDE.md` (frontmatter spec).
+You can add your own fields freely — the parser and hygiene checks
+ignore unknown keys.
+
+Useful additions you might want:
+
+```yaml
+author: alice              # who wrote or introduced the page
+tags: [auth, security]     # flat tag list
+urgency: high              # for to-do-style wiki pages
+stakeholders:              # who cares about this page
+  - product-team
+  - security-team
+review_by: 2026-06-01      # explicit review date instead of age-based decay
+```
+
+If you want age-based decay to key off a different field than
+`last_verified` (say, `review_by`), edit `expected_confidence()` in
+`scripts/wiki-hygiene.py` to read from your custom field.
+
+---
+
+## Working across multiple wikis
+
+The scripts all honor the `WIKI_DIR` environment variable. Run multiple
+wikis against the same scripts:
+
+```bash
+# Work wiki
+WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh
+
+# Personal wiki
+WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh
+
+# Research wiki
+WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh
+```
+
+Each has its own state files, its own cron entries, its own qmd
+collection. You can symlink or copy `scripts/` into each wiki, or run
+all three against a single checked-out copy of the scripts.
+
+---
+
+## What I'd change if starting over
+
+Honest notes on the design choices, in case you're about to fork:
+
+1. **Config should be in YAML, not inline constants.** I bolted a
+   "CONFIGURE ME" comment onto `PROJECT_MAP` and `SKIP_DOMAIN_PATTERNS`
+   as a shortcut. Better: a `config.yaml` at the wiki root that all
+   scripts read.
+
+2. **The mining layer is tightly coupled to Claude Code.** A cleaner
+   design would put a `Session` interface in `wiki_lib.py` and have
+   extractors for each agent produce `Session` objects — the rest of the
+   pipeline would be agent-agnostic.
+
+3. **The hygiene script is a monolith.** 1100+ lines is a lot. Splitting
+   it into `wiki_hygiene/checks.py`, `wiki_hygiene/archive.py`,
+   `wiki_hygiene/llm.py`, etc., would be cleaner. It started as a single
+   file and grew.
+
+4. **The hyphenated filenames (`wiki-harvest.py`) make Python imports
+   awkward.** Standard Python convention is underscores. I used hyphens
+   for consistency with the shell scripts, and `conftest.py` has a
+   module-loader workaround. A cleaner fork would use underscores
+   everywhere.
+
+5. **The wiki schema assumes you know what you want to catalog.** If
+   you don't, start with a free-form `notes/` directory and let
+   categories emerge organically, then refactor into `patterns/` etc.
+   later.
+
+None of these are blockers. They're all "if I were designing v2"
+observations.
@@ -0,0 +1,338 @@
+# Design Rationale — Signal & Noise
+
+Why each part of this repo exists. This is the "why" document; the other
+docs are the "what" and "how."
+
+Before implementing anything, the design was worked out interactively
+with Claude as a structured Signal & Noise analysis of Andrej Karpathy's
+original persistent-wiki pattern:
+
+> **Interactive design artifact**: [The LLM Wiki — Karpathy's Pattern — Signal & Noise](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
+
+That artifact walks through the pattern's seven genuine strengths, seven
+real weaknesses, and concrete mitigations for each weakness. This repo
+is the implementation of those mitigations. If you want to understand
+*why* a component exists, the artifact has the longer-form argument; this
+document is the condensed version.
+
+---
+
+## Where the pattern is genuinely strong
+
+The analysis found seven strengths that hold up under scrutiny. This
+repo preserves all of them:
+
+| Strength | How this repo keeps it |
+|----------|-----------------------|
+| **Knowledge compounds over time** | Every ingest adds to the existing wiki rather than restarting; conversation mining and URL harvesting continuously feed new material in |
+| **Zero maintenance burden on humans** | Cron-driven harvest + hygiene; the only manual step is staging review, and that's fast because the AI already compiled the page |
+| **Token-efficient at personal scale** | `index.md` fits in context; `qmd` kicks in only at 50+ articles; the wake-up briefing is ~200 tokens |
+| **Human-readable & auditable** | Plain markdown everywhere; every cross-reference is visible; git history shows every change |
+| **Future-proof & portable** | No vendor lock-in; you can point any agent at the same tree tomorrow |
+| **Self-healing via lint passes** | `wiki-hygiene.py` runs quick checks daily and full (LLM) checks weekly |
+| **Path to fine-tuning** | Wiki pages are high-quality synthetic training data once purified through hygiene |
+
+---
+
+## Where the pattern is genuinely weak — and how this repo answers
+
+The analysis identified seven real weaknesses. Five have direct
+mitigations in this repo; two remain open trade-offs you should be aware
+of.
+
+### 1. Errors persist and compound
+
+**The problem**: Unlike RAG — where a hallucination is ephemeral and the
+next query starts clean — an LLM wiki persists its mistakes. If the LLM
+incorrectly links two concepts at ingest time, future ingests build on
+that wrong prior.
+
+**How this repo mitigates**:
+
+- **`confidence` field** — every page carries `high`/`medium`/`low` with
+  decay based on `last_verified`. Wrong claims aren't treated as
+  permanent — they age out visibly.
+- **Archive + restore** — decayed pages get moved to `archive/` where
+  they're excluded from default search. If they get referenced again
+  they're auto-restored with `confidence: medium` (never straight to
+  `high` — they have to re-earn trust).
+- **Raw harvested material is immutable** — `raw/harvested/*.md` files
+  are the ground truth. Every compiled wiki page can be traced back to
+  its source via the `sources:` frontmatter field.
+- **Full-mode contradiction detection** — `wiki-hygiene.py --full` uses
+  sonnet to find conflicting claims across pages. Report-only (humans
+  decide which side wins).
+- **Staging review** — automated content goes to `staging/` first.
+  Nothing enters the live wiki without human approval, so errors have
+  two chances to get caught (AI compile + human review) before they
+  become persistent.
+
+### 2. Hard scale ceiling at ~50K tokens
+
+**The problem**: The wiki approach stops working when `index.md` no
+longer fits in context. Karpathy's own wiki was ~100 articles / 400K
+words — already near the ceiling.
+
+**How this repo mitigates**:
+
+- **`qmd` from day one** — `qmd` (BM25 + vector + LLM re-ranking) is set
+  up in the default configuration so the agent never has to load the
+  full index. At 50+ pages, `qmd search` replaces `cat index.md`.
+- **Wing/room structural filtering** — conversations are partitioned by
+  project code (wing) and topic (room, via the `topics:` frontmatter).
+  Retrieval is pre-narrowed to the relevant wing before search runs.
+  This extends the effective ceiling because `qmd` works on a relevant
+  subset, not the whole corpus.
+- **Hygiene full mode flags redundancy** — duplicate detection auto-merges
+  weaker pages into stronger ones, keeping the corpus lean.
+- **Archive excludes stale content** — the `wiki-archive` collection has
+  `includeByDefault: false`, so archived pages don't eat context until
+  explicitly queried.
+
+### 3. Manual cross-checking burden returns in precision-critical domains
+
+**The problem**: For API specs, version constraints, legal records, and
+medical protocols, LLM-generated content needs human verification. The
+maintenance burden you thought you'd eliminated comes back as
+verification overhead.
+
+**How this repo mitigates**:
+
+- **Staging workflow** — every automated page goes through human review.
+  For precision-critical content, that review IS the cross-check. The
+  AI does the drafting; you verify.
+- **`compilation_notes` field** — staging pages include the AI's own
+  explanation of what it did and why. Makes review faster — you can
+  spot-check the reasoning rather than re-reading the whole page.
+- **Immutable raw sources** — every wiki claim traces back to a specific
+  file in `raw/harvested/` with a SHA-256 `content_hash`. Verification
+  means comparing the claim to the source, not "trust the LLM."
+- **`confidence: low` for precision domains** — the agent's instructions
+  (via `CLAUDE.md`) tell it to flag low-confidence content when
+  citing. Humans see the warning before acting.
+
+**Residual trade-off**: For *truly* mission-critical data (legal,
+medical, compliance), no amount of automation replaces domain-expert
+review. If that's your use case, treat this repo as a *drafting* tool,
+not a canonical source.
+
+### 4. Knowledge staleness without active upkeep
+
+**The problem**: Community analysis of 120+ comments on Karpathy's gist
+found this is the #1 failure mode. Most people who try the pattern get
+the folder structure right and still end up with a wiki that slowly
+becomes unreliable because they stop feeding it. Six-week half-life is
+typical.
+
+**How this repo mitigates** (this is the biggest thing):
+
+- **Automation replaces human discipline** — daily cron runs
+  `wiki-maintain.sh` (harvest + hygiene + qmd reindex); weekly cron runs
+  `--full` mode. You don't need to remember anything.
+- **Conversation mining is the feed** — you don't need to curate sources
+  manually. Every Claude Code session becomes potential ingest. The
+  feed is automatic and continuous, as long as you're doing work.
+- **`last_verified` refreshes from conversation references** — when the
+  summarizer links a conversation to a wiki page via `related:`, the
+  hygiene script picks that up and bumps `last_verified`. Pages stay
+  fresh as long as they're still being discussed.
+- **Decay thresholds force attention** — pages without refresh signals
+  for 6/9/12 months get downgraded and eventually archived. The wiki
+  self-trims.
+- **Hygiene reports** — `reports/hygiene-YYYY-MM-DD-needs-review.md`
+  flags the things that *do* need human judgment. Everything else is
+  auto-fixed.
+
+This is the single biggest reason this repo exists. The automation
+layer is entirely about removing "I forgot to lint" as a failure mode.
+
+### 5. Cognitive outsourcing risk
+
+**The problem**: Hacker News critics argued that the bookkeeping
+Karpathy outsources — filing, cross-referencing, summarizing — is
+precisely where genuine understanding forms. Outsource it and you end up
+with a comprehensive wiki you haven't internalized.
+
+**How this repo mitigates**:
+
+- **Staging review is a forcing function** — you see every automated
+  page before it lands. Even skimming forces engagement with the
+  material.
+- **`qmd query "..."` for exploration** — searching the wiki is an
+  active process, not passive retrieval. You're asking questions, not
+  pulling a file.
+- **The wake-up briefing** — `context/wake-up.md` is a 200-token digest
+  the agent reads at session start. You read it too (or the agent reads
+  it to you) — ongoing re-exposure to your own knowledge base.
+
+**Residual trade-off**: This is a real concern even with mitigations.
+The wiki is designed as *augmentation*, not *replacement*. If you
+never read your own wiki and only consult it through the agent, you're
+in the outsourcing failure mode. The fix is discipline, not
+architecture.
+
+### 6. Weaker semantic retrieval than RAG at scale
+
+**The problem**: At large corpora, vector embeddings find semantically
+related content across different wording in ways explicit wikilinks
+can't match.
+
+**How this repo mitigates**:
+
+- **`qmd` is hybrid (BM25 + vector)** — not just keyword search. Vector
+  similarity is built into the retrieval pipeline from day one.
+- **Structural navigation complements semantic search** — project codes
+  (wings) and topic frontmatter narrow the search space before the
+  hybrid search runs. Structure + semantics is stronger than either
+  alone.
+- **Missing cross-reference detection** — full-mode hygiene asks the
+  LLM to find pages that *should* link to each other but don't, then
+  auto-adds them. This is the explicit-linking approach catching up to
+  semantic retrieval over time.
+
+**Residual trade-off**: At enterprise scale (millions of documents), a
+proper vector DB with specialized retrieval wins. This repo is for
+personal / small-team scale where the hybrid approach is sufficient.
+
+### 7. No access control or multi-user support
+
+**The problem**: It's a folder of markdown files. No RBAC, no audit
+logging, no concurrency handling, no permissions model.
+
+**How this repo mitigates**:
+
+- **Git-based sync with merge-union** — concurrent writes on different
+  machines auto-resolve because markdown is set to `merge=union` in
+  `.gitattributes`. Both sides win.
+- **Network boundary as soft access control** — the suggested
+  deployment is over Tailscale or a VPN, so the network does the work a
+  RBAC layer would otherwise do. Not enterprise-grade, but sufficient
+  for personal/family/small-team use.
+
+**Residual trade-off**: **This is the big one.** The repo is not a
+replacement for enterprise knowledge management. No audit trails, no
+fine-grained permissions, no compliance story. If you need any of
+that, you need a different architecture. This repo is explicitly
+scoped to the personal/small-team use case.
+
+---
+
+## The #1 failure mode — active upkeep
+
+Every other weakness has a mitigation. *Active upkeep is the one that
+kills wikis in the wild.* The community data is unambiguous:
+
+- People who automate the lint schedule → wikis healthy at 6+ months
+- People who rely on "I'll remember to lint" → wikis abandoned at 6 weeks
+
+The entire automation layer of this repo exists to remove upkeep as a
+thing the human has to think about:
+
+| Cadence | Job | Purpose |
+|---------|-----|---------|
+| Every 15 min | `wiki-sync.sh` | Commit/pull/push — cross-machine sync |
+| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
+| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
+| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
+| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
+| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
+
+If you disable all of these, you get the same outcome as every
+abandoned wiki: six-week half-life. The scripts aren't optional
+convenience — they're the load-bearing answer to the pattern's primary
+failure mode.
+
+---
+
+## What was borrowed from where
+
+This repo is a synthesis of two ideas with an automation layer on top:
+
+### From Karpathy
+
+- The core pattern: LLM-maintained persistent wiki, compile at ingest
+  time instead of retrieve at query time
+- Separation of `raw/` (immutable sources) from `wiki/` (compiled pages)
+- `CLAUDE.md` as the schema that disciplines the agent
+- Periodic "lint" passes to catch orphans, contradictions, missing refs
+- The idea that the wiki becomes fine-tuning material over time
+
+### From mempalace
+
+- **Wings** = per-person or per-project namespaces → this repo uses
+  project codes (`mc`, `wiki`, `web`, etc.) as the same thing in
+  `conversations/<project>/`
+- **Rooms** = topics within a wing → the `topics:` frontmatter on
+  conversation files
+- **Halls** = memory-type corridors (fact / event / discovery /
+  preference / advice / tooling) → the `halls:` frontmatter field
+  classified by the summarizer
+- **Closets** = summary layer → the summary body of each summarized
+  conversation
+- **Drawers** = verbatim archive, never lost → the extracted
+  conversation transcripts under `conversations/<project>/*.md`
+- **Tunnels** = cross-wing connections → the `related:` frontmatter
+  linking conversations to wiki pages
+- Wing + room structural filtering gives a documented +34% retrieval
+  boost over flat search
+
+The MemPalace taxonomy solved a problem Karpathy's pattern doesn't
+address: how do you navigate a growing corpus without reading
+everything? The answer is to give the corpus structural metadata at
+ingest time, then filter on that metadata before doing semantic search.
+This repo borrows that wholesale.
+
+### What this repo adds
+
+- **Automation layer** tying the pieces together with cron-friendly
+  orchestration
+- **Staging pipeline** as a human-in-the-loop checkpoint for automated
+  content
+- **Confidence decay + auto-archive + auto-restore** as the "retention
+  curve" that community analysis identified as critical for long-term
+  wiki health
+- **`qmd` integration** as the scalable search layer (chosen over
+  ChromaDB because it uses the same markdown storage as the wiki —
+  one index to maintain, not two)
+- **Hygiene reports** with fixed vs needs-review separation so
+  automation handles mechanical fixes and humans handle ambiguity
+- **Cross-machine sync** via git with markdown merge-union so the same
+  wiki lives on multiple machines without merge hell
+
+---
+
+## Honest residual trade-offs
+
+Five items from the analysis that this repo doesn't fully solve and
+where you should know the limits:
+
+1. **Enterprise scale** — this is a personal/small-team tool. Millions
+   of documents, hundreds of users, RBAC, compliance: wrong
+   architecture.
+2. **True semantic retrieval at massive scale** — `qmd` hybrid search
+   is great for thousands of pages, not millions.
+3. **Cognitive outsourcing** — no architecture fix. Discipline
+   yourself to read your own wiki, not just query it through the agent.
+4. **Precision-critical domains** — for legal/medical/regulatory data,
+   use this as a drafting tool, not a source of truth. Human
+   domain-expert review is not replaceable.
+5. **Access control** — network boundary (Tailscale) is the fastest
+   path; nothing in the repo itself enforces permissions.
+
+If any of these are dealbreakers for your use case, a different
+architecture is probably what you need.
+
+---
+
+## Further reading
+
+- [The original Karpathy gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
+  — the concept
+- [mempalace](https://github.com/milla-jovovich/mempalace) — the
+  structural memory layer
+- [Signal & Noise interactive analysis](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
+  — the design rationale this document summarizes
+- [README](../README.md) — the concept pitch
+- [ARCHITECTURE.md](ARCHITECTURE.md) — component deep-dive
+- [SETUP.md](SETUP.md) — installation
+- [CUSTOMIZE.md](CUSTOMIZE.md) — adapting for non-Claude-Code setups
@@ -0,0 +1,502 @@
+# Setup Guide
+
+Complete installation for the full automation pipeline. For the conceptual
+version (just the idea, no scripts), see the "Quick start — Path A" section
+in the [README](../README.md).
+
+Tested on macOS (work machines) and Linux/WSL2 (home machines). Should work
+on any POSIX system with Python 3.11+, Node.js 18+, and bash.
+
+---
+
+## 1. Prerequisites
+
+### Required
+
+- **git** with SSH or HTTPS access to your remote (for cross-machine sync)
+- **Node.js 18+** (for `qmd` search)
+- **Python 3.11+** (for all pipeline scripts)
+- **`claude` CLI** with valid authentication — Max subscription OAuth or
+  API key. Required for summarization and the harvester's AI compile step.
+  Without `claude`, you can still use the wiki, but the automation layer
+  falls back to manual or local-LLM paths.
+
+### Python tools (recommended via `pipx`)
+
+```bash
+# URL content extraction — required for wiki-harvest.py
+pipx install trafilatura
+pipx install crawl4ai && crawl4ai-setup    # installs Playwright browsers
+```
+
+Verify: `trafilatura --version` and `crwl --help` should both work.
+
+### Optional
+
+- **`pytest`** — only needed to run the test suite (`pip install --user pytest`)
+- **`llama.cpp` / `llama-server`** — only if you want the legacy local-LLM
+  summarization path instead of `claude -p`
+
+---
+
+## 2. Clone the repo
+
+```bash
+git clone <your-gitea-or-github-url> ~/projects/wiki
+cd ~/projects/wiki
+```
+
+The repo contains scripts, tests, docs, and example content — but no
+actual wiki pages. The wiki grows as you use it.
+
+---
+
+## 3. Configure qmd search
+
+`qmd` handles BM25 full-text search and vector search over the wiki.
+The pipeline uses **three** collections:
+
+- **`wiki`** — live content (patterns/decisions/concepts/environments),
+  staging, and raw sources. The default search surface.
+- **`wiki-archive`** — stale / superseded pages. Excluded from default
+  search; query explicitly with `-c wiki-archive` when digging into
+  history.
+- **`wiki-conversations`** — mined Claude Code session transcripts.
+  Excluded from default search because they'd flood results with noisy
+  tool-call output; query explicitly with `-c wiki-conversations` when
+  looking for "what did I discuss about X last month?"
+
+```bash
+npm install -g @tobilu/qmd
+```
+
+Configure via YAML directly — the CLI doesn't support `ignore` or
+`includeByDefault`, so we edit the config file:
+
+```bash
+mkdir -p ~/.config/qmd
+cat > ~/.config/qmd/index.yml <<'YAML'
+collections:
+  wiki:
+    path: /Users/YOUR_USER/projects/wiki   # ← replace with your actual path
+    pattern: "**/*.md"
+    ignore:
+      - "archive/**"
+      - "reports/**"
+      - "plans/**"
+      - "conversations/**"
+      - "scripts/**"
+      - "context/**"
+
+  wiki-archive:
+    path: /Users/YOUR_USER/projects/wiki/archive
+    pattern: "**/*.md"
+    includeByDefault: false
+
+  wiki-conversations:
+    path: /Users/YOUR_USER/projects/wiki/conversations
+    pattern: "**/*.md"
+    includeByDefault: false
+    ignore:
+      - "index.md"
+YAML
+```
+
+On Linux/WSL, replace `/Users/YOUR_USER` with `/home/YOUR_USER`.
+
+Build the indexes:
+
+```bash
+qmd update     # scan files into all three collections
+qmd embed      # generate vector embeddings (~2 min first run + ~30 min for conversations on CPU)
+```
+
+Verify:
+
+```bash
+qmd collection list
+# Expected:
+#   wiki                — N files
+#   wiki-archive        — M files [excluded]
+#   wiki-conversations  — K files [excluded]
+```
+
+The `[excluded]` tag on the non-default collections confirms
+`includeByDefault: false` is honored.
+
+**When to query which**:
+
+```bash
+# "What's the current pattern for X?"
+qmd search "topic" --json -n 5
+
+# "What was the OLD pattern, before we changed it?"
+qmd search "topic" -c wiki-archive --json -n 5
+
+# "When did we discuss this, and what did we decide?"
+qmd search "topic" -c wiki-conversations --json -n 5
+
+# Everything — history + current + conversations
+qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
+```
+
+---
+
+## 4. Configure the Python scripts
+
+Three scripts need per-user configuration:
+
+### `scripts/extract-sessions.py` — `PROJECT_MAP`
+
+This maps Claude Code project directory suffixes to short wiki codes
+("wings"). Claude stores sessions under `~/.claude/projects/<hashed-path>/`
+where the hashed path is derived from the absolute path to your project.
+
+Open the script and edit the `PROJECT_MAP` dict near the top. Look for
+the `CONFIGURE ME` block. Examples:
+
+```python
+PROJECT_MAP: dict[str, str] = {
+    "projects-wiki": "wiki",
+    "-claude": "cl",
+    "my-webapp": "web",       # map "mydir/my-webapp" → wing "web"
+    "mobile-app": "mob",
+    "work-monorepo": "work",
+    "-home": "general",       # catch-all for unmatched sessions
+}
+```
+
+Run `ls ~/.claude/projects/` to see what directory names Claude is
+actually producing on your machine — the suffix in `PROJECT_MAP` matches
+against the end of each directory name.
+
+### `scripts/update-conversation-index.py` — `PROJECT_NAMES` / `PROJECT_ORDER`
+
+Matching display names for every code in `PROJECT_MAP`:
+
+```python
+PROJECT_NAMES: dict[str, str] = {
+    "wiki": "WIKI — This Wiki",
+    "cl": "CL — Claude Config",
+    "web": "WEB — My Webapp",
+    "mob": "MOB — Mobile App",
+    "work": "WORK — Day Job",
+    "general": "General — Cross-Project",
+}
+
+PROJECT_ORDER = [
+    "work", "web", "mob",   # most-active first
+    "wiki", "cl", "general",
+]
+```
+
+### `scripts/wiki-harvest.py` — `SKIP_DOMAIN_PATTERNS`
+
+Add your internal/personal domains so the harvester doesn't try to fetch
+them. Patterns use `re.search`:
+
+```python
+SKIP_DOMAIN_PATTERNS = [
+    # ... (generic ones are already there)
+    r"\.mycompany\.com$",
+    r"^git\.mydomain\.com$",
+]
+```
+
+---
+
+## 5. Create the post-merge hook
+
+The hook rebuilds the qmd index automatically after every `git pull`:
+
+```bash
+cat > ~/projects/wiki/.git/hooks/post-merge <<'HOOK'
+#!/usr/bin/env bash
+set -euo pipefail
+
+if command -v qmd &>/dev/null; then
+  echo "wiki: rebuilding qmd index..."
+  qmd update 2>/dev/null
+  # WSL / Linux: no GPU, force CPU-only embeddings
+  if [[ "$(uname -s)" == "Linux" ]]; then
+    NODE_LLAMA_CPP_GPU=false qmd embed 2>/dev/null
+  else
+    qmd embed 2>/dev/null
+  fi
+  echo "wiki: qmd index updated"
+fi
+HOOK
+chmod +x ~/projects/wiki/.git/hooks/post-merge
+```
+
+`.git/hooks/` isn't tracked by git, so this step runs on every machine
+where you clone the repo.
+
+---
+
+## 6. Backfill frontmatter (first-time setup or fresh clone)
+
+If you're starting with existing wiki pages that don't yet have
+`last_verified` or `origin`, backfill them:
+
+```bash
+cd ~/projects/wiki
+
+# Backfill last_verified from last_compiled/git/mtime
+python3 scripts/wiki-hygiene.py --backfill
+
+# Backfill origin: manual on pre-automation pages (one-shot inline)
+python3 -c "
+import sys
+sys.path.insert(0, 'scripts')
+from wiki_lib import iter_live_pages, write_page
+changed = 0
+for p in iter_live_pages():
+    if 'origin' not in p.frontmatter:
+        p.frontmatter['origin'] = 'manual'
+        write_page(p)
+        changed += 1
+print(f'{changed} page(s) backfilled')
+"
+```
+
+For a brand-new empty wiki, there's nothing to backfill — skip this step.
+
+---
+
+## 7. Run the pipeline manually once
+
+Before setting up cron, do a full end-to-end dry run to make sure
+everything's wired up:
+
+```bash
+cd ~/projects/wiki
+
+# 1. Extract any existing Claude Code sessions
+bash scripts/mine-conversations.sh --extract-only
+
+# 2. Summarize with claude -p (will make real LLM calls — can take minutes)
+python3 scripts/summarize-conversations.py --claude
+
+# 3. Regenerate conversation index + wake-up context
+python3 scripts/update-conversation-index.py --reindex
+
+# 4. Dry-run the maintenance pipeline
+bash scripts/wiki-maintain.sh --dry-run --no-compile
+```
+
+Expected output from step 4: all three phases run, phase 3 (qmd reindex)
+shows as skipped in dry-run mode, and you see `finished in Ns`.
+
+---
+
+## 8. Cron setup (optional)
+
+If you want full automation, add these cron jobs. **Run them on only ONE
+machine** — state files sync via git, so the other machine picks up the
+results automatically.
+
+```bash
+crontab -e
+```
+
+```cron
+# Wiki SSH key for cron (if your remote uses SSH with a key)
+GIT_SSH_COMMAND="ssh -i /path/to/wiki-key -o StrictHostKeyChecking=no"
+
+# PATH for cron so claude, qmd, node, python3, pipx tools are findable
+PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/local/bin:/usr/bin:/bin
+
+# ─── Sync ──────────────────────────────────────────────────────────────────
+# commit/pull/push every 15 minutes
+*/15 * * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --commit && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --pull && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --push >> /tmp/wiki-sync.log 2>&1
+
+# full sync with qmd reindex every 2 hours
+0 */2 * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh full >> /tmp/wiki-sync.log 2>&1
+
+# ─── Mining ────────────────────────────────────────────────────────────────
+# Extract new sessions hourly (no LLM, fast)
+0 * * * * /home/YOUR_USER/projects/wiki/scripts/mine-conversations.sh --extract-only >> /tmp/wiki-mine.log 2>&1
+
+# Summarize + index daily at 2am (uses claude -p)
+0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1
+
+# ─── Maintenance ───────────────────────────────────────────────────────────
+# Daily at 3am: harvest + quick hygiene + qmd reindex
+0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1
+
+# Weekly Sunday at 4am: full hygiene with LLM checks
+0 4 * * 0 cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh --hygiene-only --full >> scripts/.maintain.log 2>&1
+```
+
+Replace `YOUR_USER` and the node path as appropriate for your system.
+
+**macOS note**: `cron` needs Full Disk Access if you're pointing it at
+files in `~/Documents` or `~/Desktop`. Alternatively use `launchd` with
+a plist — same effect, easier permission model on macOS.
+
+**WSL note**: make sure `cron` is actually running (`sudo service cron
+start`). Cron doesn't auto-start in WSL by default.
+
+**`claude -p` in cron**: OAuth tokens must be cached before cron runs it.
+Run `claude --version` once interactively as your user to prime the
+token cache — cron then picks up the cached credentials.
+
+---
+
+## 9. Tell Claude Code about the wiki
+
+Two separate CLAUDE.md files work together:
+
+1. **The wiki's own `CLAUDE.md`** at `~/projects/wiki/CLAUDE.md` — the
+   schema the agent reads when working INSIDE the wiki. Tells it how to
+   maintain pages, apply frontmatter, handle staging/archival.
+2. **Your global `~/.claude/CLAUDE.md`** — the user-level instructions
+   the agent reads on EVERY session (regardless of directory). Tells it
+   when and how to consult the wiki from any other project.
+
+Both are provided as starter templates you can copy and adapt:
+
+### (a) Wiki schema — copy to the wiki root
+
+```bash
+cp ~/projects/wiki/docs/examples/wiki-CLAUDE.md ~/projects/wiki/CLAUDE.md
+# then edit ~/projects/wiki/CLAUDE.md for your own conventions
+```
+
+This file is ~200 lines. It defines:
+- Directory structure and the automated-vs-manual core rule
+- Frontmatter spec (required fields, staging fields, archive fields)
+- Page-type conventions (pattern / decision / environment / concept)
+- Operations: Ingest, Query, Mine, Harvest, Maintain, Lint
+- **Search Strategy** — which of the three qmd collections to use for
+  which question type
+
+Customize the sections marked **"Customization Notes"** at the bottom
+for your own categories, environments, and cross-reference format.
+
+### (b) Global wake-up + query instructions
+
+Append the contents of `docs/examples/global-CLAUDE.md` to your global
+Claude Code instructions:
+
+```bash
+cat ~/projects/wiki/docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
+# then review ~/.claude/CLAUDE.md to integrate cleanly with any existing
+# content
+```
+
+This adds:
+- **Wake-Up Context** — read `context/wake-up.md` at session start
+- **LLM Wiki — When to Consult It** — query mode vs ingest mode rules
+- **LLM Wiki — How to Search It** — explicit guidance for all three qmd
+  collections (`wiki`, `wiki-archive`, `wiki-conversations`) with
+  example queries for each
+- **Rules When Citing** — flag `confidence: low`, `status: pending`,
+  and archived pages to the user
+
+Together these give the agent a complete picture: how to maintain the
+wiki when working inside it, and how to consult it from anywhere else.
+
+---
+
+## 10. Verify
+
+```bash
+cd ~/projects/wiki
+
+# Sync state
+bash scripts/wiki-sync.sh --status
+
+# Search
+qmd collection list
+qmd search "test" --json -n 3   # won't return anything if wiki is empty
+
+# Mining
+tail -20 scripts/.mine.log 2>/dev/null || echo "(no mining runs yet)"
+
+# End-to-end maintenance dry-run (no writes, no LLM, no network)
+bash scripts/wiki-maintain.sh --dry-run --no-compile
+
+# Run the test suite
+cd tests && python3 -m pytest
+```
+
+Expected:
+- `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]`
+- `wiki-maintain.sh --dry-run` completes all three phases
+- `pytest` passes all 171 tests in ~1.3 seconds
+
+---
+
+## Troubleshooting
+
+**qmd search returns nothing**
+```bash
+qmd collection list          # verify path points at the right place
+qmd update                   # rebuild index
+qmd embed                    # rebuild embeddings
+cat ~/.config/qmd/index.yml  # verify config is correct for your machine
+```
+
+**qmd collection points at the wrong path**
+Edit `~/.config/qmd/index.yml` directly. Don't use `qmd collection add`
+from inside the target directory — it can interpret the path oddly.
+
+**qmd returns archived pages in default searches**
+Verify `wiki-archive` has `includeByDefault: false` in the YAML and
+`qmd collection list` shows `[excluded]`.
+
+**`claude -p` fails in cron ("not authenticated")**
+Cron has no browser. Run `claude --version` once as the same user
+outside cron to cache OAuth tokens; cron will pick them up. Also verify
+the `PATH` directive at the top of the crontab includes the directory
+containing `claude`.
+
+**`wiki-harvest.py` fetch failures**
+```bash
+# Verify the extraction tools work
+trafilatura -u "https://example.com" --markdown --no-comments --precision
+crwl "https://example.com" -o markdown-fit
+
+# Check harvest state
+python3 -c "import json; print(json.dumps(json.load(open('.harvest-state.json'))['failed_urls'], indent=2))"
+```
+
+**`wiki-hygiene.py` archived a page unexpectedly**
+Check `last_verified` vs decay thresholds. If the page was never
+referenced in a conversation, it decayed naturally. Restore with:
+```bash
+python3 scripts/wiki-hygiene.py --restore archive/patterns/foo.md
+```
+
+**Both machines ran maintenance simultaneously**
+Merge conflicts on `.harvest-state.json` / `.hygiene-state.json` will
+occur. Pick ONE machine for maintenance; disable the maintenance cron
+on the other. Leave sync cron running on both so changes still propagate.
+
+**Tests fail**
+Run `cd tests && python3 -m pytest -v` for verbose output. If the
+failure mentions `WIKI_DIR` or module loading, verify
+`scripts/wiki_lib.py` exists and contains the `WIKI_DIR` env var override
+near the top.
+
+---
+
+## Minimal install (skip everything except the idea)
+
+If you want the conceptual wiki without any of the automation, all you
+actually need is:
+
+1. An empty directory
+2. `CLAUDE.md` telling your agent the conventions (see the schema in
+   [`ARCHITECTURE.md`](ARCHITECTURE.md) or Karpathy's gist)
+3. `index.md` for the agent to catalog pages
+4. An agent that can read and write files (any Claude Code, Cursor, Aider
+   session works)
+
+Then tell the agent: "Start maintaining a wiki here. Every time I share
+a source, integrate it. When I ask a question, check the wiki first."
+
+You can bolt on the automation layer later if/when it becomes worth
+the setup effort.
@@ -0,0 +1,161 @@
+# Global Claude Code Instructions — Wiki Section
+
+**What this is**: Content to add to your global `~/.claude/CLAUDE.md`
+(the user-level instructions Claude Code reads at the start of every
+session, regardless of which project you're in). These instructions tell
+Claude how to consult the wiki from outside the wiki directory.
+
+**Where to paste it**: Append these sections to `~/.claude/CLAUDE.md`.
+Don't overwrite the whole file — this is additive.
+
+---
+
+Copy everything below this line into your global `~/.claude/CLAUDE.md`:
+
+---
+
+## Wake-Up Context
+
+At the start of each session, read `~/projects/wiki/context/wake-up.md`
+for a briefing on active projects, recent decisions, and current
+concerns. This provides conversation continuity across sessions.
+
+## LLM Wiki — When to Consult It
+
+**Before creating API endpoints, Docker configs, CI pipelines, or making
+architectural decisions**, check the wiki at `~/projects/wiki/` for
+established patterns and decisions.
+
+The wiki captures the **why** behind patterns — not just what to do, but
+the reasoning, constraints, alternatives rejected, and environment-
+specific differences. It compounds over time as projects discover new
+knowledge.
+
+**When to read from the wiki** (query mode):
+- Creating any operational endpoint (/health, /version, /status)
+- Setting up secrets management in a new service
+- Writing Dockerfiles or docker-compose configurations
+- Configuring CI/CD pipelines
+- Adding database users or migrations
+- Making architectural decisions that should be consistent across projects
+
+**When to write back to the wiki** (ingest mode):
+- When you discover something new that should apply across projects
+- When a project reveals an exception or edge case to an existing pattern
+- When a decision is made that future projects should follow
+- When the human explicitly says "add this to the wiki"
+
+Human-initiated wiki writes go directly to the live wiki with
+`origin: manual`. Script-initiated writes go through `staging/` first.
+See the wiki's own `CLAUDE.md` for the full ingest protocol.
+
+## LLM Wiki — How to Search It
+
+Use the `qmd` CLI for fast, structured search. DO NOT read `index.md`
+for large queries — it's only for full-catalog browsing. DO NOT grep the
+wiki manually when `qmd` is available.
+
+The wiki has **three qmd collections**. Pick the right one for the
+question:
+
+### Default collection: `wiki` (live content)
+
+For "what's our current pattern for X?" type questions. This is the
+default — no `-c` flag needed.
+
+```bash
+# Keyword search (fast, BM25)
+qmd search "health endpoint version" --json -n 5
+
+# Semantic search (finds conceptually related pages)
+qmd vsearch "how should API endpoints be structured" --json -n 5
+
+# Best quality — hybrid BM25 + vector + LLM re-ranking
+qmd query "health endpoint" --json -n 5
+
+# Then read the matched page
+cat ~/projects/wiki/patterns/health-endpoints.md
+```
+
+### Archive collection: `wiki-archive` (stale / superseded)
+
+For "what was our OLD pattern before we changed it?" questions. This is
+excluded from default searches; query explicitly with `-c wiki-archive`.
+
+```bash
+# "Did we used to use Alpine? Why did we stop?"
+qmd search "alpine" -c wiki-archive --json -n 5
+
+# Semantic search across archive
+qmd vsearch "container base image considerations" -c wiki-archive --json -n 5
+```
+
+When you cite content from an archived page, tell the user it's
+archived and may be outdated.
+
+### Conversations collection: `wiki-conversations` (mined session transcripts)
+
+For "when did we discuss this, and what did we decide?" questions. This
+is the mined history of your actual Claude Code sessions — decisions,
+debugging breakthroughs, design discussions. Excluded from default
+searches because transcripts would flood results.
+
+```bash
+# "When did we decide to use staging?"
+qmd search "staging review workflow" -c wiki-conversations --json -n 5
+
+# "What debugging did we do around Docker networking?"
+qmd vsearch "docker network conflicts" -c wiki-conversations --json -n 5
+```
+
+Useful for:
+- Tracing the reasoning behind a decision back to the session where it
+  was made
+- Finding a solution to a problem you remember solving but didn't write
+  up
+- Context-gathering when returning to a project after time away
+
+### Searching across all collections
+
+Rarely needed, but for "find everything on this topic across time":
+
+```bash
+qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
+```
+
+## LLM Wiki — Rules When Citing
+
+1. **Always use `--json`** for structured qmd output. Never try to parse
+   prose.
+2. **Flag `confidence: low` pages** to the user when citing. The content
+   may be aging out.
+3. **Flag `status: pending` pages** (in `staging/`) as unverified when
+   citing: "Note: this is from a pending wiki page that has not been
+   human-reviewed yet."
+4. **Flag archived pages** as "archived and may be outdated" when citing.
+5. **Use `index.md` for browsing only**, not for targeted lookups. `qmd`
+   is faster and more accurate.
+6. **Prefer semantic search for conceptual queries**, keyword search for
+   specific names/terms.
+
+## LLM Wiki — Quick Reference
+
+- `~/projects/wiki/CLAUDE.md` — Full wiki schema and operations (read this when working IN the wiki)
+- `~/projects/wiki/index.md` — Content catalog (browse the full wiki)
+- `~/projects/wiki/patterns/` — How things should be built
+- `~/projects/wiki/decisions/` — Why we chose this approach
+- `~/projects/wiki/environments/` — Where environments differ
+- `~/projects/wiki/concepts/` — Foundational ideas
+- `~/projects/wiki/raw/` — Immutable source material (never modify)
+- `~/projects/wiki/staging/` — Pending automated content (flag when citing)
+- `~/projects/wiki/archive/` — Stale content (flag when citing)
+- `~/projects/wiki/conversations/` — Session history (search via `-c wiki-conversations`)
+
+---
+
+**End of additions for `~/.claude/CLAUDE.md`.**
+
+See also the wiki's own `CLAUDE.md` at the wiki root — that file tells
+the agent how to *maintain* the wiki when working inside it. This file
+(the global one) tells the agent how to *consult* the wiki from anywhere
+else.
@@ -0,0 +1,278 @@
+# LLM Wiki — Schema
+
+This is a persistent, compounding knowledge base maintained by LLM agents.
+It captures the **why** behind patterns, decisions, and implementations —
+not just the what. Copy this file to the root of your wiki directory
+(i.e. `~/projects/wiki/CLAUDE.md`) and edit for your own conventions.
+
+> This is an example `CLAUDE.md` for the wiki root. The agent reads this
+> at the start of every session when working inside the wiki. It's the
+> "constitution" that tells the agent how to maintain the knowledge base.
+
+## How This Wiki Works
+
+**You are the maintainer.** When working in this wiki directory, you read
+raw sources, compile knowledge into wiki pages, maintain cross-references,
+and keep everything consistent.
+
+**You are a consumer.** When working in any other project directory, you
+read wiki pages to inform your work — applying established patterns,
+respecting decisions, and understanding context.
+
+## Directory Structure
+
+```
+wiki/
+├── CLAUDE.md              ← You are here (schema)
+├── index.md               ← Content catalog — read this FIRST on any query
+├── log.md                 ← Chronological record of all operations
+│
+├── patterns/              ← LIVE: HOW things should be built (with WHY)
+├── decisions/             ← LIVE: WHY we chose this approach (with alternatives rejected)
+├── environments/          ← LIVE: WHERE implementations differ
+├── concepts/              ← LIVE: WHAT the foundational ideas are
+│
+├── raw/                   ← Immutable source material (NEVER modify)
+│   └── harvested/         ← URL harvester output
+│
+├── staging/               ← PENDING automated content awaiting human review
+│   ├── index.md
+│   └── <type>/
+│
+├── archive/               ← STALE / superseded (excluded from default search)
+│   ├── index.md
+│   └── <type>/
+│
+├── conversations/         ← Mined Claude Code session transcripts
+│   ├── index.md
+│   └── <wing>/            ← per-project or per-person (MemPalace "wing")
+│
+├── context/               ← Auto-updated AI session briefing
+│   ├── wake-up.md         ← Loaded at the start of every session
+│   └── active-concerns.md
+│
+├── reports/               ← Hygiene operation logs
+└── scripts/               ← The automation pipeline
+```
+
+**Core rule — automated vs manual content**:
+
+| Origin | Destination | Status |
+|--------|-------------|--------|
+| Script-generated (harvester, hygiene, URL compile) | `staging/` | `pending` |
+| Human-initiated ("add this to the wiki" in a Claude session) | Live wiki (`patterns/`, etc.) | `verified` |
+| Human-reviewed from staging | Live wiki (promoted) | `verified` |
+
+Managed via `scripts/wiki-staging.py --list / --promote / --reject / --review`.
+
+## Page Conventions
+
+### Frontmatter (required on all wiki pages)
+
+```yaml
+---
+title: Page Title
+type: pattern | decision | environment | concept
+confidence: high | medium | low
+origin: manual | automated    # How the page entered the wiki
+sources: [list of raw/ files this was compiled from]
+related: [list of other wiki pages this connects to]
+last_compiled: YYYY-MM-DD     # Date this page was last (re)compiled from sources
+last_verified: YYYY-MM-DD     # Date the content was last confirmed accurate
+---
+```
+
+**`origin` values**:
+- `manual` — Created by a human in a Claude session. Goes directly to the live wiki, no staging.
+- `automated` — Created by a script (harvester, hygiene, etc.). Must pass through `staging/` for human review before promotion.
+
+**Confidence decay**: Pages with no refresh signal for 6 months decay `high → medium`; 9 months → `low`; 12 months → `stale` (auto-archived). `last_verified` drives decay, not `last_compiled`. See `scripts/wiki-hygiene.py` and `archive/index.md`.
+
+### Staging Frontmatter (pages in `staging/<type>/`)
+
+Automated-origin pages get additional staging metadata that is **stripped on promotion**:
+
+```yaml
+---
+title: ...
+type: ...
+origin: automated
+status: pending              # Awaiting review
+staged_date: YYYY-MM-DD      # When the automated script staged this
+staged_by: wiki-harvest      # Which script staged it (wiki-harvest, wiki-hygiene, ...)
+target_path: patterns/foo.md # Where it should land on promotion
+modifies: patterns/bar.md    # Only present when this is an update to an existing live page
+compilation_notes: "..."     # AI's explanation of what it did and why
+harvest_source: https://...  # Only present for URL-harvested content
+sources: [...]
+related: [...]
+last_verified: YYYY-MM-DD
+---
+```
+
+### Pattern Pages (`patterns/`)
+
+Structure:
+1. **What** — One-paragraph description of the pattern
+2. **Why** — The reasoning, constraints, and goals that led to this pattern
+3. **Canonical Example** — A concrete implementation (link to raw/ source or inline)
+4. **Structure** — The specification: fields, endpoints, formats, conventions
+5. **When to Deviate** — Known exceptions or conditions where the pattern doesn't apply
+6. **History** — Key changes and the decisions that drove them
+
+### Decision Pages (`decisions/`)
+
+Structure:
+1. **Decision** — One sentence: what we decided
+2. **Context** — What problem or constraint prompted this
+3. **Options Considered** — What alternatives existed (with pros/cons)
+4. **Rationale** — Why this option won
+5. **Consequences** — What this decision enables and constrains
+6. **Status** — Active | Superseded by [link] | Under Review
+
+### Environment Pages (`environments/`)
+
+Structure:
+1. **Overview** — What this environment is (platform, CI, infra)
+2. **Key Differences** — Table comparing environments for this domain
+3. **Implementation Details** — Environment-specific configs, credentials, deploy method
+4. **Gotchas** — Things that have bitten us
+
+### Concept Pages (`concepts/`)
+
+Structure:
+1. **Definition** — What this concept means in our context
+2. **Why It Matters** — How this concept shapes our decisions
+3. **Related Patterns** — Links to patterns that implement this concept
+4. **Related Decisions** — Links to decisions driven by this concept
+
+## Operations
+
+### Ingest (adding new knowledge)
+
+When a new raw source is added or you learn something new:
+
+1. Read the source material thoroughly
+2. Identify which existing wiki pages need updating
+3. Identify if new pages are needed
+4. Update/create pages following the conventions above
+5. Update cross-references (`related:` frontmatter) on all affected pages
+6. Update `index.md` with any new pages
+7. Set `last_verified:` to today's date on every page you create or update
+8. Set `origin: manual` on any page you create when a human directed you to
+9. Append to `log.md`: `## [YYYY-MM-DD] ingest | Source Description`
+
+**Where to write**:
+- **Human-initiated** ("add this to the wiki", "create a pattern for X") — write directly to the live directory (`patterns/`, `decisions/`, etc.) with `origin: manual`. The human's instruction IS the approval.
+- **Script-initiated** (harvest, auto-compile, hygiene auto-fix) — write to `staging/<type>/` with `origin: automated`, `status: pending`, plus `staged_date`, `staged_by`, `target_path`, and `compilation_notes`. For updates to existing live pages, also set `modifies: <live-page-path>`.
+
+### Query (answering questions from other projects)
+
+When working in another project and consulting the wiki:
+
+1. Use `qmd` to search first (see Search Strategy below). Read `index.md` only when browsing the full catalog.
+2. Read the specific pattern/decision/concept pages
+3. Apply the knowledge, respecting environment differences
+4. If a page's `confidence` is `low`, flag that to the user — the content may be aging out
+5. If a page has `status: pending` (it's in `staging/`), flag that to the user: "Note: this is from a pending wiki page in staging, not yet verified." Use the content but make the uncertainty visible.
+6. If you find yourself consulting a page under `archive/`, mention it's archived and may be outdated
+7. If your work reveals new knowledge, **file it back** — update the wiki (and bump `last_verified`)
+
+### Search Strategy — which qmd collection to use
+
+The wiki has three qmd collections. Pick the right one for the question:
+
+| Question type | Collection | Command |
+|---|---|---|
+| "What's our current pattern for X?" | `wiki` (default) | `qmd search "X" --json -n 5` |
+| "What's the rationale behind decision Y?" | `wiki` (default) | `qmd vsearch "why did we choose Y" --json -n 5` |
+| "What was our OLD approach before we changed it?" | `wiki-archive` | `qmd search "X" -c wiki-archive --json -n 5` |
+| "When did we discuss this, and what did we decide?" | `wiki-conversations` | `qmd search "X" -c wiki-conversations --json -n 5` |
+| "Find everything across time" | all three | `qmd search "X" -c wiki -c wiki-archive -c wiki-conversations --json -n 10` |
+
+**Rules of thumb**:
+- Use `qmd search` for keyword matches (BM25, fast)
+- Use `qmd vsearch` for conceptual / semantically-similar queries (vector)
+- Use `qmd query` for the best quality — hybrid BM25 + vector + LLM re-ranking
+- Always use `--json` for structured output
+- Read individual matched pages with `cat` or your file tool after finding them
+
+### Mine (conversation extraction and summarization)
+
+Four-phase pipeline that extracts sessions into searchable conversation pages:
+
+1. **Extract** (`extract-sessions.py`) — Parse session files into markdown transcripts
+2. **Summarize** (`summarize-conversations.py --claude`) — Classify + summarize via `claude -p` with haiku/sonnet routing
+3. **Index** (`update-conversation-index.py --reindex`) — Regenerate conversation index + `context/wake-up.md`
+4. **Harvest** (`wiki-harvest.py`) — Scan summarized conversations for external reference URLs and compile them into wiki pages
+
+Full pipeline via `mine-conversations.sh`. Extraction is incremental (tracks byte offsets). Summarization is incremental (tracks message count).
+
+### Maintain (wiki health automation)
+
+`scripts/wiki-maintain.sh` chains harvest + hygiene + qmd reindex:
+
+```bash
+bash scripts/wiki-maintain.sh                 # Harvest + quick hygiene + reindex
+bash scripts/wiki-maintain.sh --full          # Harvest + full hygiene (LLM) + reindex
+bash scripts/wiki-maintain.sh --harvest-only  # Harvest only
+bash scripts/wiki-maintain.sh --hygiene-only  # Hygiene only
+bash scripts/wiki-maintain.sh --dry-run       # Show what would run
+```
+
+### Lint (periodic health check)
+
+Automated via `scripts/wiki-hygiene.py`. Two tiers:
+
+**Quick mode** (no LLM, run daily — `python3 scripts/wiki-hygiene.py`):
+- Backfill missing `last_verified`
+- Refresh `last_verified` from conversation `related:` references
+- Auto-restore archived pages that are referenced again
+- Repair frontmatter (missing required fields, invalid values)
+- Confidence decay per 6/9/12-month thresholds
+- Archive stale and superseded pages
+- Orphan pages (auto-linked into `index.md`)
+- Broken cross-references (fuzzy-match fix via `difflib`, or restore from archive)
+- Main index drift (auto add missing entries, remove stale ones)
+- Empty stubs (report-only)
+- State file drift (report-only)
+- Staging/archive index resync
+
+**Full mode** (LLM, run weekly — `python3 scripts/wiki-hygiene.py --full`):
+- Everything in quick mode, plus:
+- Missing cross-references between related pages (haiku)
+- Duplicate coverage — weaker page auto-merged into stronger (sonnet)
+- Contradictions between pages (sonnet, report-only)
+- Technology lifecycle — flag pages referencing versions older than what's in recent conversations
+
+**Reports** (written to `reports/`):
+- `hygiene-YYYY-MM-DD-fixed.md` — what was auto-fixed
+- `hygiene-YYYY-MM-DD-needs-review.md` — what needs human judgment
+
+## Cross-Reference Conventions
+
+- Link between wiki pages using relative markdown links: `[Pattern Name](../patterns/file.md)`
+- Link to raw sources: `[Source](../raw/path/to/file.md)`
+- In frontmatter `related:` use the relative filename: `patterns/secrets-at-startup.md`
+
+## Naming Conventions
+
+- Filenames: `kebab-case.md`
+- Patterns: named by what they standardize (e.g., `health-endpoints.md`, `secrets-at-startup.md`)
+- Decisions: named by what was decided (e.g., `no-alpine.md`, `dhi-base-images.md`)
+- Environments: named by domain (e.g., `docker-registries.md`, `ci-cd-platforms.md`)
+- Concepts: named by the concept (e.g., `two-user-database-model.md`, `build-once-deploy-many.md`)
+
+## Customization Notes
+
+Things you should change for your own wiki:
+
+1. **Directory structure** — the four live dirs (`patterns/`, `decisions/`, `concepts/`, `environments/`) reflect engineering use cases. Pick categories that match how you think — research wikis might use `findings/`, `hypotheses/`, `methods/`, `literature/` instead. Update `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py` to match.
+
+2. **Page page-type sections** — the "Structure" blocks under each page type are for my use. Define your own conventions.
+
+3. **`status` field** — if you want to track Superseded/Active/Under Review explicitly, this is a natural add. The hygiene script already checks for `status: Superseded by ...` and archives those automatically.
+
+4. **Environment Detection** — if you don't have multiple environments, remove the section. If you do, update it for your own environments (work/home, dev/prod, mac/linux, etc.).
+
+5. **Cross-reference path format** — I use `patterns/foo.md` in the `related:` field. Obsidian users might prefer `[[foo]]` wikilink format. The hygiene script handles standard markdown links; adapt as needed.