Initial commit — memex

A compounding LLM-maintained knowledge wiki.

Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's
mempalace, with an automation layer on top for conversation mining, URL
harvesting, human-in-the-loop staging, staleness decay, and hygiene.

Includes:
- 11 pipeline scripts (extract, summarize, index, harvest, stage,
  hygiene, maintain, sync, + shared library)
- Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE
- Example CLAUDE.md files (wiki schema + global instructions) tuned for
  the three-collection qmd setup
- 171-test pytest suite (cross-platform, runs in ~1.3s)
- MIT licensed
This commit is contained in:
Eric Turner
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions

360
docs/ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,360 @@
# Architecture
Eleven scripts across three conceptual layers. This document walks through
what each one does, how they talk to each other, and where the seams are
for customization.
> **See also**: [`DESIGN-RATIONALE.md`](DESIGN-RATIONALE.md) — the *why*
> behind each component, with links to the interactive design artifact.
## Borrowed concepts
The architecture is a synthesis of two external ideas with an automation
layer on top. The terminology often maps 1:1, so it's worth calling out
which concepts came from where:
### From Karpathy's persistent-wiki gist
| Concept | How this repo implements it |
|---------|-----------------------------|
| Immutable `raw/` sources | `raw/` directory — never modified by the agent |
| LLM-compiled `wiki/` pages | `patterns/` `decisions/` `concepts/` `environments/` |
| Schema file disciplining the agent | `CLAUDE.md` at the wiki root |
| Periodic "lint" passes | `wiki-hygiene.py --quick` (daily) + `--full` (weekly) |
| Wiki as fine-tuning material | Clean markdown body is ready for synthetic training data |
### From [mempalace](https://github.com/milla-jovovich/mempalace)
MemPalace gave us the structural memory taxonomy that turns a flat
corpus into something you can navigate without reading everything. The
concepts map directly:
| MemPalace term | Meaning | How this repo implements it |
|----------------|---------|-----------------------------|
| **Wing** | Per-person or per-project namespace | Project code in `conversations/<code>/` (set by `PROJECT_MAP` in `extract-sessions.py`) |
| **Room** | Topic within a wing | `topics:` frontmatter field on summarized conversation files |
| **Closet** | Summary layer — high-signal compressed knowledge | The summary body written by `summarize-conversations.py --claude` |
| **Drawer** | Verbatim archive, never lost | The extracted transcript under `conversations/<wing>/*.md` (before summarization) |
| **Hall** | Memory-type corridor (fact / event / discovery / preference / advice / tooling) | `halls:` frontmatter field classified by the summarizer |
| **Tunnel** | Cross-wing connection — same topic in multiple projects | `related:` frontmatter linking conversations to wiki pages and to each other |
The key benefit of wing + room filtering is documented in MemPalace's
benchmarks as a **+34% retrieval boost** over flat search — because
`qmd` can search a pre-narrowed subset of the corpus instead of
everything. This is why the wiki scales past the Karpathy-pattern's
~50K token ceiling without needing a full vector DB rebuild.
### What this repo adds
Automation + lifecycle management on top of both:
- **Automation layer** — cron-friendly orchestration via `wiki-maintain.sh`
- **Staging pipeline** — human-in-the-loop checkpoint for automated content
- **Confidence decay + auto-archive + auto-restore** — the retention curve
- **`qmd` integration** — the scalable search layer (chosen over ChromaDB
because it uses markdown storage like the wiki itself)
- **Hygiene reports** — fixed vs needs-review separation
- **Cross-machine sync** — git with markdown merge-union
---
## Overview
```
┌─────────────────────────────────┐
│ SYNC LAYER │
│ wiki-sync.sh │ (git commit/pull/push, qmd reindex)
└─────────────────────────────────┘
┌─────────────────────────────────┐
│ MINING LAYER │
│ extract-sessions.py │ (Claude Code JSONL → markdown)
│ summarize-conversations.py │ (LLM classify + summarize)
│ update-conversation-index.py │ (regenerate indexes + wake-up)
│ mine-conversations.sh │ (orchestrator)
└─────────────────────────────────┘
┌─────────────────────────────────┐
│ AUTOMATION LAYER │
│ wiki_lib.py (shared helpers) │
│ wiki-harvest.py │ (URL → raw → staging)
│ wiki-staging.py │ (human review)
│ wiki-hygiene.py │ (decay, archive, repair, checks)
│ wiki-maintain.sh │ (orchestrator)
└─────────────────────────────────┘
```
Each layer is independent — you can run the mining layer without the
automation layer, or vice versa. The layers communicate through files on
disk (conversation markdown, raw harvested pages, staging pages, wiki
pages), never through in-memory state.
---
## Mining layer
### `extract-sessions.py`
Parses Claude Code JSONL session files from `~/.claude/projects/` into
clean markdown transcripts under `conversations/<project-code>/`.
Deterministic, no LLM calls. Incremental — tracks byte offsets in
`.mine-state.json` so it safely re-runs on partially-processed sessions.
Key features:
- Summarizes tool calls intelligently: full output for `Bash` and `Skill`,
paths-only for `Read`/`Glob`/`Grep`, path + summary for `Edit`/`Write`
- Caps Bash output at 200 lines to prevent transcript bloat
- Handles session resumption — if a session has grown since last extraction,
it appends new messages without re-processing old ones
- Maps Claude project directory names to short wiki codes via `PROJECT_MAP`
### `summarize-conversations.py`
Sends extracted transcripts to an LLM for classification and summarization.
Supports two backends:
1. **`--claude` mode** (recommended): Uses `claude -p` with
haiku for short sessions (≤200 messages) and sonnet for longer ones.
Runs chunked over long transcripts, keeping a rolling context window.
2. **Local LLM mode** (default, omit `--claude`): Uses a local
`llama-server` instance at `localhost:8080` (or WSL gateway:8081 on
Windows Subsystem for Linux). Requires llama.cpp installed and a GGUF
model loaded.
Output: adds frontmatter to each conversation file — `topics`, `halls`
(fact/discovery/preference/advice/event/tooling), and `related` wiki
page links. The `related` links are load-bearing: they're what
`wiki-hygiene.py` uses to refresh `last_verified` on pages that are still
being discussed.
### `update-conversation-index.py`
Regenerates three files from the summarized conversations:
- `conversations/index.md` — catalog of all conversations grouped by project
- `context/wake-up.md` — a ~200-token briefing the agent loads at the start
of every session ("current focus areas, recent decisions, active
concerns")
- `context/active-concerns.md` — longer-form current state
The wake-up file is important: it's what gives the agent *continuity*
across sessions without forcing you to re-explain context every time.
### `mine-conversations.sh`
Orchestrator chaining extract → summarize → index. Supports
`--extract-only`, `--summarize-only`, `--index-only`, `--project <code>`,
and `--dry-run`.
---
## Automation layer
### `wiki_lib.py`
The shared library. Everything in the automation layer imports from here.
Provides:
- `WikiPage` dataclass — path + frontmatter + body + raw YAML
- `parse_page(path)` — safe markdown parser with YAML frontmatter
- `parse_yaml_lite(text)` — subset YAML parser (no external deps, handles
the frontmatter patterns we use)
- `serialize_frontmatter(fm)` — writes YAML back in canonical key order
- `write_page(page, ...)` — full round-trip writer
- `page_content_hash(page)` — body-only SHA-256 for change detection
- `iter_live_pages()` / `iter_staging_pages()` / `iter_archived_pages()`
- Shared constants: `WIKI_DIR`, `STAGING_DIR`, `ARCHIVE_DIR`, etc.
All paths honor the `WIKI_DIR` environment variable, so tests and
alternate installs can override the root.
### `wiki-harvest.py`
Scans summarized conversations for HTTP(S) URLs, classifies them,
fetches content, and compiles pending wiki pages.
URL classification:
- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
- **Check** (Type C) — GitHub issues, Stack Overflow — only harvest if
the topic is already covered in the wiki (to avoid noise)
- **Skip** (Type D) — internal domains, localhost, private IPs, chat tools
Fetch cascade (tries in order, validates at each step):
1. `trafilatura -u <url> --markdown --no-comments --precision`
2. `crwl <url> -o markdown-fit`
3. `crwl <url> -o markdown-fit -b "user_agent_mode=random" -c "magic=true"` (stealth)
4. Conversation-transcript fallback — pull inline content from where the
URL was mentioned during the session
Validated content goes to `raw/harvested/<domain>-<path>.md` with
frontmatter recording source URL, fetch method, and a content hash.
Compilation step: sends the raw content + `index.md` + conversation
context to `claude -p`, asking for a JSON verdict:
- `new_page` — create a new wiki page
- `update_page` — update an existing page (with `modifies:` field)
- `both` — do both
- `skip` — content isn't substantive enough
Result lands in `staging/<type>/` with `origin: automated`,
`status: pending`, and all the staging-specific frontmatter that gets
stripped on promotion.
### `wiki-staging.py`
Pure file operations — no LLM calls. Human review pipeline for automated
content.
Commands:
- `--list` / `--list --json` — pending items with metadata
- `--stats` — counts by type/source + age stats
- `--review` — interactive a/r/s/q loop with preview
- `--promote <path>` — approve, strip staging fields, move to live, update
main index, rewrite cross-refs, preserve `origin: automated` as audit trail
- `--reject <path> --reason "..."` — delete, record in
`.harvest-state.json` rejected_urls so the harvester won't re-create
- `--promote-all` — bulk approve everything
- `--sync` — regenerate `staging/index.md`, detect drift
### `wiki-hygiene.py`
The heavy lifter. Two modes:
**Quick mode** (no LLM, ~1 second on a 100-page wiki, run daily):
- Backfill `last_verified` from `last_compiled`/git/mtime
- Refresh `last_verified` from conversation `related:` links — this is
the "something's still being discussed" signal
- Auto-restore archived pages that are referenced again
- Repair frontmatter (missing/invalid fields get sensible defaults)
- Apply confidence decay per thresholds (6/9/12 months)
- Archive stale and superseded pages
- Detect index drift (pages on disk not in index, stale index entries)
- Detect orphan pages (no inbound links) and auto-add them to index
- Detect broken cross-references, fuzzy-match to the intended target
via `difflib.get_close_matches`, fix in place
- Report empty stubs (body < 100 chars)
- Detect state file drift (references to missing files)
- Regenerate `staging/index.md` and `archive/index.md` if out of sync
**Full mode** (LLM-powered, run weekly — extends quick mode with):
- Missing cross-references (haiku, batched 5 pages per call)
- Duplicate coverage (sonnet — weaker merged into stronger, auto-archives
the loser with `archived_reason: Merged into <winner>`)
- Contradictions (sonnet, **report-only** — the human decides)
- Technology lifecycle (regex + conversation comparison — flags pages
mentioning `Node 18` when recent conversations are using `Node 20`)
State lives in `.hygiene-state.json` — tracks content hashes per page so
full-mode runs can skip unchanged pages. Reports land in
`reports/hygiene-YYYY-MM-DD-{fixed,needs-review}.md`.
### `wiki-maintain.sh`
Top-level orchestrator:
```
Phase 1: wiki-harvest.py (unless --hygiene-only)
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
```
Flags pass through to child scripts. Error-tolerant: if one phase fails,
the others still run. Logs to `scripts/.maintain.log`.
---
## Sync layer
### `wiki-sync.sh`
Git-based sync for cross-machine use. Commands:
- `--commit` — stage and commit local changes
- `--pull``git pull` with markdown merge-union (keeps both sides on conflict)
- `--push` — push to origin
- `full` — commit + pull + push + qmd reindex
- `--status` — read-only sync state report
The `.gitattributes` file sets `*.md merge=union` so markdown conflicts
auto-resolve by keeping both versions. This works because most conflicts
are additive (two machines both adding new entries).
---
## State files
Three JSON files track per-pipeline state:
| File | Owner | Synced? | Purpose |
|------|-------|---------|---------|
| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
Harvest and hygiene state need to sync across machines so both
installations agree on what's been processed. Mining state is per-machine
because Claude Code session files live at OS-specific paths.
---
## Module dependency graph
```
wiki_lib.py ─┬─> wiki-harvest.py
├─> wiki-staging.py
└─> wiki-hygiene.py
wiki-maintain.sh ─> wiki-harvest.py
─> wiki-hygiene.py
─> qmd (external)
mine-conversations.sh ─> extract-sessions.py
─> summarize-conversations.py
─> update-conversation-index.py
extract-sessions.py (standalone — reads Claude JSONL)
summarize-conversations.py ─> claude CLI (or llama-server)
update-conversation-index.py ─> qmd (external)
```
`wiki_lib.py` is the only shared Python module — everything else is
self-contained within its layer.
---
## Extension seams
The places to modify when customizing:
1. **`scripts/extract-sessions.py`** — `PROJECT_MAP` controls how Claude
project directories become wiki "wings". Also `KEEP_FULL_OUTPUT_TOOLS`,
`SUMMARIZE_TOOLS`, `MAX_BASH_OUTPUT_LINES` to tune transcript shape.
2. **`scripts/update-conversation-index.py`** — `PROJECT_NAMES` and
`PROJECT_ORDER` control how the index groups conversations.
3. **`scripts/wiki-harvest.py`** —
- `SKIP_DOMAIN_PATTERNS` — your internal domains
- `C_TYPE_URL_PATTERNS` — URL shapes that need topic-match before harvesting
- `FETCH_DELAY_SECONDS` — rate limit between fetches
- `COMPILE_PROMPT_TEMPLATE` — what the AI compile step tells the LLM
- `SONNET_CONTENT_THRESHOLD` — size cutoff for haiku vs sonnet
4. **`scripts/wiki-hygiene.py`** —
- `DECAY_HIGH_TO_MEDIUM` / `DECAY_MEDIUM_TO_LOW` / `DECAY_LOW_TO_STALE`
— decay thresholds in days
- `EMPTY_STUB_THRESHOLD` — what counts as a stub
- `VERSION_REGEX` — which tools/runtimes to track for lifecycle checks
- `REQUIRED_FIELDS` — frontmatter fields the repair step enforces
5. **`scripts/summarize-conversations.py`** —
- `CLAUDE_LONG_THRESHOLD` — haiku/sonnet routing cutoff
- `MINE_PROMPT_FILE` — the LLM system prompt for summarization
- Backend selection (claude vs llama-server)
6. **`CLAUDE.md`** at the wiki root — the instructions the agent reads
every session. This is where you tell the agent how to maintain the
wiki, what conventions to follow, when to flag things to you.
See [`docs/CUSTOMIZE.md`](CUSTOMIZE.md) for recipes.

432
docs/CUSTOMIZE.md Normal file
View File

@@ -0,0 +1,432 @@
# Customization Guide
This repo is built around Claude Code, cron-based automation, and a
specific directory layout. None of those are load-bearing for the core
idea. This document walks through adapting it for different agents,
different scheduling, and different subsets of functionality.
## What's actually required for the core idea
The minimum viable compounding wiki is:
1. A markdown directory tree
2. An agent that reads the tree at the start of a session and writes to
it during the session
3. Some convention (a `CLAUDE.md` or equivalent) telling the agent how to
maintain the wiki
**Everything else in this repo is optional optimization** — automated
extraction, URL harvesting, hygiene checks, cron scheduling. They're
worth the setup effort once the wiki grows past a few dozen pages, but
they're not the *idea*.
---
## Adapting for non-Claude-Code agents
Four script components are Claude-specific. Each has a natural
replacement path:
### 1. `extract-sessions.py` — Claude Code JSONL parsing
**What it does**: Reads session files from `~/.claude/projects/` and
converts them to markdown transcripts.
**What's Claude-specific**: The JSONL format and directory structure are
specific to the Claude Code CLI. Other agents don't produce these files.
**Replacements**:
- **Cursor**: Cursor stores chat history in `~/Library/Application
Support/Cursor/User/globalStorage/` (macOS) as SQLite. Write an
equivalent `extract-sessions.py` that queries that SQLite and produces
the same markdown format.
- **Aider**: Aider stores chat history as `.aider.chat.history.md` in
each project directory. A much simpler extractor: walk all project
directories, read each `.aider.chat.history.md`, split on session
boundaries, write to `conversations/<project>/`.
- **OpenAI Codex / gemini CLI / other**: Whatever session format your
tool uses — the target format is a markdown file with a specific
frontmatter shape (`title`, `type: conversation`, `project`, `date`,
`status: extracted`, `messages: N`, body of user/assistant turns).
Anything that produces files in that shape will flow through the rest
of the pipeline unchanged.
- **No agent at all — just manual**: Skip this script entirely. Paste
interesting conversations into `conversations/general/YYYY-MM-DD-slug.md`
by hand and set `status: extracted` yourself.
The pipeline downstream of `extract-sessions.py` doesn't care how the
transcripts got there, only that they exist with the right frontmatter.
### 2. `summarize-conversations.py` — `claude -p` summarization
**What it does**: Classifies extracted conversations into "halls"
(fact/discovery/preference/advice/event/tooling) and writes summaries.
**What's Claude-specific**: Uses `claude -p` with haiku/sonnet routing.
**Replacements**:
- **OpenAI**: Replace the `call_claude` helper with a function that calls
`openai` Python SDK or `gpt` CLI. Use gpt-4o-mini for short
conversations (equivalent to haiku routing) and gpt-4o for long ones.
- **Local LLM**: The script already supports this path — just omit the
`--claude` flag and run a `llama-server` on localhost:8080 (or the WSL
gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
- **Ollama**: Point `AI_BASE_URL` at your Ollama endpoint (e.g.
`http://localhost:11434/v1`). Ollama exposes an OpenAI-compatible API.
- **Any OpenAI-compatible endpoint**: `AI_BASE_URL` and `AI_MODEL` env
vars configure the script — no code changes needed.
- **No LLM at all — manual summaries**: Edit each conversation file by
hand to set `status: summarized` and add your own `topics`/`related`
frontmatter. Tedious but works for a small wiki.
### 3. `wiki-harvest.py` — AI compile step
**What it does**: After fetching raw URL content, sends it to `claude -p`
to get a structured JSON verdict (new_page / update_page / both / skip)
plus the page content.
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
**Replacements**:
- **Any other LLM**: Replace `call_claude_compile()` with a function that
calls your preferred backend. The prompt template
(`COMPILE_PROMPT_TEMPLATE`) is reusable — just swap the transport.
- **Skip AI compilation entirely**: Run `wiki-harvest.py --no-compile`
and the harvester will save raw content to `raw/harvested/` without
trying to compile it. You can then manually (or via a different script)
turn the raw content into wiki pages.
### 4. `wiki-hygiene.py --full` — LLM-powered checks
**What it does**: Duplicate detection, contradiction detection, missing
cross-reference suggestions.
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
**Replacements**:
- **Same as #3**: Replace the `call_claude()` helper in `wiki-hygiene.py`.
- **Skip full mode entirely**: Only run `wiki-hygiene.py --quick` (the
default). Quick mode has no LLM calls and catches 90% of structural
issues. Contradictions and duplicates just have to be caught by human
review during `wiki-staging.py --review` sessions.
### 5. `CLAUDE.md` at the wiki root
**What it does**: The instructions Claude Code reads at the start of
every session that explain the wiki schema and maintenance operations.
**What's Claude-specific**: The filename. Claude Code specifically looks
for `CLAUDE.md`; other agents look for other files.
**Replacements**:
| Agent | Equivalent file |
|-------|-----------------|
| Claude Code | `CLAUDE.md` |
| Cursor | `.cursorrules` or `.cursor/rules/` |
| Aider | `CONVENTIONS.md` (read via `--read CONVENTIONS.md`) |
| Gemini CLI | `GEMINI.md` |
| Continue.dev | `config.json` prompts or `.continue/rules/` |
The content is the same — just rename the file and point your agent at
it.
---
## Running without cron
Cron is convenient but not required. Alternatives:
### Manual runs
Just call the scripts when you want the wiki updated:
```bash
cd ~/projects/wiki
# When you want to ingest new Claude Code sessions
bash scripts/mine-conversations.sh
# When you want hygiene + harvest
bash scripts/wiki-maintain.sh
# When you want the expensive LLM pass
bash scripts/wiki-maintain.sh --hygiene-only --full
```
This is arguably *better* than cron if you work in bursts — run
maintenance when you start a session, not on a schedule.
### systemd timers (Linux)
More observable than cron, better journaling:
```ini
# ~/.config/systemd/user/wiki-maintain.service
[Unit]
Description=Wiki maintenance pipeline
[Service]
Type=oneshot
WorkingDirectory=%h/projects/wiki
ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh
```
```ini
# ~/.config/systemd/user/wiki-maintain.timer
[Unit]
Description=Run wiki-maintain daily
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
```
```bash
systemctl --user enable --now wiki-maintain.timer
journalctl --user -u wiki-maintain.service # see logs
```
### launchd (macOS)
More native than cron on macOS:
```xml
<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key><string>com.user.wiki-maintain</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
</array>
<key>StartCalendarInterval</key>
<dict>
<key>Hour</key><integer>3</integer>
<key>Minute</key><integer>0</integer>
</dict>
<key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
<key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
</dict>
</plist>
```
```bash
launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
launchctl list | grep wiki # verify
```
### Git hooks (pre-push)
Run hygiene before every push so the wiki is always clean when it hits
the remote:
```bash
cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
#!/usr/bin/env bash
set -euo pipefail
bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
HOOK
chmod +x ~/projects/wiki/.git/hooks/pre-push
```
Downside: every push is slow. Upside: you never push a broken wiki.
### CI pipeline
Run `wiki-hygiene.py --check-only` in a CI workflow on every PR:
```yaml
# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
name: Wiki hygiene check
on: [push, pull_request]
jobs:
hygiene:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: python3 scripts/wiki-hygiene.py --check-only
```
`--check-only` reports issues without auto-fixing them, so CI can flag
problems without modifying files.
---
## Minimal subsets
You don't have to run the whole pipeline. Pick what's useful:
### "Just the wiki" (no automation)
- Delete `scripts/wiki-*` and `scripts/*-conversations*`
- Delete `tests/`
- Keep the directory structure (`patterns/`, `decisions/`, etc.)
- Keep `index.md` and `CLAUDE.md`
- Write and maintain the wiki manually with your agent
This is the Karpathy-gist version. Works great for small wikis.
### "Wiki + mining" (no harvesting, no hygiene)
- Keep the mining layer (`extract-sessions.py`, `summarize-conversations.py`, `update-conversation-index.py`)
- Delete the automation layer (`wiki-harvest.py`, `wiki-hygiene.py`, `wiki-staging.py`, `wiki-maintain.sh`)
- The wiki grows from session mining but you maintain it manually
Useful if you want session continuity (the wake-up briefing) without
the full automation.
### "Wiki + hygiene" (no mining, no harvesting)
- Keep `wiki-hygiene.py` and `wiki_lib.py`
- Delete everything else
- Run `wiki-hygiene.py --quick` periodically to catch structural issues
Useful if you write the wiki manually but want automated checks for
orphans, broken links, and staleness.
### "Wiki + harvesting" (no session mining)
- Keep `wiki-harvest.py`, `wiki-staging.py`, `wiki_lib.py`
- Delete mining scripts
- Source URLs manually — put them in a file and point the harvester at
it. You'd need to write a wrapper that extracts URLs from your source
file and feeds them into the fetch cascade.
Useful if URLs come from somewhere other than Claude Code sessions
(e.g. browser bookmarks, Pocket export, RSS).
---
## Schema customization
The repo uses these live content types:
- `patterns/` — HOW things should be built
- `decisions/` — WHY we chose this approach
- `concepts/` — WHAT the foundational ideas are
- `environments/` — WHERE implementations differ
These reflect my engineering-focused use case. Your wiki might need
different categories. To change them:
1. Rename / add directories under the wiki root
2. Edit `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py`
3. Update the `type:` frontmatter validation in
`scripts/wiki-hygiene.py` (`VALID_TYPES` constant)
4. Update `CLAUDE.md` to describe the new categories
5. Update `index.md` section headers to match
Examples of alternative schemas:
**Research wiki**:
- `findings/` — experimental results
- `hypotheses/` — what you're testing
- `methods/` — how you test
- `literature/` — external sources
**Product wiki**:
- `features/` — what the product does
- `decisions/` — why we chose this
- `users/` — personas, interviews, feedback
- `metrics/` — what we measure
**Personal knowledge wiki**:
- `topics/` — general subject matter
- `projects/` — specific ongoing work
- `journal/` — dated entries
- `references/` — external links/papers
None of these are better or worse — pick what matches how you think.
---
## Frontmatter customization
The required fields are documented in `CLAUDE.md` (frontmatter spec).
You can add your own fields freely — the parser and hygiene checks
ignore unknown keys.
Useful additions you might want:
```yaml
author: alice # who wrote or introduced the page
tags: [auth, security] # flat tag list
urgency: high # for to-do-style wiki pages
stakeholders: # who cares about this page
- product-team
- security-team
review_by: 2026-06-01 # explicit review date instead of age-based decay
```
If you want age-based decay to key off a different field than
`last_verified` (say, `review_by`), edit `expected_confidence()` in
`scripts/wiki-hygiene.py` to read from your custom field.
---
## Working across multiple wikis
The scripts all honor the `WIKI_DIR` environment variable. Run multiple
wikis against the same scripts:
```bash
# Work wiki
WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh
# Personal wiki
WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh
# Research wiki
WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh
```
Each has its own state files, its own cron entries, its own qmd
collection. You can symlink or copy `scripts/` into each wiki, or run
all three against a single checked-out copy of the scripts.
---
## What I'd change if starting over
Honest notes on the design choices, in case you're about to fork:
1. **Config should be in YAML, not inline constants.** I bolted a
"CONFIGURE ME" comment onto `PROJECT_MAP` and `SKIP_DOMAIN_PATTERNS`
as a shortcut. Better: a `config.yaml` at the wiki root that all
scripts read.
2. **The mining layer is tightly coupled to Claude Code.** A cleaner
design would put a `Session` interface in `wiki_lib.py` and have
extractors for each agent produce `Session` objects — the rest of the
pipeline would be agent-agnostic.
3. **The hygiene script is a monolith.** 1100+ lines is a lot. Splitting
it into `wiki_hygiene/checks.py`, `wiki_hygiene/archive.py`,
`wiki_hygiene/llm.py`, etc., would be cleaner. It started as a single
file and grew.
4. **The hyphenated filenames (`wiki-harvest.py`) make Python imports
awkward.** Standard Python convention is underscores. I used hyphens
for consistency with the shell scripts, and `conftest.py` has a
module-loader workaround. A cleaner fork would use underscores
everywhere.
5. **The wiki schema assumes you know what you want to catalog.** If
you don't, start with a free-form `notes/` directory and let
categories emerge organically, then refactor into `patterns/` etc.
later.
None of these are blockers. They're all "if I were designing v2"
observations.

338
docs/DESIGN-RATIONALE.md Normal file
View File

@@ -0,0 +1,338 @@
# Design Rationale — Signal & Noise
Why each part of this repo exists. This is the "why" document; the other
docs are the "what" and "how."
Before implementing anything, the design was worked out interactively
with Claude as a structured Signal & Noise analysis of Andrej Karpathy's
original persistent-wiki pattern:
> **Interactive design artifact**: [The LLM Wiki — Karpathy's Pattern — Signal & Noise](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
That artifact walks through the pattern's seven genuine strengths, seven
real weaknesses, and concrete mitigations for each weakness. This repo
is the implementation of those mitigations. If you want to understand
*why* a component exists, the artifact has the longer-form argument; this
document is the condensed version.
---
## Where the pattern is genuinely strong
The analysis found seven strengths that hold up under scrutiny. This
repo preserves all of them:
| Strength | How this repo keeps it |
|----------|-----------------------|
| **Knowledge compounds over time** | Every ingest adds to the existing wiki rather than restarting; conversation mining and URL harvesting continuously feed new material in |
| **Zero maintenance burden on humans** | Cron-driven harvest + hygiene; the only manual step is staging review, and that's fast because the AI already compiled the page |
| **Token-efficient at personal scale** | `index.md` fits in context; `qmd` kicks in only at 50+ articles; the wake-up briefing is ~200 tokens |
| **Human-readable & auditable** | Plain markdown everywhere; every cross-reference is visible; git history shows every change |
| **Future-proof & portable** | No vendor lock-in; you can point any agent at the same tree tomorrow |
| **Self-healing via lint passes** | `wiki-hygiene.py` runs quick checks daily and full (LLM) checks weekly |
| **Path to fine-tuning** | Wiki pages are high-quality synthetic training data once purified through hygiene |
---
## Where the pattern is genuinely weak — and how this repo answers
The analysis identified seven real weaknesses. Five have direct
mitigations in this repo; two remain open trade-offs you should be aware
of.
### 1. Errors persist and compound
**The problem**: Unlike RAG — where a hallucination is ephemeral and the
next query starts clean — an LLM wiki persists its mistakes. If the LLM
incorrectly links two concepts at ingest time, future ingests build on
that wrong prior.
**How this repo mitigates**:
- **`confidence` field** — every page carries `high`/`medium`/`low` with
decay based on `last_verified`. Wrong claims aren't treated as
permanent — they age out visibly.
- **Archive + restore** — decayed pages get moved to `archive/` where
they're excluded from default search. If they get referenced again
they're auto-restored with `confidence: medium` (never straight to
`high` — they have to re-earn trust).
- **Raw harvested material is immutable** — `raw/harvested/*.md` files
are the ground truth. Every compiled wiki page can be traced back to
its source via the `sources:` frontmatter field.
- **Full-mode contradiction detection** — `wiki-hygiene.py --full` uses
sonnet to find conflicting claims across pages. Report-only (humans
decide which side wins).
- **Staging review** — automated content goes to `staging/` first.
Nothing enters the live wiki without human approval, so errors have
two chances to get caught (AI compile + human review) before they
become persistent.
### 2. Hard scale ceiling at ~50K tokens
**The problem**: The wiki approach stops working when `index.md` no
longer fits in context. Karpathy's own wiki was ~100 articles / 400K
words — already near the ceiling.
**How this repo mitigates**:
- **`qmd` from day one** — `qmd` (BM25 + vector + LLM re-ranking) is set
up in the default configuration so the agent never has to load the
full index. At 50+ pages, `qmd search` replaces `cat index.md`.
- **Wing/room structural filtering** — conversations are partitioned by
project code (wing) and topic (room, via the `topics:` frontmatter).
Retrieval is pre-narrowed to the relevant wing before search runs.
This extends the effective ceiling because `qmd` works on a relevant
subset, not the whole corpus.
- **Hygiene full mode flags redundancy** — duplicate detection auto-merges
weaker pages into stronger ones, keeping the corpus lean.
- **Archive excludes stale content** — the `wiki-archive` collection has
`includeByDefault: false`, so archived pages don't eat context until
explicitly queried.
### 3. Manual cross-checking burden returns in precision-critical domains
**The problem**: For API specs, version constraints, legal records, and
medical protocols, LLM-generated content needs human verification. The
maintenance burden you thought you'd eliminated comes back as
verification overhead.
**How this repo mitigates**:
- **Staging workflow** — every automated page goes through human review.
For precision-critical content, that review IS the cross-check. The
AI does the drafting; you verify.
- **`compilation_notes` field** — staging pages include the AI's own
explanation of what it did and why. Makes review faster — you can
spot-check the reasoning rather than re-reading the whole page.
- **Immutable raw sources** — every wiki claim traces back to a specific
file in `raw/harvested/` with a SHA-256 `content_hash`. Verification
means comparing the claim to the source, not "trust the LLM."
- **`confidence: low` for precision domains** — the agent's instructions
(via `CLAUDE.md`) tell it to flag low-confidence content when
citing. Humans see the warning before acting.
**Residual trade-off**: For *truly* mission-critical data (legal,
medical, compliance), no amount of automation replaces domain-expert
review. If that's your use case, treat this repo as a *drafting* tool,
not a canonical source.
### 4. Knowledge staleness without active upkeep
**The problem**: Community analysis of 120+ comments on Karpathy's gist
found this is the #1 failure mode. Most people who try the pattern get
the folder structure right and still end up with a wiki that slowly
becomes unreliable because they stop feeding it. Six-week half-life is
typical.
**How this repo mitigates** (this is the biggest thing):
- **Automation replaces human discipline** — daily cron runs
`wiki-maintain.sh` (harvest + hygiene + qmd reindex); weekly cron runs
`--full` mode. You don't need to remember anything.
- **Conversation mining is the feed** — you don't need to curate sources
manually. Every Claude Code session becomes potential ingest. The
feed is automatic and continuous, as long as you're doing work.
- **`last_verified` refreshes from conversation references** — when the
summarizer links a conversation to a wiki page via `related:`, the
hygiene script picks that up and bumps `last_verified`. Pages stay
fresh as long as they're still being discussed.
- **Decay thresholds force attention** — pages without refresh signals
for 6/9/12 months get downgraded and eventually archived. The wiki
self-trims.
- **Hygiene reports** — `reports/hygiene-YYYY-MM-DD-needs-review.md`
flags the things that *do* need human judgment. Everything else is
auto-fixed.
This is the single biggest reason this repo exists. The automation
layer is entirely about removing "I forgot to lint" as a failure mode.
### 5. Cognitive outsourcing risk
**The problem**: Hacker News critics argued that the bookkeeping
Karpathy outsources — filing, cross-referencing, summarizing — is
precisely where genuine understanding forms. Outsource it and you end up
with a comprehensive wiki you haven't internalized.
**How this repo mitigates**:
- **Staging review is a forcing function** — you see every automated
page before it lands. Even skimming forces engagement with the
material.
- **`qmd query "..."` for exploration** — searching the wiki is an
active process, not passive retrieval. You're asking questions, not
pulling a file.
- **The wake-up briefing** — `context/wake-up.md` is a 200-token digest
the agent reads at session start. You read it too (or the agent reads
it to you) — ongoing re-exposure to your own knowledge base.
**Residual trade-off**: This is a real concern even with mitigations.
The wiki is designed as *augmentation*, not *replacement*. If you
never read your own wiki and only consult it through the agent, you're
in the outsourcing failure mode. The fix is discipline, not
architecture.
### 6. Weaker semantic retrieval than RAG at scale
**The problem**: At large corpora, vector embeddings find semantically
related content across different wording in ways explicit wikilinks
can't match.
**How this repo mitigates**:
- **`qmd` is hybrid (BM25 + vector)** — not just keyword search. Vector
similarity is built into the retrieval pipeline from day one.
- **Structural navigation complements semantic search** — project codes
(wings) and topic frontmatter narrow the search space before the
hybrid search runs. Structure + semantics is stronger than either
alone.
- **Missing cross-reference detection** — full-mode hygiene asks the
LLM to find pages that *should* link to each other but don't, then
auto-adds them. This is the explicit-linking approach catching up to
semantic retrieval over time.
**Residual trade-off**: At enterprise scale (millions of documents), a
proper vector DB with specialized retrieval wins. This repo is for
personal / small-team scale where the hybrid approach is sufficient.
### 7. No access control or multi-user support
**The problem**: It's a folder of markdown files. No RBAC, no audit
logging, no concurrency handling, no permissions model.
**How this repo mitigates**:
- **Git-based sync with merge-union** — concurrent writes on different
machines auto-resolve because markdown is set to `merge=union` in
`.gitattributes`. Both sides win.
- **Network boundary as soft access control** — the suggested
deployment is over Tailscale or a VPN, so the network does the work a
RBAC layer would otherwise do. Not enterprise-grade, but sufficient
for personal/family/small-team use.
**Residual trade-off**: **This is the big one.** The repo is not a
replacement for enterprise knowledge management. No audit trails, no
fine-grained permissions, no compliance story. If you need any of
that, you need a different architecture. This repo is explicitly
scoped to the personal/small-team use case.
---
## The #1 failure mode — active upkeep
Every other weakness has a mitigation. *Active upkeep is the one that
kills wikis in the wild.* The community data is unambiguous:
- People who automate the lint schedule → wikis healthy at 6+ months
- People who rely on "I'll remember to lint" → wikis abandoned at 6 weeks
The entire automation layer of this repo exists to remove upkeep as a
thing the human has to think about:
| Cadence | Job | Purpose |
|---------|-----|---------|
| Every 15 min | `wiki-sync.sh` | Commit/pull/push — cross-machine sync |
| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
If you disable all of these, you get the same outcome as every
abandoned wiki: six-week half-life. The scripts aren't optional
convenience — they're the load-bearing answer to the pattern's primary
failure mode.
---
## What was borrowed from where
This repo is a synthesis of two ideas with an automation layer on top:
### From Karpathy
- The core pattern: LLM-maintained persistent wiki, compile at ingest
time instead of retrieve at query time
- Separation of `raw/` (immutable sources) from `wiki/` (compiled pages)
- `CLAUDE.md` as the schema that disciplines the agent
- Periodic "lint" passes to catch orphans, contradictions, missing refs
- The idea that the wiki becomes fine-tuning material over time
### From mempalace
- **Wings** = per-person or per-project namespaces → this repo uses
project codes (`mc`, `wiki`, `web`, etc.) as the same thing in
`conversations/<project>/`
- **Rooms** = topics within a wing → the `topics:` frontmatter on
conversation files
- **Halls** = memory-type corridors (fact / event / discovery /
preference / advice / tooling) → the `halls:` frontmatter field
classified by the summarizer
- **Closets** = summary layer → the summary body of each summarized
conversation
- **Drawers** = verbatim archive, never lost → the extracted
conversation transcripts under `conversations/<project>/*.md`
- **Tunnels** = cross-wing connections → the `related:` frontmatter
linking conversations to wiki pages
- Wing + room structural filtering gives a documented +34% retrieval
boost over flat search
The MemPalace taxonomy solved a problem Karpathy's pattern doesn't
address: how do you navigate a growing corpus without reading
everything? The answer is to give the corpus structural metadata at
ingest time, then filter on that metadata before doing semantic search.
This repo borrows that wholesale.
### What this repo adds
- **Automation layer** tying the pieces together with cron-friendly
orchestration
- **Staging pipeline** as a human-in-the-loop checkpoint for automated
content
- **Confidence decay + auto-archive + auto-restore** as the "retention
curve" that community analysis identified as critical for long-term
wiki health
- **`qmd` integration** as the scalable search layer (chosen over
ChromaDB because it uses the same markdown storage as the wiki —
one index to maintain, not two)
- **Hygiene reports** with fixed vs needs-review separation so
automation handles mechanical fixes and humans handle ambiguity
- **Cross-machine sync** via git with markdown merge-union so the same
wiki lives on multiple machines without merge hell
---
## Honest residual trade-offs
Five items from the analysis that this repo doesn't fully solve and
where you should know the limits:
1. **Enterprise scale** — this is a personal/small-team tool. Millions
of documents, hundreds of users, RBAC, compliance: wrong
architecture.
2. **True semantic retrieval at massive scale**`qmd` hybrid search
is great for thousands of pages, not millions.
3. **Cognitive outsourcing** — no architecture fix. Discipline
yourself to read your own wiki, not just query it through the agent.
4. **Precision-critical domains** — for legal/medical/regulatory data,
use this as a drafting tool, not a source of truth. Human
domain-expert review is not replaceable.
5. **Access control** — network boundary (Tailscale) is the fastest
path; nothing in the repo itself enforces permissions.
If any of these are dealbreakers for your use case, a different
architecture is probably what you need.
---
## Further reading
- [The original Karpathy gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
— the concept
- [mempalace](https://github.com/milla-jovovich/mempalace) — the
structural memory layer
- [Signal & Noise interactive analysis](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
— the design rationale this document summarizes
- [README](../README.md) — the concept pitch
- [ARCHITECTURE.md](ARCHITECTURE.md) — component deep-dive
- [SETUP.md](SETUP.md) — installation
- [CUSTOMIZE.md](CUSTOMIZE.md) — adapting for non-Claude-Code setups

502
docs/SETUP.md Normal file
View File

@@ -0,0 +1,502 @@
# Setup Guide
Complete installation for the full automation pipeline. For the conceptual
version (just the idea, no scripts), see the "Quick start — Path A" section
in the [README](../README.md).
Tested on macOS (work machines) and Linux/WSL2 (home machines). Should work
on any POSIX system with Python 3.11+, Node.js 18+, and bash.
---
## 1. Prerequisites
### Required
- **git** with SSH or HTTPS access to your remote (for cross-machine sync)
- **Node.js 18+** (for `qmd` search)
- **Python 3.11+** (for all pipeline scripts)
- **`claude` CLI** with valid authentication — Max subscription OAuth or
API key. Required for summarization and the harvester's AI compile step.
Without `claude`, you can still use the wiki, but the automation layer
falls back to manual or local-LLM paths.
### Python tools (recommended via `pipx`)
```bash
# URL content extraction — required for wiki-harvest.py
pipx install trafilatura
pipx install crawl4ai && crawl4ai-setup # installs Playwright browsers
```
Verify: `trafilatura --version` and `crwl --help` should both work.
### Optional
- **`pytest`** — only needed to run the test suite (`pip install --user pytest`)
- **`llama.cpp` / `llama-server`** — only if you want the legacy local-LLM
summarization path instead of `claude -p`
---
## 2. Clone the repo
```bash
git clone <your-gitea-or-github-url> ~/projects/wiki
cd ~/projects/wiki
```
The repo contains scripts, tests, docs, and example content — but no
actual wiki pages. The wiki grows as you use it.
---
## 3. Configure qmd search
`qmd` handles BM25 full-text search and vector search over the wiki.
The pipeline uses **three** collections:
- **`wiki`** — live content (patterns/decisions/concepts/environments),
staging, and raw sources. The default search surface.
- **`wiki-archive`** — stale / superseded pages. Excluded from default
search; query explicitly with `-c wiki-archive` when digging into
history.
- **`wiki-conversations`** — mined Claude Code session transcripts.
Excluded from default search because they'd flood results with noisy
tool-call output; query explicitly with `-c wiki-conversations` when
looking for "what did I discuss about X last month?"
```bash
npm install -g @tobilu/qmd
```
Configure via YAML directly — the CLI doesn't support `ignore` or
`includeByDefault`, so we edit the config file:
```bash
mkdir -p ~/.config/qmd
cat > ~/.config/qmd/index.yml <<'YAML'
collections:
wiki:
path: /Users/YOUR_USER/projects/wiki # ← replace with your actual path
pattern: "**/*.md"
ignore:
- "archive/**"
- "reports/**"
- "plans/**"
- "conversations/**"
- "scripts/**"
- "context/**"
wiki-archive:
path: /Users/YOUR_USER/projects/wiki/archive
pattern: "**/*.md"
includeByDefault: false
wiki-conversations:
path: /Users/YOUR_USER/projects/wiki/conversations
pattern: "**/*.md"
includeByDefault: false
ignore:
- "index.md"
YAML
```
On Linux/WSL, replace `/Users/YOUR_USER` with `/home/YOUR_USER`.
Build the indexes:
```bash
qmd update # scan files into all three collections
qmd embed # generate vector embeddings (~2 min first run + ~30 min for conversations on CPU)
```
Verify:
```bash
qmd collection list
# Expected:
# wiki — N files
# wiki-archive — M files [excluded]
# wiki-conversations — K files [excluded]
```
The `[excluded]` tag on the non-default collections confirms
`includeByDefault: false` is honored.
**When to query which**:
```bash
# "What's the current pattern for X?"
qmd search "topic" --json -n 5
# "What was the OLD pattern, before we changed it?"
qmd search "topic" -c wiki-archive --json -n 5
# "When did we discuss this, and what did we decide?"
qmd search "topic" -c wiki-conversations --json -n 5
# Everything — history + current + conversations
qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
```
---
## 4. Configure the Python scripts
Three scripts need per-user configuration:
### `scripts/extract-sessions.py` — `PROJECT_MAP`
This maps Claude Code project directory suffixes to short wiki codes
("wings"). Claude stores sessions under `~/.claude/projects/<hashed-path>/`
where the hashed path is derived from the absolute path to your project.
Open the script and edit the `PROJECT_MAP` dict near the top. Look for
the `CONFIGURE ME` block. Examples:
```python
PROJECT_MAP: dict[str, str] = {
"projects-wiki": "wiki",
"-claude": "cl",
"my-webapp": "web", # map "mydir/my-webapp" → wing "web"
"mobile-app": "mob",
"work-monorepo": "work",
"-home": "general", # catch-all for unmatched sessions
}
```
Run `ls ~/.claude/projects/` to see what directory names Claude is
actually producing on your machine — the suffix in `PROJECT_MAP` matches
against the end of each directory name.
### `scripts/update-conversation-index.py` — `PROJECT_NAMES` / `PROJECT_ORDER`
Matching display names for every code in `PROJECT_MAP`:
```python
PROJECT_NAMES: dict[str, str] = {
"wiki": "WIKI — This Wiki",
"cl": "CL — Claude Config",
"web": "WEB — My Webapp",
"mob": "MOB — Mobile App",
"work": "WORK — Day Job",
"general": "General — Cross-Project",
}
PROJECT_ORDER = [
"work", "web", "mob", # most-active first
"wiki", "cl", "general",
]
```
### `scripts/wiki-harvest.py` — `SKIP_DOMAIN_PATTERNS`
Add your internal/personal domains so the harvester doesn't try to fetch
them. Patterns use `re.search`:
```python
SKIP_DOMAIN_PATTERNS = [
# ... (generic ones are already there)
r"\.mycompany\.com$",
r"^git\.mydomain\.com$",
]
```
---
## 5. Create the post-merge hook
The hook rebuilds the qmd index automatically after every `git pull`:
```bash
cat > ~/projects/wiki/.git/hooks/post-merge <<'HOOK'
#!/usr/bin/env bash
set -euo pipefail
if command -v qmd &>/dev/null; then
echo "wiki: rebuilding qmd index..."
qmd update 2>/dev/null
# WSL / Linux: no GPU, force CPU-only embeddings
if [[ "$(uname -s)" == "Linux" ]]; then
NODE_LLAMA_CPP_GPU=false qmd embed 2>/dev/null
else
qmd embed 2>/dev/null
fi
echo "wiki: qmd index updated"
fi
HOOK
chmod +x ~/projects/wiki/.git/hooks/post-merge
```
`.git/hooks/` isn't tracked by git, so this step runs on every machine
where you clone the repo.
---
## 6. Backfill frontmatter (first-time setup or fresh clone)
If you're starting with existing wiki pages that don't yet have
`last_verified` or `origin`, backfill them:
```bash
cd ~/projects/wiki
# Backfill last_verified from last_compiled/git/mtime
python3 scripts/wiki-hygiene.py --backfill
# Backfill origin: manual on pre-automation pages (one-shot inline)
python3 -c "
import sys
sys.path.insert(0, 'scripts')
from wiki_lib import iter_live_pages, write_page
changed = 0
for p in iter_live_pages():
if 'origin' not in p.frontmatter:
p.frontmatter['origin'] = 'manual'
write_page(p)
changed += 1
print(f'{changed} page(s) backfilled')
"
```
For a brand-new empty wiki, there's nothing to backfill — skip this step.
---
## 7. Run the pipeline manually once
Before setting up cron, do a full end-to-end dry run to make sure
everything's wired up:
```bash
cd ~/projects/wiki
# 1. Extract any existing Claude Code sessions
bash scripts/mine-conversations.sh --extract-only
# 2. Summarize with claude -p (will make real LLM calls — can take minutes)
python3 scripts/summarize-conversations.py --claude
# 3. Regenerate conversation index + wake-up context
python3 scripts/update-conversation-index.py --reindex
# 4. Dry-run the maintenance pipeline
bash scripts/wiki-maintain.sh --dry-run --no-compile
```
Expected output from step 4: all three phases run, phase 3 (qmd reindex)
shows as skipped in dry-run mode, and you see `finished in Ns`.
---
## 8. Cron setup (optional)
If you want full automation, add these cron jobs. **Run them on only ONE
machine** — state files sync via git, so the other machine picks up the
results automatically.
```bash
crontab -e
```
```cron
# Wiki SSH key for cron (if your remote uses SSH with a key)
GIT_SSH_COMMAND="ssh -i /path/to/wiki-key -o StrictHostKeyChecking=no"
# PATH for cron so claude, qmd, node, python3, pipx tools are findable
PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/local/bin:/usr/bin:/bin
# ─── Sync ──────────────────────────────────────────────────────────────────
# commit/pull/push every 15 minutes
*/15 * * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --commit && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --pull && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --push >> /tmp/wiki-sync.log 2>&1
# full sync with qmd reindex every 2 hours
0 */2 * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh full >> /tmp/wiki-sync.log 2>&1
# ─── Mining ────────────────────────────────────────────────────────────────
# Extract new sessions hourly (no LLM, fast)
0 * * * * /home/YOUR_USER/projects/wiki/scripts/mine-conversations.sh --extract-only >> /tmp/wiki-mine.log 2>&1
# Summarize + index daily at 2am (uses claude -p)
0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1
# ─── Maintenance ───────────────────────────────────────────────────────────
# Daily at 3am: harvest + quick hygiene + qmd reindex
0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1
# Weekly Sunday at 4am: full hygiene with LLM checks
0 4 * * 0 cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh --hygiene-only --full >> scripts/.maintain.log 2>&1
```
Replace `YOUR_USER` and the node path as appropriate for your system.
**macOS note**: `cron` needs Full Disk Access if you're pointing it at
files in `~/Documents` or `~/Desktop`. Alternatively use `launchd` with
a plist — same effect, easier permission model on macOS.
**WSL note**: make sure `cron` is actually running (`sudo service cron
start`). Cron doesn't auto-start in WSL by default.
**`claude -p` in cron**: OAuth tokens must be cached before cron runs it.
Run `claude --version` once interactively as your user to prime the
token cache — cron then picks up the cached credentials.
---
## 9. Tell Claude Code about the wiki
Two separate CLAUDE.md files work together:
1. **The wiki's own `CLAUDE.md`** at `~/projects/wiki/CLAUDE.md` — the
schema the agent reads when working INSIDE the wiki. Tells it how to
maintain pages, apply frontmatter, handle staging/archival.
2. **Your global `~/.claude/CLAUDE.md`** — the user-level instructions
the agent reads on EVERY session (regardless of directory). Tells it
when and how to consult the wiki from any other project.
Both are provided as starter templates you can copy and adapt:
### (a) Wiki schema — copy to the wiki root
```bash
cp ~/projects/wiki/docs/examples/wiki-CLAUDE.md ~/projects/wiki/CLAUDE.md
# then edit ~/projects/wiki/CLAUDE.md for your own conventions
```
This file is ~200 lines. It defines:
- Directory structure and the automated-vs-manual core rule
- Frontmatter spec (required fields, staging fields, archive fields)
- Page-type conventions (pattern / decision / environment / concept)
- Operations: Ingest, Query, Mine, Harvest, Maintain, Lint
- **Search Strategy** — which of the three qmd collections to use for
which question type
Customize the sections marked **"Customization Notes"** at the bottom
for your own categories, environments, and cross-reference format.
### (b) Global wake-up + query instructions
Append the contents of `docs/examples/global-CLAUDE.md` to your global
Claude Code instructions:
```bash
cat ~/projects/wiki/docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
# then review ~/.claude/CLAUDE.md to integrate cleanly with any existing
# content
```
This adds:
- **Wake-Up Context** — read `context/wake-up.md` at session start
- **LLM Wiki — When to Consult It** — query mode vs ingest mode rules
- **LLM Wiki — How to Search It** — explicit guidance for all three qmd
collections (`wiki`, `wiki-archive`, `wiki-conversations`) with
example queries for each
- **Rules When Citing** — flag `confidence: low`, `status: pending`,
and archived pages to the user
Together these give the agent a complete picture: how to maintain the
wiki when working inside it, and how to consult it from anywhere else.
---
## 10. Verify
```bash
cd ~/projects/wiki
# Sync state
bash scripts/wiki-sync.sh --status
# Search
qmd collection list
qmd search "test" --json -n 3 # won't return anything if wiki is empty
# Mining
tail -20 scripts/.mine.log 2>/dev/null || echo "(no mining runs yet)"
# End-to-end maintenance dry-run (no writes, no LLM, no network)
bash scripts/wiki-maintain.sh --dry-run --no-compile
# Run the test suite
cd tests && python3 -m pytest
```
Expected:
- `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]`
- `wiki-maintain.sh --dry-run` completes all three phases
- `pytest` passes all 171 tests in ~1.3 seconds
---
## Troubleshooting
**qmd search returns nothing**
```bash
qmd collection list # verify path points at the right place
qmd update # rebuild index
qmd embed # rebuild embeddings
cat ~/.config/qmd/index.yml # verify config is correct for your machine
```
**qmd collection points at the wrong path**
Edit `~/.config/qmd/index.yml` directly. Don't use `qmd collection add`
from inside the target directory — it can interpret the path oddly.
**qmd returns archived pages in default searches**
Verify `wiki-archive` has `includeByDefault: false` in the YAML and
`qmd collection list` shows `[excluded]`.
**`claude -p` fails in cron ("not authenticated")**
Cron has no browser. Run `claude --version` once as the same user
outside cron to cache OAuth tokens; cron will pick them up. Also verify
the `PATH` directive at the top of the crontab includes the directory
containing `claude`.
**`wiki-harvest.py` fetch failures**
```bash
# Verify the extraction tools work
trafilatura -u "https://example.com" --markdown --no-comments --precision
crwl "https://example.com" -o markdown-fit
# Check harvest state
python3 -c "import json; print(json.dumps(json.load(open('.harvest-state.json'))['failed_urls'], indent=2))"
```
**`wiki-hygiene.py` archived a page unexpectedly**
Check `last_verified` vs decay thresholds. If the page was never
referenced in a conversation, it decayed naturally. Restore with:
```bash
python3 scripts/wiki-hygiene.py --restore archive/patterns/foo.md
```
**Both machines ran maintenance simultaneously**
Merge conflicts on `.harvest-state.json` / `.hygiene-state.json` will
occur. Pick ONE machine for maintenance; disable the maintenance cron
on the other. Leave sync cron running on both so changes still propagate.
**Tests fail**
Run `cd tests && python3 -m pytest -v` for verbose output. If the
failure mentions `WIKI_DIR` or module loading, verify
`scripts/wiki_lib.py` exists and contains the `WIKI_DIR` env var override
near the top.
---
## Minimal install (skip everything except the idea)
If you want the conceptual wiki without any of the automation, all you
actually need is:
1. An empty directory
2. `CLAUDE.md` telling your agent the conventions (see the schema in
[`ARCHITECTURE.md`](ARCHITECTURE.md) or Karpathy's gist)
3. `index.md` for the agent to catalog pages
4. An agent that can read and write files (any Claude Code, Cursor, Aider
session works)
Then tell the agent: "Start maintaining a wiki here. Every time I share
a source, integrate it. When I ask a question, check the wiki first."
You can bolt on the automation layer later if/when it becomes worth
the setup effort.

View File

@@ -0,0 +1,161 @@
# Global Claude Code Instructions — Wiki Section
**What this is**: Content to add to your global `~/.claude/CLAUDE.md`
(the user-level instructions Claude Code reads at the start of every
session, regardless of which project you're in). These instructions tell
Claude how to consult the wiki from outside the wiki directory.
**Where to paste it**: Append these sections to `~/.claude/CLAUDE.md`.
Don't overwrite the whole file — this is additive.
---
Copy everything below this line into your global `~/.claude/CLAUDE.md`:
---
## Wake-Up Context
At the start of each session, read `~/projects/wiki/context/wake-up.md`
for a briefing on active projects, recent decisions, and current
concerns. This provides conversation continuity across sessions.
## LLM Wiki — When to Consult It
**Before creating API endpoints, Docker configs, CI pipelines, or making
architectural decisions**, check the wiki at `~/projects/wiki/` for
established patterns and decisions.
The wiki captures the **why** behind patterns — not just what to do, but
the reasoning, constraints, alternatives rejected, and environment-
specific differences. It compounds over time as projects discover new
knowledge.
**When to read from the wiki** (query mode):
- Creating any operational endpoint (/health, /version, /status)
- Setting up secrets management in a new service
- Writing Dockerfiles or docker-compose configurations
- Configuring CI/CD pipelines
- Adding database users or migrations
- Making architectural decisions that should be consistent across projects
**When to write back to the wiki** (ingest mode):
- When you discover something new that should apply across projects
- When a project reveals an exception or edge case to an existing pattern
- When a decision is made that future projects should follow
- When the human explicitly says "add this to the wiki"
Human-initiated wiki writes go directly to the live wiki with
`origin: manual`. Script-initiated writes go through `staging/` first.
See the wiki's own `CLAUDE.md` for the full ingest protocol.
## LLM Wiki — How to Search It
Use the `qmd` CLI for fast, structured search. DO NOT read `index.md`
for large queries — it's only for full-catalog browsing. DO NOT grep the
wiki manually when `qmd` is available.
The wiki has **three qmd collections**. Pick the right one for the
question:
### Default collection: `wiki` (live content)
For "what's our current pattern for X?" type questions. This is the
default — no `-c` flag needed.
```bash
# Keyword search (fast, BM25)
qmd search "health endpoint version" --json -n 5
# Semantic search (finds conceptually related pages)
qmd vsearch "how should API endpoints be structured" --json -n 5
# Best quality — hybrid BM25 + vector + LLM re-ranking
qmd query "health endpoint" --json -n 5
# Then read the matched page
cat ~/projects/wiki/patterns/health-endpoints.md
```
### Archive collection: `wiki-archive` (stale / superseded)
For "what was our OLD pattern before we changed it?" questions. This is
excluded from default searches; query explicitly with `-c wiki-archive`.
```bash
# "Did we used to use Alpine? Why did we stop?"
qmd search "alpine" -c wiki-archive --json -n 5
# Semantic search across archive
qmd vsearch "container base image considerations" -c wiki-archive --json -n 5
```
When you cite content from an archived page, tell the user it's
archived and may be outdated.
### Conversations collection: `wiki-conversations` (mined session transcripts)
For "when did we discuss this, and what did we decide?" questions. This
is the mined history of your actual Claude Code sessions — decisions,
debugging breakthroughs, design discussions. Excluded from default
searches because transcripts would flood results.
```bash
# "When did we decide to use staging?"
qmd search "staging review workflow" -c wiki-conversations --json -n 5
# "What debugging did we do around Docker networking?"
qmd vsearch "docker network conflicts" -c wiki-conversations --json -n 5
```
Useful for:
- Tracing the reasoning behind a decision back to the session where it
was made
- Finding a solution to a problem you remember solving but didn't write
up
- Context-gathering when returning to a project after time away
### Searching across all collections
Rarely needed, but for "find everything on this topic across time":
```bash
qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
```
## LLM Wiki — Rules When Citing
1. **Always use `--json`** for structured qmd output. Never try to parse
prose.
2. **Flag `confidence: low` pages** to the user when citing. The content
may be aging out.
3. **Flag `status: pending` pages** (in `staging/`) as unverified when
citing: "Note: this is from a pending wiki page that has not been
human-reviewed yet."
4. **Flag archived pages** as "archived and may be outdated" when citing.
5. **Use `index.md` for browsing only**, not for targeted lookups. `qmd`
is faster and more accurate.
6. **Prefer semantic search for conceptual queries**, keyword search for
specific names/terms.
## LLM Wiki — Quick Reference
- `~/projects/wiki/CLAUDE.md` — Full wiki schema and operations (read this when working IN the wiki)
- `~/projects/wiki/index.md` — Content catalog (browse the full wiki)
- `~/projects/wiki/patterns/` — How things should be built
- `~/projects/wiki/decisions/` — Why we chose this approach
- `~/projects/wiki/environments/` — Where environments differ
- `~/projects/wiki/concepts/` — Foundational ideas
- `~/projects/wiki/raw/` — Immutable source material (never modify)
- `~/projects/wiki/staging/` — Pending automated content (flag when citing)
- `~/projects/wiki/archive/` — Stale content (flag when citing)
- `~/projects/wiki/conversations/` — Session history (search via `-c wiki-conversations`)
---
**End of additions for `~/.claude/CLAUDE.md`.**
See also the wiki's own `CLAUDE.md` at the wiki root — that file tells
the agent how to *maintain* the wiki when working inside it. This file
(the global one) tells the agent how to *consult* the wiki from anywhere
else.

View File

@@ -0,0 +1,278 @@
# LLM Wiki — Schema
This is a persistent, compounding knowledge base maintained by LLM agents.
It captures the **why** behind patterns, decisions, and implementations —
not just the what. Copy this file to the root of your wiki directory
(i.e. `~/projects/wiki/CLAUDE.md`) and edit for your own conventions.
> This is an example `CLAUDE.md` for the wiki root. The agent reads this
> at the start of every session when working inside the wiki. It's the
> "constitution" that tells the agent how to maintain the knowledge base.
## How This Wiki Works
**You are the maintainer.** When working in this wiki directory, you read
raw sources, compile knowledge into wiki pages, maintain cross-references,
and keep everything consistent.
**You are a consumer.** When working in any other project directory, you
read wiki pages to inform your work — applying established patterns,
respecting decisions, and understanding context.
## Directory Structure
```
wiki/
├── CLAUDE.md ← You are here (schema)
├── index.md ← Content catalog — read this FIRST on any query
├── log.md ← Chronological record of all operations
├── patterns/ ← LIVE: HOW things should be built (with WHY)
├── decisions/ ← LIVE: WHY we chose this approach (with alternatives rejected)
├── environments/ ← LIVE: WHERE implementations differ
├── concepts/ ← LIVE: WHAT the foundational ideas are
├── raw/ ← Immutable source material (NEVER modify)
│ └── harvested/ ← URL harvester output
├── staging/ ← PENDING automated content awaiting human review
│ ├── index.md
│ └── <type>/
├── archive/ ← STALE / superseded (excluded from default search)
│ ├── index.md
│ └── <type>/
├── conversations/ ← Mined Claude Code session transcripts
│ ├── index.md
│ └── <wing>/ ← per-project or per-person (MemPalace "wing")
├── context/ ← Auto-updated AI session briefing
│ ├── wake-up.md ← Loaded at the start of every session
│ └── active-concerns.md
├── reports/ ← Hygiene operation logs
└── scripts/ ← The automation pipeline
```
**Core rule — automated vs manual content**:
| Origin | Destination | Status |
|--------|-------------|--------|
| Script-generated (harvester, hygiene, URL compile) | `staging/` | `pending` |
| Human-initiated ("add this to the wiki" in a Claude session) | Live wiki (`patterns/`, etc.) | `verified` |
| Human-reviewed from staging | Live wiki (promoted) | `verified` |
Managed via `scripts/wiki-staging.py --list / --promote / --reject / --review`.
## Page Conventions
### Frontmatter (required on all wiki pages)
```yaml
---
title: Page Title
type: pattern | decision | environment | concept
confidence: high | medium | low
origin: manual | automated # How the page entered the wiki
sources: [list of raw/ files this was compiled from]
related: [list of other wiki pages this connects to]
last_compiled: YYYY-MM-DD # Date this page was last (re)compiled from sources
last_verified: YYYY-MM-DD # Date the content was last confirmed accurate
---
```
**`origin` values**:
- `manual` — Created by a human in a Claude session. Goes directly to the live wiki, no staging.
- `automated` — Created by a script (harvester, hygiene, etc.). Must pass through `staging/` for human review before promotion.
**Confidence decay**: Pages with no refresh signal for 6 months decay `high → medium`; 9 months → `low`; 12 months → `stale` (auto-archived). `last_verified` drives decay, not `last_compiled`. See `scripts/wiki-hygiene.py` and `archive/index.md`.
### Staging Frontmatter (pages in `staging/<type>/`)
Automated-origin pages get additional staging metadata that is **stripped on promotion**:
```yaml
---
title: ...
type: ...
origin: automated
status: pending # Awaiting review
staged_date: YYYY-MM-DD # When the automated script staged this
staged_by: wiki-harvest # Which script staged it (wiki-harvest, wiki-hygiene, ...)
target_path: patterns/foo.md # Where it should land on promotion
modifies: patterns/bar.md # Only present when this is an update to an existing live page
compilation_notes: "..." # AI's explanation of what it did and why
harvest_source: https://... # Only present for URL-harvested content
sources: [...]
related: [...]
last_verified: YYYY-MM-DD
---
```
### Pattern Pages (`patterns/`)
Structure:
1. **What** — One-paragraph description of the pattern
2. **Why** — The reasoning, constraints, and goals that led to this pattern
3. **Canonical Example** — A concrete implementation (link to raw/ source or inline)
4. **Structure** — The specification: fields, endpoints, formats, conventions
5. **When to Deviate** — Known exceptions or conditions where the pattern doesn't apply
6. **History** — Key changes and the decisions that drove them
### Decision Pages (`decisions/`)
Structure:
1. **Decision** — One sentence: what we decided
2. **Context** — What problem or constraint prompted this
3. **Options Considered** — What alternatives existed (with pros/cons)
4. **Rationale** — Why this option won
5. **Consequences** — What this decision enables and constrains
6. **Status** — Active | Superseded by [link] | Under Review
### Environment Pages (`environments/`)
Structure:
1. **Overview** — What this environment is (platform, CI, infra)
2. **Key Differences** — Table comparing environments for this domain
3. **Implementation Details** — Environment-specific configs, credentials, deploy method
4. **Gotchas** — Things that have bitten us
### Concept Pages (`concepts/`)
Structure:
1. **Definition** — What this concept means in our context
2. **Why It Matters** — How this concept shapes our decisions
3. **Related Patterns** — Links to patterns that implement this concept
4. **Related Decisions** — Links to decisions driven by this concept
## Operations
### Ingest (adding new knowledge)
When a new raw source is added or you learn something new:
1. Read the source material thoroughly
2. Identify which existing wiki pages need updating
3. Identify if new pages are needed
4. Update/create pages following the conventions above
5. Update cross-references (`related:` frontmatter) on all affected pages
6. Update `index.md` with any new pages
7. Set `last_verified:` to today's date on every page you create or update
8. Set `origin: manual` on any page you create when a human directed you to
9. Append to `log.md`: `## [YYYY-MM-DD] ingest | Source Description`
**Where to write**:
- **Human-initiated** ("add this to the wiki", "create a pattern for X") — write directly to the live directory (`patterns/`, `decisions/`, etc.) with `origin: manual`. The human's instruction IS the approval.
- **Script-initiated** (harvest, auto-compile, hygiene auto-fix) — write to `staging/<type>/` with `origin: automated`, `status: pending`, plus `staged_date`, `staged_by`, `target_path`, and `compilation_notes`. For updates to existing live pages, also set `modifies: <live-page-path>`.
### Query (answering questions from other projects)
When working in another project and consulting the wiki:
1. Use `qmd` to search first (see Search Strategy below). Read `index.md` only when browsing the full catalog.
2. Read the specific pattern/decision/concept pages
3. Apply the knowledge, respecting environment differences
4. If a page's `confidence` is `low`, flag that to the user — the content may be aging out
5. If a page has `status: pending` (it's in `staging/`), flag that to the user: "Note: this is from a pending wiki page in staging, not yet verified." Use the content but make the uncertainty visible.
6. If you find yourself consulting a page under `archive/`, mention it's archived and may be outdated
7. If your work reveals new knowledge, **file it back** — update the wiki (and bump `last_verified`)
### Search Strategy — which qmd collection to use
The wiki has three qmd collections. Pick the right one for the question:
| Question type | Collection | Command |
|---|---|---|
| "What's our current pattern for X?" | `wiki` (default) | `qmd search "X" --json -n 5` |
| "What's the rationale behind decision Y?" | `wiki` (default) | `qmd vsearch "why did we choose Y" --json -n 5` |
| "What was our OLD approach before we changed it?" | `wiki-archive` | `qmd search "X" -c wiki-archive --json -n 5` |
| "When did we discuss this, and what did we decide?" | `wiki-conversations` | `qmd search "X" -c wiki-conversations --json -n 5` |
| "Find everything across time" | all three | `qmd search "X" -c wiki -c wiki-archive -c wiki-conversations --json -n 10` |
**Rules of thumb**:
- Use `qmd search` for keyword matches (BM25, fast)
- Use `qmd vsearch` for conceptual / semantically-similar queries (vector)
- Use `qmd query` for the best quality — hybrid BM25 + vector + LLM re-ranking
- Always use `--json` for structured output
- Read individual matched pages with `cat` or your file tool after finding them
### Mine (conversation extraction and summarization)
Four-phase pipeline that extracts sessions into searchable conversation pages:
1. **Extract** (`extract-sessions.py`) — Parse session files into markdown transcripts
2. **Summarize** (`summarize-conversations.py --claude`) — Classify + summarize via `claude -p` with haiku/sonnet routing
3. **Index** (`update-conversation-index.py --reindex`) — Regenerate conversation index + `context/wake-up.md`
4. **Harvest** (`wiki-harvest.py`) — Scan summarized conversations for external reference URLs and compile them into wiki pages
Full pipeline via `mine-conversations.sh`. Extraction is incremental (tracks byte offsets). Summarization is incremental (tracks message count).
### Maintain (wiki health automation)
`scripts/wiki-maintain.sh` chains harvest + hygiene + qmd reindex:
```bash
bash scripts/wiki-maintain.sh # Harvest + quick hygiene + reindex
bash scripts/wiki-maintain.sh --full # Harvest + full hygiene (LLM) + reindex
bash scripts/wiki-maintain.sh --harvest-only # Harvest only
bash scripts/wiki-maintain.sh --hygiene-only # Hygiene only
bash scripts/wiki-maintain.sh --dry-run # Show what would run
```
### Lint (periodic health check)
Automated via `scripts/wiki-hygiene.py`. Two tiers:
**Quick mode** (no LLM, run daily — `python3 scripts/wiki-hygiene.py`):
- Backfill missing `last_verified`
- Refresh `last_verified` from conversation `related:` references
- Auto-restore archived pages that are referenced again
- Repair frontmatter (missing required fields, invalid values)
- Confidence decay per 6/9/12-month thresholds
- Archive stale and superseded pages
- Orphan pages (auto-linked into `index.md`)
- Broken cross-references (fuzzy-match fix via `difflib`, or restore from archive)
- Main index drift (auto add missing entries, remove stale ones)
- Empty stubs (report-only)
- State file drift (report-only)
- Staging/archive index resync
**Full mode** (LLM, run weekly — `python3 scripts/wiki-hygiene.py --full`):
- Everything in quick mode, plus:
- Missing cross-references between related pages (haiku)
- Duplicate coverage — weaker page auto-merged into stronger (sonnet)
- Contradictions between pages (sonnet, report-only)
- Technology lifecycle — flag pages referencing versions older than what's in recent conversations
**Reports** (written to `reports/`):
- `hygiene-YYYY-MM-DD-fixed.md` — what was auto-fixed
- `hygiene-YYYY-MM-DD-needs-review.md` — what needs human judgment
## Cross-Reference Conventions
- Link between wiki pages using relative markdown links: `[Pattern Name](../patterns/file.md)`
- Link to raw sources: `[Source](../raw/path/to/file.md)`
- In frontmatter `related:` use the relative filename: `patterns/secrets-at-startup.md`
## Naming Conventions
- Filenames: `kebab-case.md`
- Patterns: named by what they standardize (e.g., `health-endpoints.md`, `secrets-at-startup.md`)
- Decisions: named by what was decided (e.g., `no-alpine.md`, `dhi-base-images.md`)
- Environments: named by domain (e.g., `docker-registries.md`, `ci-cd-platforms.md`)
- Concepts: named by the concept (e.g., `two-user-database-model.md`, `build-once-deploy-many.md`)
## Customization Notes
Things you should change for your own wiki:
1. **Directory structure** — the four live dirs (`patterns/`, `decisions/`, `concepts/`, `environments/`) reflect engineering use cases. Pick categories that match how you think — research wikis might use `findings/`, `hypotheses/`, `methods/`, `literature/` instead. Update `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py` to match.
2. **Page page-type sections** — the "Structure" blocks under each page type are for my use. Define your own conventions.
3. **`status` field** — if you want to track Superseded/Active/Under Review explicitly, this is a natural add. The hygiene script already checks for `status: Superseded by ...` and archives those automatically.
4. **Environment Detection** — if you don't have multiple environments, remove the section. If you do, update it for your own environments (work/home, dev/prod, mac/linux, etc.).
5. **Cross-reference path format** — I use `patterns/foo.md` in the `related:` field. Obsidian users might prefer `[[foo]]` wikilink format. The hygiene script handles standard markdown links; adapt as needed.