Initial commit — memex
A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
This commit is contained in:
360
docs/ARCHITECTURE.md
Normal file
360
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,360 @@
|
||||
# Architecture
|
||||
|
||||
Eleven scripts across three conceptual layers. This document walks through
|
||||
what each one does, how they talk to each other, and where the seams are
|
||||
for customization.
|
||||
|
||||
> **See also**: [`DESIGN-RATIONALE.md`](DESIGN-RATIONALE.md) — the *why*
|
||||
> behind each component, with links to the interactive design artifact.
|
||||
|
||||
## Borrowed concepts
|
||||
|
||||
The architecture is a synthesis of two external ideas with an automation
|
||||
layer on top. The terminology often maps 1:1, so it's worth calling out
|
||||
which concepts came from where:
|
||||
|
||||
### From Karpathy's persistent-wiki gist
|
||||
|
||||
| Concept | How this repo implements it |
|
||||
|---------|-----------------------------|
|
||||
| Immutable `raw/` sources | `raw/` directory — never modified by the agent |
|
||||
| LLM-compiled `wiki/` pages | `patterns/` `decisions/` `concepts/` `environments/` |
|
||||
| Schema file disciplining the agent | `CLAUDE.md` at the wiki root |
|
||||
| Periodic "lint" passes | `wiki-hygiene.py --quick` (daily) + `--full` (weekly) |
|
||||
| Wiki as fine-tuning material | Clean markdown body is ready for synthetic training data |
|
||||
|
||||
### From [mempalace](https://github.com/milla-jovovich/mempalace)
|
||||
|
||||
MemPalace gave us the structural memory taxonomy that turns a flat
|
||||
corpus into something you can navigate without reading everything. The
|
||||
concepts map directly:
|
||||
|
||||
| MemPalace term | Meaning | How this repo implements it |
|
||||
|----------------|---------|-----------------------------|
|
||||
| **Wing** | Per-person or per-project namespace | Project code in `conversations/<code>/` (set by `PROJECT_MAP` in `extract-sessions.py`) |
|
||||
| **Room** | Topic within a wing | `topics:` frontmatter field on summarized conversation files |
|
||||
| **Closet** | Summary layer — high-signal compressed knowledge | The summary body written by `summarize-conversations.py --claude` |
|
||||
| **Drawer** | Verbatim archive, never lost | The extracted transcript under `conversations/<wing>/*.md` (before summarization) |
|
||||
| **Hall** | Memory-type corridor (fact / event / discovery / preference / advice / tooling) | `halls:` frontmatter field classified by the summarizer |
|
||||
| **Tunnel** | Cross-wing connection — same topic in multiple projects | `related:` frontmatter linking conversations to wiki pages and to each other |
|
||||
|
||||
The key benefit of wing + room filtering is documented in MemPalace's
|
||||
benchmarks as a **+34% retrieval boost** over flat search — because
|
||||
`qmd` can search a pre-narrowed subset of the corpus instead of
|
||||
everything. This is why the wiki scales past the Karpathy-pattern's
|
||||
~50K token ceiling without needing a full vector DB rebuild.
|
||||
|
||||
### What this repo adds
|
||||
|
||||
Automation + lifecycle management on top of both:
|
||||
|
||||
- **Automation layer** — cron-friendly orchestration via `wiki-maintain.sh`
|
||||
- **Staging pipeline** — human-in-the-loop checkpoint for automated content
|
||||
- **Confidence decay + auto-archive + auto-restore** — the retention curve
|
||||
- **`qmd` integration** — the scalable search layer (chosen over ChromaDB
|
||||
because it uses markdown storage like the wiki itself)
|
||||
- **Hygiene reports** — fixed vs needs-review separation
|
||||
- **Cross-machine sync** — git with markdown merge-union
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────┐
|
||||
│ SYNC LAYER │
|
||||
│ wiki-sync.sh │ (git commit/pull/push, qmd reindex)
|
||||
└─────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────────────┐
|
||||
│ MINING LAYER │
|
||||
│ extract-sessions.py │ (Claude Code JSONL → markdown)
|
||||
│ summarize-conversations.py │ (LLM classify + summarize)
|
||||
│ update-conversation-index.py │ (regenerate indexes + wake-up)
|
||||
│ mine-conversations.sh │ (orchestrator)
|
||||
└─────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────────────┐
|
||||
│ AUTOMATION LAYER │
|
||||
│ wiki_lib.py (shared helpers) │
|
||||
│ wiki-harvest.py │ (URL → raw → staging)
|
||||
│ wiki-staging.py │ (human review)
|
||||
│ wiki-hygiene.py │ (decay, archive, repair, checks)
|
||||
│ wiki-maintain.sh │ (orchestrator)
|
||||
└─────────────────────────────────┘
|
||||
```
|
||||
|
||||
Each layer is independent — you can run the mining layer without the
|
||||
automation layer, or vice versa. The layers communicate through files on
|
||||
disk (conversation markdown, raw harvested pages, staging pages, wiki
|
||||
pages), never through in-memory state.
|
||||
|
||||
---
|
||||
|
||||
## Mining layer
|
||||
|
||||
### `extract-sessions.py`
|
||||
|
||||
Parses Claude Code JSONL session files from `~/.claude/projects/` into
|
||||
clean markdown transcripts under `conversations/<project-code>/`.
|
||||
Deterministic, no LLM calls. Incremental — tracks byte offsets in
|
||||
`.mine-state.json` so it safely re-runs on partially-processed sessions.
|
||||
|
||||
Key features:
|
||||
- Summarizes tool calls intelligently: full output for `Bash` and `Skill`,
|
||||
paths-only for `Read`/`Glob`/`Grep`, path + summary for `Edit`/`Write`
|
||||
- Caps Bash output at 200 lines to prevent transcript bloat
|
||||
- Handles session resumption — if a session has grown since last extraction,
|
||||
it appends new messages without re-processing old ones
|
||||
- Maps Claude project directory names to short wiki codes via `PROJECT_MAP`
|
||||
|
||||
### `summarize-conversations.py`
|
||||
|
||||
Sends extracted transcripts to an LLM for classification and summarization.
|
||||
Supports two backends:
|
||||
|
||||
1. **`--claude` mode** (recommended): Uses `claude -p` with
|
||||
haiku for short sessions (≤200 messages) and sonnet for longer ones.
|
||||
Runs chunked over long transcripts, keeping a rolling context window.
|
||||
|
||||
2. **Local LLM mode** (default, omit `--claude`): Uses a local
|
||||
`llama-server` instance at `localhost:8080` (or WSL gateway:8081 on
|
||||
Windows Subsystem for Linux). Requires llama.cpp installed and a GGUF
|
||||
model loaded.
|
||||
|
||||
Output: adds frontmatter to each conversation file — `topics`, `halls`
|
||||
(fact/discovery/preference/advice/event/tooling), and `related` wiki
|
||||
page links. The `related` links are load-bearing: they're what
|
||||
`wiki-hygiene.py` uses to refresh `last_verified` on pages that are still
|
||||
being discussed.
|
||||
|
||||
### `update-conversation-index.py`
|
||||
|
||||
Regenerates three files from the summarized conversations:
|
||||
|
||||
- `conversations/index.md` — catalog of all conversations grouped by project
|
||||
- `context/wake-up.md` — a ~200-token briefing the agent loads at the start
|
||||
of every session ("current focus areas, recent decisions, active
|
||||
concerns")
|
||||
- `context/active-concerns.md` — longer-form current state
|
||||
|
||||
The wake-up file is important: it's what gives the agent *continuity*
|
||||
across sessions without forcing you to re-explain context every time.
|
||||
|
||||
### `mine-conversations.sh`
|
||||
|
||||
Orchestrator chaining extract → summarize → index. Supports
|
||||
`--extract-only`, `--summarize-only`, `--index-only`, `--project <code>`,
|
||||
and `--dry-run`.
|
||||
|
||||
---
|
||||
|
||||
## Automation layer
|
||||
|
||||
### `wiki_lib.py`
|
||||
|
||||
The shared library. Everything in the automation layer imports from here.
|
||||
Provides:
|
||||
|
||||
- `WikiPage` dataclass — path + frontmatter + body + raw YAML
|
||||
- `parse_page(path)` — safe markdown parser with YAML frontmatter
|
||||
- `parse_yaml_lite(text)` — subset YAML parser (no external deps, handles
|
||||
the frontmatter patterns we use)
|
||||
- `serialize_frontmatter(fm)` — writes YAML back in canonical key order
|
||||
- `write_page(page, ...)` — full round-trip writer
|
||||
- `page_content_hash(page)` — body-only SHA-256 for change detection
|
||||
- `iter_live_pages()` / `iter_staging_pages()` / `iter_archived_pages()`
|
||||
- Shared constants: `WIKI_DIR`, `STAGING_DIR`, `ARCHIVE_DIR`, etc.
|
||||
|
||||
All paths honor the `WIKI_DIR` environment variable, so tests and
|
||||
alternate installs can override the root.
|
||||
|
||||
### `wiki-harvest.py`
|
||||
|
||||
Scans summarized conversations for HTTP(S) URLs, classifies them,
|
||||
fetches content, and compiles pending wiki pages.
|
||||
|
||||
URL classification:
|
||||
- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
|
||||
- **Check** (Type C) — GitHub issues, Stack Overflow — only harvest if
|
||||
the topic is already covered in the wiki (to avoid noise)
|
||||
- **Skip** (Type D) — internal domains, localhost, private IPs, chat tools
|
||||
|
||||
Fetch cascade (tries in order, validates at each step):
|
||||
1. `trafilatura -u <url> --markdown --no-comments --precision`
|
||||
2. `crwl <url> -o markdown-fit`
|
||||
3. `crwl <url> -o markdown-fit -b "user_agent_mode=random" -c "magic=true"` (stealth)
|
||||
4. Conversation-transcript fallback — pull inline content from where the
|
||||
URL was mentioned during the session
|
||||
|
||||
Validated content goes to `raw/harvested/<domain>-<path>.md` with
|
||||
frontmatter recording source URL, fetch method, and a content hash.
|
||||
|
||||
Compilation step: sends the raw content + `index.md` + conversation
|
||||
context to `claude -p`, asking for a JSON verdict:
|
||||
- `new_page` — create a new wiki page
|
||||
- `update_page` — update an existing page (with `modifies:` field)
|
||||
- `both` — do both
|
||||
- `skip` — content isn't substantive enough
|
||||
|
||||
Result lands in `staging/<type>/` with `origin: automated`,
|
||||
`status: pending`, and all the staging-specific frontmatter that gets
|
||||
stripped on promotion.
|
||||
|
||||
### `wiki-staging.py`
|
||||
|
||||
Pure file operations — no LLM calls. Human review pipeline for automated
|
||||
content.
|
||||
|
||||
Commands:
|
||||
- `--list` / `--list --json` — pending items with metadata
|
||||
- `--stats` — counts by type/source + age stats
|
||||
- `--review` — interactive a/r/s/q loop with preview
|
||||
- `--promote <path>` — approve, strip staging fields, move to live, update
|
||||
main index, rewrite cross-refs, preserve `origin: automated` as audit trail
|
||||
- `--reject <path> --reason "..."` — delete, record in
|
||||
`.harvest-state.json` rejected_urls so the harvester won't re-create
|
||||
- `--promote-all` — bulk approve everything
|
||||
- `--sync` — regenerate `staging/index.md`, detect drift
|
||||
|
||||
### `wiki-hygiene.py`
|
||||
|
||||
The heavy lifter. Two modes:
|
||||
|
||||
**Quick mode** (no LLM, ~1 second on a 100-page wiki, run daily):
|
||||
- Backfill `last_verified` from `last_compiled`/git/mtime
|
||||
- Refresh `last_verified` from conversation `related:` links — this is
|
||||
the "something's still being discussed" signal
|
||||
- Auto-restore archived pages that are referenced again
|
||||
- Repair frontmatter (missing/invalid fields get sensible defaults)
|
||||
- Apply confidence decay per thresholds (6/9/12 months)
|
||||
- Archive stale and superseded pages
|
||||
- Detect index drift (pages on disk not in index, stale index entries)
|
||||
- Detect orphan pages (no inbound links) and auto-add them to index
|
||||
- Detect broken cross-references, fuzzy-match to the intended target
|
||||
via `difflib.get_close_matches`, fix in place
|
||||
- Report empty stubs (body < 100 chars)
|
||||
- Detect state file drift (references to missing files)
|
||||
- Regenerate `staging/index.md` and `archive/index.md` if out of sync
|
||||
|
||||
**Full mode** (LLM-powered, run weekly — extends quick mode with):
|
||||
- Missing cross-references (haiku, batched 5 pages per call)
|
||||
- Duplicate coverage (sonnet — weaker merged into stronger, auto-archives
|
||||
the loser with `archived_reason: Merged into <winner>`)
|
||||
- Contradictions (sonnet, **report-only** — the human decides)
|
||||
- Technology lifecycle (regex + conversation comparison — flags pages
|
||||
mentioning `Node 18` when recent conversations are using `Node 20`)
|
||||
|
||||
State lives in `.hygiene-state.json` — tracks content hashes per page so
|
||||
full-mode runs can skip unchanged pages. Reports land in
|
||||
`reports/hygiene-YYYY-MM-DD-{fixed,needs-review}.md`.
|
||||
|
||||
### `wiki-maintain.sh`
|
||||
|
||||
Top-level orchestrator:
|
||||
|
||||
```
|
||||
Phase 1: wiki-harvest.py (unless --hygiene-only)
|
||||
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
|
||||
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
|
||||
```
|
||||
|
||||
Flags pass through to child scripts. Error-tolerant: if one phase fails,
|
||||
the others still run. Logs to `scripts/.maintain.log`.
|
||||
|
||||
---
|
||||
|
||||
## Sync layer
|
||||
|
||||
### `wiki-sync.sh`
|
||||
|
||||
Git-based sync for cross-machine use. Commands:
|
||||
|
||||
- `--commit` — stage and commit local changes
|
||||
- `--pull` — `git pull` with markdown merge-union (keeps both sides on conflict)
|
||||
- `--push` — push to origin
|
||||
- `full` — commit + pull + push + qmd reindex
|
||||
- `--status` — read-only sync state report
|
||||
|
||||
The `.gitattributes` file sets `*.md merge=union` so markdown conflicts
|
||||
auto-resolve by keeping both versions. This works because most conflicts
|
||||
are additive (two machines both adding new entries).
|
||||
|
||||
---
|
||||
|
||||
## State files
|
||||
|
||||
Three JSON files track per-pipeline state:
|
||||
|
||||
| File | Owner | Synced? | Purpose |
|
||||
|------|-------|---------|---------|
|
||||
| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
|
||||
| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
|
||||
| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
|
||||
|
||||
Harvest and hygiene state need to sync across machines so both
|
||||
installations agree on what's been processed. Mining state is per-machine
|
||||
because Claude Code session files live at OS-specific paths.
|
||||
|
||||
---
|
||||
|
||||
## Module dependency graph
|
||||
|
||||
```
|
||||
wiki_lib.py ─┬─> wiki-harvest.py
|
||||
├─> wiki-staging.py
|
||||
└─> wiki-hygiene.py
|
||||
|
||||
wiki-maintain.sh ─> wiki-harvest.py
|
||||
─> wiki-hygiene.py
|
||||
─> qmd (external)
|
||||
|
||||
mine-conversations.sh ─> extract-sessions.py
|
||||
─> summarize-conversations.py
|
||||
─> update-conversation-index.py
|
||||
|
||||
extract-sessions.py (standalone — reads Claude JSONL)
|
||||
summarize-conversations.py ─> claude CLI (or llama-server)
|
||||
update-conversation-index.py ─> qmd (external)
|
||||
```
|
||||
|
||||
`wiki_lib.py` is the only shared Python module — everything else is
|
||||
self-contained within its layer.
|
||||
|
||||
---
|
||||
|
||||
## Extension seams
|
||||
|
||||
The places to modify when customizing:
|
||||
|
||||
1. **`scripts/extract-sessions.py`** — `PROJECT_MAP` controls how Claude
|
||||
project directories become wiki "wings". Also `KEEP_FULL_OUTPUT_TOOLS`,
|
||||
`SUMMARIZE_TOOLS`, `MAX_BASH_OUTPUT_LINES` to tune transcript shape.
|
||||
|
||||
2. **`scripts/update-conversation-index.py`** — `PROJECT_NAMES` and
|
||||
`PROJECT_ORDER` control how the index groups conversations.
|
||||
|
||||
3. **`scripts/wiki-harvest.py`** —
|
||||
- `SKIP_DOMAIN_PATTERNS` — your internal domains
|
||||
- `C_TYPE_URL_PATTERNS` — URL shapes that need topic-match before harvesting
|
||||
- `FETCH_DELAY_SECONDS` — rate limit between fetches
|
||||
- `COMPILE_PROMPT_TEMPLATE` — what the AI compile step tells the LLM
|
||||
- `SONNET_CONTENT_THRESHOLD` — size cutoff for haiku vs sonnet
|
||||
|
||||
4. **`scripts/wiki-hygiene.py`** —
|
||||
- `DECAY_HIGH_TO_MEDIUM` / `DECAY_MEDIUM_TO_LOW` / `DECAY_LOW_TO_STALE`
|
||||
— decay thresholds in days
|
||||
- `EMPTY_STUB_THRESHOLD` — what counts as a stub
|
||||
- `VERSION_REGEX` — which tools/runtimes to track for lifecycle checks
|
||||
- `REQUIRED_FIELDS` — frontmatter fields the repair step enforces
|
||||
|
||||
5. **`scripts/summarize-conversations.py`** —
|
||||
- `CLAUDE_LONG_THRESHOLD` — haiku/sonnet routing cutoff
|
||||
- `MINE_PROMPT_FILE` — the LLM system prompt for summarization
|
||||
- Backend selection (claude vs llama-server)
|
||||
|
||||
6. **`CLAUDE.md`** at the wiki root — the instructions the agent reads
|
||||
every session. This is where you tell the agent how to maintain the
|
||||
wiki, what conventions to follow, when to flag things to you.
|
||||
|
||||
See [`docs/CUSTOMIZE.md`](CUSTOMIZE.md) for recipes.
|
||||
432
docs/CUSTOMIZE.md
Normal file
432
docs/CUSTOMIZE.md
Normal file
@@ -0,0 +1,432 @@
|
||||
# Customization Guide
|
||||
|
||||
This repo is built around Claude Code, cron-based automation, and a
|
||||
specific directory layout. None of those are load-bearing for the core
|
||||
idea. This document walks through adapting it for different agents,
|
||||
different scheduling, and different subsets of functionality.
|
||||
|
||||
## What's actually required for the core idea
|
||||
|
||||
The minimum viable compounding wiki is:
|
||||
|
||||
1. A markdown directory tree
|
||||
2. An agent that reads the tree at the start of a session and writes to
|
||||
it during the session
|
||||
3. Some convention (a `CLAUDE.md` or equivalent) telling the agent how to
|
||||
maintain the wiki
|
||||
|
||||
**Everything else in this repo is optional optimization** — automated
|
||||
extraction, URL harvesting, hygiene checks, cron scheduling. They're
|
||||
worth the setup effort once the wiki grows past a few dozen pages, but
|
||||
they're not the *idea*.
|
||||
|
||||
---
|
||||
|
||||
## Adapting for non-Claude-Code agents
|
||||
|
||||
Four script components are Claude-specific. Each has a natural
|
||||
replacement path:
|
||||
|
||||
### 1. `extract-sessions.py` — Claude Code JSONL parsing
|
||||
|
||||
**What it does**: Reads session files from `~/.claude/projects/` and
|
||||
converts them to markdown transcripts.
|
||||
|
||||
**What's Claude-specific**: The JSONL format and directory structure are
|
||||
specific to the Claude Code CLI. Other agents don't produce these files.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
- **Cursor**: Cursor stores chat history in `~/Library/Application
|
||||
Support/Cursor/User/globalStorage/` (macOS) as SQLite. Write an
|
||||
equivalent `extract-sessions.py` that queries that SQLite and produces
|
||||
the same markdown format.
|
||||
- **Aider**: Aider stores chat history as `.aider.chat.history.md` in
|
||||
each project directory. A much simpler extractor: walk all project
|
||||
directories, read each `.aider.chat.history.md`, split on session
|
||||
boundaries, write to `conversations/<project>/`.
|
||||
- **OpenAI Codex / gemini CLI / other**: Whatever session format your
|
||||
tool uses — the target format is a markdown file with a specific
|
||||
frontmatter shape (`title`, `type: conversation`, `project`, `date`,
|
||||
`status: extracted`, `messages: N`, body of user/assistant turns).
|
||||
Anything that produces files in that shape will flow through the rest
|
||||
of the pipeline unchanged.
|
||||
- **No agent at all — just manual**: Skip this script entirely. Paste
|
||||
interesting conversations into `conversations/general/YYYY-MM-DD-slug.md`
|
||||
by hand and set `status: extracted` yourself.
|
||||
|
||||
The pipeline downstream of `extract-sessions.py` doesn't care how the
|
||||
transcripts got there, only that they exist with the right frontmatter.
|
||||
|
||||
### 2. `summarize-conversations.py` — `claude -p` summarization
|
||||
|
||||
**What it does**: Classifies extracted conversations into "halls"
|
||||
(fact/discovery/preference/advice/event/tooling) and writes summaries.
|
||||
|
||||
**What's Claude-specific**: Uses `claude -p` with haiku/sonnet routing.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
- **OpenAI**: Replace the `call_claude` helper with a function that calls
|
||||
`openai` Python SDK or `gpt` CLI. Use gpt-4o-mini for short
|
||||
conversations (equivalent to haiku routing) and gpt-4o for long ones.
|
||||
- **Local LLM**: The script already supports this path — just omit the
|
||||
`--claude` flag and run a `llama-server` on localhost:8080 (or the WSL
|
||||
gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
|
||||
- **Ollama**: Point `AI_BASE_URL` at your Ollama endpoint (e.g.
|
||||
`http://localhost:11434/v1`). Ollama exposes an OpenAI-compatible API.
|
||||
- **Any OpenAI-compatible endpoint**: `AI_BASE_URL` and `AI_MODEL` env
|
||||
vars configure the script — no code changes needed.
|
||||
- **No LLM at all — manual summaries**: Edit each conversation file by
|
||||
hand to set `status: summarized` and add your own `topics`/`related`
|
||||
frontmatter. Tedious but works for a small wiki.
|
||||
|
||||
### 3. `wiki-harvest.py` — AI compile step
|
||||
|
||||
**What it does**: After fetching raw URL content, sends it to `claude -p`
|
||||
to get a structured JSON verdict (new_page / update_page / both / skip)
|
||||
plus the page content.
|
||||
|
||||
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
- **Any other LLM**: Replace `call_claude_compile()` with a function that
|
||||
calls your preferred backend. The prompt template
|
||||
(`COMPILE_PROMPT_TEMPLATE`) is reusable — just swap the transport.
|
||||
- **Skip AI compilation entirely**: Run `wiki-harvest.py --no-compile`
|
||||
and the harvester will save raw content to `raw/harvested/` without
|
||||
trying to compile it. You can then manually (or via a different script)
|
||||
turn the raw content into wiki pages.
|
||||
|
||||
### 4. `wiki-hygiene.py --full` — LLM-powered checks
|
||||
|
||||
**What it does**: Duplicate detection, contradiction detection, missing
|
||||
cross-reference suggestions.
|
||||
|
||||
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
- **Same as #3**: Replace the `call_claude()` helper in `wiki-hygiene.py`.
|
||||
- **Skip full mode entirely**: Only run `wiki-hygiene.py --quick` (the
|
||||
default). Quick mode has no LLM calls and catches 90% of structural
|
||||
issues. Contradictions and duplicates just have to be caught by human
|
||||
review during `wiki-staging.py --review` sessions.
|
||||
|
||||
### 5. `CLAUDE.md` at the wiki root
|
||||
|
||||
**What it does**: The instructions Claude Code reads at the start of
|
||||
every session that explain the wiki schema and maintenance operations.
|
||||
|
||||
**What's Claude-specific**: The filename. Claude Code specifically looks
|
||||
for `CLAUDE.md`; other agents look for other files.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
| Agent | Equivalent file |
|
||||
|-------|-----------------|
|
||||
| Claude Code | `CLAUDE.md` |
|
||||
| Cursor | `.cursorrules` or `.cursor/rules/` |
|
||||
| Aider | `CONVENTIONS.md` (read via `--read CONVENTIONS.md`) |
|
||||
| Gemini CLI | `GEMINI.md` |
|
||||
| Continue.dev | `config.json` prompts or `.continue/rules/` |
|
||||
|
||||
The content is the same — just rename the file and point your agent at
|
||||
it.
|
||||
|
||||
---
|
||||
|
||||
## Running without cron
|
||||
|
||||
Cron is convenient but not required. Alternatives:
|
||||
|
||||
### Manual runs
|
||||
|
||||
Just call the scripts when you want the wiki updated:
|
||||
|
||||
```bash
|
||||
cd ~/projects/wiki
|
||||
|
||||
# When you want to ingest new Claude Code sessions
|
||||
bash scripts/mine-conversations.sh
|
||||
|
||||
# When you want hygiene + harvest
|
||||
bash scripts/wiki-maintain.sh
|
||||
|
||||
# When you want the expensive LLM pass
|
||||
bash scripts/wiki-maintain.sh --hygiene-only --full
|
||||
```
|
||||
|
||||
This is arguably *better* than cron if you work in bursts — run
|
||||
maintenance when you start a session, not on a schedule.
|
||||
|
||||
### systemd timers (Linux)
|
||||
|
||||
More observable than cron, better journaling:
|
||||
|
||||
```ini
|
||||
# ~/.config/systemd/user/wiki-maintain.service
|
||||
[Unit]
|
||||
Description=Wiki maintenance pipeline
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
WorkingDirectory=%h/projects/wiki
|
||||
ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh
|
||||
```
|
||||
|
||||
```ini
|
||||
# ~/.config/systemd/user/wiki-maintain.timer
|
||||
[Unit]
|
||||
Description=Run wiki-maintain daily
|
||||
|
||||
[Timer]
|
||||
OnCalendar=daily
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
```bash
|
||||
systemctl --user enable --now wiki-maintain.timer
|
||||
journalctl --user -u wiki-maintain.service # see logs
|
||||
```
|
||||
|
||||
### launchd (macOS)
|
||||
|
||||
More native than cron on macOS:
|
||||
|
||||
```xml
|
||||
<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>Label</key><string>com.user.wiki-maintain</string>
|
||||
<key>ProgramArguments</key>
|
||||
<array>
|
||||
<string>/bin/bash</string>
|
||||
<string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
|
||||
</array>
|
||||
<key>StartCalendarInterval</key>
|
||||
<dict>
|
||||
<key>Hour</key><integer>3</integer>
|
||||
<key>Minute</key><integer>0</integer>
|
||||
</dict>
|
||||
<key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
|
||||
<key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
|
||||
</dict>
|
||||
</plist>
|
||||
```
|
||||
|
||||
```bash
|
||||
launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
|
||||
launchctl list | grep wiki # verify
|
||||
```
|
||||
|
||||
### Git hooks (pre-push)
|
||||
|
||||
Run hygiene before every push so the wiki is always clean when it hits
|
||||
the remote:
|
||||
|
||||
```bash
|
||||
cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
|
||||
HOOK
|
||||
chmod +x ~/projects/wiki/.git/hooks/pre-push
|
||||
```
|
||||
|
||||
Downside: every push is slow. Upside: you never push a broken wiki.
|
||||
|
||||
### CI pipeline
|
||||
|
||||
Run `wiki-hygiene.py --check-only` in a CI workflow on every PR:
|
||||
|
||||
```yaml
|
||||
# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
|
||||
name: Wiki hygiene check
|
||||
on: [push, pull_request]
|
||||
jobs:
|
||||
hygiene:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-python@v5
|
||||
- run: python3 scripts/wiki-hygiene.py --check-only
|
||||
```
|
||||
|
||||
`--check-only` reports issues without auto-fixing them, so CI can flag
|
||||
problems without modifying files.
|
||||
|
||||
---
|
||||
|
||||
## Minimal subsets
|
||||
|
||||
You don't have to run the whole pipeline. Pick what's useful:
|
||||
|
||||
### "Just the wiki" (no automation)
|
||||
|
||||
- Delete `scripts/wiki-*` and `scripts/*-conversations*`
|
||||
- Delete `tests/`
|
||||
- Keep the directory structure (`patterns/`, `decisions/`, etc.)
|
||||
- Keep `index.md` and `CLAUDE.md`
|
||||
- Write and maintain the wiki manually with your agent
|
||||
|
||||
This is the Karpathy-gist version. Works great for small wikis.
|
||||
|
||||
### "Wiki + mining" (no harvesting, no hygiene)
|
||||
|
||||
- Keep the mining layer (`extract-sessions.py`, `summarize-conversations.py`, `update-conversation-index.py`)
|
||||
- Delete the automation layer (`wiki-harvest.py`, `wiki-hygiene.py`, `wiki-staging.py`, `wiki-maintain.sh`)
|
||||
- The wiki grows from session mining but you maintain it manually
|
||||
|
||||
Useful if you want session continuity (the wake-up briefing) without
|
||||
the full automation.
|
||||
|
||||
### "Wiki + hygiene" (no mining, no harvesting)
|
||||
|
||||
- Keep `wiki-hygiene.py` and `wiki_lib.py`
|
||||
- Delete everything else
|
||||
- Run `wiki-hygiene.py --quick` periodically to catch structural issues
|
||||
|
||||
Useful if you write the wiki manually but want automated checks for
|
||||
orphans, broken links, and staleness.
|
||||
|
||||
### "Wiki + harvesting" (no session mining)
|
||||
|
||||
- Keep `wiki-harvest.py`, `wiki-staging.py`, `wiki_lib.py`
|
||||
- Delete mining scripts
|
||||
- Source URLs manually — put them in a file and point the harvester at
|
||||
it. You'd need to write a wrapper that extracts URLs from your source
|
||||
file and feeds them into the fetch cascade.
|
||||
|
||||
Useful if URLs come from somewhere other than Claude Code sessions
|
||||
(e.g. browser bookmarks, Pocket export, RSS).
|
||||
|
||||
---
|
||||
|
||||
## Schema customization
|
||||
|
||||
The repo uses these live content types:
|
||||
|
||||
- `patterns/` — HOW things should be built
|
||||
- `decisions/` — WHY we chose this approach
|
||||
- `concepts/` — WHAT the foundational ideas are
|
||||
- `environments/` — WHERE implementations differ
|
||||
|
||||
These reflect my engineering-focused use case. Your wiki might need
|
||||
different categories. To change them:
|
||||
|
||||
1. Rename / add directories under the wiki root
|
||||
2. Edit `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py`
|
||||
3. Update the `type:` frontmatter validation in
|
||||
`scripts/wiki-hygiene.py` (`VALID_TYPES` constant)
|
||||
4. Update `CLAUDE.md` to describe the new categories
|
||||
5. Update `index.md` section headers to match
|
||||
|
||||
Examples of alternative schemas:
|
||||
|
||||
**Research wiki**:
|
||||
- `findings/` — experimental results
|
||||
- `hypotheses/` — what you're testing
|
||||
- `methods/` — how you test
|
||||
- `literature/` — external sources
|
||||
|
||||
**Product wiki**:
|
||||
- `features/` — what the product does
|
||||
- `decisions/` — why we chose this
|
||||
- `users/` — personas, interviews, feedback
|
||||
- `metrics/` — what we measure
|
||||
|
||||
**Personal knowledge wiki**:
|
||||
- `topics/` — general subject matter
|
||||
- `projects/` — specific ongoing work
|
||||
- `journal/` — dated entries
|
||||
- `references/` — external links/papers
|
||||
|
||||
None of these are better or worse — pick what matches how you think.
|
||||
|
||||
---
|
||||
|
||||
## Frontmatter customization
|
||||
|
||||
The required fields are documented in `CLAUDE.md` (frontmatter spec).
|
||||
You can add your own fields freely — the parser and hygiene checks
|
||||
ignore unknown keys.
|
||||
|
||||
Useful additions you might want:
|
||||
|
||||
```yaml
|
||||
author: alice # who wrote or introduced the page
|
||||
tags: [auth, security] # flat tag list
|
||||
urgency: high # for to-do-style wiki pages
|
||||
stakeholders: # who cares about this page
|
||||
- product-team
|
||||
- security-team
|
||||
review_by: 2026-06-01 # explicit review date instead of age-based decay
|
||||
```
|
||||
|
||||
If you want age-based decay to key off a different field than
|
||||
`last_verified` (say, `review_by`), edit `expected_confidence()` in
|
||||
`scripts/wiki-hygiene.py` to read from your custom field.
|
||||
|
||||
---
|
||||
|
||||
## Working across multiple wikis
|
||||
|
||||
The scripts all honor the `WIKI_DIR` environment variable. Run multiple
|
||||
wikis against the same scripts:
|
||||
|
||||
```bash
|
||||
# Work wiki
|
||||
WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh
|
||||
|
||||
# Personal wiki
|
||||
WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh
|
||||
|
||||
# Research wiki
|
||||
WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh
|
||||
```
|
||||
|
||||
Each has its own state files, its own cron entries, its own qmd
|
||||
collection. You can symlink or copy `scripts/` into each wiki, or run
|
||||
all three against a single checked-out copy of the scripts.
|
||||
|
||||
---
|
||||
|
||||
## What I'd change if starting over
|
||||
|
||||
Honest notes on the design choices, in case you're about to fork:
|
||||
|
||||
1. **Config should be in YAML, not inline constants.** I bolted a
|
||||
"CONFIGURE ME" comment onto `PROJECT_MAP` and `SKIP_DOMAIN_PATTERNS`
|
||||
as a shortcut. Better: a `config.yaml` at the wiki root that all
|
||||
scripts read.
|
||||
|
||||
2. **The mining layer is tightly coupled to Claude Code.** A cleaner
|
||||
design would put a `Session` interface in `wiki_lib.py` and have
|
||||
extractors for each agent produce `Session` objects — the rest of the
|
||||
pipeline would be agent-agnostic.
|
||||
|
||||
3. **The hygiene script is a monolith.** 1100+ lines is a lot. Splitting
|
||||
it into `wiki_hygiene/checks.py`, `wiki_hygiene/archive.py`,
|
||||
`wiki_hygiene/llm.py`, etc., would be cleaner. It started as a single
|
||||
file and grew.
|
||||
|
||||
4. **The hyphenated filenames (`wiki-harvest.py`) make Python imports
|
||||
awkward.** Standard Python convention is underscores. I used hyphens
|
||||
for consistency with the shell scripts, and `conftest.py` has a
|
||||
module-loader workaround. A cleaner fork would use underscores
|
||||
everywhere.
|
||||
|
||||
5. **The wiki schema assumes you know what you want to catalog.** If
|
||||
you don't, start with a free-form `notes/` directory and let
|
||||
categories emerge organically, then refactor into `patterns/` etc.
|
||||
later.
|
||||
|
||||
None of these are blockers. They're all "if I were designing v2"
|
||||
observations.
|
||||
338
docs/DESIGN-RATIONALE.md
Normal file
338
docs/DESIGN-RATIONALE.md
Normal file
@@ -0,0 +1,338 @@
|
||||
# Design Rationale — Signal & Noise
|
||||
|
||||
Why each part of this repo exists. This is the "why" document; the other
|
||||
docs are the "what" and "how."
|
||||
|
||||
Before implementing anything, the design was worked out interactively
|
||||
with Claude as a structured Signal & Noise analysis of Andrej Karpathy's
|
||||
original persistent-wiki pattern:
|
||||
|
||||
> **Interactive design artifact**: [The LLM Wiki — Karpathy's Pattern — Signal & Noise](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
|
||||
|
||||
That artifact walks through the pattern's seven genuine strengths, seven
|
||||
real weaknesses, and concrete mitigations for each weakness. This repo
|
||||
is the implementation of those mitigations. If you want to understand
|
||||
*why* a component exists, the artifact has the longer-form argument; this
|
||||
document is the condensed version.
|
||||
|
||||
---
|
||||
|
||||
## Where the pattern is genuinely strong
|
||||
|
||||
The analysis found seven strengths that hold up under scrutiny. This
|
||||
repo preserves all of them:
|
||||
|
||||
| Strength | How this repo keeps it |
|
||||
|----------|-----------------------|
|
||||
| **Knowledge compounds over time** | Every ingest adds to the existing wiki rather than restarting; conversation mining and URL harvesting continuously feed new material in |
|
||||
| **Zero maintenance burden on humans** | Cron-driven harvest + hygiene; the only manual step is staging review, and that's fast because the AI already compiled the page |
|
||||
| **Token-efficient at personal scale** | `index.md` fits in context; `qmd` kicks in only at 50+ articles; the wake-up briefing is ~200 tokens |
|
||||
| **Human-readable & auditable** | Plain markdown everywhere; every cross-reference is visible; git history shows every change |
|
||||
| **Future-proof & portable** | No vendor lock-in; you can point any agent at the same tree tomorrow |
|
||||
| **Self-healing via lint passes** | `wiki-hygiene.py` runs quick checks daily and full (LLM) checks weekly |
|
||||
| **Path to fine-tuning** | Wiki pages are high-quality synthetic training data once purified through hygiene |
|
||||
|
||||
---
|
||||
|
||||
## Where the pattern is genuinely weak — and how this repo answers
|
||||
|
||||
The analysis identified seven real weaknesses. Five have direct
|
||||
mitigations in this repo; two remain open trade-offs you should be aware
|
||||
of.
|
||||
|
||||
### 1. Errors persist and compound
|
||||
|
||||
**The problem**: Unlike RAG — where a hallucination is ephemeral and the
|
||||
next query starts clean — an LLM wiki persists its mistakes. If the LLM
|
||||
incorrectly links two concepts at ingest time, future ingests build on
|
||||
that wrong prior.
|
||||
|
||||
**How this repo mitigates**:
|
||||
|
||||
- **`confidence` field** — every page carries `high`/`medium`/`low` with
|
||||
decay based on `last_verified`. Wrong claims aren't treated as
|
||||
permanent — they age out visibly.
|
||||
- **Archive + restore** — decayed pages get moved to `archive/` where
|
||||
they're excluded from default search. If they get referenced again
|
||||
they're auto-restored with `confidence: medium` (never straight to
|
||||
`high` — they have to re-earn trust).
|
||||
- **Raw harvested material is immutable** — `raw/harvested/*.md` files
|
||||
are the ground truth. Every compiled wiki page can be traced back to
|
||||
its source via the `sources:` frontmatter field.
|
||||
- **Full-mode contradiction detection** — `wiki-hygiene.py --full` uses
|
||||
sonnet to find conflicting claims across pages. Report-only (humans
|
||||
decide which side wins).
|
||||
- **Staging review** — automated content goes to `staging/` first.
|
||||
Nothing enters the live wiki without human approval, so errors have
|
||||
two chances to get caught (AI compile + human review) before they
|
||||
become persistent.
|
||||
|
||||
### 2. Hard scale ceiling at ~50K tokens
|
||||
|
||||
**The problem**: The wiki approach stops working when `index.md` no
|
||||
longer fits in context. Karpathy's own wiki was ~100 articles / 400K
|
||||
words — already near the ceiling.
|
||||
|
||||
**How this repo mitigates**:
|
||||
|
||||
- **`qmd` from day one** — `qmd` (BM25 + vector + LLM re-ranking) is set
|
||||
up in the default configuration so the agent never has to load the
|
||||
full index. At 50+ pages, `qmd search` replaces `cat index.md`.
|
||||
- **Wing/room structural filtering** — conversations are partitioned by
|
||||
project code (wing) and topic (room, via the `topics:` frontmatter).
|
||||
Retrieval is pre-narrowed to the relevant wing before search runs.
|
||||
This extends the effective ceiling because `qmd` works on a relevant
|
||||
subset, not the whole corpus.
|
||||
- **Hygiene full mode flags redundancy** — duplicate detection auto-merges
|
||||
weaker pages into stronger ones, keeping the corpus lean.
|
||||
- **Archive excludes stale content** — the `wiki-archive` collection has
|
||||
`includeByDefault: false`, so archived pages don't eat context until
|
||||
explicitly queried.
|
||||
|
||||
### 3. Manual cross-checking burden returns in precision-critical domains
|
||||
|
||||
**The problem**: For API specs, version constraints, legal records, and
|
||||
medical protocols, LLM-generated content needs human verification. The
|
||||
maintenance burden you thought you'd eliminated comes back as
|
||||
verification overhead.
|
||||
|
||||
**How this repo mitigates**:
|
||||
|
||||
- **Staging workflow** — every automated page goes through human review.
|
||||
For precision-critical content, that review IS the cross-check. The
|
||||
AI does the drafting; you verify.
|
||||
- **`compilation_notes` field** — staging pages include the AI's own
|
||||
explanation of what it did and why. Makes review faster — you can
|
||||
spot-check the reasoning rather than re-reading the whole page.
|
||||
- **Immutable raw sources** — every wiki claim traces back to a specific
|
||||
file in `raw/harvested/` with a SHA-256 `content_hash`. Verification
|
||||
means comparing the claim to the source, not "trust the LLM."
|
||||
- **`confidence: low` for precision domains** — the agent's instructions
|
||||
(via `CLAUDE.md`) tell it to flag low-confidence content when
|
||||
citing. Humans see the warning before acting.
|
||||
|
||||
**Residual trade-off**: For *truly* mission-critical data (legal,
|
||||
medical, compliance), no amount of automation replaces domain-expert
|
||||
review. If that's your use case, treat this repo as a *drafting* tool,
|
||||
not a canonical source.
|
||||
|
||||
### 4. Knowledge staleness without active upkeep
|
||||
|
||||
**The problem**: Community analysis of 120+ comments on Karpathy's gist
|
||||
found this is the #1 failure mode. Most people who try the pattern get
|
||||
the folder structure right and still end up with a wiki that slowly
|
||||
becomes unreliable because they stop feeding it. Six-week half-life is
|
||||
typical.
|
||||
|
||||
**How this repo mitigates** (this is the biggest thing):
|
||||
|
||||
- **Automation replaces human discipline** — daily cron runs
|
||||
`wiki-maintain.sh` (harvest + hygiene + qmd reindex); weekly cron runs
|
||||
`--full` mode. You don't need to remember anything.
|
||||
- **Conversation mining is the feed** — you don't need to curate sources
|
||||
manually. Every Claude Code session becomes potential ingest. The
|
||||
feed is automatic and continuous, as long as you're doing work.
|
||||
- **`last_verified` refreshes from conversation references** — when the
|
||||
summarizer links a conversation to a wiki page via `related:`, the
|
||||
hygiene script picks that up and bumps `last_verified`. Pages stay
|
||||
fresh as long as they're still being discussed.
|
||||
- **Decay thresholds force attention** — pages without refresh signals
|
||||
for 6/9/12 months get downgraded and eventually archived. The wiki
|
||||
self-trims.
|
||||
- **Hygiene reports** — `reports/hygiene-YYYY-MM-DD-needs-review.md`
|
||||
flags the things that *do* need human judgment. Everything else is
|
||||
auto-fixed.
|
||||
|
||||
This is the single biggest reason this repo exists. The automation
|
||||
layer is entirely about removing "I forgot to lint" as a failure mode.
|
||||
|
||||
### 5. Cognitive outsourcing risk
|
||||
|
||||
**The problem**: Hacker News critics argued that the bookkeeping
|
||||
Karpathy outsources — filing, cross-referencing, summarizing — is
|
||||
precisely where genuine understanding forms. Outsource it and you end up
|
||||
with a comprehensive wiki you haven't internalized.
|
||||
|
||||
**How this repo mitigates**:
|
||||
|
||||
- **Staging review is a forcing function** — you see every automated
|
||||
page before it lands. Even skimming forces engagement with the
|
||||
material.
|
||||
- **`qmd query "..."` for exploration** — searching the wiki is an
|
||||
active process, not passive retrieval. You're asking questions, not
|
||||
pulling a file.
|
||||
- **The wake-up briefing** — `context/wake-up.md` is a 200-token digest
|
||||
the agent reads at session start. You read it too (or the agent reads
|
||||
it to you) — ongoing re-exposure to your own knowledge base.
|
||||
|
||||
**Residual trade-off**: This is a real concern even with mitigations.
|
||||
The wiki is designed as *augmentation*, not *replacement*. If you
|
||||
never read your own wiki and only consult it through the agent, you're
|
||||
in the outsourcing failure mode. The fix is discipline, not
|
||||
architecture.
|
||||
|
||||
### 6. Weaker semantic retrieval than RAG at scale
|
||||
|
||||
**The problem**: At large corpora, vector embeddings find semantically
|
||||
related content across different wording in ways explicit wikilinks
|
||||
can't match.
|
||||
|
||||
**How this repo mitigates**:
|
||||
|
||||
- **`qmd` is hybrid (BM25 + vector)** — not just keyword search. Vector
|
||||
similarity is built into the retrieval pipeline from day one.
|
||||
- **Structural navigation complements semantic search** — project codes
|
||||
(wings) and topic frontmatter narrow the search space before the
|
||||
hybrid search runs. Structure + semantics is stronger than either
|
||||
alone.
|
||||
- **Missing cross-reference detection** — full-mode hygiene asks the
|
||||
LLM to find pages that *should* link to each other but don't, then
|
||||
auto-adds them. This is the explicit-linking approach catching up to
|
||||
semantic retrieval over time.
|
||||
|
||||
**Residual trade-off**: At enterprise scale (millions of documents), a
|
||||
proper vector DB with specialized retrieval wins. This repo is for
|
||||
personal / small-team scale where the hybrid approach is sufficient.
|
||||
|
||||
### 7. No access control or multi-user support
|
||||
|
||||
**The problem**: It's a folder of markdown files. No RBAC, no audit
|
||||
logging, no concurrency handling, no permissions model.
|
||||
|
||||
**How this repo mitigates**:
|
||||
|
||||
- **Git-based sync with merge-union** — concurrent writes on different
|
||||
machines auto-resolve because markdown is set to `merge=union` in
|
||||
`.gitattributes`. Both sides win.
|
||||
- **Network boundary as soft access control** — the suggested
|
||||
deployment is over Tailscale or a VPN, so the network does the work a
|
||||
RBAC layer would otherwise do. Not enterprise-grade, but sufficient
|
||||
for personal/family/small-team use.
|
||||
|
||||
**Residual trade-off**: **This is the big one.** The repo is not a
|
||||
replacement for enterprise knowledge management. No audit trails, no
|
||||
fine-grained permissions, no compliance story. If you need any of
|
||||
that, you need a different architecture. This repo is explicitly
|
||||
scoped to the personal/small-team use case.
|
||||
|
||||
---
|
||||
|
||||
## The #1 failure mode — active upkeep
|
||||
|
||||
Every other weakness has a mitigation. *Active upkeep is the one that
|
||||
kills wikis in the wild.* The community data is unambiguous:
|
||||
|
||||
- People who automate the lint schedule → wikis healthy at 6+ months
|
||||
- People who rely on "I'll remember to lint" → wikis abandoned at 6 weeks
|
||||
|
||||
The entire automation layer of this repo exists to remove upkeep as a
|
||||
thing the human has to think about:
|
||||
|
||||
| Cadence | Job | Purpose |
|
||||
|---------|-----|---------|
|
||||
| Every 15 min | `wiki-sync.sh` | Commit/pull/push — cross-machine sync |
|
||||
| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
|
||||
| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
|
||||
| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
|
||||
| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
|
||||
| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
|
||||
|
||||
If you disable all of these, you get the same outcome as every
|
||||
abandoned wiki: six-week half-life. The scripts aren't optional
|
||||
convenience — they're the load-bearing answer to the pattern's primary
|
||||
failure mode.
|
||||
|
||||
---
|
||||
|
||||
## What was borrowed from where
|
||||
|
||||
This repo is a synthesis of two ideas with an automation layer on top:
|
||||
|
||||
### From Karpathy
|
||||
|
||||
- The core pattern: LLM-maintained persistent wiki, compile at ingest
|
||||
time instead of retrieve at query time
|
||||
- Separation of `raw/` (immutable sources) from `wiki/` (compiled pages)
|
||||
- `CLAUDE.md` as the schema that disciplines the agent
|
||||
- Periodic "lint" passes to catch orphans, contradictions, missing refs
|
||||
- The idea that the wiki becomes fine-tuning material over time
|
||||
|
||||
### From mempalace
|
||||
|
||||
- **Wings** = per-person or per-project namespaces → this repo uses
|
||||
project codes (`mc`, `wiki`, `web`, etc.) as the same thing in
|
||||
`conversations/<project>/`
|
||||
- **Rooms** = topics within a wing → the `topics:` frontmatter on
|
||||
conversation files
|
||||
- **Halls** = memory-type corridors (fact / event / discovery /
|
||||
preference / advice / tooling) → the `halls:` frontmatter field
|
||||
classified by the summarizer
|
||||
- **Closets** = summary layer → the summary body of each summarized
|
||||
conversation
|
||||
- **Drawers** = verbatim archive, never lost → the extracted
|
||||
conversation transcripts under `conversations/<project>/*.md`
|
||||
- **Tunnels** = cross-wing connections → the `related:` frontmatter
|
||||
linking conversations to wiki pages
|
||||
- Wing + room structural filtering gives a documented +34% retrieval
|
||||
boost over flat search
|
||||
|
||||
The MemPalace taxonomy solved a problem Karpathy's pattern doesn't
|
||||
address: how do you navigate a growing corpus without reading
|
||||
everything? The answer is to give the corpus structural metadata at
|
||||
ingest time, then filter on that metadata before doing semantic search.
|
||||
This repo borrows that wholesale.
|
||||
|
||||
### What this repo adds
|
||||
|
||||
- **Automation layer** tying the pieces together with cron-friendly
|
||||
orchestration
|
||||
- **Staging pipeline** as a human-in-the-loop checkpoint for automated
|
||||
content
|
||||
- **Confidence decay + auto-archive + auto-restore** as the "retention
|
||||
curve" that community analysis identified as critical for long-term
|
||||
wiki health
|
||||
- **`qmd` integration** as the scalable search layer (chosen over
|
||||
ChromaDB because it uses the same markdown storage as the wiki —
|
||||
one index to maintain, not two)
|
||||
- **Hygiene reports** with fixed vs needs-review separation so
|
||||
automation handles mechanical fixes and humans handle ambiguity
|
||||
- **Cross-machine sync** via git with markdown merge-union so the same
|
||||
wiki lives on multiple machines without merge hell
|
||||
|
||||
---
|
||||
|
||||
## Honest residual trade-offs
|
||||
|
||||
Five items from the analysis that this repo doesn't fully solve and
|
||||
where you should know the limits:
|
||||
|
||||
1. **Enterprise scale** — this is a personal/small-team tool. Millions
|
||||
of documents, hundreds of users, RBAC, compliance: wrong
|
||||
architecture.
|
||||
2. **True semantic retrieval at massive scale** — `qmd` hybrid search
|
||||
is great for thousands of pages, not millions.
|
||||
3. **Cognitive outsourcing** — no architecture fix. Discipline
|
||||
yourself to read your own wiki, not just query it through the agent.
|
||||
4. **Precision-critical domains** — for legal/medical/regulatory data,
|
||||
use this as a drafting tool, not a source of truth. Human
|
||||
domain-expert review is not replaceable.
|
||||
5. **Access control** — network boundary (Tailscale) is the fastest
|
||||
path; nothing in the repo itself enforces permissions.
|
||||
|
||||
If any of these are dealbreakers for your use case, a different
|
||||
architecture is probably what you need.
|
||||
|
||||
---
|
||||
|
||||
## Further reading
|
||||
|
||||
- [The original Karpathy gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
|
||||
— the concept
|
||||
- [mempalace](https://github.com/milla-jovovich/mempalace) — the
|
||||
structural memory layer
|
||||
- [Signal & Noise interactive analysis](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
|
||||
— the design rationale this document summarizes
|
||||
- [README](../README.md) — the concept pitch
|
||||
- [ARCHITECTURE.md](ARCHITECTURE.md) — component deep-dive
|
||||
- [SETUP.md](SETUP.md) — installation
|
||||
- [CUSTOMIZE.md](CUSTOMIZE.md) — adapting for non-Claude-Code setups
|
||||
502
docs/SETUP.md
Normal file
502
docs/SETUP.md
Normal file
@@ -0,0 +1,502 @@
|
||||
# Setup Guide
|
||||
|
||||
Complete installation for the full automation pipeline. For the conceptual
|
||||
version (just the idea, no scripts), see the "Quick start — Path A" section
|
||||
in the [README](../README.md).
|
||||
|
||||
Tested on macOS (work machines) and Linux/WSL2 (home machines). Should work
|
||||
on any POSIX system with Python 3.11+, Node.js 18+, and bash.
|
||||
|
||||
---
|
||||
|
||||
## 1. Prerequisites
|
||||
|
||||
### Required
|
||||
|
||||
- **git** with SSH or HTTPS access to your remote (for cross-machine sync)
|
||||
- **Node.js 18+** (for `qmd` search)
|
||||
- **Python 3.11+** (for all pipeline scripts)
|
||||
- **`claude` CLI** with valid authentication — Max subscription OAuth or
|
||||
API key. Required for summarization and the harvester's AI compile step.
|
||||
Without `claude`, you can still use the wiki, but the automation layer
|
||||
falls back to manual or local-LLM paths.
|
||||
|
||||
### Python tools (recommended via `pipx`)
|
||||
|
||||
```bash
|
||||
# URL content extraction — required for wiki-harvest.py
|
||||
pipx install trafilatura
|
||||
pipx install crawl4ai && crawl4ai-setup # installs Playwright browsers
|
||||
```
|
||||
|
||||
Verify: `trafilatura --version` and `crwl --help` should both work.
|
||||
|
||||
### Optional
|
||||
|
||||
- **`pytest`** — only needed to run the test suite (`pip install --user pytest`)
|
||||
- **`llama.cpp` / `llama-server`** — only if you want the legacy local-LLM
|
||||
summarization path instead of `claude -p`
|
||||
|
||||
---
|
||||
|
||||
## 2. Clone the repo
|
||||
|
||||
```bash
|
||||
git clone <your-gitea-or-github-url> ~/projects/wiki
|
||||
cd ~/projects/wiki
|
||||
```
|
||||
|
||||
The repo contains scripts, tests, docs, and example content — but no
|
||||
actual wiki pages. The wiki grows as you use it.
|
||||
|
||||
---
|
||||
|
||||
## 3. Configure qmd search
|
||||
|
||||
`qmd` handles BM25 full-text search and vector search over the wiki.
|
||||
The pipeline uses **three** collections:
|
||||
|
||||
- **`wiki`** — live content (patterns/decisions/concepts/environments),
|
||||
staging, and raw sources. The default search surface.
|
||||
- **`wiki-archive`** — stale / superseded pages. Excluded from default
|
||||
search; query explicitly with `-c wiki-archive` when digging into
|
||||
history.
|
||||
- **`wiki-conversations`** — mined Claude Code session transcripts.
|
||||
Excluded from default search because they'd flood results with noisy
|
||||
tool-call output; query explicitly with `-c wiki-conversations` when
|
||||
looking for "what did I discuss about X last month?"
|
||||
|
||||
```bash
|
||||
npm install -g @tobilu/qmd
|
||||
```
|
||||
|
||||
Configure via YAML directly — the CLI doesn't support `ignore` or
|
||||
`includeByDefault`, so we edit the config file:
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.config/qmd
|
||||
cat > ~/.config/qmd/index.yml <<'YAML'
|
||||
collections:
|
||||
wiki:
|
||||
path: /Users/YOUR_USER/projects/wiki # ← replace with your actual path
|
||||
pattern: "**/*.md"
|
||||
ignore:
|
||||
- "archive/**"
|
||||
- "reports/**"
|
||||
- "plans/**"
|
||||
- "conversations/**"
|
||||
- "scripts/**"
|
||||
- "context/**"
|
||||
|
||||
wiki-archive:
|
||||
path: /Users/YOUR_USER/projects/wiki/archive
|
||||
pattern: "**/*.md"
|
||||
includeByDefault: false
|
||||
|
||||
wiki-conversations:
|
||||
path: /Users/YOUR_USER/projects/wiki/conversations
|
||||
pattern: "**/*.md"
|
||||
includeByDefault: false
|
||||
ignore:
|
||||
- "index.md"
|
||||
YAML
|
||||
```
|
||||
|
||||
On Linux/WSL, replace `/Users/YOUR_USER` with `/home/YOUR_USER`.
|
||||
|
||||
Build the indexes:
|
||||
|
||||
```bash
|
||||
qmd update # scan files into all three collections
|
||||
qmd embed # generate vector embeddings (~2 min first run + ~30 min for conversations on CPU)
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
qmd collection list
|
||||
# Expected:
|
||||
# wiki — N files
|
||||
# wiki-archive — M files [excluded]
|
||||
# wiki-conversations — K files [excluded]
|
||||
```
|
||||
|
||||
The `[excluded]` tag on the non-default collections confirms
|
||||
`includeByDefault: false` is honored.
|
||||
|
||||
**When to query which**:
|
||||
|
||||
```bash
|
||||
# "What's the current pattern for X?"
|
||||
qmd search "topic" --json -n 5
|
||||
|
||||
# "What was the OLD pattern, before we changed it?"
|
||||
qmd search "topic" -c wiki-archive --json -n 5
|
||||
|
||||
# "When did we discuss this, and what did we decide?"
|
||||
qmd search "topic" -c wiki-conversations --json -n 5
|
||||
|
||||
# Everything — history + current + conversations
|
||||
qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Configure the Python scripts
|
||||
|
||||
Three scripts need per-user configuration:
|
||||
|
||||
### `scripts/extract-sessions.py` — `PROJECT_MAP`
|
||||
|
||||
This maps Claude Code project directory suffixes to short wiki codes
|
||||
("wings"). Claude stores sessions under `~/.claude/projects/<hashed-path>/`
|
||||
where the hashed path is derived from the absolute path to your project.
|
||||
|
||||
Open the script and edit the `PROJECT_MAP` dict near the top. Look for
|
||||
the `CONFIGURE ME` block. Examples:
|
||||
|
||||
```python
|
||||
PROJECT_MAP: dict[str, str] = {
|
||||
"projects-wiki": "wiki",
|
||||
"-claude": "cl",
|
||||
"my-webapp": "web", # map "mydir/my-webapp" → wing "web"
|
||||
"mobile-app": "mob",
|
||||
"work-monorepo": "work",
|
||||
"-home": "general", # catch-all for unmatched sessions
|
||||
}
|
||||
```
|
||||
|
||||
Run `ls ~/.claude/projects/` to see what directory names Claude is
|
||||
actually producing on your machine — the suffix in `PROJECT_MAP` matches
|
||||
against the end of each directory name.
|
||||
|
||||
### `scripts/update-conversation-index.py` — `PROJECT_NAMES` / `PROJECT_ORDER`
|
||||
|
||||
Matching display names for every code in `PROJECT_MAP`:
|
||||
|
||||
```python
|
||||
PROJECT_NAMES: dict[str, str] = {
|
||||
"wiki": "WIKI — This Wiki",
|
||||
"cl": "CL — Claude Config",
|
||||
"web": "WEB — My Webapp",
|
||||
"mob": "MOB — Mobile App",
|
||||
"work": "WORK — Day Job",
|
||||
"general": "General — Cross-Project",
|
||||
}
|
||||
|
||||
PROJECT_ORDER = [
|
||||
"work", "web", "mob", # most-active first
|
||||
"wiki", "cl", "general",
|
||||
]
|
||||
```
|
||||
|
||||
### `scripts/wiki-harvest.py` — `SKIP_DOMAIN_PATTERNS`
|
||||
|
||||
Add your internal/personal domains so the harvester doesn't try to fetch
|
||||
them. Patterns use `re.search`:
|
||||
|
||||
```python
|
||||
SKIP_DOMAIN_PATTERNS = [
|
||||
# ... (generic ones are already there)
|
||||
r"\.mycompany\.com$",
|
||||
r"^git\.mydomain\.com$",
|
||||
]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Create the post-merge hook
|
||||
|
||||
The hook rebuilds the qmd index automatically after every `git pull`:
|
||||
|
||||
```bash
|
||||
cat > ~/projects/wiki/.git/hooks/post-merge <<'HOOK'
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
if command -v qmd &>/dev/null; then
|
||||
echo "wiki: rebuilding qmd index..."
|
||||
qmd update 2>/dev/null
|
||||
# WSL / Linux: no GPU, force CPU-only embeddings
|
||||
if [[ "$(uname -s)" == "Linux" ]]; then
|
||||
NODE_LLAMA_CPP_GPU=false qmd embed 2>/dev/null
|
||||
else
|
||||
qmd embed 2>/dev/null
|
||||
fi
|
||||
echo "wiki: qmd index updated"
|
||||
fi
|
||||
HOOK
|
||||
chmod +x ~/projects/wiki/.git/hooks/post-merge
|
||||
```
|
||||
|
||||
`.git/hooks/` isn't tracked by git, so this step runs on every machine
|
||||
where you clone the repo.
|
||||
|
||||
---
|
||||
|
||||
## 6. Backfill frontmatter (first-time setup or fresh clone)
|
||||
|
||||
If you're starting with existing wiki pages that don't yet have
|
||||
`last_verified` or `origin`, backfill them:
|
||||
|
||||
```bash
|
||||
cd ~/projects/wiki
|
||||
|
||||
# Backfill last_verified from last_compiled/git/mtime
|
||||
python3 scripts/wiki-hygiene.py --backfill
|
||||
|
||||
# Backfill origin: manual on pre-automation pages (one-shot inline)
|
||||
python3 -c "
|
||||
import sys
|
||||
sys.path.insert(0, 'scripts')
|
||||
from wiki_lib import iter_live_pages, write_page
|
||||
changed = 0
|
||||
for p in iter_live_pages():
|
||||
if 'origin' not in p.frontmatter:
|
||||
p.frontmatter['origin'] = 'manual'
|
||||
write_page(p)
|
||||
changed += 1
|
||||
print(f'{changed} page(s) backfilled')
|
||||
"
|
||||
```
|
||||
|
||||
For a brand-new empty wiki, there's nothing to backfill — skip this step.
|
||||
|
||||
---
|
||||
|
||||
## 7. Run the pipeline manually once
|
||||
|
||||
Before setting up cron, do a full end-to-end dry run to make sure
|
||||
everything's wired up:
|
||||
|
||||
```bash
|
||||
cd ~/projects/wiki
|
||||
|
||||
# 1. Extract any existing Claude Code sessions
|
||||
bash scripts/mine-conversations.sh --extract-only
|
||||
|
||||
# 2. Summarize with claude -p (will make real LLM calls — can take minutes)
|
||||
python3 scripts/summarize-conversations.py --claude
|
||||
|
||||
# 3. Regenerate conversation index + wake-up context
|
||||
python3 scripts/update-conversation-index.py --reindex
|
||||
|
||||
# 4. Dry-run the maintenance pipeline
|
||||
bash scripts/wiki-maintain.sh --dry-run --no-compile
|
||||
```
|
||||
|
||||
Expected output from step 4: all three phases run, phase 3 (qmd reindex)
|
||||
shows as skipped in dry-run mode, and you see `finished in Ns`.
|
||||
|
||||
---
|
||||
|
||||
## 8. Cron setup (optional)
|
||||
|
||||
If you want full automation, add these cron jobs. **Run them on only ONE
|
||||
machine** — state files sync via git, so the other machine picks up the
|
||||
results automatically.
|
||||
|
||||
```bash
|
||||
crontab -e
|
||||
```
|
||||
|
||||
```cron
|
||||
# Wiki SSH key for cron (if your remote uses SSH with a key)
|
||||
GIT_SSH_COMMAND="ssh -i /path/to/wiki-key -o StrictHostKeyChecking=no"
|
||||
|
||||
# PATH for cron so claude, qmd, node, python3, pipx tools are findable
|
||||
PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/local/bin:/usr/bin:/bin
|
||||
|
||||
# ─── Sync ──────────────────────────────────────────────────────────────────
|
||||
# commit/pull/push every 15 minutes
|
||||
*/15 * * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --commit && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --pull && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --push >> /tmp/wiki-sync.log 2>&1
|
||||
|
||||
# full sync with qmd reindex every 2 hours
|
||||
0 */2 * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh full >> /tmp/wiki-sync.log 2>&1
|
||||
|
||||
# ─── Mining ────────────────────────────────────────────────────────────────
|
||||
# Extract new sessions hourly (no LLM, fast)
|
||||
0 * * * * /home/YOUR_USER/projects/wiki/scripts/mine-conversations.sh --extract-only >> /tmp/wiki-mine.log 2>&1
|
||||
|
||||
# Summarize + index daily at 2am (uses claude -p)
|
||||
0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1
|
||||
|
||||
# ─── Maintenance ───────────────────────────────────────────────────────────
|
||||
# Daily at 3am: harvest + quick hygiene + qmd reindex
|
||||
0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1
|
||||
|
||||
# Weekly Sunday at 4am: full hygiene with LLM checks
|
||||
0 4 * * 0 cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh --hygiene-only --full >> scripts/.maintain.log 2>&1
|
||||
```
|
||||
|
||||
Replace `YOUR_USER` and the node path as appropriate for your system.
|
||||
|
||||
**macOS note**: `cron` needs Full Disk Access if you're pointing it at
|
||||
files in `~/Documents` or `~/Desktop`. Alternatively use `launchd` with
|
||||
a plist — same effect, easier permission model on macOS.
|
||||
|
||||
**WSL note**: make sure `cron` is actually running (`sudo service cron
|
||||
start`). Cron doesn't auto-start in WSL by default.
|
||||
|
||||
**`claude -p` in cron**: OAuth tokens must be cached before cron runs it.
|
||||
Run `claude --version` once interactively as your user to prime the
|
||||
token cache — cron then picks up the cached credentials.
|
||||
|
||||
---
|
||||
|
||||
## 9. Tell Claude Code about the wiki
|
||||
|
||||
Two separate CLAUDE.md files work together:
|
||||
|
||||
1. **The wiki's own `CLAUDE.md`** at `~/projects/wiki/CLAUDE.md` — the
|
||||
schema the agent reads when working INSIDE the wiki. Tells it how to
|
||||
maintain pages, apply frontmatter, handle staging/archival.
|
||||
2. **Your global `~/.claude/CLAUDE.md`** — the user-level instructions
|
||||
the agent reads on EVERY session (regardless of directory). Tells it
|
||||
when and how to consult the wiki from any other project.
|
||||
|
||||
Both are provided as starter templates you can copy and adapt:
|
||||
|
||||
### (a) Wiki schema — copy to the wiki root
|
||||
|
||||
```bash
|
||||
cp ~/projects/wiki/docs/examples/wiki-CLAUDE.md ~/projects/wiki/CLAUDE.md
|
||||
# then edit ~/projects/wiki/CLAUDE.md for your own conventions
|
||||
```
|
||||
|
||||
This file is ~200 lines. It defines:
|
||||
- Directory structure and the automated-vs-manual core rule
|
||||
- Frontmatter spec (required fields, staging fields, archive fields)
|
||||
- Page-type conventions (pattern / decision / environment / concept)
|
||||
- Operations: Ingest, Query, Mine, Harvest, Maintain, Lint
|
||||
- **Search Strategy** — which of the three qmd collections to use for
|
||||
which question type
|
||||
|
||||
Customize the sections marked **"Customization Notes"** at the bottom
|
||||
for your own categories, environments, and cross-reference format.
|
||||
|
||||
### (b) Global wake-up + query instructions
|
||||
|
||||
Append the contents of `docs/examples/global-CLAUDE.md` to your global
|
||||
Claude Code instructions:
|
||||
|
||||
```bash
|
||||
cat ~/projects/wiki/docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
|
||||
# then review ~/.claude/CLAUDE.md to integrate cleanly with any existing
|
||||
# content
|
||||
```
|
||||
|
||||
This adds:
|
||||
- **Wake-Up Context** — read `context/wake-up.md` at session start
|
||||
- **LLM Wiki — When to Consult It** — query mode vs ingest mode rules
|
||||
- **LLM Wiki — How to Search It** — explicit guidance for all three qmd
|
||||
collections (`wiki`, `wiki-archive`, `wiki-conversations`) with
|
||||
example queries for each
|
||||
- **Rules When Citing** — flag `confidence: low`, `status: pending`,
|
||||
and archived pages to the user
|
||||
|
||||
Together these give the agent a complete picture: how to maintain the
|
||||
wiki when working inside it, and how to consult it from anywhere else.
|
||||
|
||||
---
|
||||
|
||||
## 10. Verify
|
||||
|
||||
```bash
|
||||
cd ~/projects/wiki
|
||||
|
||||
# Sync state
|
||||
bash scripts/wiki-sync.sh --status
|
||||
|
||||
# Search
|
||||
qmd collection list
|
||||
qmd search "test" --json -n 3 # won't return anything if wiki is empty
|
||||
|
||||
# Mining
|
||||
tail -20 scripts/.mine.log 2>/dev/null || echo "(no mining runs yet)"
|
||||
|
||||
# End-to-end maintenance dry-run (no writes, no LLM, no network)
|
||||
bash scripts/wiki-maintain.sh --dry-run --no-compile
|
||||
|
||||
# Run the test suite
|
||||
cd tests && python3 -m pytest
|
||||
```
|
||||
|
||||
Expected:
|
||||
- `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]`
|
||||
- `wiki-maintain.sh --dry-run` completes all three phases
|
||||
- `pytest` passes all 171 tests in ~1.3 seconds
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**qmd search returns nothing**
|
||||
```bash
|
||||
qmd collection list # verify path points at the right place
|
||||
qmd update # rebuild index
|
||||
qmd embed # rebuild embeddings
|
||||
cat ~/.config/qmd/index.yml # verify config is correct for your machine
|
||||
```
|
||||
|
||||
**qmd collection points at the wrong path**
|
||||
Edit `~/.config/qmd/index.yml` directly. Don't use `qmd collection add`
|
||||
from inside the target directory — it can interpret the path oddly.
|
||||
|
||||
**qmd returns archived pages in default searches**
|
||||
Verify `wiki-archive` has `includeByDefault: false` in the YAML and
|
||||
`qmd collection list` shows `[excluded]`.
|
||||
|
||||
**`claude -p` fails in cron ("not authenticated")**
|
||||
Cron has no browser. Run `claude --version` once as the same user
|
||||
outside cron to cache OAuth tokens; cron will pick them up. Also verify
|
||||
the `PATH` directive at the top of the crontab includes the directory
|
||||
containing `claude`.
|
||||
|
||||
**`wiki-harvest.py` fetch failures**
|
||||
```bash
|
||||
# Verify the extraction tools work
|
||||
trafilatura -u "https://example.com" --markdown --no-comments --precision
|
||||
crwl "https://example.com" -o markdown-fit
|
||||
|
||||
# Check harvest state
|
||||
python3 -c "import json; print(json.dumps(json.load(open('.harvest-state.json'))['failed_urls'], indent=2))"
|
||||
```
|
||||
|
||||
**`wiki-hygiene.py` archived a page unexpectedly**
|
||||
Check `last_verified` vs decay thresholds. If the page was never
|
||||
referenced in a conversation, it decayed naturally. Restore with:
|
||||
```bash
|
||||
python3 scripts/wiki-hygiene.py --restore archive/patterns/foo.md
|
||||
```
|
||||
|
||||
**Both machines ran maintenance simultaneously**
|
||||
Merge conflicts on `.harvest-state.json` / `.hygiene-state.json` will
|
||||
occur. Pick ONE machine for maintenance; disable the maintenance cron
|
||||
on the other. Leave sync cron running on both so changes still propagate.
|
||||
|
||||
**Tests fail**
|
||||
Run `cd tests && python3 -m pytest -v` for verbose output. If the
|
||||
failure mentions `WIKI_DIR` or module loading, verify
|
||||
`scripts/wiki_lib.py` exists and contains the `WIKI_DIR` env var override
|
||||
near the top.
|
||||
|
||||
---
|
||||
|
||||
## Minimal install (skip everything except the idea)
|
||||
|
||||
If you want the conceptual wiki without any of the automation, all you
|
||||
actually need is:
|
||||
|
||||
1. An empty directory
|
||||
2. `CLAUDE.md` telling your agent the conventions (see the schema in
|
||||
[`ARCHITECTURE.md`](ARCHITECTURE.md) or Karpathy's gist)
|
||||
3. `index.md` for the agent to catalog pages
|
||||
4. An agent that can read and write files (any Claude Code, Cursor, Aider
|
||||
session works)
|
||||
|
||||
Then tell the agent: "Start maintaining a wiki here. Every time I share
|
||||
a source, integrate it. When I ask a question, check the wiki first."
|
||||
|
||||
You can bolt on the automation layer later if/when it becomes worth
|
||||
the setup effort.
|
||||
161
docs/examples/global-CLAUDE.md
Normal file
161
docs/examples/global-CLAUDE.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# Global Claude Code Instructions — Wiki Section
|
||||
|
||||
**What this is**: Content to add to your global `~/.claude/CLAUDE.md`
|
||||
(the user-level instructions Claude Code reads at the start of every
|
||||
session, regardless of which project you're in). These instructions tell
|
||||
Claude how to consult the wiki from outside the wiki directory.
|
||||
|
||||
**Where to paste it**: Append these sections to `~/.claude/CLAUDE.md`.
|
||||
Don't overwrite the whole file — this is additive.
|
||||
|
||||
---
|
||||
|
||||
Copy everything below this line into your global `~/.claude/CLAUDE.md`:
|
||||
|
||||
---
|
||||
|
||||
## Wake-Up Context
|
||||
|
||||
At the start of each session, read `~/projects/wiki/context/wake-up.md`
|
||||
for a briefing on active projects, recent decisions, and current
|
||||
concerns. This provides conversation continuity across sessions.
|
||||
|
||||
## LLM Wiki — When to Consult It
|
||||
|
||||
**Before creating API endpoints, Docker configs, CI pipelines, or making
|
||||
architectural decisions**, check the wiki at `~/projects/wiki/` for
|
||||
established patterns and decisions.
|
||||
|
||||
The wiki captures the **why** behind patterns — not just what to do, but
|
||||
the reasoning, constraints, alternatives rejected, and environment-
|
||||
specific differences. It compounds over time as projects discover new
|
||||
knowledge.
|
||||
|
||||
**When to read from the wiki** (query mode):
|
||||
- Creating any operational endpoint (/health, /version, /status)
|
||||
- Setting up secrets management in a new service
|
||||
- Writing Dockerfiles or docker-compose configurations
|
||||
- Configuring CI/CD pipelines
|
||||
- Adding database users or migrations
|
||||
- Making architectural decisions that should be consistent across projects
|
||||
|
||||
**When to write back to the wiki** (ingest mode):
|
||||
- When you discover something new that should apply across projects
|
||||
- When a project reveals an exception or edge case to an existing pattern
|
||||
- When a decision is made that future projects should follow
|
||||
- When the human explicitly says "add this to the wiki"
|
||||
|
||||
Human-initiated wiki writes go directly to the live wiki with
|
||||
`origin: manual`. Script-initiated writes go through `staging/` first.
|
||||
See the wiki's own `CLAUDE.md` for the full ingest protocol.
|
||||
|
||||
## LLM Wiki — How to Search It
|
||||
|
||||
Use the `qmd` CLI for fast, structured search. DO NOT read `index.md`
|
||||
for large queries — it's only for full-catalog browsing. DO NOT grep the
|
||||
wiki manually when `qmd` is available.
|
||||
|
||||
The wiki has **three qmd collections**. Pick the right one for the
|
||||
question:
|
||||
|
||||
### Default collection: `wiki` (live content)
|
||||
|
||||
For "what's our current pattern for X?" type questions. This is the
|
||||
default — no `-c` flag needed.
|
||||
|
||||
```bash
|
||||
# Keyword search (fast, BM25)
|
||||
qmd search "health endpoint version" --json -n 5
|
||||
|
||||
# Semantic search (finds conceptually related pages)
|
||||
qmd vsearch "how should API endpoints be structured" --json -n 5
|
||||
|
||||
# Best quality — hybrid BM25 + vector + LLM re-ranking
|
||||
qmd query "health endpoint" --json -n 5
|
||||
|
||||
# Then read the matched page
|
||||
cat ~/projects/wiki/patterns/health-endpoints.md
|
||||
```
|
||||
|
||||
### Archive collection: `wiki-archive` (stale / superseded)
|
||||
|
||||
For "what was our OLD pattern before we changed it?" questions. This is
|
||||
excluded from default searches; query explicitly with `-c wiki-archive`.
|
||||
|
||||
```bash
|
||||
# "Did we used to use Alpine? Why did we stop?"
|
||||
qmd search "alpine" -c wiki-archive --json -n 5
|
||||
|
||||
# Semantic search across archive
|
||||
qmd vsearch "container base image considerations" -c wiki-archive --json -n 5
|
||||
```
|
||||
|
||||
When you cite content from an archived page, tell the user it's
|
||||
archived and may be outdated.
|
||||
|
||||
### Conversations collection: `wiki-conversations` (mined session transcripts)
|
||||
|
||||
For "when did we discuss this, and what did we decide?" questions. This
|
||||
is the mined history of your actual Claude Code sessions — decisions,
|
||||
debugging breakthroughs, design discussions. Excluded from default
|
||||
searches because transcripts would flood results.
|
||||
|
||||
```bash
|
||||
# "When did we decide to use staging?"
|
||||
qmd search "staging review workflow" -c wiki-conversations --json -n 5
|
||||
|
||||
# "What debugging did we do around Docker networking?"
|
||||
qmd vsearch "docker network conflicts" -c wiki-conversations --json -n 5
|
||||
```
|
||||
|
||||
Useful for:
|
||||
- Tracing the reasoning behind a decision back to the session where it
|
||||
was made
|
||||
- Finding a solution to a problem you remember solving but didn't write
|
||||
up
|
||||
- Context-gathering when returning to a project after time away
|
||||
|
||||
### Searching across all collections
|
||||
|
||||
Rarely needed, but for "find everything on this topic across time":
|
||||
|
||||
```bash
|
||||
qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
|
||||
```
|
||||
|
||||
## LLM Wiki — Rules When Citing
|
||||
|
||||
1. **Always use `--json`** for structured qmd output. Never try to parse
|
||||
prose.
|
||||
2. **Flag `confidence: low` pages** to the user when citing. The content
|
||||
may be aging out.
|
||||
3. **Flag `status: pending` pages** (in `staging/`) as unverified when
|
||||
citing: "Note: this is from a pending wiki page that has not been
|
||||
human-reviewed yet."
|
||||
4. **Flag archived pages** as "archived and may be outdated" when citing.
|
||||
5. **Use `index.md` for browsing only**, not for targeted lookups. `qmd`
|
||||
is faster and more accurate.
|
||||
6. **Prefer semantic search for conceptual queries**, keyword search for
|
||||
specific names/terms.
|
||||
|
||||
## LLM Wiki — Quick Reference
|
||||
|
||||
- `~/projects/wiki/CLAUDE.md` — Full wiki schema and operations (read this when working IN the wiki)
|
||||
- `~/projects/wiki/index.md` — Content catalog (browse the full wiki)
|
||||
- `~/projects/wiki/patterns/` — How things should be built
|
||||
- `~/projects/wiki/decisions/` — Why we chose this approach
|
||||
- `~/projects/wiki/environments/` — Where environments differ
|
||||
- `~/projects/wiki/concepts/` — Foundational ideas
|
||||
- `~/projects/wiki/raw/` — Immutable source material (never modify)
|
||||
- `~/projects/wiki/staging/` — Pending automated content (flag when citing)
|
||||
- `~/projects/wiki/archive/` — Stale content (flag when citing)
|
||||
- `~/projects/wiki/conversations/` — Session history (search via `-c wiki-conversations`)
|
||||
|
||||
---
|
||||
|
||||
**End of additions for `~/.claude/CLAUDE.md`.**
|
||||
|
||||
See also the wiki's own `CLAUDE.md` at the wiki root — that file tells
|
||||
the agent how to *maintain* the wiki when working inside it. This file
|
||||
(the global one) tells the agent how to *consult* the wiki from anywhere
|
||||
else.
|
||||
278
docs/examples/wiki-CLAUDE.md
Normal file
278
docs/examples/wiki-CLAUDE.md
Normal file
@@ -0,0 +1,278 @@
|
||||
# LLM Wiki — Schema
|
||||
|
||||
This is a persistent, compounding knowledge base maintained by LLM agents.
|
||||
It captures the **why** behind patterns, decisions, and implementations —
|
||||
not just the what. Copy this file to the root of your wiki directory
|
||||
(i.e. `~/projects/wiki/CLAUDE.md`) and edit for your own conventions.
|
||||
|
||||
> This is an example `CLAUDE.md` for the wiki root. The agent reads this
|
||||
> at the start of every session when working inside the wiki. It's the
|
||||
> "constitution" that tells the agent how to maintain the knowledge base.
|
||||
|
||||
## How This Wiki Works
|
||||
|
||||
**You are the maintainer.** When working in this wiki directory, you read
|
||||
raw sources, compile knowledge into wiki pages, maintain cross-references,
|
||||
and keep everything consistent.
|
||||
|
||||
**You are a consumer.** When working in any other project directory, you
|
||||
read wiki pages to inform your work — applying established patterns,
|
||||
respecting decisions, and understanding context.
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
wiki/
|
||||
├── CLAUDE.md ← You are here (schema)
|
||||
├── index.md ← Content catalog — read this FIRST on any query
|
||||
├── log.md ← Chronological record of all operations
|
||||
│
|
||||
├── patterns/ ← LIVE: HOW things should be built (with WHY)
|
||||
├── decisions/ ← LIVE: WHY we chose this approach (with alternatives rejected)
|
||||
├── environments/ ← LIVE: WHERE implementations differ
|
||||
├── concepts/ ← LIVE: WHAT the foundational ideas are
|
||||
│
|
||||
├── raw/ ← Immutable source material (NEVER modify)
|
||||
│ └── harvested/ ← URL harvester output
|
||||
│
|
||||
├── staging/ ← PENDING automated content awaiting human review
|
||||
│ ├── index.md
|
||||
│ └── <type>/
|
||||
│
|
||||
├── archive/ ← STALE / superseded (excluded from default search)
|
||||
│ ├── index.md
|
||||
│ └── <type>/
|
||||
│
|
||||
├── conversations/ ← Mined Claude Code session transcripts
|
||||
│ ├── index.md
|
||||
│ └── <wing>/ ← per-project or per-person (MemPalace "wing")
|
||||
│
|
||||
├── context/ ← Auto-updated AI session briefing
|
||||
│ ├── wake-up.md ← Loaded at the start of every session
|
||||
│ └── active-concerns.md
|
||||
│
|
||||
├── reports/ ← Hygiene operation logs
|
||||
└── scripts/ ← The automation pipeline
|
||||
```
|
||||
|
||||
**Core rule — automated vs manual content**:
|
||||
|
||||
| Origin | Destination | Status |
|
||||
|--------|-------------|--------|
|
||||
| Script-generated (harvester, hygiene, URL compile) | `staging/` | `pending` |
|
||||
| Human-initiated ("add this to the wiki" in a Claude session) | Live wiki (`patterns/`, etc.) | `verified` |
|
||||
| Human-reviewed from staging | Live wiki (promoted) | `verified` |
|
||||
|
||||
Managed via `scripts/wiki-staging.py --list / --promote / --reject / --review`.
|
||||
|
||||
## Page Conventions
|
||||
|
||||
### Frontmatter (required on all wiki pages)
|
||||
|
||||
```yaml
|
||||
---
|
||||
title: Page Title
|
||||
type: pattern | decision | environment | concept
|
||||
confidence: high | medium | low
|
||||
origin: manual | automated # How the page entered the wiki
|
||||
sources: [list of raw/ files this was compiled from]
|
||||
related: [list of other wiki pages this connects to]
|
||||
last_compiled: YYYY-MM-DD # Date this page was last (re)compiled from sources
|
||||
last_verified: YYYY-MM-DD # Date the content was last confirmed accurate
|
||||
---
|
||||
```
|
||||
|
||||
**`origin` values**:
|
||||
- `manual` — Created by a human in a Claude session. Goes directly to the live wiki, no staging.
|
||||
- `automated` — Created by a script (harvester, hygiene, etc.). Must pass through `staging/` for human review before promotion.
|
||||
|
||||
**Confidence decay**: Pages with no refresh signal for 6 months decay `high → medium`; 9 months → `low`; 12 months → `stale` (auto-archived). `last_verified` drives decay, not `last_compiled`. See `scripts/wiki-hygiene.py` and `archive/index.md`.
|
||||
|
||||
### Staging Frontmatter (pages in `staging/<type>/`)
|
||||
|
||||
Automated-origin pages get additional staging metadata that is **stripped on promotion**:
|
||||
|
||||
```yaml
|
||||
---
|
||||
title: ...
|
||||
type: ...
|
||||
origin: automated
|
||||
status: pending # Awaiting review
|
||||
staged_date: YYYY-MM-DD # When the automated script staged this
|
||||
staged_by: wiki-harvest # Which script staged it (wiki-harvest, wiki-hygiene, ...)
|
||||
target_path: patterns/foo.md # Where it should land on promotion
|
||||
modifies: patterns/bar.md # Only present when this is an update to an existing live page
|
||||
compilation_notes: "..." # AI's explanation of what it did and why
|
||||
harvest_source: https://... # Only present for URL-harvested content
|
||||
sources: [...]
|
||||
related: [...]
|
||||
last_verified: YYYY-MM-DD
|
||||
---
|
||||
```
|
||||
|
||||
### Pattern Pages (`patterns/`)
|
||||
|
||||
Structure:
|
||||
1. **What** — One-paragraph description of the pattern
|
||||
2. **Why** — The reasoning, constraints, and goals that led to this pattern
|
||||
3. **Canonical Example** — A concrete implementation (link to raw/ source or inline)
|
||||
4. **Structure** — The specification: fields, endpoints, formats, conventions
|
||||
5. **When to Deviate** — Known exceptions or conditions where the pattern doesn't apply
|
||||
6. **History** — Key changes and the decisions that drove them
|
||||
|
||||
### Decision Pages (`decisions/`)
|
||||
|
||||
Structure:
|
||||
1. **Decision** — One sentence: what we decided
|
||||
2. **Context** — What problem or constraint prompted this
|
||||
3. **Options Considered** — What alternatives existed (with pros/cons)
|
||||
4. **Rationale** — Why this option won
|
||||
5. **Consequences** — What this decision enables and constrains
|
||||
6. **Status** — Active | Superseded by [link] | Under Review
|
||||
|
||||
### Environment Pages (`environments/`)
|
||||
|
||||
Structure:
|
||||
1. **Overview** — What this environment is (platform, CI, infra)
|
||||
2. **Key Differences** — Table comparing environments for this domain
|
||||
3. **Implementation Details** — Environment-specific configs, credentials, deploy method
|
||||
4. **Gotchas** — Things that have bitten us
|
||||
|
||||
### Concept Pages (`concepts/`)
|
||||
|
||||
Structure:
|
||||
1. **Definition** — What this concept means in our context
|
||||
2. **Why It Matters** — How this concept shapes our decisions
|
||||
3. **Related Patterns** — Links to patterns that implement this concept
|
||||
4. **Related Decisions** — Links to decisions driven by this concept
|
||||
|
||||
## Operations
|
||||
|
||||
### Ingest (adding new knowledge)
|
||||
|
||||
When a new raw source is added or you learn something new:
|
||||
|
||||
1. Read the source material thoroughly
|
||||
2. Identify which existing wiki pages need updating
|
||||
3. Identify if new pages are needed
|
||||
4. Update/create pages following the conventions above
|
||||
5. Update cross-references (`related:` frontmatter) on all affected pages
|
||||
6. Update `index.md` with any new pages
|
||||
7. Set `last_verified:` to today's date on every page you create or update
|
||||
8. Set `origin: manual` on any page you create when a human directed you to
|
||||
9. Append to `log.md`: `## [YYYY-MM-DD] ingest | Source Description`
|
||||
|
||||
**Where to write**:
|
||||
- **Human-initiated** ("add this to the wiki", "create a pattern for X") — write directly to the live directory (`patterns/`, `decisions/`, etc.) with `origin: manual`. The human's instruction IS the approval.
|
||||
- **Script-initiated** (harvest, auto-compile, hygiene auto-fix) — write to `staging/<type>/` with `origin: automated`, `status: pending`, plus `staged_date`, `staged_by`, `target_path`, and `compilation_notes`. For updates to existing live pages, also set `modifies: <live-page-path>`.
|
||||
|
||||
### Query (answering questions from other projects)
|
||||
|
||||
When working in another project and consulting the wiki:
|
||||
|
||||
1. Use `qmd` to search first (see Search Strategy below). Read `index.md` only when browsing the full catalog.
|
||||
2. Read the specific pattern/decision/concept pages
|
||||
3. Apply the knowledge, respecting environment differences
|
||||
4. If a page's `confidence` is `low`, flag that to the user — the content may be aging out
|
||||
5. If a page has `status: pending` (it's in `staging/`), flag that to the user: "Note: this is from a pending wiki page in staging, not yet verified." Use the content but make the uncertainty visible.
|
||||
6. If you find yourself consulting a page under `archive/`, mention it's archived and may be outdated
|
||||
7. If your work reveals new knowledge, **file it back** — update the wiki (and bump `last_verified`)
|
||||
|
||||
### Search Strategy — which qmd collection to use
|
||||
|
||||
The wiki has three qmd collections. Pick the right one for the question:
|
||||
|
||||
| Question type | Collection | Command |
|
||||
|---|---|---|
|
||||
| "What's our current pattern for X?" | `wiki` (default) | `qmd search "X" --json -n 5` |
|
||||
| "What's the rationale behind decision Y?" | `wiki` (default) | `qmd vsearch "why did we choose Y" --json -n 5` |
|
||||
| "What was our OLD approach before we changed it?" | `wiki-archive` | `qmd search "X" -c wiki-archive --json -n 5` |
|
||||
| "When did we discuss this, and what did we decide?" | `wiki-conversations` | `qmd search "X" -c wiki-conversations --json -n 5` |
|
||||
| "Find everything across time" | all three | `qmd search "X" -c wiki -c wiki-archive -c wiki-conversations --json -n 10` |
|
||||
|
||||
**Rules of thumb**:
|
||||
- Use `qmd search` for keyword matches (BM25, fast)
|
||||
- Use `qmd vsearch` for conceptual / semantically-similar queries (vector)
|
||||
- Use `qmd query` for the best quality — hybrid BM25 + vector + LLM re-ranking
|
||||
- Always use `--json` for structured output
|
||||
- Read individual matched pages with `cat` or your file tool after finding them
|
||||
|
||||
### Mine (conversation extraction and summarization)
|
||||
|
||||
Four-phase pipeline that extracts sessions into searchable conversation pages:
|
||||
|
||||
1. **Extract** (`extract-sessions.py`) — Parse session files into markdown transcripts
|
||||
2. **Summarize** (`summarize-conversations.py --claude`) — Classify + summarize via `claude -p` with haiku/sonnet routing
|
||||
3. **Index** (`update-conversation-index.py --reindex`) — Regenerate conversation index + `context/wake-up.md`
|
||||
4. **Harvest** (`wiki-harvest.py`) — Scan summarized conversations for external reference URLs and compile them into wiki pages
|
||||
|
||||
Full pipeline via `mine-conversations.sh`. Extraction is incremental (tracks byte offsets). Summarization is incremental (tracks message count).
|
||||
|
||||
### Maintain (wiki health automation)
|
||||
|
||||
`scripts/wiki-maintain.sh` chains harvest + hygiene + qmd reindex:
|
||||
|
||||
```bash
|
||||
bash scripts/wiki-maintain.sh # Harvest + quick hygiene + reindex
|
||||
bash scripts/wiki-maintain.sh --full # Harvest + full hygiene (LLM) + reindex
|
||||
bash scripts/wiki-maintain.sh --harvest-only # Harvest only
|
||||
bash scripts/wiki-maintain.sh --hygiene-only # Hygiene only
|
||||
bash scripts/wiki-maintain.sh --dry-run # Show what would run
|
||||
```
|
||||
|
||||
### Lint (periodic health check)
|
||||
|
||||
Automated via `scripts/wiki-hygiene.py`. Two tiers:
|
||||
|
||||
**Quick mode** (no LLM, run daily — `python3 scripts/wiki-hygiene.py`):
|
||||
- Backfill missing `last_verified`
|
||||
- Refresh `last_verified` from conversation `related:` references
|
||||
- Auto-restore archived pages that are referenced again
|
||||
- Repair frontmatter (missing required fields, invalid values)
|
||||
- Confidence decay per 6/9/12-month thresholds
|
||||
- Archive stale and superseded pages
|
||||
- Orphan pages (auto-linked into `index.md`)
|
||||
- Broken cross-references (fuzzy-match fix via `difflib`, or restore from archive)
|
||||
- Main index drift (auto add missing entries, remove stale ones)
|
||||
- Empty stubs (report-only)
|
||||
- State file drift (report-only)
|
||||
- Staging/archive index resync
|
||||
|
||||
**Full mode** (LLM, run weekly — `python3 scripts/wiki-hygiene.py --full`):
|
||||
- Everything in quick mode, plus:
|
||||
- Missing cross-references between related pages (haiku)
|
||||
- Duplicate coverage — weaker page auto-merged into stronger (sonnet)
|
||||
- Contradictions between pages (sonnet, report-only)
|
||||
- Technology lifecycle — flag pages referencing versions older than what's in recent conversations
|
||||
|
||||
**Reports** (written to `reports/`):
|
||||
- `hygiene-YYYY-MM-DD-fixed.md` — what was auto-fixed
|
||||
- `hygiene-YYYY-MM-DD-needs-review.md` — what needs human judgment
|
||||
|
||||
## Cross-Reference Conventions
|
||||
|
||||
- Link between wiki pages using relative markdown links: `[Pattern Name](../patterns/file.md)`
|
||||
- Link to raw sources: `[Source](../raw/path/to/file.md)`
|
||||
- In frontmatter `related:` use the relative filename: `patterns/secrets-at-startup.md`
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
- Filenames: `kebab-case.md`
|
||||
- Patterns: named by what they standardize (e.g., `health-endpoints.md`, `secrets-at-startup.md`)
|
||||
- Decisions: named by what was decided (e.g., `no-alpine.md`, `dhi-base-images.md`)
|
||||
- Environments: named by domain (e.g., `docker-registries.md`, `ci-cd-platforms.md`)
|
||||
- Concepts: named by the concept (e.g., `two-user-database-model.md`, `build-once-deploy-many.md`)
|
||||
|
||||
## Customization Notes
|
||||
|
||||
Things you should change for your own wiki:
|
||||
|
||||
1. **Directory structure** — the four live dirs (`patterns/`, `decisions/`, `concepts/`, `environments/`) reflect engineering use cases. Pick categories that match how you think — research wikis might use `findings/`, `hypotheses/`, `methods/`, `literature/` instead. Update `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py` to match.
|
||||
|
||||
2. **Page page-type sections** — the "Structure" blocks under each page type are for my use. Define your own conventions.
|
||||
|
||||
3. **`status` field** — if you want to track Superseded/Active/Under Review explicitly, this is a natural add. The hygiene script already checks for `status: Superseded by ...` and archives those automatically.
|
||||
|
||||
4. **Environment Detection** — if you don't have multiple environments, remove the section. If you do, update it for your own environments (work/home, dev/prod, mac/linux, etc.).
|
||||
|
||||
5. **Cross-reference path format** — I use `patterns/foo.md` in the `related:` field. Obsidian users might prefer `[[foo]]` wikilink format. The hygiene script handles standard markdown links; adapt as needed.
|
||||
Reference in New Issue
Block a user