A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
503 lines
16 KiB
Markdown
503 lines
16 KiB
Markdown
# Setup Guide
|
|
|
|
Complete installation for the full automation pipeline. For the conceptual
|
|
version (just the idea, no scripts), see the "Quick start — Path A" section
|
|
in the [README](../README.md).
|
|
|
|
Tested on macOS (work machines) and Linux/WSL2 (home machines). Should work
|
|
on any POSIX system with Python 3.11+, Node.js 18+, and bash.
|
|
|
|
---
|
|
|
|
## 1. Prerequisites
|
|
|
|
### Required
|
|
|
|
- **git** with SSH or HTTPS access to your remote (for cross-machine sync)
|
|
- **Node.js 18+** (for `qmd` search)
|
|
- **Python 3.11+** (for all pipeline scripts)
|
|
- **`claude` CLI** with valid authentication — Max subscription OAuth or
|
|
API key. Required for summarization and the harvester's AI compile step.
|
|
Without `claude`, you can still use the wiki, but the automation layer
|
|
falls back to manual or local-LLM paths.
|
|
|
|
### Python tools (recommended via `pipx`)
|
|
|
|
```bash
|
|
# URL content extraction — required for wiki-harvest.py
|
|
pipx install trafilatura
|
|
pipx install crawl4ai && crawl4ai-setup # installs Playwright browsers
|
|
```
|
|
|
|
Verify: `trafilatura --version` and `crwl --help` should both work.
|
|
|
|
### Optional
|
|
|
|
- **`pytest`** — only needed to run the test suite (`pip install --user pytest`)
|
|
- **`llama.cpp` / `llama-server`** — only if you want the legacy local-LLM
|
|
summarization path instead of `claude -p`
|
|
|
|
---
|
|
|
|
## 2. Clone the repo
|
|
|
|
```bash
|
|
git clone <your-gitea-or-github-url> ~/projects/wiki
|
|
cd ~/projects/wiki
|
|
```
|
|
|
|
The repo contains scripts, tests, docs, and example content — but no
|
|
actual wiki pages. The wiki grows as you use it.
|
|
|
|
---
|
|
|
|
## 3. Configure qmd search
|
|
|
|
`qmd` handles BM25 full-text search and vector search over the wiki.
|
|
The pipeline uses **three** collections:
|
|
|
|
- **`wiki`** — live content (patterns/decisions/concepts/environments),
|
|
staging, and raw sources. The default search surface.
|
|
- **`wiki-archive`** — stale / superseded pages. Excluded from default
|
|
search; query explicitly with `-c wiki-archive` when digging into
|
|
history.
|
|
- **`wiki-conversations`** — mined Claude Code session transcripts.
|
|
Excluded from default search because they'd flood results with noisy
|
|
tool-call output; query explicitly with `-c wiki-conversations` when
|
|
looking for "what did I discuss about X last month?"
|
|
|
|
```bash
|
|
npm install -g @tobilu/qmd
|
|
```
|
|
|
|
Configure via YAML directly — the CLI doesn't support `ignore` or
|
|
`includeByDefault`, so we edit the config file:
|
|
|
|
```bash
|
|
mkdir -p ~/.config/qmd
|
|
cat > ~/.config/qmd/index.yml <<'YAML'
|
|
collections:
|
|
wiki:
|
|
path: /Users/YOUR_USER/projects/wiki # ← replace with your actual path
|
|
pattern: "**/*.md"
|
|
ignore:
|
|
- "archive/**"
|
|
- "reports/**"
|
|
- "plans/**"
|
|
- "conversations/**"
|
|
- "scripts/**"
|
|
- "context/**"
|
|
|
|
wiki-archive:
|
|
path: /Users/YOUR_USER/projects/wiki/archive
|
|
pattern: "**/*.md"
|
|
includeByDefault: false
|
|
|
|
wiki-conversations:
|
|
path: /Users/YOUR_USER/projects/wiki/conversations
|
|
pattern: "**/*.md"
|
|
includeByDefault: false
|
|
ignore:
|
|
- "index.md"
|
|
YAML
|
|
```
|
|
|
|
On Linux/WSL, replace `/Users/YOUR_USER` with `/home/YOUR_USER`.
|
|
|
|
Build the indexes:
|
|
|
|
```bash
|
|
qmd update # scan files into all three collections
|
|
qmd embed # generate vector embeddings (~2 min first run + ~30 min for conversations on CPU)
|
|
```
|
|
|
|
Verify:
|
|
|
|
```bash
|
|
qmd collection list
|
|
# Expected:
|
|
# wiki — N files
|
|
# wiki-archive — M files [excluded]
|
|
# wiki-conversations — K files [excluded]
|
|
```
|
|
|
|
The `[excluded]` tag on the non-default collections confirms
|
|
`includeByDefault: false` is honored.
|
|
|
|
**When to query which**:
|
|
|
|
```bash
|
|
# "What's the current pattern for X?"
|
|
qmd search "topic" --json -n 5
|
|
|
|
# "What was the OLD pattern, before we changed it?"
|
|
qmd search "topic" -c wiki-archive --json -n 5
|
|
|
|
# "When did we discuss this, and what did we decide?"
|
|
qmd search "topic" -c wiki-conversations --json -n 5
|
|
|
|
# Everything — history + current + conversations
|
|
qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Configure the Python scripts
|
|
|
|
Three scripts need per-user configuration:
|
|
|
|
### `scripts/extract-sessions.py` — `PROJECT_MAP`
|
|
|
|
This maps Claude Code project directory suffixes to short wiki codes
|
|
("wings"). Claude stores sessions under `~/.claude/projects/<hashed-path>/`
|
|
where the hashed path is derived from the absolute path to your project.
|
|
|
|
Open the script and edit the `PROJECT_MAP` dict near the top. Look for
|
|
the `CONFIGURE ME` block. Examples:
|
|
|
|
```python
|
|
PROJECT_MAP: dict[str, str] = {
|
|
"projects-wiki": "wiki",
|
|
"-claude": "cl",
|
|
"my-webapp": "web", # map "mydir/my-webapp" → wing "web"
|
|
"mobile-app": "mob",
|
|
"work-monorepo": "work",
|
|
"-home": "general", # catch-all for unmatched sessions
|
|
}
|
|
```
|
|
|
|
Run `ls ~/.claude/projects/` to see what directory names Claude is
|
|
actually producing on your machine — the suffix in `PROJECT_MAP` matches
|
|
against the end of each directory name.
|
|
|
|
### `scripts/update-conversation-index.py` — `PROJECT_NAMES` / `PROJECT_ORDER`
|
|
|
|
Matching display names for every code in `PROJECT_MAP`:
|
|
|
|
```python
|
|
PROJECT_NAMES: dict[str, str] = {
|
|
"wiki": "WIKI — This Wiki",
|
|
"cl": "CL — Claude Config",
|
|
"web": "WEB — My Webapp",
|
|
"mob": "MOB — Mobile App",
|
|
"work": "WORK — Day Job",
|
|
"general": "General — Cross-Project",
|
|
}
|
|
|
|
PROJECT_ORDER = [
|
|
"work", "web", "mob", # most-active first
|
|
"wiki", "cl", "general",
|
|
]
|
|
```
|
|
|
|
### `scripts/wiki-harvest.py` — `SKIP_DOMAIN_PATTERNS`
|
|
|
|
Add your internal/personal domains so the harvester doesn't try to fetch
|
|
them. Patterns use `re.search`:
|
|
|
|
```python
|
|
SKIP_DOMAIN_PATTERNS = [
|
|
# ... (generic ones are already there)
|
|
r"\.mycompany\.com$",
|
|
r"^git\.mydomain\.com$",
|
|
]
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Create the post-merge hook
|
|
|
|
The hook rebuilds the qmd index automatically after every `git pull`:
|
|
|
|
```bash
|
|
cat > ~/projects/wiki/.git/hooks/post-merge <<'HOOK'
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
|
|
if command -v qmd &>/dev/null; then
|
|
echo "wiki: rebuilding qmd index..."
|
|
qmd update 2>/dev/null
|
|
# WSL / Linux: no GPU, force CPU-only embeddings
|
|
if [[ "$(uname -s)" == "Linux" ]]; then
|
|
NODE_LLAMA_CPP_GPU=false qmd embed 2>/dev/null
|
|
else
|
|
qmd embed 2>/dev/null
|
|
fi
|
|
echo "wiki: qmd index updated"
|
|
fi
|
|
HOOK
|
|
chmod +x ~/projects/wiki/.git/hooks/post-merge
|
|
```
|
|
|
|
`.git/hooks/` isn't tracked by git, so this step runs on every machine
|
|
where you clone the repo.
|
|
|
|
---
|
|
|
|
## 6. Backfill frontmatter (first-time setup or fresh clone)
|
|
|
|
If you're starting with existing wiki pages that don't yet have
|
|
`last_verified` or `origin`, backfill them:
|
|
|
|
```bash
|
|
cd ~/projects/wiki
|
|
|
|
# Backfill last_verified from last_compiled/git/mtime
|
|
python3 scripts/wiki-hygiene.py --backfill
|
|
|
|
# Backfill origin: manual on pre-automation pages (one-shot inline)
|
|
python3 -c "
|
|
import sys
|
|
sys.path.insert(0, 'scripts')
|
|
from wiki_lib import iter_live_pages, write_page
|
|
changed = 0
|
|
for p in iter_live_pages():
|
|
if 'origin' not in p.frontmatter:
|
|
p.frontmatter['origin'] = 'manual'
|
|
write_page(p)
|
|
changed += 1
|
|
print(f'{changed} page(s) backfilled')
|
|
"
|
|
```
|
|
|
|
For a brand-new empty wiki, there's nothing to backfill — skip this step.
|
|
|
|
---
|
|
|
|
## 7. Run the pipeline manually once
|
|
|
|
Before setting up cron, do a full end-to-end dry run to make sure
|
|
everything's wired up:
|
|
|
|
```bash
|
|
cd ~/projects/wiki
|
|
|
|
# 1. Extract any existing Claude Code sessions
|
|
bash scripts/mine-conversations.sh --extract-only
|
|
|
|
# 2. Summarize with claude -p (will make real LLM calls — can take minutes)
|
|
python3 scripts/summarize-conversations.py --claude
|
|
|
|
# 3. Regenerate conversation index + wake-up context
|
|
python3 scripts/update-conversation-index.py --reindex
|
|
|
|
# 4. Dry-run the maintenance pipeline
|
|
bash scripts/wiki-maintain.sh --dry-run --no-compile
|
|
```
|
|
|
|
Expected output from step 4: all three phases run, phase 3 (qmd reindex)
|
|
shows as skipped in dry-run mode, and you see `finished in Ns`.
|
|
|
|
---
|
|
|
|
## 8. Cron setup (optional)
|
|
|
|
If you want full automation, add these cron jobs. **Run them on only ONE
|
|
machine** — state files sync via git, so the other machine picks up the
|
|
results automatically.
|
|
|
|
```bash
|
|
crontab -e
|
|
```
|
|
|
|
```cron
|
|
# Wiki SSH key for cron (if your remote uses SSH with a key)
|
|
GIT_SSH_COMMAND="ssh -i /path/to/wiki-key -o StrictHostKeyChecking=no"
|
|
|
|
# PATH for cron so claude, qmd, node, python3, pipx tools are findable
|
|
PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/local/bin:/usr/bin:/bin
|
|
|
|
# ─── Sync ──────────────────────────────────────────────────────────────────
|
|
# commit/pull/push every 15 minutes
|
|
*/15 * * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --commit && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --pull && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --push >> /tmp/wiki-sync.log 2>&1
|
|
|
|
# full sync with qmd reindex every 2 hours
|
|
0 */2 * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh full >> /tmp/wiki-sync.log 2>&1
|
|
|
|
# ─── Mining ────────────────────────────────────────────────────────────────
|
|
# Extract new sessions hourly (no LLM, fast)
|
|
0 * * * * /home/YOUR_USER/projects/wiki/scripts/mine-conversations.sh --extract-only >> /tmp/wiki-mine.log 2>&1
|
|
|
|
# Summarize + index daily at 2am (uses claude -p)
|
|
0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1
|
|
|
|
# ─── Maintenance ───────────────────────────────────────────────────────────
|
|
# Daily at 3am: harvest + quick hygiene + qmd reindex
|
|
0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1
|
|
|
|
# Weekly Sunday at 4am: full hygiene with LLM checks
|
|
0 4 * * 0 cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh --hygiene-only --full >> scripts/.maintain.log 2>&1
|
|
```
|
|
|
|
Replace `YOUR_USER` and the node path as appropriate for your system.
|
|
|
|
**macOS note**: `cron` needs Full Disk Access if you're pointing it at
|
|
files in `~/Documents` or `~/Desktop`. Alternatively use `launchd` with
|
|
a plist — same effect, easier permission model on macOS.
|
|
|
|
**WSL note**: make sure `cron` is actually running (`sudo service cron
|
|
start`). Cron doesn't auto-start in WSL by default.
|
|
|
|
**`claude -p` in cron**: OAuth tokens must be cached before cron runs it.
|
|
Run `claude --version` once interactively as your user to prime the
|
|
token cache — cron then picks up the cached credentials.
|
|
|
|
---
|
|
|
|
## 9. Tell Claude Code about the wiki
|
|
|
|
Two separate CLAUDE.md files work together:
|
|
|
|
1. **The wiki's own `CLAUDE.md`** at `~/projects/wiki/CLAUDE.md` — the
|
|
schema the agent reads when working INSIDE the wiki. Tells it how to
|
|
maintain pages, apply frontmatter, handle staging/archival.
|
|
2. **Your global `~/.claude/CLAUDE.md`** — the user-level instructions
|
|
the agent reads on EVERY session (regardless of directory). Tells it
|
|
when and how to consult the wiki from any other project.
|
|
|
|
Both are provided as starter templates you can copy and adapt:
|
|
|
|
### (a) Wiki schema — copy to the wiki root
|
|
|
|
```bash
|
|
cp ~/projects/wiki/docs/examples/wiki-CLAUDE.md ~/projects/wiki/CLAUDE.md
|
|
# then edit ~/projects/wiki/CLAUDE.md for your own conventions
|
|
```
|
|
|
|
This file is ~200 lines. It defines:
|
|
- Directory structure and the automated-vs-manual core rule
|
|
- Frontmatter spec (required fields, staging fields, archive fields)
|
|
- Page-type conventions (pattern / decision / environment / concept)
|
|
- Operations: Ingest, Query, Mine, Harvest, Maintain, Lint
|
|
- **Search Strategy** — which of the three qmd collections to use for
|
|
which question type
|
|
|
|
Customize the sections marked **"Customization Notes"** at the bottom
|
|
for your own categories, environments, and cross-reference format.
|
|
|
|
### (b) Global wake-up + query instructions
|
|
|
|
Append the contents of `docs/examples/global-CLAUDE.md` to your global
|
|
Claude Code instructions:
|
|
|
|
```bash
|
|
cat ~/projects/wiki/docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
|
|
# then review ~/.claude/CLAUDE.md to integrate cleanly with any existing
|
|
# content
|
|
```
|
|
|
|
This adds:
|
|
- **Wake-Up Context** — read `context/wake-up.md` at session start
|
|
- **LLM Wiki — When to Consult It** — query mode vs ingest mode rules
|
|
- **LLM Wiki — How to Search It** — explicit guidance for all three qmd
|
|
collections (`wiki`, `wiki-archive`, `wiki-conversations`) with
|
|
example queries for each
|
|
- **Rules When Citing** — flag `confidence: low`, `status: pending`,
|
|
and archived pages to the user
|
|
|
|
Together these give the agent a complete picture: how to maintain the
|
|
wiki when working inside it, and how to consult it from anywhere else.
|
|
|
|
---
|
|
|
|
## 10. Verify
|
|
|
|
```bash
|
|
cd ~/projects/wiki
|
|
|
|
# Sync state
|
|
bash scripts/wiki-sync.sh --status
|
|
|
|
# Search
|
|
qmd collection list
|
|
qmd search "test" --json -n 3 # won't return anything if wiki is empty
|
|
|
|
# Mining
|
|
tail -20 scripts/.mine.log 2>/dev/null || echo "(no mining runs yet)"
|
|
|
|
# End-to-end maintenance dry-run (no writes, no LLM, no network)
|
|
bash scripts/wiki-maintain.sh --dry-run --no-compile
|
|
|
|
# Run the test suite
|
|
cd tests && python3 -m pytest
|
|
```
|
|
|
|
Expected:
|
|
- `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]`
|
|
- `wiki-maintain.sh --dry-run` completes all three phases
|
|
- `pytest` passes all 171 tests in ~1.3 seconds
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
**qmd search returns nothing**
|
|
```bash
|
|
qmd collection list # verify path points at the right place
|
|
qmd update # rebuild index
|
|
qmd embed # rebuild embeddings
|
|
cat ~/.config/qmd/index.yml # verify config is correct for your machine
|
|
```
|
|
|
|
**qmd collection points at the wrong path**
|
|
Edit `~/.config/qmd/index.yml` directly. Don't use `qmd collection add`
|
|
from inside the target directory — it can interpret the path oddly.
|
|
|
|
**qmd returns archived pages in default searches**
|
|
Verify `wiki-archive` has `includeByDefault: false` in the YAML and
|
|
`qmd collection list` shows `[excluded]`.
|
|
|
|
**`claude -p` fails in cron ("not authenticated")**
|
|
Cron has no browser. Run `claude --version` once as the same user
|
|
outside cron to cache OAuth tokens; cron will pick them up. Also verify
|
|
the `PATH` directive at the top of the crontab includes the directory
|
|
containing `claude`.
|
|
|
|
**`wiki-harvest.py` fetch failures**
|
|
```bash
|
|
# Verify the extraction tools work
|
|
trafilatura -u "https://example.com" --markdown --no-comments --precision
|
|
crwl "https://example.com" -o markdown-fit
|
|
|
|
# Check harvest state
|
|
python3 -c "import json; print(json.dumps(json.load(open('.harvest-state.json'))['failed_urls'], indent=2))"
|
|
```
|
|
|
|
**`wiki-hygiene.py` archived a page unexpectedly**
|
|
Check `last_verified` vs decay thresholds. If the page was never
|
|
referenced in a conversation, it decayed naturally. Restore with:
|
|
```bash
|
|
python3 scripts/wiki-hygiene.py --restore archive/patterns/foo.md
|
|
```
|
|
|
|
**Both machines ran maintenance simultaneously**
|
|
Merge conflicts on `.harvest-state.json` / `.hygiene-state.json` will
|
|
occur. Pick ONE machine for maintenance; disable the maintenance cron
|
|
on the other. Leave sync cron running on both so changes still propagate.
|
|
|
|
**Tests fail**
|
|
Run `cd tests && python3 -m pytest -v` for verbose output. If the
|
|
failure mentions `WIKI_DIR` or module loading, verify
|
|
`scripts/wiki_lib.py` exists and contains the `WIKI_DIR` env var override
|
|
near the top.
|
|
|
|
---
|
|
|
|
## Minimal install (skip everything except the idea)
|
|
|
|
If you want the conceptual wiki without any of the automation, all you
|
|
actually need is:
|
|
|
|
1. An empty directory
|
|
2. `CLAUDE.md` telling your agent the conventions (see the schema in
|
|
[`ARCHITECTURE.md`](ARCHITECTURE.md) or Karpathy's gist)
|
|
3. `index.md` for the agent to catalog pages
|
|
4. An agent that can read and write files (any Claude Code, Cursor, Aider
|
|
session works)
|
|
|
|
Then tell the agent: "Start maintaining a wiki here. Every time I share
|
|
a source, integrate it. When I ask a question, check the wiki first."
|
|
|
|
You can bolt on the automation layer later if/when it becomes worth
|
|
the setup effort.
|