Replace project self-references throughout README, SETUP, and the example CLAUDE.md files. External artifact titles are preserved as-is since they refer to the actual title of the Claude design artifact. Also add a "Why 'memex'?" aside to the README that roots the project in Vannevar Bush's 1945 "As We May Think" essay, where the term originates. The compounding knowledge wiki is the LLM-era realization of Bush's memex concept: the "associative trails" he imagined are the related: frontmatter fields and wikilinks the agent maintains. Kept lowercase where referring to the generic pattern (e.g. "an LLM wiki persists its mistakes") since that refers to the class of system, not this specific project.
503 lines
16 KiB
Markdown
503 lines
16 KiB
Markdown
# Setup Guide
|
|
|
|
Complete installation for the full automation pipeline. For the conceptual
|
|
version (just the idea, no scripts), see the "Quick start — Path A" section
|
|
in the [README](../README.md).
|
|
|
|
Tested on macOS (work machines) and Linux/WSL2 (home machines). Should work
|
|
on any POSIX system with Python 3.11+, Node.js 18+, and bash.
|
|
|
|
---
|
|
|
|
## 1. Prerequisites
|
|
|
|
### Required
|
|
|
|
- **git** with SSH or HTTPS access to your remote (for cross-machine sync)
|
|
- **Node.js 18+** (for `qmd` search)
|
|
- **Python 3.11+** (for all pipeline scripts)
|
|
- **`claude` CLI** with valid authentication — Max subscription OAuth or
|
|
API key. Required for summarization and the harvester's AI compile step.
|
|
Without `claude`, you can still use the wiki, but the automation layer
|
|
falls back to manual or local-LLM paths.
|
|
|
|
### Python tools (recommended via `pipx`)
|
|
|
|
```bash
|
|
# URL content extraction — required for wiki-harvest.py
|
|
pipx install trafilatura
|
|
pipx install crawl4ai && crawl4ai-setup # installs Playwright browsers
|
|
```
|
|
|
|
Verify: `trafilatura --version` and `crwl --help` should both work.
|
|
|
|
### Optional
|
|
|
|
- **`pytest`** — only needed to run the test suite (`pip install --user pytest`)
|
|
- **`llama.cpp` / `llama-server`** — only if you want the legacy local-LLM
|
|
summarization path instead of `claude -p`
|
|
|
|
---
|
|
|
|
## 2. Clone the repo
|
|
|
|
```bash
|
|
git clone <your-gitea-or-github-url> ~/projects/wiki
|
|
cd ~/projects/wiki
|
|
```
|
|
|
|
The repo contains scripts, tests, docs, and example content — but no
|
|
actual wiki pages. The wiki grows as you use it.
|
|
|
|
---
|
|
|
|
## 3. Configure qmd search
|
|
|
|
`qmd` handles BM25 full-text search and vector search over the wiki.
|
|
The pipeline uses **three** collections:
|
|
|
|
- **`wiki`** — live content (patterns/decisions/concepts/environments),
|
|
staging, and raw sources. The default search surface.
|
|
- **`wiki-archive`** — stale / superseded pages. Excluded from default
|
|
search; query explicitly with `-c wiki-archive` when digging into
|
|
history.
|
|
- **`wiki-conversations`** — mined Claude Code session transcripts.
|
|
Excluded from default search because they'd flood results with noisy
|
|
tool-call output; query explicitly with `-c wiki-conversations` when
|
|
looking for "what did I discuss about X last month?"
|
|
|
|
```bash
|
|
npm install -g @tobilu/qmd
|
|
```
|
|
|
|
Configure via YAML directly — the CLI doesn't support `ignore` or
|
|
`includeByDefault`, so we edit the config file:
|
|
|
|
```bash
|
|
mkdir -p ~/.config/qmd
|
|
cat > ~/.config/qmd/index.yml <<'YAML'
|
|
collections:
|
|
wiki:
|
|
path: /Users/YOUR_USER/projects/wiki # ← replace with your actual path
|
|
pattern: "**/*.md"
|
|
ignore:
|
|
- "archive/**"
|
|
- "reports/**"
|
|
- "plans/**"
|
|
- "conversations/**"
|
|
- "scripts/**"
|
|
- "context/**"
|
|
|
|
wiki-archive:
|
|
path: /Users/YOUR_USER/projects/wiki/archive
|
|
pattern: "**/*.md"
|
|
includeByDefault: false
|
|
|
|
wiki-conversations:
|
|
path: /Users/YOUR_USER/projects/wiki/conversations
|
|
pattern: "**/*.md"
|
|
includeByDefault: false
|
|
ignore:
|
|
- "index.md"
|
|
YAML
|
|
```
|
|
|
|
On Linux/WSL, replace `/Users/YOUR_USER` with `/home/YOUR_USER`.
|
|
|
|
Build the indexes:
|
|
|
|
```bash
|
|
qmd update # scan files into all three collections
|
|
qmd embed # generate vector embeddings (~2 min first run + ~30 min for conversations on CPU)
|
|
```
|
|
|
|
Verify:
|
|
|
|
```bash
|
|
qmd collection list
|
|
# Expected:
|
|
# wiki — N files
|
|
# wiki-archive — M files [excluded]
|
|
# wiki-conversations — K files [excluded]
|
|
```
|
|
|
|
The `[excluded]` tag on the non-default collections confirms
|
|
`includeByDefault: false` is honored.
|
|
|
|
**When to query which**:
|
|
|
|
```bash
|
|
# "What's the current pattern for X?"
|
|
qmd search "topic" --json -n 5
|
|
|
|
# "What was the OLD pattern, before we changed it?"
|
|
qmd search "topic" -c wiki-archive --json -n 5
|
|
|
|
# "When did we discuss this, and what did we decide?"
|
|
qmd search "topic" -c wiki-conversations --json -n 5
|
|
|
|
# Everything — history + current + conversations
|
|
qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Configure the Python scripts
|
|
|
|
Three scripts need per-user configuration:
|
|
|
|
### `scripts/extract-sessions.py` — `PROJECT_MAP`
|
|
|
|
This maps Claude Code project directory suffixes to short wiki codes
|
|
("wings"). Claude stores sessions under `~/.claude/projects/<hashed-path>/`
|
|
where the hashed path is derived from the absolute path to your project.
|
|
|
|
Open the script and edit the `PROJECT_MAP` dict near the top. Look for
|
|
the `CONFIGURE ME` block. Examples:
|
|
|
|
```python
|
|
PROJECT_MAP: dict[str, str] = {
|
|
"projects-wiki": "wiki",
|
|
"-claude": "cl",
|
|
"my-webapp": "web", # map "mydir/my-webapp" → wing "web"
|
|
"mobile-app": "mob",
|
|
"work-monorepo": "work",
|
|
"-home": "general", # catch-all for unmatched sessions
|
|
}
|
|
```
|
|
|
|
Run `ls ~/.claude/projects/` to see what directory names Claude is
|
|
actually producing on your machine — the suffix in `PROJECT_MAP` matches
|
|
against the end of each directory name.
|
|
|
|
### `scripts/update-conversation-index.py` — `PROJECT_NAMES` / `PROJECT_ORDER`
|
|
|
|
Matching display names for every code in `PROJECT_MAP`:
|
|
|
|
```python
|
|
PROJECT_NAMES: dict[str, str] = {
|
|
"wiki": "WIKI — This Wiki",
|
|
"cl": "CL — Claude Config",
|
|
"web": "WEB — My Webapp",
|
|
"mob": "MOB — Mobile App",
|
|
"work": "WORK — Day Job",
|
|
"general": "General — Cross-Project",
|
|
}
|
|
|
|
PROJECT_ORDER = [
|
|
"work", "web", "mob", # most-active first
|
|
"wiki", "cl", "general",
|
|
]
|
|
```
|
|
|
|
### `scripts/wiki-harvest.py` — `SKIP_DOMAIN_PATTERNS`
|
|
|
|
Add your internal/personal domains so the harvester doesn't try to fetch
|
|
them. Patterns use `re.search`:
|
|
|
|
```python
|
|
SKIP_DOMAIN_PATTERNS = [
|
|
# ... (generic ones are already there)
|
|
r"\.mycompany\.com$",
|
|
r"^git\.mydomain\.com$",
|
|
]
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Create the post-merge hook
|
|
|
|
The hook rebuilds the qmd index automatically after every `git pull`:
|
|
|
|
```bash
|
|
cat > ~/projects/wiki/.git/hooks/post-merge <<'HOOK'
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
|
|
if command -v qmd &>/dev/null; then
|
|
echo "wiki: rebuilding qmd index..."
|
|
qmd update 2>/dev/null
|
|
# WSL / Linux: no GPU, force CPU-only embeddings
|
|
if [[ "$(uname -s)" == "Linux" ]]; then
|
|
NODE_LLAMA_CPP_GPU=false qmd embed 2>/dev/null
|
|
else
|
|
qmd embed 2>/dev/null
|
|
fi
|
|
echo "wiki: qmd index updated"
|
|
fi
|
|
HOOK
|
|
chmod +x ~/projects/wiki/.git/hooks/post-merge
|
|
```
|
|
|
|
`.git/hooks/` isn't tracked by git, so this step runs on every machine
|
|
where you clone the repo.
|
|
|
|
---
|
|
|
|
## 6. Backfill frontmatter (first-time setup or fresh clone)
|
|
|
|
If you're starting with existing wiki pages that don't yet have
|
|
`last_verified` or `origin`, backfill them:
|
|
|
|
```bash
|
|
cd ~/projects/wiki
|
|
|
|
# Backfill last_verified from last_compiled/git/mtime
|
|
python3 scripts/wiki-hygiene.py --backfill
|
|
|
|
# Backfill origin: manual on pre-automation pages (one-shot inline)
|
|
python3 -c "
|
|
import sys
|
|
sys.path.insert(0, 'scripts')
|
|
from wiki_lib import iter_live_pages, write_page
|
|
changed = 0
|
|
for p in iter_live_pages():
|
|
if 'origin' not in p.frontmatter:
|
|
p.frontmatter['origin'] = 'manual'
|
|
write_page(p)
|
|
changed += 1
|
|
print(f'{changed} page(s) backfilled')
|
|
"
|
|
```
|
|
|
|
For a brand-new empty wiki, there's nothing to backfill — skip this step.
|
|
|
|
---
|
|
|
|
## 7. Run the pipeline manually once
|
|
|
|
Before setting up cron, do a full end-to-end dry run to make sure
|
|
everything's wired up:
|
|
|
|
```bash
|
|
cd ~/projects/wiki
|
|
|
|
# 1. Extract any existing Claude Code sessions
|
|
bash scripts/mine-conversations.sh --extract-only
|
|
|
|
# 2. Summarize with claude -p (will make real LLM calls — can take minutes)
|
|
python3 scripts/summarize-conversations.py --claude
|
|
|
|
# 3. Regenerate conversation index + wake-up context
|
|
python3 scripts/update-conversation-index.py --reindex
|
|
|
|
# 4. Dry-run the maintenance pipeline
|
|
bash scripts/wiki-maintain.sh --dry-run --no-compile
|
|
```
|
|
|
|
Expected output from step 4: all three phases run, phase 3 (qmd reindex)
|
|
shows as skipped in dry-run mode, and you see `finished in Ns`.
|
|
|
|
---
|
|
|
|
## 8. Cron setup (optional)
|
|
|
|
If you want full automation, add these cron jobs. **Run them on only ONE
|
|
machine** — state files sync via git, so the other machine picks up the
|
|
results automatically.
|
|
|
|
```bash
|
|
crontab -e
|
|
```
|
|
|
|
```cron
|
|
# Wiki SSH key for cron (if your remote uses SSH with a key)
|
|
GIT_SSH_COMMAND="ssh -i /path/to/wiki-key -o StrictHostKeyChecking=no"
|
|
|
|
# PATH for cron so claude, qmd, node, python3, pipx tools are findable
|
|
PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/local/bin:/usr/bin:/bin
|
|
|
|
# ─── Sync ──────────────────────────────────────────────────────────────────
|
|
# commit/pull/push every 15 minutes
|
|
*/15 * * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --commit && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --pull && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --push >> /tmp/wiki-sync.log 2>&1
|
|
|
|
# full sync with qmd reindex every 2 hours
|
|
0 */2 * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh full >> /tmp/wiki-sync.log 2>&1
|
|
|
|
# ─── Mining ────────────────────────────────────────────────────────────────
|
|
# Extract new sessions hourly (no LLM, fast)
|
|
0 * * * * /home/YOUR_USER/projects/wiki/scripts/mine-conversations.sh --extract-only >> /tmp/wiki-mine.log 2>&1
|
|
|
|
# Summarize + index daily at 2am (uses claude -p)
|
|
0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1
|
|
|
|
# ─── Maintenance ───────────────────────────────────────────────────────────
|
|
# Daily at 3am: harvest + quick hygiene + qmd reindex
|
|
0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1
|
|
|
|
# Weekly Sunday at 4am: full hygiene with LLM checks
|
|
0 4 * * 0 cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh --hygiene-only --full >> scripts/.maintain.log 2>&1
|
|
```
|
|
|
|
Replace `YOUR_USER` and the node path as appropriate for your system.
|
|
|
|
**macOS note**: `cron` needs Full Disk Access if you're pointing it at
|
|
files in `~/Documents` or `~/Desktop`. Alternatively use `launchd` with
|
|
a plist — same effect, easier permission model on macOS.
|
|
|
|
**WSL note**: make sure `cron` is actually running (`sudo service cron
|
|
start`). Cron doesn't auto-start in WSL by default.
|
|
|
|
**`claude -p` in cron**: OAuth tokens must be cached before cron runs it.
|
|
Run `claude --version` once interactively as your user to prime the
|
|
token cache — cron then picks up the cached credentials.
|
|
|
|
---
|
|
|
|
## 9. Tell Claude Code about the wiki
|
|
|
|
Two separate CLAUDE.md files work together:
|
|
|
|
1. **The wiki's own `CLAUDE.md`** at `~/projects/wiki/CLAUDE.md` — the
|
|
schema the agent reads when working INSIDE the wiki. Tells it how to
|
|
maintain pages, apply frontmatter, handle staging/archival.
|
|
2. **Your global `~/.claude/CLAUDE.md`** — the user-level instructions
|
|
the agent reads on EVERY session (regardless of directory). Tells it
|
|
when and how to consult the wiki from any other project.
|
|
|
|
Both are provided as starter templates you can copy and adapt:
|
|
|
|
### (a) Wiki schema — copy to the wiki root
|
|
|
|
```bash
|
|
cp ~/projects/wiki/docs/examples/wiki-CLAUDE.md ~/projects/wiki/CLAUDE.md
|
|
# then edit ~/projects/wiki/CLAUDE.md for your own conventions
|
|
```
|
|
|
|
This file is ~200 lines. It defines:
|
|
- Directory structure and the automated-vs-manual core rule
|
|
- Frontmatter spec (required fields, staging fields, archive fields)
|
|
- Page-type conventions (pattern / decision / environment / concept)
|
|
- Operations: Ingest, Query, Mine, Harvest, Maintain, Lint
|
|
- **Search Strategy** — which of the three qmd collections to use for
|
|
which question type
|
|
|
|
Customize the sections marked **"Customization Notes"** at the bottom
|
|
for your own categories, environments, and cross-reference format.
|
|
|
|
### (b) Global wake-up + query instructions
|
|
|
|
Append the contents of `docs/examples/global-CLAUDE.md` to your global
|
|
Claude Code instructions:
|
|
|
|
```bash
|
|
cat ~/projects/wiki/docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
|
|
# then review ~/.claude/CLAUDE.md to integrate cleanly with any existing
|
|
# content
|
|
```
|
|
|
|
This adds:
|
|
- **Wake-Up Context** — read `context/wake-up.md` at session start
|
|
- **memex — When to Consult It** — query mode vs ingest mode rules
|
|
- **memex — How to Search It** — explicit guidance for all three qmd
|
|
collections (`wiki`, `wiki-archive`, `wiki-conversations`) with
|
|
example queries for each
|
|
- **memex — Rules When Citing** — flag `confidence: low`,
|
|
`status: pending`, and archived pages to the user
|
|
|
|
Together these give the agent a complete picture: how to maintain the
|
|
wiki when working inside it, and how to consult it from anywhere else.
|
|
|
|
---
|
|
|
|
## 10. Verify
|
|
|
|
```bash
|
|
cd ~/projects/wiki
|
|
|
|
# Sync state
|
|
bash scripts/wiki-sync.sh --status
|
|
|
|
# Search
|
|
qmd collection list
|
|
qmd search "test" --json -n 3 # won't return anything if wiki is empty
|
|
|
|
# Mining
|
|
tail -20 scripts/.mine.log 2>/dev/null || echo "(no mining runs yet)"
|
|
|
|
# End-to-end maintenance dry-run (no writes, no LLM, no network)
|
|
bash scripts/wiki-maintain.sh --dry-run --no-compile
|
|
|
|
# Run the test suite
|
|
cd tests && python3 -m pytest
|
|
```
|
|
|
|
Expected:
|
|
- `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]`
|
|
- `wiki-maintain.sh --dry-run` completes all three phases
|
|
- `pytest` passes all 171 tests in ~1.3 seconds
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
**qmd search returns nothing**
|
|
```bash
|
|
qmd collection list # verify path points at the right place
|
|
qmd update # rebuild index
|
|
qmd embed # rebuild embeddings
|
|
cat ~/.config/qmd/index.yml # verify config is correct for your machine
|
|
```
|
|
|
|
**qmd collection points at the wrong path**
|
|
Edit `~/.config/qmd/index.yml` directly. Don't use `qmd collection add`
|
|
from inside the target directory — it can interpret the path oddly.
|
|
|
|
**qmd returns archived pages in default searches**
|
|
Verify `wiki-archive` has `includeByDefault: false` in the YAML and
|
|
`qmd collection list` shows `[excluded]`.
|
|
|
|
**`claude -p` fails in cron ("not authenticated")**
|
|
Cron has no browser. Run `claude --version` once as the same user
|
|
outside cron to cache OAuth tokens; cron will pick them up. Also verify
|
|
the `PATH` directive at the top of the crontab includes the directory
|
|
containing `claude`.
|
|
|
|
**`wiki-harvest.py` fetch failures**
|
|
```bash
|
|
# Verify the extraction tools work
|
|
trafilatura -u "https://example.com" --markdown --no-comments --precision
|
|
crwl "https://example.com" -o markdown-fit
|
|
|
|
# Check harvest state
|
|
python3 -c "import json; print(json.dumps(json.load(open('.harvest-state.json'))['failed_urls'], indent=2))"
|
|
```
|
|
|
|
**`wiki-hygiene.py` archived a page unexpectedly**
|
|
Check `last_verified` vs decay thresholds. If the page was never
|
|
referenced in a conversation, it decayed naturally. Restore with:
|
|
```bash
|
|
python3 scripts/wiki-hygiene.py --restore archive/patterns/foo.md
|
|
```
|
|
|
|
**Both machines ran maintenance simultaneously**
|
|
Merge conflicts on `.harvest-state.json` / `.hygiene-state.json` will
|
|
occur. Pick ONE machine for maintenance; disable the maintenance cron
|
|
on the other. Leave sync cron running on both so changes still propagate.
|
|
|
|
**Tests fail**
|
|
Run `cd tests && python3 -m pytest -v` for verbose output. If the
|
|
failure mentions `WIKI_DIR` or module loading, verify
|
|
`scripts/wiki_lib.py` exists and contains the `WIKI_DIR` env var override
|
|
near the top.
|
|
|
|
---
|
|
|
|
## Minimal install (skip everything except the idea)
|
|
|
|
If you want the conceptual wiki without any of the automation, all you
|
|
actually need is:
|
|
|
|
1. An empty directory
|
|
2. `CLAUDE.md` telling your agent the conventions (see the schema in
|
|
[`ARCHITECTURE.md`](ARCHITECTURE.md) or Karpathy's gist)
|
|
3. `index.md` for the agent to catalog pages
|
|
4. An agent that can read and write files (any Claude Code, Cursor, Aider
|
|
session works)
|
|
|
|
Then tell the agent: "Start maintaining a wiki here. Every time I share
|
|
a source, integrate it. When I ask a question, check the wiki first."
|
|
|
|
You can bolt on the automation layer later if/when it becomes worth
|
|
the setup effort.
|