Initial commit — memex

A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions
--- a/docs/SETUP.md
+++ b/docs/SETUP.md
@@ -0,0 +1,502 @@
+# Setup Guide
+
+Complete installation for the full automation pipeline. For the conceptual
+version (just the idea, no scripts), see the "Quick start — Path A" section
+in the [README](../README.md).
+
+Tested on macOS (work machines) and Linux/WSL2 (home machines). Should work
+on any POSIX system with Python 3.11+, Node.js 18+, and bash.
+
+---
+
+## 1. Prerequisites
+
+### Required
+
+- **git** with SSH or HTTPS access to your remote (for cross-machine sync)
+- **Node.js 18+** (for `qmd` search)
+- **Python 3.11+** (for all pipeline scripts)
+- **`claude` CLI** with valid authentication — Max subscription OAuth or
+  API key. Required for summarization and the harvester's AI compile step.
+  Without `claude`, you can still use the wiki, but the automation layer
+  falls back to manual or local-LLM paths.
+
+### Python tools (recommended via `pipx`)
+
+```bash
+# URL content extraction — required for wiki-harvest.py
+pipx install trafilatura
+pipx install crawl4ai && crawl4ai-setup    # installs Playwright browsers
+```
+
+Verify: `trafilatura --version` and `crwl --help` should both work.
+
+### Optional
+
+- **`pytest`** — only needed to run the test suite (`pip install --user pytest`)
+- **`llama.cpp` / `llama-server`** — only if you want the legacy local-LLM
+  summarization path instead of `claude -p`
+
+---
+
+## 2. Clone the repo
+
+```bash
+git clone <your-gitea-or-github-url> ~/projects/wiki
+cd ~/projects/wiki
+```
+
+The repo contains scripts, tests, docs, and example content — but no
+actual wiki pages. The wiki grows as you use it.
+
+---
+
+## 3. Configure qmd search
+
+`qmd` handles BM25 full-text search and vector search over the wiki.
+The pipeline uses **three** collections:
+
+- **`wiki`** — live content (patterns/decisions/concepts/environments),
+  staging, and raw sources. The default search surface.
+- **`wiki-archive`** — stale / superseded pages. Excluded from default
+  search; query explicitly with `-c wiki-archive` when digging into
+  history.
+- **`wiki-conversations`** — mined Claude Code session transcripts.
+  Excluded from default search because they'd flood results with noisy
+  tool-call output; query explicitly with `-c wiki-conversations` when
+  looking for "what did I discuss about X last month?"
+
+```bash
+npm install -g @tobilu/qmd
+```
+
+Configure via YAML directly — the CLI doesn't support `ignore` or
+`includeByDefault`, so we edit the config file:
+
+```bash
+mkdir -p ~/.config/qmd
+cat > ~/.config/qmd/index.yml <<'YAML'
+collections:
+  wiki:
+    path: /Users/YOUR_USER/projects/wiki   # ← replace with your actual path
+    pattern: "**/*.md"
+    ignore:
+      - "archive/**"
+      - "reports/**"
+      - "plans/**"
+      - "conversations/**"
+      - "scripts/**"
+      - "context/**"
+
+  wiki-archive:
+    path: /Users/YOUR_USER/projects/wiki/archive
+    pattern: "**/*.md"
+    includeByDefault: false
+
+  wiki-conversations:
+    path: /Users/YOUR_USER/projects/wiki/conversations
+    pattern: "**/*.md"
+    includeByDefault: false
+    ignore:
+      - "index.md"
+YAML
+```
+
+On Linux/WSL, replace `/Users/YOUR_USER` with `/home/YOUR_USER`.
+
+Build the indexes:
+
+```bash
+qmd update     # scan files into all three collections
+qmd embed      # generate vector embeddings (~2 min first run + ~30 min for conversations on CPU)
+```
+
+Verify:
+
+```bash
+qmd collection list
+# Expected:
+#   wiki                — N files
+#   wiki-archive        — M files [excluded]
+#   wiki-conversations  — K files [excluded]
+```
+
+The `[excluded]` tag on the non-default collections confirms
+`includeByDefault: false` is honored.
+
+**When to query which**:
+
+```bash
+# "What's the current pattern for X?"
+qmd search "topic" --json -n 5
+
+# "What was the OLD pattern, before we changed it?"
+qmd search "topic" -c wiki-archive --json -n 5
+
+# "When did we discuss this, and what did we decide?"
+qmd search "topic" -c wiki-conversations --json -n 5
+
+# Everything — history + current + conversations
+qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
+```
+
+---
+
+## 4. Configure the Python scripts
+
+Three scripts need per-user configuration:
+
+### `scripts/extract-sessions.py` — `PROJECT_MAP`
+
+This maps Claude Code project directory suffixes to short wiki codes
+("wings"). Claude stores sessions under `~/.claude/projects/<hashed-path>/`
+where the hashed path is derived from the absolute path to your project.
+
+Open the script and edit the `PROJECT_MAP` dict near the top. Look for
+the `CONFIGURE ME` block. Examples:
+
+```python
+PROJECT_MAP: dict[str, str] = {
+    "projects-wiki": "wiki",
+    "-claude": "cl",
+    "my-webapp": "web",       # map "mydir/my-webapp" → wing "web"
+    "mobile-app": "mob",
+    "work-monorepo": "work",
+    "-home": "general",       # catch-all for unmatched sessions
+}
+```
+
+Run `ls ~/.claude/projects/` to see what directory names Claude is
+actually producing on your machine — the suffix in `PROJECT_MAP` matches
+against the end of each directory name.
+
+### `scripts/update-conversation-index.py` — `PROJECT_NAMES` / `PROJECT_ORDER`
+
+Matching display names for every code in `PROJECT_MAP`:
+
+```python
+PROJECT_NAMES: dict[str, str] = {
+    "wiki": "WIKI — This Wiki",
+    "cl": "CL — Claude Config",
+    "web": "WEB — My Webapp",
+    "mob": "MOB — Mobile App",
+    "work": "WORK — Day Job",
+    "general": "General — Cross-Project",
+}
+
+PROJECT_ORDER = [
+    "work", "web", "mob",   # most-active first
+    "wiki", "cl", "general",
+]
+```
+
+### `scripts/wiki-harvest.py` — `SKIP_DOMAIN_PATTERNS`
+
+Add your internal/personal domains so the harvester doesn't try to fetch
+them. Patterns use `re.search`:
+
+```python
+SKIP_DOMAIN_PATTERNS = [
+    # ... (generic ones are already there)
+    r"\.mycompany\.com$",
+    r"^git\.mydomain\.com$",
+]
+```
+
+---
+
+## 5. Create the post-merge hook
+
+The hook rebuilds the qmd index automatically after every `git pull`:
+
+```bash
+cat > ~/projects/wiki/.git/hooks/post-merge <<'HOOK'
+#!/usr/bin/env bash
+set -euo pipefail
+
+if command -v qmd &>/dev/null; then
+  echo "wiki: rebuilding qmd index..."
+  qmd update 2>/dev/null
+  # WSL / Linux: no GPU, force CPU-only embeddings
+  if [[ "$(uname -s)" == "Linux" ]]; then
+    NODE_LLAMA_CPP_GPU=false qmd embed 2>/dev/null
+  else
+    qmd embed 2>/dev/null
+  fi
+  echo "wiki: qmd index updated"
+fi
+HOOK
+chmod +x ~/projects/wiki/.git/hooks/post-merge
+```
+
+`.git/hooks/` isn't tracked by git, so this step runs on every machine
+where you clone the repo.
+
+---
+
+## 6. Backfill frontmatter (first-time setup or fresh clone)
+
+If you're starting with existing wiki pages that don't yet have
+`last_verified` or `origin`, backfill them:
+
+```bash
+cd ~/projects/wiki
+
+# Backfill last_verified from last_compiled/git/mtime
+python3 scripts/wiki-hygiene.py --backfill
+
+# Backfill origin: manual on pre-automation pages (one-shot inline)
+python3 -c "
+import sys
+sys.path.insert(0, 'scripts')
+from wiki_lib import iter_live_pages, write_page
+changed = 0
+for p in iter_live_pages():
+    if 'origin' not in p.frontmatter:
+        p.frontmatter['origin'] = 'manual'
+        write_page(p)
+        changed += 1
+print(f'{changed} page(s) backfilled')
+"
+```
+
+For a brand-new empty wiki, there's nothing to backfill — skip this step.
+
+---
+
+## 7. Run the pipeline manually once
+
+Before setting up cron, do a full end-to-end dry run to make sure
+everything's wired up:
+
+```bash
+cd ~/projects/wiki
+
+# 1. Extract any existing Claude Code sessions
+bash scripts/mine-conversations.sh --extract-only
+
+# 2. Summarize with claude -p (will make real LLM calls — can take minutes)
+python3 scripts/summarize-conversations.py --claude
+
+# 3. Regenerate conversation index + wake-up context
+python3 scripts/update-conversation-index.py --reindex
+
+# 4. Dry-run the maintenance pipeline
+bash scripts/wiki-maintain.sh --dry-run --no-compile
+```
+
+Expected output from step 4: all three phases run, phase 3 (qmd reindex)
+shows as skipped in dry-run mode, and you see `finished in Ns`.
+
+---
+
+## 8. Cron setup (optional)
+
+If you want full automation, add these cron jobs. **Run them on only ONE
+machine** — state files sync via git, so the other machine picks up the
+results automatically.
+
+```bash
+crontab -e
+```
+
+```cron
+# Wiki SSH key for cron (if your remote uses SSH with a key)
+GIT_SSH_COMMAND="ssh -i /path/to/wiki-key -o StrictHostKeyChecking=no"
+
+# PATH for cron so claude, qmd, node, python3, pipx tools are findable
+PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/local/bin:/usr/bin:/bin
+
+# ─── Sync ──────────────────────────────────────────────────────────────────
+# commit/pull/push every 15 minutes
+*/15 * * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --commit && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --pull && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --push >> /tmp/wiki-sync.log 2>&1
+
+# full sync with qmd reindex every 2 hours
+0 */2 * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh full >> /tmp/wiki-sync.log 2>&1
+
+# ─── Mining ────────────────────────────────────────────────────────────────
+# Extract new sessions hourly (no LLM, fast)
+0 * * * * /home/YOUR_USER/projects/wiki/scripts/mine-conversations.sh --extract-only >> /tmp/wiki-mine.log 2>&1
+
+# Summarize + index daily at 2am (uses claude -p)
+0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1
+
+# ─── Maintenance ───────────────────────────────────────────────────────────
+# Daily at 3am: harvest + quick hygiene + qmd reindex
+0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1
+
+# Weekly Sunday at 4am: full hygiene with LLM checks
+0 4 * * 0 cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh --hygiene-only --full >> scripts/.maintain.log 2>&1
+```
+
+Replace `YOUR_USER` and the node path as appropriate for your system.
+
+**macOS note**: `cron` needs Full Disk Access if you're pointing it at
+files in `~/Documents` or `~/Desktop`. Alternatively use `launchd` with
+a plist — same effect, easier permission model on macOS.
+
+**WSL note**: make sure `cron` is actually running (`sudo service cron
+start`). Cron doesn't auto-start in WSL by default.
+
+**`claude -p` in cron**: OAuth tokens must be cached before cron runs it.
+Run `claude --version` once interactively as your user to prime the
+token cache — cron then picks up the cached credentials.
+
+---
+
+## 9. Tell Claude Code about the wiki
+
+Two separate CLAUDE.md files work together:
+
+1. **The wiki's own `CLAUDE.md`** at `~/projects/wiki/CLAUDE.md` — the
+   schema the agent reads when working INSIDE the wiki. Tells it how to
+   maintain pages, apply frontmatter, handle staging/archival.
+2. **Your global `~/.claude/CLAUDE.md`** — the user-level instructions
+   the agent reads on EVERY session (regardless of directory). Tells it
+   when and how to consult the wiki from any other project.
+
+Both are provided as starter templates you can copy and adapt:
+
+### (a) Wiki schema — copy to the wiki root
+
+```bash
+cp ~/projects/wiki/docs/examples/wiki-CLAUDE.md ~/projects/wiki/CLAUDE.md
+# then edit ~/projects/wiki/CLAUDE.md for your own conventions
+```
+
+This file is ~200 lines. It defines:
+- Directory structure and the automated-vs-manual core rule
+- Frontmatter spec (required fields, staging fields, archive fields)
+- Page-type conventions (pattern / decision / environment / concept)
+- Operations: Ingest, Query, Mine, Harvest, Maintain, Lint
+- **Search Strategy** — which of the three qmd collections to use for
+  which question type
+
+Customize the sections marked **"Customization Notes"** at the bottom
+for your own categories, environments, and cross-reference format.
+
+### (b) Global wake-up + query instructions
+
+Append the contents of `docs/examples/global-CLAUDE.md` to your global
+Claude Code instructions:
+
+```bash
+cat ~/projects/wiki/docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
+# then review ~/.claude/CLAUDE.md to integrate cleanly with any existing
+# content
+```
+
+This adds:
+- **Wake-Up Context** — read `context/wake-up.md` at session start
+- **LLM Wiki — When to Consult It** — query mode vs ingest mode rules
+- **LLM Wiki — How to Search It** — explicit guidance for all three qmd
+  collections (`wiki`, `wiki-archive`, `wiki-conversations`) with
+  example queries for each
+- **Rules When Citing** — flag `confidence: low`, `status: pending`,
+  and archived pages to the user
+
+Together these give the agent a complete picture: how to maintain the
+wiki when working inside it, and how to consult it from anywhere else.
+
+---
+
+## 10. Verify
+
+```bash
+cd ~/projects/wiki
+
+# Sync state
+bash scripts/wiki-sync.sh --status
+
+# Search
+qmd collection list
+qmd search "test" --json -n 3   # won't return anything if wiki is empty
+
+# Mining
+tail -20 scripts/.mine.log 2>/dev/null || echo "(no mining runs yet)"
+
+# End-to-end maintenance dry-run (no writes, no LLM, no network)
+bash scripts/wiki-maintain.sh --dry-run --no-compile
+
+# Run the test suite
+cd tests && python3 -m pytest
+```
+
+Expected:
+- `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]`
+- `wiki-maintain.sh --dry-run` completes all three phases
+- `pytest` passes all 171 tests in ~1.3 seconds
+
+---
+
+## Troubleshooting
+
+**qmd search returns nothing**
+```bash
+qmd collection list          # verify path points at the right place
+qmd update                   # rebuild index
+qmd embed                    # rebuild embeddings
+cat ~/.config/qmd/index.yml  # verify config is correct for your machine
+```
+
+**qmd collection points at the wrong path**
+Edit `~/.config/qmd/index.yml` directly. Don't use `qmd collection add`
+from inside the target directory — it can interpret the path oddly.
+
+**qmd returns archived pages in default searches**
+Verify `wiki-archive` has `includeByDefault: false` in the YAML and
+`qmd collection list` shows `[excluded]`.
+
+**`claude -p` fails in cron ("not authenticated")**
+Cron has no browser. Run `claude --version` once as the same user
+outside cron to cache OAuth tokens; cron will pick them up. Also verify
+the `PATH` directive at the top of the crontab includes the directory
+containing `claude`.
+
+**`wiki-harvest.py` fetch failures**
+```bash
+# Verify the extraction tools work
+trafilatura -u "https://example.com" --markdown --no-comments --precision
+crwl "https://example.com" -o markdown-fit
+
+# Check harvest state
+python3 -c "import json; print(json.dumps(json.load(open('.harvest-state.json'))['failed_urls'], indent=2))"
+```
+
+**`wiki-hygiene.py` archived a page unexpectedly**
+Check `last_verified` vs decay thresholds. If the page was never
+referenced in a conversation, it decayed naturally. Restore with:
+```bash
+python3 scripts/wiki-hygiene.py --restore archive/patterns/foo.md
+```
+
+**Both machines ran maintenance simultaneously**
+Merge conflicts on `.harvest-state.json` / `.hygiene-state.json` will
+occur. Pick ONE machine for maintenance; disable the maintenance cron
+on the other. Leave sync cron running on both so changes still propagate.
+
+**Tests fail**
+Run `cd tests && python3 -m pytest -v` for verbose output. If the
+failure mentions `WIKI_DIR` or module loading, verify
+`scripts/wiki_lib.py` exists and contains the `WIKI_DIR` env var override
+near the top.
+
+---
+
+## Minimal install (skip everything except the idea)
+
+If you want the conceptual wiki without any of the automation, all you
+actually need is:
+
+1. An empty directory
+2. `CLAUDE.md` telling your agent the conventions (see the schema in
+   [`ARCHITECTURE.md`](ARCHITECTURE.md) or Karpathy's gist)
+3. `index.md` for the agent to catalog pages
+4. An agent that can read and write files (any Claude Code, Cursor, Aider
+   session works)
+
+Then tell the agent: "Start maintaining a wiki here. Every time I share
+a source, integrate it. When I ask a question, check the wiki first."
+
+You can bolt on the automation layer later if/when it becomes worth
+the setup effort.