Initial commit — memex

A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions
--- a/docs/CUSTOMIZE.md
+++ b/docs/CUSTOMIZE.md
@@ -0,0 +1,432 @@
+# Customization Guide
+
+This repo is built around Claude Code, cron-based automation, and a
+specific directory layout. None of those are load-bearing for the core
+idea. This document walks through adapting it for different agents,
+different scheduling, and different subsets of functionality.
+
+## What's actually required for the core idea
+
+The minimum viable compounding wiki is:
+
+1. A markdown directory tree
+2. An agent that reads the tree at the start of a session and writes to
+   it during the session
+3. Some convention (a `CLAUDE.md` or equivalent) telling the agent how to
+   maintain the wiki
+
+**Everything else in this repo is optional optimization** — automated
+extraction, URL harvesting, hygiene checks, cron scheduling. They're
+worth the setup effort once the wiki grows past a few dozen pages, but
+they're not the *idea*.
+
+---
+
+## Adapting for non-Claude-Code agents
+
+Four script components are Claude-specific. Each has a natural
+replacement path:
+
+### 1. `extract-sessions.py` — Claude Code JSONL parsing
+
+**What it does**: Reads session files from `~/.claude/projects/` and
+converts them to markdown transcripts.
+
+**What's Claude-specific**: The JSONL format and directory structure are
+specific to the Claude Code CLI. Other agents don't produce these files.
+
+**Replacements**:
+
+- **Cursor**: Cursor stores chat history in `~/Library/Application
+  Support/Cursor/User/globalStorage/` (macOS) as SQLite. Write an
+  equivalent `extract-sessions.py` that queries that SQLite and produces
+  the same markdown format.
+- **Aider**: Aider stores chat history as `.aider.chat.history.md` in
+  each project directory. A much simpler extractor: walk all project
+  directories, read each `.aider.chat.history.md`, split on session
+  boundaries, write to `conversations/<project>/`.
+- **OpenAI Codex / gemini CLI / other**: Whatever session format your
+  tool uses — the target format is a markdown file with a specific
+  frontmatter shape (`title`, `type: conversation`, `project`, `date`,
+  `status: extracted`, `messages: N`, body of user/assistant turns).
+  Anything that produces files in that shape will flow through the rest
+  of the pipeline unchanged.
+- **No agent at all — just manual**: Skip this script entirely. Paste
+  interesting conversations into `conversations/general/YYYY-MM-DD-slug.md`
+  by hand and set `status: extracted` yourself.
+
+The pipeline downstream of `extract-sessions.py` doesn't care how the
+transcripts got there, only that they exist with the right frontmatter.
+
+### 2. `summarize-conversations.py` — `claude -p` summarization
+
+**What it does**: Classifies extracted conversations into "halls"
+(fact/discovery/preference/advice/event/tooling) and writes summaries.
+
+**What's Claude-specific**: Uses `claude -p` with haiku/sonnet routing.
+
+**Replacements**:
+
+- **OpenAI**: Replace the `call_claude` helper with a function that calls
+  `openai` Python SDK or `gpt` CLI. Use gpt-4o-mini for short
+  conversations (equivalent to haiku routing) and gpt-4o for long ones.
+- **Local LLM**: The script already supports this path — just omit the
+  `--claude` flag and run a `llama-server` on localhost:8080 (or the WSL
+  gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
+- **Ollama**: Point `AI_BASE_URL` at your Ollama endpoint (e.g.
+  `http://localhost:11434/v1`). Ollama exposes an OpenAI-compatible API.
+- **Any OpenAI-compatible endpoint**: `AI_BASE_URL` and `AI_MODEL` env
+  vars configure the script — no code changes needed.
+- **No LLM at all — manual summaries**: Edit each conversation file by
+  hand to set `status: summarized` and add your own `topics`/`related`
+  frontmatter. Tedious but works for a small wiki.
+
+### 3. `wiki-harvest.py` — AI compile step
+
+**What it does**: After fetching raw URL content, sends it to `claude -p`
+to get a structured JSON verdict (new_page / update_page / both / skip)
+plus the page content.
+
+**What's Claude-specific**: `claude -p --model haiku|sonnet`.
+
+**Replacements**:
+
+- **Any other LLM**: Replace `call_claude_compile()` with a function that
+  calls your preferred backend. The prompt template
+  (`COMPILE_PROMPT_TEMPLATE`) is reusable — just swap the transport.
+- **Skip AI compilation entirely**: Run `wiki-harvest.py --no-compile`
+  and the harvester will save raw content to `raw/harvested/` without
+  trying to compile it. You can then manually (or via a different script)
+  turn the raw content into wiki pages.
+
+### 4. `wiki-hygiene.py --full` — LLM-powered checks
+
+**What it does**: Duplicate detection, contradiction detection, missing
+cross-reference suggestions.
+
+**What's Claude-specific**: `claude -p --model haiku|sonnet`.
+
+**Replacements**:
+
+- **Same as #3**: Replace the `call_claude()` helper in `wiki-hygiene.py`.
+- **Skip full mode entirely**: Only run `wiki-hygiene.py --quick` (the
+  default). Quick mode has no LLM calls and catches 90% of structural
+  issues. Contradictions and duplicates just have to be caught by human
+  review during `wiki-staging.py --review` sessions.
+
+### 5. `CLAUDE.md` at the wiki root
+
+**What it does**: The instructions Claude Code reads at the start of
+every session that explain the wiki schema and maintenance operations.
+
+**What's Claude-specific**: The filename. Claude Code specifically looks
+for `CLAUDE.md`; other agents look for other files.
+
+**Replacements**:
+
+| Agent | Equivalent file |
+|-------|-----------------|
+| Claude Code | `CLAUDE.md` |
+| Cursor | `.cursorrules` or `.cursor/rules/` |
+| Aider | `CONVENTIONS.md` (read via `--read CONVENTIONS.md`) |
+| Gemini CLI | `GEMINI.md` |
+| Continue.dev | `config.json` prompts or `.continue/rules/` |
+
+The content is the same — just rename the file and point your agent at
+it.
+
+---
+
+## Running without cron
+
+Cron is convenient but not required. Alternatives:
+
+### Manual runs
+
+Just call the scripts when you want the wiki updated:
+
+```bash
+cd ~/projects/wiki
+
+# When you want to ingest new Claude Code sessions
+bash scripts/mine-conversations.sh
+
+# When you want hygiene + harvest
+bash scripts/wiki-maintain.sh
+
+# When you want the expensive LLM pass
+bash scripts/wiki-maintain.sh --hygiene-only --full
+```
+
+This is arguably *better* than cron if you work in bursts — run
+maintenance when you start a session, not on a schedule.
+
+### systemd timers (Linux)
+
+More observable than cron, better journaling:
+
+```ini
+# ~/.config/systemd/user/wiki-maintain.service
+[Unit]
+Description=Wiki maintenance pipeline
+
+[Service]
+Type=oneshot
+WorkingDirectory=%h/projects/wiki
+ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh
+```
+
+```ini
+# ~/.config/systemd/user/wiki-maintain.timer
+[Unit]
+Description=Run wiki-maintain daily
+
+[Timer]
+OnCalendar=daily
+Persistent=true
+
+[Install]
+WantedBy=timers.target
+```
+
+```bash
+systemctl --user enable --now wiki-maintain.timer
+journalctl --user -u wiki-maintain.service  # see logs
+```
+
+### launchd (macOS)
+
+More native than cron on macOS:
+
+```xml
+<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
+<plist version="1.0">
+<dict>
+  <key>Label</key><string>com.user.wiki-maintain</string>
+  <key>ProgramArguments</key>
+  <array>
+    <string>/bin/bash</string>
+    <string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
+  </array>
+  <key>StartCalendarInterval</key>
+  <dict>
+    <key>Hour</key><integer>3</integer>
+    <key>Minute</key><integer>0</integer>
+  </dict>
+  <key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
+  <key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
+</dict>
+</plist>
+```
+
+```bash
+launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
+launchctl list | grep wiki  # verify
+```
+
+### Git hooks (pre-push)
+
+Run hygiene before every push so the wiki is always clean when it hits
+the remote:
+
+```bash
+cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
+#!/usr/bin/env bash
+set -euo pipefail
+bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
+HOOK
+chmod +x ~/projects/wiki/.git/hooks/pre-push
+```
+
+Downside: every push is slow. Upside: you never push a broken wiki.
+
+### CI pipeline
+
+Run `wiki-hygiene.py --check-only` in a CI workflow on every PR:
+
+```yaml
+# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
+name: Wiki hygiene check
+on: [push, pull_request]
+jobs:
+  hygiene:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+      - run: python3 scripts/wiki-hygiene.py --check-only
+```
+
+`--check-only` reports issues without auto-fixing them, so CI can flag
+problems without modifying files.
+
+---
+
+## Minimal subsets
+
+You don't have to run the whole pipeline. Pick what's useful:
+
+### "Just the wiki" (no automation)
+
+- Delete `scripts/wiki-*` and `scripts/*-conversations*`
+- Delete `tests/`
+- Keep the directory structure (`patterns/`, `decisions/`, etc.)
+- Keep `index.md` and `CLAUDE.md`
+- Write and maintain the wiki manually with your agent
+
+This is the Karpathy-gist version. Works great for small wikis.
+
+### "Wiki + mining" (no harvesting, no hygiene)
+
+- Keep the mining layer (`extract-sessions.py`, `summarize-conversations.py`, `update-conversation-index.py`)
+- Delete the automation layer (`wiki-harvest.py`, `wiki-hygiene.py`, `wiki-staging.py`, `wiki-maintain.sh`)
+- The wiki grows from session mining but you maintain it manually
+
+Useful if you want session continuity (the wake-up briefing) without
+the full automation.
+
+### "Wiki + hygiene" (no mining, no harvesting)
+
+- Keep `wiki-hygiene.py` and `wiki_lib.py`
+- Delete everything else
+- Run `wiki-hygiene.py --quick` periodically to catch structural issues
+
+Useful if you write the wiki manually but want automated checks for
+orphans, broken links, and staleness.
+
+### "Wiki + harvesting" (no session mining)
+
+- Keep `wiki-harvest.py`, `wiki-staging.py`, `wiki_lib.py`
+- Delete mining scripts
+- Source URLs manually — put them in a file and point the harvester at
+  it. You'd need to write a wrapper that extracts URLs from your source
+  file and feeds them into the fetch cascade.
+
+Useful if URLs come from somewhere other than Claude Code sessions
+(e.g. browser bookmarks, Pocket export, RSS).
+
+---
+
+## Schema customization
+
+The repo uses these live content types:
+
+- `patterns/` — HOW things should be built
+- `decisions/` — WHY we chose this approach
+- `concepts/` — WHAT the foundational ideas are
+- `environments/` — WHERE implementations differ
+
+These reflect my engineering-focused use case. Your wiki might need
+different categories. To change them:
+
+1. Rename / add directories under the wiki root
+2. Edit `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py`
+3. Update the `type:` frontmatter validation in
+   `scripts/wiki-hygiene.py` (`VALID_TYPES` constant)
+4. Update `CLAUDE.md` to describe the new categories
+5. Update `index.md` section headers to match
+
+Examples of alternative schemas:
+
+**Research wiki**:
+- `findings/` — experimental results
+- `hypotheses/` — what you're testing
+- `methods/` — how you test
+- `literature/` — external sources
+
+**Product wiki**:
+- `features/` — what the product does
+- `decisions/` — why we chose this
+- `users/` — personas, interviews, feedback
+- `metrics/` — what we measure
+
+**Personal knowledge wiki**:
+- `topics/` — general subject matter
+- `projects/` — specific ongoing work
+- `journal/` — dated entries
+- `references/` — external links/papers
+
+None of these are better or worse — pick what matches how you think.
+
+---
+
+## Frontmatter customization
+
+The required fields are documented in `CLAUDE.md` (frontmatter spec).
+You can add your own fields freely — the parser and hygiene checks
+ignore unknown keys.
+
+Useful additions you might want:
+
+```yaml
+author: alice              # who wrote or introduced the page
+tags: [auth, security]     # flat tag list
+urgency: high              # for to-do-style wiki pages
+stakeholders:              # who cares about this page
+  - product-team
+  - security-team
+review_by: 2026-06-01      # explicit review date instead of age-based decay
+```
+
+If you want age-based decay to key off a different field than
+`last_verified` (say, `review_by`), edit `expected_confidence()` in
+`scripts/wiki-hygiene.py` to read from your custom field.
+
+---
+
+## Working across multiple wikis
+
+The scripts all honor the `WIKI_DIR` environment variable. Run multiple
+wikis against the same scripts:
+
+```bash
+# Work wiki
+WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh
+
+# Personal wiki
+WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh
+
+# Research wiki
+WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh
+```
+
+Each has its own state files, its own cron entries, its own qmd
+collection. You can symlink or copy `scripts/` into each wiki, or run
+all three against a single checked-out copy of the scripts.
+
+---
+
+## What I'd change if starting over
+
+Honest notes on the design choices, in case you're about to fork:
+
+1. **Config should be in YAML, not inline constants.** I bolted a
+   "CONFIGURE ME" comment onto `PROJECT_MAP` and `SKIP_DOMAIN_PATTERNS`
+   as a shortcut. Better: a `config.yaml` at the wiki root that all
+   scripts read.
+
+2. **The mining layer is tightly coupled to Claude Code.** A cleaner
+   design would put a `Session` interface in `wiki_lib.py` and have
+   extractors for each agent produce `Session` objects — the rest of the
+   pipeline would be agent-agnostic.
+
+3. **The hygiene script is a monolith.** 1100+ lines is a lot. Splitting
+   it into `wiki_hygiene/checks.py`, `wiki_hygiene/archive.py`,
+   `wiki_hygiene/llm.py`, etc., would be cleaner. It started as a single
+   file and grew.
+
+4. **The hyphenated filenames (`wiki-harvest.py`) make Python imports
+   awkward.** Standard Python convention is underscores. I used hyphens
+   for consistency with the shell scripts, and `conftest.py` has a
+   module-loader workaround. A cleaner fork would use underscores
+   everywhere.
+
+5. **The wiki schema assumes you know what you want to catalog.** If
+   you don't, start with a free-form `notes/` directory and let
+   categories emerge organically, then refactor into `patterns/` etc.
+   later.
+
+None of these are blockers. They're all "if I were designing v2"
+observations.