A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
433 lines
14 KiB
Markdown
433 lines
14 KiB
Markdown
# Customization Guide
|
|
|
|
This repo is built around Claude Code, cron-based automation, and a
|
|
specific directory layout. None of those are load-bearing for the core
|
|
idea. This document walks through adapting it for different agents,
|
|
different scheduling, and different subsets of functionality.
|
|
|
|
## What's actually required for the core idea
|
|
|
|
The minimum viable compounding wiki is:
|
|
|
|
1. A markdown directory tree
|
|
2. An agent that reads the tree at the start of a session and writes to
|
|
it during the session
|
|
3. Some convention (a `CLAUDE.md` or equivalent) telling the agent how to
|
|
maintain the wiki
|
|
|
|
**Everything else in this repo is optional optimization** — automated
|
|
extraction, URL harvesting, hygiene checks, cron scheduling. They're
|
|
worth the setup effort once the wiki grows past a few dozen pages, but
|
|
they're not the *idea*.
|
|
|
|
---
|
|
|
|
## Adapting for non-Claude-Code agents
|
|
|
|
Four script components are Claude-specific. Each has a natural
|
|
replacement path:
|
|
|
|
### 1. `extract-sessions.py` — Claude Code JSONL parsing
|
|
|
|
**What it does**: Reads session files from `~/.claude/projects/` and
|
|
converts them to markdown transcripts.
|
|
|
|
**What's Claude-specific**: The JSONL format and directory structure are
|
|
specific to the Claude Code CLI. Other agents don't produce these files.
|
|
|
|
**Replacements**:
|
|
|
|
- **Cursor**: Cursor stores chat history in `~/Library/Application
|
|
Support/Cursor/User/globalStorage/` (macOS) as SQLite. Write an
|
|
equivalent `extract-sessions.py` that queries that SQLite and produces
|
|
the same markdown format.
|
|
- **Aider**: Aider stores chat history as `.aider.chat.history.md` in
|
|
each project directory. A much simpler extractor: walk all project
|
|
directories, read each `.aider.chat.history.md`, split on session
|
|
boundaries, write to `conversations/<project>/`.
|
|
- **OpenAI Codex / gemini CLI / other**: Whatever session format your
|
|
tool uses — the target format is a markdown file with a specific
|
|
frontmatter shape (`title`, `type: conversation`, `project`, `date`,
|
|
`status: extracted`, `messages: N`, body of user/assistant turns).
|
|
Anything that produces files in that shape will flow through the rest
|
|
of the pipeline unchanged.
|
|
- **No agent at all — just manual**: Skip this script entirely. Paste
|
|
interesting conversations into `conversations/general/YYYY-MM-DD-slug.md`
|
|
by hand and set `status: extracted` yourself.
|
|
|
|
The pipeline downstream of `extract-sessions.py` doesn't care how the
|
|
transcripts got there, only that they exist with the right frontmatter.
|
|
|
|
### 2. `summarize-conversations.py` — `claude -p` summarization
|
|
|
|
**What it does**: Classifies extracted conversations into "halls"
|
|
(fact/discovery/preference/advice/event/tooling) and writes summaries.
|
|
|
|
**What's Claude-specific**: Uses `claude -p` with haiku/sonnet routing.
|
|
|
|
**Replacements**:
|
|
|
|
- **OpenAI**: Replace the `call_claude` helper with a function that calls
|
|
`openai` Python SDK or `gpt` CLI. Use gpt-4o-mini for short
|
|
conversations (equivalent to haiku routing) and gpt-4o for long ones.
|
|
- **Local LLM**: The script already supports this path — just omit the
|
|
`--claude` flag and run a `llama-server` on localhost:8080 (or the WSL
|
|
gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
|
|
- **Ollama**: Point `AI_BASE_URL` at your Ollama endpoint (e.g.
|
|
`http://localhost:11434/v1`). Ollama exposes an OpenAI-compatible API.
|
|
- **Any OpenAI-compatible endpoint**: `AI_BASE_URL` and `AI_MODEL` env
|
|
vars configure the script — no code changes needed.
|
|
- **No LLM at all — manual summaries**: Edit each conversation file by
|
|
hand to set `status: summarized` and add your own `topics`/`related`
|
|
frontmatter. Tedious but works for a small wiki.
|
|
|
|
### 3. `wiki-harvest.py` — AI compile step
|
|
|
|
**What it does**: After fetching raw URL content, sends it to `claude -p`
|
|
to get a structured JSON verdict (new_page / update_page / both / skip)
|
|
plus the page content.
|
|
|
|
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
|
|
|
|
**Replacements**:
|
|
|
|
- **Any other LLM**: Replace `call_claude_compile()` with a function that
|
|
calls your preferred backend. The prompt template
|
|
(`COMPILE_PROMPT_TEMPLATE`) is reusable — just swap the transport.
|
|
- **Skip AI compilation entirely**: Run `wiki-harvest.py --no-compile`
|
|
and the harvester will save raw content to `raw/harvested/` without
|
|
trying to compile it. You can then manually (or via a different script)
|
|
turn the raw content into wiki pages.
|
|
|
|
### 4. `wiki-hygiene.py --full` — LLM-powered checks
|
|
|
|
**What it does**: Duplicate detection, contradiction detection, missing
|
|
cross-reference suggestions.
|
|
|
|
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
|
|
|
|
**Replacements**:
|
|
|
|
- **Same as #3**: Replace the `call_claude()` helper in `wiki-hygiene.py`.
|
|
- **Skip full mode entirely**: Only run `wiki-hygiene.py --quick` (the
|
|
default). Quick mode has no LLM calls and catches 90% of structural
|
|
issues. Contradictions and duplicates just have to be caught by human
|
|
review during `wiki-staging.py --review` sessions.
|
|
|
|
### 5. `CLAUDE.md` at the wiki root
|
|
|
|
**What it does**: The instructions Claude Code reads at the start of
|
|
every session that explain the wiki schema and maintenance operations.
|
|
|
|
**What's Claude-specific**: The filename. Claude Code specifically looks
|
|
for `CLAUDE.md`; other agents look for other files.
|
|
|
|
**Replacements**:
|
|
|
|
| Agent | Equivalent file |
|
|
|-------|-----------------|
|
|
| Claude Code | `CLAUDE.md` |
|
|
| Cursor | `.cursorrules` or `.cursor/rules/` |
|
|
| Aider | `CONVENTIONS.md` (read via `--read CONVENTIONS.md`) |
|
|
| Gemini CLI | `GEMINI.md` |
|
|
| Continue.dev | `config.json` prompts or `.continue/rules/` |
|
|
|
|
The content is the same — just rename the file and point your agent at
|
|
it.
|
|
|
|
---
|
|
|
|
## Running without cron
|
|
|
|
Cron is convenient but not required. Alternatives:
|
|
|
|
### Manual runs
|
|
|
|
Just call the scripts when you want the wiki updated:
|
|
|
|
```bash
|
|
cd ~/projects/wiki
|
|
|
|
# When you want to ingest new Claude Code sessions
|
|
bash scripts/mine-conversations.sh
|
|
|
|
# When you want hygiene + harvest
|
|
bash scripts/wiki-maintain.sh
|
|
|
|
# When you want the expensive LLM pass
|
|
bash scripts/wiki-maintain.sh --hygiene-only --full
|
|
```
|
|
|
|
This is arguably *better* than cron if you work in bursts — run
|
|
maintenance when you start a session, not on a schedule.
|
|
|
|
### systemd timers (Linux)
|
|
|
|
More observable than cron, better journaling:
|
|
|
|
```ini
|
|
# ~/.config/systemd/user/wiki-maintain.service
|
|
[Unit]
|
|
Description=Wiki maintenance pipeline
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
WorkingDirectory=%h/projects/wiki
|
|
ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh
|
|
```
|
|
|
|
```ini
|
|
# ~/.config/systemd/user/wiki-maintain.timer
|
|
[Unit]
|
|
Description=Run wiki-maintain daily
|
|
|
|
[Timer]
|
|
OnCalendar=daily
|
|
Persistent=true
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|
|
```
|
|
|
|
```bash
|
|
systemctl --user enable --now wiki-maintain.timer
|
|
journalctl --user -u wiki-maintain.service # see logs
|
|
```
|
|
|
|
### launchd (macOS)
|
|
|
|
More native than cron on macOS:
|
|
|
|
```xml
|
|
<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
|
<plist version="1.0">
|
|
<dict>
|
|
<key>Label</key><string>com.user.wiki-maintain</string>
|
|
<key>ProgramArguments</key>
|
|
<array>
|
|
<string>/bin/bash</string>
|
|
<string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
|
|
</array>
|
|
<key>StartCalendarInterval</key>
|
|
<dict>
|
|
<key>Hour</key><integer>3</integer>
|
|
<key>Minute</key><integer>0</integer>
|
|
</dict>
|
|
<key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
|
|
<key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
|
|
</dict>
|
|
</plist>
|
|
```
|
|
|
|
```bash
|
|
launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
|
|
launchctl list | grep wiki # verify
|
|
```
|
|
|
|
### Git hooks (pre-push)
|
|
|
|
Run hygiene before every push so the wiki is always clean when it hits
|
|
the remote:
|
|
|
|
```bash
|
|
cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
|
|
HOOK
|
|
chmod +x ~/projects/wiki/.git/hooks/pre-push
|
|
```
|
|
|
|
Downside: every push is slow. Upside: you never push a broken wiki.
|
|
|
|
### CI pipeline
|
|
|
|
Run `wiki-hygiene.py --check-only` in a CI workflow on every PR:
|
|
|
|
```yaml
|
|
# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
|
|
name: Wiki hygiene check
|
|
on: [push, pull_request]
|
|
jobs:
|
|
hygiene:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
- uses: actions/setup-python@v5
|
|
- run: python3 scripts/wiki-hygiene.py --check-only
|
|
```
|
|
|
|
`--check-only` reports issues without auto-fixing them, so CI can flag
|
|
problems without modifying files.
|
|
|
|
---
|
|
|
|
## Minimal subsets
|
|
|
|
You don't have to run the whole pipeline. Pick what's useful:
|
|
|
|
### "Just the wiki" (no automation)
|
|
|
|
- Delete `scripts/wiki-*` and `scripts/*-conversations*`
|
|
- Delete `tests/`
|
|
- Keep the directory structure (`patterns/`, `decisions/`, etc.)
|
|
- Keep `index.md` and `CLAUDE.md`
|
|
- Write and maintain the wiki manually with your agent
|
|
|
|
This is the Karpathy-gist version. Works great for small wikis.
|
|
|
|
### "Wiki + mining" (no harvesting, no hygiene)
|
|
|
|
- Keep the mining layer (`extract-sessions.py`, `summarize-conversations.py`, `update-conversation-index.py`)
|
|
- Delete the automation layer (`wiki-harvest.py`, `wiki-hygiene.py`, `wiki-staging.py`, `wiki-maintain.sh`)
|
|
- The wiki grows from session mining but you maintain it manually
|
|
|
|
Useful if you want session continuity (the wake-up briefing) without
|
|
the full automation.
|
|
|
|
### "Wiki + hygiene" (no mining, no harvesting)
|
|
|
|
- Keep `wiki-hygiene.py` and `wiki_lib.py`
|
|
- Delete everything else
|
|
- Run `wiki-hygiene.py --quick` periodically to catch structural issues
|
|
|
|
Useful if you write the wiki manually but want automated checks for
|
|
orphans, broken links, and staleness.
|
|
|
|
### "Wiki + harvesting" (no session mining)
|
|
|
|
- Keep `wiki-harvest.py`, `wiki-staging.py`, `wiki_lib.py`
|
|
- Delete mining scripts
|
|
- Source URLs manually — put them in a file and point the harvester at
|
|
it. You'd need to write a wrapper that extracts URLs from your source
|
|
file and feeds them into the fetch cascade.
|
|
|
|
Useful if URLs come from somewhere other than Claude Code sessions
|
|
(e.g. browser bookmarks, Pocket export, RSS).
|
|
|
|
---
|
|
|
|
## Schema customization
|
|
|
|
The repo uses these live content types:
|
|
|
|
- `patterns/` — HOW things should be built
|
|
- `decisions/` — WHY we chose this approach
|
|
- `concepts/` — WHAT the foundational ideas are
|
|
- `environments/` — WHERE implementations differ
|
|
|
|
These reflect my engineering-focused use case. Your wiki might need
|
|
different categories. To change them:
|
|
|
|
1. Rename / add directories under the wiki root
|
|
2. Edit `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py`
|
|
3. Update the `type:` frontmatter validation in
|
|
`scripts/wiki-hygiene.py` (`VALID_TYPES` constant)
|
|
4. Update `CLAUDE.md` to describe the new categories
|
|
5. Update `index.md` section headers to match
|
|
|
|
Examples of alternative schemas:
|
|
|
|
**Research wiki**:
|
|
- `findings/` — experimental results
|
|
- `hypotheses/` — what you're testing
|
|
- `methods/` — how you test
|
|
- `literature/` — external sources
|
|
|
|
**Product wiki**:
|
|
- `features/` — what the product does
|
|
- `decisions/` — why we chose this
|
|
- `users/` — personas, interviews, feedback
|
|
- `metrics/` — what we measure
|
|
|
|
**Personal knowledge wiki**:
|
|
- `topics/` — general subject matter
|
|
- `projects/` — specific ongoing work
|
|
- `journal/` — dated entries
|
|
- `references/` — external links/papers
|
|
|
|
None of these are better or worse — pick what matches how you think.
|
|
|
|
---
|
|
|
|
## Frontmatter customization
|
|
|
|
The required fields are documented in `CLAUDE.md` (frontmatter spec).
|
|
You can add your own fields freely — the parser and hygiene checks
|
|
ignore unknown keys.
|
|
|
|
Useful additions you might want:
|
|
|
|
```yaml
|
|
author: alice # who wrote or introduced the page
|
|
tags: [auth, security] # flat tag list
|
|
urgency: high # for to-do-style wiki pages
|
|
stakeholders: # who cares about this page
|
|
- product-team
|
|
- security-team
|
|
review_by: 2026-06-01 # explicit review date instead of age-based decay
|
|
```
|
|
|
|
If you want age-based decay to key off a different field than
|
|
`last_verified` (say, `review_by`), edit `expected_confidence()` in
|
|
`scripts/wiki-hygiene.py` to read from your custom field.
|
|
|
|
---
|
|
|
|
## Working across multiple wikis
|
|
|
|
The scripts all honor the `WIKI_DIR` environment variable. Run multiple
|
|
wikis against the same scripts:
|
|
|
|
```bash
|
|
# Work wiki
|
|
WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh
|
|
|
|
# Personal wiki
|
|
WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh
|
|
|
|
# Research wiki
|
|
WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh
|
|
```
|
|
|
|
Each has its own state files, its own cron entries, its own qmd
|
|
collection. You can symlink or copy `scripts/` into each wiki, or run
|
|
all three against a single checked-out copy of the scripts.
|
|
|
|
---
|
|
|
|
## What I'd change if starting over
|
|
|
|
Honest notes on the design choices, in case you're about to fork:
|
|
|
|
1. **Config should be in YAML, not inline constants.** I bolted a
|
|
"CONFIGURE ME" comment onto `PROJECT_MAP` and `SKIP_DOMAIN_PATTERNS`
|
|
as a shortcut. Better: a `config.yaml` at the wiki root that all
|
|
scripts read.
|
|
|
|
2. **The mining layer is tightly coupled to Claude Code.** A cleaner
|
|
design would put a `Session` interface in `wiki_lib.py` and have
|
|
extractors for each agent produce `Session` objects — the rest of the
|
|
pipeline would be agent-agnostic.
|
|
|
|
3. **The hygiene script is a monolith.** 1100+ lines is a lot. Splitting
|
|
it into `wiki_hygiene/checks.py`, `wiki_hygiene/archive.py`,
|
|
`wiki_hygiene/llm.py`, etc., would be cleaner. It started as a single
|
|
file and grew.
|
|
|
|
4. **The hyphenated filenames (`wiki-harvest.py`) make Python imports
|
|
awkward.** Standard Python convention is underscores. I used hyphens
|
|
for consistency with the shell scripts, and `conftest.py` has a
|
|
module-loader workaround. A cleaner fork would use underscores
|
|
everywhere.
|
|
|
|
5. **The wiki schema assumes you know what you want to catalog.** If
|
|
you don't, start with a free-form `notes/` directory and let
|
|
categories emerge organically, then refactor into `patterns/` etc.
|
|
later.
|
|
|
|
None of these are blockers. They're all "if I were designing v2"
|
|
observations.
|