Initial commit — memex
A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
This commit is contained in:
432
docs/CUSTOMIZE.md
Normal file
432
docs/CUSTOMIZE.md
Normal file
@@ -0,0 +1,432 @@
|
||||
# Customization Guide
|
||||
|
||||
This repo is built around Claude Code, cron-based automation, and a
|
||||
specific directory layout. None of those are load-bearing for the core
|
||||
idea. This document walks through adapting it for different agents,
|
||||
different scheduling, and different subsets of functionality.
|
||||
|
||||
## What's actually required for the core idea
|
||||
|
||||
The minimum viable compounding wiki is:
|
||||
|
||||
1. A markdown directory tree
|
||||
2. An agent that reads the tree at the start of a session and writes to
|
||||
it during the session
|
||||
3. Some convention (a `CLAUDE.md` or equivalent) telling the agent how to
|
||||
maintain the wiki
|
||||
|
||||
**Everything else in this repo is optional optimization** — automated
|
||||
extraction, URL harvesting, hygiene checks, cron scheduling. They're
|
||||
worth the setup effort once the wiki grows past a few dozen pages, but
|
||||
they're not the *idea*.
|
||||
|
||||
---
|
||||
|
||||
## Adapting for non-Claude-Code agents
|
||||
|
||||
Four script components are Claude-specific. Each has a natural
|
||||
replacement path:
|
||||
|
||||
### 1. `extract-sessions.py` — Claude Code JSONL parsing
|
||||
|
||||
**What it does**: Reads session files from `~/.claude/projects/` and
|
||||
converts them to markdown transcripts.
|
||||
|
||||
**What's Claude-specific**: The JSONL format and directory structure are
|
||||
specific to the Claude Code CLI. Other agents don't produce these files.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
- **Cursor**: Cursor stores chat history in `~/Library/Application
|
||||
Support/Cursor/User/globalStorage/` (macOS) as SQLite. Write an
|
||||
equivalent `extract-sessions.py` that queries that SQLite and produces
|
||||
the same markdown format.
|
||||
- **Aider**: Aider stores chat history as `.aider.chat.history.md` in
|
||||
each project directory. A much simpler extractor: walk all project
|
||||
directories, read each `.aider.chat.history.md`, split on session
|
||||
boundaries, write to `conversations/<project>/`.
|
||||
- **OpenAI Codex / gemini CLI / other**: Whatever session format your
|
||||
tool uses — the target format is a markdown file with a specific
|
||||
frontmatter shape (`title`, `type: conversation`, `project`, `date`,
|
||||
`status: extracted`, `messages: N`, body of user/assistant turns).
|
||||
Anything that produces files in that shape will flow through the rest
|
||||
of the pipeline unchanged.
|
||||
- **No agent at all — just manual**: Skip this script entirely. Paste
|
||||
interesting conversations into `conversations/general/YYYY-MM-DD-slug.md`
|
||||
by hand and set `status: extracted` yourself.
|
||||
|
||||
The pipeline downstream of `extract-sessions.py` doesn't care how the
|
||||
transcripts got there, only that they exist with the right frontmatter.
|
||||
|
||||
### 2. `summarize-conversations.py` — `claude -p` summarization
|
||||
|
||||
**What it does**: Classifies extracted conversations into "halls"
|
||||
(fact/discovery/preference/advice/event/tooling) and writes summaries.
|
||||
|
||||
**What's Claude-specific**: Uses `claude -p` with haiku/sonnet routing.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
- **OpenAI**: Replace the `call_claude` helper with a function that calls
|
||||
`openai` Python SDK or `gpt` CLI. Use gpt-4o-mini for short
|
||||
conversations (equivalent to haiku routing) and gpt-4o for long ones.
|
||||
- **Local LLM**: The script already supports this path — just omit the
|
||||
`--claude` flag and run a `llama-server` on localhost:8080 (or the WSL
|
||||
gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
|
||||
- **Ollama**: Point `AI_BASE_URL` at your Ollama endpoint (e.g.
|
||||
`http://localhost:11434/v1`). Ollama exposes an OpenAI-compatible API.
|
||||
- **Any OpenAI-compatible endpoint**: `AI_BASE_URL` and `AI_MODEL` env
|
||||
vars configure the script — no code changes needed.
|
||||
- **No LLM at all — manual summaries**: Edit each conversation file by
|
||||
hand to set `status: summarized` and add your own `topics`/`related`
|
||||
frontmatter. Tedious but works for a small wiki.
|
||||
|
||||
### 3. `wiki-harvest.py` — AI compile step
|
||||
|
||||
**What it does**: After fetching raw URL content, sends it to `claude -p`
|
||||
to get a structured JSON verdict (new_page / update_page / both / skip)
|
||||
plus the page content.
|
||||
|
||||
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
- **Any other LLM**: Replace `call_claude_compile()` with a function that
|
||||
calls your preferred backend. The prompt template
|
||||
(`COMPILE_PROMPT_TEMPLATE`) is reusable — just swap the transport.
|
||||
- **Skip AI compilation entirely**: Run `wiki-harvest.py --no-compile`
|
||||
and the harvester will save raw content to `raw/harvested/` without
|
||||
trying to compile it. You can then manually (or via a different script)
|
||||
turn the raw content into wiki pages.
|
||||
|
||||
### 4. `wiki-hygiene.py --full` — LLM-powered checks
|
||||
|
||||
**What it does**: Duplicate detection, contradiction detection, missing
|
||||
cross-reference suggestions.
|
||||
|
||||
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
- **Same as #3**: Replace the `call_claude()` helper in `wiki-hygiene.py`.
|
||||
- **Skip full mode entirely**: Only run `wiki-hygiene.py --quick` (the
|
||||
default). Quick mode has no LLM calls and catches 90% of structural
|
||||
issues. Contradictions and duplicates just have to be caught by human
|
||||
review during `wiki-staging.py --review` sessions.
|
||||
|
||||
### 5. `CLAUDE.md` at the wiki root
|
||||
|
||||
**What it does**: The instructions Claude Code reads at the start of
|
||||
every session that explain the wiki schema and maintenance operations.
|
||||
|
||||
**What's Claude-specific**: The filename. Claude Code specifically looks
|
||||
for `CLAUDE.md`; other agents look for other files.
|
||||
|
||||
**Replacements**:
|
||||
|
||||
| Agent | Equivalent file |
|
||||
|-------|-----------------|
|
||||
| Claude Code | `CLAUDE.md` |
|
||||
| Cursor | `.cursorrules` or `.cursor/rules/` |
|
||||
| Aider | `CONVENTIONS.md` (read via `--read CONVENTIONS.md`) |
|
||||
| Gemini CLI | `GEMINI.md` |
|
||||
| Continue.dev | `config.json` prompts or `.continue/rules/` |
|
||||
|
||||
The content is the same — just rename the file and point your agent at
|
||||
it.
|
||||
|
||||
---
|
||||
|
||||
## Running without cron
|
||||
|
||||
Cron is convenient but not required. Alternatives:
|
||||
|
||||
### Manual runs
|
||||
|
||||
Just call the scripts when you want the wiki updated:
|
||||
|
||||
```bash
|
||||
cd ~/projects/wiki
|
||||
|
||||
# When you want to ingest new Claude Code sessions
|
||||
bash scripts/mine-conversations.sh
|
||||
|
||||
# When you want hygiene + harvest
|
||||
bash scripts/wiki-maintain.sh
|
||||
|
||||
# When you want the expensive LLM pass
|
||||
bash scripts/wiki-maintain.sh --hygiene-only --full
|
||||
```
|
||||
|
||||
This is arguably *better* than cron if you work in bursts — run
|
||||
maintenance when you start a session, not on a schedule.
|
||||
|
||||
### systemd timers (Linux)
|
||||
|
||||
More observable than cron, better journaling:
|
||||
|
||||
```ini
|
||||
# ~/.config/systemd/user/wiki-maintain.service
|
||||
[Unit]
|
||||
Description=Wiki maintenance pipeline
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
WorkingDirectory=%h/projects/wiki
|
||||
ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh
|
||||
```
|
||||
|
||||
```ini
|
||||
# ~/.config/systemd/user/wiki-maintain.timer
|
||||
[Unit]
|
||||
Description=Run wiki-maintain daily
|
||||
|
||||
[Timer]
|
||||
OnCalendar=daily
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
```bash
|
||||
systemctl --user enable --now wiki-maintain.timer
|
||||
journalctl --user -u wiki-maintain.service # see logs
|
||||
```
|
||||
|
||||
### launchd (macOS)
|
||||
|
||||
More native than cron on macOS:
|
||||
|
||||
```xml
|
||||
<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>Label</key><string>com.user.wiki-maintain</string>
|
||||
<key>ProgramArguments</key>
|
||||
<array>
|
||||
<string>/bin/bash</string>
|
||||
<string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
|
||||
</array>
|
||||
<key>StartCalendarInterval</key>
|
||||
<dict>
|
||||
<key>Hour</key><integer>3</integer>
|
||||
<key>Minute</key><integer>0</integer>
|
||||
</dict>
|
||||
<key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
|
||||
<key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
|
||||
</dict>
|
||||
</plist>
|
||||
```
|
||||
|
||||
```bash
|
||||
launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
|
||||
launchctl list | grep wiki # verify
|
||||
```
|
||||
|
||||
### Git hooks (pre-push)
|
||||
|
||||
Run hygiene before every push so the wiki is always clean when it hits
|
||||
the remote:
|
||||
|
||||
```bash
|
||||
cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
|
||||
HOOK
|
||||
chmod +x ~/projects/wiki/.git/hooks/pre-push
|
||||
```
|
||||
|
||||
Downside: every push is slow. Upside: you never push a broken wiki.
|
||||
|
||||
### CI pipeline
|
||||
|
||||
Run `wiki-hygiene.py --check-only` in a CI workflow on every PR:
|
||||
|
||||
```yaml
|
||||
# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
|
||||
name: Wiki hygiene check
|
||||
on: [push, pull_request]
|
||||
jobs:
|
||||
hygiene:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v4
|
||||
- uses: actions/setup-python@v5
|
||||
- run: python3 scripts/wiki-hygiene.py --check-only
|
||||
```
|
||||
|
||||
`--check-only` reports issues without auto-fixing them, so CI can flag
|
||||
problems without modifying files.
|
||||
|
||||
---
|
||||
|
||||
## Minimal subsets
|
||||
|
||||
You don't have to run the whole pipeline. Pick what's useful:
|
||||
|
||||
### "Just the wiki" (no automation)
|
||||
|
||||
- Delete `scripts/wiki-*` and `scripts/*-conversations*`
|
||||
- Delete `tests/`
|
||||
- Keep the directory structure (`patterns/`, `decisions/`, etc.)
|
||||
- Keep `index.md` and `CLAUDE.md`
|
||||
- Write and maintain the wiki manually with your agent
|
||||
|
||||
This is the Karpathy-gist version. Works great for small wikis.
|
||||
|
||||
### "Wiki + mining" (no harvesting, no hygiene)
|
||||
|
||||
- Keep the mining layer (`extract-sessions.py`, `summarize-conversations.py`, `update-conversation-index.py`)
|
||||
- Delete the automation layer (`wiki-harvest.py`, `wiki-hygiene.py`, `wiki-staging.py`, `wiki-maintain.sh`)
|
||||
- The wiki grows from session mining but you maintain it manually
|
||||
|
||||
Useful if you want session continuity (the wake-up briefing) without
|
||||
the full automation.
|
||||
|
||||
### "Wiki + hygiene" (no mining, no harvesting)
|
||||
|
||||
- Keep `wiki-hygiene.py` and `wiki_lib.py`
|
||||
- Delete everything else
|
||||
- Run `wiki-hygiene.py --quick` periodically to catch structural issues
|
||||
|
||||
Useful if you write the wiki manually but want automated checks for
|
||||
orphans, broken links, and staleness.
|
||||
|
||||
### "Wiki + harvesting" (no session mining)
|
||||
|
||||
- Keep `wiki-harvest.py`, `wiki-staging.py`, `wiki_lib.py`
|
||||
- Delete mining scripts
|
||||
- Source URLs manually — put them in a file and point the harvester at
|
||||
it. You'd need to write a wrapper that extracts URLs from your source
|
||||
file and feeds them into the fetch cascade.
|
||||
|
||||
Useful if URLs come from somewhere other than Claude Code sessions
|
||||
(e.g. browser bookmarks, Pocket export, RSS).
|
||||
|
||||
---
|
||||
|
||||
## Schema customization
|
||||
|
||||
The repo uses these live content types:
|
||||
|
||||
- `patterns/` — HOW things should be built
|
||||
- `decisions/` — WHY we chose this approach
|
||||
- `concepts/` — WHAT the foundational ideas are
|
||||
- `environments/` — WHERE implementations differ
|
||||
|
||||
These reflect my engineering-focused use case. Your wiki might need
|
||||
different categories. To change them:
|
||||
|
||||
1. Rename / add directories under the wiki root
|
||||
2. Edit `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py`
|
||||
3. Update the `type:` frontmatter validation in
|
||||
`scripts/wiki-hygiene.py` (`VALID_TYPES` constant)
|
||||
4. Update `CLAUDE.md` to describe the new categories
|
||||
5. Update `index.md` section headers to match
|
||||
|
||||
Examples of alternative schemas:
|
||||
|
||||
**Research wiki**:
|
||||
- `findings/` — experimental results
|
||||
- `hypotheses/` — what you're testing
|
||||
- `methods/` — how you test
|
||||
- `literature/` — external sources
|
||||
|
||||
**Product wiki**:
|
||||
- `features/` — what the product does
|
||||
- `decisions/` — why we chose this
|
||||
- `users/` — personas, interviews, feedback
|
||||
- `metrics/` — what we measure
|
||||
|
||||
**Personal knowledge wiki**:
|
||||
- `topics/` — general subject matter
|
||||
- `projects/` — specific ongoing work
|
||||
- `journal/` — dated entries
|
||||
- `references/` — external links/papers
|
||||
|
||||
None of these are better or worse — pick what matches how you think.
|
||||
|
||||
---
|
||||
|
||||
## Frontmatter customization
|
||||
|
||||
The required fields are documented in `CLAUDE.md` (frontmatter spec).
|
||||
You can add your own fields freely — the parser and hygiene checks
|
||||
ignore unknown keys.
|
||||
|
||||
Useful additions you might want:
|
||||
|
||||
```yaml
|
||||
author: alice # who wrote or introduced the page
|
||||
tags: [auth, security] # flat tag list
|
||||
urgency: high # for to-do-style wiki pages
|
||||
stakeholders: # who cares about this page
|
||||
- product-team
|
||||
- security-team
|
||||
review_by: 2026-06-01 # explicit review date instead of age-based decay
|
||||
```
|
||||
|
||||
If you want age-based decay to key off a different field than
|
||||
`last_verified` (say, `review_by`), edit `expected_confidence()` in
|
||||
`scripts/wiki-hygiene.py` to read from your custom field.
|
||||
|
||||
---
|
||||
|
||||
## Working across multiple wikis
|
||||
|
||||
The scripts all honor the `WIKI_DIR` environment variable. Run multiple
|
||||
wikis against the same scripts:
|
||||
|
||||
```bash
|
||||
# Work wiki
|
||||
WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh
|
||||
|
||||
# Personal wiki
|
||||
WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh
|
||||
|
||||
# Research wiki
|
||||
WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh
|
||||
```
|
||||
|
||||
Each has its own state files, its own cron entries, its own qmd
|
||||
collection. You can symlink or copy `scripts/` into each wiki, or run
|
||||
all three against a single checked-out copy of the scripts.
|
||||
|
||||
---
|
||||
|
||||
## What I'd change if starting over
|
||||
|
||||
Honest notes on the design choices, in case you're about to fork:
|
||||
|
||||
1. **Config should be in YAML, not inline constants.** I bolted a
|
||||
"CONFIGURE ME" comment onto `PROJECT_MAP` and `SKIP_DOMAIN_PATTERNS`
|
||||
as a shortcut. Better: a `config.yaml` at the wiki root that all
|
||||
scripts read.
|
||||
|
||||
2. **The mining layer is tightly coupled to Claude Code.** A cleaner
|
||||
design would put a `Session` interface in `wiki_lib.py` and have
|
||||
extractors for each agent produce `Session` objects — the rest of the
|
||||
pipeline would be agent-agnostic.
|
||||
|
||||
3. **The hygiene script is a monolith.** 1100+ lines is a lot. Splitting
|
||||
it into `wiki_hygiene/checks.py`, `wiki_hygiene/archive.py`,
|
||||
`wiki_hygiene/llm.py`, etc., would be cleaner. It started as a single
|
||||
file and grew.
|
||||
|
||||
4. **The hyphenated filenames (`wiki-harvest.py`) make Python imports
|
||||
awkward.** Standard Python convention is underscores. I used hyphens
|
||||
for consistency with the shell scripts, and `conftest.py` has a
|
||||
module-loader workaround. A cleaner fork would use underscores
|
||||
everywhere.
|
||||
|
||||
5. **The wiki schema assumes you know what you want to catalog.** If
|
||||
you don't, start with a free-form `notes/` directory and let
|
||||
categories emerge organically, then refactor into `patterns/` etc.
|
||||
later.
|
||||
|
||||
None of these are blockers. They're all "if I were designing v2"
|
||||
observations.
|
||||
Reference in New Issue
Block a user