Initial commit — memex

A compounding LLM-maintained knowledge wiki.

Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's
mempalace, with an automation layer on top for conversation mining, URL
harvesting, human-in-the-loop staging, staleness decay, and hygiene.

Includes:
- 11 pipeline scripts (extract, summarize, index, harvest, stage,
  hygiene, maintain, sync, + shared library)
- Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE
- Example CLAUDE.md files (wiki schema + global instructions) tuned for
  the three-collection qmd setup
- 171-test pytest suite (cross-platform, runs in ~1.3s)
- MIT licensed
This commit is contained in:
Eric Turner
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions

432
docs/CUSTOMIZE.md Normal file
View File

@@ -0,0 +1,432 @@
# Customization Guide
This repo is built around Claude Code, cron-based automation, and a
specific directory layout. None of those are load-bearing for the core
idea. This document walks through adapting it for different agents,
different scheduling, and different subsets of functionality.
## What's actually required for the core idea
The minimum viable compounding wiki is:
1. A markdown directory tree
2. An agent that reads the tree at the start of a session and writes to
it during the session
3. Some convention (a `CLAUDE.md` or equivalent) telling the agent how to
maintain the wiki
**Everything else in this repo is optional optimization** — automated
extraction, URL harvesting, hygiene checks, cron scheduling. They're
worth the setup effort once the wiki grows past a few dozen pages, but
they're not the *idea*.
---
## Adapting for non-Claude-Code agents
Four script components are Claude-specific. Each has a natural
replacement path:
### 1. `extract-sessions.py` — Claude Code JSONL parsing
**What it does**: Reads session files from `~/.claude/projects/` and
converts them to markdown transcripts.
**What's Claude-specific**: The JSONL format and directory structure are
specific to the Claude Code CLI. Other agents don't produce these files.
**Replacements**:
- **Cursor**: Cursor stores chat history in `~/Library/Application
Support/Cursor/User/globalStorage/` (macOS) as SQLite. Write an
equivalent `extract-sessions.py` that queries that SQLite and produces
the same markdown format.
- **Aider**: Aider stores chat history as `.aider.chat.history.md` in
each project directory. A much simpler extractor: walk all project
directories, read each `.aider.chat.history.md`, split on session
boundaries, write to `conversations/<project>/`.
- **OpenAI Codex / gemini CLI / other**: Whatever session format your
tool uses — the target format is a markdown file with a specific
frontmatter shape (`title`, `type: conversation`, `project`, `date`,
`status: extracted`, `messages: N`, body of user/assistant turns).
Anything that produces files in that shape will flow through the rest
of the pipeline unchanged.
- **No agent at all — just manual**: Skip this script entirely. Paste
interesting conversations into `conversations/general/YYYY-MM-DD-slug.md`
by hand and set `status: extracted` yourself.
The pipeline downstream of `extract-sessions.py` doesn't care how the
transcripts got there, only that they exist with the right frontmatter.
### 2. `summarize-conversations.py` — `claude -p` summarization
**What it does**: Classifies extracted conversations into "halls"
(fact/discovery/preference/advice/event/tooling) and writes summaries.
**What's Claude-specific**: Uses `claude -p` with haiku/sonnet routing.
**Replacements**:
- **OpenAI**: Replace the `call_claude` helper with a function that calls
`openai` Python SDK or `gpt` CLI. Use gpt-4o-mini for short
conversations (equivalent to haiku routing) and gpt-4o for long ones.
- **Local LLM**: The script already supports this path — just omit the
`--claude` flag and run a `llama-server` on localhost:8080 (or the WSL
gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
- **Ollama**: Point `AI_BASE_URL` at your Ollama endpoint (e.g.
`http://localhost:11434/v1`). Ollama exposes an OpenAI-compatible API.
- **Any OpenAI-compatible endpoint**: `AI_BASE_URL` and `AI_MODEL` env
vars configure the script — no code changes needed.
- **No LLM at all — manual summaries**: Edit each conversation file by
hand to set `status: summarized` and add your own `topics`/`related`
frontmatter. Tedious but works for a small wiki.
### 3. `wiki-harvest.py` — AI compile step
**What it does**: After fetching raw URL content, sends it to `claude -p`
to get a structured JSON verdict (new_page / update_page / both / skip)
plus the page content.
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
**Replacements**:
- **Any other LLM**: Replace `call_claude_compile()` with a function that
calls your preferred backend. The prompt template
(`COMPILE_PROMPT_TEMPLATE`) is reusable — just swap the transport.
- **Skip AI compilation entirely**: Run `wiki-harvest.py --no-compile`
and the harvester will save raw content to `raw/harvested/` without
trying to compile it. You can then manually (or via a different script)
turn the raw content into wiki pages.
### 4. `wiki-hygiene.py --full` — LLM-powered checks
**What it does**: Duplicate detection, contradiction detection, missing
cross-reference suggestions.
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
**Replacements**:
- **Same as #3**: Replace the `call_claude()` helper in `wiki-hygiene.py`.
- **Skip full mode entirely**: Only run `wiki-hygiene.py --quick` (the
default). Quick mode has no LLM calls and catches 90% of structural
issues. Contradictions and duplicates just have to be caught by human
review during `wiki-staging.py --review` sessions.
### 5. `CLAUDE.md` at the wiki root
**What it does**: The instructions Claude Code reads at the start of
every session that explain the wiki schema and maintenance operations.
**What's Claude-specific**: The filename. Claude Code specifically looks
for `CLAUDE.md`; other agents look for other files.
**Replacements**:
| Agent | Equivalent file |
|-------|-----------------|
| Claude Code | `CLAUDE.md` |
| Cursor | `.cursorrules` or `.cursor/rules/` |
| Aider | `CONVENTIONS.md` (read via `--read CONVENTIONS.md`) |
| Gemini CLI | `GEMINI.md` |
| Continue.dev | `config.json` prompts or `.continue/rules/` |
The content is the same — just rename the file and point your agent at
it.
---
## Running without cron
Cron is convenient but not required. Alternatives:
### Manual runs
Just call the scripts when you want the wiki updated:
```bash
cd ~/projects/wiki
# When you want to ingest new Claude Code sessions
bash scripts/mine-conversations.sh
# When you want hygiene + harvest
bash scripts/wiki-maintain.sh
# When you want the expensive LLM pass
bash scripts/wiki-maintain.sh --hygiene-only --full
```
This is arguably *better* than cron if you work in bursts — run
maintenance when you start a session, not on a schedule.
### systemd timers (Linux)
More observable than cron, better journaling:
```ini
# ~/.config/systemd/user/wiki-maintain.service
[Unit]
Description=Wiki maintenance pipeline
[Service]
Type=oneshot
WorkingDirectory=%h/projects/wiki
ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh
```
```ini
# ~/.config/systemd/user/wiki-maintain.timer
[Unit]
Description=Run wiki-maintain daily
[Timer]
OnCalendar=daily
Persistent=true
[Install]
WantedBy=timers.target
```
```bash
systemctl --user enable --now wiki-maintain.timer
journalctl --user -u wiki-maintain.service # see logs
```
### launchd (macOS)
More native than cron on macOS:
```xml
<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key><string>com.user.wiki-maintain</string>
<key>ProgramArguments</key>
<array>
<string>/bin/bash</string>
<string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
</array>
<key>StartCalendarInterval</key>
<dict>
<key>Hour</key><integer>3</integer>
<key>Minute</key><integer>0</integer>
</dict>
<key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
<key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
</dict>
</plist>
```
```bash
launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
launchctl list | grep wiki # verify
```
### Git hooks (pre-push)
Run hygiene before every push so the wiki is always clean when it hits
the remote:
```bash
cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
#!/usr/bin/env bash
set -euo pipefail
bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
HOOK
chmod +x ~/projects/wiki/.git/hooks/pre-push
```
Downside: every push is slow. Upside: you never push a broken wiki.
### CI pipeline
Run `wiki-hygiene.py --check-only` in a CI workflow on every PR:
```yaml
# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
name: Wiki hygiene check
on: [push, pull_request]
jobs:
hygiene:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
- run: python3 scripts/wiki-hygiene.py --check-only
```
`--check-only` reports issues without auto-fixing them, so CI can flag
problems without modifying files.
---
## Minimal subsets
You don't have to run the whole pipeline. Pick what's useful:
### "Just the wiki" (no automation)
- Delete `scripts/wiki-*` and `scripts/*-conversations*`
- Delete `tests/`
- Keep the directory structure (`patterns/`, `decisions/`, etc.)
- Keep `index.md` and `CLAUDE.md`
- Write and maintain the wiki manually with your agent
This is the Karpathy-gist version. Works great for small wikis.
### "Wiki + mining" (no harvesting, no hygiene)
- Keep the mining layer (`extract-sessions.py`, `summarize-conversations.py`, `update-conversation-index.py`)
- Delete the automation layer (`wiki-harvest.py`, `wiki-hygiene.py`, `wiki-staging.py`, `wiki-maintain.sh`)
- The wiki grows from session mining but you maintain it manually
Useful if you want session continuity (the wake-up briefing) without
the full automation.
### "Wiki + hygiene" (no mining, no harvesting)
- Keep `wiki-hygiene.py` and `wiki_lib.py`
- Delete everything else
- Run `wiki-hygiene.py --quick` periodically to catch structural issues
Useful if you write the wiki manually but want automated checks for
orphans, broken links, and staleness.
### "Wiki + harvesting" (no session mining)
- Keep `wiki-harvest.py`, `wiki-staging.py`, `wiki_lib.py`
- Delete mining scripts
- Source URLs manually — put them in a file and point the harvester at
it. You'd need to write a wrapper that extracts URLs from your source
file and feeds them into the fetch cascade.
Useful if URLs come from somewhere other than Claude Code sessions
(e.g. browser bookmarks, Pocket export, RSS).
---
## Schema customization
The repo uses these live content types:
- `patterns/` — HOW things should be built
- `decisions/` — WHY we chose this approach
- `concepts/` — WHAT the foundational ideas are
- `environments/` — WHERE implementations differ
These reflect my engineering-focused use case. Your wiki might need
different categories. To change them:
1. Rename / add directories under the wiki root
2. Edit `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py`
3. Update the `type:` frontmatter validation in
`scripts/wiki-hygiene.py` (`VALID_TYPES` constant)
4. Update `CLAUDE.md` to describe the new categories
5. Update `index.md` section headers to match
Examples of alternative schemas:
**Research wiki**:
- `findings/` — experimental results
- `hypotheses/` — what you're testing
- `methods/` — how you test
- `literature/` — external sources
**Product wiki**:
- `features/` — what the product does
- `decisions/` — why we chose this
- `users/` — personas, interviews, feedback
- `metrics/` — what we measure
**Personal knowledge wiki**:
- `topics/` — general subject matter
- `projects/` — specific ongoing work
- `journal/` — dated entries
- `references/` — external links/papers
None of these are better or worse — pick what matches how you think.
---
## Frontmatter customization
The required fields are documented in `CLAUDE.md` (frontmatter spec).
You can add your own fields freely — the parser and hygiene checks
ignore unknown keys.
Useful additions you might want:
```yaml
author: alice # who wrote or introduced the page
tags: [auth, security] # flat tag list
urgency: high # for to-do-style wiki pages
stakeholders: # who cares about this page
- product-team
- security-team
review_by: 2026-06-01 # explicit review date instead of age-based decay
```
If you want age-based decay to key off a different field than
`last_verified` (say, `review_by`), edit `expected_confidence()` in
`scripts/wiki-hygiene.py` to read from your custom field.
---
## Working across multiple wikis
The scripts all honor the `WIKI_DIR` environment variable. Run multiple
wikis against the same scripts:
```bash
# Work wiki
WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh
# Personal wiki
WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh
# Research wiki
WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh
```
Each has its own state files, its own cron entries, its own qmd
collection. You can symlink or copy `scripts/` into each wiki, or run
all three against a single checked-out copy of the scripts.
---
## What I'd change if starting over
Honest notes on the design choices, in case you're about to fork:
1. **Config should be in YAML, not inline constants.** I bolted a
"CONFIGURE ME" comment onto `PROJECT_MAP` and `SKIP_DOMAIN_PATTERNS`
as a shortcut. Better: a `config.yaml` at the wiki root that all
scripts read.
2. **The mining layer is tightly coupled to Claude Code.** A cleaner
design would put a `Session` interface in `wiki_lib.py` and have
extractors for each agent produce `Session` objects — the rest of the
pipeline would be agent-agnostic.
3. **The hygiene script is a monolith.** 1100+ lines is a lot. Splitting
it into `wiki_hygiene/checks.py`, `wiki_hygiene/archive.py`,
`wiki_hygiene/llm.py`, etc., would be cleaner. It started as a single
file and grew.
4. **The hyphenated filenames (`wiki-harvest.py`) make Python imports
awkward.** Standard Python convention is underscores. I used hyphens
for consistency with the shell scripts, and `conftest.py` has a
module-loader workaround. A cleaner fork would use underscores
everywhere.
5. **The wiki schema assumes you know what you want to catalog.** If
you don't, start with a free-form `notes/` directory and let
categories emerge organically, then refactor into `patterns/` etc.
later.
None of these are blockers. They're all "if I were designing v2"
observations.