memex/docs/CUSTOMIZE.md

# Customization Guide

This repo is built around Claude Code, cron-based automation, and a
specific directory layout. None of those are load-bearing for the core
idea. This document walks through adapting it for different agents,
different scheduling, and different subsets of functionality.

## What's actually required for the core idea

The minimum viable compounding wiki is:

1. A markdown directory tree
2. An agent that reads the tree at the start of a session and writes to
   it during the session
3. Some convention (a `CLAUDE.md` or equivalent) telling the agent how to
   maintain the wiki

**Everything else in this repo is optional optimization** — automated
extraction, URL harvesting, hygiene checks, cron scheduling. They're
worth the setup effort once the wiki grows past a few dozen pages, but
they're not the *idea*.

---

## Adapting for non-Claude-Code agents

Four script components are Claude-specific. Each has a natural
replacement path:

### 1. `extract-sessions.py` — Claude Code JSONL parsing

**What it does**: Reads session files from `~/.claude/projects/` and
converts them to markdown transcripts.

**What's Claude-specific**: The JSONL format and directory structure are
specific to the Claude Code CLI. Other agents don't produce these files.

**Replacements**:

- **Cursor**: Cursor stores chat history in `~/Library/Application
  Support/Cursor/User/globalStorage/` (macOS) as SQLite. Write an
  equivalent `extract-sessions.py` that queries that SQLite and produces
  the same markdown format.
- **Aider**: Aider stores chat history as `.aider.chat.history.md` in
  each project directory. A much simpler extractor: walk all project
  directories, read each `.aider.chat.history.md`, split on session
  boundaries, write to `conversations/<project>/`.
- **OpenAI Codex / gemini CLI / other**: Whatever session format your
  tool uses — the target format is a markdown file with a specific
  frontmatter shape (`title`, `type: conversation`, `project`, `date`,
  `status: extracted`, `messages: N`, body of user/assistant turns).
  Anything that produces files in that shape will flow through the rest
  of the pipeline unchanged.
- **No agent at all — just manual**: Skip this script entirely. Paste
  interesting conversations into `conversations/general/YYYY-MM-DD-slug.md`
  by hand and set `status: extracted` yourself.

The pipeline downstream of `extract-sessions.py` doesn't care how the
transcripts got there, only that they exist with the right frontmatter.

### 2. `summarize-conversations.py` — `claude -p` summarization

**What it does**: Classifies extracted conversations into "halls"
(fact/discovery/preference/advice/event/tooling) and writes summaries.

**What's Claude-specific**: Uses `claude -p` with haiku/sonnet routing.

**Replacements**:

- **OpenAI**: Replace the `call_claude` helper with a function that calls
  `openai` Python SDK or `gpt` CLI. Use gpt-4o-mini for short
  conversations (equivalent to haiku routing) and gpt-4o for long ones.
- **Local LLM**: The script already supports this path — just omit the
  `--claude` flag and run a `llama-server` on localhost:8080 (or the WSL
  gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
- **Ollama**: Point `AI_BASE_URL` at your Ollama endpoint (e.g.
  `http://localhost:11434/v1`). Ollama exposes an OpenAI-compatible API.
- **Any OpenAI-compatible endpoint**: `AI_BASE_URL` and `AI_MODEL` env
  vars configure the script — no code changes needed.
- **No LLM at all — manual summaries**: Edit each conversation file by
  hand to set `status: summarized` and add your own `topics`/`related`
  frontmatter. Tedious but works for a small wiki.

### 3. `wiki-harvest.py` — AI compile step

**What it does**: After fetching raw URL content, sends it to `claude -p`
to get a structured JSON verdict (new_page / update_page / both / skip)
plus the page content.

**What's Claude-specific**: `claude -p --model haiku|sonnet`.

**Replacements**:

- **Any other LLM**: Replace `call_claude_compile()` with a function that
  calls your preferred backend. The prompt template
  (`COMPILE_PROMPT_TEMPLATE`) is reusable — just swap the transport.
- **Skip AI compilation entirely**: Run `wiki-harvest.py --no-compile`
  and the harvester will save raw content to `raw/harvested/` without
  trying to compile it. You can then manually (or via a different script)
  turn the raw content into wiki pages.

### 4. `wiki-hygiene.py --full` — LLM-powered checks

**What it does**: Duplicate detection, contradiction detection, missing
cross-reference suggestions.

**What's Claude-specific**: `claude -p --model haiku|sonnet`.

**Replacements**:

- **Same as #3**: Replace the `call_claude()` helper in `wiki-hygiene.py`.
- **Skip full mode entirely**: Only run `wiki-hygiene.py --quick` (the
  default). Quick mode has no LLM calls and catches 90% of structural
  issues. Contradictions and duplicates just have to be caught by human
  review during `wiki-staging.py --review` sessions.

### 5. `CLAUDE.md` at the wiki root

**What it does**: The instructions Claude Code reads at the start of
every session that explain the wiki schema and maintenance operations.

**What's Claude-specific**: The filename. Claude Code specifically looks
for `CLAUDE.md`; other agents look for other files.

**Replacements**:

| Agent | Equivalent file |
|-------|-----------------|
| Claude Code | `CLAUDE.md` |
| Cursor | `.cursorrules` or `.cursor/rules/` |
| Aider | `CONVENTIONS.md` (read via `--read CONVENTIONS.md`) |
| Gemini CLI | `GEMINI.md` |
| Continue.dev | `config.json` prompts or `.continue/rules/` |

The content is the same — just rename the file and point your agent at
it.

---

## Running without cron

Cron is convenient but not required. Alternatives:

### Manual runs

Just call the scripts when you want the wiki updated:

```bash
cd ~/projects/wiki

# When you want to ingest new Claude Code sessions
bash scripts/mine-conversations.sh

# When you want hygiene + harvest
bash scripts/wiki-maintain.sh

# When you want the expensive LLM pass
bash scripts/wiki-maintain.sh --hygiene-only --full
```

This is arguably *better* than cron if you work in bursts — run
maintenance when you start a session, not on a schedule.

### systemd timers (Linux)

More observable than cron, better journaling:

```ini
# ~/.config/systemd/user/wiki-maintain.service
[Unit]
Description=Wiki maintenance pipeline

[Service]
Type=oneshot
WorkingDirectory=%h/projects/wiki
ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh
```

```ini
# ~/.config/systemd/user/wiki-maintain.timer
[Unit]
Description=Run wiki-maintain daily

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target
```

```bash
systemctl --user enable --now wiki-maintain.timer
journalctl --user -u wiki-maintain.service  # see logs
```

### launchd (macOS)

More native than cron on macOS:

```xml
<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key><string>com.user.wiki-maintain</string>
  <key>ProgramArguments</key>
  <array>
    <string>/bin/bash</string>
    <string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
  </array>
  <key>StartCalendarInterval</key>
  <dict>
    <key>Hour</key><integer>3</integer>
    <key>Minute</key><integer>0</integer>
  </dict>
  <key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
  <key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
</dict>
</plist>
```

```bash
launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
launchctl list | grep wiki  # verify
```

### Git hooks (pre-push)

Run hygiene before every push so the wiki is always clean when it hits
the remote:

```bash
cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
#!/usr/bin/env bash
set -euo pipefail
bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
HOOK
chmod +x ~/projects/wiki/.git/hooks/pre-push
```

Downside: every push is slow. Upside: you never push a broken wiki.

### CI pipeline

Run `wiki-hygiene.py --check-only` in a CI workflow on every PR:

```yaml
# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
name: Wiki hygiene check
on: [push, pull_request]
jobs:
  hygiene:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: python3 scripts/wiki-hygiene.py --check-only
```

`--check-only` reports issues without auto-fixing them, so CI can flag
problems without modifying files.

---

## Minimal subsets

You don't have to run the whole pipeline. Pick what's useful:

### "Just the wiki" (no automation)

- Delete `scripts/wiki-*` and `scripts/*-conversations*`
- Delete `tests/`
- Keep the directory structure (`patterns/`, `decisions/`, etc.)
- Keep `index.md` and `CLAUDE.md`
- Write and maintain the wiki manually with your agent

This is the Karpathy-gist version. Works great for small wikis.

### "Wiki + mining" (no harvesting, no hygiene)

- Keep the mining layer (`extract-sessions.py`, `summarize-conversations.py`, `update-conversation-index.py`)
- Delete the automation layer (`wiki-harvest.py`, `wiki-hygiene.py`, `wiki-staging.py`, `wiki-maintain.sh`)
- The wiki grows from session mining but you maintain it manually

Useful if you want session continuity (the wake-up briefing) without
the full automation.

### "Wiki + hygiene" (no mining, no harvesting)

- Keep `wiki-hygiene.py` and `wiki_lib.py`
- Delete everything else
- Run `wiki-hygiene.py --quick` periodically to catch structural issues

Useful if you write the wiki manually but want automated checks for
orphans, broken links, and staleness.

### "Wiki + harvesting" (no session mining)

- Keep `wiki-harvest.py`, `wiki-staging.py`, `wiki_lib.py`
- Delete mining scripts
- Source URLs manually — put them in a file and point the harvester at
  it. You'd need to write a wrapper that extracts URLs from your source
  file and feeds them into the fetch cascade.

Useful if URLs come from somewhere other than Claude Code sessions
(e.g. browser bookmarks, Pocket export, RSS).

---

## Schema customization

The repo uses these live content types:

- `patterns/` — HOW things should be built
- `decisions/` — WHY we chose this approach
- `concepts/` — WHAT the foundational ideas are
- `environments/` — WHERE implementations differ

These reflect my engineering-focused use case. Your wiki might need
different categories. To change them:

1. Rename / add directories under the wiki root
2. Edit `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py`
3. Update the `type:` frontmatter validation in
   `scripts/wiki-hygiene.py` (`VALID_TYPES` constant)
4. Update `CLAUDE.md` to describe the new categories
5. Update `index.md` section headers to match

Examples of alternative schemas:

**Research wiki**:
- `findings/` — experimental results
- `hypotheses/` — what you're testing
- `methods/` — how you test
- `literature/` — external sources

**Product wiki**:
- `features/` — what the product does
- `decisions/` — why we chose this
- `users/` — personas, interviews, feedback
- `metrics/` — what we measure

**Personal knowledge wiki**:
- `topics/` — general subject matter
- `projects/` — specific ongoing work
- `journal/` — dated entries
- `references/` — external links/papers

None of these are better or worse — pick what matches how you think.

---

## Frontmatter customization

The required fields are documented in `CLAUDE.md` (frontmatter spec).
You can add your own fields freely — the parser and hygiene checks
ignore unknown keys.

Useful additions you might want:

```yaml
author: alice              # who wrote or introduced the page
tags: [auth, security]     # flat tag list
urgency: high              # for to-do-style wiki pages
stakeholders:              # who cares about this page
  - product-team
  - security-team
review_by: 2026-06-01      # explicit review date instead of age-based decay
```

If you want age-based decay to key off a different field than
`last_verified` (say, `review_by`), edit `expected_confidence()` in
`scripts/wiki-hygiene.py` to read from your custom field.

---

## Working across multiple wikis

The scripts all honor the `WIKI_DIR` environment variable. Run multiple
wikis against the same scripts:

```bash
# Work wiki
WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh

# Personal wiki
WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh

# Research wiki
WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh
```

Each has its own state files, its own cron entries, its own qmd
collection. You can symlink or copy `scripts/` into each wiki, or run
all three against a single checked-out copy of the scripts.

---

## What I'd change if starting over

Honest notes on the design choices, in case you're about to fork:

1. **Config should be in YAML, not inline constants.** I bolted a
   "CONFIGURE ME" comment onto `PROJECT_MAP` and `SKIP_DOMAIN_PATTERNS`
   as a shortcut. Better: a `config.yaml` at the wiki root that all
   scripts read.

2. **The mining layer is tightly coupled to Claude Code.** A cleaner
   design would put a `Session` interface in `wiki_lib.py` and have
   extractors for each agent produce `Session` objects — the rest of the
   pipeline would be agent-agnostic.

3. **The hygiene script is a monolith.** 1100+ lines is a lot. Splitting
   it into `wiki_hygiene/checks.py`, `wiki_hygiene/archive.py`,
   `wiki_hygiene/llm.py`, etc., would be cleaner. It started as a single
   file and grew.

4. **The hyphenated filenames (`wiki-harvest.py`) make Python imports
   awkward.** Standard Python convention is underscores. I used hyphens
   for consistency with the shell scripts, and `conftest.py` has a
   module-loader workaround. A cleaner fork would use underscores
   everywhere.

5. **The wiki schema assumes you know what you want to catalog.** If
   you don't, start with a free-form `notes/` directory and let
   categories emerge organically, then refactor into `patterns/` etc.
   later.

None of these are blockers. They're all "if I were designing v2"
observations.