Files

Eric Turner ee54a2f5d4 Initial commit — memex

A compounding LLM-maintained knowledge wiki.

Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's
mempalace, with an automation layer on top for conversation mining, URL
harvesting, human-in-the-loop staging, staleness decay, and hygiene.

Includes:
- 11 pipeline scripts (extract, summarize, index, harvest, stage,
  hygiene, maintain, sync, + shared library)
- Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE
- Example CLAUDE.md files (wiki schema + global instructions) tuned for
  the three-collection qmd setup
- 171-test pytest suite (cross-platform, runs in ~1.3s)
- MIT licensed

2026-04-12 21:16:02 -06:00

14 KiB

Raw Blame History

Customization Guide

This repo is built around Claude Code, cron-based automation, and a specific directory layout. None of those are load-bearing for the core idea. This document walks through adapting it for different agents, different scheduling, and different subsets of functionality.

What's actually required for the core idea

The minimum viable compounding wiki is:

A markdown directory tree
An agent that reads the tree at the start of a session and writes to it during the session
Some convention (a CLAUDE.md or equivalent) telling the agent how to maintain the wiki

Everything else in this repo is optional optimization — automated extraction, URL harvesting, hygiene checks, cron scheduling. They're worth the setup effort once the wiki grows past a few dozen pages, but they're not the idea.

Adapting for non-Claude-Code agents

Four script components are Claude-specific. Each has a natural replacement path:

1. `extract-sessions.py` — Claude Code JSONL parsing

What it does: Reads session files from ~/.claude/projects/ and converts them to markdown transcripts.

What's Claude-specific: The JSONL format and directory structure are specific to the Claude Code CLI. Other agents don't produce these files.

Replacements:

Cursor: Cursor stores chat history in ~/Library/Application Support/Cursor/User/globalStorage/ (macOS) as SQLite. Write an equivalent extract-sessions.py that queries that SQLite and produces the same markdown format.
Aider: Aider stores chat history as .aider.chat.history.md in each project directory. A much simpler extractor: walk all project directories, read each .aider.chat.history.md, split on session boundaries, write to conversations/<project>/.
OpenAI Codex / gemini CLI / other: Whatever session format your tool uses — the target format is a markdown file with a specific frontmatter shape (title, type: conversation, project, date, status: extracted, messages: N, body of user/assistant turns). Anything that produces files in that shape will flow through the rest of the pipeline unchanged.
No agent at all — just manual: Skip this script entirely. Paste interesting conversations into conversations/general/YYYY-MM-DD-slug.md by hand and set status: extracted yourself.

The pipeline downstream of extract-sessions.py doesn't care how the transcripts got there, only that they exist with the right frontmatter.

2. `summarize-conversations.py` — `claude -p` summarization

What it does: Classifies extracted conversations into "halls" (fact/discovery/preference/advice/event/tooling) and writes summaries.

What's Claude-specific: Uses claude -p with haiku/sonnet routing.

Replacements:

OpenAI: Replace the call_claude helper with a function that calls openai Python SDK or gpt CLI. Use gpt-4o-mini for short conversations (equivalent to haiku routing) and gpt-4o for long ones.
Local LLM: The script already supports this path — just omit the --claude flag and run a llama-server on localhost:8080 (or the WSL gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
Ollama: Point AI_BASE_URL at your Ollama endpoint (e.g. http://localhost:11434/v1). Ollama exposes an OpenAI-compatible API.
Any OpenAI-compatible endpoint: AI_BASE_URL and AI_MODEL env vars configure the script — no code changes needed.
No LLM at all — manual summaries: Edit each conversation file by hand to set status: summarized and add your own topics/related frontmatter. Tedious but works for a small wiki.

3. `wiki-harvest.py` — AI compile step

What it does: After fetching raw URL content, sends it to claude -p to get a structured JSON verdict (new_page / update_page / both / skip) plus the page content.

What's Claude-specific: claude -p --model haiku|sonnet.

Replacements:

Any other LLM: Replace call_claude_compile() with a function that calls your preferred backend. The prompt template (COMPILE_PROMPT_TEMPLATE) is reusable — just swap the transport.
Skip AI compilation entirely: Run wiki-harvest.py --no-compile and the harvester will save raw content to raw/harvested/ without trying to compile it. You can then manually (or via a different script) turn the raw content into wiki pages.

4. `wiki-hygiene.py --full` — LLM-powered checks

What it does: Duplicate detection, contradiction detection, missing cross-reference suggestions.

What's Claude-specific: claude -p --model haiku|sonnet.

Replacements:

Same as #3: Replace the call_claude() helper in wiki-hygiene.py.
Skip full mode entirely: Only run wiki-hygiene.py --quick (the default). Quick mode has no LLM calls and catches 90% of structural issues. Contradictions and duplicates just have to be caught by human review during wiki-staging.py --review sessions.

5. `CLAUDE.md` at the wiki root

What it does: The instructions Claude Code reads at the start of every session that explain the wiki schema and maintenance operations.

What's Claude-specific: The filename. Claude Code specifically looks for CLAUDE.md; other agents look for other files.

Replacements:

Agent	Equivalent file
Claude Code	`CLAUDE.md`
Cursor	`.cursorrules` or `.cursor/rules/`
Aider	`CONVENTIONS.md` (read via `--read CONVENTIONS.md`)
Gemini CLI	`GEMINI.md`
Continue.dev	`config.json` prompts or `.continue/rules/`

The content is the same — just rename the file and point your agent at it.

Running without cron

Cron is convenient but not required. Alternatives:

Manual runs

Just call the scripts when you want the wiki updated:

cd ~/projects/wiki

# When you want to ingest new Claude Code sessions
bash scripts/mine-conversations.sh

# When you want hygiene + harvest
bash scripts/wiki-maintain.sh

# When you want the expensive LLM pass
bash scripts/wiki-maintain.sh --hygiene-only --full

This is arguably better than cron if you work in bursts — run maintenance when you start a session, not on a schedule.

systemd timers (Linux)

More observable than cron, better journaling:

# ~/.config/systemd/user/wiki-maintain.service
[Unit]
Description=Wiki maintenance pipeline

[Service]
Type=oneshot
WorkingDirectory=%h/projects/wiki
ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh

# ~/.config/systemd/user/wiki-maintain.timer
[Unit]
Description=Run wiki-maintain daily

[Timer]
OnCalendar=daily
Persistent=true

[Install]
WantedBy=timers.target

systemctl --user enable --now wiki-maintain.timer
journalctl --user -u wiki-maintain.service  # see logs

launchd (macOS)

More native than cron on macOS:

<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key><string>com.user.wiki-maintain</string>
  <key>ProgramArguments</key>
  <array>
    <string>/bin/bash</string>
    <string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
  </array>
  <key>StartCalendarInterval</key>
  <dict>
    <key>Hour</key><integer>3</integer>
    <key>Minute</key><integer>0</integer>
  </dict>
  <key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
  <key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
</dict>
</plist>

launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
launchctl list | grep wiki  # verify

Git hooks (pre-push)

Run hygiene before every push so the wiki is always clean when it hits the remote:

cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
#!/usr/bin/env bash
set -euo pipefail
bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
HOOK
chmod +x ~/projects/wiki/.git/hooks/pre-push

Downside: every push is slow. Upside: you never push a broken wiki.

CI pipeline

Run wiki-hygiene.py --check-only in a CI workflow on every PR:

# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
name: Wiki hygiene check
on: [push, pull_request]
jobs:
  hygiene:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
      - run: python3 scripts/wiki-hygiene.py --check-only

--check-only reports issues without auto-fixing them, so CI can flag problems without modifying files.

Minimal subsets

You don't have to run the whole pipeline. Pick what's useful:

"Just the wiki" (no automation)

Delete scripts/wiki-* and scripts/*-conversations*
Delete tests/
Keep the directory structure (patterns/, decisions/, etc.)
Keep index.md and CLAUDE.md
Write and maintain the wiki manually with your agent

This is the Karpathy-gist version. Works great for small wikis.

"Wiki + mining" (no harvesting, no hygiene)

Keep the mining layer (extract-sessions.py, summarize-conversations.py, update-conversation-index.py)
Delete the automation layer (wiki-harvest.py, wiki-hygiene.py, wiki-staging.py, wiki-maintain.sh)
The wiki grows from session mining but you maintain it manually

Useful if you want session continuity (the wake-up briefing) without the full automation.

"Wiki + hygiene" (no mining, no harvesting)

Keep wiki-hygiene.py and wiki_lib.py
Delete everything else
Run wiki-hygiene.py --quick periodically to catch structural issues

Useful if you write the wiki manually but want automated checks for orphans, broken links, and staleness.

"Wiki + harvesting" (no session mining)

Keep wiki-harvest.py, wiki-staging.py, wiki_lib.py
Delete mining scripts
Source URLs manually — put them in a file and point the harvester at it. You'd need to write a wrapper that extracts URLs from your source file and feeds them into the fetch cascade.

Useful if URLs come from somewhere other than Claude Code sessions (e.g. browser bookmarks, Pocket export, RSS).

Schema customization

The repo uses these live content types:

patterns/ — HOW things should be built
decisions/ — WHY we chose this approach
concepts/ — WHAT the foundational ideas are
environments/ — WHERE implementations differ

These reflect my engineering-focused use case. Your wiki might need different categories. To change them:

Rename / add directories under the wiki root
Edit LIVE_CONTENT_DIRS in scripts/wiki_lib.py
Update the type: frontmatter validation in scripts/wiki-hygiene.py (VALID_TYPES constant)
Update CLAUDE.md to describe the new categories
Update index.md section headers to match

Examples of alternative schemas:

Research wiki:

findings/ — experimental results
hypotheses/ — what you're testing
methods/ — how you test
literature/ — external sources

Product wiki:

features/ — what the product does
decisions/ — why we chose this
users/ — personas, interviews, feedback
metrics/ — what we measure

Personal knowledge wiki:

topics/ — general subject matter
projects/ — specific ongoing work
journal/ — dated entries
references/ — external links/papers

None of these are better or worse — pick what matches how you think.

Frontmatter customization

The required fields are documented in CLAUDE.md (frontmatter spec). You can add your own fields freely — the parser and hygiene checks ignore unknown keys.

Useful additions you might want:

author: alice              # who wrote or introduced the page
tags: [auth, security]     # flat tag list
urgency: high              # for to-do-style wiki pages
stakeholders:              # who cares about this page
  - product-team
  - security-team
review_by: 2026-06-01      # explicit review date instead of age-based decay

If you want age-based decay to key off a different field than last_verified (say, review_by), edit expected_confidence() in scripts/wiki-hygiene.py to read from your custom field.

Working across multiple wikis

The scripts all honor the WIKI_DIR environment variable. Run multiple wikis against the same scripts:

# Work wiki
WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh

# Personal wiki
WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh

# Research wiki
WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh

Each has its own state files, its own cron entries, its own qmd collection. You can symlink or copy scripts/ into each wiki, or run all three against a single checked-out copy of the scripts.

What I'd change if starting over

Honest notes on the design choices, in case you're about to fork:

Config should be in YAML, not inline constants. I bolted a "CONFIGURE ME" comment onto PROJECT_MAP and SKIP_DOMAIN_PATTERNS as a shortcut. Better: a config.yaml at the wiki root that all scripts read.
The mining layer is tightly coupled to Claude Code. A cleaner design would put a Session interface in wiki_lib.py and have extractors for each agent produce Session objects — the rest of the pipeline would be agent-agnostic.
The hygiene script is a monolith. 1100+ lines is a lot. Splitting it into wiki_hygiene/checks.py, wiki_hygiene/archive.py, wiki_hygiene/llm.py, etc., would be cleaner. It started as a single file and grew.
The hyphenated filenames (wiki-harvest.py) make Python imports awkward. Standard Python convention is underscores. I used hyphens for consistency with the shell scripts, and conftest.py has a module-loader workaround. A cleaner fork would use underscores everywhere.
The wiki schema assumes you know what you want to catalog. If you don't, start with a free-form notes/ directory and let categories emerge organically, then refactor into patterns/ etc. later.

None of these are blockers. They're all "if I were designing v2" observations.

14 KiB Raw Blame History

Customization Guide

What's actually required for the core idea

Adapting for non-Claude-Code agents

1. extract-sessions.py — Claude Code JSONL parsing

2. summarize-conversations.py — claude -p summarization

3. wiki-harvest.py — AI compile step

4. wiki-hygiene.py --full — LLM-powered checks

5. CLAUDE.md at the wiki root

Running without cron

Manual runs

systemd timers (Linux)

launchd (macOS)

Git hooks (pre-push)

CI pipeline

Minimal subsets

"Just the wiki" (no automation)

"Wiki + mining" (no harvesting, no hygiene)

"Wiki + hygiene" (no mining, no harvesting)

"Wiki + harvesting" (no session mining)

Schema customization

Frontmatter customization

Working across multiple wikis

What I'd change if starting over

14 KiB

Raw Blame History

1. `extract-sessions.py` — Claude Code JSONL parsing

2. `summarize-conversations.py` — `claude -p` summarization

3. `wiki-harvest.py` — AI compile step

4. `wiki-hygiene.py --full` — LLM-powered checks

5. `CLAUDE.md` at the wiki root