Files

Eric Turner 997aa837de feat(distill): close the MemPalace loop — conversations → wiki pages

Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.

## The gap distill closes

The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.

## What distill does

Narrow today-filter with historical rollup:

  1. Find all summarized conversations dated TODAY
  2. Extract their topics: — this is the "topics of today" set
  3. For each topic in that set, pull ALL summarized conversations
     across history that share that topic (full historical context)
  4. Extract hall_facts + hall_discoveries + hall_advice bullets
     (the high-signal hall types — skips event/preference/tooling)
  5. Send topic group + wiki index.md to claude -p
  6. Model emits JSON actions[]: new_page / update_page / skip
  7. Write each action to staging/<type>/ with distill provenance
     frontmatter (staged_by: wiki-distill, distill_topic,
     distill_source_conversations, compilation_notes)

First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.

Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.

## Orchestration — distill BEFORE harvest

wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):

  1a. wiki-distill.py    — conversations → staging (PRIORITY)
  1b. wiki-harvest.py    — URLs → raw/harvested → staging (supplement)
  2.  wiki-hygiene.py    — decay, archive, repair, checks
  3.  qmd reindex

Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.

## Verified on real wiki

Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.

## Code additions

- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
  + high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
  state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
  the Phase 1a/1b split

## Docs additions

- README.md: 8th row in extensions table, updated compounding-loop
  diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
  loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
  order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
  verify section test count
- .gitignore: note distill-state.json is committed (sync across
  machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
  tab with flow diagram, hall filter table, narrow-today/wide-
  history explanation, staging provenance example

## Tests

192 tests total (+21 new, +1 regression fix), all green in ~1.5s.

2026-04-12 22:34:33 -06:00

16 KiB

Raw Permalink Blame History

Setup Guide

Complete installation for the full automation pipeline. For the conceptual version (just the idea, no scripts), see the "Quick start — Path A" section in the README.

Tested on macOS (work machines) and Linux/WSL2 (home machines). Should work on any POSIX system with Python 3.11+, Node.js 18+, and bash.

1. Prerequisites

Required

git with SSH or HTTPS access to your remote (for cross-machine sync)
Node.js 18+ (for qmd search)
Python 3.11+ (for all pipeline scripts)
claude CLI with valid authentication — Max subscription OAuth or API key. Required for summarization and the harvester's AI compile step. Without claude, you can still use the wiki, but the automation layer falls back to manual or local-LLM paths.

Python tools (recommended via `pipx`)

# URL content extraction — required for wiki-harvest.py
pipx install trafilatura
pipx install crawl4ai && crawl4ai-setup    # installs Playwright browsers

Verify: trafilatura --version and crwl --help should both work.

Optional

pytest — only needed to run the test suite (pip install --user pytest)
llama.cpp / llama-server — only if you want the legacy local-LLM summarization path instead of claude -p

2. Clone the repo

git clone <your-gitea-or-github-url> ~/projects/wiki
cd ~/projects/wiki

The repo contains scripts, tests, docs, and example content — but no actual wiki pages. The wiki grows as you use it.

3. Configure qmd search

qmd handles BM25 full-text search and vector search over the wiki. The pipeline uses three collections:

wiki — live content (patterns/decisions/concepts/environments), staging, and raw sources. The default search surface.
wiki-archive — stale / superseded pages. Excluded from default search; query explicitly with -c wiki-archive when digging into history.
wiki-conversations — mined Claude Code session transcripts. Excluded from default search because they'd flood results with noisy tool-call output; query explicitly with -c wiki-conversations when looking for "what did I discuss about X last month?"

npm install -g @tobilu/qmd

Configure via YAML directly — the CLI doesn't support ignore or includeByDefault, so we edit the config file:

mkdir -p ~/.config/qmd
cat > ~/.config/qmd/index.yml <<'YAML'
collections:
  wiki:
    path: /Users/YOUR_USER/projects/wiki   # ← replace with your actual path
    pattern: "**/*.md"
    ignore:
      - "archive/**"
      - "reports/**"
      - "plans/**"
      - "conversations/**"
      - "scripts/**"
      - "context/**"

  wiki-archive:
    path: /Users/YOUR_USER/projects/wiki/archive
    pattern: "**/*.md"
    includeByDefault: false

  wiki-conversations:
    path: /Users/YOUR_USER/projects/wiki/conversations
    pattern: "**/*.md"
    includeByDefault: false
    ignore:
      - "index.md"
YAML

On Linux/WSL, replace /Users/YOUR_USER with /home/YOUR_USER.

Build the indexes:

qmd update     # scan files into all three collections
qmd embed      # generate vector embeddings (~2 min first run + ~30 min for conversations on CPU)

Verify:

qmd collection list
# Expected:
#   wiki                — N files
#   wiki-archive        — M files [excluded]
#   wiki-conversations  — K files [excluded]

The [excluded] tag on the non-default collections confirms includeByDefault: false is honored.

When to query which:

# "What's the current pattern for X?"
qmd search "topic" --json -n 5

# "What was the OLD pattern, before we changed it?"
qmd search "topic" -c wiki-archive --json -n 5

# "When did we discuss this, and what did we decide?"
qmd search "topic" -c wiki-conversations --json -n 5

# Everything — history + current + conversations
qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10

4. Configure the Python scripts

Three scripts need per-user configuration:

`scripts/extract-sessions.py` — `PROJECT_MAP`

This maps Claude Code project directory suffixes to short wiki codes ("wings"). Claude stores sessions under ~/.claude/projects/<hashed-path>/ where the hashed path is derived from the absolute path to your project.

Open the script and edit the PROJECT_MAP dict near the top. Look for the CONFIGURE ME block. Examples:

PROJECT_MAP: dict[str, str] = {
    "projects-wiki": "wiki",
    "-claude": "cl",
    "my-webapp": "web",       # map "mydir/my-webapp" → wing "web"
    "mobile-app": "mob",
    "work-monorepo": "work",
    "-home": "general",       # catch-all for unmatched sessions
}

Run ls ~/.claude/projects/ to see what directory names Claude is actually producing on your machine — the suffix in PROJECT_MAP matches against the end of each directory name.

`scripts/update-conversation-index.py` — `PROJECT_NAMES` / `PROJECT_ORDER`

Matching display names for every code in PROJECT_MAP:

PROJECT_NAMES: dict[str, str] = {
    "wiki": "WIKI — This Wiki",
    "cl": "CL — Claude Config",
    "web": "WEB — My Webapp",
    "mob": "MOB — Mobile App",
    "work": "WORK — Day Job",
    "general": "General — Cross-Project",
}

PROJECT_ORDER = [
    "work", "web", "mob",   # most-active first
    "wiki", "cl", "general",
]

`scripts/wiki-harvest.py` — `SKIP_DOMAIN_PATTERNS`

Add your internal/personal domains so the harvester doesn't try to fetch them. Patterns use re.search:

SKIP_DOMAIN_PATTERNS = [
    # ... (generic ones are already there)
    r"\.mycompany\.com$",
    r"^git\.mydomain\.com$",
]

5. Create the post-merge hook

The hook rebuilds the qmd index automatically after every git pull:

cat > ~/projects/wiki/.git/hooks/post-merge <<'HOOK'
#!/usr/bin/env bash
set -euo pipefail

if command -v qmd &>/dev/null; then
  echo "wiki: rebuilding qmd index..."
  qmd update 2>/dev/null
  # WSL / Linux: no GPU, force CPU-only embeddings
  if [[ "$(uname -s)" == "Linux" ]]; then
    NODE_LLAMA_CPP_GPU=false qmd embed 2>/dev/null
  else
    qmd embed 2>/dev/null
  fi
  echo "wiki: qmd index updated"
fi
HOOK
chmod +x ~/projects/wiki/.git/hooks/post-merge

.git/hooks/ isn't tracked by git, so this step runs on every machine where you clone the repo.

6. Backfill frontmatter (first-time setup or fresh clone)

If you're starting with existing wiki pages that don't yet have last_verified or origin, backfill them:

cd ~/projects/wiki

# Backfill last_verified from last_compiled/git/mtime
python3 scripts/wiki-hygiene.py --backfill

# Backfill origin: manual on pre-automation pages (one-shot inline)
python3 -c "
import sys
sys.path.insert(0, 'scripts')
from wiki_lib import iter_live_pages, write_page
changed = 0
for p in iter_live_pages():
    if 'origin' not in p.frontmatter:
        p.frontmatter['origin'] = 'manual'
        write_page(p)
        changed += 1
print(f'{changed} page(s) backfilled')
"

For a brand-new empty wiki, there's nothing to backfill — skip this step.

7. Run the pipeline manually once

Before setting up cron, do a full end-to-end dry run to make sure everything's wired up:

cd ~/projects/wiki

# 1. Extract any existing Claude Code sessions
bash scripts/mine-conversations.sh --extract-only

# 2. Summarize with claude -p (will make real LLM calls — can take minutes)
python3 scripts/summarize-conversations.py --claude

# 3. Regenerate conversation index + wake-up context
python3 scripts/update-conversation-index.py --reindex

# 4. First-run distill bootstrap (7-day lookback, burns claude -p calls)
#    Only do this if you have summarized conversations from recent work.
#    Skip it if you're starting with a fresh wiki.
python3 scripts/wiki-distill.py --first-run --dry-run    # plan
python3 scripts/wiki-distill.py --first-run              # actually do it

# 5. Dry-run the maintenance pipeline
bash scripts/wiki-maintain.sh --dry-run --no-compile

Expected output from step 4: all three phases run, phase 3 (qmd reindex) shows as skipped in dry-run mode, and you see finished in Ns.

8. Cron setup (optional)

If you want full automation, add these cron jobs. Run them on only ONE machine — state files sync via git, so the other machine picks up the results automatically.

crontab -e

# Wiki SSH key for cron (if your remote uses SSH with a key)
GIT_SSH_COMMAND="ssh -i /path/to/wiki-key -o StrictHostKeyChecking=no"

# PATH for cron so claude, qmd, node, python3, pipx tools are findable
PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/local/bin:/usr/bin:/bin

# ─── Sync ──────────────────────────────────────────────────────────────────
# commit/pull/push every 15 minutes
*/15 * * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --commit && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --pull && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --push >> /tmp/wiki-sync.log 2>&1

# full sync with qmd reindex every 2 hours
0 */2 * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh full >> /tmp/wiki-sync.log 2>&1

# ─── Mining ────────────────────────────────────────────────────────────────
# Extract new sessions hourly (no LLM, fast)
0 * * * * /home/YOUR_USER/projects/wiki/scripts/mine-conversations.sh --extract-only >> /tmp/wiki-mine.log 2>&1

# Summarize + index daily at 2am (uses claude -p)
0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1

# ─── Maintenance ───────────────────────────────────────────────────────────
# Daily at 3am: distill conversations + harvest URLs + quick hygiene + qmd reindex
0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1

# Weekly Sunday at 4am: full hygiene with LLM checks
0 4 * * 0 cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh --hygiene-only --full >> scripts/.maintain.log 2>&1

Replace YOUR_USER and the node path as appropriate for your system.

macOS note: cron needs Full Disk Access if you're pointing it at files in ~/Documents or ~/Desktop. Alternatively use launchd with a plist — same effect, easier permission model on macOS.

WSL note: make sure cron is actually running (sudo service cron start). Cron doesn't auto-start in WSL by default.

claude -p in cron: OAuth tokens must be cached before cron runs it. Run claude --version once interactively as your user to prime the token cache — cron then picks up the cached credentials.

9. Tell Claude Code about the wiki

Two separate CLAUDE.md files work together:

The wiki's own CLAUDE.md at ~/projects/wiki/CLAUDE.md — the schema the agent reads when working INSIDE the wiki. Tells it how to maintain pages, apply frontmatter, handle staging/archival.
Your global ~/.claude/CLAUDE.md — the user-level instructions the agent reads on EVERY session (regardless of directory). Tells it when and how to consult the wiki from any other project.

Both are provided as starter templates you can copy and adapt:

(a) Wiki schema — copy to the wiki root

cp ~/projects/wiki/docs/examples/wiki-CLAUDE.md ~/projects/wiki/CLAUDE.md
# then edit ~/projects/wiki/CLAUDE.md for your own conventions

This file is ~200 lines. It defines:

Directory structure and the automated-vs-manual core rule
Frontmatter spec (required fields, staging fields, archive fields)
Page-type conventions (pattern / decision / environment / concept)
Operations: Ingest, Query, Mine, Harvest, Maintain, Lint
Search Strategy — which of the three qmd collections to use for which question type

Customize the sections marked "Customization Notes" at the bottom for your own categories, environments, and cross-reference format.

(b) Global wake-up + query instructions

Append the contents of docs/examples/global-CLAUDE.md to your global Claude Code instructions:

cat ~/projects/wiki/docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
# then review ~/.claude/CLAUDE.md to integrate cleanly with any existing
# content

This adds:

Wake-Up Context — read context/wake-up.md at session start
memex — When to Consult It — query mode vs ingest mode rules
memex — How to Search It — explicit guidance for all three qmd collections (wiki, wiki-archive, wiki-conversations) with example queries for each
memex — Rules When Citing — flag confidence: low, status: pending, and archived pages to the user

Together these give the agent a complete picture: how to maintain the wiki when working inside it, and how to consult it from anywhere else.

10. Verify

cd ~/projects/wiki

# Sync state
bash scripts/wiki-sync.sh --status

# Search
qmd collection list
qmd search "test" --json -n 3   # won't return anything if wiki is empty

# Mining
tail -20 scripts/.mine.log 2>/dev/null || echo "(no mining runs yet)"

# End-to-end maintenance dry-run (no writes, no LLM, no network)
bash scripts/wiki-maintain.sh --dry-run --no-compile

# Run the test suite
cd tests && python3 -m pytest

Expected:

qmd collection list shows all three collections: wiki, wiki-archive [excluded], wiki-conversations [excluded]
wiki-maintain.sh --dry-run completes all four phases (distill, harvest, hygiene, reindex)
pytest passes all 192 tests in ~1.5 seconds

Troubleshooting

qmd search returns nothing

qmd collection list          # verify path points at the right place
qmd update                   # rebuild index
qmd embed                    # rebuild embeddings
cat ~/.config/qmd/index.yml  # verify config is correct for your machine

qmd collection points at the wrong path Edit ~/.config/qmd/index.yml directly. Don't use qmd collection add from inside the target directory — it can interpret the path oddly.

qmd returns archived pages in default searches Verify wiki-archive has includeByDefault: false in the YAML and qmd collection list shows [excluded].

claude -p fails in cron ("not authenticated") Cron has no browser. Run claude --version once as the same user outside cron to cache OAuth tokens; cron will pick them up. Also verify the PATH directive at the top of the crontab includes the directory containing claude.

wiki-harvest.py fetch failures

# Verify the extraction tools work
trafilatura -u "https://example.com" --markdown --no-comments --precision
crwl "https://example.com" -o markdown-fit

# Check harvest state
python3 -c "import json; print(json.dumps(json.load(open('.harvest-state.json'))['failed_urls'], indent=2))"

wiki-hygiene.py archived a page unexpectedly Check last_verified vs decay thresholds. If the page was never referenced in a conversation, it decayed naturally. Restore with:

python3 scripts/wiki-hygiene.py --restore archive/patterns/foo.md

Both machines ran maintenance simultaneously Merge conflicts on .harvest-state.json / .hygiene-state.json will occur. Pick ONE machine for maintenance; disable the maintenance cron on the other. Leave sync cron running on both so changes still propagate.

Tests fail Run cd tests && python3 -m pytest -v for verbose output. If the failure mentions WIKI_DIR or module loading, verify scripts/wiki_lib.py exists and contains the WIKI_DIR env var override near the top.

Minimal install (skip everything except the idea)

If you want the conceptual wiki without any of the automation, all you actually need is:

An empty directory
CLAUDE.md telling your agent the conventions (see the schema in ARCHITECTURE.md or Karpathy's gist)
index.md for the agent to catalog pages
An agent that can read and write files (any Claude Code, Cursor, Aider session works)

Then tell the agent: "Start maintaining a wiki here. Every time I share a source, integrate it. When I ask a question, check the wiki first."

You can bolt on the automation layer later if/when it becomes worth the setup effort.

16 KiB Raw Permalink Blame History