feat(distill): close the MemPalace loop — conversations → wiki pages

Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.

## The gap distill closes

The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.

## What distill does

Narrow today-filter with historical rollup:

  1. Find all summarized conversations dated TODAY
  2. Extract their topics: — this is the "topics of today" set
  3. For each topic in that set, pull ALL summarized conversations
     across history that share that topic (full historical context)
  4. Extract hall_facts + hall_discoveries + hall_advice bullets
     (the high-signal hall types — skips event/preference/tooling)
  5. Send topic group + wiki index.md to claude -p
  6. Model emits JSON actions[]: new_page / update_page / skip
  7. Write each action to staging/<type>/ with distill provenance
     frontmatter (staged_by: wiki-distill, distill_topic,
     distill_source_conversations, compilation_notes)

First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.

Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.

## Orchestration — distill BEFORE harvest

wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):

  1a. wiki-distill.py    — conversations → staging (PRIORITY)
  1b. wiki-harvest.py    — URLs → raw/harvested → staging (supplement)
  2.  wiki-hygiene.py    — decay, archive, repair, checks
  3.  qmd reindex

Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.

## Verified on real wiki

Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.

## Code additions

- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
  + high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
  state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
  the Phase 1a/1b split

## Docs additions

- README.md: 8th row in extensions table, updated compounding-loop
  diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
  loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
  order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
  verify section test count
- .gitignore: note distill-state.json is committed (sync across
  machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
  tab with flow diagram, hall filter table, narrow-today/wide-
  history explanation, staging provenance example

## Tests

192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
This commit is contained in:
Eric Turner
2026-04-12 22:34:33 -06:00
parent 4c6b7609a1
commit 997aa837de
11 changed files with 1732 additions and 66 deletions

700
scripts/wiki-distill.py Normal file
View File

@@ -0,0 +1,700 @@
#!/usr/bin/env python3
"""Distill wiki pages from summarized conversation content.
This is the "closing the MemPalace loop" step: closet summaries become
the source material for new or updated wiki pages. It's parallel to
wiki-harvest.py (which compiles URL content into wiki pages) but operates
on the *content of the conversations themselves* rather than the URLs
they cite.
Scope filter (deliberately narrow):
1. Find all summarized conversations dated TODAY
2. Extract their `topics:` — this is the "topics-of-today" set
3. For each topic in that set, pull ALL summarized conversations across
history that share that topic (rollup for full context)
4. For each topic group, extract `hall_facts` + `hall_discoveries` +
`hall_advice` bullet content from the body
5. Send the topic group + relevant hall entries to `claude -p` with
the current index.md, ask for new_page / update_page / both / skip
6. Write result(s) to staging/<type>/ with `staged_by: wiki-distill`
First run bootstrap (--first-run or empty state):
- Instead of "topics-of-today", use "topics-from-the-last-7-days"
- This seeds the state file so subsequent runs can stay narrow
Self-triggering property:
- Old dormant topics that resurface in a new conversation will
automatically pull in all historical conversations on that topic
via the rollup — no need to manually trigger reprocessing
State: `.distill-state.json` tracks processed conversations (path +
content hash + topics seen at distill time). A conversation is
re-processed if its content hash changes OR it has a new topic not
seen during the previous distill.
Usage:
python3 scripts/wiki-distill.py # Today-only rollup
python3 scripts/wiki-distill.py --first-run # Last 7 days rollup
python3 scripts/wiki-distill.py --topic TOPIC # Process one topic explicitly
python3 scripts/wiki-distill.py --project mc # Only this wing's today topics
python3 scripts/wiki-distill.py --dry-run # Plan only, no LLM, no writes
python3 scripts/wiki-distill.py --no-compile # Parse/rollup only, skip claude -p
python3 scripts/wiki-distill.py --limit N # Cap at N topic groups processed
"""
from __future__ import annotations
import argparse
import hashlib
import json
import os
import re
import subprocess
import sys
import time
from dataclasses import dataclass, field
from datetime import date, datetime, timedelta, timezone
from pathlib import Path
from typing import Any
sys.path.insert(0, str(Path(__file__).parent))
from wiki_lib import ( # noqa: E402
CONVERSATIONS_DIR,
INDEX_FILE,
STAGING_DIR,
WIKI_DIR,
WikiPage,
high_signal_halls,
parse_date,
parse_page,
today,
)
sys.stdout.reconfigure(line_buffering=True)
sys.stderr.reconfigure(line_buffering=True)
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
DISTILL_STATE_FILE = WIKI_DIR / ".distill-state.json"
CLAUDE_HAIKU_MODEL = "haiku"
CLAUDE_SONNET_MODEL = "sonnet"
# Content size (characters) above which we route to sonnet
SONNET_CONTENT_THRESHOLD = 15_000
CLAUDE_TIMEOUT = 600
FIRST_RUN_LOOKBACK_DAYS = 7
# Minimum number of total hall bullets across the topic group to bother
# asking the LLM. A topic with only one fact/discovery across history is
# usually not enough signal to warrant a wiki page.
MIN_BULLETS_PER_TOPIC = 2
# ---------------------------------------------------------------------------
# State management
# ---------------------------------------------------------------------------
def load_state() -> dict[str, Any]:
defaults: dict[str, Any] = {
"processed_convs": {},
"processed_topics": {},
"rejected_topics": {},
"last_run": None,
"first_run_complete": False,
}
if DISTILL_STATE_FILE.exists():
try:
with open(DISTILL_STATE_FILE) as f:
state = json.load(f)
for k, v in defaults.items():
state.setdefault(k, v)
return state
except (OSError, json.JSONDecodeError):
pass
return defaults
def save_state(state: dict[str, Any]) -> None:
state["last_run"] = datetime.now(timezone.utc).isoformat()
tmp = DISTILL_STATE_FILE.with_suffix(".json.tmp")
with open(tmp, "w") as f:
json.dump(state, f, indent=2, sort_keys=True)
tmp.replace(DISTILL_STATE_FILE)
def conv_content_hash(conv: WikiPage) -> str:
return "sha256:" + hashlib.sha256(conv.body.encode("utf-8")).hexdigest()
def conv_needs_distill(state: dict[str, Any], conv: WikiPage) -> bool:
"""Return True if this conversation should be re-processed."""
rel = str(conv.path.relative_to(WIKI_DIR))
entry = state.get("processed_convs", {}).get(rel)
if not entry:
return True
if entry.get("content_hash") != conv_content_hash(conv):
return True
# New topics that weren't seen at distill time → re-process
seen_topics = set(entry.get("topics_at_distill", []))
current_topics = set(conv.frontmatter.get("topics") or [])
if current_topics - seen_topics:
return True
return False
def mark_conv_distilled(
state: dict[str, Any],
conv: WikiPage,
output_pages: list[str],
) -> None:
rel = str(conv.path.relative_to(WIKI_DIR))
state.setdefault("processed_convs", {})[rel] = {
"distilled_date": today().isoformat(),
"content_hash": conv_content_hash(conv),
"topics_at_distill": list(conv.frontmatter.get("topics") or []),
"output_pages": output_pages,
}
# ---------------------------------------------------------------------------
# Conversation discovery & topic rollup
# ---------------------------------------------------------------------------
def iter_summarized_conversations(project_filter: str | None = None) -> list[WikiPage]:
"""Walk conversations/ and return all summarized conversation pages."""
if not CONVERSATIONS_DIR.exists():
return []
results: list[WikiPage] = []
for project_dir in sorted(CONVERSATIONS_DIR.iterdir()):
if not project_dir.is_dir():
continue
if project_filter and project_dir.name != project_filter:
continue
for md in sorted(project_dir.glob("*.md")):
page = parse_page(md)
if not page:
continue
if page.frontmatter.get("status") != "summarized":
continue
results.append(page)
return results
def extract_topics_from_today(
conversations: list[WikiPage],
target_date: date,
lookback_days: int = 0,
) -> set[str]:
"""Find the set of topics appearing in conversations dated ≥ (target - lookback).
lookback_days=0 → only today
lookback_days=7 → today and the previous 7 days
"""
cutoff = target_date - timedelta(days=lookback_days)
topics: set[str] = set()
for conv in conversations:
d = parse_date(conv.frontmatter.get("date"))
if d and d >= cutoff:
for t in conv.frontmatter.get("topics") or []:
t_clean = str(t).strip()
if t_clean:
topics.add(t_clean)
return topics
def rollup_conversations_by_topic(
topic: str, conversations: list[WikiPage]
) -> list[WikiPage]:
"""Return all conversations (across all time) whose topics: list contains `topic`."""
results: list[WikiPage] = []
for conv in conversations:
conv_topics = conv.frontmatter.get("topics") or []
if topic in conv_topics:
results.append(conv)
# Most recent first so the LLM sees the current state before the backstory
results.sort(
key=lambda c: parse_date(c.frontmatter.get("date")) or date.min,
reverse=True,
)
return results
# ---------------------------------------------------------------------------
# Build the LLM input for a topic group
# ---------------------------------------------------------------------------
@dataclass
class TopicGroup:
topic: str
conversations: list[WikiPage]
halls_by_conv: list[dict[str, list[str]]]
total_bullets: int
def build_topic_group(topic: str, conversations: list[WikiPage]) -> TopicGroup:
halls_by_conv: list[dict[str, list[str]]] = []
total = 0
for conv in conversations:
halls = high_signal_halls(conv)
halls_by_conv.append(halls)
total += sum(len(v) for v in halls.values())
return TopicGroup(
topic=topic,
conversations=conversations,
halls_by_conv=halls_by_conv,
total_bullets=total,
)
def format_topic_group_for_llm(group: TopicGroup) -> str:
"""Render a topic group as a prompt-friendly markdown block."""
lines = [f"# Topic: {group.topic}", ""]
lines.append(
f"Found {len(group.conversations)} summarized conversation(s) tagged "
f"with this topic, containing {group.total_bullets} high-signal bullets "
f"across fact/discovery/advice halls."
)
lines.append("")
for conv, halls in zip(group.conversations, group.halls_by_conv):
rel = str(conv.path.relative_to(WIKI_DIR))
date_str = conv.frontmatter.get("date", "unknown")
title = conv.frontmatter.get("title", conv.path.stem)
project = conv.frontmatter.get("project", "?")
lines.append(f"## {date_str}{title} ({project})")
lines.append(f"_Source: `{rel}`_")
lines.append("")
for hall_type in ("fact", "discovery", "advice"):
bullets = halls.get(hall_type) or []
if not bullets:
continue
label = {"fact": "Decisions", "discovery": "Discoveries", "advice": "Advice"}[hall_type]
lines.append(f"**{label}:**")
for b in bullets:
lines.append(f"- {b}")
lines.append("")
return "\n".join(lines)
# ---------------------------------------------------------------------------
# Claude compilation
# ---------------------------------------------------------------------------
DISTILL_PROMPT_TEMPLATE = """You are distilling wiki pages from summarized conversation content.
The wiki schema and conventions are defined in CLAUDE.md. The wiki has four
content directories: patterns/ (HOW), decisions/ (WHY), environments/ (WHERE),
concepts/ (WHAT). All pages require YAML frontmatter with title, type,
confidence, origin, sources, related, last_compiled, last_verified.
IMPORTANT: Do NOT include `status`, `staged_*`, `target_path`, `modifies`,
or `compilation_notes` fields in your page frontmatter — the distill script
injects those automatically.
Your task: given a topic group (all conversations across history that share
a topic, with their decisions/discoveries/advice), decide what wiki pages
should be created or updated. Emit a single JSON object with an `actions`
array. Each action is one of:
- "new_page" — create a new wiki page from the distilled knowledge
- "update_page" — update an existing live wiki page (add content, merge)
- "skip" — content isn't substantive enough for a wiki page
OR the topic is already well-covered elsewhere
Schema:
{{
"rationale": "1-2 sentences explaining your decision",
"actions": [
{{
"type": "new_page",
"directory": "patterns" | "decisions" | "environments" | "concepts",
"filename": "kebab-case-name.md",
"content": "full markdown including frontmatter"
}},
{{
"type": "update_page",
"path": "patterns/existing-page.md",
"content": "full updated markdown including frontmatter (merged)"
}},
{{
"type": "skip",
"reason": "why this topic doesn't need a wiki page"
}}
]
}}
You can emit MULTIPLE actions — e.g. a new_page for a concept and an
update_page to an existing pattern that now has new context.
Emit ONLY the JSON object. No prose, no markdown fences.
--- WIKI INDEX (existing pages) ---
{wiki_index}
--- TOPIC GROUP ---
{topic_group}
"""
def call_claude_distill(prompt: str, model: str) -> dict[str, Any] | None:
try:
result = subprocess.run(
["claude", "-p", "--model", model, "--output-format", "text", prompt],
capture_output=True,
text=True,
timeout=CLAUDE_TIMEOUT,
)
except FileNotFoundError:
print(" [warn] claude CLI not found — skipping compilation", file=sys.stderr)
return None
except subprocess.TimeoutExpired:
print(" [warn] claude -p timed out", file=sys.stderr)
return None
if result.returncode != 0:
print(f" [warn] claude -p failed: {result.stderr.strip()[:200]}", file=sys.stderr)
return None
output = result.stdout.strip()
match = re.search(r"\{.*\}", output, re.DOTALL)
if not match:
print(f" [warn] no JSON found in claude output ({len(output)} chars)", file=sys.stderr)
return None
try:
return json.loads(match.group(0))
except json.JSONDecodeError as e:
print(f" [warn] JSON parse failed: {e}", file=sys.stderr)
return None
# ---------------------------------------------------------------------------
# Staging output
# ---------------------------------------------------------------------------
STAGING_INJECT_TEMPLATE = (
"---\n"
"origin: automated\n"
"status: pending\n"
"staged_date: {staged_date}\n"
"staged_by: wiki-distill\n"
"target_path: {target_path}\n"
"{modifies_line}"
"distill_topic: {topic}\n"
"distill_source_conversations: {source_convs}\n"
"compilation_notes: {compilation_notes}\n"
)
def _inject_staging_frontmatter(
content: str,
target_path: str,
topic: str,
source_convs: list[str],
compilation_notes: str,
modifies: str | None,
) -> str:
content = re.sub(
r"^(status|origin|staged_\w+|target_path|modifies|distill_\w+|compilation_notes):.*\n",
"",
content,
flags=re.MULTILINE,
)
modifies_line = f"modifies: {modifies}\n" if modifies else ""
clean_notes = compilation_notes.replace("\n", " ").replace("\r", " ").strip()
sources_yaml = ",".join(source_convs)
injection = STAGING_INJECT_TEMPLATE.format(
staged_date=datetime.now(timezone.utc).date().isoformat(),
target_path=target_path,
modifies_line=modifies_line,
topic=topic,
source_convs=sources_yaml,
compilation_notes=clean_notes or "(distilled from conversation topic group)",
)
if content.startswith("---\n"):
return injection + content[4:]
return injection + "---\n" + content
def _unique_staging_path(base: Path) -> Path:
if not base.exists():
return base
suffix = hashlib.sha256(str(base).encode() + str(time.time()).encode()).hexdigest()[:6]
return base.with_stem(f"{base.stem}-{suffix}")
def apply_distill_actions(
result: dict[str, Any],
topic: str,
source_convs: list[str],
dry_run: bool,
) -> list[Path]:
written: list[Path] = []
actions = result.get("actions") or []
rationale = result.get("rationale", "")
for action in actions:
action_type = action.get("type")
if action_type == "skip":
reason = action.get("reason", "not substantive enough")
print(f" [skip] topic={topic!r}: {reason}")
continue
if action_type == "new_page":
directory = action.get("directory") or "patterns"
filename = action.get("filename")
content = action.get("content")
if not filename or not content:
print(f" [warn] incomplete new_page action for topic={topic!r}", file=sys.stderr)
continue
target_rel = f"{directory}/{filename}"
dest = _unique_staging_path(STAGING_DIR / target_rel)
if dry_run:
print(f" [dry-run] new_page → {dest.relative_to(WIKI_DIR)}")
continue
dest.parent.mkdir(parents=True, exist_ok=True)
injected = _inject_staging_frontmatter(
content,
target_path=target_rel,
topic=topic,
source_convs=source_convs,
compilation_notes=rationale,
modifies=None,
)
dest.write_text(injected)
written.append(dest)
print(f" [new] {dest.relative_to(WIKI_DIR)}")
continue
if action_type == "update_page":
target_rel = action.get("path")
content = action.get("content")
if not target_rel or not content:
print(f" [warn] incomplete update_page action for topic={topic!r}", file=sys.stderr)
continue
dest = _unique_staging_path(STAGING_DIR / target_rel)
if dry_run:
print(f" [dry-run] update_page → {dest.relative_to(WIKI_DIR)} (modifies {target_rel})")
continue
dest.parent.mkdir(parents=True, exist_ok=True)
injected = _inject_staging_frontmatter(
content,
target_path=target_rel,
topic=topic,
source_convs=source_convs,
compilation_notes=rationale,
modifies=target_rel,
)
dest.write_text(injected)
written.append(dest)
print(f" [upd] {dest.relative_to(WIKI_DIR)} (modifies {target_rel})")
continue
print(f" [warn] unknown action type: {action_type!r}", file=sys.stderr)
return written
# ---------------------------------------------------------------------------
# Main pipeline
# ---------------------------------------------------------------------------
def pick_model(topic_group: TopicGroup, prompt: str) -> str:
if len(prompt) > SONNET_CONTENT_THRESHOLD or topic_group.total_bullets > 20:
return CLAUDE_SONNET_MODEL
return CLAUDE_HAIKU_MODEL
def process_topic(
topic: str,
conversations: list[WikiPage],
state: dict[str, Any],
dry_run: bool,
compile_enabled: bool,
) -> tuple[str, list[Path]]:
"""Process a single topic group. Returns (status, written_paths)."""
group = build_topic_group(topic, conversations)
if group.total_bullets < MIN_BULLETS_PER_TOPIC:
return f"too-thin (only {group.total_bullets} bullets)", []
if topic in state.get("rejected_topics", {}):
return "previously-rejected", []
wiki_index_text = ""
try:
wiki_index_text = INDEX_FILE.read_text()[:15_000]
except OSError:
pass
topic_group_text = format_topic_group_for_llm(group)
prompt = DISTILL_PROMPT_TEMPLATE.format(
wiki_index=wiki_index_text,
topic_group=topic_group_text,
)
if dry_run:
model = pick_model(group, prompt)
return (
f"would-distill ({len(group.conversations)} convs, "
f"{group.total_bullets} bullets, {model})"
), []
if not compile_enabled:
return (
f"skipped-compile ({len(group.conversations)} convs, "
f"{group.total_bullets} bullets)"
), []
model = pick_model(group, prompt)
print(f" [compile] topic={topic!r} "
f"convs={len(group.conversations)} bullets={group.total_bullets} model={model}")
result = call_claude_distill(prompt, model)
if result is None:
return "compile-failed", []
actions = result.get("actions") or []
if not actions or all(a.get("type") == "skip" for a in actions):
reason = result.get("rationale", "AI chose to skip")
state.setdefault("rejected_topics", {})[topic] = {
"reason": reason,
"rejected_date": today().isoformat(),
}
return "rejected-by-llm", []
source_convs = [str(c.path.relative_to(WIKI_DIR)) for c in group.conversations]
written = apply_distill_actions(result, topic, source_convs, dry_run=False)
for conv in group.conversations:
mark_conv_distilled(state, conv, [str(p.relative_to(WIKI_DIR)) for p in written])
state.setdefault("processed_topics", {})[topic] = {
"distilled_date": today().isoformat(),
"conversations": source_convs,
"output_pages": [str(p.relative_to(WIKI_DIR)) for p in written],
}
return f"distilled ({len(written)} page(s))", written
def run(
*,
first_run: bool,
explicit_topic: str | None,
project_filter: str | None,
dry_run: bool,
compile_enabled: bool,
limit: int,
) -> int:
state = load_state()
if not state.get("first_run_complete"):
first_run = True
all_convs = iter_summarized_conversations(project_filter)
print(f"Scanning {len(all_convs)} summarized conversation(s)...")
# Figure out which topics to process
if explicit_topic:
topics_to_process: set[str] = {explicit_topic}
print(f"Explicit topic mode: {explicit_topic!r}")
else:
lookback = FIRST_RUN_LOOKBACK_DAYS if first_run else 0
topics_to_process = extract_topics_from_today(all_convs, today(), lookback)
if first_run:
print(f"First-run bootstrap: last {FIRST_RUN_LOOKBACK_DAYS} days → "
f"{len(topics_to_process)} topic(s)")
else:
print(f"Today-only mode: {len(topics_to_process)} topic(s) from today's conversations")
if not topics_to_process:
print("No topics to distill.")
if first_run:
state["first_run_complete"] = True
save_state(state)
return 0
# Sort for deterministic ordering
topics_ordered = sorted(topics_to_process)
stats: dict[str, int] = {}
processed = 0
total_written: list[Path] = []
for topic in topics_ordered:
convs = rollup_conversations_by_topic(topic, all_convs)
if not convs:
stats["no-matches"] = stats.get("no-matches", 0) + 1
continue
print(f"\n[{topic}] rollup: {len(convs)} conversation(s)")
status, written = process_topic(
topic, convs, state, dry_run=dry_run, compile_enabled=compile_enabled
)
stats[status.split(" ")[0]] = stats.get(status.split(" ")[0], 0) + 1
print(f" [{status}]")
total_written.extend(written)
if not dry_run:
processed += 1
save_state(state)
if limit and processed >= limit:
print(f"\nLimit reached ({limit}); stopping.")
break
if first_run and not dry_run:
state["first_run_complete"] = True
if not dry_run:
save_state(state)
print("\nSummary:")
for status, count in sorted(stats.items()):
print(f" {status}: {count}")
print(f"\n{len(total_written)} staging page(s) written")
return 0
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
parser.add_argument("--first-run", action="store_true",
help="Bootstrap with last 7 days instead of today-only")
parser.add_argument("--topic", default=None,
help="Process one specific topic explicitly")
parser.add_argument("--project", default=None,
help="Only consider conversations under this wing")
parser.add_argument("--dry-run", action="store_true",
help="Plan only; no LLM calls, no writes")
parser.add_argument("--no-compile", action="store_true",
help="Parse + rollup only; skip claude -p step")
parser.add_argument("--limit", type=int, default=0,
help="Stop after N topic groups processed (0 = unlimited)")
args = parser.parse_args()
return run(
first_run=args.first_run,
explicit_topic=args.topic,
project_filter=args.project,
dry_run=args.dry_run,
compile_enabled=not args.no_compile,
limit=args.limit,
)
if __name__ == "__main__":
sys.exit(main())

View File

@@ -3,19 +3,26 @@ set -euo pipefail
# wiki-maintain.sh — Top-level orchestrator for wiki maintenance.
#
# Chains the three maintenance scripts in the correct order:
# 1. wiki-harvest.py (URL harvesting from summarized conversations)
# 2. wiki-hygiene.py (quick or full hygiene checks)
# 3. qmd update && qmd embed (reindex after changes)
# Chains the maintenance scripts in the correct order:
# 1a. wiki-distill.py (closet summaries → wiki pages via claude -p)
# 1b. wiki-harvest.py (URL content from conversations → wiki pages)
# 2. wiki-hygiene.py (quick or full hygiene checks)
# 3. qmd update && qmd embed (reindex after changes)
#
# Distill runs BEFORE harvest: conversation content takes priority over
# URL content. If a topic is already discussed in the conversations, we
# want the conversation rollup to drive the page, not a cited URL.
#
# Usage:
# wiki-maintain.sh # Harvest + quick hygiene
# wiki-maintain.sh --full # Harvest + full hygiene (LLM-powered)
# wiki-maintain.sh # Distill + harvest + quick hygiene + reindex
# wiki-maintain.sh --full # Everything with full hygiene (LLM)
# wiki-maintain.sh --distill-only # Conversation distillation only
# wiki-maintain.sh --harvest-only # URL harvesting only
# wiki-maintain.sh --hygiene-only # Quick hygiene only
# wiki-maintain.sh --hygiene-only --full # Full hygiene only
# wiki-maintain.sh --dry-run # Show what would run (no writes)
# wiki-maintain.sh --no-compile # Harvest without claude -p compilation step
# wiki-maintain.sh --hygiene-only # Hygiene only
# wiki-maintain.sh --no-distill # Skip distillation phase
# wiki-maintain.sh --distill-first-run # Bootstrap distill with last 7 days
# wiki-maintain.sh --dry-run # Show what would run (no writes, no LLM)
# wiki-maintain.sh --no-compile # Skip claude -p in harvest AND distill
# wiki-maintain.sh --no-reindex # Skip qmd update/embed after
#
# Log file: scripts/.maintain.log (rotated manually)
@@ -32,22 +39,28 @@ LOG_FILE="${SCRIPTS_DIR}/.maintain.log"
# -----------------------------------------------------------------------------
FULL_MODE=false
DISTILL_ONLY=false
HARVEST_ONLY=false
HYGIENE_ONLY=false
NO_DISTILL=false
DISTILL_FIRST_RUN=false
DRY_RUN=false
NO_COMPILE=false
NO_REINDEX=false
while [[ $# -gt 0 ]]; do
case "$1" in
--full) FULL_MODE=true; shift ;;
--harvest-only) HARVEST_ONLY=true; shift ;;
--hygiene-only) HYGIENE_ONLY=true; shift ;;
--dry-run) DRY_RUN=true; shift ;;
--no-compile) NO_COMPILE=true; shift ;;
--no-reindex) NO_REINDEX=true; shift ;;
--full) FULL_MODE=true; shift ;;
--distill-only) DISTILL_ONLY=true; shift ;;
--harvest-only) HARVEST_ONLY=true; shift ;;
--hygiene-only) HYGIENE_ONLY=true; shift ;;
--no-distill) NO_DISTILL=true; shift ;;
--distill-first-run) DISTILL_FIRST_RUN=true; shift ;;
--dry-run) DRY_RUN=true; shift ;;
--no-compile) NO_COMPILE=true; shift ;;
--no-reindex) NO_REINDEX=true; shift ;;
-h|--help)
sed -n '3,20p' "$0" | sed 's/^# \?//'
sed -n '3,28p' "$0" | sed 's/^# \?//'
exit 0
;;
*)
@@ -57,8 +70,13 @@ while [[ $# -gt 0 ]]; do
esac
done
if [[ "${HARVEST_ONLY}" == "true" && "${HYGIENE_ONLY}" == "true" ]]; then
echo "--harvest-only and --hygiene-only are mutually exclusive" >&2
# Mutex check — only one "only" flag at a time
only_count=0
${DISTILL_ONLY} && only_count=$((only_count + 1))
${HARVEST_ONLY} && only_count=$((only_count + 1))
${HYGIENE_ONLY} && only_count=$((only_count + 1))
if [[ $only_count -gt 1 ]]; then
echo "--distill-only, --harvest-only, and --hygiene-only are mutually exclusive" >&2
exit 1
fi
@@ -91,13 +109,36 @@ cd "${WIKI_DIR}"
for req in python3 qmd; do
if ! command -v "${req}" >/dev/null 2>&1; then
if [[ "${req}" == "qmd" && "${NO_REINDEX}" == "true" ]]; then
continue # qmd not required if --no-reindex
continue
fi
echo "Required command not found: ${req}" >&2
exit 1
fi
done
# -----------------------------------------------------------------------------
# Determine which phases to run
# -----------------------------------------------------------------------------
run_distill=true
run_harvest=true
run_hygiene=true
${NO_DISTILL} && run_distill=false
if ${DISTILL_ONLY}; then
run_harvest=false
run_hygiene=false
fi
if ${HARVEST_ONLY}; then
run_distill=false
run_hygiene=false
fi
if ${HYGIENE_ONLY}; then
run_distill=false
run_harvest=false
fi
# -----------------------------------------------------------------------------
# Pipeline
# -----------------------------------------------------------------------------
@@ -105,18 +146,39 @@ done
START_TS="$(date '+%s')"
section "wiki-maintain.sh starting"
log "mode: $(${FULL_MODE} && echo full || echo quick)"
log "harvest: $(${HYGIENE_ONLY} && echo skipped || echo enabled)"
log "hygiene: $(${HARVEST_ONLY} && echo skipped || echo enabled)"
log "distill: $(${run_distill} && echo enabled || echo skipped)"
log "harvest: $(${run_harvest} && echo enabled || echo skipped)"
log "hygiene: $(${run_hygiene} && echo enabled || echo skipped)"
log "reindex: $(${NO_REINDEX} && echo skipped || echo enabled)"
log "dry-run: ${DRY_RUN}"
log "wiki: ${WIKI_DIR}"
# -----------------------------------------------------------------------------
# Phase 1: Harvest
# Phase 1a: Distill — conversations → wiki pages
# -----------------------------------------------------------------------------
if [[ "${HYGIENE_ONLY}" != "true" ]]; then
section "Phase 1: URL harvesting"
if ${run_distill}; then
section "Phase 1a: Conversation distillation"
distill_args=()
${DRY_RUN} && distill_args+=(--dry-run)
${NO_COMPILE} && distill_args+=(--no-compile)
${DISTILL_FIRST_RUN} && distill_args+=(--first-run)
if python3 "${SCRIPTS_DIR}/wiki-distill.py" "${distill_args[@]}"; then
log "distill completed"
else
log "[error] distill failed (exit $?) — continuing to harvest"
fi
else
section "Phase 1a: Conversation distillation (skipped)"
fi
# -----------------------------------------------------------------------------
# Phase 1b: Harvest — URLs cited in conversations → raw/ → wiki pages
# -----------------------------------------------------------------------------
if ${run_harvest}; then
section "Phase 1b: URL harvesting"
harvest_args=()
${DRY_RUN} && harvest_args+=(--dry-run)
${NO_COMPILE} && harvest_args+=(--no-compile)
@@ -127,14 +189,14 @@ if [[ "${HYGIENE_ONLY}" != "true" ]]; then
log "[error] harvest failed (exit $?) — continuing to hygiene"
fi
else
section "Phase 1: URL harvesting (skipped)"
section "Phase 1b: URL harvesting (skipped)"
fi
# -----------------------------------------------------------------------------
# Phase 2: Hygiene
# -----------------------------------------------------------------------------
if [[ "${HARVEST_ONLY}" != "true" ]]; then
if ${run_hygiene}; then
section "Phase 2: Hygiene checks"
hygiene_args=()
if ${FULL_MODE}; then

View File

@@ -209,3 +209,63 @@ def iter_archived_pages() -> list[WikiPage]:
def page_content_hash(page: WikiPage) -> str:
"""Hash of page body only (excludes frontmatter) so mechanical frontmatter fixes don't churn the hash."""
return "sha256:" + hashlib.sha256(page.body.strip().encode("utf-8")).hexdigest()
# ---------------------------------------------------------------------------
# Conversation hall parsing
# ---------------------------------------------------------------------------
#
# Summarized conversations have sections in the body like:
# ## Decisions (hall: fact)
# - bullet
# - bullet
# ## Discoveries (hall: discovery)
# - bullet
#
# Hall types used by the summarizer: fact, discovery, preference, advice,
# event, tooling. Only fact/discovery/advice are high-signal enough to
# distill into wiki pages; the others are tracked but not auto-promoted.
HIGH_SIGNAL_HALLS = {"fact", "discovery", "advice"}
_HALL_SECTION_RE = re.compile(
r"^##\s+[^\n]*?\(hall:\s*(\w+)\s*\)\s*$(.*?)(?=^##\s|\Z)",
re.MULTILINE | re.DOTALL,
)
_BULLET_RE = re.compile(r"^\s*-\s+(.*?)$", re.MULTILINE)
def parse_conversation_halls(page: WikiPage) -> dict[str, list[str]]:
"""Extract hall-bucketed bullet content from a summarized conversation body.
Returns a dict like:
{"fact": ["claim one", "claim two"],
"discovery": ["root cause X"],
"advice": ["do Y", "consider Z"], ...}
Empty hall types are omitted. Bullet lines are stripped of leading "- "
and trailing whitespace; multi-line bullets are joined with a space.
"""
result: dict[str, list[str]] = {}
for match in _HALL_SECTION_RE.finditer(page.body):
hall_type = match.group(1).strip().lower()
section_body = match.group(2)
bullets = [
_flatten_bullet(b.group(1))
for b in _BULLET_RE.finditer(section_body)
]
bullets = [b for b in bullets if b]
if bullets:
result.setdefault(hall_type, []).extend(bullets)
return result
def _flatten_bullet(text: str) -> str:
"""Collapse a possibly-multiline bullet into a single clean line."""
return " ".join(text.split()).strip()
def high_signal_halls(page: WikiPage) -> dict[str, list[str]]:
"""Return only fact/discovery/advice content from a conversation."""
all_halls = parse_conversation_halls(page)
return {k: v for k, v in all_halls.items() if k in HIGH_SIGNAL_HALLS}