feat(distill): close the MemPalace loop — conversations → wiki pages
Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.
## The gap distill closes
The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.
## What distill does
Narrow today-filter with historical rollup:
1. Find all summarized conversations dated TODAY
2. Extract their topics: — this is the "topics of today" set
3. For each topic in that set, pull ALL summarized conversations
across history that share that topic (full historical context)
4. Extract hall_facts + hall_discoveries + hall_advice bullets
(the high-signal hall types — skips event/preference/tooling)
5. Send topic group + wiki index.md to claude -p
6. Model emits JSON actions[]: new_page / update_page / skip
7. Write each action to staging/<type>/ with distill provenance
frontmatter (staged_by: wiki-distill, distill_topic,
distill_source_conversations, compilation_notes)
First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.
Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.
## Orchestration — distill BEFORE harvest
wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):
1a. wiki-distill.py — conversations → staging (PRIORITY)
1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement)
2. wiki-hygiene.py — decay, archive, repair, checks
3. qmd reindex
Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.
## Verified on real wiki
Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.
## Code additions
- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
+ high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
the Phase 1a/1b split
## Docs additions
- README.md: 8th row in extensions table, updated compounding-loop
diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
verify section test count
- .gitignore: note distill-state.json is committed (sync across
machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
tab with flow diagram, hall filter table, narrow-today/wide-
history explanation, staging provenance example
## Tests
192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
This commit is contained in:
700
scripts/wiki-distill.py
Normal file
700
scripts/wiki-distill.py
Normal file
@@ -0,0 +1,700 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Distill wiki pages from summarized conversation content.
|
||||
|
||||
This is the "closing the MemPalace loop" step: closet summaries become
|
||||
the source material for new or updated wiki pages. It's parallel to
|
||||
wiki-harvest.py (which compiles URL content into wiki pages) but operates
|
||||
on the *content of the conversations themselves* rather than the URLs
|
||||
they cite.
|
||||
|
||||
Scope filter (deliberately narrow):
|
||||
|
||||
1. Find all summarized conversations dated TODAY
|
||||
2. Extract their `topics:` — this is the "topics-of-today" set
|
||||
3. For each topic in that set, pull ALL summarized conversations across
|
||||
history that share that topic (rollup for full context)
|
||||
4. For each topic group, extract `hall_facts` + `hall_discoveries` +
|
||||
`hall_advice` bullet content from the body
|
||||
5. Send the topic group + relevant hall entries to `claude -p` with
|
||||
the current index.md, ask for new_page / update_page / both / skip
|
||||
6. Write result(s) to staging/<type>/ with `staged_by: wiki-distill`
|
||||
|
||||
First run bootstrap (--first-run or empty state):
|
||||
|
||||
- Instead of "topics-of-today", use "topics-from-the-last-7-days"
|
||||
- This seeds the state file so subsequent runs can stay narrow
|
||||
|
||||
Self-triggering property:
|
||||
|
||||
- Old dormant topics that resurface in a new conversation will
|
||||
automatically pull in all historical conversations on that topic
|
||||
via the rollup — no need to manually trigger reprocessing
|
||||
|
||||
State: `.distill-state.json` tracks processed conversations (path +
|
||||
content hash + topics seen at distill time). A conversation is
|
||||
re-processed if its content hash changes OR it has a new topic not
|
||||
seen during the previous distill.
|
||||
|
||||
Usage:
|
||||
python3 scripts/wiki-distill.py # Today-only rollup
|
||||
python3 scripts/wiki-distill.py --first-run # Last 7 days rollup
|
||||
python3 scripts/wiki-distill.py --topic TOPIC # Process one topic explicitly
|
||||
python3 scripts/wiki-distill.py --project mc # Only this wing's today topics
|
||||
python3 scripts/wiki-distill.py --dry-run # Plan only, no LLM, no writes
|
||||
python3 scripts/wiki-distill.py --no-compile # Parse/rollup only, skip claude -p
|
||||
python3 scripts/wiki-distill.py --limit N # Cap at N topic groups processed
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import date, datetime, timedelta, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from wiki_lib import ( # noqa: E402
|
||||
CONVERSATIONS_DIR,
|
||||
INDEX_FILE,
|
||||
STAGING_DIR,
|
||||
WIKI_DIR,
|
||||
WikiPage,
|
||||
high_signal_halls,
|
||||
parse_date,
|
||||
parse_page,
|
||||
today,
|
||||
)
|
||||
|
||||
sys.stdout.reconfigure(line_buffering=True)
|
||||
sys.stderr.reconfigure(line_buffering=True)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Configuration
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
DISTILL_STATE_FILE = WIKI_DIR / ".distill-state.json"
|
||||
|
||||
CLAUDE_HAIKU_MODEL = "haiku"
|
||||
CLAUDE_SONNET_MODEL = "sonnet"
|
||||
# Content size (characters) above which we route to sonnet
|
||||
SONNET_CONTENT_THRESHOLD = 15_000
|
||||
CLAUDE_TIMEOUT = 600
|
||||
|
||||
FIRST_RUN_LOOKBACK_DAYS = 7
|
||||
|
||||
# Minimum number of total hall bullets across the topic group to bother
|
||||
# asking the LLM. A topic with only one fact/discovery across history is
|
||||
# usually not enough signal to warrant a wiki page.
|
||||
MIN_BULLETS_PER_TOPIC = 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# State management
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def load_state() -> dict[str, Any]:
|
||||
defaults: dict[str, Any] = {
|
||||
"processed_convs": {},
|
||||
"processed_topics": {},
|
||||
"rejected_topics": {},
|
||||
"last_run": None,
|
||||
"first_run_complete": False,
|
||||
}
|
||||
if DISTILL_STATE_FILE.exists():
|
||||
try:
|
||||
with open(DISTILL_STATE_FILE) as f:
|
||||
state = json.load(f)
|
||||
for k, v in defaults.items():
|
||||
state.setdefault(k, v)
|
||||
return state
|
||||
except (OSError, json.JSONDecodeError):
|
||||
pass
|
||||
return defaults
|
||||
|
||||
|
||||
def save_state(state: dict[str, Any]) -> None:
|
||||
state["last_run"] = datetime.now(timezone.utc).isoformat()
|
||||
tmp = DISTILL_STATE_FILE.with_suffix(".json.tmp")
|
||||
with open(tmp, "w") as f:
|
||||
json.dump(state, f, indent=2, sort_keys=True)
|
||||
tmp.replace(DISTILL_STATE_FILE)
|
||||
|
||||
|
||||
def conv_content_hash(conv: WikiPage) -> str:
|
||||
return "sha256:" + hashlib.sha256(conv.body.encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
def conv_needs_distill(state: dict[str, Any], conv: WikiPage) -> bool:
|
||||
"""Return True if this conversation should be re-processed."""
|
||||
rel = str(conv.path.relative_to(WIKI_DIR))
|
||||
entry = state.get("processed_convs", {}).get(rel)
|
||||
if not entry:
|
||||
return True
|
||||
if entry.get("content_hash") != conv_content_hash(conv):
|
||||
return True
|
||||
# New topics that weren't seen at distill time → re-process
|
||||
seen_topics = set(entry.get("topics_at_distill", []))
|
||||
current_topics = set(conv.frontmatter.get("topics") or [])
|
||||
if current_topics - seen_topics:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def mark_conv_distilled(
|
||||
state: dict[str, Any],
|
||||
conv: WikiPage,
|
||||
output_pages: list[str],
|
||||
) -> None:
|
||||
rel = str(conv.path.relative_to(WIKI_DIR))
|
||||
state.setdefault("processed_convs", {})[rel] = {
|
||||
"distilled_date": today().isoformat(),
|
||||
"content_hash": conv_content_hash(conv),
|
||||
"topics_at_distill": list(conv.frontmatter.get("topics") or []),
|
||||
"output_pages": output_pages,
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Conversation discovery & topic rollup
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def iter_summarized_conversations(project_filter: str | None = None) -> list[WikiPage]:
|
||||
"""Walk conversations/ and return all summarized conversation pages."""
|
||||
if not CONVERSATIONS_DIR.exists():
|
||||
return []
|
||||
results: list[WikiPage] = []
|
||||
for project_dir in sorted(CONVERSATIONS_DIR.iterdir()):
|
||||
if not project_dir.is_dir():
|
||||
continue
|
||||
if project_filter and project_dir.name != project_filter:
|
||||
continue
|
||||
for md in sorted(project_dir.glob("*.md")):
|
||||
page = parse_page(md)
|
||||
if not page:
|
||||
continue
|
||||
if page.frontmatter.get("status") != "summarized":
|
||||
continue
|
||||
results.append(page)
|
||||
return results
|
||||
|
||||
|
||||
def extract_topics_from_today(
|
||||
conversations: list[WikiPage],
|
||||
target_date: date,
|
||||
lookback_days: int = 0,
|
||||
) -> set[str]:
|
||||
"""Find the set of topics appearing in conversations dated ≥ (target - lookback).
|
||||
|
||||
lookback_days=0 → only today
|
||||
lookback_days=7 → today and the previous 7 days
|
||||
"""
|
||||
cutoff = target_date - timedelta(days=lookback_days)
|
||||
topics: set[str] = set()
|
||||
for conv in conversations:
|
||||
d = parse_date(conv.frontmatter.get("date"))
|
||||
if d and d >= cutoff:
|
||||
for t in conv.frontmatter.get("topics") or []:
|
||||
t_clean = str(t).strip()
|
||||
if t_clean:
|
||||
topics.add(t_clean)
|
||||
return topics
|
||||
|
||||
|
||||
def rollup_conversations_by_topic(
|
||||
topic: str, conversations: list[WikiPage]
|
||||
) -> list[WikiPage]:
|
||||
"""Return all conversations (across all time) whose topics: list contains `topic`."""
|
||||
results: list[WikiPage] = []
|
||||
for conv in conversations:
|
||||
conv_topics = conv.frontmatter.get("topics") or []
|
||||
if topic in conv_topics:
|
||||
results.append(conv)
|
||||
# Most recent first so the LLM sees the current state before the backstory
|
||||
results.sort(
|
||||
key=lambda c: parse_date(c.frontmatter.get("date")) or date.min,
|
||||
reverse=True,
|
||||
)
|
||||
return results
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Build the LLM input for a topic group
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class TopicGroup:
|
||||
topic: str
|
||||
conversations: list[WikiPage]
|
||||
halls_by_conv: list[dict[str, list[str]]]
|
||||
total_bullets: int
|
||||
|
||||
|
||||
def build_topic_group(topic: str, conversations: list[WikiPage]) -> TopicGroup:
|
||||
halls_by_conv: list[dict[str, list[str]]] = []
|
||||
total = 0
|
||||
for conv in conversations:
|
||||
halls = high_signal_halls(conv)
|
||||
halls_by_conv.append(halls)
|
||||
total += sum(len(v) for v in halls.values())
|
||||
return TopicGroup(
|
||||
topic=topic,
|
||||
conversations=conversations,
|
||||
halls_by_conv=halls_by_conv,
|
||||
total_bullets=total,
|
||||
)
|
||||
|
||||
|
||||
def format_topic_group_for_llm(group: TopicGroup) -> str:
|
||||
"""Render a topic group as a prompt-friendly markdown block."""
|
||||
lines = [f"# Topic: {group.topic}", ""]
|
||||
lines.append(
|
||||
f"Found {len(group.conversations)} summarized conversation(s) tagged "
|
||||
f"with this topic, containing {group.total_bullets} high-signal bullets "
|
||||
f"across fact/discovery/advice halls."
|
||||
)
|
||||
lines.append("")
|
||||
for conv, halls in zip(group.conversations, group.halls_by_conv):
|
||||
rel = str(conv.path.relative_to(WIKI_DIR))
|
||||
date_str = conv.frontmatter.get("date", "unknown")
|
||||
title = conv.frontmatter.get("title", conv.path.stem)
|
||||
project = conv.frontmatter.get("project", "?")
|
||||
lines.append(f"## {date_str} — {title} ({project})")
|
||||
lines.append(f"_Source: `{rel}`_")
|
||||
lines.append("")
|
||||
for hall_type in ("fact", "discovery", "advice"):
|
||||
bullets = halls.get(hall_type) or []
|
||||
if not bullets:
|
||||
continue
|
||||
label = {"fact": "Decisions", "discovery": "Discoveries", "advice": "Advice"}[hall_type]
|
||||
lines.append(f"**{label}:**")
|
||||
for b in bullets:
|
||||
lines.append(f"- {b}")
|
||||
lines.append("")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Claude compilation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
DISTILL_PROMPT_TEMPLATE = """You are distilling wiki pages from summarized conversation content.
|
||||
|
||||
The wiki schema and conventions are defined in CLAUDE.md. The wiki has four
|
||||
content directories: patterns/ (HOW), decisions/ (WHY), environments/ (WHERE),
|
||||
concepts/ (WHAT). All pages require YAML frontmatter with title, type,
|
||||
confidence, origin, sources, related, last_compiled, last_verified.
|
||||
|
||||
IMPORTANT: Do NOT include `status`, `staged_*`, `target_path`, `modifies`,
|
||||
or `compilation_notes` fields in your page frontmatter — the distill script
|
||||
injects those automatically.
|
||||
|
||||
Your task: given a topic group (all conversations across history that share
|
||||
a topic, with their decisions/discoveries/advice), decide what wiki pages
|
||||
should be created or updated. Emit a single JSON object with an `actions`
|
||||
array. Each action is one of:
|
||||
|
||||
- "new_page" — create a new wiki page from the distilled knowledge
|
||||
- "update_page" — update an existing live wiki page (add content, merge)
|
||||
- "skip" — content isn't substantive enough for a wiki page
|
||||
OR the topic is already well-covered elsewhere
|
||||
|
||||
Schema:
|
||||
|
||||
{{
|
||||
"rationale": "1-2 sentences explaining your decision",
|
||||
"actions": [
|
||||
{{
|
||||
"type": "new_page",
|
||||
"directory": "patterns" | "decisions" | "environments" | "concepts",
|
||||
"filename": "kebab-case-name.md",
|
||||
"content": "full markdown including frontmatter"
|
||||
}},
|
||||
{{
|
||||
"type": "update_page",
|
||||
"path": "patterns/existing-page.md",
|
||||
"content": "full updated markdown including frontmatter (merged)"
|
||||
}},
|
||||
{{
|
||||
"type": "skip",
|
||||
"reason": "why this topic doesn't need a wiki page"
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
You can emit MULTIPLE actions — e.g. a new_page for a concept and an
|
||||
update_page to an existing pattern that now has new context.
|
||||
|
||||
Emit ONLY the JSON object. No prose, no markdown fences.
|
||||
|
||||
--- WIKI INDEX (existing pages) ---
|
||||
|
||||
{wiki_index}
|
||||
|
||||
--- TOPIC GROUP ---
|
||||
|
||||
{topic_group}
|
||||
"""
|
||||
|
||||
|
||||
def call_claude_distill(prompt: str, model: str) -> dict[str, Any] | None:
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["claude", "-p", "--model", model, "--output-format", "text", prompt],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=CLAUDE_TIMEOUT,
|
||||
)
|
||||
except FileNotFoundError:
|
||||
print(" [warn] claude CLI not found — skipping compilation", file=sys.stderr)
|
||||
return None
|
||||
except subprocess.TimeoutExpired:
|
||||
print(" [warn] claude -p timed out", file=sys.stderr)
|
||||
return None
|
||||
if result.returncode != 0:
|
||||
print(f" [warn] claude -p failed: {result.stderr.strip()[:200]}", file=sys.stderr)
|
||||
return None
|
||||
|
||||
output = result.stdout.strip()
|
||||
match = re.search(r"\{.*\}", output, re.DOTALL)
|
||||
if not match:
|
||||
print(f" [warn] no JSON found in claude output ({len(output)} chars)", file=sys.stderr)
|
||||
return None
|
||||
try:
|
||||
return json.loads(match.group(0))
|
||||
except json.JSONDecodeError as e:
|
||||
print(f" [warn] JSON parse failed: {e}", file=sys.stderr)
|
||||
return None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Staging output
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
STAGING_INJECT_TEMPLATE = (
|
||||
"---\n"
|
||||
"origin: automated\n"
|
||||
"status: pending\n"
|
||||
"staged_date: {staged_date}\n"
|
||||
"staged_by: wiki-distill\n"
|
||||
"target_path: {target_path}\n"
|
||||
"{modifies_line}"
|
||||
"distill_topic: {topic}\n"
|
||||
"distill_source_conversations: {source_convs}\n"
|
||||
"compilation_notes: {compilation_notes}\n"
|
||||
)
|
||||
|
||||
|
||||
def _inject_staging_frontmatter(
|
||||
content: str,
|
||||
target_path: str,
|
||||
topic: str,
|
||||
source_convs: list[str],
|
||||
compilation_notes: str,
|
||||
modifies: str | None,
|
||||
) -> str:
|
||||
content = re.sub(
|
||||
r"^(status|origin|staged_\w+|target_path|modifies|distill_\w+|compilation_notes):.*\n",
|
||||
"",
|
||||
content,
|
||||
flags=re.MULTILINE,
|
||||
)
|
||||
|
||||
modifies_line = f"modifies: {modifies}\n" if modifies else ""
|
||||
clean_notes = compilation_notes.replace("\n", " ").replace("\r", " ").strip()
|
||||
sources_yaml = ",".join(source_convs)
|
||||
injection = STAGING_INJECT_TEMPLATE.format(
|
||||
staged_date=datetime.now(timezone.utc).date().isoformat(),
|
||||
target_path=target_path,
|
||||
modifies_line=modifies_line,
|
||||
topic=topic,
|
||||
source_convs=sources_yaml,
|
||||
compilation_notes=clean_notes or "(distilled from conversation topic group)",
|
||||
)
|
||||
|
||||
if content.startswith("---\n"):
|
||||
return injection + content[4:]
|
||||
return injection + "---\n" + content
|
||||
|
||||
|
||||
def _unique_staging_path(base: Path) -> Path:
|
||||
if not base.exists():
|
||||
return base
|
||||
suffix = hashlib.sha256(str(base).encode() + str(time.time()).encode()).hexdigest()[:6]
|
||||
return base.with_stem(f"{base.stem}-{suffix}")
|
||||
|
||||
|
||||
def apply_distill_actions(
|
||||
result: dict[str, Any],
|
||||
topic: str,
|
||||
source_convs: list[str],
|
||||
dry_run: bool,
|
||||
) -> list[Path]:
|
||||
written: list[Path] = []
|
||||
actions = result.get("actions") or []
|
||||
rationale = result.get("rationale", "")
|
||||
|
||||
for action in actions:
|
||||
action_type = action.get("type")
|
||||
if action_type == "skip":
|
||||
reason = action.get("reason", "not substantive enough")
|
||||
print(f" [skip] topic={topic!r}: {reason}")
|
||||
continue
|
||||
|
||||
if action_type == "new_page":
|
||||
directory = action.get("directory") or "patterns"
|
||||
filename = action.get("filename")
|
||||
content = action.get("content")
|
||||
if not filename or not content:
|
||||
print(f" [warn] incomplete new_page action for topic={topic!r}", file=sys.stderr)
|
||||
continue
|
||||
target_rel = f"{directory}/{filename}"
|
||||
dest = _unique_staging_path(STAGING_DIR / target_rel)
|
||||
if dry_run:
|
||||
print(f" [dry-run] new_page → {dest.relative_to(WIKI_DIR)}")
|
||||
continue
|
||||
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||
injected = _inject_staging_frontmatter(
|
||||
content,
|
||||
target_path=target_rel,
|
||||
topic=topic,
|
||||
source_convs=source_convs,
|
||||
compilation_notes=rationale,
|
||||
modifies=None,
|
||||
)
|
||||
dest.write_text(injected)
|
||||
written.append(dest)
|
||||
print(f" [new] {dest.relative_to(WIKI_DIR)}")
|
||||
continue
|
||||
|
||||
if action_type == "update_page":
|
||||
target_rel = action.get("path")
|
||||
content = action.get("content")
|
||||
if not target_rel or not content:
|
||||
print(f" [warn] incomplete update_page action for topic={topic!r}", file=sys.stderr)
|
||||
continue
|
||||
dest = _unique_staging_path(STAGING_DIR / target_rel)
|
||||
if dry_run:
|
||||
print(f" [dry-run] update_page → {dest.relative_to(WIKI_DIR)} (modifies {target_rel})")
|
||||
continue
|
||||
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||
injected = _inject_staging_frontmatter(
|
||||
content,
|
||||
target_path=target_rel,
|
||||
topic=topic,
|
||||
source_convs=source_convs,
|
||||
compilation_notes=rationale,
|
||||
modifies=target_rel,
|
||||
)
|
||||
dest.write_text(injected)
|
||||
written.append(dest)
|
||||
print(f" [upd] {dest.relative_to(WIKI_DIR)} (modifies {target_rel})")
|
||||
continue
|
||||
|
||||
print(f" [warn] unknown action type: {action_type!r}", file=sys.stderr)
|
||||
|
||||
return written
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main pipeline
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def pick_model(topic_group: TopicGroup, prompt: str) -> str:
|
||||
if len(prompt) > SONNET_CONTENT_THRESHOLD or topic_group.total_bullets > 20:
|
||||
return CLAUDE_SONNET_MODEL
|
||||
return CLAUDE_HAIKU_MODEL
|
||||
|
||||
|
||||
def process_topic(
|
||||
topic: str,
|
||||
conversations: list[WikiPage],
|
||||
state: dict[str, Any],
|
||||
dry_run: bool,
|
||||
compile_enabled: bool,
|
||||
) -> tuple[str, list[Path]]:
|
||||
"""Process a single topic group. Returns (status, written_paths)."""
|
||||
|
||||
group = build_topic_group(topic, conversations)
|
||||
|
||||
if group.total_bullets < MIN_BULLETS_PER_TOPIC:
|
||||
return f"too-thin (only {group.total_bullets} bullets)", []
|
||||
|
||||
if topic in state.get("rejected_topics", {}):
|
||||
return "previously-rejected", []
|
||||
|
||||
wiki_index_text = ""
|
||||
try:
|
||||
wiki_index_text = INDEX_FILE.read_text()[:15_000]
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
topic_group_text = format_topic_group_for_llm(group)
|
||||
prompt = DISTILL_PROMPT_TEMPLATE.format(
|
||||
wiki_index=wiki_index_text,
|
||||
topic_group=topic_group_text,
|
||||
)
|
||||
|
||||
if dry_run:
|
||||
model = pick_model(group, prompt)
|
||||
return (
|
||||
f"would-distill ({len(group.conversations)} convs, "
|
||||
f"{group.total_bullets} bullets, {model})"
|
||||
), []
|
||||
|
||||
if not compile_enabled:
|
||||
return (
|
||||
f"skipped-compile ({len(group.conversations)} convs, "
|
||||
f"{group.total_bullets} bullets)"
|
||||
), []
|
||||
|
||||
model = pick_model(group, prompt)
|
||||
print(f" [compile] topic={topic!r} "
|
||||
f"convs={len(group.conversations)} bullets={group.total_bullets} model={model}")
|
||||
|
||||
result = call_claude_distill(prompt, model)
|
||||
if result is None:
|
||||
return "compile-failed", []
|
||||
|
||||
actions = result.get("actions") or []
|
||||
if not actions or all(a.get("type") == "skip" for a in actions):
|
||||
reason = result.get("rationale", "AI chose to skip")
|
||||
state.setdefault("rejected_topics", {})[topic] = {
|
||||
"reason": reason,
|
||||
"rejected_date": today().isoformat(),
|
||||
}
|
||||
return "rejected-by-llm", []
|
||||
|
||||
source_convs = [str(c.path.relative_to(WIKI_DIR)) for c in group.conversations]
|
||||
written = apply_distill_actions(result, topic, source_convs, dry_run=False)
|
||||
|
||||
for conv in group.conversations:
|
||||
mark_conv_distilled(state, conv, [str(p.relative_to(WIKI_DIR)) for p in written])
|
||||
|
||||
state.setdefault("processed_topics", {})[topic] = {
|
||||
"distilled_date": today().isoformat(),
|
||||
"conversations": source_convs,
|
||||
"output_pages": [str(p.relative_to(WIKI_DIR)) for p in written],
|
||||
}
|
||||
|
||||
return f"distilled ({len(written)} page(s))", written
|
||||
|
||||
|
||||
def run(
|
||||
*,
|
||||
first_run: bool,
|
||||
explicit_topic: str | None,
|
||||
project_filter: str | None,
|
||||
dry_run: bool,
|
||||
compile_enabled: bool,
|
||||
limit: int,
|
||||
) -> int:
|
||||
state = load_state()
|
||||
if not state.get("first_run_complete"):
|
||||
first_run = True
|
||||
|
||||
all_convs = iter_summarized_conversations(project_filter)
|
||||
print(f"Scanning {len(all_convs)} summarized conversation(s)...")
|
||||
|
||||
# Figure out which topics to process
|
||||
if explicit_topic:
|
||||
topics_to_process: set[str] = {explicit_topic}
|
||||
print(f"Explicit topic mode: {explicit_topic!r}")
|
||||
else:
|
||||
lookback = FIRST_RUN_LOOKBACK_DAYS if first_run else 0
|
||||
topics_to_process = extract_topics_from_today(all_convs, today(), lookback)
|
||||
if first_run:
|
||||
print(f"First-run bootstrap: last {FIRST_RUN_LOOKBACK_DAYS} days → "
|
||||
f"{len(topics_to_process)} topic(s)")
|
||||
else:
|
||||
print(f"Today-only mode: {len(topics_to_process)} topic(s) from today's conversations")
|
||||
|
||||
if not topics_to_process:
|
||||
print("No topics to distill.")
|
||||
if first_run:
|
||||
state["first_run_complete"] = True
|
||||
save_state(state)
|
||||
return 0
|
||||
|
||||
# Sort for deterministic ordering
|
||||
topics_ordered = sorted(topics_to_process)
|
||||
|
||||
stats: dict[str, int] = {}
|
||||
processed = 0
|
||||
total_written: list[Path] = []
|
||||
|
||||
for topic in topics_ordered:
|
||||
convs = rollup_conversations_by_topic(topic, all_convs)
|
||||
if not convs:
|
||||
stats["no-matches"] = stats.get("no-matches", 0) + 1
|
||||
continue
|
||||
|
||||
print(f"\n[{topic}] rollup: {len(convs)} conversation(s)")
|
||||
status, written = process_topic(
|
||||
topic, convs, state, dry_run=dry_run, compile_enabled=compile_enabled
|
||||
)
|
||||
stats[status.split(" ")[0]] = stats.get(status.split(" ")[0], 0) + 1
|
||||
print(f" [{status}]")
|
||||
|
||||
total_written.extend(written)
|
||||
if not dry_run:
|
||||
processed += 1
|
||||
save_state(state)
|
||||
|
||||
if limit and processed >= limit:
|
||||
print(f"\nLimit reached ({limit}); stopping.")
|
||||
break
|
||||
|
||||
if first_run and not dry_run:
|
||||
state["first_run_complete"] = True
|
||||
if not dry_run:
|
||||
save_state(state)
|
||||
|
||||
print("\nSummary:")
|
||||
for status, count in sorted(stats.items()):
|
||||
print(f" {status}: {count}")
|
||||
print(f"\n{len(total_written)} staging page(s) written")
|
||||
return 0
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
|
||||
parser.add_argument("--first-run", action="store_true",
|
||||
help="Bootstrap with last 7 days instead of today-only")
|
||||
parser.add_argument("--topic", default=None,
|
||||
help="Process one specific topic explicitly")
|
||||
parser.add_argument("--project", default=None,
|
||||
help="Only consider conversations under this wing")
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Plan only; no LLM calls, no writes")
|
||||
parser.add_argument("--no-compile", action="store_true",
|
||||
help="Parse + rollup only; skip claude -p step")
|
||||
parser.add_argument("--limit", type=int, default=0,
|
||||
help="Stop after N topic groups processed (0 = unlimited)")
|
||||
args = parser.parse_args()
|
||||
|
||||
return run(
|
||||
first_run=args.first_run,
|
||||
explicit_topic=args.topic,
|
||||
project_filter=args.project,
|
||||
dry_run=args.dry_run,
|
||||
compile_enabled=not args.no_compile,
|
||||
limit=args.limit,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -3,19 +3,26 @@ set -euo pipefail
|
||||
|
||||
# wiki-maintain.sh — Top-level orchestrator for wiki maintenance.
|
||||
#
|
||||
# Chains the three maintenance scripts in the correct order:
|
||||
# 1. wiki-harvest.py (URL harvesting from summarized conversations)
|
||||
# 2. wiki-hygiene.py (quick or full hygiene checks)
|
||||
# 3. qmd update && qmd embed (reindex after changes)
|
||||
# Chains the maintenance scripts in the correct order:
|
||||
# 1a. wiki-distill.py (closet summaries → wiki pages via claude -p)
|
||||
# 1b. wiki-harvest.py (URL content from conversations → wiki pages)
|
||||
# 2. wiki-hygiene.py (quick or full hygiene checks)
|
||||
# 3. qmd update && qmd embed (reindex after changes)
|
||||
#
|
||||
# Distill runs BEFORE harvest: conversation content takes priority over
|
||||
# URL content. If a topic is already discussed in the conversations, we
|
||||
# want the conversation rollup to drive the page, not a cited URL.
|
||||
#
|
||||
# Usage:
|
||||
# wiki-maintain.sh # Harvest + quick hygiene
|
||||
# wiki-maintain.sh --full # Harvest + full hygiene (LLM-powered)
|
||||
# wiki-maintain.sh # Distill + harvest + quick hygiene + reindex
|
||||
# wiki-maintain.sh --full # Everything with full hygiene (LLM)
|
||||
# wiki-maintain.sh --distill-only # Conversation distillation only
|
||||
# wiki-maintain.sh --harvest-only # URL harvesting only
|
||||
# wiki-maintain.sh --hygiene-only # Quick hygiene only
|
||||
# wiki-maintain.sh --hygiene-only --full # Full hygiene only
|
||||
# wiki-maintain.sh --dry-run # Show what would run (no writes)
|
||||
# wiki-maintain.sh --no-compile # Harvest without claude -p compilation step
|
||||
# wiki-maintain.sh --hygiene-only # Hygiene only
|
||||
# wiki-maintain.sh --no-distill # Skip distillation phase
|
||||
# wiki-maintain.sh --distill-first-run # Bootstrap distill with last 7 days
|
||||
# wiki-maintain.sh --dry-run # Show what would run (no writes, no LLM)
|
||||
# wiki-maintain.sh --no-compile # Skip claude -p in harvest AND distill
|
||||
# wiki-maintain.sh --no-reindex # Skip qmd update/embed after
|
||||
#
|
||||
# Log file: scripts/.maintain.log (rotated manually)
|
||||
@@ -32,22 +39,28 @@ LOG_FILE="${SCRIPTS_DIR}/.maintain.log"
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
FULL_MODE=false
|
||||
DISTILL_ONLY=false
|
||||
HARVEST_ONLY=false
|
||||
HYGIENE_ONLY=false
|
||||
NO_DISTILL=false
|
||||
DISTILL_FIRST_RUN=false
|
||||
DRY_RUN=false
|
||||
NO_COMPILE=false
|
||||
NO_REINDEX=false
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--full) FULL_MODE=true; shift ;;
|
||||
--harvest-only) HARVEST_ONLY=true; shift ;;
|
||||
--hygiene-only) HYGIENE_ONLY=true; shift ;;
|
||||
--dry-run) DRY_RUN=true; shift ;;
|
||||
--no-compile) NO_COMPILE=true; shift ;;
|
||||
--no-reindex) NO_REINDEX=true; shift ;;
|
||||
--full) FULL_MODE=true; shift ;;
|
||||
--distill-only) DISTILL_ONLY=true; shift ;;
|
||||
--harvest-only) HARVEST_ONLY=true; shift ;;
|
||||
--hygiene-only) HYGIENE_ONLY=true; shift ;;
|
||||
--no-distill) NO_DISTILL=true; shift ;;
|
||||
--distill-first-run) DISTILL_FIRST_RUN=true; shift ;;
|
||||
--dry-run) DRY_RUN=true; shift ;;
|
||||
--no-compile) NO_COMPILE=true; shift ;;
|
||||
--no-reindex) NO_REINDEX=true; shift ;;
|
||||
-h|--help)
|
||||
sed -n '3,20p' "$0" | sed 's/^# \?//'
|
||||
sed -n '3,28p' "$0" | sed 's/^# \?//'
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
@@ -57,8 +70,13 @@ while [[ $# -gt 0 ]]; do
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ "${HARVEST_ONLY}" == "true" && "${HYGIENE_ONLY}" == "true" ]]; then
|
||||
echo "--harvest-only and --hygiene-only are mutually exclusive" >&2
|
||||
# Mutex check — only one "only" flag at a time
|
||||
only_count=0
|
||||
${DISTILL_ONLY} && only_count=$((only_count + 1))
|
||||
${HARVEST_ONLY} && only_count=$((only_count + 1))
|
||||
${HYGIENE_ONLY} && only_count=$((only_count + 1))
|
||||
if [[ $only_count -gt 1 ]]; then
|
||||
echo "--distill-only, --harvest-only, and --hygiene-only are mutually exclusive" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
@@ -91,13 +109,36 @@ cd "${WIKI_DIR}"
|
||||
for req in python3 qmd; do
|
||||
if ! command -v "${req}" >/dev/null 2>&1; then
|
||||
if [[ "${req}" == "qmd" && "${NO_REINDEX}" == "true" ]]; then
|
||||
continue # qmd not required if --no-reindex
|
||||
continue
|
||||
fi
|
||||
echo "Required command not found: ${req}" >&2
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Determine which phases to run
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
run_distill=true
|
||||
run_harvest=true
|
||||
run_hygiene=true
|
||||
|
||||
${NO_DISTILL} && run_distill=false
|
||||
|
||||
if ${DISTILL_ONLY}; then
|
||||
run_harvest=false
|
||||
run_hygiene=false
|
||||
fi
|
||||
if ${HARVEST_ONLY}; then
|
||||
run_distill=false
|
||||
run_hygiene=false
|
||||
fi
|
||||
if ${HYGIENE_ONLY}; then
|
||||
run_distill=false
|
||||
run_harvest=false
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Pipeline
|
||||
# -----------------------------------------------------------------------------
|
||||
@@ -105,18 +146,39 @@ done
|
||||
START_TS="$(date '+%s')"
|
||||
section "wiki-maintain.sh starting"
|
||||
log "mode: $(${FULL_MODE} && echo full || echo quick)"
|
||||
log "harvest: $(${HYGIENE_ONLY} && echo skipped || echo enabled)"
|
||||
log "hygiene: $(${HARVEST_ONLY} && echo skipped || echo enabled)"
|
||||
log "distill: $(${run_distill} && echo enabled || echo skipped)"
|
||||
log "harvest: $(${run_harvest} && echo enabled || echo skipped)"
|
||||
log "hygiene: $(${run_hygiene} && echo enabled || echo skipped)"
|
||||
log "reindex: $(${NO_REINDEX} && echo skipped || echo enabled)"
|
||||
log "dry-run: ${DRY_RUN}"
|
||||
log "wiki: ${WIKI_DIR}"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Phase 1: Harvest
|
||||
# Phase 1a: Distill — conversations → wiki pages
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
if [[ "${HYGIENE_ONLY}" != "true" ]]; then
|
||||
section "Phase 1: URL harvesting"
|
||||
if ${run_distill}; then
|
||||
section "Phase 1a: Conversation distillation"
|
||||
distill_args=()
|
||||
${DRY_RUN} && distill_args+=(--dry-run)
|
||||
${NO_COMPILE} && distill_args+=(--no-compile)
|
||||
${DISTILL_FIRST_RUN} && distill_args+=(--first-run)
|
||||
|
||||
if python3 "${SCRIPTS_DIR}/wiki-distill.py" "${distill_args[@]}"; then
|
||||
log "distill completed"
|
||||
else
|
||||
log "[error] distill failed (exit $?) — continuing to harvest"
|
||||
fi
|
||||
else
|
||||
section "Phase 1a: Conversation distillation (skipped)"
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Phase 1b: Harvest — URLs cited in conversations → raw/ → wiki pages
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
if ${run_harvest}; then
|
||||
section "Phase 1b: URL harvesting"
|
||||
harvest_args=()
|
||||
${DRY_RUN} && harvest_args+=(--dry-run)
|
||||
${NO_COMPILE} && harvest_args+=(--no-compile)
|
||||
@@ -127,14 +189,14 @@ if [[ "${HYGIENE_ONLY}" != "true" ]]; then
|
||||
log "[error] harvest failed (exit $?) — continuing to hygiene"
|
||||
fi
|
||||
else
|
||||
section "Phase 1: URL harvesting (skipped)"
|
||||
section "Phase 1b: URL harvesting (skipped)"
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Phase 2: Hygiene
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
if [[ "${HARVEST_ONLY}" != "true" ]]; then
|
||||
if ${run_hygiene}; then
|
||||
section "Phase 2: Hygiene checks"
|
||||
hygiene_args=()
|
||||
if ${FULL_MODE}; then
|
||||
|
||||
@@ -209,3 +209,63 @@ def iter_archived_pages() -> list[WikiPage]:
|
||||
def page_content_hash(page: WikiPage) -> str:
|
||||
"""Hash of page body only (excludes frontmatter) so mechanical frontmatter fixes don't churn the hash."""
|
||||
return "sha256:" + hashlib.sha256(page.body.strip().encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Conversation hall parsing
|
||||
# ---------------------------------------------------------------------------
|
||||
#
|
||||
# Summarized conversations have sections in the body like:
|
||||
# ## Decisions (hall: fact)
|
||||
# - bullet
|
||||
# - bullet
|
||||
# ## Discoveries (hall: discovery)
|
||||
# - bullet
|
||||
#
|
||||
# Hall types used by the summarizer: fact, discovery, preference, advice,
|
||||
# event, tooling. Only fact/discovery/advice are high-signal enough to
|
||||
# distill into wiki pages; the others are tracked but not auto-promoted.
|
||||
|
||||
HIGH_SIGNAL_HALLS = {"fact", "discovery", "advice"}
|
||||
|
||||
_HALL_SECTION_RE = re.compile(
|
||||
r"^##\s+[^\n]*?\(hall:\s*(\w+)\s*\)\s*$(.*?)(?=^##\s|\Z)",
|
||||
re.MULTILINE | re.DOTALL,
|
||||
)
|
||||
_BULLET_RE = re.compile(r"^\s*-\s+(.*?)$", re.MULTILINE)
|
||||
|
||||
|
||||
def parse_conversation_halls(page: WikiPage) -> dict[str, list[str]]:
|
||||
"""Extract hall-bucketed bullet content from a summarized conversation body.
|
||||
|
||||
Returns a dict like:
|
||||
{"fact": ["claim one", "claim two"],
|
||||
"discovery": ["root cause X"],
|
||||
"advice": ["do Y", "consider Z"], ...}
|
||||
|
||||
Empty hall types are omitted. Bullet lines are stripped of leading "- "
|
||||
and trailing whitespace; multi-line bullets are joined with a space.
|
||||
"""
|
||||
result: dict[str, list[str]] = {}
|
||||
for match in _HALL_SECTION_RE.finditer(page.body):
|
||||
hall_type = match.group(1).strip().lower()
|
||||
section_body = match.group(2)
|
||||
bullets = [
|
||||
_flatten_bullet(b.group(1))
|
||||
for b in _BULLET_RE.finditer(section_body)
|
||||
]
|
||||
bullets = [b for b in bullets if b]
|
||||
if bullets:
|
||||
result.setdefault(hall_type, []).extend(bullets)
|
||||
return result
|
||||
|
||||
|
||||
def _flatten_bullet(text: str) -> str:
|
||||
"""Collapse a possibly-multiline bullet into a single clean line."""
|
||||
return " ".join(text.split()).strip()
|
||||
|
||||
|
||||
def high_signal_halls(page: WikiPage) -> dict[str, list[str]]:
|
||||
"""Return only fact/discovery/advice content from a conversation."""
|
||||
all_halls = parse_conversation_halls(page)
|
||||
return {k: v for k, v in all_halls.items() if k in HIGH_SIGNAL_HALLS}
|
||||
|
||||
Reference in New Issue
Block a user