Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.
## The gap distill closes
The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.
## What distill does
Narrow today-filter with historical rollup:
1. Find all summarized conversations dated TODAY
2. Extract their topics: — this is the "topics of today" set
3. For each topic in that set, pull ALL summarized conversations
across history that share that topic (full historical context)
4. Extract hall_facts + hall_discoveries + hall_advice bullets
(the high-signal hall types — skips event/preference/tooling)
5. Send topic group + wiki index.md to claude -p
6. Model emits JSON actions[]: new_page / update_page / skip
7. Write each action to staging/<type>/ with distill provenance
frontmatter (staged_by: wiki-distill, distill_topic,
distill_source_conversations, compilation_notes)
First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.
Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.
## Orchestration — distill BEFORE harvest
wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):
1a. wiki-distill.py — conversations → staging (PRIORITY)
1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement)
2. wiki-hygiene.py — decay, archive, repair, checks
3. qmd reindex
Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.
## Verified on real wiki
Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.
## Code additions
- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
+ high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
the Phase 1a/1b split
## Docs additions
- README.md: 8th row in extensions table, updated compounding-loop
diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
verify section test count
- .gitignore: note distill-state.json is committed (sync across
machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
tab with flow diagram, hall filter table, narrow-today/wide-
history explanation, staging provenance example
## Tests
192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
272 lines
8.1 KiB
Python
272 lines
8.1 KiB
Python
"""Shared helpers for wiki maintenance scripts.
|
|
|
|
Provides frontmatter parsing/serialization, WikiPage dataclass, and common
|
|
constants used by wiki-hygiene.py, wiki-staging.py, and wiki-harvest.py.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import hashlib
|
|
import os
|
|
import re
|
|
from dataclasses import dataclass
|
|
from datetime import date, datetime, timezone
|
|
from pathlib import Path
|
|
from typing import Any
|
|
|
|
# Wiki root — override via WIKI_DIR env var for tests / alternate installs
|
|
WIKI_DIR = Path(os.environ.get("WIKI_DIR", str(Path.home() / "projects" / "wiki")))
|
|
INDEX_FILE = WIKI_DIR / "index.md"
|
|
STAGING_DIR = WIKI_DIR / "staging"
|
|
STAGING_INDEX = STAGING_DIR / "index.md"
|
|
ARCHIVE_DIR = WIKI_DIR / "archive"
|
|
ARCHIVE_INDEX = ARCHIVE_DIR / "index.md"
|
|
REPORTS_DIR = WIKI_DIR / "reports"
|
|
CONVERSATIONS_DIR = WIKI_DIR / "conversations"
|
|
HARVEST_STATE_FILE = WIKI_DIR / ".harvest-state.json"
|
|
|
|
LIVE_CONTENT_DIRS = ["patterns", "decisions", "concepts", "environments"]
|
|
|
|
FM_FENCE = "---\n"
|
|
|
|
|
|
@dataclass
|
|
class WikiPage:
|
|
path: Path
|
|
frontmatter: dict[str, Any]
|
|
fm_raw: str
|
|
body: str
|
|
fm_start: int
|
|
|
|
|
|
def today() -> date:
|
|
return datetime.now(timezone.utc).date()
|
|
|
|
|
|
def parse_date(value: Any) -> date | None:
|
|
if not value:
|
|
return None
|
|
if isinstance(value, date):
|
|
return value
|
|
s = str(value).strip()
|
|
try:
|
|
return datetime.strptime(s, "%Y-%m-%d").date()
|
|
except ValueError:
|
|
return None
|
|
|
|
|
|
def parse_page(path: Path) -> WikiPage | None:
|
|
"""Parse a markdown page with YAML frontmatter. Returns None if no frontmatter."""
|
|
try:
|
|
text = path.read_text()
|
|
except OSError:
|
|
return None
|
|
if not text.startswith(FM_FENCE):
|
|
return None
|
|
end = text.find("\n---\n", 4)
|
|
if end == -1:
|
|
return None
|
|
fm_raw = text[4:end]
|
|
body = text[end + 5 :]
|
|
fm = parse_yaml_lite(fm_raw)
|
|
return WikiPage(path=path, frontmatter=fm, fm_raw=fm_raw, body=body, fm_start=end + 5)
|
|
|
|
|
|
def parse_yaml_lite(text: str) -> dict[str, Any]:
|
|
"""Parse a subset of YAML used in wiki frontmatter.
|
|
|
|
Supports:
|
|
- key: value
|
|
- key: [a, b, c]
|
|
- key:
|
|
- a
|
|
- b
|
|
"""
|
|
result: dict[str, Any] = {}
|
|
lines = text.splitlines()
|
|
i = 0
|
|
while i < len(lines):
|
|
line = lines[i]
|
|
if not line.strip() or line.lstrip().startswith("#"):
|
|
i += 1
|
|
continue
|
|
m = re.match(r"^([\w_-]+):\s*(.*)$", line)
|
|
if not m:
|
|
i += 1
|
|
continue
|
|
key, rest = m.group(1), m.group(2).strip()
|
|
if rest == "":
|
|
items: list[str] = []
|
|
j = i + 1
|
|
while j < len(lines) and re.match(r"^\s+-\s+", lines[j]):
|
|
items.append(re.sub(r"^\s+-\s+", "", lines[j]).strip())
|
|
j += 1
|
|
if items:
|
|
result[key] = items
|
|
i = j
|
|
continue
|
|
result[key] = ""
|
|
i += 1
|
|
continue
|
|
if rest.startswith("[") and rest.endswith("]"):
|
|
inner = rest[1:-1].strip()
|
|
if inner:
|
|
result[key] = [x.strip().strip('"').strip("'") for x in inner.split(",")]
|
|
else:
|
|
result[key] = []
|
|
i += 1
|
|
continue
|
|
result[key] = rest.strip('"').strip("'")
|
|
i += 1
|
|
return result
|
|
|
|
|
|
# Canonical frontmatter key order for serialization
|
|
PREFERRED_KEY_ORDER = [
|
|
"title", "type", "confidence",
|
|
"status", "origin",
|
|
"last_compiled", "last_verified",
|
|
"staged_date", "staged_by", "target_path", "modifies", "compilation_notes",
|
|
"archived_date", "archived_reason", "original_path",
|
|
"sources", "related",
|
|
]
|
|
|
|
|
|
def serialize_frontmatter(fm: dict[str, Any]) -> str:
|
|
"""Serialize a frontmatter dict back to YAML in the wiki's canonical style."""
|
|
out_lines: list[str] = []
|
|
seen: set[str] = set()
|
|
for key in PREFERRED_KEY_ORDER:
|
|
if key in fm:
|
|
out_lines.append(_format_fm_entry(key, fm[key]))
|
|
seen.add(key)
|
|
for key in sorted(fm.keys()):
|
|
if key in seen:
|
|
continue
|
|
out_lines.append(_format_fm_entry(key, fm[key]))
|
|
return "\n".join(out_lines)
|
|
|
|
|
|
def _format_fm_entry(key: str, value: Any) -> str:
|
|
if isinstance(value, list):
|
|
if not value:
|
|
return f"{key}: []"
|
|
lines = [f"{key}:"]
|
|
for item in value:
|
|
lines.append(f" - {item}")
|
|
return "\n".join(lines)
|
|
return f"{key}: {value}"
|
|
|
|
|
|
def write_page(page: WikiPage, new_fm: dict[str, Any] | None = None, new_body: str | None = None) -> None:
|
|
fm = new_fm if new_fm is not None else page.frontmatter
|
|
body = new_body if new_body is not None else page.body
|
|
fm_yaml = serialize_frontmatter(fm)
|
|
text = f"---\n{fm_yaml}\n---\n{body}"
|
|
page.path.write_text(text)
|
|
|
|
|
|
def iter_live_pages() -> list[WikiPage]:
|
|
pages: list[WikiPage] = []
|
|
for sub in LIVE_CONTENT_DIRS:
|
|
for md in sorted((WIKI_DIR / sub).glob("*.md")):
|
|
page = parse_page(md)
|
|
if page:
|
|
pages.append(page)
|
|
return pages
|
|
|
|
|
|
def iter_staging_pages() -> list[WikiPage]:
|
|
pages: list[WikiPage] = []
|
|
if not STAGING_DIR.exists():
|
|
return pages
|
|
for sub in LIVE_CONTENT_DIRS:
|
|
d = STAGING_DIR / sub
|
|
if not d.exists():
|
|
continue
|
|
for md in sorted(d.glob("*.md")):
|
|
page = parse_page(md)
|
|
if page:
|
|
pages.append(page)
|
|
return pages
|
|
|
|
|
|
def iter_archived_pages() -> list[WikiPage]:
|
|
pages: list[WikiPage] = []
|
|
if not ARCHIVE_DIR.exists():
|
|
return pages
|
|
for sub in LIVE_CONTENT_DIRS:
|
|
d = ARCHIVE_DIR / sub
|
|
if not d.exists():
|
|
continue
|
|
for md in sorted(d.glob("*.md")):
|
|
page = parse_page(md)
|
|
if page:
|
|
pages.append(page)
|
|
return pages
|
|
|
|
|
|
def page_content_hash(page: WikiPage) -> str:
|
|
"""Hash of page body only (excludes frontmatter) so mechanical frontmatter fixes don't churn the hash."""
|
|
return "sha256:" + hashlib.sha256(page.body.strip().encode("utf-8")).hexdigest()
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Conversation hall parsing
|
|
# ---------------------------------------------------------------------------
|
|
#
|
|
# Summarized conversations have sections in the body like:
|
|
# ## Decisions (hall: fact)
|
|
# - bullet
|
|
# - bullet
|
|
# ## Discoveries (hall: discovery)
|
|
# - bullet
|
|
#
|
|
# Hall types used by the summarizer: fact, discovery, preference, advice,
|
|
# event, tooling. Only fact/discovery/advice are high-signal enough to
|
|
# distill into wiki pages; the others are tracked but not auto-promoted.
|
|
|
|
HIGH_SIGNAL_HALLS = {"fact", "discovery", "advice"}
|
|
|
|
_HALL_SECTION_RE = re.compile(
|
|
r"^##\s+[^\n]*?\(hall:\s*(\w+)\s*\)\s*$(.*?)(?=^##\s|\Z)",
|
|
re.MULTILINE | re.DOTALL,
|
|
)
|
|
_BULLET_RE = re.compile(r"^\s*-\s+(.*?)$", re.MULTILINE)
|
|
|
|
|
|
def parse_conversation_halls(page: WikiPage) -> dict[str, list[str]]:
|
|
"""Extract hall-bucketed bullet content from a summarized conversation body.
|
|
|
|
Returns a dict like:
|
|
{"fact": ["claim one", "claim two"],
|
|
"discovery": ["root cause X"],
|
|
"advice": ["do Y", "consider Z"], ...}
|
|
|
|
Empty hall types are omitted. Bullet lines are stripped of leading "- "
|
|
and trailing whitespace; multi-line bullets are joined with a space.
|
|
"""
|
|
result: dict[str, list[str]] = {}
|
|
for match in _HALL_SECTION_RE.finditer(page.body):
|
|
hall_type = match.group(1).strip().lower()
|
|
section_body = match.group(2)
|
|
bullets = [
|
|
_flatten_bullet(b.group(1))
|
|
for b in _BULLET_RE.finditer(section_body)
|
|
]
|
|
bullets = [b for b in bullets if b]
|
|
if bullets:
|
|
result.setdefault(hall_type, []).extend(bullets)
|
|
return result
|
|
|
|
|
|
def _flatten_bullet(text: str) -> str:
|
|
"""Collapse a possibly-multiline bullet into a single clean line."""
|
|
return " ".join(text.split()).strip()
|
|
|
|
|
|
def high_signal_halls(page: WikiPage) -> dict[str, list[str]]:
|
|
"""Return only fact/discovery/advice content from a conversation."""
|
|
all_halls = parse_conversation_halls(page)
|
|
return {k: v for k, v in all_halls.items() if k in HIGH_SIGNAL_HALLS}
|