feat(distill): close the MemPalace loop — conversations → wiki pages

Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.

## The gap distill closes

The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.

## What distill does

Narrow today-filter with historical rollup:

  1. Find all summarized conversations dated TODAY
  2. Extract their topics: — this is the "topics of today" set
  3. For each topic in that set, pull ALL summarized conversations
     across history that share that topic (full historical context)
  4. Extract hall_facts + hall_discoveries + hall_advice bullets
     (the high-signal hall types — skips event/preference/tooling)
  5. Send topic group + wiki index.md to claude -p
  6. Model emits JSON actions[]: new_page / update_page / skip
  7. Write each action to staging/<type>/ with distill provenance
     frontmatter (staged_by: wiki-distill, distill_topic,
     distill_source_conversations, compilation_notes)

First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.

Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.

## Orchestration — distill BEFORE harvest

wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):

  1a. wiki-distill.py    — conversations → staging (PRIORITY)
  1b. wiki-harvest.py    — URLs → raw/harvested → staging (supplement)
  2.  wiki-hygiene.py    — decay, archive, repair, checks
  3.  qmd reindex

Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.

## Verified on real wiki

Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.

## Code additions

- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
  + high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
  state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
  the Phase 1a/1b split

## Docs additions

- README.md: 8th row in extensions table, updated compounding-loop
  diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
  loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
  order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
  verify section test count
- .gitignore: note distill-state.json is committed (sync across
  machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
  tab with flow diagram, hall filter table, narrow-today/wide-
  history explanation, staging provenance example

## Tests

192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
This commit is contained in:
Eric Turner
2026-04-12 22:34:33 -06:00
parent 4c6b7609a1
commit 997aa837de
11 changed files with 1732 additions and 66 deletions

1
.gitignore vendored
View File

@@ -31,5 +31,6 @@ __pycache__/
# NOTE: the following state files are NOT gitignored — they must sync
# across machines so both installs agree on what's been processed:
# .distill-state.json (conversation distillation: processed convs + topics)
# .harvest-state.json (URL dedup)
# .hygiene-state.json (content hashes, deferred issues)

View File

@@ -114,13 +114,14 @@ to one of those extensions:
| What memex adds | How it works |
|-----------------|--------------|
| **Conversation distillation** — your sessions become wiki pages | `wiki-distill.py` finds today's topics, rolls up ALL historical conversations sharing each topic, pulls their `hall_facts` + `hall_discoveries` + `hall_advice` content, and asks `claude -p` to create new pages or update existing ones. This is what closes the MemPalace loop — closet summaries become the source material for the wiki itself, not just the URLs cited in them. |
| **Time-decaying confidence** — pages earn trust through reinforcement and fade without it | `confidence` field + `last_verified`, 6/9/12 month decay thresholds, auto-archive. Full-mode hygiene also adds LLM contradiction detection across pages. |
| **Scalable search beyond the context window** | `qmd` (BM25 + vector + LLM re-ranking) from day one, with three collections (`wiki` / `wiki-archive` / `wiki-conversations`) so queries can route to the right surface. |
| **Traceable sources for every claim** | Every compiled page traces back to an immutable `raw/harvested/*.md` file with a SHA-256 content hash. Staging review is the built-in cross-check, and `compilation_notes` makes review fast. |
| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. |
| **Traceable sources for every claim** | Every compiled page traces back to either an immutable `raw/harvested/*.md` file (URL-sourced) or specific conversations with a `distill_source_conversations` field (session-sourced). Staging review is the built-in cross-check, and `compilation_notes` makes review fast. |
| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → distill → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. |
| **Human-in-the-loop staging** for automated content | Every automated page lands in `staging/` first with `origin: automated`, `status: pending`. Nothing bypasses human review — one promotion step and it's in the live wiki with `last_verified` set. |
| **Hybrid retrieval** — structural navigation + semantic search | Wings/rooms/halls (borrowed from mempalace) give structural filtering that narrows the search space before qmd's hybrid BM25 + vector pass runs. Full-mode hygiene also auto-adds missing cross-references. |
| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Harvest and hygiene state files sync across machines so both agree on what's been processed. |
| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Distill, harvest, and hygiene state files sync across machines so both agree on what's been processed. |
The short version: Karpathy shared the idea, milla-jovovich's mempalace
added the structural memory taxonomy, and memex is the automation layer
@@ -147,21 +148,28 @@ memex doesn't cover.
│ summarize-conversations.py --claude (daily)
┌─────────────────────┐
│ conversations/ │ summaries with related: wiki links
│ conversations/ │ summaries with halls + topics + related:
│ <project>/*.md │ (status: summarized)
└──────────┬──────────┘
│ wiki-harvest.py (daily)
┌─────────────────────┐
│ raw/harvested/ │ fetched URL content
│ *.md │ (immutable source material)
└──────────┬──────────┘
│ claude -p compile step
┌─────────────────────┐
│ staging/<type>/ pending pages
*.md │ (status: pending, origin: automated)
└──────────┬──────────┘
└──────┬───────────┬──
└──▶ wiki-distill.py (daily Phase 1a) ──┐
│ - rollup by today's topics │
│ - pull historical conversations│
- extract fact/discovery/advice│
│ - claude -p → new or update │
│ wiki-harvest.py (daily Phase 1b) │
▼ │
┌─────────────────────┐
raw/harvested/ fetched URL content │
│ *.md │ (immutable source material) │
└──────────┬──────────┘ │
│ claude -p compile step │
▼ │
┌──────────────────────────────────────────────────────┐ │
│ staging/<type>/ pending pages │◀─┘
│ *.md (status: pending, origin: auto) │
└──────────┬───────────────────────────────────────────┘
│ human review (wiki-staging.py --review)
┌─────────────────────┐
@@ -283,6 +291,7 @@ wiki/
├── reports/ ← Hygiene operation logs
├── scripts/ ← The automation pipeline
├── tests/ ← Pytest suite (171 tests)
├── .distill-state.json ← Conversation distill state (committed, synced)
├── .harvest-state.json ← URL dedup state (committed, synced)
├── .hygiene-state.json ← Content hashes, deferred issues (committed, synced)
└── .mine-state.json ← Conversation extraction offsets (gitignored, per-machine)
@@ -333,11 +342,12 @@ Eleven scripts organized in three layers:
- `update-conversation-index.py` — Regenerate conversation index + wake-up context
**Automation layer** (maintains the wiki):
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, constants
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, hall extraction, constants
- `wiki-distill.py` — Conversation distillation (closet → wiki pages via claude -p, closes the MemPalace loop)
- `wiki-harvest.py` — URL classification + fetch cascade + compile to staging
- `wiki-staging.py` — Human review (list/promote/reject/review/sync)
- `wiki-hygiene.py` — Quick + full hygiene checks, archival, auto-restore
- `wiki-maintain.sh` — Top-level orchestrator chaining harvest + hygiene
- `wiki-maintain.sh` — Top-level orchestrator chaining distill + harvest + hygiene
**Sync layer**:
- `wiki-sync.sh` — Git commit/pull/push with merge-union markdown handling

View File

@@ -77,6 +77,7 @@ Automation + lifecycle management on top of both:
┌─────────────────────────────────┐
│ AUTOMATION LAYER │
│ wiki_lib.py (shared helpers) │
│ wiki-distill.py │ (conversations → staging) ← closes MemPalace loop
│ wiki-harvest.py │ (URL → raw → staging)
│ wiki-staging.py │ (human review)
│ wiki-hygiene.py │ (decay, archive, repair, checks)
@@ -169,10 +170,63 @@ Provides:
All paths honor the `WIKI_DIR` environment variable, so tests and
alternate installs can override the root.
### `wiki-distill.py`
**Closes the MemPalace loop.** Reads the *content* of summarized
conversations — not the URLs they cite — and compiles wiki pages from
the high-signal hall entries (`hall_facts`, `hall_discoveries`,
`hall_advice`). Runs as Phase 1a in `wiki-maintain.sh`, before URL
harvesting.
**Scope filter (deliberately narrow)**:
1. Find all summarized conversations dated TODAY
2. Extract their `topics:` — this is the "topics-of-today" set
3. For each topic in that set, pull ALL summarized conversations across
history that share that topic (full historical context via rollup)
4. Extract `hall_facts` + `hall_discoveries` + `hall_advice` bullet
content from each conversation's body
5. Send the topic group (topic + matching conversations + halls) to
`claude -p` with the current `index.md`
6. Model emits a JSON `actions` array with `new_page` / `update_page` /
`skip` verdicts; the script writes each to `staging/<type>/`
**First-run bootstrap**: the very first run uses a 7-day lookback
instead of today-only, so the state file gets seeded with a reasonable
starting set. After that, daily runs stay narrow.
**Self-triggering**: dormant topics that resurface in a new
conversation automatically pull in all historical conversations on
that topic via the rollup. No manual intervention needed to
reprocess old knowledge when it becomes relevant again.
**Model routing**: haiku for short topic groups (< 15K chars prompt,
< 20 bullets), sonnet for longer ones.
**State** lives in `.distill-state.json` — tracks processed
conversations by content hash and topics-at-distill-time. A
conversation is re-processed if its body changes OR if it gains a new
topic not seen at previous distill.
**Staging output** includes distill-specific frontmatter:
- `staged_by: wiki-distill`
- `distill_topic: <topic>`
- `distill_source_conversations: <comma-separated conversation paths>`
Commands:
- `wiki-distill.py` — today-only rollup (default mode after first run)
- `wiki-distill.py --first-run` — 7-day lookback bootstrap
- `wiki-distill.py --topic TOPIC` — explicit single-topic processing
- `wiki-distill.py --project WING` — only today-topics from this wing
- `wiki-distill.py --dry-run` — plan only, no LLM calls, no writes
- `wiki-distill.py --no-compile` — rollup only, skip claude -p step
- `wiki-distill.py --limit N` — stop after N topic groups
### `wiki-harvest.py`
Scans summarized conversations for HTTP(S) URLs, classifies them,
fetches content, and compiles pending wiki pages.
fetches content, and compiles pending wiki pages. Runs as Phase 1b in
`wiki-maintain.sh`, after distill — URL content is treated as a
supplement to conversation-driven knowledge, not the primary source.
URL classification:
- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
@@ -254,13 +308,17 @@ full-mode runs can skip unchanged pages. Reports land in
Top-level orchestrator:
```
Phase 1: wiki-harvest.py (unless --hygiene-only)
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
Phase 1a: wiki-distill.py (unless --no-distill or --harvest-only / --hygiene-only)
Phase 1b: wiki-harvest.py (unless --distill-only / --hygiene-only)
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
```
Flags pass through to child scripts. Error-tolerant: if one phase fails,
the others still run. Logs to `scripts/.maintain.log`.
Ordering is deliberate: distill runs before harvest so that
conversation content drives the page shape, and URL harvesting only
supplements what the conversations are already covering. Flags pass
through to child scripts. Error-tolerant: if one phase fails, the
others still run. Logs to `scripts/.maintain.log`.
---
@@ -289,6 +347,7 @@ Three JSON files track per-pipeline state:
| File | Owner | Synced? | Purpose |
|------|-------|---------|---------|
| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
| `.distill-state.json` | `wiki-distill.py` | Yes (committed) | Processed conversations (content hash + topics seen), rejected topics, first-run flag |
| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
@@ -301,13 +360,15 @@ because Claude Code session files live at OS-specific paths.
## Module dependency graph
```
wiki_lib.py ─┬─> wiki-harvest.py
wiki_lib.py ─┬─> wiki-distill.py
├─> wiki-harvest.py
├─> wiki-staging.py
└─> wiki-hygiene.py
wiki-maintain.sh ─> wiki-harvest.py
─> wiki-hygiene.py
─> qmd (external)
wiki-maintain.sh ─> wiki-distill.py (Phase 1a — conversations → staging)
─> wiki-harvest.py (Phase 1b — URLs → staging)
─> wiki-hygiene.py (Phase 2)
─> qmd (external) (Phase 3)
mine-conversations.sh ─> extract-sessions.py
─> summarize-conversations.py

View File

@@ -43,10 +43,13 @@ repo preserves all of them:
Karpathy's gist is a concept pitch. He was explicit that he was sharing
an "idea file" for others to build on, not publishing a working
implementation. The analysis identified seven places where the core idea
needs an engineering layer to become practical day-to-day — five have
first-class answers in memex, and two remain scoped-out trade-offs that
the architecture cleanly acknowledges.
implementation. The analysis identified eight places where the core idea
needs an engineering layer to become practical day-to-day. The first
seven emerged from the original Signal & Noise review; the eighth
(conversation distillation) surfaced after building the other layers
and realizing that the conversations themselves were being mined,
summarized, indexed, and scanned for URLs — but the knowledge *inside*
them was never becoming wiki pages.
### 1. Claim freshness and reversibility
@@ -236,6 +239,71 @@ story. If you need any of that, you need a different architecture.
This is for the personal and small-team case where git + Tailscale is
the right amount of rigor.
### 8. Closing the MemPalace loop — conversation distillation
**The gap**: The mining pipeline extracts Claude Code sessions into
transcripts, classifies them by memory type (fact/discovery/preference/
advice/event/tooling), and tags them with topics. The URL harvester
scans them for cited links. Hygiene refreshes `last_verified` on any
wiki page that appears in a conversation's `related:` field. But none
of those steps actually *compile the knowledge inside the conversations
themselves into wiki pages.* A decision made in a session, a root cause
found during debugging, a pattern spotted in review — these stay in the
conversation summaries (searchable but not synthesized) until a human
manually writes them up. That's the last piece of the MemPalace model
that wasn't wired through: **closet content was never becoming the
source for the wiki proper**.
**How memex extends it**:
- **`wiki-distill.py`** runs as Phase 1a of `wiki-maintain.sh`, before
URL harvesting. The ordering is deliberate: conversation content
should drive the page, and URL harvesting should only supplement
what the conversations are already covering.
- **Narrow today-filter with historical rollup** — daily runs only
look at topics appearing in TODAY's summarized conversations, but
for each such topic the script pulls in ALL historical conversations
sharing that topic. Processing scope stays small; LLM context stays
wide. Old topics that resurface in new sessions automatically
trigger a re-distillation of the full history on that topic.
- **First-run bootstrap** — the very first run uses a 7-day lookback
to seed the state. After that, daily runs stay narrow.
- **High-signal halls only** — distill reads `hall_facts`,
`hall_discoveries`, and `hall_advice` bullets. Skips `hall_events`
(temporal, not knowledge), `hall_preferences` (user working style),
and `hall_tooling` (often low-signal). These are the halls the
MemPalace taxonomy treats as "canonical knowledge" vs "context."
- **claude -p compile step** — each topic group (topic + all matching
conversations + their high-signal halls) is sent to `claude -p`
with the current wiki index. The model decides whether to create a
new page, update an existing one, emit both, or skip (topic not
substantive enough or already well-covered).
- **Staging output with distill provenance** — new/updated pages land
in `staging/` with `staged_by: wiki-distill`, `distill_topic`, and
`distill_source_conversations` frontmatter fields. Every page traces
back to the exact conversations it was distilled from.
- **State file `.distill-state.json`** tracks processed conversations
by content hash and topic set, so re-runs only process what actually
changed. A conversation gets re-distilled if its body changes OR if
it gains a new topic not seen at previous distill time.
**Why this matters**: Without distillation, the MemPalace integration
was incomplete — the closet summaries existed, the structural metadata
existed, qmd could search them, but knowledge discovered during work
never escaped the conversation archive. You could find "we had a
debugging session about X last month" but couldn't find "here's the
canonical page on X that captures what we learned." This extension
turns the MemPalace layer from a searchable archive into a proper
**ingest pipeline** for the wiki.
**Residual consideration**: Summarization quality is now load-bearing.
The distill step trusts the summarizer's classification of bullets
into halls. If the summarizer puts a debugging dead-end in
`hall_discoveries`, it may enter the wiki compilation pipeline. The
`MIN_BULLETS_PER_TOPIC` filter (default 2) and the LLM's own
substantiveness check (it can choose `skip` with a reason) together
catch most noise, and the staging review catches the rest.
---
## The biggest layer — active upkeep
@@ -255,7 +323,7 @@ thing the human has to think about:
| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
| Daily 3am | `wiki-maintain.sh` | Distill + harvest + quick hygiene + reindex |
| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
If you disable all of these, you get the same outcome as every

View File

@@ -281,7 +281,13 @@ python3 scripts/summarize-conversations.py --claude
# 3. Regenerate conversation index + wake-up context
python3 scripts/update-conversation-index.py --reindex
# 4. Dry-run the maintenance pipeline
# 4. First-run distill bootstrap (7-day lookback, burns claude -p calls)
# Only do this if you have summarized conversations from recent work.
# Skip it if you're starting with a fresh wiki.
python3 scripts/wiki-distill.py --first-run --dry-run # plan
python3 scripts/wiki-distill.py --first-run # actually do it
# 5. Dry-run the maintenance pipeline
bash scripts/wiki-maintain.sh --dry-run --no-compile
```
@@ -322,7 +328,7 @@ PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/
0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1
# ─── Maintenance ───────────────────────────────────────────────────────────
# Daily at 3am: harvest + quick hygiene + qmd reindex
# Daily at 3am: distill conversations + harvest URLs + quick hygiene + qmd reindex
0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1
# Weekly Sunday at 4am: full hygiene with LLM checks
@@ -424,8 +430,8 @@ cd tests && python3 -m pytest
Expected:
- `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]`
- `wiki-maintain.sh --dry-run` completes all three phases
- `pytest` passes all 171 tests in ~1.3 seconds
- `wiki-maintain.sh --dry-run` completes all four phases (distill, harvest, hygiene, reindex)
- `pytest` passes all 192 tests in ~1.5 seconds
---

View File

@@ -1209,6 +1209,7 @@
<button class="tab-btn" onclick="switchTab(this, 'tab-signals')">Signal Breakdown</button>
<button class="tab-btn" onclick="switchTab(this, 'tab-mitigations')">Mitigations ★</button>
<button class="tab-btn" onclick="switchTab(this, 'tab-mempalace')" style="color:var(--accent-green);font-weight:600">MemPalace ⬡</button>
<button class="tab-btn" onclick="switchTab(this, 'tab-distill')" style="color:var(--accent-amber);font-weight:600">Distill ⬣</button>
</div>
<!-- TAB: PROS & CONS -->
@@ -2259,6 +2260,255 @@
</div><!-- /tab-mempalace -->
<!-- TAB: DISTILL — the 8th extension, closing the MemPalace loop -->
<div id="tab-distill" class="tab-panel">
<div class="palace-hero" style="background:linear-gradient(135deg, #2a1810 0%, #1a1a10 50%, #0a1510 100%); border-color:#4a3a1a;">
<div class="kicker" style="color:#f0c060">⬣ The 8th Extension — Closing the MemPalace Loop</div>
<h3>Closet summaries <em>become</em> the source for the wiki itself.</h3>
<p>The first seven extensions came out of the Signal &amp; Noise review. The eighth surfaced only after the other layers were built — and it's the one that makes the MemPalace integration a real pipeline into the wiki instead of just a searchable archive beside it. The mining layer was extracting sessions, classifying bullets into halls, tagging topics, and making everything searchable via qmd. But the knowledge <em>inside</em> the conversations was never being compiled into wiki pages. A decision made in a session, a root cause found during debugging, a pattern spotted in review — these stayed in the conversation summaries forever, findable but not synthesized.</p>
<p style="color:#f0c060;font-size:12.5px;font-family:'JetBrains Mono',monospace;letter-spacing:0.05em;">This is what the <code>wiki-distill.py</code> script solves. It's Phase 1a of <code>wiki-maintain.sh</code> and runs before URL harvesting because conversation content should drive the page, not the URLs the conversation cites.</p>
<div class="hero-stats">
<div class="hstat"><span class="hval">Phase 1a</span><span class="hlbl">Runs before harvest</span></div>
<div class="hstat"><span class="hval">today</span><span class="hlbl">Narrow filter — today's topics</span></div>
<div class="hstat"><span class="hval">∀ history</span><span class="hlbl">Rollup all past conversations on each topic</span></div>
<div class="hstat"><span class="hval">3 halls</span><span class="hlbl">fact + discovery + advice</span></div>
<div class="hstat"><span class="hval">haiku/sonnet</span><span class="hlbl">Auto-routed by topic size</span></div>
</div>
</div>
<!-- FLOW DIAGRAM -->
<div class="flow-diagram">
<div class="flow-title">Distill Flow — Conversation Content → Wiki Pages</div>
<div class="flow-label">Narrow: what topics to process today</div>
<div class="flow-row">
<div class="flow-node convo">Today's<br>conversations</div>
<div class="flow-arrow"></div>
<div class="flow-node palace">Extract<br>topics[]</div>
<div class="flow-arrow">=</div>
<div class="flow-node wiki">Topics of<br>today set</div>
</div>
<div class="flow-label" style="margin-top:14px">Wide: pull full history for each today-topic</div>
<div class="flow-row">
<div class="flow-node wiki">Each<br>today-topic</div>
<div class="flow-arrow"></div>
<div class="flow-node palace">Rollup ALL<br>historical convs</div>
<div class="flow-arrow"></div>
<div class="flow-node palace">Extract<br>fact / discovery / advice</div>
<div class="flow-arrow"></div>
<div class="flow-node llm">claude -p<br>distill prompt</div>
</div>
<div class="flow-label" style="margin-top:14px">Compile: model decides new / update / skip</div>
<div class="flow-row">
<div class="flow-node llm">JSON<br>actions[]</div>
<div class="flow-arrow"></div>
<div class="flow-node wiki">new_page</div>
<div class="flow-arrow">+</div>
<div class="flow-node wiki">update_page<br>(modifies existing)</div>
<div class="flow-arrow"></div>
<div class="flow-node raw">staging/&lt;type&gt;/<br>pending review</div>
</div>
</div>
<!-- SECTION: WHY IT COMPLETES MEMPALACE -->
<div class="section-header">
<h2>Why This Completes MemPalace</h2>
<span class="section-tag" style="border-color:var(--accent-amber);color:var(--accent-amber);background:#fff8e6">Pipeline Closure</span>
</div>
<div class="palace-map">
<div class="palace-cell">
<div class="pc-icon">📦</div>
<div class="pc-term">Drawer — before</div>
<div class="pc-name">Verbatim Archive</div>
<div class="pc-desc">Full transcripts stored, searchable via qmd. No compilation — if you wanted canonical knowledge from them, you had to write it up manually.</div>
<div class="pc-wiki-map">Status: already working</div>
</div>
<div class="palace-cell">
<div class="pc-icon">🗂️</div>
<div class="pc-term">Closet — before</div>
<div class="pc-name">Summary Layer</div>
<div class="pc-desc">Summaries with hall classification (fact / discovery / preference / advice / event / tooling) and topics. Searchable. Terminal: never fed forward into the wiki compiler.</div>
<div class="pc-wiki-map">Status: terminal data, not flowing</div>
</div>
<div class="palace-cell">
<div class="pc-icon"></div>
<div class="pc-term">Distill — NEW</div>
<div class="pc-name">Compiler Bridge</div>
<div class="pc-desc">Reads closet content by topic, rolls up all matching conversations across history, filters to high-signal halls only, sends to claude -p with the current wiki index, emits new or updated wiki pages to staging.</div>
<div class="pc-wiki-map">Status: wiki-distill.py</div>
</div>
<div class="palace-cell">
<div class="pc-icon">📄</div>
<div class="pc-term">Wiki Pages — NEW</div>
<div class="pc-name">Distilled Knowledge</div>
<div class="pc-desc">Pages in staging/&lt;type&gt;/ with full distill provenance: distill_topic, distill_source_conversations, compilation_notes. Promote via staging review. Session knowledge becomes canonical knowledge.</div>
<div class="pc-wiki-map">Status: origin=automated, staged_by=wiki-distill</div>
</div>
</div>
<!-- HALL FILTERING -->
<div class="section-header">
<h2>Which Halls Get Distilled</h2>
<span class="section-tag" style="border-color:var(--accent-green);color:var(--accent-green);background:#eaf5ee">High Signal Only</span>
</div>
<table class="compare-table">
<thead>
<tr>
<th>Hall</th>
<th style="text-align:center">Distilled?</th>
<th>Why</th>
</tr>
</thead>
<tbody>
<tr>
<td class="row-label">hall_facts</td>
<td style="text-align:center" class="cell-win">✦ YES</td>
<td>Decisions locked in, choices made, specs agreed. Canonical knowledge.</td>
</tr>
<tr>
<td class="row-label">hall_discoveries</td>
<td style="text-align:center" class="cell-win">✦ YES</td>
<td>Root causes, breakthroughs, non-obvious findings. The highest-signal content in any session.</td>
</tr>
<tr>
<td class="row-label">hall_advice</td>
<td style="text-align:center" class="cell-win">✦ YES</td>
<td>Recommendations, lessons learned, "next time do X." Worth capturing as patterns.</td>
</tr>
<tr>
<td class="row-label">hall_events</td>
<td style="text-align:center" class="cell-mid">no</td>
<td>Deployments, incidents, milestones. Temporal data — belongs in logs, not the wiki.</td>
</tr>
<tr>
<td class="row-label">hall_preferences</td>
<td style="text-align:center" class="cell-mid">no</td>
<td>User working style notes. Belong in personal configs, not the shared wiki.</td>
</tr>
<tr>
<td class="row-label">hall_tooling</td>
<td style="text-align:center" class="cell-mid">no</td>
<td>Script/command usage, failures, improvements. Usually low-signal or duplicates what's already in the wiki.</td>
</tr>
</tbody>
</table>
<!-- HOW THE NARROW-TODAY + WIDE-HISTORY FILTER WORKS -->
<div class="section-header">
<h2>The Narrow-Today / Wide-History Filter</h2>
<span class="section-tag" style="border-color:var(--accent-blue);color:var(--accent-blue);background:#e8eef5">Key Design</span>
</div>
<div class="mitigation-intro">
<strong>Processing scope stays narrow; LLM context stays wide.</strong> This is the key property that makes distill cheap enough to run daily and smart enough to produce good pages.
</div>
<div class="mitigation-steps">
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">01</span>
<span class="step-title">Daily filter: only process topics appearing in TODAY's conversations</span>
<span class="step-tool-tag">Scope</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>Each daily run only looks at conversations dated today. It extracts the <code>topics:</code> frontmatter from each — that union becomes the "topics of today" set. If you didn't discuss a topic today, it's not in the processing scope. This keeps the cron job cheap and predictable: if today was a light session day, distill runs fast. If today was a heavy architecture discussion, distill does real work.</p>
<div class="tip-box"><strong>First run only:</strong> The very first run uses a 7-day lookback instead of today-only so the state file gets seeded. After that first bootstrap, daily runs stay narrow.</div>
</div>
</div>
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">02</span>
<span class="step-title">Historical rollup: for each today-topic, pull ALL matching conversations</span>
<span class="step-tool-tag">Context</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>Once the today-topic set is known, for each topic the script walks the entire conversation archive and pulls every summarized conversation that shares that topic. A discussion about <code>blue-green-deploy</code> today might roll up 16 conversations across the last 6 months. The claude -p call sees the full history, not just today's fragment.</p>
<p>This is what makes the distilled pages <em>good</em>. The LLM isn't guessing what a pattern looks like from one session — it's synthesizing across everything you've ever discussed on the topic.</p>
</div>
</div>
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">03</span>
<span class="step-title">Self-triggering: dormant topics wake up when they resurface</span>
<span class="step-tool-tag">Emergent</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>The narrow-today/wide-history combination produces a useful emergent property: <strong>dormant topics wake up automatically.</strong> If you discussed <code>database-migrations</code> three months ago and it never came up again, it's not in the daily scope. But the day you mention it again in any new conversation, that topic enters today's set — and the rollup pulls in all three months of historical discussion. The wiki page gets updated with fresh synthesis across the full history without you having to manually trigger reprocessing.</p>
<div class="tip-box"><strong>What this means in practice:</strong> Old knowledge gets distilled <em>when it becomes relevant again</em>. You don't need to remember to ask "hey, is there a wiki page for X?" — the next time X comes up in a session, distill will check the wiki state and either create or update the page for you.</div>
</div>
</div>
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">04</span>
<span class="step-title">State tracking by content hash + topic set</span>
<span class="step-tool-tag">.distill-state.json</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>A conversation is considered "already distilled" only if its body hash AND its topic set match what was seen at the last distill. If the body changes (summarizer re-ran and updated the bullets) OR a new topic is added, the conversation gets re-processed on the next run. Topics get tracked so rejected ones don't get reprocessed forever — if the LLM says "this topic doesn't deserve a wiki page" once, it stays rejected until something meaningful changes.</p>
</div>
</div>
<div class="mitigation-step" onclick="toggleStep(this)">
<div class="mitigation-step-header">
<span class="step-num">05</span>
<span class="step-title">Distill runs BEFORE harvest — conversation content has priority</span>
<span class="step-tool-tag">Phase 1a</span>
<span class="step-arrow"></span>
</div>
<div class="mitigation-step-body">
<p>The orchestrator runs distill as Phase 1a and harvest as Phase 1b. Deliberate: if a topic is being actively discussed in your sessions, you want the wiki page to reflect <em>your</em> synthesis of what you've learned, not just the external URL cited in passing. URL harvesting then fills in gaps — it picks up the docs pages, blog posts, and references that your sessions didn't already cover.</p>
<div class="warn-box">Both phases can produce staging pages. If distill creates <code>patterns/docker-hardening.md</code> and harvest creates <code>patterns/docker-hardening.md</code>, the staging-unique-path helper appends a short hash suffix so they don't collide. The reviewer sees both in staging and picks the better one (usually distill, since it has historical context).</div>
</div>
</div>
</div>
<!-- STAGING FRONTMATTER -->
<div class="section-header">
<h2>Distill Staging Provenance</h2>
<span class="section-tag" style="border-color:var(--accent-green);color:var(--accent-green);background:#eaf5ee">Traceable</span>
</div>
<p style="font-size:13.5px;color:var(--muted);margin-bottom:20px;line-height:1.6;">Every distilled page lands in staging with full provenance in its frontmatter. When you review a page in staging, you can see exactly which conversations it came from and jump directly to those transcripts.</p>
<div class="flow-diagram" style="background:#0d0d0d; border-color:#2a2a2a;">
<div class="flow-title" style="color:#c4b99a">Example: staging/patterns/zoho-crm-integration.md frontmatter</div>
<pre style="font-family:'JetBrains Mono',monospace;font-size:11px;color:#c4b99a;line-height:1.6;margin:0;padding:14px 0;overflow-x:auto;">---
origin: automated
status: pending
staged_date: 2026-04-12
staged_by: wiki-distill
target_path: patterns/zoho-crm-integration.md
distill_topic: zoho-api
distill_source_conversations: conversations/general/2026-04-06-73d15650.md,conversations/mc/2026-03-30-64089d1d.md
compilation_notes: Two separate incidents discovered the same Zoho CRM v2 API limitations, documenting them as a pattern page prevents re-investigation and provides a canonical reference for future Zoho integrations.
title: Zoho CRM Integration
type: pattern
confidence: high
sources: [conversations/general/2026-04-06-73d15650.md, conversations/mc/2026-03-30-64089d1d.md]
related: [database-migrations.md, activity-event-auditing.md]
last_compiled: 2026-04-12
last_verified: 2026-04-12
---</pre>
</div>
<div class="pull-quote" style="border-left-color:var(--accent-amber)">
Without distillation, MemPalace was a searchable archive sitting beside the wiki. With distillation, it's a real ingest pipeline — closet content becomes the source material for the wiki proper, completing the eight-extension story.
<span class="attribution">— memex design rationale, April 2026</span>
</div>
</div><!-- /tab-distill -->
</div><!-- /page -->
<footer class="page-footer">

700
scripts/wiki-distill.py Normal file
View File

@@ -0,0 +1,700 @@
#!/usr/bin/env python3
"""Distill wiki pages from summarized conversation content.
This is the "closing the MemPalace loop" step: closet summaries become
the source material for new or updated wiki pages. It's parallel to
wiki-harvest.py (which compiles URL content into wiki pages) but operates
on the *content of the conversations themselves* rather than the URLs
they cite.
Scope filter (deliberately narrow):
1. Find all summarized conversations dated TODAY
2. Extract their `topics:` — this is the "topics-of-today" set
3. For each topic in that set, pull ALL summarized conversations across
history that share that topic (rollup for full context)
4. For each topic group, extract `hall_facts` + `hall_discoveries` +
`hall_advice` bullet content from the body
5. Send the topic group + relevant hall entries to `claude -p` with
the current index.md, ask for new_page / update_page / both / skip
6. Write result(s) to staging/<type>/ with `staged_by: wiki-distill`
First run bootstrap (--first-run or empty state):
- Instead of "topics-of-today", use "topics-from-the-last-7-days"
- This seeds the state file so subsequent runs can stay narrow
Self-triggering property:
- Old dormant topics that resurface in a new conversation will
automatically pull in all historical conversations on that topic
via the rollup — no need to manually trigger reprocessing
State: `.distill-state.json` tracks processed conversations (path +
content hash + topics seen at distill time). A conversation is
re-processed if its content hash changes OR it has a new topic not
seen during the previous distill.
Usage:
python3 scripts/wiki-distill.py # Today-only rollup
python3 scripts/wiki-distill.py --first-run # Last 7 days rollup
python3 scripts/wiki-distill.py --topic TOPIC # Process one topic explicitly
python3 scripts/wiki-distill.py --project mc # Only this wing's today topics
python3 scripts/wiki-distill.py --dry-run # Plan only, no LLM, no writes
python3 scripts/wiki-distill.py --no-compile # Parse/rollup only, skip claude -p
python3 scripts/wiki-distill.py --limit N # Cap at N topic groups processed
"""
from __future__ import annotations
import argparse
import hashlib
import json
import os
import re
import subprocess
import sys
import time
from dataclasses import dataclass, field
from datetime import date, datetime, timedelta, timezone
from pathlib import Path
from typing import Any
sys.path.insert(0, str(Path(__file__).parent))
from wiki_lib import ( # noqa: E402
CONVERSATIONS_DIR,
INDEX_FILE,
STAGING_DIR,
WIKI_DIR,
WikiPage,
high_signal_halls,
parse_date,
parse_page,
today,
)
sys.stdout.reconfigure(line_buffering=True)
sys.stderr.reconfigure(line_buffering=True)
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
DISTILL_STATE_FILE = WIKI_DIR / ".distill-state.json"
CLAUDE_HAIKU_MODEL = "haiku"
CLAUDE_SONNET_MODEL = "sonnet"
# Content size (characters) above which we route to sonnet
SONNET_CONTENT_THRESHOLD = 15_000
CLAUDE_TIMEOUT = 600
FIRST_RUN_LOOKBACK_DAYS = 7
# Minimum number of total hall bullets across the topic group to bother
# asking the LLM. A topic with only one fact/discovery across history is
# usually not enough signal to warrant a wiki page.
MIN_BULLETS_PER_TOPIC = 2
# ---------------------------------------------------------------------------
# State management
# ---------------------------------------------------------------------------
def load_state() -> dict[str, Any]:
defaults: dict[str, Any] = {
"processed_convs": {},
"processed_topics": {},
"rejected_topics": {},
"last_run": None,
"first_run_complete": False,
}
if DISTILL_STATE_FILE.exists():
try:
with open(DISTILL_STATE_FILE) as f:
state = json.load(f)
for k, v in defaults.items():
state.setdefault(k, v)
return state
except (OSError, json.JSONDecodeError):
pass
return defaults
def save_state(state: dict[str, Any]) -> None:
state["last_run"] = datetime.now(timezone.utc).isoformat()
tmp = DISTILL_STATE_FILE.with_suffix(".json.tmp")
with open(tmp, "w") as f:
json.dump(state, f, indent=2, sort_keys=True)
tmp.replace(DISTILL_STATE_FILE)
def conv_content_hash(conv: WikiPage) -> str:
return "sha256:" + hashlib.sha256(conv.body.encode("utf-8")).hexdigest()
def conv_needs_distill(state: dict[str, Any], conv: WikiPage) -> bool:
"""Return True if this conversation should be re-processed."""
rel = str(conv.path.relative_to(WIKI_DIR))
entry = state.get("processed_convs", {}).get(rel)
if not entry:
return True
if entry.get("content_hash") != conv_content_hash(conv):
return True
# New topics that weren't seen at distill time → re-process
seen_topics = set(entry.get("topics_at_distill", []))
current_topics = set(conv.frontmatter.get("topics") or [])
if current_topics - seen_topics:
return True
return False
def mark_conv_distilled(
state: dict[str, Any],
conv: WikiPage,
output_pages: list[str],
) -> None:
rel = str(conv.path.relative_to(WIKI_DIR))
state.setdefault("processed_convs", {})[rel] = {
"distilled_date": today().isoformat(),
"content_hash": conv_content_hash(conv),
"topics_at_distill": list(conv.frontmatter.get("topics") or []),
"output_pages": output_pages,
}
# ---------------------------------------------------------------------------
# Conversation discovery & topic rollup
# ---------------------------------------------------------------------------
def iter_summarized_conversations(project_filter: str | None = None) -> list[WikiPage]:
"""Walk conversations/ and return all summarized conversation pages."""
if not CONVERSATIONS_DIR.exists():
return []
results: list[WikiPage] = []
for project_dir in sorted(CONVERSATIONS_DIR.iterdir()):
if not project_dir.is_dir():
continue
if project_filter and project_dir.name != project_filter:
continue
for md in sorted(project_dir.glob("*.md")):
page = parse_page(md)
if not page:
continue
if page.frontmatter.get("status") != "summarized":
continue
results.append(page)
return results
def extract_topics_from_today(
conversations: list[WikiPage],
target_date: date,
lookback_days: int = 0,
) -> set[str]:
"""Find the set of topics appearing in conversations dated ≥ (target - lookback).
lookback_days=0 → only today
lookback_days=7 → today and the previous 7 days
"""
cutoff = target_date - timedelta(days=lookback_days)
topics: set[str] = set()
for conv in conversations:
d = parse_date(conv.frontmatter.get("date"))
if d and d >= cutoff:
for t in conv.frontmatter.get("topics") or []:
t_clean = str(t).strip()
if t_clean:
topics.add(t_clean)
return topics
def rollup_conversations_by_topic(
topic: str, conversations: list[WikiPage]
) -> list[WikiPage]:
"""Return all conversations (across all time) whose topics: list contains `topic`."""
results: list[WikiPage] = []
for conv in conversations:
conv_topics = conv.frontmatter.get("topics") or []
if topic in conv_topics:
results.append(conv)
# Most recent first so the LLM sees the current state before the backstory
results.sort(
key=lambda c: parse_date(c.frontmatter.get("date")) or date.min,
reverse=True,
)
return results
# ---------------------------------------------------------------------------
# Build the LLM input for a topic group
# ---------------------------------------------------------------------------
@dataclass
class TopicGroup:
topic: str
conversations: list[WikiPage]
halls_by_conv: list[dict[str, list[str]]]
total_bullets: int
def build_topic_group(topic: str, conversations: list[WikiPage]) -> TopicGroup:
halls_by_conv: list[dict[str, list[str]]] = []
total = 0
for conv in conversations:
halls = high_signal_halls(conv)
halls_by_conv.append(halls)
total += sum(len(v) for v in halls.values())
return TopicGroup(
topic=topic,
conversations=conversations,
halls_by_conv=halls_by_conv,
total_bullets=total,
)
def format_topic_group_for_llm(group: TopicGroup) -> str:
"""Render a topic group as a prompt-friendly markdown block."""
lines = [f"# Topic: {group.topic}", ""]
lines.append(
f"Found {len(group.conversations)} summarized conversation(s) tagged "
f"with this topic, containing {group.total_bullets} high-signal bullets "
f"across fact/discovery/advice halls."
)
lines.append("")
for conv, halls in zip(group.conversations, group.halls_by_conv):
rel = str(conv.path.relative_to(WIKI_DIR))
date_str = conv.frontmatter.get("date", "unknown")
title = conv.frontmatter.get("title", conv.path.stem)
project = conv.frontmatter.get("project", "?")
lines.append(f"## {date_str}{title} ({project})")
lines.append(f"_Source: `{rel}`_")
lines.append("")
for hall_type in ("fact", "discovery", "advice"):
bullets = halls.get(hall_type) or []
if not bullets:
continue
label = {"fact": "Decisions", "discovery": "Discoveries", "advice": "Advice"}[hall_type]
lines.append(f"**{label}:**")
for b in bullets:
lines.append(f"- {b}")
lines.append("")
return "\n".join(lines)
# ---------------------------------------------------------------------------
# Claude compilation
# ---------------------------------------------------------------------------
DISTILL_PROMPT_TEMPLATE = """You are distilling wiki pages from summarized conversation content.
The wiki schema and conventions are defined in CLAUDE.md. The wiki has four
content directories: patterns/ (HOW), decisions/ (WHY), environments/ (WHERE),
concepts/ (WHAT). All pages require YAML frontmatter with title, type,
confidence, origin, sources, related, last_compiled, last_verified.
IMPORTANT: Do NOT include `status`, `staged_*`, `target_path`, `modifies`,
or `compilation_notes` fields in your page frontmatter — the distill script
injects those automatically.
Your task: given a topic group (all conversations across history that share
a topic, with their decisions/discoveries/advice), decide what wiki pages
should be created or updated. Emit a single JSON object with an `actions`
array. Each action is one of:
- "new_page" — create a new wiki page from the distilled knowledge
- "update_page" — update an existing live wiki page (add content, merge)
- "skip" — content isn't substantive enough for a wiki page
OR the topic is already well-covered elsewhere
Schema:
{{
"rationale": "1-2 sentences explaining your decision",
"actions": [
{{
"type": "new_page",
"directory": "patterns" | "decisions" | "environments" | "concepts",
"filename": "kebab-case-name.md",
"content": "full markdown including frontmatter"
}},
{{
"type": "update_page",
"path": "patterns/existing-page.md",
"content": "full updated markdown including frontmatter (merged)"
}},
{{
"type": "skip",
"reason": "why this topic doesn't need a wiki page"
}}
]
}}
You can emit MULTIPLE actions — e.g. a new_page for a concept and an
update_page to an existing pattern that now has new context.
Emit ONLY the JSON object. No prose, no markdown fences.
--- WIKI INDEX (existing pages) ---
{wiki_index}
--- TOPIC GROUP ---
{topic_group}
"""
def call_claude_distill(prompt: str, model: str) -> dict[str, Any] | None:
try:
result = subprocess.run(
["claude", "-p", "--model", model, "--output-format", "text", prompt],
capture_output=True,
text=True,
timeout=CLAUDE_TIMEOUT,
)
except FileNotFoundError:
print(" [warn] claude CLI not found — skipping compilation", file=sys.stderr)
return None
except subprocess.TimeoutExpired:
print(" [warn] claude -p timed out", file=sys.stderr)
return None
if result.returncode != 0:
print(f" [warn] claude -p failed: {result.stderr.strip()[:200]}", file=sys.stderr)
return None
output = result.stdout.strip()
match = re.search(r"\{.*\}", output, re.DOTALL)
if not match:
print(f" [warn] no JSON found in claude output ({len(output)} chars)", file=sys.stderr)
return None
try:
return json.loads(match.group(0))
except json.JSONDecodeError as e:
print(f" [warn] JSON parse failed: {e}", file=sys.stderr)
return None
# ---------------------------------------------------------------------------
# Staging output
# ---------------------------------------------------------------------------
STAGING_INJECT_TEMPLATE = (
"---\n"
"origin: automated\n"
"status: pending\n"
"staged_date: {staged_date}\n"
"staged_by: wiki-distill\n"
"target_path: {target_path}\n"
"{modifies_line}"
"distill_topic: {topic}\n"
"distill_source_conversations: {source_convs}\n"
"compilation_notes: {compilation_notes}\n"
)
def _inject_staging_frontmatter(
content: str,
target_path: str,
topic: str,
source_convs: list[str],
compilation_notes: str,
modifies: str | None,
) -> str:
content = re.sub(
r"^(status|origin|staged_\w+|target_path|modifies|distill_\w+|compilation_notes):.*\n",
"",
content,
flags=re.MULTILINE,
)
modifies_line = f"modifies: {modifies}\n" if modifies else ""
clean_notes = compilation_notes.replace("\n", " ").replace("\r", " ").strip()
sources_yaml = ",".join(source_convs)
injection = STAGING_INJECT_TEMPLATE.format(
staged_date=datetime.now(timezone.utc).date().isoformat(),
target_path=target_path,
modifies_line=modifies_line,
topic=topic,
source_convs=sources_yaml,
compilation_notes=clean_notes or "(distilled from conversation topic group)",
)
if content.startswith("---\n"):
return injection + content[4:]
return injection + "---\n" + content
def _unique_staging_path(base: Path) -> Path:
if not base.exists():
return base
suffix = hashlib.sha256(str(base).encode() + str(time.time()).encode()).hexdigest()[:6]
return base.with_stem(f"{base.stem}-{suffix}")
def apply_distill_actions(
result: dict[str, Any],
topic: str,
source_convs: list[str],
dry_run: bool,
) -> list[Path]:
written: list[Path] = []
actions = result.get("actions") or []
rationale = result.get("rationale", "")
for action in actions:
action_type = action.get("type")
if action_type == "skip":
reason = action.get("reason", "not substantive enough")
print(f" [skip] topic={topic!r}: {reason}")
continue
if action_type == "new_page":
directory = action.get("directory") or "patterns"
filename = action.get("filename")
content = action.get("content")
if not filename or not content:
print(f" [warn] incomplete new_page action for topic={topic!r}", file=sys.stderr)
continue
target_rel = f"{directory}/{filename}"
dest = _unique_staging_path(STAGING_DIR / target_rel)
if dry_run:
print(f" [dry-run] new_page → {dest.relative_to(WIKI_DIR)}")
continue
dest.parent.mkdir(parents=True, exist_ok=True)
injected = _inject_staging_frontmatter(
content,
target_path=target_rel,
topic=topic,
source_convs=source_convs,
compilation_notes=rationale,
modifies=None,
)
dest.write_text(injected)
written.append(dest)
print(f" [new] {dest.relative_to(WIKI_DIR)}")
continue
if action_type == "update_page":
target_rel = action.get("path")
content = action.get("content")
if not target_rel or not content:
print(f" [warn] incomplete update_page action for topic={topic!r}", file=sys.stderr)
continue
dest = _unique_staging_path(STAGING_DIR / target_rel)
if dry_run:
print(f" [dry-run] update_page → {dest.relative_to(WIKI_DIR)} (modifies {target_rel})")
continue
dest.parent.mkdir(parents=True, exist_ok=True)
injected = _inject_staging_frontmatter(
content,
target_path=target_rel,
topic=topic,
source_convs=source_convs,
compilation_notes=rationale,
modifies=target_rel,
)
dest.write_text(injected)
written.append(dest)
print(f" [upd] {dest.relative_to(WIKI_DIR)} (modifies {target_rel})")
continue
print(f" [warn] unknown action type: {action_type!r}", file=sys.stderr)
return written
# ---------------------------------------------------------------------------
# Main pipeline
# ---------------------------------------------------------------------------
def pick_model(topic_group: TopicGroup, prompt: str) -> str:
if len(prompt) > SONNET_CONTENT_THRESHOLD or topic_group.total_bullets > 20:
return CLAUDE_SONNET_MODEL
return CLAUDE_HAIKU_MODEL
def process_topic(
topic: str,
conversations: list[WikiPage],
state: dict[str, Any],
dry_run: bool,
compile_enabled: bool,
) -> tuple[str, list[Path]]:
"""Process a single topic group. Returns (status, written_paths)."""
group = build_topic_group(topic, conversations)
if group.total_bullets < MIN_BULLETS_PER_TOPIC:
return f"too-thin (only {group.total_bullets} bullets)", []
if topic in state.get("rejected_topics", {}):
return "previously-rejected", []
wiki_index_text = ""
try:
wiki_index_text = INDEX_FILE.read_text()[:15_000]
except OSError:
pass
topic_group_text = format_topic_group_for_llm(group)
prompt = DISTILL_PROMPT_TEMPLATE.format(
wiki_index=wiki_index_text,
topic_group=topic_group_text,
)
if dry_run:
model = pick_model(group, prompt)
return (
f"would-distill ({len(group.conversations)} convs, "
f"{group.total_bullets} bullets, {model})"
), []
if not compile_enabled:
return (
f"skipped-compile ({len(group.conversations)} convs, "
f"{group.total_bullets} bullets)"
), []
model = pick_model(group, prompt)
print(f" [compile] topic={topic!r} "
f"convs={len(group.conversations)} bullets={group.total_bullets} model={model}")
result = call_claude_distill(prompt, model)
if result is None:
return "compile-failed", []
actions = result.get("actions") or []
if not actions or all(a.get("type") == "skip" for a in actions):
reason = result.get("rationale", "AI chose to skip")
state.setdefault("rejected_topics", {})[topic] = {
"reason": reason,
"rejected_date": today().isoformat(),
}
return "rejected-by-llm", []
source_convs = [str(c.path.relative_to(WIKI_DIR)) for c in group.conversations]
written = apply_distill_actions(result, topic, source_convs, dry_run=False)
for conv in group.conversations:
mark_conv_distilled(state, conv, [str(p.relative_to(WIKI_DIR)) for p in written])
state.setdefault("processed_topics", {})[topic] = {
"distilled_date": today().isoformat(),
"conversations": source_convs,
"output_pages": [str(p.relative_to(WIKI_DIR)) for p in written],
}
return f"distilled ({len(written)} page(s))", written
def run(
*,
first_run: bool,
explicit_topic: str | None,
project_filter: str | None,
dry_run: bool,
compile_enabled: bool,
limit: int,
) -> int:
state = load_state()
if not state.get("first_run_complete"):
first_run = True
all_convs = iter_summarized_conversations(project_filter)
print(f"Scanning {len(all_convs)} summarized conversation(s)...")
# Figure out which topics to process
if explicit_topic:
topics_to_process: set[str] = {explicit_topic}
print(f"Explicit topic mode: {explicit_topic!r}")
else:
lookback = FIRST_RUN_LOOKBACK_DAYS if first_run else 0
topics_to_process = extract_topics_from_today(all_convs, today(), lookback)
if first_run:
print(f"First-run bootstrap: last {FIRST_RUN_LOOKBACK_DAYS} days → "
f"{len(topics_to_process)} topic(s)")
else:
print(f"Today-only mode: {len(topics_to_process)} topic(s) from today's conversations")
if not topics_to_process:
print("No topics to distill.")
if first_run:
state["first_run_complete"] = True
save_state(state)
return 0
# Sort for deterministic ordering
topics_ordered = sorted(topics_to_process)
stats: dict[str, int] = {}
processed = 0
total_written: list[Path] = []
for topic in topics_ordered:
convs = rollup_conversations_by_topic(topic, all_convs)
if not convs:
stats["no-matches"] = stats.get("no-matches", 0) + 1
continue
print(f"\n[{topic}] rollup: {len(convs)} conversation(s)")
status, written = process_topic(
topic, convs, state, dry_run=dry_run, compile_enabled=compile_enabled
)
stats[status.split(" ")[0]] = stats.get(status.split(" ")[0], 0) + 1
print(f" [{status}]")
total_written.extend(written)
if not dry_run:
processed += 1
save_state(state)
if limit and processed >= limit:
print(f"\nLimit reached ({limit}); stopping.")
break
if first_run and not dry_run:
state["first_run_complete"] = True
if not dry_run:
save_state(state)
print("\nSummary:")
for status, count in sorted(stats.items()):
print(f" {status}: {count}")
print(f"\n{len(total_written)} staging page(s) written")
return 0
def main() -> int:
parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
parser.add_argument("--first-run", action="store_true",
help="Bootstrap with last 7 days instead of today-only")
parser.add_argument("--topic", default=None,
help="Process one specific topic explicitly")
parser.add_argument("--project", default=None,
help="Only consider conversations under this wing")
parser.add_argument("--dry-run", action="store_true",
help="Plan only; no LLM calls, no writes")
parser.add_argument("--no-compile", action="store_true",
help="Parse + rollup only; skip claude -p step")
parser.add_argument("--limit", type=int, default=0,
help="Stop after N topic groups processed (0 = unlimited)")
args = parser.parse_args()
return run(
first_run=args.first_run,
explicit_topic=args.topic,
project_filter=args.project,
dry_run=args.dry_run,
compile_enabled=not args.no_compile,
limit=args.limit,
)
if __name__ == "__main__":
sys.exit(main())

View File

@@ -3,19 +3,26 @@ set -euo pipefail
# wiki-maintain.sh — Top-level orchestrator for wiki maintenance.
#
# Chains the three maintenance scripts in the correct order:
# 1. wiki-harvest.py (URL harvesting from summarized conversations)
# 2. wiki-hygiene.py (quick or full hygiene checks)
# 3. qmd update && qmd embed (reindex after changes)
# Chains the maintenance scripts in the correct order:
# 1a. wiki-distill.py (closet summaries → wiki pages via claude -p)
# 1b. wiki-harvest.py (URL content from conversations → wiki pages)
# 2. wiki-hygiene.py (quick or full hygiene checks)
# 3. qmd update && qmd embed (reindex after changes)
#
# Distill runs BEFORE harvest: conversation content takes priority over
# URL content. If a topic is already discussed in the conversations, we
# want the conversation rollup to drive the page, not a cited URL.
#
# Usage:
# wiki-maintain.sh # Harvest + quick hygiene
# wiki-maintain.sh --full # Harvest + full hygiene (LLM-powered)
# wiki-maintain.sh # Distill + harvest + quick hygiene + reindex
# wiki-maintain.sh --full # Everything with full hygiene (LLM)
# wiki-maintain.sh --distill-only # Conversation distillation only
# wiki-maintain.sh --harvest-only # URL harvesting only
# wiki-maintain.sh --hygiene-only # Quick hygiene only
# wiki-maintain.sh --hygiene-only --full # Full hygiene only
# wiki-maintain.sh --dry-run # Show what would run (no writes)
# wiki-maintain.sh --no-compile # Harvest without claude -p compilation step
# wiki-maintain.sh --hygiene-only # Hygiene only
# wiki-maintain.sh --no-distill # Skip distillation phase
# wiki-maintain.sh --distill-first-run # Bootstrap distill with last 7 days
# wiki-maintain.sh --dry-run # Show what would run (no writes, no LLM)
# wiki-maintain.sh --no-compile # Skip claude -p in harvest AND distill
# wiki-maintain.sh --no-reindex # Skip qmd update/embed after
#
# Log file: scripts/.maintain.log (rotated manually)
@@ -32,22 +39,28 @@ LOG_FILE="${SCRIPTS_DIR}/.maintain.log"
# -----------------------------------------------------------------------------
FULL_MODE=false
DISTILL_ONLY=false
HARVEST_ONLY=false
HYGIENE_ONLY=false
NO_DISTILL=false
DISTILL_FIRST_RUN=false
DRY_RUN=false
NO_COMPILE=false
NO_REINDEX=false
while [[ $# -gt 0 ]]; do
case "$1" in
--full) FULL_MODE=true; shift ;;
--harvest-only) HARVEST_ONLY=true; shift ;;
--hygiene-only) HYGIENE_ONLY=true; shift ;;
--dry-run) DRY_RUN=true; shift ;;
--no-compile) NO_COMPILE=true; shift ;;
--no-reindex) NO_REINDEX=true; shift ;;
--full) FULL_MODE=true; shift ;;
--distill-only) DISTILL_ONLY=true; shift ;;
--harvest-only) HARVEST_ONLY=true; shift ;;
--hygiene-only) HYGIENE_ONLY=true; shift ;;
--no-distill) NO_DISTILL=true; shift ;;
--distill-first-run) DISTILL_FIRST_RUN=true; shift ;;
--dry-run) DRY_RUN=true; shift ;;
--no-compile) NO_COMPILE=true; shift ;;
--no-reindex) NO_REINDEX=true; shift ;;
-h|--help)
sed -n '3,20p' "$0" | sed 's/^# \?//'
sed -n '3,28p' "$0" | sed 's/^# \?//'
exit 0
;;
*)
@@ -57,8 +70,13 @@ while [[ $# -gt 0 ]]; do
esac
done
if [[ "${HARVEST_ONLY}" == "true" && "${HYGIENE_ONLY}" == "true" ]]; then
echo "--harvest-only and --hygiene-only are mutually exclusive" >&2
# Mutex check — only one "only" flag at a time
only_count=0
${DISTILL_ONLY} && only_count=$((only_count + 1))
${HARVEST_ONLY} && only_count=$((only_count + 1))
${HYGIENE_ONLY} && only_count=$((only_count + 1))
if [[ $only_count -gt 1 ]]; then
echo "--distill-only, --harvest-only, and --hygiene-only are mutually exclusive" >&2
exit 1
fi
@@ -91,13 +109,36 @@ cd "${WIKI_DIR}"
for req in python3 qmd; do
if ! command -v "${req}" >/dev/null 2>&1; then
if [[ "${req}" == "qmd" && "${NO_REINDEX}" == "true" ]]; then
continue # qmd not required if --no-reindex
continue
fi
echo "Required command not found: ${req}" >&2
exit 1
fi
done
# -----------------------------------------------------------------------------
# Determine which phases to run
# -----------------------------------------------------------------------------
run_distill=true
run_harvest=true
run_hygiene=true
${NO_DISTILL} && run_distill=false
if ${DISTILL_ONLY}; then
run_harvest=false
run_hygiene=false
fi
if ${HARVEST_ONLY}; then
run_distill=false
run_hygiene=false
fi
if ${HYGIENE_ONLY}; then
run_distill=false
run_harvest=false
fi
# -----------------------------------------------------------------------------
# Pipeline
# -----------------------------------------------------------------------------
@@ -105,18 +146,39 @@ done
START_TS="$(date '+%s')"
section "wiki-maintain.sh starting"
log "mode: $(${FULL_MODE} && echo full || echo quick)"
log "harvest: $(${HYGIENE_ONLY} && echo skipped || echo enabled)"
log "hygiene: $(${HARVEST_ONLY} && echo skipped || echo enabled)"
log "distill: $(${run_distill} && echo enabled || echo skipped)"
log "harvest: $(${run_harvest} && echo enabled || echo skipped)"
log "hygiene: $(${run_hygiene} && echo enabled || echo skipped)"
log "reindex: $(${NO_REINDEX} && echo skipped || echo enabled)"
log "dry-run: ${DRY_RUN}"
log "wiki: ${WIKI_DIR}"
# -----------------------------------------------------------------------------
# Phase 1: Harvest
# Phase 1a: Distill — conversations → wiki pages
# -----------------------------------------------------------------------------
if [[ "${HYGIENE_ONLY}" != "true" ]]; then
section "Phase 1: URL harvesting"
if ${run_distill}; then
section "Phase 1a: Conversation distillation"
distill_args=()
${DRY_RUN} && distill_args+=(--dry-run)
${NO_COMPILE} && distill_args+=(--no-compile)
${DISTILL_FIRST_RUN} && distill_args+=(--first-run)
if python3 "${SCRIPTS_DIR}/wiki-distill.py" "${distill_args[@]}"; then
log "distill completed"
else
log "[error] distill failed (exit $?) — continuing to harvest"
fi
else
section "Phase 1a: Conversation distillation (skipped)"
fi
# -----------------------------------------------------------------------------
# Phase 1b: Harvest — URLs cited in conversations → raw/ → wiki pages
# -----------------------------------------------------------------------------
if ${run_harvest}; then
section "Phase 1b: URL harvesting"
harvest_args=()
${DRY_RUN} && harvest_args+=(--dry-run)
${NO_COMPILE} && harvest_args+=(--no-compile)
@@ -127,14 +189,14 @@ if [[ "${HYGIENE_ONLY}" != "true" ]]; then
log "[error] harvest failed (exit $?) — continuing to hygiene"
fi
else
section "Phase 1: URL harvesting (skipped)"
section "Phase 1b: URL harvesting (skipped)"
fi
# -----------------------------------------------------------------------------
# Phase 2: Hygiene
# -----------------------------------------------------------------------------
if [[ "${HARVEST_ONLY}" != "true" ]]; then
if ${run_hygiene}; then
section "Phase 2: Hygiene checks"
hygiene_args=()
if ${FULL_MODE}; then

View File

@@ -209,3 +209,63 @@ def iter_archived_pages() -> list[WikiPage]:
def page_content_hash(page: WikiPage) -> str:
"""Hash of page body only (excludes frontmatter) so mechanical frontmatter fixes don't churn the hash."""
return "sha256:" + hashlib.sha256(page.body.strip().encode("utf-8")).hexdigest()
# ---------------------------------------------------------------------------
# Conversation hall parsing
# ---------------------------------------------------------------------------
#
# Summarized conversations have sections in the body like:
# ## Decisions (hall: fact)
# - bullet
# - bullet
# ## Discoveries (hall: discovery)
# - bullet
#
# Hall types used by the summarizer: fact, discovery, preference, advice,
# event, tooling. Only fact/discovery/advice are high-signal enough to
# distill into wiki pages; the others are tracked but not auto-promoted.
HIGH_SIGNAL_HALLS = {"fact", "discovery", "advice"}
_HALL_SECTION_RE = re.compile(
r"^##\s+[^\n]*?\(hall:\s*(\w+)\s*\)\s*$(.*?)(?=^##\s|\Z)",
re.MULTILINE | re.DOTALL,
)
_BULLET_RE = re.compile(r"^\s*-\s+(.*?)$", re.MULTILINE)
def parse_conversation_halls(page: WikiPage) -> dict[str, list[str]]:
"""Extract hall-bucketed bullet content from a summarized conversation body.
Returns a dict like:
{"fact": ["claim one", "claim two"],
"discovery": ["root cause X"],
"advice": ["do Y", "consider Z"], ...}
Empty hall types are omitted. Bullet lines are stripped of leading "- "
and trailing whitespace; multi-line bullets are joined with a space.
"""
result: dict[str, list[str]] = {}
for match in _HALL_SECTION_RE.finditer(page.body):
hall_type = match.group(1).strip().lower()
section_body = match.group(2)
bullets = [
_flatten_bullet(b.group(1))
for b in _BULLET_RE.finditer(section_body)
]
bullets = [b for b in bullets if b]
if bullets:
result.setdefault(hall_type, []).extend(bullets)
return result
def _flatten_bullet(text: str) -> str:
"""Collapse a possibly-multiline bullet into a single clean line."""
return " ".join(text.split()).strip()
def high_signal_halls(page: WikiPage) -> dict[str, list[str]]:
"""Return only fact/discovery/advice content from a conversation."""
all_halls = parse_conversation_halls(page)
return {k: v for k, v in all_halls.items() if k in HIGH_SIGNAL_HALLS}

View File

@@ -65,7 +65,9 @@ class TestWikiMaintainSh:
"wiki-maintain.sh", "--hygiene-only", "--dry-run", "--no-reindex"
)
assert result.returncode == 0
assert "Phase 1: URL harvesting (skipped)" in result.stdout
# Phase 1a (distill) and Phase 1b (harvest) both skipped in --hygiene-only
assert "Phase 1a: Conversation distillation (skipped)" in result.stdout
assert "Phase 1b: URL harvesting (skipped)" in result.stdout
def test_phase_3_skipped_in_dry_run(
self, run_script, tmp_wiki: Path

446
tests/test_wiki_distill.py Normal file
View File

@@ -0,0 +1,446 @@
"""Unit + integration tests for scripts/wiki-distill.py.
Mocks claude -p; no real LLM calls during tests.
"""
from __future__ import annotations
import json
from datetime import date, timedelta
from pathlib import Path
from typing import Any
import pytest
from conftest import make_conversation
# ---------------------------------------------------------------------------
# wiki_lib hall parsing helpers
# ---------------------------------------------------------------------------
class TestParseConversationHalls:
def _make_conv_with_halls(self, tmp_wiki: Path, body: str) -> Path:
return make_conversation(
tmp_wiki,
"test",
"2026-04-12-halls.md",
status="summarized",
body=body,
)
def test_extracts_fact_bullets(self, wiki_lib: Any, tmp_wiki: Path) -> None:
body = (
"## Summary\n\nsome summary text.\n\n"
"## Decisions (hall: fact)\n\n"
"- First decision made\n"
"- Second decision\n\n"
"## Other section\n\nunrelated.\n"
)
path = self._make_conv_with_halls(tmp_wiki, body)
page = wiki_lib.parse_page(path)
halls = wiki_lib.parse_conversation_halls(page)
assert "fact" in halls
assert halls["fact"] == ["First decision made", "Second decision"]
def test_extracts_multiple_hall_types(
self, wiki_lib: Any, tmp_wiki: Path
) -> None:
body = (
"## Decisions (hall: fact)\n\n- A\n- B\n\n"
"## Discoveries (hall: discovery)\n\n- root cause X\n\n"
"## Advice (hall: advice)\n\n- try Y\n- consider Z\n"
)
path = self._make_conv_with_halls(tmp_wiki, body)
page = wiki_lib.parse_page(path)
halls = wiki_lib.parse_conversation_halls(page)
assert halls["fact"] == ["A", "B"]
assert halls["discovery"] == ["root cause X"]
assert halls["advice"] == ["try Y", "consider Z"]
def test_ignores_sections_without_hall_marker(
self, wiki_lib: Any, tmp_wiki: Path
) -> None:
body = (
"## Summary\n\n- not a hall bullet\n\n"
"## Decisions (hall: fact)\n\n- real bullet\n"
)
path = self._make_conv_with_halls(tmp_wiki, body)
page = wiki_lib.parse_page(path)
halls = wiki_lib.parse_conversation_halls(page)
assert halls == {"fact": ["real bullet"]}
def test_flattens_multiline_bullets(
self, wiki_lib: Any, tmp_wiki: Path
) -> None:
body = (
"## Decisions (hall: fact)\n\n"
"- A bullet that goes on\n and continues here\n"
"- Second bullet\n"
)
path = self._make_conv_with_halls(tmp_wiki, body)
page = wiki_lib.parse_page(path)
halls = wiki_lib.parse_conversation_halls(page)
# The simple regex captures each "- " line separately; continuation
# lines are not part of the bullet. This matches the current behavior.
assert halls["fact"][0].startswith("A bullet")
assert "Second bullet" in halls["fact"]
def test_empty_body_returns_empty(
self, wiki_lib: Any, tmp_wiki: Path
) -> None:
path = self._make_conv_with_halls(tmp_wiki, "## Summary\n\ntext.\n")
page = wiki_lib.parse_page(path)
assert wiki_lib.parse_conversation_halls(page) == {}
def test_high_signal_halls_filters_out_preference_event_tooling(
self, wiki_lib: Any, tmp_wiki: Path
) -> None:
body = (
"## Decisions (hall: fact)\n- f\n"
"## Preferences (hall: preference)\n- p\n"
"## Events (hall: event)\n- e\n"
"## Tooling (hall: tooling)\n- t\n"
"## Advice (hall: advice)\n- a\n"
)
path = self._make_conv_with_halls(tmp_wiki, body)
page = wiki_lib.parse_page(path)
halls = wiki_lib.high_signal_halls(page)
assert set(halls.keys()) == {"fact", "advice"}
# ---------------------------------------------------------------------------
# wiki-distill.py module fixture
# ---------------------------------------------------------------------------
@pytest.fixture
def wiki_distill(tmp_wiki: Path) -> Any:
from conftest import SCRIPTS_DIR, _load_script_module
_load_script_module("wiki_lib", SCRIPTS_DIR / "wiki_lib.py")
return _load_script_module("wiki_distill", SCRIPTS_DIR / "wiki-distill.py")
# ---------------------------------------------------------------------------
# Topic rollup logic
# ---------------------------------------------------------------------------
class TestTopicRollup:
def _make_summarized_conv(
self,
tmp_wiki: Path,
project: str,
filename: str,
conv_date: str,
topics: list[str],
fact_bullets: list[str] | None = None,
) -> Path:
fact_section = ""
if fact_bullets:
fact_section = "## Decisions (hall: fact)\n\n" + "\n".join(
f"- {b}" for b in fact_bullets
)
return make_conversation(
tmp_wiki,
project,
filename,
date=conv_date,
status="summarized",
related=[f"topic:{t}" for t in []],
body=f"## Summary\n\ntest.\n\n{fact_section}\n",
)
def test_extract_topics_from_today_only(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
today_date = wiki_distill.today()
yesterday = today_date - timedelta(days=1)
# Today's conversation with topics
_write_conv_with_topics(
tmp_wiki, "test", "today.md",
date_str=today_date.isoformat(), topics=["alpha", "beta"],
)
# Yesterday's conversation — should be excluded at lookback=0
_write_conv_with_topics(
tmp_wiki, "test", "yesterday.md",
date_str=yesterday.isoformat(), topics=["gamma"],
)
all_convs = wiki_distill.iter_summarized_conversations()
topics = wiki_distill.extract_topics_from_today(all_convs, today_date, 0)
assert topics == {"alpha", "beta"}
def test_extract_topics_with_lookback(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
today_date = wiki_distill.today()
day3 = today_date - timedelta(days=3)
day10 = today_date - timedelta(days=10)
_write_conv_with_topics(
tmp_wiki, "test", "today.md",
date_str=today_date.isoformat(), topics=["a"],
)
_write_conv_with_topics(
tmp_wiki, "test", "day3.md",
date_str=day3.isoformat(), topics=["b"],
)
_write_conv_with_topics(
tmp_wiki, "test", "day10.md",
date_str=day10.isoformat(), topics=["c"],
)
all_convs = wiki_distill.iter_summarized_conversations()
topics_7 = wiki_distill.extract_topics_from_today(all_convs, today_date, 7)
assert topics_7 == {"a", "b"} # day10 excluded by 7-day lookback
def test_rollup_by_topic_across_history(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
today_date = wiki_distill.today()
# Three conversations all tagged with "shared-topic", different dates
_write_conv_with_topics(
tmp_wiki, "test", "a.md",
date_str=today_date.isoformat(), topics=["shared-topic"],
)
_write_conv_with_topics(
tmp_wiki, "test", "b.md",
date_str=(today_date - timedelta(days=30)).isoformat(),
topics=["shared-topic", "other"],
)
_write_conv_with_topics(
tmp_wiki, "test", "c.md",
date_str=(today_date - timedelta(days=90)).isoformat(),
topics=["shared-topic"],
)
# One unrelated
_write_conv_with_topics(
tmp_wiki, "test", "d.md",
date_str=today_date.isoformat(), topics=["unrelated"],
)
all_convs = wiki_distill.iter_summarized_conversations()
rollup = wiki_distill.rollup_conversations_by_topic(
"shared-topic", all_convs
)
assert len(rollup) == 3
stems = [c.path.stem for c in rollup]
# Most recent first
assert stems[0] == "a"
def _write_conv_with_topics(
tmp_wiki: Path,
project: str,
filename: str,
*,
date_str: str,
topics: list[str],
) -> Path:
"""Helper — write a summarized conversation with topic frontmatter."""
proj_dir = tmp_wiki / "conversations" / project
proj_dir.mkdir(parents=True, exist_ok=True)
path = proj_dir / filename
topic_yaml = "topics: [" + ", ".join(topics) + "]"
content = (
f"---\n"
f"title: Test Conv\n"
f"type: conversation\n"
f"project: {project}\n"
f"date: {date_str}\n"
f"status: summarized\n"
f"messages: 50\n"
f"{topic_yaml}\n"
f"---\n"
f"## Summary\n\ntest.\n\n"
f"## Decisions (hall: fact)\n\n"
f"- Fact one for these topics\n"
f"- Fact two\n"
)
path.write_text(content)
return path
# ---------------------------------------------------------------------------
# Topic group building
# ---------------------------------------------------------------------------
class TestTopicGroupBuild:
def test_counts_total_bullets(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
_write_conv_with_topics(
tmp_wiki, "test", "one.md",
date_str="2026-04-12", topics=["foo"],
)
all_convs = wiki_distill.iter_summarized_conversations()
rollup = wiki_distill.rollup_conversations_by_topic("foo", all_convs)
group = wiki_distill.build_topic_group("foo", rollup)
assert group.topic == "foo"
assert group.total_bullets == 2 # the helper writes 2 fact bullets
def test_format_for_llm_includes_topic_and_sections(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
_write_conv_with_topics(
tmp_wiki, "test", "one.md",
date_str="2026-04-12", topics=["bar"],
)
all_convs = wiki_distill.iter_summarized_conversations()
rollup = wiki_distill.rollup_conversations_by_topic("bar", all_convs)
group = wiki_distill.build_topic_group("bar", rollup)
text = wiki_distill.format_topic_group_for_llm(group)
assert "# Topic: bar" in text
assert "Fact one" in text
assert "Decisions:" in text
# ---------------------------------------------------------------------------
# State management
# ---------------------------------------------------------------------------
class TestDistillState:
def test_load_returns_defaults(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
state = wiki_distill.load_state()
assert state["processed_convs"] == {}
assert state["processed_topics"] == {}
assert state["first_run_complete"] is False
def test_save_and_reload(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
state = wiki_distill.load_state()
state["first_run_complete"] = True
state["processed_topics"]["foo"] = {"distilled_date": "2026-04-12"}
wiki_distill.save_state(state)
reloaded = wiki_distill.load_state()
assert reloaded["first_run_complete"] is True
assert "foo" in reloaded["processed_topics"]
def test_conv_needs_distill_first_time(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
path = _write_conv_with_topics(
tmp_wiki, "test", "fresh.md",
date_str="2026-04-12", topics=["x"],
)
conv = wiki_distill.parse_page(path)
state = wiki_distill.load_state()
assert wiki_distill.conv_needs_distill(state, conv) is True
def test_conv_needs_distill_detects_content_change(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
path = _write_conv_with_topics(
tmp_wiki, "test", "mut.md",
date_str="2026-04-12", topics=["x"],
)
conv = wiki_distill.parse_page(path)
state = wiki_distill.load_state()
wiki_distill.mark_conv_distilled(state, conv, ["staging/patterns/x.md"])
assert wiki_distill.conv_needs_distill(state, conv) is False
# Mutate the body
text = path.read_text()
path.write_text(text + "\n- Another bullet\n")
conv2 = wiki_distill.parse_page(path)
assert wiki_distill.conv_needs_distill(state, conv2) is True
def test_conv_needs_distill_detects_new_topic(
self, wiki_distill: Any, tmp_wiki: Path
) -> None:
path = _write_conv_with_topics(
tmp_wiki, "test", "new-topic.md",
date_str="2026-04-12", topics=["original"],
)
conv = wiki_distill.parse_page(path)
state = wiki_distill.load_state()
wiki_distill.mark_conv_distilled(state, conv, [])
assert wiki_distill.conv_needs_distill(state, conv) is False
# Rewrite with a new topic added
_write_conv_with_topics(
tmp_wiki, "test", "new-topic.md",
date_str="2026-04-12", topics=["original", "freshly-added"],
)
conv2 = wiki_distill.parse_page(path)
assert wiki_distill.conv_needs_distill(state, conv2) is True
# ---------------------------------------------------------------------------
# CLI smoke tests (no real LLM calls — uses --dry-run)
# ---------------------------------------------------------------------------
class TestDistillCli:
def test_help_flag(self, run_script) -> None:
result = run_script("wiki-distill.py", "--help")
assert result.returncode == 0
assert "--first-run" in result.stdout
assert "--topic" in result.stdout
assert "--dry-run" in result.stdout
def test_dry_run_empty_wiki(self, run_script, tmp_wiki: Path) -> None:
result = run_script("wiki-distill.py", "--dry-run", "--first-run")
assert result.returncode == 0
def test_dry_run_with_topic_rollup(
self, run_script, tmp_wiki: Path
) -> None:
_write_conv_with_topics(
tmp_wiki, "test", "convA.md",
date_str="2026-04-12", topics=["rollup-test"],
)
_write_conv_with_topics(
tmp_wiki, "test", "convB.md",
date_str="2026-04-11", topics=["rollup-test"],
)
result = run_script(
"wiki-distill.py", "--dry-run", "--first-run",
)
assert result.returncode == 0
# Should mention the rollup topic
assert "rollup-test" in result.stdout
def test_topic_flag_narrow_mode(
self, run_script, tmp_wiki: Path
) -> None:
_write_conv_with_topics(
tmp_wiki, "test", "a.md",
date_str="2026-04-12", topics=["explicit-topic"],
)
result = run_script(
"wiki-distill.py", "--dry-run", "--topic", "explicit-topic",
)
assert result.returncode == 0
assert "Explicit topic mode" in result.stdout
assert "explicit-topic" in result.stdout
def test_too_thin_topic_is_skipped(
self, run_script, tmp_wiki: Path, wiki_distill: Any
) -> None:
# Write a conversation with only ONE hall bullet on this topic
proj_dir = tmp_wiki / "conversations" / "test"
proj_dir.mkdir(parents=True, exist_ok=True)
(proj_dir / "thin.md").write_text(
"---\n"
"title: Thin\n"
"type: conversation\n"
"project: test\n"
"date: 2026-04-12\n"
"status: summarized\n"
"messages: 5\n"
"topics: [thin-topic]\n"
"---\n"
"## Summary\n\n\n"
"## Decisions (hall: fact)\n\n"
"- Single bullet\n"
)
result = run_script(
"wiki-distill.py", "--dry-run", "--topic", "thin-topic",
)
assert result.returncode == 0
assert "too-thin" in result.stdout or "too-thin" in result.stderr