feat(distill): close the MemPalace loop — conversations → wiki pages
Add wiki-distill.py as Phase 1a of the maintenance pipeline. This is
the 8th extension memex adds to Karpathy's pattern and the one that
makes the MemPalace integration a real ingest pipeline instead of
just a searchable archive beside the wiki.
## The gap distill closes
The mining layer was extracting Claude Code sessions, classifying
bullets into halls (fact/discovery/preference/advice/event/tooling),
and tagging topics. The URL harvester scanned conversations for cited
links. Hygiene refreshed last_verified on wiki pages referenced in
related: fields. But none of those steps compiled the knowledge
*inside* the conversations themselves into wiki pages. Decisions,
root causes, and patterns stayed in the summaries forever — findable
via qmd but never synthesized into canonical pages.
## What distill does
Narrow today-filter with historical rollup:
1. Find all summarized conversations dated TODAY
2. Extract their topics: — this is the "topics of today" set
3. For each topic in that set, pull ALL summarized conversations
across history that share that topic (full historical context)
4. Extract hall_facts + hall_discoveries + hall_advice bullets
(the high-signal hall types — skips event/preference/tooling)
5. Send topic group + wiki index.md to claude -p
6. Model emits JSON actions[]: new_page / update_page / skip
7. Write each action to staging/<type>/ with distill provenance
frontmatter (staged_by: wiki-distill, distill_topic,
distill_source_conversations, compilation_notes)
First-run bootstrap: uses 7-day lookback instead of today-only so
the state file gets seeded reasonably. After that, daily runs stay
narrow.
Self-triggering: dormant topics that resurface in a new conversation
automatically pull in all historical conversations on that topic via
the rollup. Old knowledge gets distilled when it becomes relevant
again without manual intervention.
## Orchestration — distill BEFORE harvest
wiki-maintain.sh now has Phase 1a (distill) + Phase 1b (harvest):
1a. wiki-distill.py — conversations → staging (PRIORITY)
1b. wiki-harvest.py — URLs → raw/harvested → staging (supplement)
2. wiki-hygiene.py — decay, archive, repair, checks
3. qmd reindex
Conversation content drives the page shape; URL harvesting fills
gaps for external references conversations don't cover. New flags:
--distill-only, --no-distill, --distill-first-run.
## Verified on real wiki
Tested end-to-end on the production wiki with 611 summarized
conversations across 14 wings. First-run dry-run found 116 topic
groups worth distilling (+ 3 too-thin). Tested single-topic compile
with --topic zoho-api: the LLM rolled up 2 conversations (34
bullets), synthesized a proper pattern page with "What / Why /
Known Limitations" structure, linked it to existing wiki pages,
and landed it in staging with full distill provenance. LLM
correctly rejected claude-code-statusline (already well-covered
by an existing live page) — so the "skip" path works.
## Code additions
- scripts/wiki-distill.py (new, ~530 lines)
- scripts/wiki_lib.py: HIGH_SIGNAL_HALLS + parse_conversation_halls
+ high_signal_halls + _flatten_bullet helpers
- scripts/wiki-maintain.sh: Phase 1a distill, new flags
- tests/test_wiki_distill.py (21 new tests — hall parsing, rollup,
state management, CLI smoke tests)
- tests/test_shell_scripts.py: updated phase-name assertion for
the Phase 1a/1b split
## Docs additions
- README.md: 8th row in extensions table, updated compounding-loop
diagram, new wiki-distill.py reference in architecture overview
- docs/DESIGN-RATIONALE.md: new section 8 "Closing the MemPalace
loop" with full mempalace taxonomy mapping
- docs/ARCHITECTURE.md: wiki-distill.py section, updated phase
order, updated state file table, updated dep graph
- docs/SETUP.md: updated cron comment, first-run distill guidance,
verify section test count
- .gitignore: note distill-state.json is committed (sync across
machines), not gitignored
- docs/artifacts/signal-and-noise.html: new "Distill ⬣" top-level
tab with flow diagram, hall filter table, narrow-today/wide-
history explanation, staging provenance example
## Tests
192 tests total (+21 new, +1 regression fix), all green in ~1.5s.
This commit is contained in:
1
.gitignore
vendored
1
.gitignore
vendored
@@ -31,5 +31,6 @@ __pycache__/
|
||||
|
||||
# NOTE: the following state files are NOT gitignored — they must sync
|
||||
# across machines so both installs agree on what's been processed:
|
||||
# .distill-state.json (conversation distillation: processed convs + topics)
|
||||
# .harvest-state.json (URL dedup)
|
||||
# .hygiene-state.json (content hashes, deferred issues)
|
||||
|
||||
48
README.md
48
README.md
@@ -114,13 +114,14 @@ to one of those extensions:
|
||||
|
||||
| What memex adds | How it works |
|
||||
|-----------------|--------------|
|
||||
| **Conversation distillation** — your sessions become wiki pages | `wiki-distill.py` finds today's topics, rolls up ALL historical conversations sharing each topic, pulls their `hall_facts` + `hall_discoveries` + `hall_advice` content, and asks `claude -p` to create new pages or update existing ones. This is what closes the MemPalace loop — closet summaries become the source material for the wiki itself, not just the URLs cited in them. |
|
||||
| **Time-decaying confidence** — pages earn trust through reinforcement and fade without it | `confidence` field + `last_verified`, 6/9/12 month decay thresholds, auto-archive. Full-mode hygiene also adds LLM contradiction detection across pages. |
|
||||
| **Scalable search beyond the context window** | `qmd` (BM25 + vector + LLM re-ranking) from day one, with three collections (`wiki` / `wiki-archive` / `wiki-conversations`) so queries can route to the right surface. |
|
||||
| **Traceable sources for every claim** | Every compiled page traces back to an immutable `raw/harvested/*.md` file with a SHA-256 content hash. Staging review is the built-in cross-check, and `compilation_notes` makes review fast. |
|
||||
| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. |
|
||||
| **Traceable sources for every claim** | Every compiled page traces back to either an immutable `raw/harvested/*.md` file (URL-sourced) or specific conversations with a `distill_source_conversations` field (session-sourced). Staging review is the built-in cross-check, and `compilation_notes` makes review fast. |
|
||||
| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → distill → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. |
|
||||
| **Human-in-the-loop staging** for automated content | Every automated page lands in `staging/` first with `origin: automated`, `status: pending`. Nothing bypasses human review — one promotion step and it's in the live wiki with `last_verified` set. |
|
||||
| **Hybrid retrieval** — structural navigation + semantic search | Wings/rooms/halls (borrowed from mempalace) give structural filtering that narrows the search space before qmd's hybrid BM25 + vector pass runs. Full-mode hygiene also auto-adds missing cross-references. |
|
||||
| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Harvest and hygiene state files sync across machines so both agree on what's been processed. |
|
||||
| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Distill, harvest, and hygiene state files sync across machines so both agree on what's been processed. |
|
||||
|
||||
The short version: Karpathy shared the idea, milla-jovovich's mempalace
|
||||
added the structural memory taxonomy, and memex is the automation layer
|
||||
@@ -147,21 +148,28 @@ memex doesn't cover.
|
||||
│ summarize-conversations.py --claude (daily)
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ conversations/ │ summaries with related: wiki links
|
||||
│ conversations/ │ summaries with halls + topics + related:
|
||||
│ <project>/*.md │ (status: summarized)
|
||||
└──────────┬──────────┘
|
||||
│ wiki-harvest.py (daily)
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ raw/harvested/ │ fetched URL content
|
||||
│ *.md │ (immutable source material)
|
||||
└──────────┬──────────┘
|
||||
│ claude -p compile step
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
│ staging/<type>/ │ pending pages
|
||||
│ *.md │ (status: pending, origin: automated)
|
||||
└──────────┬──────────┘
|
||||
└──────┬───────────┬──┘
|
||||
│ │
|
||||
│ └──▶ wiki-distill.py (daily Phase 1a) ──┐
|
||||
│ - rollup by today's topics │
|
||||
│ - pull historical conversations│
|
||||
│ - extract fact/discovery/advice│
|
||||
│ - claude -p → new or update │
|
||||
│ │
|
||||
│ wiki-harvest.py (daily Phase 1b) │
|
||||
▼ │
|
||||
┌─────────────────────┐ │
|
||||
│ raw/harvested/ │ fetched URL content │
|
||||
│ *.md │ (immutable source material) │
|
||||
└──────────┬──────────┘ │
|
||||
│ claude -p compile step │
|
||||
▼ │
|
||||
┌──────────────────────────────────────────────────────┐ │
|
||||
│ staging/<type>/ pending pages │◀─┘
|
||||
│ *.md (status: pending, origin: auto) │
|
||||
└──────────┬───────────────────────────────────────────┘
|
||||
│ human review (wiki-staging.py --review)
|
||||
▼
|
||||
┌─────────────────────┐
|
||||
@@ -283,6 +291,7 @@ wiki/
|
||||
├── reports/ ← Hygiene operation logs
|
||||
├── scripts/ ← The automation pipeline
|
||||
├── tests/ ← Pytest suite (171 tests)
|
||||
├── .distill-state.json ← Conversation distill state (committed, synced)
|
||||
├── .harvest-state.json ← URL dedup state (committed, synced)
|
||||
├── .hygiene-state.json ← Content hashes, deferred issues (committed, synced)
|
||||
└── .mine-state.json ← Conversation extraction offsets (gitignored, per-machine)
|
||||
@@ -333,11 +342,12 @@ Eleven scripts organized in three layers:
|
||||
- `update-conversation-index.py` — Regenerate conversation index + wake-up context
|
||||
|
||||
**Automation layer** (maintains the wiki):
|
||||
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, constants
|
||||
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, hall extraction, constants
|
||||
- `wiki-distill.py` — Conversation distillation (closet → wiki pages via claude -p, closes the MemPalace loop)
|
||||
- `wiki-harvest.py` — URL classification + fetch cascade + compile to staging
|
||||
- `wiki-staging.py` — Human review (list/promote/reject/review/sync)
|
||||
- `wiki-hygiene.py` — Quick + full hygiene checks, archival, auto-restore
|
||||
- `wiki-maintain.sh` — Top-level orchestrator chaining harvest + hygiene
|
||||
- `wiki-maintain.sh` — Top-level orchestrator chaining distill + harvest + hygiene
|
||||
|
||||
**Sync layer**:
|
||||
- `wiki-sync.sh` — Git commit/pull/push with merge-union markdown handling
|
||||
|
||||
@@ -77,6 +77,7 @@ Automation + lifecycle management on top of both:
|
||||
┌─────────────────────────────────┐
|
||||
│ AUTOMATION LAYER │
|
||||
│ wiki_lib.py (shared helpers) │
|
||||
│ wiki-distill.py │ (conversations → staging) ← closes MemPalace loop
|
||||
│ wiki-harvest.py │ (URL → raw → staging)
|
||||
│ wiki-staging.py │ (human review)
|
||||
│ wiki-hygiene.py │ (decay, archive, repair, checks)
|
||||
@@ -169,10 +170,63 @@ Provides:
|
||||
All paths honor the `WIKI_DIR` environment variable, so tests and
|
||||
alternate installs can override the root.
|
||||
|
||||
### `wiki-distill.py`
|
||||
|
||||
**Closes the MemPalace loop.** Reads the *content* of summarized
|
||||
conversations — not the URLs they cite — and compiles wiki pages from
|
||||
the high-signal hall entries (`hall_facts`, `hall_discoveries`,
|
||||
`hall_advice`). Runs as Phase 1a in `wiki-maintain.sh`, before URL
|
||||
harvesting.
|
||||
|
||||
**Scope filter (deliberately narrow)**:
|
||||
1. Find all summarized conversations dated TODAY
|
||||
2. Extract their `topics:` — this is the "topics-of-today" set
|
||||
3. For each topic in that set, pull ALL summarized conversations across
|
||||
history that share that topic (full historical context via rollup)
|
||||
4. Extract `hall_facts` + `hall_discoveries` + `hall_advice` bullet
|
||||
content from each conversation's body
|
||||
5. Send the topic group (topic + matching conversations + halls) to
|
||||
`claude -p` with the current `index.md`
|
||||
6. Model emits a JSON `actions` array with `new_page` / `update_page` /
|
||||
`skip` verdicts; the script writes each to `staging/<type>/`
|
||||
|
||||
**First-run bootstrap**: the very first run uses a 7-day lookback
|
||||
instead of today-only, so the state file gets seeded with a reasonable
|
||||
starting set. After that, daily runs stay narrow.
|
||||
|
||||
**Self-triggering**: dormant topics that resurface in a new
|
||||
conversation automatically pull in all historical conversations on
|
||||
that topic via the rollup. No manual intervention needed to
|
||||
reprocess old knowledge when it becomes relevant again.
|
||||
|
||||
**Model routing**: haiku for short topic groups (< 15K chars prompt,
|
||||
< 20 bullets), sonnet for longer ones.
|
||||
|
||||
**State** lives in `.distill-state.json` — tracks processed
|
||||
conversations by content hash and topics-at-distill-time. A
|
||||
conversation is re-processed if its body changes OR if it gains a new
|
||||
topic not seen at previous distill.
|
||||
|
||||
**Staging output** includes distill-specific frontmatter:
|
||||
- `staged_by: wiki-distill`
|
||||
- `distill_topic: <topic>`
|
||||
- `distill_source_conversations: <comma-separated conversation paths>`
|
||||
|
||||
Commands:
|
||||
- `wiki-distill.py` — today-only rollup (default mode after first run)
|
||||
- `wiki-distill.py --first-run` — 7-day lookback bootstrap
|
||||
- `wiki-distill.py --topic TOPIC` — explicit single-topic processing
|
||||
- `wiki-distill.py --project WING` — only today-topics from this wing
|
||||
- `wiki-distill.py --dry-run` — plan only, no LLM calls, no writes
|
||||
- `wiki-distill.py --no-compile` — rollup only, skip claude -p step
|
||||
- `wiki-distill.py --limit N` — stop after N topic groups
|
||||
|
||||
### `wiki-harvest.py`
|
||||
|
||||
Scans summarized conversations for HTTP(S) URLs, classifies them,
|
||||
fetches content, and compiles pending wiki pages.
|
||||
fetches content, and compiles pending wiki pages. Runs as Phase 1b in
|
||||
`wiki-maintain.sh`, after distill — URL content is treated as a
|
||||
supplement to conversation-driven knowledge, not the primary source.
|
||||
|
||||
URL classification:
|
||||
- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
|
||||
@@ -254,13 +308,17 @@ full-mode runs can skip unchanged pages. Reports land in
|
||||
Top-level orchestrator:
|
||||
|
||||
```
|
||||
Phase 1: wiki-harvest.py (unless --hygiene-only)
|
||||
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
|
||||
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
|
||||
Phase 1a: wiki-distill.py (unless --no-distill or --harvest-only / --hygiene-only)
|
||||
Phase 1b: wiki-harvest.py (unless --distill-only / --hygiene-only)
|
||||
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
|
||||
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
|
||||
```
|
||||
|
||||
Flags pass through to child scripts. Error-tolerant: if one phase fails,
|
||||
the others still run. Logs to `scripts/.maintain.log`.
|
||||
Ordering is deliberate: distill runs before harvest so that
|
||||
conversation content drives the page shape, and URL harvesting only
|
||||
supplements what the conversations are already covering. Flags pass
|
||||
through to child scripts. Error-tolerant: if one phase fails, the
|
||||
others still run. Logs to `scripts/.maintain.log`.
|
||||
|
||||
---
|
||||
|
||||
@@ -289,6 +347,7 @@ Three JSON files track per-pipeline state:
|
||||
| File | Owner | Synced? | Purpose |
|
||||
|------|-------|---------|---------|
|
||||
| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
|
||||
| `.distill-state.json` | `wiki-distill.py` | Yes (committed) | Processed conversations (content hash + topics seen), rejected topics, first-run flag |
|
||||
| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
|
||||
| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
|
||||
|
||||
@@ -301,13 +360,15 @@ because Claude Code session files live at OS-specific paths.
|
||||
## Module dependency graph
|
||||
|
||||
```
|
||||
wiki_lib.py ─┬─> wiki-harvest.py
|
||||
wiki_lib.py ─┬─> wiki-distill.py
|
||||
├─> wiki-harvest.py
|
||||
├─> wiki-staging.py
|
||||
└─> wiki-hygiene.py
|
||||
|
||||
wiki-maintain.sh ─> wiki-harvest.py
|
||||
─> wiki-hygiene.py
|
||||
─> qmd (external)
|
||||
wiki-maintain.sh ─> wiki-distill.py (Phase 1a — conversations → staging)
|
||||
─> wiki-harvest.py (Phase 1b — URLs → staging)
|
||||
─> wiki-hygiene.py (Phase 2)
|
||||
─> qmd (external) (Phase 3)
|
||||
|
||||
mine-conversations.sh ─> extract-sessions.py
|
||||
─> summarize-conversations.py
|
||||
|
||||
@@ -43,10 +43,13 @@ repo preserves all of them:
|
||||
|
||||
Karpathy's gist is a concept pitch. He was explicit that he was sharing
|
||||
an "idea file" for others to build on, not publishing a working
|
||||
implementation. The analysis identified seven places where the core idea
|
||||
needs an engineering layer to become practical day-to-day — five have
|
||||
first-class answers in memex, and two remain scoped-out trade-offs that
|
||||
the architecture cleanly acknowledges.
|
||||
implementation. The analysis identified eight places where the core idea
|
||||
needs an engineering layer to become practical day-to-day. The first
|
||||
seven emerged from the original Signal & Noise review; the eighth
|
||||
(conversation distillation) surfaced after building the other layers
|
||||
and realizing that the conversations themselves were being mined,
|
||||
summarized, indexed, and scanned for URLs — but the knowledge *inside*
|
||||
them was never becoming wiki pages.
|
||||
|
||||
### 1. Claim freshness and reversibility
|
||||
|
||||
@@ -236,6 +239,71 @@ story. If you need any of that, you need a different architecture.
|
||||
This is for the personal and small-team case where git + Tailscale is
|
||||
the right amount of rigor.
|
||||
|
||||
### 8. Closing the MemPalace loop — conversation distillation
|
||||
|
||||
**The gap**: The mining pipeline extracts Claude Code sessions into
|
||||
transcripts, classifies them by memory type (fact/discovery/preference/
|
||||
advice/event/tooling), and tags them with topics. The URL harvester
|
||||
scans them for cited links. Hygiene refreshes `last_verified` on any
|
||||
wiki page that appears in a conversation's `related:` field. But none
|
||||
of those steps actually *compile the knowledge inside the conversations
|
||||
themselves into wiki pages.* A decision made in a session, a root cause
|
||||
found during debugging, a pattern spotted in review — these stay in the
|
||||
conversation summaries (searchable but not synthesized) until a human
|
||||
manually writes them up. That's the last piece of the MemPalace model
|
||||
that wasn't wired through: **closet content was never becoming the
|
||||
source for the wiki proper**.
|
||||
|
||||
**How memex extends it**:
|
||||
|
||||
- **`wiki-distill.py`** runs as Phase 1a of `wiki-maintain.sh`, before
|
||||
URL harvesting. The ordering is deliberate: conversation content
|
||||
should drive the page, and URL harvesting should only supplement
|
||||
what the conversations are already covering.
|
||||
- **Narrow today-filter with historical rollup** — daily runs only
|
||||
look at topics appearing in TODAY's summarized conversations, but
|
||||
for each such topic the script pulls in ALL historical conversations
|
||||
sharing that topic. Processing scope stays small; LLM context stays
|
||||
wide. Old topics that resurface in new sessions automatically
|
||||
trigger a re-distillation of the full history on that topic.
|
||||
- **First-run bootstrap** — the very first run uses a 7-day lookback
|
||||
to seed the state. After that, daily runs stay narrow.
|
||||
- **High-signal halls only** — distill reads `hall_facts`,
|
||||
`hall_discoveries`, and `hall_advice` bullets. Skips `hall_events`
|
||||
(temporal, not knowledge), `hall_preferences` (user working style),
|
||||
and `hall_tooling` (often low-signal). These are the halls the
|
||||
MemPalace taxonomy treats as "canonical knowledge" vs "context."
|
||||
- **claude -p compile step** — each topic group (topic + all matching
|
||||
conversations + their high-signal halls) is sent to `claude -p`
|
||||
with the current wiki index. The model decides whether to create a
|
||||
new page, update an existing one, emit both, or skip (topic not
|
||||
substantive enough or already well-covered).
|
||||
- **Staging output with distill provenance** — new/updated pages land
|
||||
in `staging/` with `staged_by: wiki-distill`, `distill_topic`, and
|
||||
`distill_source_conversations` frontmatter fields. Every page traces
|
||||
back to the exact conversations it was distilled from.
|
||||
- **State file `.distill-state.json`** tracks processed conversations
|
||||
by content hash and topic set, so re-runs only process what actually
|
||||
changed. A conversation gets re-distilled if its body changes OR if
|
||||
it gains a new topic not seen at previous distill time.
|
||||
|
||||
**Why this matters**: Without distillation, the MemPalace integration
|
||||
was incomplete — the closet summaries existed, the structural metadata
|
||||
existed, qmd could search them, but knowledge discovered during work
|
||||
never escaped the conversation archive. You could find "we had a
|
||||
debugging session about X last month" but couldn't find "here's the
|
||||
canonical page on X that captures what we learned." This extension
|
||||
turns the MemPalace layer from a searchable archive into a proper
|
||||
**ingest pipeline** for the wiki.
|
||||
|
||||
**Residual consideration**: Summarization quality is now load-bearing.
|
||||
The distill step trusts the summarizer's classification of bullets
|
||||
into halls. If the summarizer puts a debugging dead-end in
|
||||
`hall_discoveries`, it may enter the wiki compilation pipeline. The
|
||||
`MIN_BULLETS_PER_TOPIC` filter (default 2) and the LLM's own
|
||||
substantiveness check (it can choose `skip` with a reason) together
|
||||
catch most noise, and the staging review catches the rest.
|
||||
|
||||
---
|
||||
|
||||
## The biggest layer — active upkeep
|
||||
@@ -255,7 +323,7 @@ thing the human has to think about:
|
||||
| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
|
||||
| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
|
||||
| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
|
||||
| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
|
||||
| Daily 3am | `wiki-maintain.sh` | Distill + harvest + quick hygiene + reindex |
|
||||
| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
|
||||
|
||||
If you disable all of these, you get the same outcome as every
|
||||
|
||||
@@ -281,7 +281,13 @@ python3 scripts/summarize-conversations.py --claude
|
||||
# 3. Regenerate conversation index + wake-up context
|
||||
python3 scripts/update-conversation-index.py --reindex
|
||||
|
||||
# 4. Dry-run the maintenance pipeline
|
||||
# 4. First-run distill bootstrap (7-day lookback, burns claude -p calls)
|
||||
# Only do this if you have summarized conversations from recent work.
|
||||
# Skip it if you're starting with a fresh wiki.
|
||||
python3 scripts/wiki-distill.py --first-run --dry-run # plan
|
||||
python3 scripts/wiki-distill.py --first-run # actually do it
|
||||
|
||||
# 5. Dry-run the maintenance pipeline
|
||||
bash scripts/wiki-maintain.sh --dry-run --no-compile
|
||||
```
|
||||
|
||||
@@ -322,7 +328,7 @@ PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/
|
||||
0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1
|
||||
|
||||
# ─── Maintenance ───────────────────────────────────────────────────────────
|
||||
# Daily at 3am: harvest + quick hygiene + qmd reindex
|
||||
# Daily at 3am: distill conversations + harvest URLs + quick hygiene + qmd reindex
|
||||
0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1
|
||||
|
||||
# Weekly Sunday at 4am: full hygiene with LLM checks
|
||||
@@ -424,8 +430,8 @@ cd tests && python3 -m pytest
|
||||
|
||||
Expected:
|
||||
- `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]`
|
||||
- `wiki-maintain.sh --dry-run` completes all three phases
|
||||
- `pytest` passes all 171 tests in ~1.3 seconds
|
||||
- `wiki-maintain.sh --dry-run` completes all four phases (distill, harvest, hygiene, reindex)
|
||||
- `pytest` passes all 192 tests in ~1.5 seconds
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -1209,6 +1209,7 @@
|
||||
<button class="tab-btn" onclick="switchTab(this, 'tab-signals')">Signal Breakdown</button>
|
||||
<button class="tab-btn" onclick="switchTab(this, 'tab-mitigations')">Mitigations ★</button>
|
||||
<button class="tab-btn" onclick="switchTab(this, 'tab-mempalace')" style="color:var(--accent-green);font-weight:600">MemPalace ⬡</button>
|
||||
<button class="tab-btn" onclick="switchTab(this, 'tab-distill')" style="color:var(--accent-amber);font-weight:600">Distill ⬣</button>
|
||||
</div>
|
||||
|
||||
<!-- TAB: PROS & CONS -->
|
||||
@@ -2259,6 +2260,255 @@
|
||||
|
||||
</div><!-- /tab-mempalace -->
|
||||
|
||||
<!-- TAB: DISTILL — the 8th extension, closing the MemPalace loop -->
|
||||
<div id="tab-distill" class="tab-panel">
|
||||
|
||||
<div class="palace-hero" style="background:linear-gradient(135deg, #2a1810 0%, #1a1a10 50%, #0a1510 100%); border-color:#4a3a1a;">
|
||||
<div class="kicker" style="color:#f0c060">⬣ The 8th Extension — Closing the MemPalace Loop</div>
|
||||
<h3>Closet summaries <em>become</em> the source for the wiki itself.</h3>
|
||||
<p>The first seven extensions came out of the Signal & Noise review. The eighth surfaced only after the other layers were built — and it's the one that makes the MemPalace integration a real pipeline into the wiki instead of just a searchable archive beside it. The mining layer was extracting sessions, classifying bullets into halls, tagging topics, and making everything searchable via qmd. But the knowledge <em>inside</em> the conversations was never being compiled into wiki pages. A decision made in a session, a root cause found during debugging, a pattern spotted in review — these stayed in the conversation summaries forever, findable but not synthesized.</p>
|
||||
<p style="color:#f0c060;font-size:12.5px;font-family:'JetBrains Mono',monospace;letter-spacing:0.05em;">This is what the <code>wiki-distill.py</code> script solves. It's Phase 1a of <code>wiki-maintain.sh</code> and runs before URL harvesting because conversation content should drive the page, not the URLs the conversation cites.</p>
|
||||
<div class="hero-stats">
|
||||
<div class="hstat"><span class="hval">Phase 1a</span><span class="hlbl">Runs before harvest</span></div>
|
||||
<div class="hstat"><span class="hval">today</span><span class="hlbl">Narrow filter — today's topics</span></div>
|
||||
<div class="hstat"><span class="hval">∀ history</span><span class="hlbl">Rollup all past conversations on each topic</span></div>
|
||||
<div class="hstat"><span class="hval">3 halls</span><span class="hlbl">fact + discovery + advice</span></div>
|
||||
<div class="hstat"><span class="hval">haiku/sonnet</span><span class="hlbl">Auto-routed by topic size</span></div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- FLOW DIAGRAM -->
|
||||
<div class="flow-diagram">
|
||||
<div class="flow-title">Distill Flow — Conversation Content → Wiki Pages</div>
|
||||
<div class="flow-label">Narrow: what topics to process today</div>
|
||||
<div class="flow-row">
|
||||
<div class="flow-node convo">Today's<br>conversations</div>
|
||||
<div class="flow-arrow">→</div>
|
||||
<div class="flow-node palace">Extract<br>topics[]</div>
|
||||
<div class="flow-arrow">=</div>
|
||||
<div class="flow-node wiki">Topics of<br>today set</div>
|
||||
</div>
|
||||
<div class="flow-label" style="margin-top:14px">Wide: pull full history for each today-topic</div>
|
||||
<div class="flow-row">
|
||||
<div class="flow-node wiki">Each<br>today-topic</div>
|
||||
<div class="flow-arrow">→</div>
|
||||
<div class="flow-node palace">Rollup ALL<br>historical convs</div>
|
||||
<div class="flow-arrow">→</div>
|
||||
<div class="flow-node palace">Extract<br>fact / discovery / advice</div>
|
||||
<div class="flow-arrow">→</div>
|
||||
<div class="flow-node llm">claude -p<br>distill prompt</div>
|
||||
</div>
|
||||
<div class="flow-label" style="margin-top:14px">Compile: model decides new / update / skip</div>
|
||||
<div class="flow-row">
|
||||
<div class="flow-node llm">JSON<br>actions[]</div>
|
||||
<div class="flow-arrow">→</div>
|
||||
<div class="flow-node wiki">new_page</div>
|
||||
<div class="flow-arrow">+</div>
|
||||
<div class="flow-node wiki">update_page<br>(modifies existing)</div>
|
||||
<div class="flow-arrow">→</div>
|
||||
<div class="flow-node raw">staging/<type>/<br>pending review</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- SECTION: WHY IT COMPLETES MEMPALACE -->
|
||||
<div class="section-header">
|
||||
<h2>Why This Completes MemPalace</h2>
|
||||
<span class="section-tag" style="border-color:var(--accent-amber);color:var(--accent-amber);background:#fff8e6">Pipeline Closure</span>
|
||||
</div>
|
||||
|
||||
<div class="palace-map">
|
||||
<div class="palace-cell">
|
||||
<div class="pc-icon">📦</div>
|
||||
<div class="pc-term">Drawer — before</div>
|
||||
<div class="pc-name">Verbatim Archive</div>
|
||||
<div class="pc-desc">Full transcripts stored, searchable via qmd. No compilation — if you wanted canonical knowledge from them, you had to write it up manually.</div>
|
||||
<div class="pc-wiki-map">Status: already working</div>
|
||||
</div>
|
||||
<div class="palace-cell">
|
||||
<div class="pc-icon">🗂️</div>
|
||||
<div class="pc-term">Closet — before</div>
|
||||
<div class="pc-name">Summary Layer</div>
|
||||
<div class="pc-desc">Summaries with hall classification (fact / discovery / preference / advice / event / tooling) and topics. Searchable. Terminal: never fed forward into the wiki compiler.</div>
|
||||
<div class="pc-wiki-map">Status: terminal data, not flowing</div>
|
||||
</div>
|
||||
<div class="palace-cell">
|
||||
<div class="pc-icon">⬣</div>
|
||||
<div class="pc-term">Distill — NEW</div>
|
||||
<div class="pc-name">Compiler Bridge</div>
|
||||
<div class="pc-desc">Reads closet content by topic, rolls up all matching conversations across history, filters to high-signal halls only, sends to claude -p with the current wiki index, emits new or updated wiki pages to staging.</div>
|
||||
<div class="pc-wiki-map">Status: wiki-distill.py</div>
|
||||
</div>
|
||||
<div class="palace-cell">
|
||||
<div class="pc-icon">📄</div>
|
||||
<div class="pc-term">Wiki Pages — NEW</div>
|
||||
<div class="pc-name">Distilled Knowledge</div>
|
||||
<div class="pc-desc">Pages in staging/<type>/ with full distill provenance: distill_topic, distill_source_conversations, compilation_notes. Promote via staging review. Session knowledge becomes canonical knowledge.</div>
|
||||
<div class="pc-wiki-map">Status: origin=automated, staged_by=wiki-distill</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- HALL FILTERING -->
|
||||
<div class="section-header">
|
||||
<h2>Which Halls Get Distilled</h2>
|
||||
<span class="section-tag" style="border-color:var(--accent-green);color:var(--accent-green);background:#eaf5ee">High Signal Only</span>
|
||||
</div>
|
||||
|
||||
<table class="compare-table">
|
||||
<thead>
|
||||
<tr>
|
||||
<th>Hall</th>
|
||||
<th style="text-align:center">Distilled?</th>
|
||||
<th>Why</th>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td class="row-label">hall_facts</td>
|
||||
<td style="text-align:center" class="cell-win">✦ YES</td>
|
||||
<td>Decisions locked in, choices made, specs agreed. Canonical knowledge.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="row-label">hall_discoveries</td>
|
||||
<td style="text-align:center" class="cell-win">✦ YES</td>
|
||||
<td>Root causes, breakthroughs, non-obvious findings. The highest-signal content in any session.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="row-label">hall_advice</td>
|
||||
<td style="text-align:center" class="cell-win">✦ YES</td>
|
||||
<td>Recommendations, lessons learned, "next time do X." Worth capturing as patterns.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="row-label">hall_events</td>
|
||||
<td style="text-align:center" class="cell-mid">no</td>
|
||||
<td>Deployments, incidents, milestones. Temporal data — belongs in logs, not the wiki.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="row-label">hall_preferences</td>
|
||||
<td style="text-align:center" class="cell-mid">no</td>
|
||||
<td>User working style notes. Belong in personal configs, not the shared wiki.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td class="row-label">hall_tooling</td>
|
||||
<td style="text-align:center" class="cell-mid">no</td>
|
||||
<td>Script/command usage, failures, improvements. Usually low-signal or duplicates what's already in the wiki.</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
<!-- HOW THE NARROW-TODAY + WIDE-HISTORY FILTER WORKS -->
|
||||
<div class="section-header">
|
||||
<h2>The Narrow-Today / Wide-History Filter</h2>
|
||||
<span class="section-tag" style="border-color:var(--accent-blue);color:var(--accent-blue);background:#e8eef5">Key Design</span>
|
||||
</div>
|
||||
|
||||
<div class="mitigation-intro">
|
||||
<strong>Processing scope stays narrow; LLM context stays wide.</strong> This is the key property that makes distill cheap enough to run daily and smart enough to produce good pages.
|
||||
</div>
|
||||
|
||||
<div class="mitigation-steps">
|
||||
|
||||
<div class="mitigation-step" onclick="toggleStep(this)">
|
||||
<div class="mitigation-step-header">
|
||||
<span class="step-num">01</span>
|
||||
<span class="step-title">Daily filter: only process topics appearing in TODAY's conversations</span>
|
||||
<span class="step-tool-tag">Scope</span>
|
||||
<span class="step-arrow">▶</span>
|
||||
</div>
|
||||
<div class="mitigation-step-body">
|
||||
<p>Each daily run only looks at conversations dated today. It extracts the <code>topics:</code> frontmatter from each — that union becomes the "topics of today" set. If you didn't discuss a topic today, it's not in the processing scope. This keeps the cron job cheap and predictable: if today was a light session day, distill runs fast. If today was a heavy architecture discussion, distill does real work.</p>
|
||||
<div class="tip-box"><strong>First run only:</strong> The very first run uses a 7-day lookback instead of today-only so the state file gets seeded. After that first bootstrap, daily runs stay narrow.</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="mitigation-step" onclick="toggleStep(this)">
|
||||
<div class="mitigation-step-header">
|
||||
<span class="step-num">02</span>
|
||||
<span class="step-title">Historical rollup: for each today-topic, pull ALL matching conversations</span>
|
||||
<span class="step-tool-tag">Context</span>
|
||||
<span class="step-arrow">▶</span>
|
||||
</div>
|
||||
<div class="mitigation-step-body">
|
||||
<p>Once the today-topic set is known, for each topic the script walks the entire conversation archive and pulls every summarized conversation that shares that topic. A discussion about <code>blue-green-deploy</code> today might roll up 16 conversations across the last 6 months. The claude -p call sees the full history, not just today's fragment.</p>
|
||||
<p>This is what makes the distilled pages <em>good</em>. The LLM isn't guessing what a pattern looks like from one session — it's synthesizing across everything you've ever discussed on the topic.</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="mitigation-step" onclick="toggleStep(this)">
|
||||
<div class="mitigation-step-header">
|
||||
<span class="step-num">03</span>
|
||||
<span class="step-title">Self-triggering: dormant topics wake up when they resurface</span>
|
||||
<span class="step-tool-tag">Emergent</span>
|
||||
<span class="step-arrow">▶</span>
|
||||
</div>
|
||||
<div class="mitigation-step-body">
|
||||
<p>The narrow-today/wide-history combination produces a useful emergent property: <strong>dormant topics wake up automatically.</strong> If you discussed <code>database-migrations</code> three months ago and it never came up again, it's not in the daily scope. But the day you mention it again in any new conversation, that topic enters today's set — and the rollup pulls in all three months of historical discussion. The wiki page gets updated with fresh synthesis across the full history without you having to manually trigger reprocessing.</p>
|
||||
<div class="tip-box"><strong>What this means in practice:</strong> Old knowledge gets distilled <em>when it becomes relevant again</em>. You don't need to remember to ask "hey, is there a wiki page for X?" — the next time X comes up in a session, distill will check the wiki state and either create or update the page for you.</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="mitigation-step" onclick="toggleStep(this)">
|
||||
<div class="mitigation-step-header">
|
||||
<span class="step-num">04</span>
|
||||
<span class="step-title">State tracking by content hash + topic set</span>
|
||||
<span class="step-tool-tag">.distill-state.json</span>
|
||||
<span class="step-arrow">▶</span>
|
||||
</div>
|
||||
<div class="mitigation-step-body">
|
||||
<p>A conversation is considered "already distilled" only if its body hash AND its topic set match what was seen at the last distill. If the body changes (summarizer re-ran and updated the bullets) OR a new topic is added, the conversation gets re-processed on the next run. Topics get tracked so rejected ones don't get reprocessed forever — if the LLM says "this topic doesn't deserve a wiki page" once, it stays rejected until something meaningful changes.</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="mitigation-step" onclick="toggleStep(this)">
|
||||
<div class="mitigation-step-header">
|
||||
<span class="step-num">05</span>
|
||||
<span class="step-title">Distill runs BEFORE harvest — conversation content has priority</span>
|
||||
<span class="step-tool-tag">Phase 1a</span>
|
||||
<span class="step-arrow">▶</span>
|
||||
</div>
|
||||
<div class="mitigation-step-body">
|
||||
<p>The orchestrator runs distill as Phase 1a and harvest as Phase 1b. Deliberate: if a topic is being actively discussed in your sessions, you want the wiki page to reflect <em>your</em> synthesis of what you've learned, not just the external URL cited in passing. URL harvesting then fills in gaps — it picks up the docs pages, blog posts, and references that your sessions didn't already cover.</p>
|
||||
<div class="warn-box">Both phases can produce staging pages. If distill creates <code>patterns/docker-hardening.md</code> and harvest creates <code>patterns/docker-hardening.md</code>, the staging-unique-path helper appends a short hash suffix so they don't collide. The reviewer sees both in staging and picks the better one (usually distill, since it has historical context).</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
<!-- STAGING FRONTMATTER -->
|
||||
<div class="section-header">
|
||||
<h2>Distill Staging Provenance</h2>
|
||||
<span class="section-tag" style="border-color:var(--accent-green);color:var(--accent-green);background:#eaf5ee">Traceable</span>
|
||||
</div>
|
||||
|
||||
<p style="font-size:13.5px;color:var(--muted);margin-bottom:20px;line-height:1.6;">Every distilled page lands in staging with full provenance in its frontmatter. When you review a page in staging, you can see exactly which conversations it came from and jump directly to those transcripts.</p>
|
||||
|
||||
<div class="flow-diagram" style="background:#0d0d0d; border-color:#2a2a2a;">
|
||||
<div class="flow-title" style="color:#c4b99a">Example: staging/patterns/zoho-crm-integration.md frontmatter</div>
|
||||
<pre style="font-family:'JetBrains Mono',monospace;font-size:11px;color:#c4b99a;line-height:1.6;margin:0;padding:14px 0;overflow-x:auto;">---
|
||||
origin: automated
|
||||
status: pending
|
||||
staged_date: 2026-04-12
|
||||
staged_by: wiki-distill
|
||||
target_path: patterns/zoho-crm-integration.md
|
||||
distill_topic: zoho-api
|
||||
distill_source_conversations: conversations/general/2026-04-06-73d15650.md,conversations/mc/2026-03-30-64089d1d.md
|
||||
compilation_notes: Two separate incidents discovered the same Zoho CRM v2 API limitations, documenting them as a pattern page prevents re-investigation and provides a canonical reference for future Zoho integrations.
|
||||
title: Zoho CRM Integration
|
||||
type: pattern
|
||||
confidence: high
|
||||
sources: [conversations/general/2026-04-06-73d15650.md, conversations/mc/2026-03-30-64089d1d.md]
|
||||
related: [database-migrations.md, activity-event-auditing.md]
|
||||
last_compiled: 2026-04-12
|
||||
last_verified: 2026-04-12
|
||||
---</pre>
|
||||
</div>
|
||||
|
||||
<div class="pull-quote" style="border-left-color:var(--accent-amber)">
|
||||
Without distillation, MemPalace was a searchable archive sitting beside the wiki. With distillation, it's a real ingest pipeline — closet content becomes the source material for the wiki proper, completing the eight-extension story.
|
||||
<span class="attribution">— memex design rationale, April 2026</span>
|
||||
</div>
|
||||
|
||||
</div><!-- /tab-distill -->
|
||||
|
||||
</div><!-- /page -->
|
||||
|
||||
<footer class="page-footer">
|
||||
|
||||
700
scripts/wiki-distill.py
Normal file
700
scripts/wiki-distill.py
Normal file
@@ -0,0 +1,700 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Distill wiki pages from summarized conversation content.
|
||||
|
||||
This is the "closing the MemPalace loop" step: closet summaries become
|
||||
the source material for new or updated wiki pages. It's parallel to
|
||||
wiki-harvest.py (which compiles URL content into wiki pages) but operates
|
||||
on the *content of the conversations themselves* rather than the URLs
|
||||
they cite.
|
||||
|
||||
Scope filter (deliberately narrow):
|
||||
|
||||
1. Find all summarized conversations dated TODAY
|
||||
2. Extract their `topics:` — this is the "topics-of-today" set
|
||||
3. For each topic in that set, pull ALL summarized conversations across
|
||||
history that share that topic (rollup for full context)
|
||||
4. For each topic group, extract `hall_facts` + `hall_discoveries` +
|
||||
`hall_advice` bullet content from the body
|
||||
5. Send the topic group + relevant hall entries to `claude -p` with
|
||||
the current index.md, ask for new_page / update_page / both / skip
|
||||
6. Write result(s) to staging/<type>/ with `staged_by: wiki-distill`
|
||||
|
||||
First run bootstrap (--first-run or empty state):
|
||||
|
||||
- Instead of "topics-of-today", use "topics-from-the-last-7-days"
|
||||
- This seeds the state file so subsequent runs can stay narrow
|
||||
|
||||
Self-triggering property:
|
||||
|
||||
- Old dormant topics that resurface in a new conversation will
|
||||
automatically pull in all historical conversations on that topic
|
||||
via the rollup — no need to manually trigger reprocessing
|
||||
|
||||
State: `.distill-state.json` tracks processed conversations (path +
|
||||
content hash + topics seen at distill time). A conversation is
|
||||
re-processed if its content hash changes OR it has a new topic not
|
||||
seen during the previous distill.
|
||||
|
||||
Usage:
|
||||
python3 scripts/wiki-distill.py # Today-only rollup
|
||||
python3 scripts/wiki-distill.py --first-run # Last 7 days rollup
|
||||
python3 scripts/wiki-distill.py --topic TOPIC # Process one topic explicitly
|
||||
python3 scripts/wiki-distill.py --project mc # Only this wing's today topics
|
||||
python3 scripts/wiki-distill.py --dry-run # Plan only, no LLM, no writes
|
||||
python3 scripts/wiki-distill.py --no-compile # Parse/rollup only, skip claude -p
|
||||
python3 scripts/wiki-distill.py --limit N # Cap at N topic groups processed
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import subprocess
|
||||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import date, datetime, timedelta, timezone
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from wiki_lib import ( # noqa: E402
|
||||
CONVERSATIONS_DIR,
|
||||
INDEX_FILE,
|
||||
STAGING_DIR,
|
||||
WIKI_DIR,
|
||||
WikiPage,
|
||||
high_signal_halls,
|
||||
parse_date,
|
||||
parse_page,
|
||||
today,
|
||||
)
|
||||
|
||||
sys.stdout.reconfigure(line_buffering=True)
|
||||
sys.stderr.reconfigure(line_buffering=True)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Configuration
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
DISTILL_STATE_FILE = WIKI_DIR / ".distill-state.json"
|
||||
|
||||
CLAUDE_HAIKU_MODEL = "haiku"
|
||||
CLAUDE_SONNET_MODEL = "sonnet"
|
||||
# Content size (characters) above which we route to sonnet
|
||||
SONNET_CONTENT_THRESHOLD = 15_000
|
||||
CLAUDE_TIMEOUT = 600
|
||||
|
||||
FIRST_RUN_LOOKBACK_DAYS = 7
|
||||
|
||||
# Minimum number of total hall bullets across the topic group to bother
|
||||
# asking the LLM. A topic with only one fact/discovery across history is
|
||||
# usually not enough signal to warrant a wiki page.
|
||||
MIN_BULLETS_PER_TOPIC = 2
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# State management
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def load_state() -> dict[str, Any]:
|
||||
defaults: dict[str, Any] = {
|
||||
"processed_convs": {},
|
||||
"processed_topics": {},
|
||||
"rejected_topics": {},
|
||||
"last_run": None,
|
||||
"first_run_complete": False,
|
||||
}
|
||||
if DISTILL_STATE_FILE.exists():
|
||||
try:
|
||||
with open(DISTILL_STATE_FILE) as f:
|
||||
state = json.load(f)
|
||||
for k, v in defaults.items():
|
||||
state.setdefault(k, v)
|
||||
return state
|
||||
except (OSError, json.JSONDecodeError):
|
||||
pass
|
||||
return defaults
|
||||
|
||||
|
||||
def save_state(state: dict[str, Any]) -> None:
|
||||
state["last_run"] = datetime.now(timezone.utc).isoformat()
|
||||
tmp = DISTILL_STATE_FILE.with_suffix(".json.tmp")
|
||||
with open(tmp, "w") as f:
|
||||
json.dump(state, f, indent=2, sort_keys=True)
|
||||
tmp.replace(DISTILL_STATE_FILE)
|
||||
|
||||
|
||||
def conv_content_hash(conv: WikiPage) -> str:
|
||||
return "sha256:" + hashlib.sha256(conv.body.encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
def conv_needs_distill(state: dict[str, Any], conv: WikiPage) -> bool:
|
||||
"""Return True if this conversation should be re-processed."""
|
||||
rel = str(conv.path.relative_to(WIKI_DIR))
|
||||
entry = state.get("processed_convs", {}).get(rel)
|
||||
if not entry:
|
||||
return True
|
||||
if entry.get("content_hash") != conv_content_hash(conv):
|
||||
return True
|
||||
# New topics that weren't seen at distill time → re-process
|
||||
seen_topics = set(entry.get("topics_at_distill", []))
|
||||
current_topics = set(conv.frontmatter.get("topics") or [])
|
||||
if current_topics - seen_topics:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def mark_conv_distilled(
|
||||
state: dict[str, Any],
|
||||
conv: WikiPage,
|
||||
output_pages: list[str],
|
||||
) -> None:
|
||||
rel = str(conv.path.relative_to(WIKI_DIR))
|
||||
state.setdefault("processed_convs", {})[rel] = {
|
||||
"distilled_date": today().isoformat(),
|
||||
"content_hash": conv_content_hash(conv),
|
||||
"topics_at_distill": list(conv.frontmatter.get("topics") or []),
|
||||
"output_pages": output_pages,
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Conversation discovery & topic rollup
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def iter_summarized_conversations(project_filter: str | None = None) -> list[WikiPage]:
|
||||
"""Walk conversations/ and return all summarized conversation pages."""
|
||||
if not CONVERSATIONS_DIR.exists():
|
||||
return []
|
||||
results: list[WikiPage] = []
|
||||
for project_dir in sorted(CONVERSATIONS_DIR.iterdir()):
|
||||
if not project_dir.is_dir():
|
||||
continue
|
||||
if project_filter and project_dir.name != project_filter:
|
||||
continue
|
||||
for md in sorted(project_dir.glob("*.md")):
|
||||
page = parse_page(md)
|
||||
if not page:
|
||||
continue
|
||||
if page.frontmatter.get("status") != "summarized":
|
||||
continue
|
||||
results.append(page)
|
||||
return results
|
||||
|
||||
|
||||
def extract_topics_from_today(
|
||||
conversations: list[WikiPage],
|
||||
target_date: date,
|
||||
lookback_days: int = 0,
|
||||
) -> set[str]:
|
||||
"""Find the set of topics appearing in conversations dated ≥ (target - lookback).
|
||||
|
||||
lookback_days=0 → only today
|
||||
lookback_days=7 → today and the previous 7 days
|
||||
"""
|
||||
cutoff = target_date - timedelta(days=lookback_days)
|
||||
topics: set[str] = set()
|
||||
for conv in conversations:
|
||||
d = parse_date(conv.frontmatter.get("date"))
|
||||
if d and d >= cutoff:
|
||||
for t in conv.frontmatter.get("topics") or []:
|
||||
t_clean = str(t).strip()
|
||||
if t_clean:
|
||||
topics.add(t_clean)
|
||||
return topics
|
||||
|
||||
|
||||
def rollup_conversations_by_topic(
|
||||
topic: str, conversations: list[WikiPage]
|
||||
) -> list[WikiPage]:
|
||||
"""Return all conversations (across all time) whose topics: list contains `topic`."""
|
||||
results: list[WikiPage] = []
|
||||
for conv in conversations:
|
||||
conv_topics = conv.frontmatter.get("topics") or []
|
||||
if topic in conv_topics:
|
||||
results.append(conv)
|
||||
# Most recent first so the LLM sees the current state before the backstory
|
||||
results.sort(
|
||||
key=lambda c: parse_date(c.frontmatter.get("date")) or date.min,
|
||||
reverse=True,
|
||||
)
|
||||
return results
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Build the LLM input for a topic group
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@dataclass
|
||||
class TopicGroup:
|
||||
topic: str
|
||||
conversations: list[WikiPage]
|
||||
halls_by_conv: list[dict[str, list[str]]]
|
||||
total_bullets: int
|
||||
|
||||
|
||||
def build_topic_group(topic: str, conversations: list[WikiPage]) -> TopicGroup:
|
||||
halls_by_conv: list[dict[str, list[str]]] = []
|
||||
total = 0
|
||||
for conv in conversations:
|
||||
halls = high_signal_halls(conv)
|
||||
halls_by_conv.append(halls)
|
||||
total += sum(len(v) for v in halls.values())
|
||||
return TopicGroup(
|
||||
topic=topic,
|
||||
conversations=conversations,
|
||||
halls_by_conv=halls_by_conv,
|
||||
total_bullets=total,
|
||||
)
|
||||
|
||||
|
||||
def format_topic_group_for_llm(group: TopicGroup) -> str:
|
||||
"""Render a topic group as a prompt-friendly markdown block."""
|
||||
lines = [f"# Topic: {group.topic}", ""]
|
||||
lines.append(
|
||||
f"Found {len(group.conversations)} summarized conversation(s) tagged "
|
||||
f"with this topic, containing {group.total_bullets} high-signal bullets "
|
||||
f"across fact/discovery/advice halls."
|
||||
)
|
||||
lines.append("")
|
||||
for conv, halls in zip(group.conversations, group.halls_by_conv):
|
||||
rel = str(conv.path.relative_to(WIKI_DIR))
|
||||
date_str = conv.frontmatter.get("date", "unknown")
|
||||
title = conv.frontmatter.get("title", conv.path.stem)
|
||||
project = conv.frontmatter.get("project", "?")
|
||||
lines.append(f"## {date_str} — {title} ({project})")
|
||||
lines.append(f"_Source: `{rel}`_")
|
||||
lines.append("")
|
||||
for hall_type in ("fact", "discovery", "advice"):
|
||||
bullets = halls.get(hall_type) or []
|
||||
if not bullets:
|
||||
continue
|
||||
label = {"fact": "Decisions", "discovery": "Discoveries", "advice": "Advice"}[hall_type]
|
||||
lines.append(f"**{label}:**")
|
||||
for b in bullets:
|
||||
lines.append(f"- {b}")
|
||||
lines.append("")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Claude compilation
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
DISTILL_PROMPT_TEMPLATE = """You are distilling wiki pages from summarized conversation content.
|
||||
|
||||
The wiki schema and conventions are defined in CLAUDE.md. The wiki has four
|
||||
content directories: patterns/ (HOW), decisions/ (WHY), environments/ (WHERE),
|
||||
concepts/ (WHAT). All pages require YAML frontmatter with title, type,
|
||||
confidence, origin, sources, related, last_compiled, last_verified.
|
||||
|
||||
IMPORTANT: Do NOT include `status`, `staged_*`, `target_path`, `modifies`,
|
||||
or `compilation_notes` fields in your page frontmatter — the distill script
|
||||
injects those automatically.
|
||||
|
||||
Your task: given a topic group (all conversations across history that share
|
||||
a topic, with their decisions/discoveries/advice), decide what wiki pages
|
||||
should be created or updated. Emit a single JSON object with an `actions`
|
||||
array. Each action is one of:
|
||||
|
||||
- "new_page" — create a new wiki page from the distilled knowledge
|
||||
- "update_page" — update an existing live wiki page (add content, merge)
|
||||
- "skip" — content isn't substantive enough for a wiki page
|
||||
OR the topic is already well-covered elsewhere
|
||||
|
||||
Schema:
|
||||
|
||||
{{
|
||||
"rationale": "1-2 sentences explaining your decision",
|
||||
"actions": [
|
||||
{{
|
||||
"type": "new_page",
|
||||
"directory": "patterns" | "decisions" | "environments" | "concepts",
|
||||
"filename": "kebab-case-name.md",
|
||||
"content": "full markdown including frontmatter"
|
||||
}},
|
||||
{{
|
||||
"type": "update_page",
|
||||
"path": "patterns/existing-page.md",
|
||||
"content": "full updated markdown including frontmatter (merged)"
|
||||
}},
|
||||
{{
|
||||
"type": "skip",
|
||||
"reason": "why this topic doesn't need a wiki page"
|
||||
}}
|
||||
]
|
||||
}}
|
||||
|
||||
You can emit MULTIPLE actions — e.g. a new_page for a concept and an
|
||||
update_page to an existing pattern that now has new context.
|
||||
|
||||
Emit ONLY the JSON object. No prose, no markdown fences.
|
||||
|
||||
--- WIKI INDEX (existing pages) ---
|
||||
|
||||
{wiki_index}
|
||||
|
||||
--- TOPIC GROUP ---
|
||||
|
||||
{topic_group}
|
||||
"""
|
||||
|
||||
|
||||
def call_claude_distill(prompt: str, model: str) -> dict[str, Any] | None:
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["claude", "-p", "--model", model, "--output-format", "text", prompt],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=CLAUDE_TIMEOUT,
|
||||
)
|
||||
except FileNotFoundError:
|
||||
print(" [warn] claude CLI not found — skipping compilation", file=sys.stderr)
|
||||
return None
|
||||
except subprocess.TimeoutExpired:
|
||||
print(" [warn] claude -p timed out", file=sys.stderr)
|
||||
return None
|
||||
if result.returncode != 0:
|
||||
print(f" [warn] claude -p failed: {result.stderr.strip()[:200]}", file=sys.stderr)
|
||||
return None
|
||||
|
||||
output = result.stdout.strip()
|
||||
match = re.search(r"\{.*\}", output, re.DOTALL)
|
||||
if not match:
|
||||
print(f" [warn] no JSON found in claude output ({len(output)} chars)", file=sys.stderr)
|
||||
return None
|
||||
try:
|
||||
return json.loads(match.group(0))
|
||||
except json.JSONDecodeError as e:
|
||||
print(f" [warn] JSON parse failed: {e}", file=sys.stderr)
|
||||
return None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Staging output
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
STAGING_INJECT_TEMPLATE = (
|
||||
"---\n"
|
||||
"origin: automated\n"
|
||||
"status: pending\n"
|
||||
"staged_date: {staged_date}\n"
|
||||
"staged_by: wiki-distill\n"
|
||||
"target_path: {target_path}\n"
|
||||
"{modifies_line}"
|
||||
"distill_topic: {topic}\n"
|
||||
"distill_source_conversations: {source_convs}\n"
|
||||
"compilation_notes: {compilation_notes}\n"
|
||||
)
|
||||
|
||||
|
||||
def _inject_staging_frontmatter(
|
||||
content: str,
|
||||
target_path: str,
|
||||
topic: str,
|
||||
source_convs: list[str],
|
||||
compilation_notes: str,
|
||||
modifies: str | None,
|
||||
) -> str:
|
||||
content = re.sub(
|
||||
r"^(status|origin|staged_\w+|target_path|modifies|distill_\w+|compilation_notes):.*\n",
|
||||
"",
|
||||
content,
|
||||
flags=re.MULTILINE,
|
||||
)
|
||||
|
||||
modifies_line = f"modifies: {modifies}\n" if modifies else ""
|
||||
clean_notes = compilation_notes.replace("\n", " ").replace("\r", " ").strip()
|
||||
sources_yaml = ",".join(source_convs)
|
||||
injection = STAGING_INJECT_TEMPLATE.format(
|
||||
staged_date=datetime.now(timezone.utc).date().isoformat(),
|
||||
target_path=target_path,
|
||||
modifies_line=modifies_line,
|
||||
topic=topic,
|
||||
source_convs=sources_yaml,
|
||||
compilation_notes=clean_notes or "(distilled from conversation topic group)",
|
||||
)
|
||||
|
||||
if content.startswith("---\n"):
|
||||
return injection + content[4:]
|
||||
return injection + "---\n" + content
|
||||
|
||||
|
||||
def _unique_staging_path(base: Path) -> Path:
|
||||
if not base.exists():
|
||||
return base
|
||||
suffix = hashlib.sha256(str(base).encode() + str(time.time()).encode()).hexdigest()[:6]
|
||||
return base.with_stem(f"{base.stem}-{suffix}")
|
||||
|
||||
|
||||
def apply_distill_actions(
|
||||
result: dict[str, Any],
|
||||
topic: str,
|
||||
source_convs: list[str],
|
||||
dry_run: bool,
|
||||
) -> list[Path]:
|
||||
written: list[Path] = []
|
||||
actions = result.get("actions") or []
|
||||
rationale = result.get("rationale", "")
|
||||
|
||||
for action in actions:
|
||||
action_type = action.get("type")
|
||||
if action_type == "skip":
|
||||
reason = action.get("reason", "not substantive enough")
|
||||
print(f" [skip] topic={topic!r}: {reason}")
|
||||
continue
|
||||
|
||||
if action_type == "new_page":
|
||||
directory = action.get("directory") or "patterns"
|
||||
filename = action.get("filename")
|
||||
content = action.get("content")
|
||||
if not filename or not content:
|
||||
print(f" [warn] incomplete new_page action for topic={topic!r}", file=sys.stderr)
|
||||
continue
|
||||
target_rel = f"{directory}/{filename}"
|
||||
dest = _unique_staging_path(STAGING_DIR / target_rel)
|
||||
if dry_run:
|
||||
print(f" [dry-run] new_page → {dest.relative_to(WIKI_DIR)}")
|
||||
continue
|
||||
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||
injected = _inject_staging_frontmatter(
|
||||
content,
|
||||
target_path=target_rel,
|
||||
topic=topic,
|
||||
source_convs=source_convs,
|
||||
compilation_notes=rationale,
|
||||
modifies=None,
|
||||
)
|
||||
dest.write_text(injected)
|
||||
written.append(dest)
|
||||
print(f" [new] {dest.relative_to(WIKI_DIR)}")
|
||||
continue
|
||||
|
||||
if action_type == "update_page":
|
||||
target_rel = action.get("path")
|
||||
content = action.get("content")
|
||||
if not target_rel or not content:
|
||||
print(f" [warn] incomplete update_page action for topic={topic!r}", file=sys.stderr)
|
||||
continue
|
||||
dest = _unique_staging_path(STAGING_DIR / target_rel)
|
||||
if dry_run:
|
||||
print(f" [dry-run] update_page → {dest.relative_to(WIKI_DIR)} (modifies {target_rel})")
|
||||
continue
|
||||
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||
injected = _inject_staging_frontmatter(
|
||||
content,
|
||||
target_path=target_rel,
|
||||
topic=topic,
|
||||
source_convs=source_convs,
|
||||
compilation_notes=rationale,
|
||||
modifies=target_rel,
|
||||
)
|
||||
dest.write_text(injected)
|
||||
written.append(dest)
|
||||
print(f" [upd] {dest.relative_to(WIKI_DIR)} (modifies {target_rel})")
|
||||
continue
|
||||
|
||||
print(f" [warn] unknown action type: {action_type!r}", file=sys.stderr)
|
||||
|
||||
return written
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Main pipeline
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def pick_model(topic_group: TopicGroup, prompt: str) -> str:
|
||||
if len(prompt) > SONNET_CONTENT_THRESHOLD or topic_group.total_bullets > 20:
|
||||
return CLAUDE_SONNET_MODEL
|
||||
return CLAUDE_HAIKU_MODEL
|
||||
|
||||
|
||||
def process_topic(
|
||||
topic: str,
|
||||
conversations: list[WikiPage],
|
||||
state: dict[str, Any],
|
||||
dry_run: bool,
|
||||
compile_enabled: bool,
|
||||
) -> tuple[str, list[Path]]:
|
||||
"""Process a single topic group. Returns (status, written_paths)."""
|
||||
|
||||
group = build_topic_group(topic, conversations)
|
||||
|
||||
if group.total_bullets < MIN_BULLETS_PER_TOPIC:
|
||||
return f"too-thin (only {group.total_bullets} bullets)", []
|
||||
|
||||
if topic in state.get("rejected_topics", {}):
|
||||
return "previously-rejected", []
|
||||
|
||||
wiki_index_text = ""
|
||||
try:
|
||||
wiki_index_text = INDEX_FILE.read_text()[:15_000]
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
topic_group_text = format_topic_group_for_llm(group)
|
||||
prompt = DISTILL_PROMPT_TEMPLATE.format(
|
||||
wiki_index=wiki_index_text,
|
||||
topic_group=topic_group_text,
|
||||
)
|
||||
|
||||
if dry_run:
|
||||
model = pick_model(group, prompt)
|
||||
return (
|
||||
f"would-distill ({len(group.conversations)} convs, "
|
||||
f"{group.total_bullets} bullets, {model})"
|
||||
), []
|
||||
|
||||
if not compile_enabled:
|
||||
return (
|
||||
f"skipped-compile ({len(group.conversations)} convs, "
|
||||
f"{group.total_bullets} bullets)"
|
||||
), []
|
||||
|
||||
model = pick_model(group, prompt)
|
||||
print(f" [compile] topic={topic!r} "
|
||||
f"convs={len(group.conversations)} bullets={group.total_bullets} model={model}")
|
||||
|
||||
result = call_claude_distill(prompt, model)
|
||||
if result is None:
|
||||
return "compile-failed", []
|
||||
|
||||
actions = result.get("actions") or []
|
||||
if not actions or all(a.get("type") == "skip" for a in actions):
|
||||
reason = result.get("rationale", "AI chose to skip")
|
||||
state.setdefault("rejected_topics", {})[topic] = {
|
||||
"reason": reason,
|
||||
"rejected_date": today().isoformat(),
|
||||
}
|
||||
return "rejected-by-llm", []
|
||||
|
||||
source_convs = [str(c.path.relative_to(WIKI_DIR)) for c in group.conversations]
|
||||
written = apply_distill_actions(result, topic, source_convs, dry_run=False)
|
||||
|
||||
for conv in group.conversations:
|
||||
mark_conv_distilled(state, conv, [str(p.relative_to(WIKI_DIR)) for p in written])
|
||||
|
||||
state.setdefault("processed_topics", {})[topic] = {
|
||||
"distilled_date": today().isoformat(),
|
||||
"conversations": source_convs,
|
||||
"output_pages": [str(p.relative_to(WIKI_DIR)) for p in written],
|
||||
}
|
||||
|
||||
return f"distilled ({len(written)} page(s))", written
|
||||
|
||||
|
||||
def run(
|
||||
*,
|
||||
first_run: bool,
|
||||
explicit_topic: str | None,
|
||||
project_filter: str | None,
|
||||
dry_run: bool,
|
||||
compile_enabled: bool,
|
||||
limit: int,
|
||||
) -> int:
|
||||
state = load_state()
|
||||
if not state.get("first_run_complete"):
|
||||
first_run = True
|
||||
|
||||
all_convs = iter_summarized_conversations(project_filter)
|
||||
print(f"Scanning {len(all_convs)} summarized conversation(s)...")
|
||||
|
||||
# Figure out which topics to process
|
||||
if explicit_topic:
|
||||
topics_to_process: set[str] = {explicit_topic}
|
||||
print(f"Explicit topic mode: {explicit_topic!r}")
|
||||
else:
|
||||
lookback = FIRST_RUN_LOOKBACK_DAYS if first_run else 0
|
||||
topics_to_process = extract_topics_from_today(all_convs, today(), lookback)
|
||||
if first_run:
|
||||
print(f"First-run bootstrap: last {FIRST_RUN_LOOKBACK_DAYS} days → "
|
||||
f"{len(topics_to_process)} topic(s)")
|
||||
else:
|
||||
print(f"Today-only mode: {len(topics_to_process)} topic(s) from today's conversations")
|
||||
|
||||
if not topics_to_process:
|
||||
print("No topics to distill.")
|
||||
if first_run:
|
||||
state["first_run_complete"] = True
|
||||
save_state(state)
|
||||
return 0
|
||||
|
||||
# Sort for deterministic ordering
|
||||
topics_ordered = sorted(topics_to_process)
|
||||
|
||||
stats: dict[str, int] = {}
|
||||
processed = 0
|
||||
total_written: list[Path] = []
|
||||
|
||||
for topic in topics_ordered:
|
||||
convs = rollup_conversations_by_topic(topic, all_convs)
|
||||
if not convs:
|
||||
stats["no-matches"] = stats.get("no-matches", 0) + 1
|
||||
continue
|
||||
|
||||
print(f"\n[{topic}] rollup: {len(convs)} conversation(s)")
|
||||
status, written = process_topic(
|
||||
topic, convs, state, dry_run=dry_run, compile_enabled=compile_enabled
|
||||
)
|
||||
stats[status.split(" ")[0]] = stats.get(status.split(" ")[0], 0) + 1
|
||||
print(f" [{status}]")
|
||||
|
||||
total_written.extend(written)
|
||||
if not dry_run:
|
||||
processed += 1
|
||||
save_state(state)
|
||||
|
||||
if limit and processed >= limit:
|
||||
print(f"\nLimit reached ({limit}); stopping.")
|
||||
break
|
||||
|
||||
if first_run and not dry_run:
|
||||
state["first_run_complete"] = True
|
||||
if not dry_run:
|
||||
save_state(state)
|
||||
|
||||
print("\nSummary:")
|
||||
for status, count in sorted(stats.items()):
|
||||
print(f" {status}: {count}")
|
||||
print(f"\n{len(total_written)} staging page(s) written")
|
||||
return 0
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
|
||||
parser.add_argument("--first-run", action="store_true",
|
||||
help="Bootstrap with last 7 days instead of today-only")
|
||||
parser.add_argument("--topic", default=None,
|
||||
help="Process one specific topic explicitly")
|
||||
parser.add_argument("--project", default=None,
|
||||
help="Only consider conversations under this wing")
|
||||
parser.add_argument("--dry-run", action="store_true",
|
||||
help="Plan only; no LLM calls, no writes")
|
||||
parser.add_argument("--no-compile", action="store_true",
|
||||
help="Parse + rollup only; skip claude -p step")
|
||||
parser.add_argument("--limit", type=int, default=0,
|
||||
help="Stop after N topic groups processed (0 = unlimited)")
|
||||
args = parser.parse_args()
|
||||
|
||||
return run(
|
||||
first_run=args.first_run,
|
||||
explicit_topic=args.topic,
|
||||
project_filter=args.project,
|
||||
dry_run=args.dry_run,
|
||||
compile_enabled=not args.no_compile,
|
||||
limit=args.limit,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -3,19 +3,26 @@ set -euo pipefail
|
||||
|
||||
# wiki-maintain.sh — Top-level orchestrator for wiki maintenance.
|
||||
#
|
||||
# Chains the three maintenance scripts in the correct order:
|
||||
# 1. wiki-harvest.py (URL harvesting from summarized conversations)
|
||||
# 2. wiki-hygiene.py (quick or full hygiene checks)
|
||||
# 3. qmd update && qmd embed (reindex after changes)
|
||||
# Chains the maintenance scripts in the correct order:
|
||||
# 1a. wiki-distill.py (closet summaries → wiki pages via claude -p)
|
||||
# 1b. wiki-harvest.py (URL content from conversations → wiki pages)
|
||||
# 2. wiki-hygiene.py (quick or full hygiene checks)
|
||||
# 3. qmd update && qmd embed (reindex after changes)
|
||||
#
|
||||
# Distill runs BEFORE harvest: conversation content takes priority over
|
||||
# URL content. If a topic is already discussed in the conversations, we
|
||||
# want the conversation rollup to drive the page, not a cited URL.
|
||||
#
|
||||
# Usage:
|
||||
# wiki-maintain.sh # Harvest + quick hygiene
|
||||
# wiki-maintain.sh --full # Harvest + full hygiene (LLM-powered)
|
||||
# wiki-maintain.sh # Distill + harvest + quick hygiene + reindex
|
||||
# wiki-maintain.sh --full # Everything with full hygiene (LLM)
|
||||
# wiki-maintain.sh --distill-only # Conversation distillation only
|
||||
# wiki-maintain.sh --harvest-only # URL harvesting only
|
||||
# wiki-maintain.sh --hygiene-only # Quick hygiene only
|
||||
# wiki-maintain.sh --hygiene-only --full # Full hygiene only
|
||||
# wiki-maintain.sh --dry-run # Show what would run (no writes)
|
||||
# wiki-maintain.sh --no-compile # Harvest without claude -p compilation step
|
||||
# wiki-maintain.sh --hygiene-only # Hygiene only
|
||||
# wiki-maintain.sh --no-distill # Skip distillation phase
|
||||
# wiki-maintain.sh --distill-first-run # Bootstrap distill with last 7 days
|
||||
# wiki-maintain.sh --dry-run # Show what would run (no writes, no LLM)
|
||||
# wiki-maintain.sh --no-compile # Skip claude -p in harvest AND distill
|
||||
# wiki-maintain.sh --no-reindex # Skip qmd update/embed after
|
||||
#
|
||||
# Log file: scripts/.maintain.log (rotated manually)
|
||||
@@ -32,22 +39,28 @@ LOG_FILE="${SCRIPTS_DIR}/.maintain.log"
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
FULL_MODE=false
|
||||
DISTILL_ONLY=false
|
||||
HARVEST_ONLY=false
|
||||
HYGIENE_ONLY=false
|
||||
NO_DISTILL=false
|
||||
DISTILL_FIRST_RUN=false
|
||||
DRY_RUN=false
|
||||
NO_COMPILE=false
|
||||
NO_REINDEX=false
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--full) FULL_MODE=true; shift ;;
|
||||
--harvest-only) HARVEST_ONLY=true; shift ;;
|
||||
--hygiene-only) HYGIENE_ONLY=true; shift ;;
|
||||
--dry-run) DRY_RUN=true; shift ;;
|
||||
--no-compile) NO_COMPILE=true; shift ;;
|
||||
--no-reindex) NO_REINDEX=true; shift ;;
|
||||
--full) FULL_MODE=true; shift ;;
|
||||
--distill-only) DISTILL_ONLY=true; shift ;;
|
||||
--harvest-only) HARVEST_ONLY=true; shift ;;
|
||||
--hygiene-only) HYGIENE_ONLY=true; shift ;;
|
||||
--no-distill) NO_DISTILL=true; shift ;;
|
||||
--distill-first-run) DISTILL_FIRST_RUN=true; shift ;;
|
||||
--dry-run) DRY_RUN=true; shift ;;
|
||||
--no-compile) NO_COMPILE=true; shift ;;
|
||||
--no-reindex) NO_REINDEX=true; shift ;;
|
||||
-h|--help)
|
||||
sed -n '3,20p' "$0" | sed 's/^# \?//'
|
||||
sed -n '3,28p' "$0" | sed 's/^# \?//'
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
@@ -57,8 +70,13 @@ while [[ $# -gt 0 ]]; do
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ "${HARVEST_ONLY}" == "true" && "${HYGIENE_ONLY}" == "true" ]]; then
|
||||
echo "--harvest-only and --hygiene-only are mutually exclusive" >&2
|
||||
# Mutex check — only one "only" flag at a time
|
||||
only_count=0
|
||||
${DISTILL_ONLY} && only_count=$((only_count + 1))
|
||||
${HARVEST_ONLY} && only_count=$((only_count + 1))
|
||||
${HYGIENE_ONLY} && only_count=$((only_count + 1))
|
||||
if [[ $only_count -gt 1 ]]; then
|
||||
echo "--distill-only, --harvest-only, and --hygiene-only are mutually exclusive" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
@@ -91,13 +109,36 @@ cd "${WIKI_DIR}"
|
||||
for req in python3 qmd; do
|
||||
if ! command -v "${req}" >/dev/null 2>&1; then
|
||||
if [[ "${req}" == "qmd" && "${NO_REINDEX}" == "true" ]]; then
|
||||
continue # qmd not required if --no-reindex
|
||||
continue
|
||||
fi
|
||||
echo "Required command not found: ${req}" >&2
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Determine which phases to run
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
run_distill=true
|
||||
run_harvest=true
|
||||
run_hygiene=true
|
||||
|
||||
${NO_DISTILL} && run_distill=false
|
||||
|
||||
if ${DISTILL_ONLY}; then
|
||||
run_harvest=false
|
||||
run_hygiene=false
|
||||
fi
|
||||
if ${HARVEST_ONLY}; then
|
||||
run_distill=false
|
||||
run_hygiene=false
|
||||
fi
|
||||
if ${HYGIENE_ONLY}; then
|
||||
run_distill=false
|
||||
run_harvest=false
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Pipeline
|
||||
# -----------------------------------------------------------------------------
|
||||
@@ -105,18 +146,39 @@ done
|
||||
START_TS="$(date '+%s')"
|
||||
section "wiki-maintain.sh starting"
|
||||
log "mode: $(${FULL_MODE} && echo full || echo quick)"
|
||||
log "harvest: $(${HYGIENE_ONLY} && echo skipped || echo enabled)"
|
||||
log "hygiene: $(${HARVEST_ONLY} && echo skipped || echo enabled)"
|
||||
log "distill: $(${run_distill} && echo enabled || echo skipped)"
|
||||
log "harvest: $(${run_harvest} && echo enabled || echo skipped)"
|
||||
log "hygiene: $(${run_hygiene} && echo enabled || echo skipped)"
|
||||
log "reindex: $(${NO_REINDEX} && echo skipped || echo enabled)"
|
||||
log "dry-run: ${DRY_RUN}"
|
||||
log "wiki: ${WIKI_DIR}"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Phase 1: Harvest
|
||||
# Phase 1a: Distill — conversations → wiki pages
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
if [[ "${HYGIENE_ONLY}" != "true" ]]; then
|
||||
section "Phase 1: URL harvesting"
|
||||
if ${run_distill}; then
|
||||
section "Phase 1a: Conversation distillation"
|
||||
distill_args=()
|
||||
${DRY_RUN} && distill_args+=(--dry-run)
|
||||
${NO_COMPILE} && distill_args+=(--no-compile)
|
||||
${DISTILL_FIRST_RUN} && distill_args+=(--first-run)
|
||||
|
||||
if python3 "${SCRIPTS_DIR}/wiki-distill.py" "${distill_args[@]}"; then
|
||||
log "distill completed"
|
||||
else
|
||||
log "[error] distill failed (exit $?) — continuing to harvest"
|
||||
fi
|
||||
else
|
||||
section "Phase 1a: Conversation distillation (skipped)"
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Phase 1b: Harvest — URLs cited in conversations → raw/ → wiki pages
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
if ${run_harvest}; then
|
||||
section "Phase 1b: URL harvesting"
|
||||
harvest_args=()
|
||||
${DRY_RUN} && harvest_args+=(--dry-run)
|
||||
${NO_COMPILE} && harvest_args+=(--no-compile)
|
||||
@@ -127,14 +189,14 @@ if [[ "${HYGIENE_ONLY}" != "true" ]]; then
|
||||
log "[error] harvest failed (exit $?) — continuing to hygiene"
|
||||
fi
|
||||
else
|
||||
section "Phase 1: URL harvesting (skipped)"
|
||||
section "Phase 1b: URL harvesting (skipped)"
|
||||
fi
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
# Phase 2: Hygiene
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
if [[ "${HARVEST_ONLY}" != "true" ]]; then
|
||||
if ${run_hygiene}; then
|
||||
section "Phase 2: Hygiene checks"
|
||||
hygiene_args=()
|
||||
if ${FULL_MODE}; then
|
||||
|
||||
@@ -209,3 +209,63 @@ def iter_archived_pages() -> list[WikiPage]:
|
||||
def page_content_hash(page: WikiPage) -> str:
|
||||
"""Hash of page body only (excludes frontmatter) so mechanical frontmatter fixes don't churn the hash."""
|
||||
return "sha256:" + hashlib.sha256(page.body.strip().encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Conversation hall parsing
|
||||
# ---------------------------------------------------------------------------
|
||||
#
|
||||
# Summarized conversations have sections in the body like:
|
||||
# ## Decisions (hall: fact)
|
||||
# - bullet
|
||||
# - bullet
|
||||
# ## Discoveries (hall: discovery)
|
||||
# - bullet
|
||||
#
|
||||
# Hall types used by the summarizer: fact, discovery, preference, advice,
|
||||
# event, tooling. Only fact/discovery/advice are high-signal enough to
|
||||
# distill into wiki pages; the others are tracked but not auto-promoted.
|
||||
|
||||
HIGH_SIGNAL_HALLS = {"fact", "discovery", "advice"}
|
||||
|
||||
_HALL_SECTION_RE = re.compile(
|
||||
r"^##\s+[^\n]*?\(hall:\s*(\w+)\s*\)\s*$(.*?)(?=^##\s|\Z)",
|
||||
re.MULTILINE | re.DOTALL,
|
||||
)
|
||||
_BULLET_RE = re.compile(r"^\s*-\s+(.*?)$", re.MULTILINE)
|
||||
|
||||
|
||||
def parse_conversation_halls(page: WikiPage) -> dict[str, list[str]]:
|
||||
"""Extract hall-bucketed bullet content from a summarized conversation body.
|
||||
|
||||
Returns a dict like:
|
||||
{"fact": ["claim one", "claim two"],
|
||||
"discovery": ["root cause X"],
|
||||
"advice": ["do Y", "consider Z"], ...}
|
||||
|
||||
Empty hall types are omitted. Bullet lines are stripped of leading "- "
|
||||
and trailing whitespace; multi-line bullets are joined with a space.
|
||||
"""
|
||||
result: dict[str, list[str]] = {}
|
||||
for match in _HALL_SECTION_RE.finditer(page.body):
|
||||
hall_type = match.group(1).strip().lower()
|
||||
section_body = match.group(2)
|
||||
bullets = [
|
||||
_flatten_bullet(b.group(1))
|
||||
for b in _BULLET_RE.finditer(section_body)
|
||||
]
|
||||
bullets = [b for b in bullets if b]
|
||||
if bullets:
|
||||
result.setdefault(hall_type, []).extend(bullets)
|
||||
return result
|
||||
|
||||
|
||||
def _flatten_bullet(text: str) -> str:
|
||||
"""Collapse a possibly-multiline bullet into a single clean line."""
|
||||
return " ".join(text.split()).strip()
|
||||
|
||||
|
||||
def high_signal_halls(page: WikiPage) -> dict[str, list[str]]:
|
||||
"""Return only fact/discovery/advice content from a conversation."""
|
||||
all_halls = parse_conversation_halls(page)
|
||||
return {k: v for k, v in all_halls.items() if k in HIGH_SIGNAL_HALLS}
|
||||
|
||||
@@ -65,7 +65,9 @@ class TestWikiMaintainSh:
|
||||
"wiki-maintain.sh", "--hygiene-only", "--dry-run", "--no-reindex"
|
||||
)
|
||||
assert result.returncode == 0
|
||||
assert "Phase 1: URL harvesting (skipped)" in result.stdout
|
||||
# Phase 1a (distill) and Phase 1b (harvest) both skipped in --hygiene-only
|
||||
assert "Phase 1a: Conversation distillation (skipped)" in result.stdout
|
||||
assert "Phase 1b: URL harvesting (skipped)" in result.stdout
|
||||
|
||||
def test_phase_3_skipped_in_dry_run(
|
||||
self, run_script, tmp_wiki: Path
|
||||
|
||||
446
tests/test_wiki_distill.py
Normal file
446
tests/test_wiki_distill.py
Normal file
@@ -0,0 +1,446 @@
|
||||
"""Unit + integration tests for scripts/wiki-distill.py.
|
||||
|
||||
Mocks claude -p; no real LLM calls during tests.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from datetime import date, timedelta
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import pytest
|
||||
|
||||
from conftest import make_conversation
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# wiki_lib hall parsing helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestParseConversationHalls:
|
||||
def _make_conv_with_halls(self, tmp_wiki: Path, body: str) -> Path:
|
||||
return make_conversation(
|
||||
tmp_wiki,
|
||||
"test",
|
||||
"2026-04-12-halls.md",
|
||||
status="summarized",
|
||||
body=body,
|
||||
)
|
||||
|
||||
def test_extracts_fact_bullets(self, wiki_lib: Any, tmp_wiki: Path) -> None:
|
||||
body = (
|
||||
"## Summary\n\nsome summary text.\n\n"
|
||||
"## Decisions (hall: fact)\n\n"
|
||||
"- First decision made\n"
|
||||
"- Second decision\n\n"
|
||||
"## Other section\n\nunrelated.\n"
|
||||
)
|
||||
path = self._make_conv_with_halls(tmp_wiki, body)
|
||||
page = wiki_lib.parse_page(path)
|
||||
halls = wiki_lib.parse_conversation_halls(page)
|
||||
assert "fact" in halls
|
||||
assert halls["fact"] == ["First decision made", "Second decision"]
|
||||
|
||||
def test_extracts_multiple_hall_types(
|
||||
self, wiki_lib: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
body = (
|
||||
"## Decisions (hall: fact)\n\n- A\n- B\n\n"
|
||||
"## Discoveries (hall: discovery)\n\n- root cause X\n\n"
|
||||
"## Advice (hall: advice)\n\n- try Y\n- consider Z\n"
|
||||
)
|
||||
path = self._make_conv_with_halls(tmp_wiki, body)
|
||||
page = wiki_lib.parse_page(path)
|
||||
halls = wiki_lib.parse_conversation_halls(page)
|
||||
assert halls["fact"] == ["A", "B"]
|
||||
assert halls["discovery"] == ["root cause X"]
|
||||
assert halls["advice"] == ["try Y", "consider Z"]
|
||||
|
||||
def test_ignores_sections_without_hall_marker(
|
||||
self, wiki_lib: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
body = (
|
||||
"## Summary\n\n- not a hall bullet\n\n"
|
||||
"## Decisions (hall: fact)\n\n- real bullet\n"
|
||||
)
|
||||
path = self._make_conv_with_halls(tmp_wiki, body)
|
||||
page = wiki_lib.parse_page(path)
|
||||
halls = wiki_lib.parse_conversation_halls(page)
|
||||
assert halls == {"fact": ["real bullet"]}
|
||||
|
||||
def test_flattens_multiline_bullets(
|
||||
self, wiki_lib: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
body = (
|
||||
"## Decisions (hall: fact)\n\n"
|
||||
"- A bullet that goes on\n and continues here\n"
|
||||
"- Second bullet\n"
|
||||
)
|
||||
path = self._make_conv_with_halls(tmp_wiki, body)
|
||||
page = wiki_lib.parse_page(path)
|
||||
halls = wiki_lib.parse_conversation_halls(page)
|
||||
# The simple regex captures each "- " line separately; continuation
|
||||
# lines are not part of the bullet. This matches the current behavior.
|
||||
assert halls["fact"][0].startswith("A bullet")
|
||||
assert "Second bullet" in halls["fact"]
|
||||
|
||||
def test_empty_body_returns_empty(
|
||||
self, wiki_lib: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
path = self._make_conv_with_halls(tmp_wiki, "## Summary\n\ntext.\n")
|
||||
page = wiki_lib.parse_page(path)
|
||||
assert wiki_lib.parse_conversation_halls(page) == {}
|
||||
|
||||
def test_high_signal_halls_filters_out_preference_event_tooling(
|
||||
self, wiki_lib: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
body = (
|
||||
"## Decisions (hall: fact)\n- f\n"
|
||||
"## Preferences (hall: preference)\n- p\n"
|
||||
"## Events (hall: event)\n- e\n"
|
||||
"## Tooling (hall: tooling)\n- t\n"
|
||||
"## Advice (hall: advice)\n- a\n"
|
||||
)
|
||||
path = self._make_conv_with_halls(tmp_wiki, body)
|
||||
page = wiki_lib.parse_page(path)
|
||||
halls = wiki_lib.high_signal_halls(page)
|
||||
assert set(halls.keys()) == {"fact", "advice"}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# wiki-distill.py module fixture
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def wiki_distill(tmp_wiki: Path) -> Any:
|
||||
from conftest import SCRIPTS_DIR, _load_script_module
|
||||
_load_script_module("wiki_lib", SCRIPTS_DIR / "wiki_lib.py")
|
||||
return _load_script_module("wiki_distill", SCRIPTS_DIR / "wiki-distill.py")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Topic rollup logic
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestTopicRollup:
|
||||
def _make_summarized_conv(
|
||||
self,
|
||||
tmp_wiki: Path,
|
||||
project: str,
|
||||
filename: str,
|
||||
conv_date: str,
|
||||
topics: list[str],
|
||||
fact_bullets: list[str] | None = None,
|
||||
) -> Path:
|
||||
fact_section = ""
|
||||
if fact_bullets:
|
||||
fact_section = "## Decisions (hall: fact)\n\n" + "\n".join(
|
||||
f"- {b}" for b in fact_bullets
|
||||
)
|
||||
return make_conversation(
|
||||
tmp_wiki,
|
||||
project,
|
||||
filename,
|
||||
date=conv_date,
|
||||
status="summarized",
|
||||
related=[f"topic:{t}" for t in []],
|
||||
body=f"## Summary\n\ntest.\n\n{fact_section}\n",
|
||||
)
|
||||
|
||||
def test_extract_topics_from_today_only(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
today_date = wiki_distill.today()
|
||||
yesterday = today_date - timedelta(days=1)
|
||||
# Today's conversation with topics
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "today.md",
|
||||
date_str=today_date.isoformat(), topics=["alpha", "beta"],
|
||||
)
|
||||
# Yesterday's conversation — should be excluded at lookback=0
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "yesterday.md",
|
||||
date_str=yesterday.isoformat(), topics=["gamma"],
|
||||
)
|
||||
all_convs = wiki_distill.iter_summarized_conversations()
|
||||
topics = wiki_distill.extract_topics_from_today(all_convs, today_date, 0)
|
||||
assert topics == {"alpha", "beta"}
|
||||
|
||||
def test_extract_topics_with_lookback(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
today_date = wiki_distill.today()
|
||||
day3 = today_date - timedelta(days=3)
|
||||
day10 = today_date - timedelta(days=10)
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "today.md",
|
||||
date_str=today_date.isoformat(), topics=["a"],
|
||||
)
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "day3.md",
|
||||
date_str=day3.isoformat(), topics=["b"],
|
||||
)
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "day10.md",
|
||||
date_str=day10.isoformat(), topics=["c"],
|
||||
)
|
||||
all_convs = wiki_distill.iter_summarized_conversations()
|
||||
topics_7 = wiki_distill.extract_topics_from_today(all_convs, today_date, 7)
|
||||
assert topics_7 == {"a", "b"} # day10 excluded by 7-day lookback
|
||||
|
||||
def test_rollup_by_topic_across_history(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
today_date = wiki_distill.today()
|
||||
# Three conversations all tagged with "shared-topic", different dates
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "a.md",
|
||||
date_str=today_date.isoformat(), topics=["shared-topic"],
|
||||
)
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "b.md",
|
||||
date_str=(today_date - timedelta(days=30)).isoformat(),
|
||||
topics=["shared-topic", "other"],
|
||||
)
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "c.md",
|
||||
date_str=(today_date - timedelta(days=90)).isoformat(),
|
||||
topics=["shared-topic"],
|
||||
)
|
||||
# One unrelated
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "d.md",
|
||||
date_str=today_date.isoformat(), topics=["unrelated"],
|
||||
)
|
||||
all_convs = wiki_distill.iter_summarized_conversations()
|
||||
rollup = wiki_distill.rollup_conversations_by_topic(
|
||||
"shared-topic", all_convs
|
||||
)
|
||||
assert len(rollup) == 3
|
||||
stems = [c.path.stem for c in rollup]
|
||||
# Most recent first
|
||||
assert stems[0] == "a"
|
||||
|
||||
|
||||
def _write_conv_with_topics(
|
||||
tmp_wiki: Path,
|
||||
project: str,
|
||||
filename: str,
|
||||
*,
|
||||
date_str: str,
|
||||
topics: list[str],
|
||||
) -> Path:
|
||||
"""Helper — write a summarized conversation with topic frontmatter."""
|
||||
proj_dir = tmp_wiki / "conversations" / project
|
||||
proj_dir.mkdir(parents=True, exist_ok=True)
|
||||
path = proj_dir / filename
|
||||
topic_yaml = "topics: [" + ", ".join(topics) + "]"
|
||||
content = (
|
||||
f"---\n"
|
||||
f"title: Test Conv\n"
|
||||
f"type: conversation\n"
|
||||
f"project: {project}\n"
|
||||
f"date: {date_str}\n"
|
||||
f"status: summarized\n"
|
||||
f"messages: 50\n"
|
||||
f"{topic_yaml}\n"
|
||||
f"---\n"
|
||||
f"## Summary\n\ntest.\n\n"
|
||||
f"## Decisions (hall: fact)\n\n"
|
||||
f"- Fact one for these topics\n"
|
||||
f"- Fact two\n"
|
||||
)
|
||||
path.write_text(content)
|
||||
return path
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Topic group building
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestTopicGroupBuild:
|
||||
def test_counts_total_bullets(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "one.md",
|
||||
date_str="2026-04-12", topics=["foo"],
|
||||
)
|
||||
all_convs = wiki_distill.iter_summarized_conversations()
|
||||
rollup = wiki_distill.rollup_conversations_by_topic("foo", all_convs)
|
||||
group = wiki_distill.build_topic_group("foo", rollup)
|
||||
assert group.topic == "foo"
|
||||
assert group.total_bullets == 2 # the helper writes 2 fact bullets
|
||||
|
||||
def test_format_for_llm_includes_topic_and_sections(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "one.md",
|
||||
date_str="2026-04-12", topics=["bar"],
|
||||
)
|
||||
all_convs = wiki_distill.iter_summarized_conversations()
|
||||
rollup = wiki_distill.rollup_conversations_by_topic("bar", all_convs)
|
||||
group = wiki_distill.build_topic_group("bar", rollup)
|
||||
text = wiki_distill.format_topic_group_for_llm(group)
|
||||
assert "# Topic: bar" in text
|
||||
assert "Fact one" in text
|
||||
assert "Decisions:" in text
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# State management
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestDistillState:
|
||||
def test_load_returns_defaults(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
state = wiki_distill.load_state()
|
||||
assert state["processed_convs"] == {}
|
||||
assert state["processed_topics"] == {}
|
||||
assert state["first_run_complete"] is False
|
||||
|
||||
def test_save_and_reload(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
state = wiki_distill.load_state()
|
||||
state["first_run_complete"] = True
|
||||
state["processed_topics"]["foo"] = {"distilled_date": "2026-04-12"}
|
||||
wiki_distill.save_state(state)
|
||||
|
||||
reloaded = wiki_distill.load_state()
|
||||
assert reloaded["first_run_complete"] is True
|
||||
assert "foo" in reloaded["processed_topics"]
|
||||
|
||||
def test_conv_needs_distill_first_time(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
path = _write_conv_with_topics(
|
||||
tmp_wiki, "test", "fresh.md",
|
||||
date_str="2026-04-12", topics=["x"],
|
||||
)
|
||||
conv = wiki_distill.parse_page(path)
|
||||
state = wiki_distill.load_state()
|
||||
assert wiki_distill.conv_needs_distill(state, conv) is True
|
||||
|
||||
def test_conv_needs_distill_detects_content_change(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
path = _write_conv_with_topics(
|
||||
tmp_wiki, "test", "mut.md",
|
||||
date_str="2026-04-12", topics=["x"],
|
||||
)
|
||||
conv = wiki_distill.parse_page(path)
|
||||
state = wiki_distill.load_state()
|
||||
wiki_distill.mark_conv_distilled(state, conv, ["staging/patterns/x.md"])
|
||||
assert wiki_distill.conv_needs_distill(state, conv) is False
|
||||
|
||||
# Mutate the body
|
||||
text = path.read_text()
|
||||
path.write_text(text + "\n- Another bullet\n")
|
||||
conv2 = wiki_distill.parse_page(path)
|
||||
assert wiki_distill.conv_needs_distill(state, conv2) is True
|
||||
|
||||
def test_conv_needs_distill_detects_new_topic(
|
||||
self, wiki_distill: Any, tmp_wiki: Path
|
||||
) -> None:
|
||||
path = _write_conv_with_topics(
|
||||
tmp_wiki, "test", "new-topic.md",
|
||||
date_str="2026-04-12", topics=["original"],
|
||||
)
|
||||
conv = wiki_distill.parse_page(path)
|
||||
state = wiki_distill.load_state()
|
||||
wiki_distill.mark_conv_distilled(state, conv, [])
|
||||
assert wiki_distill.conv_needs_distill(state, conv) is False
|
||||
|
||||
# Rewrite with a new topic added
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "new-topic.md",
|
||||
date_str="2026-04-12", topics=["original", "freshly-added"],
|
||||
)
|
||||
conv2 = wiki_distill.parse_page(path)
|
||||
assert wiki_distill.conv_needs_distill(state, conv2) is True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI smoke tests (no real LLM calls — uses --dry-run)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
class TestDistillCli:
|
||||
def test_help_flag(self, run_script) -> None:
|
||||
result = run_script("wiki-distill.py", "--help")
|
||||
assert result.returncode == 0
|
||||
assert "--first-run" in result.stdout
|
||||
assert "--topic" in result.stdout
|
||||
assert "--dry-run" in result.stdout
|
||||
|
||||
def test_dry_run_empty_wiki(self, run_script, tmp_wiki: Path) -> None:
|
||||
result = run_script("wiki-distill.py", "--dry-run", "--first-run")
|
||||
assert result.returncode == 0
|
||||
|
||||
def test_dry_run_with_topic_rollup(
|
||||
self, run_script, tmp_wiki: Path
|
||||
) -> None:
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "convA.md",
|
||||
date_str="2026-04-12", topics=["rollup-test"],
|
||||
)
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "convB.md",
|
||||
date_str="2026-04-11", topics=["rollup-test"],
|
||||
)
|
||||
result = run_script(
|
||||
"wiki-distill.py", "--dry-run", "--first-run",
|
||||
)
|
||||
assert result.returncode == 0
|
||||
# Should mention the rollup topic
|
||||
assert "rollup-test" in result.stdout
|
||||
|
||||
def test_topic_flag_narrow_mode(
|
||||
self, run_script, tmp_wiki: Path
|
||||
) -> None:
|
||||
_write_conv_with_topics(
|
||||
tmp_wiki, "test", "a.md",
|
||||
date_str="2026-04-12", topics=["explicit-topic"],
|
||||
)
|
||||
result = run_script(
|
||||
"wiki-distill.py", "--dry-run", "--topic", "explicit-topic",
|
||||
)
|
||||
assert result.returncode == 0
|
||||
assert "Explicit topic mode" in result.stdout
|
||||
assert "explicit-topic" in result.stdout
|
||||
|
||||
def test_too_thin_topic_is_skipped(
|
||||
self, run_script, tmp_wiki: Path, wiki_distill: Any
|
||||
) -> None:
|
||||
# Write a conversation with only ONE hall bullet on this topic
|
||||
proj_dir = tmp_wiki / "conversations" / "test"
|
||||
proj_dir.mkdir(parents=True, exist_ok=True)
|
||||
(proj_dir / "thin.md").write_text(
|
||||
"---\n"
|
||||
"title: Thin\n"
|
||||
"type: conversation\n"
|
||||
"project: test\n"
|
||||
"date: 2026-04-12\n"
|
||||
"status: summarized\n"
|
||||
"messages: 5\n"
|
||||
"topics: [thin-topic]\n"
|
||||
"---\n"
|
||||
"## Summary\n\n\n"
|
||||
"## Decisions (hall: fact)\n\n"
|
||||
"- Single bullet\n"
|
||||
)
|
||||
result = run_script(
|
||||
"wiki-distill.py", "--dry-run", "--topic", "thin-topic",
|
||||
)
|
||||
assert result.returncode == 0
|
||||
assert "too-thin" in result.stdout or "too-thin" in result.stderr
|
||||
Reference in New Issue
Block a user