diff --git a/.gitignore b/.gitignore index 9e48a58..12f3581 100644 --- a/.gitignore +++ b/.gitignore @@ -31,5 +31,6 @@ __pycache__/ # NOTE: the following state files are NOT gitignored — they must sync # across machines so both installs agree on what's been processed: +# .distill-state.json (conversation distillation: processed convs + topics) # .harvest-state.json (URL dedup) # .hygiene-state.json (content hashes, deferred issues) diff --git a/README.md b/README.md index 76eaa65..576276c 100644 --- a/README.md +++ b/README.md @@ -114,13 +114,14 @@ to one of those extensions: | What memex adds | How it works | |-----------------|--------------| +| **Conversation distillation** — your sessions become wiki pages | `wiki-distill.py` finds today's topics, rolls up ALL historical conversations sharing each topic, pulls their `hall_facts` + `hall_discoveries` + `hall_advice` content, and asks `claude -p` to create new pages or update existing ones. This is what closes the MemPalace loop — closet summaries become the source material for the wiki itself, not just the URLs cited in them. | | **Time-decaying confidence** — pages earn trust through reinforcement and fade without it | `confidence` field + `last_verified`, 6/9/12 month decay thresholds, auto-archive. Full-mode hygiene also adds LLM contradiction detection across pages. | | **Scalable search beyond the context window** | `qmd` (BM25 + vector + LLM re-ranking) from day one, with three collections (`wiki` / `wiki-archive` / `wiki-conversations`) so queries can route to the right surface. | -| **Traceable sources for every claim** | Every compiled page traces back to an immutable `raw/harvested/*.md` file with a SHA-256 content hash. Staging review is the built-in cross-check, and `compilation_notes` makes review fast. | -| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. | +| **Traceable sources for every claim** | Every compiled page traces back to either an immutable `raw/harvested/*.md` file (URL-sourced) or specific conversations with a `distill_source_conversations` field (session-sourced). Staging review is the built-in cross-check, and `compilation_notes` makes review fast. | +| **Continuous feed without manual discipline** | Daily + weekly cron chains extract → summarize → distill → harvest → hygiene → reindex. `last_verified` auto-refreshes from new conversation references; decayed pages auto-archive and auto-restore when referenced again. | | **Human-in-the-loop staging** for automated content | Every automated page lands in `staging/` first with `origin: automated`, `status: pending`. Nothing bypasses human review — one promotion step and it's in the live wiki with `last_verified` set. | | **Hybrid retrieval** — structural navigation + semantic search | Wings/rooms/halls (borrowed from mempalace) give structural filtering that narrows the search space before qmd's hybrid BM25 + vector pass runs. Full-mode hygiene also auto-adds missing cross-references. | -| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Harvest and hygiene state files sync across machines so both agree on what's been processed. | +| **Cross-machine git sync** for collaborative knowledge bases | `.gitattributes` with `merge=union` on markdown so concurrent writes on different machines merge additively. Distill, harvest, and hygiene state files sync across machines so both agree on what's been processed. | The short version: Karpathy shared the idea, milla-jovovich's mempalace added the structural memory taxonomy, and memex is the automation layer @@ -147,21 +148,28 @@ memex doesn't cover. │ summarize-conversations.py --claude (daily) ▼ ┌─────────────────────┐ -│ conversations/ │ summaries with related: wiki links +│ conversations/ │ summaries with halls + topics + related: │ /*.md │ (status: summarized) -└──────────┬──────────┘ - │ wiki-harvest.py (daily) - ▼ -┌─────────────────────┐ -│ raw/harvested/ │ fetched URL content -│ *.md │ (immutable source material) -└──────────┬──────────┘ - │ claude -p compile step - ▼ -┌─────────────────────┐ -│ staging// │ pending pages -│ *.md │ (status: pending, origin: automated) -└──────────┬──────────┘ +└──────┬───────────┬──┘ + │ │ + │ └──▶ wiki-distill.py (daily Phase 1a) ──┐ + │ - rollup by today's topics │ + │ - pull historical conversations│ + │ - extract fact/discovery/advice│ + │ - claude -p → new or update │ + │ │ + │ wiki-harvest.py (daily Phase 1b) │ + ▼ │ +┌─────────────────────┐ │ +│ raw/harvested/ │ fetched URL content │ +│ *.md │ (immutable source material) │ +└──────────┬──────────┘ │ + │ claude -p compile step │ + ▼ │ +┌──────────────────────────────────────────────────────┐ │ +│ staging// pending pages │◀─┘ +│ *.md (status: pending, origin: auto) │ +└──────────┬───────────────────────────────────────────┘ │ human review (wiki-staging.py --review) ▼ ┌─────────────────────┐ @@ -283,6 +291,7 @@ wiki/ ├── reports/ ← Hygiene operation logs ├── scripts/ ← The automation pipeline ├── tests/ ← Pytest suite (171 tests) +├── .distill-state.json ← Conversation distill state (committed, synced) ├── .harvest-state.json ← URL dedup state (committed, synced) ├── .hygiene-state.json ← Content hashes, deferred issues (committed, synced) └── .mine-state.json ← Conversation extraction offsets (gitignored, per-machine) @@ -333,11 +342,12 @@ Eleven scripts organized in three layers: - `update-conversation-index.py` — Regenerate conversation index + wake-up context **Automation layer** (maintains the wiki): -- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, constants +- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, hall extraction, constants +- `wiki-distill.py` — Conversation distillation (closet → wiki pages via claude -p, closes the MemPalace loop) - `wiki-harvest.py` — URL classification + fetch cascade + compile to staging - `wiki-staging.py` — Human review (list/promote/reject/review/sync) - `wiki-hygiene.py` — Quick + full hygiene checks, archival, auto-restore -- `wiki-maintain.sh` — Top-level orchestrator chaining harvest + hygiene +- `wiki-maintain.sh` — Top-level orchestrator chaining distill + harvest + hygiene **Sync layer**: - `wiki-sync.sh` — Git commit/pull/push with merge-union markdown handling diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index 39811c8..35e1606 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -77,6 +77,7 @@ Automation + lifecycle management on top of both: ┌─────────────────────────────────┐ │ AUTOMATION LAYER │ │ wiki_lib.py (shared helpers) │ + │ wiki-distill.py │ (conversations → staging) ← closes MemPalace loop │ wiki-harvest.py │ (URL → raw → staging) │ wiki-staging.py │ (human review) │ wiki-hygiene.py │ (decay, archive, repair, checks) @@ -169,10 +170,63 @@ Provides: All paths honor the `WIKI_DIR` environment variable, so tests and alternate installs can override the root. +### `wiki-distill.py` + +**Closes the MemPalace loop.** Reads the *content* of summarized +conversations — not the URLs they cite — and compiles wiki pages from +the high-signal hall entries (`hall_facts`, `hall_discoveries`, +`hall_advice`). Runs as Phase 1a in `wiki-maintain.sh`, before URL +harvesting. + +**Scope filter (deliberately narrow)**: +1. Find all summarized conversations dated TODAY +2. Extract their `topics:` — this is the "topics-of-today" set +3. For each topic in that set, pull ALL summarized conversations across + history that share that topic (full historical context via rollup) +4. Extract `hall_facts` + `hall_discoveries` + `hall_advice` bullet + content from each conversation's body +5. Send the topic group (topic + matching conversations + halls) to + `claude -p` with the current `index.md` +6. Model emits a JSON `actions` array with `new_page` / `update_page` / + `skip` verdicts; the script writes each to `staging//` + +**First-run bootstrap**: the very first run uses a 7-day lookback +instead of today-only, so the state file gets seeded with a reasonable +starting set. After that, daily runs stay narrow. + +**Self-triggering**: dormant topics that resurface in a new +conversation automatically pull in all historical conversations on +that topic via the rollup. No manual intervention needed to +reprocess old knowledge when it becomes relevant again. + +**Model routing**: haiku for short topic groups (< 15K chars prompt, +< 20 bullets), sonnet for longer ones. + +**State** lives in `.distill-state.json` — tracks processed +conversations by content hash and topics-at-distill-time. A +conversation is re-processed if its body changes OR if it gains a new +topic not seen at previous distill. + +**Staging output** includes distill-specific frontmatter: +- `staged_by: wiki-distill` +- `distill_topic: ` +- `distill_source_conversations: ` + +Commands: +- `wiki-distill.py` — today-only rollup (default mode after first run) +- `wiki-distill.py --first-run` — 7-day lookback bootstrap +- `wiki-distill.py --topic TOPIC` — explicit single-topic processing +- `wiki-distill.py --project WING` — only today-topics from this wing +- `wiki-distill.py --dry-run` — plan only, no LLM calls, no writes +- `wiki-distill.py --no-compile` — rollup only, skip claude -p step +- `wiki-distill.py --limit N` — stop after N topic groups + ### `wiki-harvest.py` Scans summarized conversations for HTTP(S) URLs, classifies them, -fetches content, and compiles pending wiki pages. +fetches content, and compiles pending wiki pages. Runs as Phase 1b in +`wiki-maintain.sh`, after distill — URL content is treated as a +supplement to conversation-driven knowledge, not the primary source. URL classification: - **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile @@ -254,13 +308,17 @@ full-mode runs can skip unchanged pages. Reports land in Top-level orchestrator: ``` -Phase 1: wiki-harvest.py (unless --hygiene-only) -Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick) -Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run) +Phase 1a: wiki-distill.py (unless --no-distill or --harvest-only / --hygiene-only) +Phase 1b: wiki-harvest.py (unless --distill-only / --hygiene-only) +Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick) +Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run) ``` -Flags pass through to child scripts. Error-tolerant: if one phase fails, -the others still run. Logs to `scripts/.maintain.log`. +Ordering is deliberate: distill runs before harvest so that +conversation content drives the page shape, and URL harvesting only +supplements what the conversations are already covering. Flags pass +through to child scripts. Error-tolerant: if one phase fails, the +others still run. Logs to `scripts/.maintain.log`. --- @@ -289,6 +347,7 @@ Three JSON files track per-pipeline state: | File | Owner | Synced? | Purpose | |------|-------|---------|---------| | `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable | +| `.distill-state.json` | `wiki-distill.py` | Yes (committed) | Processed conversations (content hash + topics seen), rejected topics, first-run flag | | `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs | | `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps | @@ -301,13 +360,15 @@ because Claude Code session files live at OS-specific paths. ## Module dependency graph ``` -wiki_lib.py ─┬─> wiki-harvest.py +wiki_lib.py ─┬─> wiki-distill.py + ├─> wiki-harvest.py ├─> wiki-staging.py └─> wiki-hygiene.py -wiki-maintain.sh ─> wiki-harvest.py - ─> wiki-hygiene.py - ─> qmd (external) +wiki-maintain.sh ─> wiki-distill.py (Phase 1a — conversations → staging) + ─> wiki-harvest.py (Phase 1b — URLs → staging) + ─> wiki-hygiene.py (Phase 2) + ─> qmd (external) (Phase 3) mine-conversations.sh ─> extract-sessions.py ─> summarize-conversations.py diff --git a/docs/DESIGN-RATIONALE.md b/docs/DESIGN-RATIONALE.md index 791dc60..5efbbbf 100644 --- a/docs/DESIGN-RATIONALE.md +++ b/docs/DESIGN-RATIONALE.md @@ -43,10 +43,13 @@ repo preserves all of them: Karpathy's gist is a concept pitch. He was explicit that he was sharing an "idea file" for others to build on, not publishing a working -implementation. The analysis identified seven places where the core idea -needs an engineering layer to become practical day-to-day — five have -first-class answers in memex, and two remain scoped-out trade-offs that -the architecture cleanly acknowledges. +implementation. The analysis identified eight places where the core idea +needs an engineering layer to become practical day-to-day. The first +seven emerged from the original Signal & Noise review; the eighth +(conversation distillation) surfaced after building the other layers +and realizing that the conversations themselves were being mined, +summarized, indexed, and scanned for URLs — but the knowledge *inside* +them was never becoming wiki pages. ### 1. Claim freshness and reversibility @@ -236,6 +239,71 @@ story. If you need any of that, you need a different architecture. This is for the personal and small-team case where git + Tailscale is the right amount of rigor. +### 8. Closing the MemPalace loop — conversation distillation + +**The gap**: The mining pipeline extracts Claude Code sessions into +transcripts, classifies them by memory type (fact/discovery/preference/ +advice/event/tooling), and tags them with topics. The URL harvester +scans them for cited links. Hygiene refreshes `last_verified` on any +wiki page that appears in a conversation's `related:` field. But none +of those steps actually *compile the knowledge inside the conversations +themselves into wiki pages.* A decision made in a session, a root cause +found during debugging, a pattern spotted in review — these stay in the +conversation summaries (searchable but not synthesized) until a human +manually writes them up. That's the last piece of the MemPalace model +that wasn't wired through: **closet content was never becoming the +source for the wiki proper**. + +**How memex extends it**: + +- **`wiki-distill.py`** runs as Phase 1a of `wiki-maintain.sh`, before + URL harvesting. The ordering is deliberate: conversation content + should drive the page, and URL harvesting should only supplement + what the conversations are already covering. +- **Narrow today-filter with historical rollup** — daily runs only + look at topics appearing in TODAY's summarized conversations, but + for each such topic the script pulls in ALL historical conversations + sharing that topic. Processing scope stays small; LLM context stays + wide. Old topics that resurface in new sessions automatically + trigger a re-distillation of the full history on that topic. +- **First-run bootstrap** — the very first run uses a 7-day lookback + to seed the state. After that, daily runs stay narrow. +- **High-signal halls only** — distill reads `hall_facts`, + `hall_discoveries`, and `hall_advice` bullets. Skips `hall_events` + (temporal, not knowledge), `hall_preferences` (user working style), + and `hall_tooling` (often low-signal). These are the halls the + MemPalace taxonomy treats as "canonical knowledge" vs "context." +- **claude -p compile step** — each topic group (topic + all matching + conversations + their high-signal halls) is sent to `claude -p` + with the current wiki index. The model decides whether to create a + new page, update an existing one, emit both, or skip (topic not + substantive enough or already well-covered). +- **Staging output with distill provenance** — new/updated pages land + in `staging/` with `staged_by: wiki-distill`, `distill_topic`, and + `distill_source_conversations` frontmatter fields. Every page traces + back to the exact conversations it was distilled from. +- **State file `.distill-state.json`** tracks processed conversations + by content hash and topic set, so re-runs only process what actually + changed. A conversation gets re-distilled if its body changes OR if + it gains a new topic not seen at previous distill time. + +**Why this matters**: Without distillation, the MemPalace integration +was incomplete — the closet summaries existed, the structural metadata +existed, qmd could search them, but knowledge discovered during work +never escaped the conversation archive. You could find "we had a +debugging session about X last month" but couldn't find "here's the +canonical page on X that captures what we learned." This extension +turns the MemPalace layer from a searchable archive into a proper +**ingest pipeline** for the wiki. + +**Residual consideration**: Summarization quality is now load-bearing. +The distill step trusts the summarizer's classification of bullets +into halls. If the summarizer puts a debugging dead-end in +`hall_discoveries`, it may enter the wiki compilation pipeline. The +`MIN_BULLETS_PER_TOPIC` filter (default 2) and the LLM's own +substantiveness check (it can choose `skip` with a reason) together +catch most noise, and the staging review catches the rest. + --- ## The biggest layer — active upkeep @@ -255,7 +323,7 @@ thing the human has to think about: | Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex | | Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) | | Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) | -| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex | +| Daily 3am | `wiki-maintain.sh` | Distill + harvest + quick hygiene + reindex | | Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection | If you disable all of these, you get the same outcome as every diff --git a/docs/SETUP.md b/docs/SETUP.md index 591fe1e..5fccfac 100644 --- a/docs/SETUP.md +++ b/docs/SETUP.md @@ -281,7 +281,13 @@ python3 scripts/summarize-conversations.py --claude # 3. Regenerate conversation index + wake-up context python3 scripts/update-conversation-index.py --reindex -# 4. Dry-run the maintenance pipeline +# 4. First-run distill bootstrap (7-day lookback, burns claude -p calls) +# Only do this if you have summarized conversations from recent work. +# Skip it if you're starting with a fresh wiki. +python3 scripts/wiki-distill.py --first-run --dry-run # plan +python3 scripts/wiki-distill.py --first-run # actually do it + +# 5. Dry-run the maintenance pipeline bash scripts/wiki-maintain.sh --dry-run --no-compile ``` @@ -322,7 +328,7 @@ PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/ 0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1 # ─── Maintenance ─────────────────────────────────────────────────────────── -# Daily at 3am: harvest + quick hygiene + qmd reindex +# Daily at 3am: distill conversations + harvest URLs + quick hygiene + qmd reindex 0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1 # Weekly Sunday at 4am: full hygiene with LLM checks @@ -424,8 +430,8 @@ cd tests && python3 -m pytest Expected: - `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]` -- `wiki-maintain.sh --dry-run` completes all three phases -- `pytest` passes all 171 tests in ~1.3 seconds +- `wiki-maintain.sh --dry-run` completes all four phases (distill, harvest, hygiene, reindex) +- `pytest` passes all 192 tests in ~1.5 seconds --- diff --git a/docs/artifacts/signal-and-noise.html b/docs/artifacts/signal-and-noise.html index b875a66..876405b 100644 --- a/docs/artifacts/signal-and-noise.html +++ b/docs/artifacts/signal-and-noise.html @@ -1209,6 +1209,7 @@ + @@ -2259,6 +2260,255 @@ + +
+ +
+
⬣ The 8th Extension — Closing the MemPalace Loop
+

Closet summaries become the source for the wiki itself.

+

The first seven extensions came out of the Signal & Noise review. The eighth surfaced only after the other layers were built — and it's the one that makes the MemPalace integration a real pipeline into the wiki instead of just a searchable archive beside it. The mining layer was extracting sessions, classifying bullets into halls, tagging topics, and making everything searchable via qmd. But the knowledge inside the conversations was never being compiled into wiki pages. A decision made in a session, a root cause found during debugging, a pattern spotted in review — these stayed in the conversation summaries forever, findable but not synthesized.

+

This is what the wiki-distill.py script solves. It's Phase 1a of wiki-maintain.sh and runs before URL harvesting because conversation content should drive the page, not the URLs the conversation cites.

+
+
Phase 1aRuns before harvest
+
todayNarrow filter — today's topics
+
∀ historyRollup all past conversations on each topic
+
3 hallsfact + discovery + advice
+
haiku/sonnetAuto-routed by topic size
+
+
+ + +
+
Distill Flow — Conversation Content → Wiki Pages
+
Narrow: what topics to process today
+
+
Today's
conversations
+
+
Extract
topics[]
+
=
+
Topics of
today set
+
+
Wide: pull full history for each today-topic
+
+
Each
today-topic
+
+
Rollup ALL
historical convs
+
+
Extract
fact / discovery / advice
+
+
claude -p
distill prompt
+
+
Compile: model decides new / update / skip
+
+
JSON
actions[]
+
+
new_page
+
+
+
update_page
(modifies existing)
+
+
staging/<type>/
pending review
+
+
+ + +
+

Why This Completes MemPalace

+ +
+ +
+
+
📦
+
Drawer — before
+
Verbatim Archive
+
Full transcripts stored, searchable via qmd. No compilation — if you wanted canonical knowledge from them, you had to write it up manually.
+
Status: already working
+
+
+
🗂️
+
Closet — before
+
Summary Layer
+
Summaries with hall classification (fact / discovery / preference / advice / event / tooling) and topics. Searchable. Terminal: never fed forward into the wiki compiler.
+
Status: terminal data, not flowing
+
+
+
+
Distill — NEW
+
Compiler Bridge
+
Reads closet content by topic, rolls up all matching conversations across history, filters to high-signal halls only, sends to claude -p with the current wiki index, emits new or updated wiki pages to staging.
+
Status: wiki-distill.py
+
+
+
📄
+
Wiki Pages — NEW
+
Distilled Knowledge
+
Pages in staging/<type>/ with full distill provenance: distill_topic, distill_source_conversations, compilation_notes. Promote via staging review. Session knowledge becomes canonical knowledge.
+
Status: origin=automated, staged_by=wiki-distill
+
+
+ + +
+

Which Halls Get Distilled

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
HallDistilled?Why
hall_facts✦ YESDecisions locked in, choices made, specs agreed. Canonical knowledge.
hall_discoveries✦ YESRoot causes, breakthroughs, non-obvious findings. The highest-signal content in any session.
hall_advice✦ YESRecommendations, lessons learned, "next time do X." Worth capturing as patterns.
hall_eventsnoDeployments, incidents, milestones. Temporal data — belongs in logs, not the wiki.
hall_preferencesnoUser working style notes. Belong in personal configs, not the shared wiki.
hall_toolingnoScript/command usage, failures, improvements. Usually low-signal or duplicates what's already in the wiki.
+ + +
+

The Narrow-Today / Wide-History Filter

+ +
+ +
+ Processing scope stays narrow; LLM context stays wide. This is the key property that makes distill cheap enough to run daily and smart enough to produce good pages. +
+ +
+ +
+
+ 01 + Daily filter: only process topics appearing in TODAY's conversations + Scope + +
+
+

Each daily run only looks at conversations dated today. It extracts the topics: frontmatter from each — that union becomes the "topics of today" set. If you didn't discuss a topic today, it's not in the processing scope. This keeps the cron job cheap and predictable: if today was a light session day, distill runs fast. If today was a heavy architecture discussion, distill does real work.

+
First run only: The very first run uses a 7-day lookback instead of today-only so the state file gets seeded. After that first bootstrap, daily runs stay narrow.
+
+
+ +
+
+ 02 + Historical rollup: for each today-topic, pull ALL matching conversations + Context + +
+
+

Once the today-topic set is known, for each topic the script walks the entire conversation archive and pulls every summarized conversation that shares that topic. A discussion about blue-green-deploy today might roll up 16 conversations across the last 6 months. The claude -p call sees the full history, not just today's fragment.

+

This is what makes the distilled pages good. The LLM isn't guessing what a pattern looks like from one session — it's synthesizing across everything you've ever discussed on the topic.

+
+
+ +
+
+ 03 + Self-triggering: dormant topics wake up when they resurface + Emergent + +
+
+

The narrow-today/wide-history combination produces a useful emergent property: dormant topics wake up automatically. If you discussed database-migrations three months ago and it never came up again, it's not in the daily scope. But the day you mention it again in any new conversation, that topic enters today's set — and the rollup pulls in all three months of historical discussion. The wiki page gets updated with fresh synthesis across the full history without you having to manually trigger reprocessing.

+
What this means in practice: Old knowledge gets distilled when it becomes relevant again. You don't need to remember to ask "hey, is there a wiki page for X?" — the next time X comes up in a session, distill will check the wiki state and either create or update the page for you.
+
+
+ +
+
+ 04 + State tracking by content hash + topic set + .distill-state.json + +
+
+

A conversation is considered "already distilled" only if its body hash AND its topic set match what was seen at the last distill. If the body changes (summarizer re-ran and updated the bullets) OR a new topic is added, the conversation gets re-processed on the next run. Topics get tracked so rejected ones don't get reprocessed forever — if the LLM says "this topic doesn't deserve a wiki page" once, it stays rejected until something meaningful changes.

+
+
+ +
+
+ 05 + Distill runs BEFORE harvest — conversation content has priority + Phase 1a + +
+
+

The orchestrator runs distill as Phase 1a and harvest as Phase 1b. Deliberate: if a topic is being actively discussed in your sessions, you want the wiki page to reflect your synthesis of what you've learned, not just the external URL cited in passing. URL harvesting then fills in gaps — it picks up the docs pages, blog posts, and references that your sessions didn't already cover.

+
Both phases can produce staging pages. If distill creates patterns/docker-hardening.md and harvest creates patterns/docker-hardening.md, the staging-unique-path helper appends a short hash suffix so they don't collide. The reviewer sees both in staging and picks the better one (usually distill, since it has historical context).
+
+
+ +
+ + +
+

Distill Staging Provenance

+ +
+ +

Every distilled page lands in staging with full provenance in its frontmatter. When you review a page in staging, you can see exactly which conversations it came from and jump directly to those transcripts.

+ +
+
Example: staging/patterns/zoho-crm-integration.md frontmatter
+
---
+origin: automated
+status: pending
+staged_date: 2026-04-12
+staged_by: wiki-distill
+target_path: patterns/zoho-crm-integration.md
+distill_topic: zoho-api
+distill_source_conversations: conversations/general/2026-04-06-73d15650.md,conversations/mc/2026-03-30-64089d1d.md
+compilation_notes: Two separate incidents discovered the same Zoho CRM v2 API limitations, documenting them as a pattern page prevents re-investigation and provides a canonical reference for future Zoho integrations.
+title: Zoho CRM Integration
+type: pattern
+confidence: high
+sources: [conversations/general/2026-04-06-73d15650.md, conversations/mc/2026-03-30-64089d1d.md]
+related: [database-migrations.md, activity-event-auditing.md]
+last_compiled: 2026-04-12
+last_verified: 2026-04-12
+---
+
+ +
+ Without distillation, MemPalace was a searchable archive sitting beside the wiki. With distillation, it's a real ingest pipeline — closet content becomes the source material for the wiki proper, completing the eight-extension story. + — memex design rationale, April 2026 +
+ +
+