Initial commit — memex
A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
This commit is contained in:
35
.gitignore
vendored
Normal file
35
.gitignore
vendored
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
# Conversation extraction state — per-machine byte offsets, not portable
|
||||||
|
.mine-state.json
|
||||||
|
|
||||||
|
# Log files from the mining and maintenance pipelines
|
||||||
|
scripts/.mine.log
|
||||||
|
scripts/.maintain.log
|
||||||
|
scripts/.sync.log
|
||||||
|
scripts/.summarize-claude.log
|
||||||
|
scripts/.summarize-claude-retry.log
|
||||||
|
|
||||||
|
# Python bytecode and cache
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
.pytest_cache/
|
||||||
|
.mypy_cache/
|
||||||
|
.ruff_cache/
|
||||||
|
|
||||||
|
# Editor / OS noise
|
||||||
|
.DS_Store
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
*.swp
|
||||||
|
*~
|
||||||
|
|
||||||
|
# Obsidian workspace state (keep the `.obsidian/` config if you use it,
|
||||||
|
# ignore only the ephemeral bits)
|
||||||
|
.obsidian/workspace.json
|
||||||
|
.obsidian/workspace-mobile.json
|
||||||
|
.obsidian/hotkeys.json
|
||||||
|
|
||||||
|
# NOTE: the following state files are NOT gitignored — they must sync
|
||||||
|
# across machines so both installs agree on what's been processed:
|
||||||
|
# .harvest-state.json (URL dedup)
|
||||||
|
# .hygiene-state.json (content hashes, deferred issues)
|
||||||
21
LICENSE
Normal file
21
LICENSE
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
MIT License
|
||||||
|
|
||||||
|
Copyright (c) 2026 Eric Turner
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in all
|
||||||
|
copies or substantial portions of the Software.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||||
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||||
|
SOFTWARE.
|
||||||
421
README.md
Normal file
421
README.md
Normal file
@@ -0,0 +1,421 @@
|
|||||||
|
# LLM Wiki — Compounding Knowledge for AI Agents
|
||||||
|
|
||||||
|
A persistent, LLM-maintained knowledge base that sits between you and the
|
||||||
|
sources it was compiled from. Unlike RAG — which re-discovers the same
|
||||||
|
answers on every query — the wiki **gets richer over time**. Facts get
|
||||||
|
cross-referenced, contradictions get flagged, stale advice ages out and
|
||||||
|
gets archived, and new knowledge discovered during a session gets written
|
||||||
|
back so it's there next time.
|
||||||
|
|
||||||
|
The agent reads the wiki at the start of every session and updates it as
|
||||||
|
new things are learned. The wiki is the long-term memory; the session is
|
||||||
|
the working memory.
|
||||||
|
|
||||||
|
> **Inspiration**: this combines the ideas from
|
||||||
|
> [Andrej Karpathy's persistent-wiki gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
|
||||||
|
> and [milla-jovovich/mempalace](https://github.com/milla-jovovich/mempalace),
|
||||||
|
> and adds an automation layer on top so the wiki maintains itself.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The problem with stateless RAG
|
||||||
|
|
||||||
|
Most people's experience with LLMs and documents looks like RAG: you upload
|
||||||
|
files, the LLM retrieves chunks at query time, generates an answer, done.
|
||||||
|
This works — but the LLM is rediscovering knowledge from scratch on every
|
||||||
|
question. There's no accumulation.
|
||||||
|
|
||||||
|
Ask the same subtle question twice and the LLM does all the same work twice.
|
||||||
|
Ask something that requires synthesizing five documents and the LLM has to
|
||||||
|
find and piece together the relevant fragments every time. Nothing is built
|
||||||
|
up. NotebookLM, ChatGPT file uploads, and most RAG systems work this way.
|
||||||
|
|
||||||
|
Worse, raw sources go stale. URLs rot. Documentation lags. Blog posts
|
||||||
|
get retracted. If your knowledge base is "the original documents,"
|
||||||
|
stale advice keeps showing up alongside current advice and there's no way
|
||||||
|
to know which is which.
|
||||||
|
|
||||||
|
## The core idea — a compounding wiki
|
||||||
|
|
||||||
|
Instead of retrieving from raw documents at query time, the LLM
|
||||||
|
**incrementally builds and maintains a persistent wiki** — a structured,
|
||||||
|
interlinked collection of markdown files that sits between you and the
|
||||||
|
raw sources.
|
||||||
|
|
||||||
|
When a new source shows up (a doc page, a blog post, a CLI `--help`, a
|
||||||
|
conversation transcript), the LLM doesn't just index it. It reads it,
|
||||||
|
extracts what's load-bearing, and integrates it into the existing wiki —
|
||||||
|
updating topic pages, revising summaries, noting where new data
|
||||||
|
contradicts old claims, strengthening or challenging the evolving
|
||||||
|
synthesis. The knowledge is compiled once and then *kept current*, not
|
||||||
|
re-derived on every query.
|
||||||
|
|
||||||
|
This is the key difference: **the wiki is a persistent, compounding
|
||||||
|
artifact.** The cross-references are already there. The contradictions have
|
||||||
|
already been flagged. The synthesis already reflects everything the LLM
|
||||||
|
has read. The wiki gets richer with every source added and every question
|
||||||
|
asked.
|
||||||
|
|
||||||
|
You never (or rarely) write the wiki yourself. The LLM writes and maintains
|
||||||
|
all of it. You're in charge of sourcing, exploration, and asking the right
|
||||||
|
questions. The LLM does the summarizing, cross-referencing, filing, and
|
||||||
|
bookkeeping that make a knowledge base actually useful over time.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What this adds beyond Karpathy's gist
|
||||||
|
|
||||||
|
Karpathy's gist describes the *idea* — a wiki the agent maintains. This
|
||||||
|
repo is a working implementation with an automation layer that handles the
|
||||||
|
lifecycle of knowledge, not just its creation:
|
||||||
|
|
||||||
|
| Layer | What it does |
|
||||||
|
|-------|--------------|
|
||||||
|
| **Conversation mining** | Extracts Claude Code session transcripts into searchable markdown. Summarizes them via `claude -p` with model routing (haiku for short sessions, sonnet for long ones). Links summaries to wiki pages by topic. |
|
||||||
|
| **URL harvesting** | Scans summarized conversations for external reference URLs. Fetches them via `trafilatura` → `crawl4ai` → stealth mode cascade. Compiles clean markdown into pending wiki pages. |
|
||||||
|
| **Human-in-the-loop staging** | Automated content lands in `staging/` with `status: pending`. You review via CLI, interactive prompts, or an in-session Claude review. Nothing automated goes live without approval. |
|
||||||
|
| **Staleness decay** | Every page tracks `last_verified`. After 6 months without a refresh signal, confidence decays `high → medium`; 9 months → `low`; 12 months → `stale` → auto-archived. |
|
||||||
|
| **Auto-restoration** | Archived pages that get referenced again in new conversations or wiki updates are automatically restored. |
|
||||||
|
| **Hygiene** | Daily structural checks (orphans, broken cross-refs, index drift, frontmatter repair). Weekly LLM-powered checks (duplicates, contradictions, missing cross-references). |
|
||||||
|
| **Orchestrator** | One script chains all of the above into a daily cron-able pipeline. |
|
||||||
|
|
||||||
|
The result: you don't have to maintain the wiki. You just *use* it. The
|
||||||
|
automation handles harvesting new knowledge, retiring old knowledge,
|
||||||
|
keeping cross-references intact, and flagging ambiguity for review.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why each part exists
|
||||||
|
|
||||||
|
Before implementing anything, the design was worked out interactively
|
||||||
|
with Claude as a [Signal & Noise analysis of Karpathy's
|
||||||
|
pattern](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c).
|
||||||
|
That analysis found seven real weaknesses in the core pattern. This
|
||||||
|
repo exists because each weakness has a concrete mitigation — and
|
||||||
|
every component maps directly to one:
|
||||||
|
|
||||||
|
| Karpathy-pattern weakness | How this repo answers it |
|
||||||
|
|---------------------------|--------------------------|
|
||||||
|
| **Errors persist and compound** | `confidence` field with time-based decay → pages age out visibly. Staging review catches automated content before it goes live. Full-mode hygiene does LLM contradiction detection. |
|
||||||
|
| **Hard ~50K-token ceiling** | `qmd` (BM25 + vector + re-ranking) set up from day one. Wing/room structural filtering narrows search before retrieval. Archive collection is excluded from default search. |
|
||||||
|
| **Manual cross-checking returns** | Every wiki claim traces back to immutable `raw/harvested/*.md` with SHA-256 hash. Staging review IS the cross-check. `compilation_notes` field makes review fast. |
|
||||||
|
| **Knowledge staleness** (the #1 failure mode in community data) | Daily + weekly cron removes "I forgot" as a failure mode. `last_verified` auto-refreshes from conversation references. Decayed pages auto-archive. |
|
||||||
|
| **Cognitive outsourcing risk** | Staging review forces engagement with every automated page. `qmd query` makes retrieval an active exploration. Wake-up briefing ~200 tokens the human reads too. |
|
||||||
|
| **Weaker semantic retrieval** | `qmd` hybrid (BM25 + vector). Full-mode hygiene adds missing cross-references. Structural metadata (wings, rooms) complements semantic search. |
|
||||||
|
| **No access control** | Git sync with `merge=union` markdown handling. Network-boundary ACL via Tailscale is the suggested path. *This one is a residual trade-off — see [DESIGN-RATIONALE.md](docs/DESIGN-RATIONALE.md).* |
|
||||||
|
|
||||||
|
The short version: Karpathy published the idea, the community found the
|
||||||
|
holes, and this repo is the automation layer that plugs the holes.
|
||||||
|
See **[`docs/DESIGN-RATIONALE.md`](docs/DESIGN-RATIONALE.md)** for the
|
||||||
|
full argument with honest residual trade-offs and what this repo
|
||||||
|
explicitly does NOT solve.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Compounding loop
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ Claude Code │
|
||||||
|
│ sessions (.jsonl) │
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│ extract-sessions.py (hourly, no LLM)
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ conversations/ │ markdown transcripts
|
||||||
|
│ <project>/*.md │ (status: extracted)
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│ summarize-conversations.py --claude (daily)
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ conversations/ │ summaries with related: wiki links
|
||||||
|
│ <project>/*.md │ (status: summarized)
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│ wiki-harvest.py (daily)
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ raw/harvested/ │ fetched URL content
|
||||||
|
│ *.md │ (immutable source material)
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│ claude -p compile step
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ staging/<type>/ │ pending pages
|
||||||
|
│ *.md │ (status: pending, origin: automated)
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│ human review (wiki-staging.py --review)
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ patterns/ │ LIVE wiki
|
||||||
|
│ decisions/ │ (origin: manual or promoted-from-automated)
|
||||||
|
│ concepts/ │
|
||||||
|
│ environments/ │
|
||||||
|
└──────────┬──────────┘
|
||||||
|
│ wiki-hygiene.py (daily quick / weekly full)
|
||||||
|
│ - refresh last_verified from new conversations
|
||||||
|
│ - decay confidence on idle pages
|
||||||
|
│ - auto-restore archived pages referenced again
|
||||||
|
│ - fuzzy-fix broken cross-references
|
||||||
|
▼
|
||||||
|
┌─────────────────────┐
|
||||||
|
│ archive/<type>/ │ stale/superseded content
|
||||||
|
│ *.md │ (excluded from default search)
|
||||||
|
└─────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
Every arrow is automated. The only human step is staging review — and
|
||||||
|
that's quick because the AI compilation step already wrote the page, you
|
||||||
|
just approve or reject.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick start — two paths
|
||||||
|
|
||||||
|
### Path A: just the idea (Karpathy-style)
|
||||||
|
|
||||||
|
Open a Claude Code session in an empty directory and tell it:
|
||||||
|
|
||||||
|
```
|
||||||
|
I want you to start maintaining a persistent knowledge wiki for me.
|
||||||
|
Create a directory structure with patterns/, decisions/, concepts/, and
|
||||||
|
environments/ subdirectories. Each page should have YAML frontmatter with
|
||||||
|
title, type, confidence, sources, related, last_compiled, and last_verified
|
||||||
|
fields. Create an index.md at the root that catalogs every page.
|
||||||
|
|
||||||
|
From now on, when I share a source (a doc page, a CLI --help, a conversation
|
||||||
|
I had), read it, extract what's load-bearing, and integrate it into the
|
||||||
|
wiki. Update existing pages when new knowledge refines them. Flag
|
||||||
|
contradictions between pages. Create new pages when topics aren't
|
||||||
|
covered yet. Update index.md every time you create or remove a page.
|
||||||
|
|
||||||
|
When I ask a question, read the relevant wiki pages first, then answer.
|
||||||
|
If you rely on a wiki page with `confidence: low`, flag that to me.
|
||||||
|
```
|
||||||
|
|
||||||
|
That's the whole idea. The agent will build you a growing markdown tree
|
||||||
|
that compounds over time. This is the minimum viable version.
|
||||||
|
|
||||||
|
### Path B: the full automation (this repo)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone <this-repo> ~/projects/wiki
|
||||||
|
cd ~/projects/wiki
|
||||||
|
|
||||||
|
# Install the Python extraction tools
|
||||||
|
pipx install trafilatura
|
||||||
|
pipx install crawl4ai && crawl4ai-setup
|
||||||
|
|
||||||
|
# Install qmd for full-text + vector search
|
||||||
|
npm install -g @tobilu/qmd
|
||||||
|
|
||||||
|
# Configure qmd (3 collections — see docs/SETUP.md for the YAML)
|
||||||
|
# Edit scripts/extract-sessions.py with your project codes
|
||||||
|
# Edit scripts/update-conversation-index.py with matching display names
|
||||||
|
|
||||||
|
# Copy the example CLAUDE.md files (wiki schema + global instructions)
|
||||||
|
cp docs/examples/wiki-CLAUDE.md CLAUDE.md
|
||||||
|
cat docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
|
||||||
|
# edit both for your conventions
|
||||||
|
|
||||||
|
# Run the full pipeline once, manually
|
||||||
|
bash scripts/mine-conversations.sh --extract-only # Fast, no LLM
|
||||||
|
python3 scripts/summarize-conversations.py --claude # Classify + summarize
|
||||||
|
python3 scripts/update-conversation-index.py --reindex
|
||||||
|
|
||||||
|
# Then maintain
|
||||||
|
bash scripts/wiki-maintain.sh # Daily hygiene
|
||||||
|
bash scripts/wiki-maintain.sh --hygiene-only --full # Weekly deep pass
|
||||||
|
```
|
||||||
|
|
||||||
|
See [`docs/SETUP.md`](docs/SETUP.md) for complete setup including qmd
|
||||||
|
configuration (three collections: `wiki`, `wiki-archive`,
|
||||||
|
`wiki-conversations`), optional cron schedules, git sync, and the
|
||||||
|
post-merge hook. See [`docs/examples/`](docs/examples/) for starter
|
||||||
|
`CLAUDE.md` files (wiki schema + global instructions) with explicit
|
||||||
|
guidance on using the three qmd collections.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Directory layout after setup
|
||||||
|
|
||||||
|
```
|
||||||
|
wiki/
|
||||||
|
├── CLAUDE.md ← Schema + instructions the agent reads every session
|
||||||
|
├── index.md ← Content catalog (the agent reads this first)
|
||||||
|
├── patterns/ ← HOW things should be built (LIVE)
|
||||||
|
├── decisions/ ← WHY we chose this approach (LIVE)
|
||||||
|
├── concepts/ ← WHAT the foundational ideas are (LIVE)
|
||||||
|
├── environments/ ← WHERE implementations differ (LIVE)
|
||||||
|
├── staging/ ← PENDING automated content awaiting review
|
||||||
|
│ ├── index.md
|
||||||
|
│ └── <type>/
|
||||||
|
├── archive/ ← STALE / superseded (excluded from search)
|
||||||
|
│ ├── index.md
|
||||||
|
│ └── <type>/
|
||||||
|
├── raw/ ← Immutable source material (never modified)
|
||||||
|
│ ├── <topic>/
|
||||||
|
│ └── harvested/ ← URL harvester output
|
||||||
|
├── conversations/ ← Mined Claude Code session transcripts
|
||||||
|
│ ├── index.md
|
||||||
|
│ └── <project>/
|
||||||
|
├── context/ ← Auto-updated AI session briefing
|
||||||
|
│ ├── wake-up.md ← Loaded at the start of every session
|
||||||
|
│ └── active-concerns.md ← Current blockers and focus areas
|
||||||
|
├── reports/ ← Hygiene operation logs
|
||||||
|
├── scripts/ ← The automation pipeline
|
||||||
|
├── tests/ ← Pytest suite (171 tests)
|
||||||
|
├── .harvest-state.json ← URL dedup state (committed, synced)
|
||||||
|
├── .hygiene-state.json ← Content hashes, deferred issues (committed, synced)
|
||||||
|
└── .mine-state.json ← Conversation extraction offsets (gitignored, per-machine)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What's Claude-specific (and what isn't)
|
||||||
|
|
||||||
|
This repo is built around **Claude Code** as the agent. Specifically:
|
||||||
|
|
||||||
|
1. **Session mining** expects `~/.claude/projects/<hashed-path>/*.jsonl`
|
||||||
|
files written by the Claude Code CLI. Other agents won't produce these.
|
||||||
|
2. **Summarization** uses `claude -p` (the Claude Code CLI's one-shot mode)
|
||||||
|
with haiku/sonnet routing by conversation length. Other LLM CLIs would
|
||||||
|
need a different wrapper.
|
||||||
|
3. **URL compilation** uses `claude -p` to turn raw harvested content into
|
||||||
|
a wiki page with proper frontmatter.
|
||||||
|
4. **The agent itself** (the thing that reads `CLAUDE.md` and maintains the
|
||||||
|
wiki conversationally) is Claude Code. Any agent that reads markdown
|
||||||
|
and can write files could do this job — `CLAUDE.md` is just a text
|
||||||
|
file telling the agent what the wiki's conventions are.
|
||||||
|
|
||||||
|
**What's NOT Claude-specific**:
|
||||||
|
|
||||||
|
- The wiki schema (frontmatter, directory layout, lifecycle states)
|
||||||
|
- The staleness decay model and archive/restore semantics
|
||||||
|
- The human-in-the-loop staging workflow
|
||||||
|
- The hygiene checks (orphans, broken cross-refs, duplicates)
|
||||||
|
- The `trafilatura` + `crawl4ai` URL fetching
|
||||||
|
- The qmd search integration
|
||||||
|
- The git-based cross-machine sync
|
||||||
|
|
||||||
|
If you use a different agent, you replace parts **1-4** above with
|
||||||
|
equivalents for your agent. The other 80% of the repo is agent-agnostic.
|
||||||
|
See [`docs/CUSTOMIZE.md`](docs/CUSTOMIZE.md) for concrete adaptation
|
||||||
|
recipes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture at a glance
|
||||||
|
|
||||||
|
Eleven scripts organized in three layers:
|
||||||
|
|
||||||
|
**Mining layer** (ingests conversations):
|
||||||
|
- `extract-sessions.py` — Parse Claude Code JSONL → markdown transcripts
|
||||||
|
- `summarize-conversations.py` — Classify + summarize via `claude -p`
|
||||||
|
- `update-conversation-index.py` — Regenerate conversation index + wake-up context
|
||||||
|
|
||||||
|
**Automation layer** (maintains the wiki):
|
||||||
|
- `wiki_lib.py` — Shared frontmatter parser, `WikiPage` dataclass, constants
|
||||||
|
- `wiki-harvest.py` — URL classification + fetch cascade + compile to staging
|
||||||
|
- `wiki-staging.py` — Human review (list/promote/reject/review/sync)
|
||||||
|
- `wiki-hygiene.py` — Quick + full hygiene checks, archival, auto-restore
|
||||||
|
- `wiki-maintain.sh` — Top-level orchestrator chaining harvest + hygiene
|
||||||
|
|
||||||
|
**Sync layer**:
|
||||||
|
- `wiki-sync.sh` — Git commit/pull/push with merge-union markdown handling
|
||||||
|
- `mine-conversations.sh` — Mining orchestrator
|
||||||
|
|
||||||
|
See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) for a deeper tour.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why markdown, not a real database?
|
||||||
|
|
||||||
|
Markdown files are:
|
||||||
|
|
||||||
|
- **Human-readable without any tooling** — you can browse in Obsidian, VS Code, or `cat`
|
||||||
|
- **Git-native** — full history, branching, rollback, cross-machine sync for free
|
||||||
|
- **Agent-friendly** — every LLM was trained on markdown, so reading and writing it is free
|
||||||
|
- **Durable** — no schema migrations, no database corruption, no vendor lock-in
|
||||||
|
- **Interoperable** — Obsidian graph view, `grep`, `qmd`, `ripgrep`, any editor
|
||||||
|
|
||||||
|
A SQLite file with the same content would be faster to query but harder
|
||||||
|
to browse, harder to merge, harder to audit, and fundamentally less
|
||||||
|
*collaborative* between you and the agent. Markdown wins for knowledge
|
||||||
|
management what Postgres wins for transactions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
Full pytest suite in `tests/` — 171 tests across all scripts, runs in
|
||||||
|
**~1.3 seconds**, no network or LLM calls needed, works on macOS and
|
||||||
|
Linux/WSL.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd tests && python3 -m pytest
|
||||||
|
# or
|
||||||
|
bash tests/run.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The test suite uses a disposable `tmp_wiki` fixture so no test ever
|
||||||
|
touches your real wiki.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Credits and inspiration
|
||||||
|
|
||||||
|
This repo is a synthesis of two existing ideas with an automation layer
|
||||||
|
on top. It would not exist without either of them.
|
||||||
|
|
||||||
|
**Core pattern — [Andrej Karpathy — "Agent-Maintained Persistent Wiki" gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)**
|
||||||
|
The foundational idea of a compounding LLM-maintained wiki that moves
|
||||||
|
synthesis from query-time (RAG) to ingest-time. This repo is an
|
||||||
|
implementation of Karpathy's pattern with the community-identified
|
||||||
|
failure modes plugged.
|
||||||
|
|
||||||
|
**Structural memory taxonomy — [milla-jovovich/mempalace](https://github.com/milla-jovovich/mempalace)**
|
||||||
|
The wing/room/hall/closet/drawer/tunnel concepts that turn a flat
|
||||||
|
corpus into something you can navigate without reading everything. See
|
||||||
|
[`ARCHITECTURE.md#borrowed-concepts`](docs/ARCHITECTURE.md#borrowed-concepts)
|
||||||
|
for the explicit mapping of MemPalace terms to this repo's
|
||||||
|
implementation.
|
||||||
|
|
||||||
|
**Search layer — [qmd](https://github.com/tobi/qmd)** by Tobi Lütke
|
||||||
|
(Shopify CEO). Local BM25 + vector + LLM re-ranking on markdown files.
|
||||||
|
Chosen over ChromaDB because it uses the same storage format as the
|
||||||
|
wiki — one index to maintain, not two. Explicitly recommended by
|
||||||
|
Karpathy as well.
|
||||||
|
|
||||||
|
**URL extraction stack** — [trafilatura](https://github.com/adbar/trafilatura)
|
||||||
|
for fast static-page extraction and [crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||||
|
for JS-rendered and anti-bot cases. The two-tool cascade handles
|
||||||
|
essentially any web content without needing a full browser stack for
|
||||||
|
simple pages.
|
||||||
|
|
||||||
|
**The agent** — [Claude Code](https://claude.com/claude-code) by Anthropic.
|
||||||
|
The repo is Claude-specific (see the section above for what that means
|
||||||
|
and how to adapt for other agents).
|
||||||
|
|
||||||
|
**Design process** — this repo was designed interactively with Claude
|
||||||
|
as a structured Signal & Noise analysis before any code was written.
|
||||||
|
The interactive design artifact is here:
|
||||||
|
[The LLM Wiki — Karpathy's Pattern — Signal & Noise](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c).
|
||||||
|
That artifact walks through the seven real strengths and seven real
|
||||||
|
weaknesses of the core pattern, then works through concrete mitigations
|
||||||
|
for each weakness. Every component in this repo maps back to a specific
|
||||||
|
mitigation identified there.
|
||||||
|
[`docs/DESIGN-RATIONALE.md`](docs/DESIGN-RATIONALE.md) is the condensed
|
||||||
|
version of that analysis as it applies to this implementation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
MIT — see [`LICENSE`](LICENSE).
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
This is a personal project that I'm making public in case the pattern is
|
||||||
|
useful to others. Issues and PRs welcome, but I make no promises about
|
||||||
|
response time. If you fork and make it your own, I'd love to hear how you
|
||||||
|
adapted it.
|
||||||
114
config.example.yaml
Normal file
114
config.example.yaml
Normal file
@@ -0,0 +1,114 @@
|
|||||||
|
# Example configuration — copy to config.yaml and edit for your setup.
|
||||||
|
#
|
||||||
|
# This file is NOT currently read by any script (see docs/CUSTOMIZE.md
|
||||||
|
# "What I'd change if starting over" #1). The scripts use inline
|
||||||
|
# constants with "CONFIGURE ME" comments instead. This file is a
|
||||||
|
# template for a future refactor and a reference for what the
|
||||||
|
# configurable surface looks like.
|
||||||
|
#
|
||||||
|
# For now, edit the constants directly in:
|
||||||
|
# scripts/extract-sessions.py (PROJECT_MAP)
|
||||||
|
# scripts/update-conversation-index.py (PROJECT_NAMES, PROJECT_ORDER)
|
||||||
|
# scripts/wiki-harvest.py (SKIP_DOMAIN_PATTERNS)
|
||||||
|
|
||||||
|
# ─── Project / wing configuration ──────────────────────────────────────────
|
||||||
|
projects:
|
||||||
|
# Map Claude Code directory suffixes to short project codes (wings)
|
||||||
|
map:
|
||||||
|
projects-wiki: wiki # this wiki's own sessions
|
||||||
|
-claude: cl # ~/.claude config repo
|
||||||
|
my-webapp: web # your project dirs
|
||||||
|
mobile-app: mob
|
||||||
|
work-monorepo: work
|
||||||
|
-home: general # catch-all
|
||||||
|
-Users: general
|
||||||
|
|
||||||
|
# Display names for each project code
|
||||||
|
names:
|
||||||
|
wiki: WIKI — This Wiki
|
||||||
|
cl: CL — Claude Config
|
||||||
|
web: WEB — My Webapp
|
||||||
|
mob: MOB — Mobile App
|
||||||
|
work: WORK — Day Job
|
||||||
|
general: General — Cross-Project
|
||||||
|
|
||||||
|
# Display order (most-active first)
|
||||||
|
order:
|
||||||
|
- work
|
||||||
|
- web
|
||||||
|
- mob
|
||||||
|
- wiki
|
||||||
|
- cl
|
||||||
|
- general
|
||||||
|
|
||||||
|
# ─── URL harvesting configuration ──────────────────────────────────────────
|
||||||
|
harvest:
|
||||||
|
# Domains to always skip (internal, ephemeral, personal).
|
||||||
|
# Patterns use re.search, so unanchored suffixes like \.example\.com$ work.
|
||||||
|
skip_domains:
|
||||||
|
- \.atlassian\.net$
|
||||||
|
- ^app\.asana\.com$
|
||||||
|
- ^(www\.)?slack\.com$
|
||||||
|
- ^(www\.)?discord\.com$
|
||||||
|
- ^mail\.google\.com$
|
||||||
|
- ^calendar\.google\.com$
|
||||||
|
- ^.+\.local$
|
||||||
|
- ^.+\.internal$
|
||||||
|
# Add your own:
|
||||||
|
- \.mycompany\.com$
|
||||||
|
- ^git\.mydomain\.com$
|
||||||
|
|
||||||
|
# Type C URLs (issue trackers, Q&A) — only harvested if topic covered
|
||||||
|
c_type_patterns:
|
||||||
|
- ^https?://github\.com/[^/]+/[^/]+/issues/\d+
|
||||||
|
- ^https?://github\.com/[^/]+/[^/]+/pull/\d+
|
||||||
|
- ^https?://(www\.)?stackoverflow\.com/questions/\d+
|
||||||
|
|
||||||
|
# Fetch behavior
|
||||||
|
fetch_delay_seconds: 2
|
||||||
|
max_failed_attempts: 3
|
||||||
|
min_content_length: 100
|
||||||
|
fetch_timeout: 45
|
||||||
|
|
||||||
|
# ─── Hygiene / staleness configuration ─────────────────────────────────────
|
||||||
|
hygiene:
|
||||||
|
# Confidence decay thresholds (days since last_verified)
|
||||||
|
decay:
|
||||||
|
high_to_medium: 180 # 6 months
|
||||||
|
medium_to_low: 270 # 9 months (6+3)
|
||||||
|
low_to_stale: 365 # 12 months (6+3+3)
|
||||||
|
|
||||||
|
# Pages with body shorter than this are flagged as stubs
|
||||||
|
empty_stub_threshold_chars: 100
|
||||||
|
|
||||||
|
# Version regex for technology lifecycle checks (which tools to track)
|
||||||
|
version_regex: '\b(?:Node(?:\.js)?|Python|Docker|PostgreSQL|MySQL|Redis|Next\.js|NestJS)\s+(\d+(?:\.\d+)?)'
|
||||||
|
|
||||||
|
# ─── LLM configuration ─────────────────────────────────────────────────────
|
||||||
|
llm:
|
||||||
|
# Which backend to use for summarization and compilation
|
||||||
|
# Options: claude | openai | local | ollama
|
||||||
|
backend: claude
|
||||||
|
|
||||||
|
# Routing threshold — sessions/content above this use the larger model
|
||||||
|
long_threshold_chars: 20000
|
||||||
|
long_threshold_messages: 200
|
||||||
|
|
||||||
|
# Per-backend settings
|
||||||
|
claude:
|
||||||
|
short_model: haiku
|
||||||
|
long_model: sonnet
|
||||||
|
timeout: 600
|
||||||
|
|
||||||
|
openai:
|
||||||
|
short_model: gpt-4o-mini
|
||||||
|
long_model: gpt-4o
|
||||||
|
api_key_env: OPENAI_API_KEY
|
||||||
|
|
||||||
|
local:
|
||||||
|
base_url: http://localhost:8080/v1
|
||||||
|
model: Phi-4-14B-Q4_K_M
|
||||||
|
|
||||||
|
ollama:
|
||||||
|
base_url: http://localhost:11434/v1
|
||||||
|
model: phi4:14b
|
||||||
360
docs/ARCHITECTURE.md
Normal file
360
docs/ARCHITECTURE.md
Normal file
@@ -0,0 +1,360 @@
|
|||||||
|
# Architecture
|
||||||
|
|
||||||
|
Eleven scripts across three conceptual layers. This document walks through
|
||||||
|
what each one does, how they talk to each other, and where the seams are
|
||||||
|
for customization.
|
||||||
|
|
||||||
|
> **See also**: [`DESIGN-RATIONALE.md`](DESIGN-RATIONALE.md) — the *why*
|
||||||
|
> behind each component, with links to the interactive design artifact.
|
||||||
|
|
||||||
|
## Borrowed concepts
|
||||||
|
|
||||||
|
The architecture is a synthesis of two external ideas with an automation
|
||||||
|
layer on top. The terminology often maps 1:1, so it's worth calling out
|
||||||
|
which concepts came from where:
|
||||||
|
|
||||||
|
### From Karpathy's persistent-wiki gist
|
||||||
|
|
||||||
|
| Concept | How this repo implements it |
|
||||||
|
|---------|-----------------------------|
|
||||||
|
| Immutable `raw/` sources | `raw/` directory — never modified by the agent |
|
||||||
|
| LLM-compiled `wiki/` pages | `patterns/` `decisions/` `concepts/` `environments/` |
|
||||||
|
| Schema file disciplining the agent | `CLAUDE.md` at the wiki root |
|
||||||
|
| Periodic "lint" passes | `wiki-hygiene.py --quick` (daily) + `--full` (weekly) |
|
||||||
|
| Wiki as fine-tuning material | Clean markdown body is ready for synthetic training data |
|
||||||
|
|
||||||
|
### From [mempalace](https://github.com/milla-jovovich/mempalace)
|
||||||
|
|
||||||
|
MemPalace gave us the structural memory taxonomy that turns a flat
|
||||||
|
corpus into something you can navigate without reading everything. The
|
||||||
|
concepts map directly:
|
||||||
|
|
||||||
|
| MemPalace term | Meaning | How this repo implements it |
|
||||||
|
|----------------|---------|-----------------------------|
|
||||||
|
| **Wing** | Per-person or per-project namespace | Project code in `conversations/<code>/` (set by `PROJECT_MAP` in `extract-sessions.py`) |
|
||||||
|
| **Room** | Topic within a wing | `topics:` frontmatter field on summarized conversation files |
|
||||||
|
| **Closet** | Summary layer — high-signal compressed knowledge | The summary body written by `summarize-conversations.py --claude` |
|
||||||
|
| **Drawer** | Verbatim archive, never lost | The extracted transcript under `conversations/<wing>/*.md` (before summarization) |
|
||||||
|
| **Hall** | Memory-type corridor (fact / event / discovery / preference / advice / tooling) | `halls:` frontmatter field classified by the summarizer |
|
||||||
|
| **Tunnel** | Cross-wing connection — same topic in multiple projects | `related:` frontmatter linking conversations to wiki pages and to each other |
|
||||||
|
|
||||||
|
The key benefit of wing + room filtering is documented in MemPalace's
|
||||||
|
benchmarks as a **+34% retrieval boost** over flat search — because
|
||||||
|
`qmd` can search a pre-narrowed subset of the corpus instead of
|
||||||
|
everything. This is why the wiki scales past the Karpathy-pattern's
|
||||||
|
~50K token ceiling without needing a full vector DB rebuild.
|
||||||
|
|
||||||
|
### What this repo adds
|
||||||
|
|
||||||
|
Automation + lifecycle management on top of both:
|
||||||
|
|
||||||
|
- **Automation layer** — cron-friendly orchestration via `wiki-maintain.sh`
|
||||||
|
- **Staging pipeline** — human-in-the-loop checkpoint for automated content
|
||||||
|
- **Confidence decay + auto-archive + auto-restore** — the retention curve
|
||||||
|
- **`qmd` integration** — the scalable search layer (chosen over ChromaDB
|
||||||
|
because it uses markdown storage like the wiki itself)
|
||||||
|
- **Hygiene reports** — fixed vs needs-review separation
|
||||||
|
- **Cross-machine sync** — git with markdown merge-union
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────┐
|
||||||
|
│ SYNC LAYER │
|
||||||
|
│ wiki-sync.sh │ (git commit/pull/push, qmd reindex)
|
||||||
|
└─────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────────────────────────────┐
|
||||||
|
│ MINING LAYER │
|
||||||
|
│ extract-sessions.py │ (Claude Code JSONL → markdown)
|
||||||
|
│ summarize-conversations.py │ (LLM classify + summarize)
|
||||||
|
│ update-conversation-index.py │ (regenerate indexes + wake-up)
|
||||||
|
│ mine-conversations.sh │ (orchestrator)
|
||||||
|
└─────────────────────────────────┘
|
||||||
|
│
|
||||||
|
┌─────────────────────────────────┐
|
||||||
|
│ AUTOMATION LAYER │
|
||||||
|
│ wiki_lib.py (shared helpers) │
|
||||||
|
│ wiki-harvest.py │ (URL → raw → staging)
|
||||||
|
│ wiki-staging.py │ (human review)
|
||||||
|
│ wiki-hygiene.py │ (decay, archive, repair, checks)
|
||||||
|
│ wiki-maintain.sh │ (orchestrator)
|
||||||
|
└─────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
Each layer is independent — you can run the mining layer without the
|
||||||
|
automation layer, or vice versa. The layers communicate through files on
|
||||||
|
disk (conversation markdown, raw harvested pages, staging pages, wiki
|
||||||
|
pages), never through in-memory state.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Mining layer
|
||||||
|
|
||||||
|
### `extract-sessions.py`
|
||||||
|
|
||||||
|
Parses Claude Code JSONL session files from `~/.claude/projects/` into
|
||||||
|
clean markdown transcripts under `conversations/<project-code>/`.
|
||||||
|
Deterministic, no LLM calls. Incremental — tracks byte offsets in
|
||||||
|
`.mine-state.json` so it safely re-runs on partially-processed sessions.
|
||||||
|
|
||||||
|
Key features:
|
||||||
|
- Summarizes tool calls intelligently: full output for `Bash` and `Skill`,
|
||||||
|
paths-only for `Read`/`Glob`/`Grep`, path + summary for `Edit`/`Write`
|
||||||
|
- Caps Bash output at 200 lines to prevent transcript bloat
|
||||||
|
- Handles session resumption — if a session has grown since last extraction,
|
||||||
|
it appends new messages without re-processing old ones
|
||||||
|
- Maps Claude project directory names to short wiki codes via `PROJECT_MAP`
|
||||||
|
|
||||||
|
### `summarize-conversations.py`
|
||||||
|
|
||||||
|
Sends extracted transcripts to an LLM for classification and summarization.
|
||||||
|
Supports two backends:
|
||||||
|
|
||||||
|
1. **`--claude` mode** (recommended): Uses `claude -p` with
|
||||||
|
haiku for short sessions (≤200 messages) and sonnet for longer ones.
|
||||||
|
Runs chunked over long transcripts, keeping a rolling context window.
|
||||||
|
|
||||||
|
2. **Local LLM mode** (default, omit `--claude`): Uses a local
|
||||||
|
`llama-server` instance at `localhost:8080` (or WSL gateway:8081 on
|
||||||
|
Windows Subsystem for Linux). Requires llama.cpp installed and a GGUF
|
||||||
|
model loaded.
|
||||||
|
|
||||||
|
Output: adds frontmatter to each conversation file — `topics`, `halls`
|
||||||
|
(fact/discovery/preference/advice/event/tooling), and `related` wiki
|
||||||
|
page links. The `related` links are load-bearing: they're what
|
||||||
|
`wiki-hygiene.py` uses to refresh `last_verified` on pages that are still
|
||||||
|
being discussed.
|
||||||
|
|
||||||
|
### `update-conversation-index.py`
|
||||||
|
|
||||||
|
Regenerates three files from the summarized conversations:
|
||||||
|
|
||||||
|
- `conversations/index.md` — catalog of all conversations grouped by project
|
||||||
|
- `context/wake-up.md` — a ~200-token briefing the agent loads at the start
|
||||||
|
of every session ("current focus areas, recent decisions, active
|
||||||
|
concerns")
|
||||||
|
- `context/active-concerns.md` — longer-form current state
|
||||||
|
|
||||||
|
The wake-up file is important: it's what gives the agent *continuity*
|
||||||
|
across sessions without forcing you to re-explain context every time.
|
||||||
|
|
||||||
|
### `mine-conversations.sh`
|
||||||
|
|
||||||
|
Orchestrator chaining extract → summarize → index. Supports
|
||||||
|
`--extract-only`, `--summarize-only`, `--index-only`, `--project <code>`,
|
||||||
|
and `--dry-run`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Automation layer
|
||||||
|
|
||||||
|
### `wiki_lib.py`
|
||||||
|
|
||||||
|
The shared library. Everything in the automation layer imports from here.
|
||||||
|
Provides:
|
||||||
|
|
||||||
|
- `WikiPage` dataclass — path + frontmatter + body + raw YAML
|
||||||
|
- `parse_page(path)` — safe markdown parser with YAML frontmatter
|
||||||
|
- `parse_yaml_lite(text)` — subset YAML parser (no external deps, handles
|
||||||
|
the frontmatter patterns we use)
|
||||||
|
- `serialize_frontmatter(fm)` — writes YAML back in canonical key order
|
||||||
|
- `write_page(page, ...)` — full round-trip writer
|
||||||
|
- `page_content_hash(page)` — body-only SHA-256 for change detection
|
||||||
|
- `iter_live_pages()` / `iter_staging_pages()` / `iter_archived_pages()`
|
||||||
|
- Shared constants: `WIKI_DIR`, `STAGING_DIR`, `ARCHIVE_DIR`, etc.
|
||||||
|
|
||||||
|
All paths honor the `WIKI_DIR` environment variable, so tests and
|
||||||
|
alternate installs can override the root.
|
||||||
|
|
||||||
|
### `wiki-harvest.py`
|
||||||
|
|
||||||
|
Scans summarized conversations for HTTP(S) URLs, classifies them,
|
||||||
|
fetches content, and compiles pending wiki pages.
|
||||||
|
|
||||||
|
URL classification:
|
||||||
|
- **Harvest** (Type A/B) — docs, articles, blogs → fetch and compile
|
||||||
|
- **Check** (Type C) — GitHub issues, Stack Overflow — only harvest if
|
||||||
|
the topic is already covered in the wiki (to avoid noise)
|
||||||
|
- **Skip** (Type D) — internal domains, localhost, private IPs, chat tools
|
||||||
|
|
||||||
|
Fetch cascade (tries in order, validates at each step):
|
||||||
|
1. `trafilatura -u <url> --markdown --no-comments --precision`
|
||||||
|
2. `crwl <url> -o markdown-fit`
|
||||||
|
3. `crwl <url> -o markdown-fit -b "user_agent_mode=random" -c "magic=true"` (stealth)
|
||||||
|
4. Conversation-transcript fallback — pull inline content from where the
|
||||||
|
URL was mentioned during the session
|
||||||
|
|
||||||
|
Validated content goes to `raw/harvested/<domain>-<path>.md` with
|
||||||
|
frontmatter recording source URL, fetch method, and a content hash.
|
||||||
|
|
||||||
|
Compilation step: sends the raw content + `index.md` + conversation
|
||||||
|
context to `claude -p`, asking for a JSON verdict:
|
||||||
|
- `new_page` — create a new wiki page
|
||||||
|
- `update_page` — update an existing page (with `modifies:` field)
|
||||||
|
- `both` — do both
|
||||||
|
- `skip` — content isn't substantive enough
|
||||||
|
|
||||||
|
Result lands in `staging/<type>/` with `origin: automated`,
|
||||||
|
`status: pending`, and all the staging-specific frontmatter that gets
|
||||||
|
stripped on promotion.
|
||||||
|
|
||||||
|
### `wiki-staging.py`
|
||||||
|
|
||||||
|
Pure file operations — no LLM calls. Human review pipeline for automated
|
||||||
|
content.
|
||||||
|
|
||||||
|
Commands:
|
||||||
|
- `--list` / `--list --json` — pending items with metadata
|
||||||
|
- `--stats` — counts by type/source + age stats
|
||||||
|
- `--review` — interactive a/r/s/q loop with preview
|
||||||
|
- `--promote <path>` — approve, strip staging fields, move to live, update
|
||||||
|
main index, rewrite cross-refs, preserve `origin: automated` as audit trail
|
||||||
|
- `--reject <path> --reason "..."` — delete, record in
|
||||||
|
`.harvest-state.json` rejected_urls so the harvester won't re-create
|
||||||
|
- `--promote-all` — bulk approve everything
|
||||||
|
- `--sync` — regenerate `staging/index.md`, detect drift
|
||||||
|
|
||||||
|
### `wiki-hygiene.py`
|
||||||
|
|
||||||
|
The heavy lifter. Two modes:
|
||||||
|
|
||||||
|
**Quick mode** (no LLM, ~1 second on a 100-page wiki, run daily):
|
||||||
|
- Backfill `last_verified` from `last_compiled`/git/mtime
|
||||||
|
- Refresh `last_verified` from conversation `related:` links — this is
|
||||||
|
the "something's still being discussed" signal
|
||||||
|
- Auto-restore archived pages that are referenced again
|
||||||
|
- Repair frontmatter (missing/invalid fields get sensible defaults)
|
||||||
|
- Apply confidence decay per thresholds (6/9/12 months)
|
||||||
|
- Archive stale and superseded pages
|
||||||
|
- Detect index drift (pages on disk not in index, stale index entries)
|
||||||
|
- Detect orphan pages (no inbound links) and auto-add them to index
|
||||||
|
- Detect broken cross-references, fuzzy-match to the intended target
|
||||||
|
via `difflib.get_close_matches`, fix in place
|
||||||
|
- Report empty stubs (body < 100 chars)
|
||||||
|
- Detect state file drift (references to missing files)
|
||||||
|
- Regenerate `staging/index.md` and `archive/index.md` if out of sync
|
||||||
|
|
||||||
|
**Full mode** (LLM-powered, run weekly — extends quick mode with):
|
||||||
|
- Missing cross-references (haiku, batched 5 pages per call)
|
||||||
|
- Duplicate coverage (sonnet — weaker merged into stronger, auto-archives
|
||||||
|
the loser with `archived_reason: Merged into <winner>`)
|
||||||
|
- Contradictions (sonnet, **report-only** — the human decides)
|
||||||
|
- Technology lifecycle (regex + conversation comparison — flags pages
|
||||||
|
mentioning `Node 18` when recent conversations are using `Node 20`)
|
||||||
|
|
||||||
|
State lives in `.hygiene-state.json` — tracks content hashes per page so
|
||||||
|
full-mode runs can skip unchanged pages. Reports land in
|
||||||
|
`reports/hygiene-YYYY-MM-DD-{fixed,needs-review}.md`.
|
||||||
|
|
||||||
|
### `wiki-maintain.sh`
|
||||||
|
|
||||||
|
Top-level orchestrator:
|
||||||
|
|
||||||
|
```
|
||||||
|
Phase 1: wiki-harvest.py (unless --hygiene-only)
|
||||||
|
Phase 2: wiki-hygiene.py (--full for the weekly pass, else quick)
|
||||||
|
Phase 3: qmd update && qmd embed (unless --no-reindex or --dry-run)
|
||||||
|
```
|
||||||
|
|
||||||
|
Flags pass through to child scripts. Error-tolerant: if one phase fails,
|
||||||
|
the others still run. Logs to `scripts/.maintain.log`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Sync layer
|
||||||
|
|
||||||
|
### `wiki-sync.sh`
|
||||||
|
|
||||||
|
Git-based sync for cross-machine use. Commands:
|
||||||
|
|
||||||
|
- `--commit` — stage and commit local changes
|
||||||
|
- `--pull` — `git pull` with markdown merge-union (keeps both sides on conflict)
|
||||||
|
- `--push` — push to origin
|
||||||
|
- `full` — commit + pull + push + qmd reindex
|
||||||
|
- `--status` — read-only sync state report
|
||||||
|
|
||||||
|
The `.gitattributes` file sets `*.md merge=union` so markdown conflicts
|
||||||
|
auto-resolve by keeping both versions. This works because most conflicts
|
||||||
|
are additive (two machines both adding new entries).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## State files
|
||||||
|
|
||||||
|
Three JSON files track per-pipeline state:
|
||||||
|
|
||||||
|
| File | Owner | Synced? | Purpose |
|
||||||
|
|------|-------|---------|---------|
|
||||||
|
| `.mine-state.json` | `extract-sessions.py`, `summarize-conversations.py` | No (gitignored) | Per-session byte offsets — local filesystem state, not portable |
|
||||||
|
| `.harvest-state.json` | `wiki-harvest.py` | Yes (committed) | URL dedup — harvested/skipped/failed/rejected URLs |
|
||||||
|
| `.hygiene-state.json` | `wiki-hygiene.py` | Yes (committed) | Page content hashes, deferred issues, last-run timestamps |
|
||||||
|
|
||||||
|
Harvest and hygiene state need to sync across machines so both
|
||||||
|
installations agree on what's been processed. Mining state is per-machine
|
||||||
|
because Claude Code session files live at OS-specific paths.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Module dependency graph
|
||||||
|
|
||||||
|
```
|
||||||
|
wiki_lib.py ─┬─> wiki-harvest.py
|
||||||
|
├─> wiki-staging.py
|
||||||
|
└─> wiki-hygiene.py
|
||||||
|
|
||||||
|
wiki-maintain.sh ─> wiki-harvest.py
|
||||||
|
─> wiki-hygiene.py
|
||||||
|
─> qmd (external)
|
||||||
|
|
||||||
|
mine-conversations.sh ─> extract-sessions.py
|
||||||
|
─> summarize-conversations.py
|
||||||
|
─> update-conversation-index.py
|
||||||
|
|
||||||
|
extract-sessions.py (standalone — reads Claude JSONL)
|
||||||
|
summarize-conversations.py ─> claude CLI (or llama-server)
|
||||||
|
update-conversation-index.py ─> qmd (external)
|
||||||
|
```
|
||||||
|
|
||||||
|
`wiki_lib.py` is the only shared Python module — everything else is
|
||||||
|
self-contained within its layer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Extension seams
|
||||||
|
|
||||||
|
The places to modify when customizing:
|
||||||
|
|
||||||
|
1. **`scripts/extract-sessions.py`** — `PROJECT_MAP` controls how Claude
|
||||||
|
project directories become wiki "wings". Also `KEEP_FULL_OUTPUT_TOOLS`,
|
||||||
|
`SUMMARIZE_TOOLS`, `MAX_BASH_OUTPUT_LINES` to tune transcript shape.
|
||||||
|
|
||||||
|
2. **`scripts/update-conversation-index.py`** — `PROJECT_NAMES` and
|
||||||
|
`PROJECT_ORDER` control how the index groups conversations.
|
||||||
|
|
||||||
|
3. **`scripts/wiki-harvest.py`** —
|
||||||
|
- `SKIP_DOMAIN_PATTERNS` — your internal domains
|
||||||
|
- `C_TYPE_URL_PATTERNS` — URL shapes that need topic-match before harvesting
|
||||||
|
- `FETCH_DELAY_SECONDS` — rate limit between fetches
|
||||||
|
- `COMPILE_PROMPT_TEMPLATE` — what the AI compile step tells the LLM
|
||||||
|
- `SONNET_CONTENT_THRESHOLD` — size cutoff for haiku vs sonnet
|
||||||
|
|
||||||
|
4. **`scripts/wiki-hygiene.py`** —
|
||||||
|
- `DECAY_HIGH_TO_MEDIUM` / `DECAY_MEDIUM_TO_LOW` / `DECAY_LOW_TO_STALE`
|
||||||
|
— decay thresholds in days
|
||||||
|
- `EMPTY_STUB_THRESHOLD` — what counts as a stub
|
||||||
|
- `VERSION_REGEX` — which tools/runtimes to track for lifecycle checks
|
||||||
|
- `REQUIRED_FIELDS` — frontmatter fields the repair step enforces
|
||||||
|
|
||||||
|
5. **`scripts/summarize-conversations.py`** —
|
||||||
|
- `CLAUDE_LONG_THRESHOLD` — haiku/sonnet routing cutoff
|
||||||
|
- `MINE_PROMPT_FILE` — the LLM system prompt for summarization
|
||||||
|
- Backend selection (claude vs llama-server)
|
||||||
|
|
||||||
|
6. **`CLAUDE.md`** at the wiki root — the instructions the agent reads
|
||||||
|
every session. This is where you tell the agent how to maintain the
|
||||||
|
wiki, what conventions to follow, when to flag things to you.
|
||||||
|
|
||||||
|
See [`docs/CUSTOMIZE.md`](CUSTOMIZE.md) for recipes.
|
||||||
432
docs/CUSTOMIZE.md
Normal file
432
docs/CUSTOMIZE.md
Normal file
@@ -0,0 +1,432 @@
|
|||||||
|
# Customization Guide
|
||||||
|
|
||||||
|
This repo is built around Claude Code, cron-based automation, and a
|
||||||
|
specific directory layout. None of those are load-bearing for the core
|
||||||
|
idea. This document walks through adapting it for different agents,
|
||||||
|
different scheduling, and different subsets of functionality.
|
||||||
|
|
||||||
|
## What's actually required for the core idea
|
||||||
|
|
||||||
|
The minimum viable compounding wiki is:
|
||||||
|
|
||||||
|
1. A markdown directory tree
|
||||||
|
2. An agent that reads the tree at the start of a session and writes to
|
||||||
|
it during the session
|
||||||
|
3. Some convention (a `CLAUDE.md` or equivalent) telling the agent how to
|
||||||
|
maintain the wiki
|
||||||
|
|
||||||
|
**Everything else in this repo is optional optimization** — automated
|
||||||
|
extraction, URL harvesting, hygiene checks, cron scheduling. They're
|
||||||
|
worth the setup effort once the wiki grows past a few dozen pages, but
|
||||||
|
they're not the *idea*.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Adapting for non-Claude-Code agents
|
||||||
|
|
||||||
|
Four script components are Claude-specific. Each has a natural
|
||||||
|
replacement path:
|
||||||
|
|
||||||
|
### 1. `extract-sessions.py` — Claude Code JSONL parsing
|
||||||
|
|
||||||
|
**What it does**: Reads session files from `~/.claude/projects/` and
|
||||||
|
converts them to markdown transcripts.
|
||||||
|
|
||||||
|
**What's Claude-specific**: The JSONL format and directory structure are
|
||||||
|
specific to the Claude Code CLI. Other agents don't produce these files.
|
||||||
|
|
||||||
|
**Replacements**:
|
||||||
|
|
||||||
|
- **Cursor**: Cursor stores chat history in `~/Library/Application
|
||||||
|
Support/Cursor/User/globalStorage/` (macOS) as SQLite. Write an
|
||||||
|
equivalent `extract-sessions.py` that queries that SQLite and produces
|
||||||
|
the same markdown format.
|
||||||
|
- **Aider**: Aider stores chat history as `.aider.chat.history.md` in
|
||||||
|
each project directory. A much simpler extractor: walk all project
|
||||||
|
directories, read each `.aider.chat.history.md`, split on session
|
||||||
|
boundaries, write to `conversations/<project>/`.
|
||||||
|
- **OpenAI Codex / gemini CLI / other**: Whatever session format your
|
||||||
|
tool uses — the target format is a markdown file with a specific
|
||||||
|
frontmatter shape (`title`, `type: conversation`, `project`, `date`,
|
||||||
|
`status: extracted`, `messages: N`, body of user/assistant turns).
|
||||||
|
Anything that produces files in that shape will flow through the rest
|
||||||
|
of the pipeline unchanged.
|
||||||
|
- **No agent at all — just manual**: Skip this script entirely. Paste
|
||||||
|
interesting conversations into `conversations/general/YYYY-MM-DD-slug.md`
|
||||||
|
by hand and set `status: extracted` yourself.
|
||||||
|
|
||||||
|
The pipeline downstream of `extract-sessions.py` doesn't care how the
|
||||||
|
transcripts got there, only that they exist with the right frontmatter.
|
||||||
|
|
||||||
|
### 2. `summarize-conversations.py` — `claude -p` summarization
|
||||||
|
|
||||||
|
**What it does**: Classifies extracted conversations into "halls"
|
||||||
|
(fact/discovery/preference/advice/event/tooling) and writes summaries.
|
||||||
|
|
||||||
|
**What's Claude-specific**: Uses `claude -p` with haiku/sonnet routing.
|
||||||
|
|
||||||
|
**Replacements**:
|
||||||
|
|
||||||
|
- **OpenAI**: Replace the `call_claude` helper with a function that calls
|
||||||
|
`openai` Python SDK or `gpt` CLI. Use gpt-4o-mini for short
|
||||||
|
conversations (equivalent to haiku routing) and gpt-4o for long ones.
|
||||||
|
- **Local LLM**: The script already supports this path — just omit the
|
||||||
|
`--claude` flag and run a `llama-server` on localhost:8080 (or the WSL
|
||||||
|
gateway IP on Windows). Phi-4-14B scored 400/400 on our internal eval.
|
||||||
|
- **Ollama**: Point `AI_BASE_URL` at your Ollama endpoint (e.g.
|
||||||
|
`http://localhost:11434/v1`). Ollama exposes an OpenAI-compatible API.
|
||||||
|
- **Any OpenAI-compatible endpoint**: `AI_BASE_URL` and `AI_MODEL` env
|
||||||
|
vars configure the script — no code changes needed.
|
||||||
|
- **No LLM at all — manual summaries**: Edit each conversation file by
|
||||||
|
hand to set `status: summarized` and add your own `topics`/`related`
|
||||||
|
frontmatter. Tedious but works for a small wiki.
|
||||||
|
|
||||||
|
### 3. `wiki-harvest.py` — AI compile step
|
||||||
|
|
||||||
|
**What it does**: After fetching raw URL content, sends it to `claude -p`
|
||||||
|
to get a structured JSON verdict (new_page / update_page / both / skip)
|
||||||
|
plus the page content.
|
||||||
|
|
||||||
|
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
|
||||||
|
|
||||||
|
**Replacements**:
|
||||||
|
|
||||||
|
- **Any other LLM**: Replace `call_claude_compile()` with a function that
|
||||||
|
calls your preferred backend. The prompt template
|
||||||
|
(`COMPILE_PROMPT_TEMPLATE`) is reusable — just swap the transport.
|
||||||
|
- **Skip AI compilation entirely**: Run `wiki-harvest.py --no-compile`
|
||||||
|
and the harvester will save raw content to `raw/harvested/` without
|
||||||
|
trying to compile it. You can then manually (or via a different script)
|
||||||
|
turn the raw content into wiki pages.
|
||||||
|
|
||||||
|
### 4. `wiki-hygiene.py --full` — LLM-powered checks
|
||||||
|
|
||||||
|
**What it does**: Duplicate detection, contradiction detection, missing
|
||||||
|
cross-reference suggestions.
|
||||||
|
|
||||||
|
**What's Claude-specific**: `claude -p --model haiku|sonnet`.
|
||||||
|
|
||||||
|
**Replacements**:
|
||||||
|
|
||||||
|
- **Same as #3**: Replace the `call_claude()` helper in `wiki-hygiene.py`.
|
||||||
|
- **Skip full mode entirely**: Only run `wiki-hygiene.py --quick` (the
|
||||||
|
default). Quick mode has no LLM calls and catches 90% of structural
|
||||||
|
issues. Contradictions and duplicates just have to be caught by human
|
||||||
|
review during `wiki-staging.py --review` sessions.
|
||||||
|
|
||||||
|
### 5. `CLAUDE.md` at the wiki root
|
||||||
|
|
||||||
|
**What it does**: The instructions Claude Code reads at the start of
|
||||||
|
every session that explain the wiki schema and maintenance operations.
|
||||||
|
|
||||||
|
**What's Claude-specific**: The filename. Claude Code specifically looks
|
||||||
|
for `CLAUDE.md`; other agents look for other files.
|
||||||
|
|
||||||
|
**Replacements**:
|
||||||
|
|
||||||
|
| Agent | Equivalent file |
|
||||||
|
|-------|-----------------|
|
||||||
|
| Claude Code | `CLAUDE.md` |
|
||||||
|
| Cursor | `.cursorrules` or `.cursor/rules/` |
|
||||||
|
| Aider | `CONVENTIONS.md` (read via `--read CONVENTIONS.md`) |
|
||||||
|
| Gemini CLI | `GEMINI.md` |
|
||||||
|
| Continue.dev | `config.json` prompts or `.continue/rules/` |
|
||||||
|
|
||||||
|
The content is the same — just rename the file and point your agent at
|
||||||
|
it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Running without cron
|
||||||
|
|
||||||
|
Cron is convenient but not required. Alternatives:
|
||||||
|
|
||||||
|
### Manual runs
|
||||||
|
|
||||||
|
Just call the scripts when you want the wiki updated:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/projects/wiki
|
||||||
|
|
||||||
|
# When you want to ingest new Claude Code sessions
|
||||||
|
bash scripts/mine-conversations.sh
|
||||||
|
|
||||||
|
# When you want hygiene + harvest
|
||||||
|
bash scripts/wiki-maintain.sh
|
||||||
|
|
||||||
|
# When you want the expensive LLM pass
|
||||||
|
bash scripts/wiki-maintain.sh --hygiene-only --full
|
||||||
|
```
|
||||||
|
|
||||||
|
This is arguably *better* than cron if you work in bursts — run
|
||||||
|
maintenance when you start a session, not on a schedule.
|
||||||
|
|
||||||
|
### systemd timers (Linux)
|
||||||
|
|
||||||
|
More observable than cron, better journaling:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
# ~/.config/systemd/user/wiki-maintain.service
|
||||||
|
[Unit]
|
||||||
|
Description=Wiki maintenance pipeline
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
WorkingDirectory=%h/projects/wiki
|
||||||
|
ExecStart=/usr/bin/bash %h/projects/wiki/scripts/wiki-maintain.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
```ini
|
||||||
|
# ~/.config/systemd/user/wiki-maintain.timer
|
||||||
|
[Unit]
|
||||||
|
Description=Run wiki-maintain daily
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=daily
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl --user enable --now wiki-maintain.timer
|
||||||
|
journalctl --user -u wiki-maintain.service # see logs
|
||||||
|
```
|
||||||
|
|
||||||
|
### launchd (macOS)
|
||||||
|
|
||||||
|
More native than cron on macOS:
|
||||||
|
|
||||||
|
```xml
|
||||||
|
<!-- ~/Library/LaunchAgents/com.user.wiki-maintain.plist -->
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||||
|
<plist version="1.0">
|
||||||
|
<dict>
|
||||||
|
<key>Label</key><string>com.user.wiki-maintain</string>
|
||||||
|
<key>ProgramArguments</key>
|
||||||
|
<array>
|
||||||
|
<string>/bin/bash</string>
|
||||||
|
<string>/Users/YOUR_USER/projects/wiki/scripts/wiki-maintain.sh</string>
|
||||||
|
</array>
|
||||||
|
<key>StartCalendarInterval</key>
|
||||||
|
<dict>
|
||||||
|
<key>Hour</key><integer>3</integer>
|
||||||
|
<key>Minute</key><integer>0</integer>
|
||||||
|
</dict>
|
||||||
|
<key>StandardOutPath</key><string>/tmp/wiki-maintain.log</string>
|
||||||
|
<key>StandardErrorPath</key><string>/tmp/wiki-maintain.err</string>
|
||||||
|
</dict>
|
||||||
|
</plist>
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
launchctl load ~/Library/LaunchAgents/com.user.wiki-maintain.plist
|
||||||
|
launchctl list | grep wiki # verify
|
||||||
|
```
|
||||||
|
|
||||||
|
### Git hooks (pre-push)
|
||||||
|
|
||||||
|
Run hygiene before every push so the wiki is always clean when it hits
|
||||||
|
the remote:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat > ~/projects/wiki/.git/hooks/pre-push <<'HOOK'
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
bash ~/projects/wiki/scripts/wiki-maintain.sh --hygiene-only --no-reindex
|
||||||
|
HOOK
|
||||||
|
chmod +x ~/projects/wiki/.git/hooks/pre-push
|
||||||
|
```
|
||||||
|
|
||||||
|
Downside: every push is slow. Upside: you never push a broken wiki.
|
||||||
|
|
||||||
|
### CI pipeline
|
||||||
|
|
||||||
|
Run `wiki-hygiene.py --check-only` in a CI workflow on every PR:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# .github/workflows/wiki-check.yml (or .gitea/workflows/...)
|
||||||
|
name: Wiki hygiene check
|
||||||
|
on: [push, pull_request]
|
||||||
|
jobs:
|
||||||
|
hygiene:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v4
|
||||||
|
- uses: actions/setup-python@v5
|
||||||
|
- run: python3 scripts/wiki-hygiene.py --check-only
|
||||||
|
```
|
||||||
|
|
||||||
|
`--check-only` reports issues without auto-fixing them, so CI can flag
|
||||||
|
problems without modifying files.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Minimal subsets
|
||||||
|
|
||||||
|
You don't have to run the whole pipeline. Pick what's useful:
|
||||||
|
|
||||||
|
### "Just the wiki" (no automation)
|
||||||
|
|
||||||
|
- Delete `scripts/wiki-*` and `scripts/*-conversations*`
|
||||||
|
- Delete `tests/`
|
||||||
|
- Keep the directory structure (`patterns/`, `decisions/`, etc.)
|
||||||
|
- Keep `index.md` and `CLAUDE.md`
|
||||||
|
- Write and maintain the wiki manually with your agent
|
||||||
|
|
||||||
|
This is the Karpathy-gist version. Works great for small wikis.
|
||||||
|
|
||||||
|
### "Wiki + mining" (no harvesting, no hygiene)
|
||||||
|
|
||||||
|
- Keep the mining layer (`extract-sessions.py`, `summarize-conversations.py`, `update-conversation-index.py`)
|
||||||
|
- Delete the automation layer (`wiki-harvest.py`, `wiki-hygiene.py`, `wiki-staging.py`, `wiki-maintain.sh`)
|
||||||
|
- The wiki grows from session mining but you maintain it manually
|
||||||
|
|
||||||
|
Useful if you want session continuity (the wake-up briefing) without
|
||||||
|
the full automation.
|
||||||
|
|
||||||
|
### "Wiki + hygiene" (no mining, no harvesting)
|
||||||
|
|
||||||
|
- Keep `wiki-hygiene.py` and `wiki_lib.py`
|
||||||
|
- Delete everything else
|
||||||
|
- Run `wiki-hygiene.py --quick` periodically to catch structural issues
|
||||||
|
|
||||||
|
Useful if you write the wiki manually but want automated checks for
|
||||||
|
orphans, broken links, and staleness.
|
||||||
|
|
||||||
|
### "Wiki + harvesting" (no session mining)
|
||||||
|
|
||||||
|
- Keep `wiki-harvest.py`, `wiki-staging.py`, `wiki_lib.py`
|
||||||
|
- Delete mining scripts
|
||||||
|
- Source URLs manually — put them in a file and point the harvester at
|
||||||
|
it. You'd need to write a wrapper that extracts URLs from your source
|
||||||
|
file and feeds them into the fetch cascade.
|
||||||
|
|
||||||
|
Useful if URLs come from somewhere other than Claude Code sessions
|
||||||
|
(e.g. browser bookmarks, Pocket export, RSS).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Schema customization
|
||||||
|
|
||||||
|
The repo uses these live content types:
|
||||||
|
|
||||||
|
- `patterns/` — HOW things should be built
|
||||||
|
- `decisions/` — WHY we chose this approach
|
||||||
|
- `concepts/` — WHAT the foundational ideas are
|
||||||
|
- `environments/` — WHERE implementations differ
|
||||||
|
|
||||||
|
These reflect my engineering-focused use case. Your wiki might need
|
||||||
|
different categories. To change them:
|
||||||
|
|
||||||
|
1. Rename / add directories under the wiki root
|
||||||
|
2. Edit `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py`
|
||||||
|
3. Update the `type:` frontmatter validation in
|
||||||
|
`scripts/wiki-hygiene.py` (`VALID_TYPES` constant)
|
||||||
|
4. Update `CLAUDE.md` to describe the new categories
|
||||||
|
5. Update `index.md` section headers to match
|
||||||
|
|
||||||
|
Examples of alternative schemas:
|
||||||
|
|
||||||
|
**Research wiki**:
|
||||||
|
- `findings/` — experimental results
|
||||||
|
- `hypotheses/` — what you're testing
|
||||||
|
- `methods/` — how you test
|
||||||
|
- `literature/` — external sources
|
||||||
|
|
||||||
|
**Product wiki**:
|
||||||
|
- `features/` — what the product does
|
||||||
|
- `decisions/` — why we chose this
|
||||||
|
- `users/` — personas, interviews, feedback
|
||||||
|
- `metrics/` — what we measure
|
||||||
|
|
||||||
|
**Personal knowledge wiki**:
|
||||||
|
- `topics/` — general subject matter
|
||||||
|
- `projects/` — specific ongoing work
|
||||||
|
- `journal/` — dated entries
|
||||||
|
- `references/` — external links/papers
|
||||||
|
|
||||||
|
None of these are better or worse — pick what matches how you think.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Frontmatter customization
|
||||||
|
|
||||||
|
The required fields are documented in `CLAUDE.md` (frontmatter spec).
|
||||||
|
You can add your own fields freely — the parser and hygiene checks
|
||||||
|
ignore unknown keys.
|
||||||
|
|
||||||
|
Useful additions you might want:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
author: alice # who wrote or introduced the page
|
||||||
|
tags: [auth, security] # flat tag list
|
||||||
|
urgency: high # for to-do-style wiki pages
|
||||||
|
stakeholders: # who cares about this page
|
||||||
|
- product-team
|
||||||
|
- security-team
|
||||||
|
review_by: 2026-06-01 # explicit review date instead of age-based decay
|
||||||
|
```
|
||||||
|
|
||||||
|
If you want age-based decay to key off a different field than
|
||||||
|
`last_verified` (say, `review_by`), edit `expected_confidence()` in
|
||||||
|
`scripts/wiki-hygiene.py` to read from your custom field.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Working across multiple wikis
|
||||||
|
|
||||||
|
The scripts all honor the `WIKI_DIR` environment variable. Run multiple
|
||||||
|
wikis against the same scripts:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Work wiki
|
||||||
|
WIKI_DIR=~/projects/work-wiki bash scripts/wiki-maintain.sh
|
||||||
|
|
||||||
|
# Personal wiki
|
||||||
|
WIKI_DIR=~/projects/personal-wiki bash scripts/wiki-maintain.sh
|
||||||
|
|
||||||
|
# Research wiki
|
||||||
|
WIKI_DIR=~/projects/research-wiki bash scripts/wiki-maintain.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Each has its own state files, its own cron entries, its own qmd
|
||||||
|
collection. You can symlink or copy `scripts/` into each wiki, or run
|
||||||
|
all three against a single checked-out copy of the scripts.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What I'd change if starting over
|
||||||
|
|
||||||
|
Honest notes on the design choices, in case you're about to fork:
|
||||||
|
|
||||||
|
1. **Config should be in YAML, not inline constants.** I bolted a
|
||||||
|
"CONFIGURE ME" comment onto `PROJECT_MAP` and `SKIP_DOMAIN_PATTERNS`
|
||||||
|
as a shortcut. Better: a `config.yaml` at the wiki root that all
|
||||||
|
scripts read.
|
||||||
|
|
||||||
|
2. **The mining layer is tightly coupled to Claude Code.** A cleaner
|
||||||
|
design would put a `Session` interface in `wiki_lib.py` and have
|
||||||
|
extractors for each agent produce `Session` objects — the rest of the
|
||||||
|
pipeline would be agent-agnostic.
|
||||||
|
|
||||||
|
3. **The hygiene script is a monolith.** 1100+ lines is a lot. Splitting
|
||||||
|
it into `wiki_hygiene/checks.py`, `wiki_hygiene/archive.py`,
|
||||||
|
`wiki_hygiene/llm.py`, etc., would be cleaner. It started as a single
|
||||||
|
file and grew.
|
||||||
|
|
||||||
|
4. **The hyphenated filenames (`wiki-harvest.py`) make Python imports
|
||||||
|
awkward.** Standard Python convention is underscores. I used hyphens
|
||||||
|
for consistency with the shell scripts, and `conftest.py` has a
|
||||||
|
module-loader workaround. A cleaner fork would use underscores
|
||||||
|
everywhere.
|
||||||
|
|
||||||
|
5. **The wiki schema assumes you know what you want to catalog.** If
|
||||||
|
you don't, start with a free-form `notes/` directory and let
|
||||||
|
categories emerge organically, then refactor into `patterns/` etc.
|
||||||
|
later.
|
||||||
|
|
||||||
|
None of these are blockers. They're all "if I were designing v2"
|
||||||
|
observations.
|
||||||
338
docs/DESIGN-RATIONALE.md
Normal file
338
docs/DESIGN-RATIONALE.md
Normal file
@@ -0,0 +1,338 @@
|
|||||||
|
# Design Rationale — Signal & Noise
|
||||||
|
|
||||||
|
Why each part of this repo exists. This is the "why" document; the other
|
||||||
|
docs are the "what" and "how."
|
||||||
|
|
||||||
|
Before implementing anything, the design was worked out interactively
|
||||||
|
with Claude as a structured Signal & Noise analysis of Andrej Karpathy's
|
||||||
|
original persistent-wiki pattern:
|
||||||
|
|
||||||
|
> **Interactive design artifact**: [The LLM Wiki — Karpathy's Pattern — Signal & Noise](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
|
||||||
|
|
||||||
|
That artifact walks through the pattern's seven genuine strengths, seven
|
||||||
|
real weaknesses, and concrete mitigations for each weakness. This repo
|
||||||
|
is the implementation of those mitigations. If you want to understand
|
||||||
|
*why* a component exists, the artifact has the longer-form argument; this
|
||||||
|
document is the condensed version.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Where the pattern is genuinely strong
|
||||||
|
|
||||||
|
The analysis found seven strengths that hold up under scrutiny. This
|
||||||
|
repo preserves all of them:
|
||||||
|
|
||||||
|
| Strength | How this repo keeps it |
|
||||||
|
|----------|-----------------------|
|
||||||
|
| **Knowledge compounds over time** | Every ingest adds to the existing wiki rather than restarting; conversation mining and URL harvesting continuously feed new material in |
|
||||||
|
| **Zero maintenance burden on humans** | Cron-driven harvest + hygiene; the only manual step is staging review, and that's fast because the AI already compiled the page |
|
||||||
|
| **Token-efficient at personal scale** | `index.md` fits in context; `qmd` kicks in only at 50+ articles; the wake-up briefing is ~200 tokens |
|
||||||
|
| **Human-readable & auditable** | Plain markdown everywhere; every cross-reference is visible; git history shows every change |
|
||||||
|
| **Future-proof & portable** | No vendor lock-in; you can point any agent at the same tree tomorrow |
|
||||||
|
| **Self-healing via lint passes** | `wiki-hygiene.py` runs quick checks daily and full (LLM) checks weekly |
|
||||||
|
| **Path to fine-tuning** | Wiki pages are high-quality synthetic training data once purified through hygiene |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Where the pattern is genuinely weak — and how this repo answers
|
||||||
|
|
||||||
|
The analysis identified seven real weaknesses. Five have direct
|
||||||
|
mitigations in this repo; two remain open trade-offs you should be aware
|
||||||
|
of.
|
||||||
|
|
||||||
|
### 1. Errors persist and compound
|
||||||
|
|
||||||
|
**The problem**: Unlike RAG — where a hallucination is ephemeral and the
|
||||||
|
next query starts clean — an LLM wiki persists its mistakes. If the LLM
|
||||||
|
incorrectly links two concepts at ingest time, future ingests build on
|
||||||
|
that wrong prior.
|
||||||
|
|
||||||
|
**How this repo mitigates**:
|
||||||
|
|
||||||
|
- **`confidence` field** — every page carries `high`/`medium`/`low` with
|
||||||
|
decay based on `last_verified`. Wrong claims aren't treated as
|
||||||
|
permanent — they age out visibly.
|
||||||
|
- **Archive + restore** — decayed pages get moved to `archive/` where
|
||||||
|
they're excluded from default search. If they get referenced again
|
||||||
|
they're auto-restored with `confidence: medium` (never straight to
|
||||||
|
`high` — they have to re-earn trust).
|
||||||
|
- **Raw harvested material is immutable** — `raw/harvested/*.md` files
|
||||||
|
are the ground truth. Every compiled wiki page can be traced back to
|
||||||
|
its source via the `sources:` frontmatter field.
|
||||||
|
- **Full-mode contradiction detection** — `wiki-hygiene.py --full` uses
|
||||||
|
sonnet to find conflicting claims across pages. Report-only (humans
|
||||||
|
decide which side wins).
|
||||||
|
- **Staging review** — automated content goes to `staging/` first.
|
||||||
|
Nothing enters the live wiki without human approval, so errors have
|
||||||
|
two chances to get caught (AI compile + human review) before they
|
||||||
|
become persistent.
|
||||||
|
|
||||||
|
### 2. Hard scale ceiling at ~50K tokens
|
||||||
|
|
||||||
|
**The problem**: The wiki approach stops working when `index.md` no
|
||||||
|
longer fits in context. Karpathy's own wiki was ~100 articles / 400K
|
||||||
|
words — already near the ceiling.
|
||||||
|
|
||||||
|
**How this repo mitigates**:
|
||||||
|
|
||||||
|
- **`qmd` from day one** — `qmd` (BM25 + vector + LLM re-ranking) is set
|
||||||
|
up in the default configuration so the agent never has to load the
|
||||||
|
full index. At 50+ pages, `qmd search` replaces `cat index.md`.
|
||||||
|
- **Wing/room structural filtering** — conversations are partitioned by
|
||||||
|
project code (wing) and topic (room, via the `topics:` frontmatter).
|
||||||
|
Retrieval is pre-narrowed to the relevant wing before search runs.
|
||||||
|
This extends the effective ceiling because `qmd` works on a relevant
|
||||||
|
subset, not the whole corpus.
|
||||||
|
- **Hygiene full mode flags redundancy** — duplicate detection auto-merges
|
||||||
|
weaker pages into stronger ones, keeping the corpus lean.
|
||||||
|
- **Archive excludes stale content** — the `wiki-archive` collection has
|
||||||
|
`includeByDefault: false`, so archived pages don't eat context until
|
||||||
|
explicitly queried.
|
||||||
|
|
||||||
|
### 3. Manual cross-checking burden returns in precision-critical domains
|
||||||
|
|
||||||
|
**The problem**: For API specs, version constraints, legal records, and
|
||||||
|
medical protocols, LLM-generated content needs human verification. The
|
||||||
|
maintenance burden you thought you'd eliminated comes back as
|
||||||
|
verification overhead.
|
||||||
|
|
||||||
|
**How this repo mitigates**:
|
||||||
|
|
||||||
|
- **Staging workflow** — every automated page goes through human review.
|
||||||
|
For precision-critical content, that review IS the cross-check. The
|
||||||
|
AI does the drafting; you verify.
|
||||||
|
- **`compilation_notes` field** — staging pages include the AI's own
|
||||||
|
explanation of what it did and why. Makes review faster — you can
|
||||||
|
spot-check the reasoning rather than re-reading the whole page.
|
||||||
|
- **Immutable raw sources** — every wiki claim traces back to a specific
|
||||||
|
file in `raw/harvested/` with a SHA-256 `content_hash`. Verification
|
||||||
|
means comparing the claim to the source, not "trust the LLM."
|
||||||
|
- **`confidence: low` for precision domains** — the agent's instructions
|
||||||
|
(via `CLAUDE.md`) tell it to flag low-confidence content when
|
||||||
|
citing. Humans see the warning before acting.
|
||||||
|
|
||||||
|
**Residual trade-off**: For *truly* mission-critical data (legal,
|
||||||
|
medical, compliance), no amount of automation replaces domain-expert
|
||||||
|
review. If that's your use case, treat this repo as a *drafting* tool,
|
||||||
|
not a canonical source.
|
||||||
|
|
||||||
|
### 4. Knowledge staleness without active upkeep
|
||||||
|
|
||||||
|
**The problem**: Community analysis of 120+ comments on Karpathy's gist
|
||||||
|
found this is the #1 failure mode. Most people who try the pattern get
|
||||||
|
the folder structure right and still end up with a wiki that slowly
|
||||||
|
becomes unreliable because they stop feeding it. Six-week half-life is
|
||||||
|
typical.
|
||||||
|
|
||||||
|
**How this repo mitigates** (this is the biggest thing):
|
||||||
|
|
||||||
|
- **Automation replaces human discipline** — daily cron runs
|
||||||
|
`wiki-maintain.sh` (harvest + hygiene + qmd reindex); weekly cron runs
|
||||||
|
`--full` mode. You don't need to remember anything.
|
||||||
|
- **Conversation mining is the feed** — you don't need to curate sources
|
||||||
|
manually. Every Claude Code session becomes potential ingest. The
|
||||||
|
feed is automatic and continuous, as long as you're doing work.
|
||||||
|
- **`last_verified` refreshes from conversation references** — when the
|
||||||
|
summarizer links a conversation to a wiki page via `related:`, the
|
||||||
|
hygiene script picks that up and bumps `last_verified`. Pages stay
|
||||||
|
fresh as long as they're still being discussed.
|
||||||
|
- **Decay thresholds force attention** — pages without refresh signals
|
||||||
|
for 6/9/12 months get downgraded and eventually archived. The wiki
|
||||||
|
self-trims.
|
||||||
|
- **Hygiene reports** — `reports/hygiene-YYYY-MM-DD-needs-review.md`
|
||||||
|
flags the things that *do* need human judgment. Everything else is
|
||||||
|
auto-fixed.
|
||||||
|
|
||||||
|
This is the single biggest reason this repo exists. The automation
|
||||||
|
layer is entirely about removing "I forgot to lint" as a failure mode.
|
||||||
|
|
||||||
|
### 5. Cognitive outsourcing risk
|
||||||
|
|
||||||
|
**The problem**: Hacker News critics argued that the bookkeeping
|
||||||
|
Karpathy outsources — filing, cross-referencing, summarizing — is
|
||||||
|
precisely where genuine understanding forms. Outsource it and you end up
|
||||||
|
with a comprehensive wiki you haven't internalized.
|
||||||
|
|
||||||
|
**How this repo mitigates**:
|
||||||
|
|
||||||
|
- **Staging review is a forcing function** — you see every automated
|
||||||
|
page before it lands. Even skimming forces engagement with the
|
||||||
|
material.
|
||||||
|
- **`qmd query "..."` for exploration** — searching the wiki is an
|
||||||
|
active process, not passive retrieval. You're asking questions, not
|
||||||
|
pulling a file.
|
||||||
|
- **The wake-up briefing** — `context/wake-up.md` is a 200-token digest
|
||||||
|
the agent reads at session start. You read it too (or the agent reads
|
||||||
|
it to you) — ongoing re-exposure to your own knowledge base.
|
||||||
|
|
||||||
|
**Residual trade-off**: This is a real concern even with mitigations.
|
||||||
|
The wiki is designed as *augmentation*, not *replacement*. If you
|
||||||
|
never read your own wiki and only consult it through the agent, you're
|
||||||
|
in the outsourcing failure mode. The fix is discipline, not
|
||||||
|
architecture.
|
||||||
|
|
||||||
|
### 6. Weaker semantic retrieval than RAG at scale
|
||||||
|
|
||||||
|
**The problem**: At large corpora, vector embeddings find semantically
|
||||||
|
related content across different wording in ways explicit wikilinks
|
||||||
|
can't match.
|
||||||
|
|
||||||
|
**How this repo mitigates**:
|
||||||
|
|
||||||
|
- **`qmd` is hybrid (BM25 + vector)** — not just keyword search. Vector
|
||||||
|
similarity is built into the retrieval pipeline from day one.
|
||||||
|
- **Structural navigation complements semantic search** — project codes
|
||||||
|
(wings) and topic frontmatter narrow the search space before the
|
||||||
|
hybrid search runs. Structure + semantics is stronger than either
|
||||||
|
alone.
|
||||||
|
- **Missing cross-reference detection** — full-mode hygiene asks the
|
||||||
|
LLM to find pages that *should* link to each other but don't, then
|
||||||
|
auto-adds them. This is the explicit-linking approach catching up to
|
||||||
|
semantic retrieval over time.
|
||||||
|
|
||||||
|
**Residual trade-off**: At enterprise scale (millions of documents), a
|
||||||
|
proper vector DB with specialized retrieval wins. This repo is for
|
||||||
|
personal / small-team scale where the hybrid approach is sufficient.
|
||||||
|
|
||||||
|
### 7. No access control or multi-user support
|
||||||
|
|
||||||
|
**The problem**: It's a folder of markdown files. No RBAC, no audit
|
||||||
|
logging, no concurrency handling, no permissions model.
|
||||||
|
|
||||||
|
**How this repo mitigates**:
|
||||||
|
|
||||||
|
- **Git-based sync with merge-union** — concurrent writes on different
|
||||||
|
machines auto-resolve because markdown is set to `merge=union` in
|
||||||
|
`.gitattributes`. Both sides win.
|
||||||
|
- **Network boundary as soft access control** — the suggested
|
||||||
|
deployment is over Tailscale or a VPN, so the network does the work a
|
||||||
|
RBAC layer would otherwise do. Not enterprise-grade, but sufficient
|
||||||
|
for personal/family/small-team use.
|
||||||
|
|
||||||
|
**Residual trade-off**: **This is the big one.** The repo is not a
|
||||||
|
replacement for enterprise knowledge management. No audit trails, no
|
||||||
|
fine-grained permissions, no compliance story. If you need any of
|
||||||
|
that, you need a different architecture. This repo is explicitly
|
||||||
|
scoped to the personal/small-team use case.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The #1 failure mode — active upkeep
|
||||||
|
|
||||||
|
Every other weakness has a mitigation. *Active upkeep is the one that
|
||||||
|
kills wikis in the wild.* The community data is unambiguous:
|
||||||
|
|
||||||
|
- People who automate the lint schedule → wikis healthy at 6+ months
|
||||||
|
- People who rely on "I'll remember to lint" → wikis abandoned at 6 weeks
|
||||||
|
|
||||||
|
The entire automation layer of this repo exists to remove upkeep as a
|
||||||
|
thing the human has to think about:
|
||||||
|
|
||||||
|
| Cadence | Job | Purpose |
|
||||||
|
|---------|-----|---------|
|
||||||
|
| Every 15 min | `wiki-sync.sh` | Commit/pull/push — cross-machine sync |
|
||||||
|
| Every 2 hours | `wiki-sync.sh full` | Full sync + qmd reindex |
|
||||||
|
| Every hour | `mine-conversations.sh --extract-only` | Capture new Claude Code sessions (no LLM) |
|
||||||
|
| Daily 2am | `summarize-conversations.py --claude` + index | Classify + summarize (LLM) |
|
||||||
|
| Daily 3am | `wiki-maintain.sh` | Harvest + quick hygiene + reindex |
|
||||||
|
| Weekly Sun 4am | `wiki-maintain.sh --hygiene-only --full` | LLM-powered duplicate/contradiction/cross-ref detection |
|
||||||
|
|
||||||
|
If you disable all of these, you get the same outcome as every
|
||||||
|
abandoned wiki: six-week half-life. The scripts aren't optional
|
||||||
|
convenience — they're the load-bearing answer to the pattern's primary
|
||||||
|
failure mode.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What was borrowed from where
|
||||||
|
|
||||||
|
This repo is a synthesis of two ideas with an automation layer on top:
|
||||||
|
|
||||||
|
### From Karpathy
|
||||||
|
|
||||||
|
- The core pattern: LLM-maintained persistent wiki, compile at ingest
|
||||||
|
time instead of retrieve at query time
|
||||||
|
- Separation of `raw/` (immutable sources) from `wiki/` (compiled pages)
|
||||||
|
- `CLAUDE.md` as the schema that disciplines the agent
|
||||||
|
- Periodic "lint" passes to catch orphans, contradictions, missing refs
|
||||||
|
- The idea that the wiki becomes fine-tuning material over time
|
||||||
|
|
||||||
|
### From mempalace
|
||||||
|
|
||||||
|
- **Wings** = per-person or per-project namespaces → this repo uses
|
||||||
|
project codes (`mc`, `wiki`, `web`, etc.) as the same thing in
|
||||||
|
`conversations/<project>/`
|
||||||
|
- **Rooms** = topics within a wing → the `topics:` frontmatter on
|
||||||
|
conversation files
|
||||||
|
- **Halls** = memory-type corridors (fact / event / discovery /
|
||||||
|
preference / advice / tooling) → the `halls:` frontmatter field
|
||||||
|
classified by the summarizer
|
||||||
|
- **Closets** = summary layer → the summary body of each summarized
|
||||||
|
conversation
|
||||||
|
- **Drawers** = verbatim archive, never lost → the extracted
|
||||||
|
conversation transcripts under `conversations/<project>/*.md`
|
||||||
|
- **Tunnels** = cross-wing connections → the `related:` frontmatter
|
||||||
|
linking conversations to wiki pages
|
||||||
|
- Wing + room structural filtering gives a documented +34% retrieval
|
||||||
|
boost over flat search
|
||||||
|
|
||||||
|
The MemPalace taxonomy solved a problem Karpathy's pattern doesn't
|
||||||
|
address: how do you navigate a growing corpus without reading
|
||||||
|
everything? The answer is to give the corpus structural metadata at
|
||||||
|
ingest time, then filter on that metadata before doing semantic search.
|
||||||
|
This repo borrows that wholesale.
|
||||||
|
|
||||||
|
### What this repo adds
|
||||||
|
|
||||||
|
- **Automation layer** tying the pieces together with cron-friendly
|
||||||
|
orchestration
|
||||||
|
- **Staging pipeline** as a human-in-the-loop checkpoint for automated
|
||||||
|
content
|
||||||
|
- **Confidence decay + auto-archive + auto-restore** as the "retention
|
||||||
|
curve" that community analysis identified as critical for long-term
|
||||||
|
wiki health
|
||||||
|
- **`qmd` integration** as the scalable search layer (chosen over
|
||||||
|
ChromaDB because it uses the same markdown storage as the wiki —
|
||||||
|
one index to maintain, not two)
|
||||||
|
- **Hygiene reports** with fixed vs needs-review separation so
|
||||||
|
automation handles mechanical fixes and humans handle ambiguity
|
||||||
|
- **Cross-machine sync** via git with markdown merge-union so the same
|
||||||
|
wiki lives on multiple machines without merge hell
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Honest residual trade-offs
|
||||||
|
|
||||||
|
Five items from the analysis that this repo doesn't fully solve and
|
||||||
|
where you should know the limits:
|
||||||
|
|
||||||
|
1. **Enterprise scale** — this is a personal/small-team tool. Millions
|
||||||
|
of documents, hundreds of users, RBAC, compliance: wrong
|
||||||
|
architecture.
|
||||||
|
2. **True semantic retrieval at massive scale** — `qmd` hybrid search
|
||||||
|
is great for thousands of pages, not millions.
|
||||||
|
3. **Cognitive outsourcing** — no architecture fix. Discipline
|
||||||
|
yourself to read your own wiki, not just query it through the agent.
|
||||||
|
4. **Precision-critical domains** — for legal/medical/regulatory data,
|
||||||
|
use this as a drafting tool, not a source of truth. Human
|
||||||
|
domain-expert review is not replaceable.
|
||||||
|
5. **Access control** — network boundary (Tailscale) is the fastest
|
||||||
|
path; nothing in the repo itself enforces permissions.
|
||||||
|
|
||||||
|
If any of these are dealbreakers for your use case, a different
|
||||||
|
architecture is probably what you need.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Further reading
|
||||||
|
|
||||||
|
- [The original Karpathy gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
|
||||||
|
— the concept
|
||||||
|
- [mempalace](https://github.com/milla-jovovich/mempalace) — the
|
||||||
|
structural memory layer
|
||||||
|
- [Signal & Noise interactive analysis](https://claude.ai/public/artifacts/0f6e1d9b-3b8c-43df-99d7-3a4328a1620c)
|
||||||
|
— the design rationale this document summarizes
|
||||||
|
- [README](../README.md) — the concept pitch
|
||||||
|
- [ARCHITECTURE.md](ARCHITECTURE.md) — component deep-dive
|
||||||
|
- [SETUP.md](SETUP.md) — installation
|
||||||
|
- [CUSTOMIZE.md](CUSTOMIZE.md) — adapting for non-Claude-Code setups
|
||||||
502
docs/SETUP.md
Normal file
502
docs/SETUP.md
Normal file
@@ -0,0 +1,502 @@
|
|||||||
|
# Setup Guide
|
||||||
|
|
||||||
|
Complete installation for the full automation pipeline. For the conceptual
|
||||||
|
version (just the idea, no scripts), see the "Quick start — Path A" section
|
||||||
|
in the [README](../README.md).
|
||||||
|
|
||||||
|
Tested on macOS (work machines) and Linux/WSL2 (home machines). Should work
|
||||||
|
on any POSIX system with Python 3.11+, Node.js 18+, and bash.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Prerequisites
|
||||||
|
|
||||||
|
### Required
|
||||||
|
|
||||||
|
- **git** with SSH or HTTPS access to your remote (for cross-machine sync)
|
||||||
|
- **Node.js 18+** (for `qmd` search)
|
||||||
|
- **Python 3.11+** (for all pipeline scripts)
|
||||||
|
- **`claude` CLI** with valid authentication — Max subscription OAuth or
|
||||||
|
API key. Required for summarization and the harvester's AI compile step.
|
||||||
|
Without `claude`, you can still use the wiki, but the automation layer
|
||||||
|
falls back to manual or local-LLM paths.
|
||||||
|
|
||||||
|
### Python tools (recommended via `pipx`)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# URL content extraction — required for wiki-harvest.py
|
||||||
|
pipx install trafilatura
|
||||||
|
pipx install crawl4ai && crawl4ai-setup # installs Playwright browsers
|
||||||
|
```
|
||||||
|
|
||||||
|
Verify: `trafilatura --version` and `crwl --help` should both work.
|
||||||
|
|
||||||
|
### Optional
|
||||||
|
|
||||||
|
- **`pytest`** — only needed to run the test suite (`pip install --user pytest`)
|
||||||
|
- **`llama.cpp` / `llama-server`** — only if you want the legacy local-LLM
|
||||||
|
summarization path instead of `claude -p`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Clone the repo
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone <your-gitea-or-github-url> ~/projects/wiki
|
||||||
|
cd ~/projects/wiki
|
||||||
|
```
|
||||||
|
|
||||||
|
The repo contains scripts, tests, docs, and example content — but no
|
||||||
|
actual wiki pages. The wiki grows as you use it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Configure qmd search
|
||||||
|
|
||||||
|
`qmd` handles BM25 full-text search and vector search over the wiki.
|
||||||
|
The pipeline uses **three** collections:
|
||||||
|
|
||||||
|
- **`wiki`** — live content (patterns/decisions/concepts/environments),
|
||||||
|
staging, and raw sources. The default search surface.
|
||||||
|
- **`wiki-archive`** — stale / superseded pages. Excluded from default
|
||||||
|
search; query explicitly with `-c wiki-archive` when digging into
|
||||||
|
history.
|
||||||
|
- **`wiki-conversations`** — mined Claude Code session transcripts.
|
||||||
|
Excluded from default search because they'd flood results with noisy
|
||||||
|
tool-call output; query explicitly with `-c wiki-conversations` when
|
||||||
|
looking for "what did I discuss about X last month?"
|
||||||
|
|
||||||
|
```bash
|
||||||
|
npm install -g @tobilu/qmd
|
||||||
|
```
|
||||||
|
|
||||||
|
Configure via YAML directly — the CLI doesn't support `ignore` or
|
||||||
|
`includeByDefault`, so we edit the config file:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p ~/.config/qmd
|
||||||
|
cat > ~/.config/qmd/index.yml <<'YAML'
|
||||||
|
collections:
|
||||||
|
wiki:
|
||||||
|
path: /Users/YOUR_USER/projects/wiki # ← replace with your actual path
|
||||||
|
pattern: "**/*.md"
|
||||||
|
ignore:
|
||||||
|
- "archive/**"
|
||||||
|
- "reports/**"
|
||||||
|
- "plans/**"
|
||||||
|
- "conversations/**"
|
||||||
|
- "scripts/**"
|
||||||
|
- "context/**"
|
||||||
|
|
||||||
|
wiki-archive:
|
||||||
|
path: /Users/YOUR_USER/projects/wiki/archive
|
||||||
|
pattern: "**/*.md"
|
||||||
|
includeByDefault: false
|
||||||
|
|
||||||
|
wiki-conversations:
|
||||||
|
path: /Users/YOUR_USER/projects/wiki/conversations
|
||||||
|
pattern: "**/*.md"
|
||||||
|
includeByDefault: false
|
||||||
|
ignore:
|
||||||
|
- "index.md"
|
||||||
|
YAML
|
||||||
|
```
|
||||||
|
|
||||||
|
On Linux/WSL, replace `/Users/YOUR_USER` with `/home/YOUR_USER`.
|
||||||
|
|
||||||
|
Build the indexes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qmd update # scan files into all three collections
|
||||||
|
qmd embed # generate vector embeddings (~2 min first run + ~30 min for conversations on CPU)
|
||||||
|
```
|
||||||
|
|
||||||
|
Verify:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qmd collection list
|
||||||
|
# Expected:
|
||||||
|
# wiki — N files
|
||||||
|
# wiki-archive — M files [excluded]
|
||||||
|
# wiki-conversations — K files [excluded]
|
||||||
|
```
|
||||||
|
|
||||||
|
The `[excluded]` tag on the non-default collections confirms
|
||||||
|
`includeByDefault: false` is honored.
|
||||||
|
|
||||||
|
**When to query which**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# "What's the current pattern for X?"
|
||||||
|
qmd search "topic" --json -n 5
|
||||||
|
|
||||||
|
# "What was the OLD pattern, before we changed it?"
|
||||||
|
qmd search "topic" -c wiki-archive --json -n 5
|
||||||
|
|
||||||
|
# "When did we discuss this, and what did we decide?"
|
||||||
|
qmd search "topic" -c wiki-conversations --json -n 5
|
||||||
|
|
||||||
|
# Everything — history + current + conversations
|
||||||
|
qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Configure the Python scripts
|
||||||
|
|
||||||
|
Three scripts need per-user configuration:
|
||||||
|
|
||||||
|
### `scripts/extract-sessions.py` — `PROJECT_MAP`
|
||||||
|
|
||||||
|
This maps Claude Code project directory suffixes to short wiki codes
|
||||||
|
("wings"). Claude stores sessions under `~/.claude/projects/<hashed-path>/`
|
||||||
|
where the hashed path is derived from the absolute path to your project.
|
||||||
|
|
||||||
|
Open the script and edit the `PROJECT_MAP` dict near the top. Look for
|
||||||
|
the `CONFIGURE ME` block. Examples:
|
||||||
|
|
||||||
|
```python
|
||||||
|
PROJECT_MAP: dict[str, str] = {
|
||||||
|
"projects-wiki": "wiki",
|
||||||
|
"-claude": "cl",
|
||||||
|
"my-webapp": "web", # map "mydir/my-webapp" → wing "web"
|
||||||
|
"mobile-app": "mob",
|
||||||
|
"work-monorepo": "work",
|
||||||
|
"-home": "general", # catch-all for unmatched sessions
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Run `ls ~/.claude/projects/` to see what directory names Claude is
|
||||||
|
actually producing on your machine — the suffix in `PROJECT_MAP` matches
|
||||||
|
against the end of each directory name.
|
||||||
|
|
||||||
|
### `scripts/update-conversation-index.py` — `PROJECT_NAMES` / `PROJECT_ORDER`
|
||||||
|
|
||||||
|
Matching display names for every code in `PROJECT_MAP`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
PROJECT_NAMES: dict[str, str] = {
|
||||||
|
"wiki": "WIKI — This Wiki",
|
||||||
|
"cl": "CL — Claude Config",
|
||||||
|
"web": "WEB — My Webapp",
|
||||||
|
"mob": "MOB — Mobile App",
|
||||||
|
"work": "WORK — Day Job",
|
||||||
|
"general": "General — Cross-Project",
|
||||||
|
}
|
||||||
|
|
||||||
|
PROJECT_ORDER = [
|
||||||
|
"work", "web", "mob", # most-active first
|
||||||
|
"wiki", "cl", "general",
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
### `scripts/wiki-harvest.py` — `SKIP_DOMAIN_PATTERNS`
|
||||||
|
|
||||||
|
Add your internal/personal domains so the harvester doesn't try to fetch
|
||||||
|
them. Patterns use `re.search`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
SKIP_DOMAIN_PATTERNS = [
|
||||||
|
# ... (generic ones are already there)
|
||||||
|
r"\.mycompany\.com$",
|
||||||
|
r"^git\.mydomain\.com$",
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Create the post-merge hook
|
||||||
|
|
||||||
|
The hook rebuilds the qmd index automatically after every `git pull`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat > ~/projects/wiki/.git/hooks/post-merge <<'HOOK'
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if command -v qmd &>/dev/null; then
|
||||||
|
echo "wiki: rebuilding qmd index..."
|
||||||
|
qmd update 2>/dev/null
|
||||||
|
# WSL / Linux: no GPU, force CPU-only embeddings
|
||||||
|
if [[ "$(uname -s)" == "Linux" ]]; then
|
||||||
|
NODE_LLAMA_CPP_GPU=false qmd embed 2>/dev/null
|
||||||
|
else
|
||||||
|
qmd embed 2>/dev/null
|
||||||
|
fi
|
||||||
|
echo "wiki: qmd index updated"
|
||||||
|
fi
|
||||||
|
HOOK
|
||||||
|
chmod +x ~/projects/wiki/.git/hooks/post-merge
|
||||||
|
```
|
||||||
|
|
||||||
|
`.git/hooks/` isn't tracked by git, so this step runs on every machine
|
||||||
|
where you clone the repo.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Backfill frontmatter (first-time setup or fresh clone)
|
||||||
|
|
||||||
|
If you're starting with existing wiki pages that don't yet have
|
||||||
|
`last_verified` or `origin`, backfill them:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/projects/wiki
|
||||||
|
|
||||||
|
# Backfill last_verified from last_compiled/git/mtime
|
||||||
|
python3 scripts/wiki-hygiene.py --backfill
|
||||||
|
|
||||||
|
# Backfill origin: manual on pre-automation pages (one-shot inline)
|
||||||
|
python3 -c "
|
||||||
|
import sys
|
||||||
|
sys.path.insert(0, 'scripts')
|
||||||
|
from wiki_lib import iter_live_pages, write_page
|
||||||
|
changed = 0
|
||||||
|
for p in iter_live_pages():
|
||||||
|
if 'origin' not in p.frontmatter:
|
||||||
|
p.frontmatter['origin'] = 'manual'
|
||||||
|
write_page(p)
|
||||||
|
changed += 1
|
||||||
|
print(f'{changed} page(s) backfilled')
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
For a brand-new empty wiki, there's nothing to backfill — skip this step.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Run the pipeline manually once
|
||||||
|
|
||||||
|
Before setting up cron, do a full end-to-end dry run to make sure
|
||||||
|
everything's wired up:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/projects/wiki
|
||||||
|
|
||||||
|
# 1. Extract any existing Claude Code sessions
|
||||||
|
bash scripts/mine-conversations.sh --extract-only
|
||||||
|
|
||||||
|
# 2. Summarize with claude -p (will make real LLM calls — can take minutes)
|
||||||
|
python3 scripts/summarize-conversations.py --claude
|
||||||
|
|
||||||
|
# 3. Regenerate conversation index + wake-up context
|
||||||
|
python3 scripts/update-conversation-index.py --reindex
|
||||||
|
|
||||||
|
# 4. Dry-run the maintenance pipeline
|
||||||
|
bash scripts/wiki-maintain.sh --dry-run --no-compile
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output from step 4: all three phases run, phase 3 (qmd reindex)
|
||||||
|
shows as skipped in dry-run mode, and you see `finished in Ns`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Cron setup (optional)
|
||||||
|
|
||||||
|
If you want full automation, add these cron jobs. **Run them on only ONE
|
||||||
|
machine** — state files sync via git, so the other machine picks up the
|
||||||
|
results automatically.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
crontab -e
|
||||||
|
```
|
||||||
|
|
||||||
|
```cron
|
||||||
|
# Wiki SSH key for cron (if your remote uses SSH with a key)
|
||||||
|
GIT_SSH_COMMAND="ssh -i /path/to/wiki-key -o StrictHostKeyChecking=no"
|
||||||
|
|
||||||
|
# PATH for cron so claude, qmd, node, python3, pipx tools are findable
|
||||||
|
PATH=/home/YOUR_USER/.nvm/versions/node/v22/bin:/home/YOUR_USER/.local/bin:/usr/local/bin:/usr/bin:/bin
|
||||||
|
|
||||||
|
# ─── Sync ──────────────────────────────────────────────────────────────────
|
||||||
|
# commit/pull/push every 15 minutes
|
||||||
|
*/15 * * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --commit && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --pull && /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh --push >> /tmp/wiki-sync.log 2>&1
|
||||||
|
|
||||||
|
# full sync with qmd reindex every 2 hours
|
||||||
|
0 */2 * * * /home/YOUR_USER/projects/wiki/scripts/wiki-sync.sh full >> /tmp/wiki-sync.log 2>&1
|
||||||
|
|
||||||
|
# ─── Mining ────────────────────────────────────────────────────────────────
|
||||||
|
# Extract new sessions hourly (no LLM, fast)
|
||||||
|
0 * * * * /home/YOUR_USER/projects/wiki/scripts/mine-conversations.sh --extract-only >> /tmp/wiki-mine.log 2>&1
|
||||||
|
|
||||||
|
# Summarize + index daily at 2am (uses claude -p)
|
||||||
|
0 2 * * * cd /home/YOUR_USER/projects/wiki && python3 scripts/summarize-conversations.py --claude >> /tmp/wiki-mine.log 2>&1 && python3 scripts/update-conversation-index.py --reindex >> /tmp/wiki-mine.log 2>&1
|
||||||
|
|
||||||
|
# ─── Maintenance ───────────────────────────────────────────────────────────
|
||||||
|
# Daily at 3am: harvest + quick hygiene + qmd reindex
|
||||||
|
0 3 * * * cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh >> scripts/.maintain.log 2>&1
|
||||||
|
|
||||||
|
# Weekly Sunday at 4am: full hygiene with LLM checks
|
||||||
|
0 4 * * 0 cd /home/YOUR_USER/projects/wiki && bash scripts/wiki-maintain.sh --hygiene-only --full >> scripts/.maintain.log 2>&1
|
||||||
|
```
|
||||||
|
|
||||||
|
Replace `YOUR_USER` and the node path as appropriate for your system.
|
||||||
|
|
||||||
|
**macOS note**: `cron` needs Full Disk Access if you're pointing it at
|
||||||
|
files in `~/Documents` or `~/Desktop`. Alternatively use `launchd` with
|
||||||
|
a plist — same effect, easier permission model on macOS.
|
||||||
|
|
||||||
|
**WSL note**: make sure `cron` is actually running (`sudo service cron
|
||||||
|
start`). Cron doesn't auto-start in WSL by default.
|
||||||
|
|
||||||
|
**`claude -p` in cron**: OAuth tokens must be cached before cron runs it.
|
||||||
|
Run `claude --version` once interactively as your user to prime the
|
||||||
|
token cache — cron then picks up the cached credentials.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Tell Claude Code about the wiki
|
||||||
|
|
||||||
|
Two separate CLAUDE.md files work together:
|
||||||
|
|
||||||
|
1. **The wiki's own `CLAUDE.md`** at `~/projects/wiki/CLAUDE.md` — the
|
||||||
|
schema the agent reads when working INSIDE the wiki. Tells it how to
|
||||||
|
maintain pages, apply frontmatter, handle staging/archival.
|
||||||
|
2. **Your global `~/.claude/CLAUDE.md`** — the user-level instructions
|
||||||
|
the agent reads on EVERY session (regardless of directory). Tells it
|
||||||
|
when and how to consult the wiki from any other project.
|
||||||
|
|
||||||
|
Both are provided as starter templates you can copy and adapt:
|
||||||
|
|
||||||
|
### (a) Wiki schema — copy to the wiki root
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp ~/projects/wiki/docs/examples/wiki-CLAUDE.md ~/projects/wiki/CLAUDE.md
|
||||||
|
# then edit ~/projects/wiki/CLAUDE.md for your own conventions
|
||||||
|
```
|
||||||
|
|
||||||
|
This file is ~200 lines. It defines:
|
||||||
|
- Directory structure and the automated-vs-manual core rule
|
||||||
|
- Frontmatter spec (required fields, staging fields, archive fields)
|
||||||
|
- Page-type conventions (pattern / decision / environment / concept)
|
||||||
|
- Operations: Ingest, Query, Mine, Harvest, Maintain, Lint
|
||||||
|
- **Search Strategy** — which of the three qmd collections to use for
|
||||||
|
which question type
|
||||||
|
|
||||||
|
Customize the sections marked **"Customization Notes"** at the bottom
|
||||||
|
for your own categories, environments, and cross-reference format.
|
||||||
|
|
||||||
|
### (b) Global wake-up + query instructions
|
||||||
|
|
||||||
|
Append the contents of `docs/examples/global-CLAUDE.md` to your global
|
||||||
|
Claude Code instructions:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat ~/projects/wiki/docs/examples/global-CLAUDE.md >> ~/.claude/CLAUDE.md
|
||||||
|
# then review ~/.claude/CLAUDE.md to integrate cleanly with any existing
|
||||||
|
# content
|
||||||
|
```
|
||||||
|
|
||||||
|
This adds:
|
||||||
|
- **Wake-Up Context** — read `context/wake-up.md` at session start
|
||||||
|
- **LLM Wiki — When to Consult It** — query mode vs ingest mode rules
|
||||||
|
- **LLM Wiki — How to Search It** — explicit guidance for all three qmd
|
||||||
|
collections (`wiki`, `wiki-archive`, `wiki-conversations`) with
|
||||||
|
example queries for each
|
||||||
|
- **Rules When Citing** — flag `confidence: low`, `status: pending`,
|
||||||
|
and archived pages to the user
|
||||||
|
|
||||||
|
Together these give the agent a complete picture: how to maintain the
|
||||||
|
wiki when working inside it, and how to consult it from anywhere else.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Verify
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/projects/wiki
|
||||||
|
|
||||||
|
# Sync state
|
||||||
|
bash scripts/wiki-sync.sh --status
|
||||||
|
|
||||||
|
# Search
|
||||||
|
qmd collection list
|
||||||
|
qmd search "test" --json -n 3 # won't return anything if wiki is empty
|
||||||
|
|
||||||
|
# Mining
|
||||||
|
tail -20 scripts/.mine.log 2>/dev/null || echo "(no mining runs yet)"
|
||||||
|
|
||||||
|
# End-to-end maintenance dry-run (no writes, no LLM, no network)
|
||||||
|
bash scripts/wiki-maintain.sh --dry-run --no-compile
|
||||||
|
|
||||||
|
# Run the test suite
|
||||||
|
cd tests && python3 -m pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected:
|
||||||
|
- `qmd collection list` shows all three collections: `wiki`, `wiki-archive [excluded]`, `wiki-conversations [excluded]`
|
||||||
|
- `wiki-maintain.sh --dry-run` completes all three phases
|
||||||
|
- `pytest` passes all 171 tests in ~1.3 seconds
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
**qmd search returns nothing**
|
||||||
|
```bash
|
||||||
|
qmd collection list # verify path points at the right place
|
||||||
|
qmd update # rebuild index
|
||||||
|
qmd embed # rebuild embeddings
|
||||||
|
cat ~/.config/qmd/index.yml # verify config is correct for your machine
|
||||||
|
```
|
||||||
|
|
||||||
|
**qmd collection points at the wrong path**
|
||||||
|
Edit `~/.config/qmd/index.yml` directly. Don't use `qmd collection add`
|
||||||
|
from inside the target directory — it can interpret the path oddly.
|
||||||
|
|
||||||
|
**qmd returns archived pages in default searches**
|
||||||
|
Verify `wiki-archive` has `includeByDefault: false` in the YAML and
|
||||||
|
`qmd collection list` shows `[excluded]`.
|
||||||
|
|
||||||
|
**`claude -p` fails in cron ("not authenticated")**
|
||||||
|
Cron has no browser. Run `claude --version` once as the same user
|
||||||
|
outside cron to cache OAuth tokens; cron will pick them up. Also verify
|
||||||
|
the `PATH` directive at the top of the crontab includes the directory
|
||||||
|
containing `claude`.
|
||||||
|
|
||||||
|
**`wiki-harvest.py` fetch failures**
|
||||||
|
```bash
|
||||||
|
# Verify the extraction tools work
|
||||||
|
trafilatura -u "https://example.com" --markdown --no-comments --precision
|
||||||
|
crwl "https://example.com" -o markdown-fit
|
||||||
|
|
||||||
|
# Check harvest state
|
||||||
|
python3 -c "import json; print(json.dumps(json.load(open('.harvest-state.json'))['failed_urls'], indent=2))"
|
||||||
|
```
|
||||||
|
|
||||||
|
**`wiki-hygiene.py` archived a page unexpectedly**
|
||||||
|
Check `last_verified` vs decay thresholds. If the page was never
|
||||||
|
referenced in a conversation, it decayed naturally. Restore with:
|
||||||
|
```bash
|
||||||
|
python3 scripts/wiki-hygiene.py --restore archive/patterns/foo.md
|
||||||
|
```
|
||||||
|
|
||||||
|
**Both machines ran maintenance simultaneously**
|
||||||
|
Merge conflicts on `.harvest-state.json` / `.hygiene-state.json` will
|
||||||
|
occur. Pick ONE machine for maintenance; disable the maintenance cron
|
||||||
|
on the other. Leave sync cron running on both so changes still propagate.
|
||||||
|
|
||||||
|
**Tests fail**
|
||||||
|
Run `cd tests && python3 -m pytest -v` for verbose output. If the
|
||||||
|
failure mentions `WIKI_DIR` or module loading, verify
|
||||||
|
`scripts/wiki_lib.py` exists and contains the `WIKI_DIR` env var override
|
||||||
|
near the top.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Minimal install (skip everything except the idea)
|
||||||
|
|
||||||
|
If you want the conceptual wiki without any of the automation, all you
|
||||||
|
actually need is:
|
||||||
|
|
||||||
|
1. An empty directory
|
||||||
|
2. `CLAUDE.md` telling your agent the conventions (see the schema in
|
||||||
|
[`ARCHITECTURE.md`](ARCHITECTURE.md) or Karpathy's gist)
|
||||||
|
3. `index.md` for the agent to catalog pages
|
||||||
|
4. An agent that can read and write files (any Claude Code, Cursor, Aider
|
||||||
|
session works)
|
||||||
|
|
||||||
|
Then tell the agent: "Start maintaining a wiki here. Every time I share
|
||||||
|
a source, integrate it. When I ask a question, check the wiki first."
|
||||||
|
|
||||||
|
You can bolt on the automation layer later if/when it becomes worth
|
||||||
|
the setup effort.
|
||||||
161
docs/examples/global-CLAUDE.md
Normal file
161
docs/examples/global-CLAUDE.md
Normal file
@@ -0,0 +1,161 @@
|
|||||||
|
# Global Claude Code Instructions — Wiki Section
|
||||||
|
|
||||||
|
**What this is**: Content to add to your global `~/.claude/CLAUDE.md`
|
||||||
|
(the user-level instructions Claude Code reads at the start of every
|
||||||
|
session, regardless of which project you're in). These instructions tell
|
||||||
|
Claude how to consult the wiki from outside the wiki directory.
|
||||||
|
|
||||||
|
**Where to paste it**: Append these sections to `~/.claude/CLAUDE.md`.
|
||||||
|
Don't overwrite the whole file — this is additive.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Copy everything below this line into your global `~/.claude/CLAUDE.md`:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Wake-Up Context
|
||||||
|
|
||||||
|
At the start of each session, read `~/projects/wiki/context/wake-up.md`
|
||||||
|
for a briefing on active projects, recent decisions, and current
|
||||||
|
concerns. This provides conversation continuity across sessions.
|
||||||
|
|
||||||
|
## LLM Wiki — When to Consult It
|
||||||
|
|
||||||
|
**Before creating API endpoints, Docker configs, CI pipelines, or making
|
||||||
|
architectural decisions**, check the wiki at `~/projects/wiki/` for
|
||||||
|
established patterns and decisions.
|
||||||
|
|
||||||
|
The wiki captures the **why** behind patterns — not just what to do, but
|
||||||
|
the reasoning, constraints, alternatives rejected, and environment-
|
||||||
|
specific differences. It compounds over time as projects discover new
|
||||||
|
knowledge.
|
||||||
|
|
||||||
|
**When to read from the wiki** (query mode):
|
||||||
|
- Creating any operational endpoint (/health, /version, /status)
|
||||||
|
- Setting up secrets management in a new service
|
||||||
|
- Writing Dockerfiles or docker-compose configurations
|
||||||
|
- Configuring CI/CD pipelines
|
||||||
|
- Adding database users or migrations
|
||||||
|
- Making architectural decisions that should be consistent across projects
|
||||||
|
|
||||||
|
**When to write back to the wiki** (ingest mode):
|
||||||
|
- When you discover something new that should apply across projects
|
||||||
|
- When a project reveals an exception or edge case to an existing pattern
|
||||||
|
- When a decision is made that future projects should follow
|
||||||
|
- When the human explicitly says "add this to the wiki"
|
||||||
|
|
||||||
|
Human-initiated wiki writes go directly to the live wiki with
|
||||||
|
`origin: manual`. Script-initiated writes go through `staging/` first.
|
||||||
|
See the wiki's own `CLAUDE.md` for the full ingest protocol.
|
||||||
|
|
||||||
|
## LLM Wiki — How to Search It
|
||||||
|
|
||||||
|
Use the `qmd` CLI for fast, structured search. DO NOT read `index.md`
|
||||||
|
for large queries — it's only for full-catalog browsing. DO NOT grep the
|
||||||
|
wiki manually when `qmd` is available.
|
||||||
|
|
||||||
|
The wiki has **three qmd collections**. Pick the right one for the
|
||||||
|
question:
|
||||||
|
|
||||||
|
### Default collection: `wiki` (live content)
|
||||||
|
|
||||||
|
For "what's our current pattern for X?" type questions. This is the
|
||||||
|
default — no `-c` flag needed.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Keyword search (fast, BM25)
|
||||||
|
qmd search "health endpoint version" --json -n 5
|
||||||
|
|
||||||
|
# Semantic search (finds conceptually related pages)
|
||||||
|
qmd vsearch "how should API endpoints be structured" --json -n 5
|
||||||
|
|
||||||
|
# Best quality — hybrid BM25 + vector + LLM re-ranking
|
||||||
|
qmd query "health endpoint" --json -n 5
|
||||||
|
|
||||||
|
# Then read the matched page
|
||||||
|
cat ~/projects/wiki/patterns/health-endpoints.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### Archive collection: `wiki-archive` (stale / superseded)
|
||||||
|
|
||||||
|
For "what was our OLD pattern before we changed it?" questions. This is
|
||||||
|
excluded from default searches; query explicitly with `-c wiki-archive`.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# "Did we used to use Alpine? Why did we stop?"
|
||||||
|
qmd search "alpine" -c wiki-archive --json -n 5
|
||||||
|
|
||||||
|
# Semantic search across archive
|
||||||
|
qmd vsearch "container base image considerations" -c wiki-archive --json -n 5
|
||||||
|
```
|
||||||
|
|
||||||
|
When you cite content from an archived page, tell the user it's
|
||||||
|
archived and may be outdated.
|
||||||
|
|
||||||
|
### Conversations collection: `wiki-conversations` (mined session transcripts)
|
||||||
|
|
||||||
|
For "when did we discuss this, and what did we decide?" questions. This
|
||||||
|
is the mined history of your actual Claude Code sessions — decisions,
|
||||||
|
debugging breakthroughs, design discussions. Excluded from default
|
||||||
|
searches because transcripts would flood results.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# "When did we decide to use staging?"
|
||||||
|
qmd search "staging review workflow" -c wiki-conversations --json -n 5
|
||||||
|
|
||||||
|
# "What debugging did we do around Docker networking?"
|
||||||
|
qmd vsearch "docker network conflicts" -c wiki-conversations --json -n 5
|
||||||
|
```
|
||||||
|
|
||||||
|
Useful for:
|
||||||
|
- Tracing the reasoning behind a decision back to the session where it
|
||||||
|
was made
|
||||||
|
- Finding a solution to a problem you remember solving but didn't write
|
||||||
|
up
|
||||||
|
- Context-gathering when returning to a project after time away
|
||||||
|
|
||||||
|
### Searching across all collections
|
||||||
|
|
||||||
|
Rarely needed, but for "find everything on this topic across time":
|
||||||
|
|
||||||
|
```bash
|
||||||
|
qmd search "topic" -c wiki -c wiki-archive -c wiki-conversations --json -n 10
|
||||||
|
```
|
||||||
|
|
||||||
|
## LLM Wiki — Rules When Citing
|
||||||
|
|
||||||
|
1. **Always use `--json`** for structured qmd output. Never try to parse
|
||||||
|
prose.
|
||||||
|
2. **Flag `confidence: low` pages** to the user when citing. The content
|
||||||
|
may be aging out.
|
||||||
|
3. **Flag `status: pending` pages** (in `staging/`) as unverified when
|
||||||
|
citing: "Note: this is from a pending wiki page that has not been
|
||||||
|
human-reviewed yet."
|
||||||
|
4. **Flag archived pages** as "archived and may be outdated" when citing.
|
||||||
|
5. **Use `index.md` for browsing only**, not for targeted lookups. `qmd`
|
||||||
|
is faster and more accurate.
|
||||||
|
6. **Prefer semantic search for conceptual queries**, keyword search for
|
||||||
|
specific names/terms.
|
||||||
|
|
||||||
|
## LLM Wiki — Quick Reference
|
||||||
|
|
||||||
|
- `~/projects/wiki/CLAUDE.md` — Full wiki schema and operations (read this when working IN the wiki)
|
||||||
|
- `~/projects/wiki/index.md` — Content catalog (browse the full wiki)
|
||||||
|
- `~/projects/wiki/patterns/` — How things should be built
|
||||||
|
- `~/projects/wiki/decisions/` — Why we chose this approach
|
||||||
|
- `~/projects/wiki/environments/` — Where environments differ
|
||||||
|
- `~/projects/wiki/concepts/` — Foundational ideas
|
||||||
|
- `~/projects/wiki/raw/` — Immutable source material (never modify)
|
||||||
|
- `~/projects/wiki/staging/` — Pending automated content (flag when citing)
|
||||||
|
- `~/projects/wiki/archive/` — Stale content (flag when citing)
|
||||||
|
- `~/projects/wiki/conversations/` — Session history (search via `-c wiki-conversations`)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**End of additions for `~/.claude/CLAUDE.md`.**
|
||||||
|
|
||||||
|
See also the wiki's own `CLAUDE.md` at the wiki root — that file tells
|
||||||
|
the agent how to *maintain* the wiki when working inside it. This file
|
||||||
|
(the global one) tells the agent how to *consult* the wiki from anywhere
|
||||||
|
else.
|
||||||
278
docs/examples/wiki-CLAUDE.md
Normal file
278
docs/examples/wiki-CLAUDE.md
Normal file
@@ -0,0 +1,278 @@
|
|||||||
|
# LLM Wiki — Schema
|
||||||
|
|
||||||
|
This is a persistent, compounding knowledge base maintained by LLM agents.
|
||||||
|
It captures the **why** behind patterns, decisions, and implementations —
|
||||||
|
not just the what. Copy this file to the root of your wiki directory
|
||||||
|
(i.e. `~/projects/wiki/CLAUDE.md`) and edit for your own conventions.
|
||||||
|
|
||||||
|
> This is an example `CLAUDE.md` for the wiki root. The agent reads this
|
||||||
|
> at the start of every session when working inside the wiki. It's the
|
||||||
|
> "constitution" that tells the agent how to maintain the knowledge base.
|
||||||
|
|
||||||
|
## How This Wiki Works
|
||||||
|
|
||||||
|
**You are the maintainer.** When working in this wiki directory, you read
|
||||||
|
raw sources, compile knowledge into wiki pages, maintain cross-references,
|
||||||
|
and keep everything consistent.
|
||||||
|
|
||||||
|
**You are a consumer.** When working in any other project directory, you
|
||||||
|
read wiki pages to inform your work — applying established patterns,
|
||||||
|
respecting decisions, and understanding context.
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
wiki/
|
||||||
|
├── CLAUDE.md ← You are here (schema)
|
||||||
|
├── index.md ← Content catalog — read this FIRST on any query
|
||||||
|
├── log.md ← Chronological record of all operations
|
||||||
|
│
|
||||||
|
├── patterns/ ← LIVE: HOW things should be built (with WHY)
|
||||||
|
├── decisions/ ← LIVE: WHY we chose this approach (with alternatives rejected)
|
||||||
|
├── environments/ ← LIVE: WHERE implementations differ
|
||||||
|
├── concepts/ ← LIVE: WHAT the foundational ideas are
|
||||||
|
│
|
||||||
|
├── raw/ ← Immutable source material (NEVER modify)
|
||||||
|
│ └── harvested/ ← URL harvester output
|
||||||
|
│
|
||||||
|
├── staging/ ← PENDING automated content awaiting human review
|
||||||
|
│ ├── index.md
|
||||||
|
│ └── <type>/
|
||||||
|
│
|
||||||
|
├── archive/ ← STALE / superseded (excluded from default search)
|
||||||
|
│ ├── index.md
|
||||||
|
│ └── <type>/
|
||||||
|
│
|
||||||
|
├── conversations/ ← Mined Claude Code session transcripts
|
||||||
|
│ ├── index.md
|
||||||
|
│ └── <wing>/ ← per-project or per-person (MemPalace "wing")
|
||||||
|
│
|
||||||
|
├── context/ ← Auto-updated AI session briefing
|
||||||
|
│ ├── wake-up.md ← Loaded at the start of every session
|
||||||
|
│ └── active-concerns.md
|
||||||
|
│
|
||||||
|
├── reports/ ← Hygiene operation logs
|
||||||
|
└── scripts/ ← The automation pipeline
|
||||||
|
```
|
||||||
|
|
||||||
|
**Core rule — automated vs manual content**:
|
||||||
|
|
||||||
|
| Origin | Destination | Status |
|
||||||
|
|--------|-------------|--------|
|
||||||
|
| Script-generated (harvester, hygiene, URL compile) | `staging/` | `pending` |
|
||||||
|
| Human-initiated ("add this to the wiki" in a Claude session) | Live wiki (`patterns/`, etc.) | `verified` |
|
||||||
|
| Human-reviewed from staging | Live wiki (promoted) | `verified` |
|
||||||
|
|
||||||
|
Managed via `scripts/wiki-staging.py --list / --promote / --reject / --review`.
|
||||||
|
|
||||||
|
## Page Conventions
|
||||||
|
|
||||||
|
### Frontmatter (required on all wiki pages)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
title: Page Title
|
||||||
|
type: pattern | decision | environment | concept
|
||||||
|
confidence: high | medium | low
|
||||||
|
origin: manual | automated # How the page entered the wiki
|
||||||
|
sources: [list of raw/ files this was compiled from]
|
||||||
|
related: [list of other wiki pages this connects to]
|
||||||
|
last_compiled: YYYY-MM-DD # Date this page was last (re)compiled from sources
|
||||||
|
last_verified: YYYY-MM-DD # Date the content was last confirmed accurate
|
||||||
|
---
|
||||||
|
```
|
||||||
|
|
||||||
|
**`origin` values**:
|
||||||
|
- `manual` — Created by a human in a Claude session. Goes directly to the live wiki, no staging.
|
||||||
|
- `automated` — Created by a script (harvester, hygiene, etc.). Must pass through `staging/` for human review before promotion.
|
||||||
|
|
||||||
|
**Confidence decay**: Pages with no refresh signal for 6 months decay `high → medium`; 9 months → `low`; 12 months → `stale` (auto-archived). `last_verified` drives decay, not `last_compiled`. See `scripts/wiki-hygiene.py` and `archive/index.md`.
|
||||||
|
|
||||||
|
### Staging Frontmatter (pages in `staging/<type>/`)
|
||||||
|
|
||||||
|
Automated-origin pages get additional staging metadata that is **stripped on promotion**:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
---
|
||||||
|
title: ...
|
||||||
|
type: ...
|
||||||
|
origin: automated
|
||||||
|
status: pending # Awaiting review
|
||||||
|
staged_date: YYYY-MM-DD # When the automated script staged this
|
||||||
|
staged_by: wiki-harvest # Which script staged it (wiki-harvest, wiki-hygiene, ...)
|
||||||
|
target_path: patterns/foo.md # Where it should land on promotion
|
||||||
|
modifies: patterns/bar.md # Only present when this is an update to an existing live page
|
||||||
|
compilation_notes: "..." # AI's explanation of what it did and why
|
||||||
|
harvest_source: https://... # Only present for URL-harvested content
|
||||||
|
sources: [...]
|
||||||
|
related: [...]
|
||||||
|
last_verified: YYYY-MM-DD
|
||||||
|
---
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pattern Pages (`patterns/`)
|
||||||
|
|
||||||
|
Structure:
|
||||||
|
1. **What** — One-paragraph description of the pattern
|
||||||
|
2. **Why** — The reasoning, constraints, and goals that led to this pattern
|
||||||
|
3. **Canonical Example** — A concrete implementation (link to raw/ source or inline)
|
||||||
|
4. **Structure** — The specification: fields, endpoints, formats, conventions
|
||||||
|
5. **When to Deviate** — Known exceptions or conditions where the pattern doesn't apply
|
||||||
|
6. **History** — Key changes and the decisions that drove them
|
||||||
|
|
||||||
|
### Decision Pages (`decisions/`)
|
||||||
|
|
||||||
|
Structure:
|
||||||
|
1. **Decision** — One sentence: what we decided
|
||||||
|
2. **Context** — What problem or constraint prompted this
|
||||||
|
3. **Options Considered** — What alternatives existed (with pros/cons)
|
||||||
|
4. **Rationale** — Why this option won
|
||||||
|
5. **Consequences** — What this decision enables and constrains
|
||||||
|
6. **Status** — Active | Superseded by [link] | Under Review
|
||||||
|
|
||||||
|
### Environment Pages (`environments/`)
|
||||||
|
|
||||||
|
Structure:
|
||||||
|
1. **Overview** — What this environment is (platform, CI, infra)
|
||||||
|
2. **Key Differences** — Table comparing environments for this domain
|
||||||
|
3. **Implementation Details** — Environment-specific configs, credentials, deploy method
|
||||||
|
4. **Gotchas** — Things that have bitten us
|
||||||
|
|
||||||
|
### Concept Pages (`concepts/`)
|
||||||
|
|
||||||
|
Structure:
|
||||||
|
1. **Definition** — What this concept means in our context
|
||||||
|
2. **Why It Matters** — How this concept shapes our decisions
|
||||||
|
3. **Related Patterns** — Links to patterns that implement this concept
|
||||||
|
4. **Related Decisions** — Links to decisions driven by this concept
|
||||||
|
|
||||||
|
## Operations
|
||||||
|
|
||||||
|
### Ingest (adding new knowledge)
|
||||||
|
|
||||||
|
When a new raw source is added or you learn something new:
|
||||||
|
|
||||||
|
1. Read the source material thoroughly
|
||||||
|
2. Identify which existing wiki pages need updating
|
||||||
|
3. Identify if new pages are needed
|
||||||
|
4. Update/create pages following the conventions above
|
||||||
|
5. Update cross-references (`related:` frontmatter) on all affected pages
|
||||||
|
6. Update `index.md` with any new pages
|
||||||
|
7. Set `last_verified:` to today's date on every page you create or update
|
||||||
|
8. Set `origin: manual` on any page you create when a human directed you to
|
||||||
|
9. Append to `log.md`: `## [YYYY-MM-DD] ingest | Source Description`
|
||||||
|
|
||||||
|
**Where to write**:
|
||||||
|
- **Human-initiated** ("add this to the wiki", "create a pattern for X") — write directly to the live directory (`patterns/`, `decisions/`, etc.) with `origin: manual`. The human's instruction IS the approval.
|
||||||
|
- **Script-initiated** (harvest, auto-compile, hygiene auto-fix) — write to `staging/<type>/` with `origin: automated`, `status: pending`, plus `staged_date`, `staged_by`, `target_path`, and `compilation_notes`. For updates to existing live pages, also set `modifies: <live-page-path>`.
|
||||||
|
|
||||||
|
### Query (answering questions from other projects)
|
||||||
|
|
||||||
|
When working in another project and consulting the wiki:
|
||||||
|
|
||||||
|
1. Use `qmd` to search first (see Search Strategy below). Read `index.md` only when browsing the full catalog.
|
||||||
|
2. Read the specific pattern/decision/concept pages
|
||||||
|
3. Apply the knowledge, respecting environment differences
|
||||||
|
4. If a page's `confidence` is `low`, flag that to the user — the content may be aging out
|
||||||
|
5. If a page has `status: pending` (it's in `staging/`), flag that to the user: "Note: this is from a pending wiki page in staging, not yet verified." Use the content but make the uncertainty visible.
|
||||||
|
6. If you find yourself consulting a page under `archive/`, mention it's archived and may be outdated
|
||||||
|
7. If your work reveals new knowledge, **file it back** — update the wiki (and bump `last_verified`)
|
||||||
|
|
||||||
|
### Search Strategy — which qmd collection to use
|
||||||
|
|
||||||
|
The wiki has three qmd collections. Pick the right one for the question:
|
||||||
|
|
||||||
|
| Question type | Collection | Command |
|
||||||
|
|---|---|---|
|
||||||
|
| "What's our current pattern for X?" | `wiki` (default) | `qmd search "X" --json -n 5` |
|
||||||
|
| "What's the rationale behind decision Y?" | `wiki` (default) | `qmd vsearch "why did we choose Y" --json -n 5` |
|
||||||
|
| "What was our OLD approach before we changed it?" | `wiki-archive` | `qmd search "X" -c wiki-archive --json -n 5` |
|
||||||
|
| "When did we discuss this, and what did we decide?" | `wiki-conversations` | `qmd search "X" -c wiki-conversations --json -n 5` |
|
||||||
|
| "Find everything across time" | all three | `qmd search "X" -c wiki -c wiki-archive -c wiki-conversations --json -n 10` |
|
||||||
|
|
||||||
|
**Rules of thumb**:
|
||||||
|
- Use `qmd search` for keyword matches (BM25, fast)
|
||||||
|
- Use `qmd vsearch` for conceptual / semantically-similar queries (vector)
|
||||||
|
- Use `qmd query` for the best quality — hybrid BM25 + vector + LLM re-ranking
|
||||||
|
- Always use `--json` for structured output
|
||||||
|
- Read individual matched pages with `cat` or your file tool after finding them
|
||||||
|
|
||||||
|
### Mine (conversation extraction and summarization)
|
||||||
|
|
||||||
|
Four-phase pipeline that extracts sessions into searchable conversation pages:
|
||||||
|
|
||||||
|
1. **Extract** (`extract-sessions.py`) — Parse session files into markdown transcripts
|
||||||
|
2. **Summarize** (`summarize-conversations.py --claude`) — Classify + summarize via `claude -p` with haiku/sonnet routing
|
||||||
|
3. **Index** (`update-conversation-index.py --reindex`) — Regenerate conversation index + `context/wake-up.md`
|
||||||
|
4. **Harvest** (`wiki-harvest.py`) — Scan summarized conversations for external reference URLs and compile them into wiki pages
|
||||||
|
|
||||||
|
Full pipeline via `mine-conversations.sh`. Extraction is incremental (tracks byte offsets). Summarization is incremental (tracks message count).
|
||||||
|
|
||||||
|
### Maintain (wiki health automation)
|
||||||
|
|
||||||
|
`scripts/wiki-maintain.sh` chains harvest + hygiene + qmd reindex:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash scripts/wiki-maintain.sh # Harvest + quick hygiene + reindex
|
||||||
|
bash scripts/wiki-maintain.sh --full # Harvest + full hygiene (LLM) + reindex
|
||||||
|
bash scripts/wiki-maintain.sh --harvest-only # Harvest only
|
||||||
|
bash scripts/wiki-maintain.sh --hygiene-only # Hygiene only
|
||||||
|
bash scripts/wiki-maintain.sh --dry-run # Show what would run
|
||||||
|
```
|
||||||
|
|
||||||
|
### Lint (periodic health check)
|
||||||
|
|
||||||
|
Automated via `scripts/wiki-hygiene.py`. Two tiers:
|
||||||
|
|
||||||
|
**Quick mode** (no LLM, run daily — `python3 scripts/wiki-hygiene.py`):
|
||||||
|
- Backfill missing `last_verified`
|
||||||
|
- Refresh `last_verified` from conversation `related:` references
|
||||||
|
- Auto-restore archived pages that are referenced again
|
||||||
|
- Repair frontmatter (missing required fields, invalid values)
|
||||||
|
- Confidence decay per 6/9/12-month thresholds
|
||||||
|
- Archive stale and superseded pages
|
||||||
|
- Orphan pages (auto-linked into `index.md`)
|
||||||
|
- Broken cross-references (fuzzy-match fix via `difflib`, or restore from archive)
|
||||||
|
- Main index drift (auto add missing entries, remove stale ones)
|
||||||
|
- Empty stubs (report-only)
|
||||||
|
- State file drift (report-only)
|
||||||
|
- Staging/archive index resync
|
||||||
|
|
||||||
|
**Full mode** (LLM, run weekly — `python3 scripts/wiki-hygiene.py --full`):
|
||||||
|
- Everything in quick mode, plus:
|
||||||
|
- Missing cross-references between related pages (haiku)
|
||||||
|
- Duplicate coverage — weaker page auto-merged into stronger (sonnet)
|
||||||
|
- Contradictions between pages (sonnet, report-only)
|
||||||
|
- Technology lifecycle — flag pages referencing versions older than what's in recent conversations
|
||||||
|
|
||||||
|
**Reports** (written to `reports/`):
|
||||||
|
- `hygiene-YYYY-MM-DD-fixed.md` — what was auto-fixed
|
||||||
|
- `hygiene-YYYY-MM-DD-needs-review.md` — what needs human judgment
|
||||||
|
|
||||||
|
## Cross-Reference Conventions
|
||||||
|
|
||||||
|
- Link between wiki pages using relative markdown links: `[Pattern Name](../patterns/file.md)`
|
||||||
|
- Link to raw sources: `[Source](../raw/path/to/file.md)`
|
||||||
|
- In frontmatter `related:` use the relative filename: `patterns/secrets-at-startup.md`
|
||||||
|
|
||||||
|
## Naming Conventions
|
||||||
|
|
||||||
|
- Filenames: `kebab-case.md`
|
||||||
|
- Patterns: named by what they standardize (e.g., `health-endpoints.md`, `secrets-at-startup.md`)
|
||||||
|
- Decisions: named by what was decided (e.g., `no-alpine.md`, `dhi-base-images.md`)
|
||||||
|
- Environments: named by domain (e.g., `docker-registries.md`, `ci-cd-platforms.md`)
|
||||||
|
- Concepts: named by the concept (e.g., `two-user-database-model.md`, `build-once-deploy-many.md`)
|
||||||
|
|
||||||
|
## Customization Notes
|
||||||
|
|
||||||
|
Things you should change for your own wiki:
|
||||||
|
|
||||||
|
1. **Directory structure** — the four live dirs (`patterns/`, `decisions/`, `concepts/`, `environments/`) reflect engineering use cases. Pick categories that match how you think — research wikis might use `findings/`, `hypotheses/`, `methods/`, `literature/` instead. Update `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py` to match.
|
||||||
|
|
||||||
|
2. **Page page-type sections** — the "Structure" blocks under each page type are for my use. Define your own conventions.
|
||||||
|
|
||||||
|
3. **`status` field** — if you want to track Superseded/Active/Under Review explicitly, this is a natural add. The hygiene script already checks for `status: Superseded by ...` and archives those automatically.
|
||||||
|
|
||||||
|
4. **Environment Detection** — if you don't have multiple environments, remove the section. If you do, update it for your own environments (work/home, dev/prod, mac/linux, etc.).
|
||||||
|
|
||||||
|
5. **Cross-reference path format** — I use `patterns/foo.md` in the `related:` field. Obsidian users might prefer `[[foo]]` wikilink format. The hygiene script handles standard markdown links; adapt as needed.
|
||||||
810
scripts/extract-sessions.py
Executable file
810
scripts/extract-sessions.py
Executable file
@@ -0,0 +1,810 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Extract Claude Code session JSONL files into clean markdown transcripts.
|
||||||
|
|
||||||
|
Phase A of the conversation mining pipeline. Deterministic, no LLM dependency.
|
||||||
|
Handles incremental extraction via byte offset tracking for sessions that span
|
||||||
|
hours or days.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 extract-sessions.py # Extract all new sessions
|
||||||
|
python3 extract-sessions.py --project mc # Extract one project
|
||||||
|
python3 extract-sessions.py --session 0a543572 # Extract specific session
|
||||||
|
python3 extract-sessions.py --dry-run # Show what would be extracted
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Configuration
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
CLAUDE_PROJECTS_DIR = Path(os.environ.get("CLAUDE_PROJECTS_DIR", str(Path.home() / ".claude" / "projects")))
|
||||||
|
WIKI_DIR = Path(os.environ.get("WIKI_DIR", str(Path.home() / "projects" / "wiki")))
|
||||||
|
CONVERSATIONS_DIR = WIKI_DIR / "conversations"
|
||||||
|
MINE_STATE_FILE = WIKI_DIR / ".mine-state.json"
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════════════
|
||||||
|
# CONFIGURE ME — Map Claude project directory suffixes to wiki project codes
|
||||||
|
# ════════════════════════════════════════════════════════════════════════════
|
||||||
|
#
|
||||||
|
# Claude Code stores sessions under ~/.claude/projects/<hashed-path>/. The
|
||||||
|
# directory name is derived from the absolute path of your project, so it
|
||||||
|
# looks like `-Users-alice-projects-myapp` or `-home-alice-projects-myapp`.
|
||||||
|
#
|
||||||
|
# This map tells the extractor which suffix maps to which short wiki code
|
||||||
|
# (the "wing"). More specific suffixes should appear first — the extractor
|
||||||
|
# picks the first match. Everything unmatched goes into `general/`.
|
||||||
|
#
|
||||||
|
# Examples — replace with your own projects:
|
||||||
|
PROJECT_MAP: dict[str, str] = {
|
||||||
|
# More specific suffixes first
|
||||||
|
"projects-wiki": "wiki", # this wiki itself
|
||||||
|
"-claude": "cl", # ~/.claude config repo
|
||||||
|
# Add your real projects here:
|
||||||
|
# "my-webapp": "web",
|
||||||
|
# "my-mobile-app": "mob",
|
||||||
|
# "work-mono-repo": "work",
|
||||||
|
# Catch-all — Claude sessions outside any tracked project
|
||||||
|
"-home": "general",
|
||||||
|
"-Users": "general",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Tool call names to keep full output for
|
||||||
|
KEEP_FULL_OUTPUT_TOOLS = {"Bash", "Skill"}
|
||||||
|
|
||||||
|
# Tool call names to summarize (just note what was accessed)
|
||||||
|
SUMMARIZE_TOOLS = {"Read", "Glob", "Grep"}
|
||||||
|
|
||||||
|
# Tool call names to keep with path + change summary
|
||||||
|
KEEP_CHANGE_TOOLS = {"Edit", "Write"}
|
||||||
|
|
||||||
|
# Tool call names to keep description + result summary
|
||||||
|
KEEP_SUMMARY_TOOLS = {"Agent"}
|
||||||
|
|
||||||
|
# Max lines of Bash output to keep
|
||||||
|
MAX_BASH_OUTPUT_LINES = 200
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# State management
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def load_state() -> dict[str, Any]:
|
||||||
|
"""Load mining state from .mine-state.json."""
|
||||||
|
if MINE_STATE_FILE.exists():
|
||||||
|
with open(MINE_STATE_FILE) as f:
|
||||||
|
return json.load(f)
|
||||||
|
return {"sessions": {}, "last_run": None}
|
||||||
|
|
||||||
|
|
||||||
|
def save_state(state: dict[str, Any]) -> None:
|
||||||
|
"""Save mining state to .mine-state.json."""
|
||||||
|
state["last_run"] = datetime.now(timezone.utc).isoformat()
|
||||||
|
with open(MINE_STATE_FILE, "w") as f:
|
||||||
|
json.dump(state, f, indent=2)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Project mapping
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_project_code(dir_name: str) -> str | None:
|
||||||
|
"""Map a Claude project directory name to a wiki project code.
|
||||||
|
|
||||||
|
Directory names look like: -Users-alice-projects-myapp or -home-alice-projects-myapp
|
||||||
|
"""
|
||||||
|
for suffix, code in PROJECT_MAP.items():
|
||||||
|
if dir_name.endswith(suffix):
|
||||||
|
return code
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def discover_sessions(
|
||||||
|
project_filter: str | None = None,
|
||||||
|
session_filter: str | None = None,
|
||||||
|
) -> list[dict[str, Any]]:
|
||||||
|
"""Discover JSONL session files from Claude projects directory."""
|
||||||
|
sessions = []
|
||||||
|
|
||||||
|
if not CLAUDE_PROJECTS_DIR.exists():
|
||||||
|
print(f"Claude projects directory not found: {CLAUDE_PROJECTS_DIR}", file=sys.stderr)
|
||||||
|
return sessions
|
||||||
|
|
||||||
|
for proj_dir in sorted(CLAUDE_PROJECTS_DIR.iterdir()):
|
||||||
|
if not proj_dir.is_dir():
|
||||||
|
continue
|
||||||
|
|
||||||
|
code = resolve_project_code(proj_dir.name)
|
||||||
|
if code is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if project_filter and code != project_filter:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for jsonl_file in sorted(proj_dir.glob("*.jsonl")):
|
||||||
|
session_id = jsonl_file.stem
|
||||||
|
if session_filter and not session_id.startswith(session_filter):
|
||||||
|
continue
|
||||||
|
|
||||||
|
sessions.append({
|
||||||
|
"session_id": session_id,
|
||||||
|
"project": code,
|
||||||
|
"jsonl_path": jsonl_file,
|
||||||
|
"file_size": jsonl_file.stat().st_size,
|
||||||
|
})
|
||||||
|
|
||||||
|
return sessions
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# JSONL parsing and filtering
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def extract_timestamp(obj: dict[str, Any]) -> str | None:
|
||||||
|
"""Get timestamp from a JSONL record."""
|
||||||
|
ts = obj.get("timestamp")
|
||||||
|
if isinstance(ts, str):
|
||||||
|
return ts
|
||||||
|
if isinstance(ts, (int, float)):
|
||||||
|
return datetime.fromtimestamp(ts / 1000, tz=timezone.utc).isoformat()
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def extract_session_date(obj: dict[str, Any]) -> str:
|
||||||
|
"""Get date string (YYYY-MM-DD) from a JSONL record timestamp."""
|
||||||
|
ts = extract_timestamp(obj)
|
||||||
|
if ts:
|
||||||
|
try:
|
||||||
|
dt = datetime.fromisoformat(ts.replace("Z", "+00:00"))
|
||||||
|
return dt.strftime("%Y-%m-%d")
|
||||||
|
except (ValueError, TypeError):
|
||||||
|
pass
|
||||||
|
return datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
||||||
|
|
||||||
|
|
||||||
|
def truncate_lines(text: str, max_lines: int) -> str:
|
||||||
|
"""Truncate text to max_lines, adding a note if truncated."""
|
||||||
|
lines = text.splitlines()
|
||||||
|
if len(lines) <= max_lines:
|
||||||
|
return text
|
||||||
|
kept = lines[:max_lines]
|
||||||
|
omitted = len(lines) - max_lines
|
||||||
|
kept.append(f"\n[... {omitted} lines truncated ...]")
|
||||||
|
return "\n".join(kept)
|
||||||
|
|
||||||
|
|
||||||
|
def format_tool_use(name: str, input_data: dict[str, Any]) -> str | None:
|
||||||
|
"""Format a tool_use content block for the transcript."""
|
||||||
|
if name in KEEP_FULL_OUTPUT_TOOLS:
|
||||||
|
if name == "Bash":
|
||||||
|
cmd = input_data.get("command", "")
|
||||||
|
desc = input_data.get("description", "")
|
||||||
|
label = desc if desc else cmd[:100]
|
||||||
|
return f"**[Bash]**: `{label}`"
|
||||||
|
if name == "Skill":
|
||||||
|
skill = input_data.get("skill", "")
|
||||||
|
args = input_data.get("args", "")
|
||||||
|
return f"**[Skill]**: /{skill} {args}".strip()
|
||||||
|
|
||||||
|
if name in SUMMARIZE_TOOLS:
|
||||||
|
if name == "Read":
|
||||||
|
fp = input_data.get("file_path", "?")
|
||||||
|
return f"[Read: {fp}]"
|
||||||
|
if name == "Glob":
|
||||||
|
pattern = input_data.get("pattern", "?")
|
||||||
|
return f"[Glob: {pattern}]"
|
||||||
|
if name == "Grep":
|
||||||
|
pattern = input_data.get("pattern", "?")
|
||||||
|
path = input_data.get("path", "")
|
||||||
|
return f"[Grep: '{pattern}' in {path}]" if path else f"[Grep: '{pattern}']"
|
||||||
|
|
||||||
|
if name in KEEP_CHANGE_TOOLS:
|
||||||
|
if name == "Edit":
|
||||||
|
fp = input_data.get("file_path", "?")
|
||||||
|
old = input_data.get("old_string", "")[:60]
|
||||||
|
return f"**[Edit]**: {fp} — replaced '{old}...'"
|
||||||
|
if name == "Write":
|
||||||
|
fp = input_data.get("file_path", "?")
|
||||||
|
content_len = len(input_data.get("content", ""))
|
||||||
|
return f"**[Write]**: {fp} ({content_len} chars)"
|
||||||
|
|
||||||
|
if name in KEEP_SUMMARY_TOOLS:
|
||||||
|
if name == "Agent":
|
||||||
|
desc = input_data.get("description", "?")
|
||||||
|
return f"**[Agent]**: {desc}"
|
||||||
|
|
||||||
|
if name == "ToolSearch":
|
||||||
|
return None # noise
|
||||||
|
if name == "TaskCreate":
|
||||||
|
subj = input_data.get("subject", "?")
|
||||||
|
return f"[TaskCreate: {subj}]"
|
||||||
|
if name == "TaskUpdate":
|
||||||
|
tid = input_data.get("taskId", "?")
|
||||||
|
status = input_data.get("status", "?")
|
||||||
|
return f"[TaskUpdate: #{tid} → {status}]"
|
||||||
|
|
||||||
|
# Default: note the tool was called
|
||||||
|
return f"[{name}]"
|
||||||
|
|
||||||
|
|
||||||
|
def format_tool_result(
|
||||||
|
tool_name: str | None,
|
||||||
|
content: Any,
|
||||||
|
is_error: bool = False,
|
||||||
|
) -> str | None:
|
||||||
|
"""Format a tool_result content block for the transcript."""
|
||||||
|
text = ""
|
||||||
|
if isinstance(content, str):
|
||||||
|
text = content
|
||||||
|
elif isinstance(content, list):
|
||||||
|
parts = []
|
||||||
|
for item in content:
|
||||||
|
if isinstance(item, dict) and item.get("type") == "text":
|
||||||
|
parts.append(item.get("text", ""))
|
||||||
|
text = "\n".join(parts)
|
||||||
|
|
||||||
|
if not text.strip():
|
||||||
|
return None
|
||||||
|
|
||||||
|
if is_error:
|
||||||
|
return f"**[ERROR]**:\n```\n{truncate_lines(text, MAX_BASH_OUTPUT_LINES)}\n```"
|
||||||
|
|
||||||
|
if tool_name in KEEP_FULL_OUTPUT_TOOLS:
|
||||||
|
return f"```\n{truncate_lines(text, MAX_BASH_OUTPUT_LINES)}\n```"
|
||||||
|
|
||||||
|
if tool_name in SUMMARIZE_TOOLS:
|
||||||
|
# Just note the result size
|
||||||
|
line_count = len(text.splitlines())
|
||||||
|
char_count = len(text)
|
||||||
|
return f"[→ {line_count} lines, {char_count} chars]"
|
||||||
|
|
||||||
|
if tool_name in KEEP_CHANGE_TOOLS:
|
||||||
|
return None # The tool_use already captured what changed
|
||||||
|
|
||||||
|
if tool_name in KEEP_SUMMARY_TOOLS:
|
||||||
|
# Keep a summary of agent results
|
||||||
|
summary = text[:300]
|
||||||
|
if len(text) > 300:
|
||||||
|
summary += "..."
|
||||||
|
return f"> {summary}"
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def parse_content_blocks(
|
||||||
|
content: list[dict[str, Any]],
|
||||||
|
role: str,
|
||||||
|
tool_id_to_name: dict[str, str],
|
||||||
|
) -> list[str]:
|
||||||
|
"""Parse content blocks from a message into transcript lines."""
|
||||||
|
parts: list[str] = []
|
||||||
|
|
||||||
|
for block in content:
|
||||||
|
block_type = block.get("type")
|
||||||
|
|
||||||
|
if block_type == "text":
|
||||||
|
text = block.get("text", "").strip()
|
||||||
|
if not text:
|
||||||
|
continue
|
||||||
|
# Skip system-reminder content
|
||||||
|
if "<system-reminder>" in text:
|
||||||
|
# Strip system reminder tags and their content
|
||||||
|
text = re.sub(
|
||||||
|
r"<system-reminder>.*?</system-reminder>",
|
||||||
|
"",
|
||||||
|
text,
|
||||||
|
flags=re.DOTALL,
|
||||||
|
).strip()
|
||||||
|
# Skip local-command noise
|
||||||
|
if text.startswith("<local-command"):
|
||||||
|
continue
|
||||||
|
if text:
|
||||||
|
parts.append(text)
|
||||||
|
|
||||||
|
elif block_type == "thinking":
|
||||||
|
# Skip thinking blocks
|
||||||
|
continue
|
||||||
|
|
||||||
|
elif block_type == "tool_use":
|
||||||
|
tool_name = block.get("name", "unknown")
|
||||||
|
tool_id = block.get("id", "")
|
||||||
|
input_data = block.get("input", {})
|
||||||
|
tool_id_to_name[tool_id] = tool_name
|
||||||
|
formatted = format_tool_use(tool_name, input_data)
|
||||||
|
if formatted:
|
||||||
|
parts.append(formatted)
|
||||||
|
|
||||||
|
elif block_type == "tool_result":
|
||||||
|
tool_id = block.get("tool_use_id", "")
|
||||||
|
tool_name = tool_id_to_name.get(tool_id)
|
||||||
|
is_error = block.get("is_error", False)
|
||||||
|
result_content = block.get("content", "")
|
||||||
|
formatted = format_tool_result(tool_name, result_content, is_error)
|
||||||
|
if formatted:
|
||||||
|
parts.append(formatted)
|
||||||
|
|
||||||
|
return parts
|
||||||
|
|
||||||
|
|
||||||
|
def process_jsonl(
|
||||||
|
jsonl_path: Path,
|
||||||
|
byte_offset: int = 0,
|
||||||
|
) -> tuple[list[str], dict[str, Any]]:
|
||||||
|
"""Process a JSONL session file and return transcript lines + metadata.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
jsonl_path: Path to the JSONL file
|
||||||
|
byte_offset: Start reading from this byte position (for incremental)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (transcript_lines, metadata_dict)
|
||||||
|
"""
|
||||||
|
transcript_lines: list[str] = []
|
||||||
|
metadata: dict[str, Any] = {
|
||||||
|
"first_date": None,
|
||||||
|
"last_date": None,
|
||||||
|
"message_count": 0,
|
||||||
|
"human_messages": 0,
|
||||||
|
"assistant_messages": 0,
|
||||||
|
"git_branch": None,
|
||||||
|
"new_byte_offset": 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Map tool_use IDs to tool names for correlating results
|
||||||
|
tool_id_to_name: dict[str, str] = {}
|
||||||
|
|
||||||
|
# Track when a command/skill was just invoked so the next user message
|
||||||
|
# (the skill prompt injection) gets labeled correctly
|
||||||
|
last_command_name: str | None = None
|
||||||
|
|
||||||
|
with open(jsonl_path, "rb") as f:
|
||||||
|
if byte_offset > 0:
|
||||||
|
f.seek(byte_offset)
|
||||||
|
|
||||||
|
for raw_line in f:
|
||||||
|
try:
|
||||||
|
obj = json.loads(raw_line)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
continue
|
||||||
|
|
||||||
|
record_type = obj.get("type")
|
||||||
|
|
||||||
|
# Skip non-message types
|
||||||
|
if record_type not in ("user", "assistant"):
|
||||||
|
continue
|
||||||
|
|
||||||
|
msg = obj.get("message", {})
|
||||||
|
role = msg.get("role", record_type)
|
||||||
|
content = msg.get("content", "")
|
||||||
|
|
||||||
|
# Track metadata
|
||||||
|
date = extract_session_date(obj)
|
||||||
|
if metadata["first_date"] is None:
|
||||||
|
metadata["first_date"] = date
|
||||||
|
metadata["last_date"] = date
|
||||||
|
metadata["message_count"] += 1
|
||||||
|
|
||||||
|
if not metadata["git_branch"]:
|
||||||
|
metadata["git_branch"] = obj.get("gitBranch")
|
||||||
|
|
||||||
|
if role == "user":
|
||||||
|
metadata["human_messages"] += 1
|
||||||
|
elif role == "assistant":
|
||||||
|
metadata["assistant_messages"] += 1
|
||||||
|
|
||||||
|
# Process content
|
||||||
|
if isinstance(content, str):
|
||||||
|
text = content.strip()
|
||||||
|
# Skip system-reminder and local-command noise
|
||||||
|
if "<system-reminder>" in text:
|
||||||
|
text = re.sub(
|
||||||
|
r"<system-reminder>.*?</system-reminder>",
|
||||||
|
"",
|
||||||
|
text,
|
||||||
|
flags=re.DOTALL,
|
||||||
|
).strip()
|
||||||
|
if text.startswith("<local-command"):
|
||||||
|
continue
|
||||||
|
if text.startswith("<command-name>/exit"):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Detect command/skill invocation: <command-name>/foo</command-name>
|
||||||
|
cmd_match = re.search(
|
||||||
|
r"<command-name>/([^<]+)</command-name>", text,
|
||||||
|
)
|
||||||
|
if cmd_match:
|
||||||
|
last_command_name = cmd_match.group(1)
|
||||||
|
# Keep just a brief note about the command invocation
|
||||||
|
transcript_lines.append(
|
||||||
|
f"**Human**: /{last_command_name}"
|
||||||
|
)
|
||||||
|
transcript_lines.append("")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Detect skill prompt injection (large structured text after a command)
|
||||||
|
if (
|
||||||
|
last_command_name
|
||||||
|
and role == "user"
|
||||||
|
and len(text) > 500
|
||||||
|
):
|
||||||
|
# This is the skill's injected prompt — summarize it
|
||||||
|
transcript_lines.append(
|
||||||
|
f"[Skill prompt: /{last_command_name} — {len(text)} chars]"
|
||||||
|
)
|
||||||
|
transcript_lines.append("")
|
||||||
|
last_command_name = None
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Also detect skill prompts by content pattern (catches cases
|
||||||
|
# where the command-name message wasn't separate, or where the
|
||||||
|
# prompt arrives without a preceding command-name tag)
|
||||||
|
if (
|
||||||
|
role == "user"
|
||||||
|
and len(text) > 500
|
||||||
|
and re.match(
|
||||||
|
r"^##\s*(Tracking|Step|Context|Instructions|Overview|Goal)",
|
||||||
|
text,
|
||||||
|
)
|
||||||
|
):
|
||||||
|
# Structured skill prompt — try to extract command name
|
||||||
|
cmd_in_text = re.search(
|
||||||
|
r'--command\s+"([^"]+)"', text,
|
||||||
|
)
|
||||||
|
prompt_label = cmd_in_text.group(1) if cmd_in_text else (last_command_name or "unknown")
|
||||||
|
transcript_lines.append(
|
||||||
|
f"[Skill prompt: /{prompt_label} — {len(text)} chars]"
|
||||||
|
)
|
||||||
|
transcript_lines.append("")
|
||||||
|
last_command_name = None
|
||||||
|
continue
|
||||||
|
|
||||||
|
last_command_name = None # Reset after non-matching message
|
||||||
|
|
||||||
|
if text:
|
||||||
|
label = "**Human**" if role == "user" else "**Assistant**"
|
||||||
|
transcript_lines.append(f"{label}: {text}")
|
||||||
|
transcript_lines.append("")
|
||||||
|
|
||||||
|
elif isinstance(content, list):
|
||||||
|
# Check if this is a skill prompt in list form
|
||||||
|
is_skill_prompt = False
|
||||||
|
skill_prompt_name = last_command_name
|
||||||
|
if role == "user":
|
||||||
|
for block in content:
|
||||||
|
if block.get("type") == "text":
|
||||||
|
block_text = block.get("text", "").strip()
|
||||||
|
# Detect by preceding command name
|
||||||
|
if last_command_name and len(block_text) > 500:
|
||||||
|
is_skill_prompt = True
|
||||||
|
break
|
||||||
|
# Detect by content pattern (## Tracking, etc.)
|
||||||
|
if (
|
||||||
|
len(block_text) > 500
|
||||||
|
and re.match(
|
||||||
|
r"^##\s*(Tracking|Step|Context|Instructions|Overview|Goal)",
|
||||||
|
block_text,
|
||||||
|
)
|
||||||
|
):
|
||||||
|
is_skill_prompt = True
|
||||||
|
# Try to extract command name from content
|
||||||
|
cmd_in_text = re.search(
|
||||||
|
r'--command\s+"([^"]+)"', block_text,
|
||||||
|
)
|
||||||
|
if cmd_in_text:
|
||||||
|
skill_prompt_name = cmd_in_text.group(1)
|
||||||
|
break
|
||||||
|
|
||||||
|
if is_skill_prompt:
|
||||||
|
total_len = sum(
|
||||||
|
len(b.get("text", ""))
|
||||||
|
for b in content
|
||||||
|
if b.get("type") == "text"
|
||||||
|
)
|
||||||
|
label = skill_prompt_name or "unknown"
|
||||||
|
transcript_lines.append(
|
||||||
|
f"[Skill prompt: /{label} — {total_len} chars]"
|
||||||
|
)
|
||||||
|
transcript_lines.append("")
|
||||||
|
last_command_name = None
|
||||||
|
continue
|
||||||
|
|
||||||
|
last_command_name = None
|
||||||
|
|
||||||
|
parts = parse_content_blocks(content, role, tool_id_to_name)
|
||||||
|
if parts:
|
||||||
|
# Determine if this is a tool result message (user role but
|
||||||
|
# contains only tool_result blocks — these are tool outputs,
|
||||||
|
# not human input)
|
||||||
|
has_only_tool_results = all(
|
||||||
|
b.get("type") in ("tool_result",)
|
||||||
|
for b in content
|
||||||
|
if b.get("type") != "text" or b.get("text", "").strip()
|
||||||
|
) and any(b.get("type") == "tool_result" for b in content)
|
||||||
|
|
||||||
|
if has_only_tool_results:
|
||||||
|
# Tool results — no speaker label, just the formatted output
|
||||||
|
for part in parts:
|
||||||
|
transcript_lines.append(part)
|
||||||
|
elif role == "user":
|
||||||
|
# Check if there's actual human text (not just tool results)
|
||||||
|
has_human_text = any(
|
||||||
|
b.get("type") == "text"
|
||||||
|
and b.get("text", "").strip()
|
||||||
|
and "<system-reminder>" not in b.get("text", "")
|
||||||
|
for b in content
|
||||||
|
)
|
||||||
|
label = "**Human**" if has_human_text else "**Assistant**"
|
||||||
|
if len(parts) == 1:
|
||||||
|
transcript_lines.append(f"{label}: {parts[0]}")
|
||||||
|
else:
|
||||||
|
transcript_lines.append(f"{label}:")
|
||||||
|
for part in parts:
|
||||||
|
transcript_lines.append(part)
|
||||||
|
else:
|
||||||
|
label = "**Assistant**"
|
||||||
|
if len(parts) == 1:
|
||||||
|
transcript_lines.append(f"{label}: {parts[0]}")
|
||||||
|
else:
|
||||||
|
transcript_lines.append(f"{label}:")
|
||||||
|
for part in parts:
|
||||||
|
transcript_lines.append(part)
|
||||||
|
transcript_lines.append("")
|
||||||
|
|
||||||
|
metadata["new_byte_offset"] = f.tell()
|
||||||
|
|
||||||
|
return transcript_lines, metadata
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Markdown generation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def build_frontmatter(
|
||||||
|
session_id: str,
|
||||||
|
project: str,
|
||||||
|
date: str,
|
||||||
|
message_count: int,
|
||||||
|
git_branch: str | None = None,
|
||||||
|
) -> str:
|
||||||
|
"""Build YAML frontmatter for a conversation markdown file."""
|
||||||
|
lines = [
|
||||||
|
"---",
|
||||||
|
f"title: Session {session_id[:8]}",
|
||||||
|
"type: conversation",
|
||||||
|
f"project: {project}",
|
||||||
|
f"date: {date}",
|
||||||
|
f"session_id: {session_id}",
|
||||||
|
f"messages: {message_count}",
|
||||||
|
"status: extracted",
|
||||||
|
]
|
||||||
|
if git_branch:
|
||||||
|
lines.append(f"git_branch: {git_branch}")
|
||||||
|
lines.append("---")
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def write_new_conversation(
|
||||||
|
output_path: Path,
|
||||||
|
session_id: str,
|
||||||
|
project: str,
|
||||||
|
transcript_lines: list[str],
|
||||||
|
metadata: dict[str, Any],
|
||||||
|
) -> None:
|
||||||
|
"""Write a new conversation markdown file."""
|
||||||
|
date = metadata["first_date"] or datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
||||||
|
frontmatter = build_frontmatter(
|
||||||
|
session_id=session_id,
|
||||||
|
project=project,
|
||||||
|
date=date,
|
||||||
|
message_count=metadata["message_count"],
|
||||||
|
git_branch=metadata.get("git_branch"),
|
||||||
|
)
|
||||||
|
|
||||||
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
with open(output_path, "w") as f:
|
||||||
|
f.write(frontmatter)
|
||||||
|
f.write("\n\n## Transcript\n\n")
|
||||||
|
f.write("\n".join(transcript_lines))
|
||||||
|
f.write("\n")
|
||||||
|
|
||||||
|
|
||||||
|
def append_to_conversation(
|
||||||
|
output_path: Path,
|
||||||
|
transcript_lines: list[str],
|
||||||
|
new_message_count: int,
|
||||||
|
) -> None:
|
||||||
|
"""Append new transcript content to an existing conversation file.
|
||||||
|
|
||||||
|
Updates the message count in frontmatter and appends new transcript lines.
|
||||||
|
"""
|
||||||
|
content = output_path.read_text()
|
||||||
|
|
||||||
|
# Update message count in frontmatter
|
||||||
|
content = re.sub(
|
||||||
|
r"^messages: \d+$",
|
||||||
|
f"messages: {new_message_count}",
|
||||||
|
content,
|
||||||
|
count=1,
|
||||||
|
flags=re.MULTILINE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Add last_updated
|
||||||
|
today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
||||||
|
if "last_updated:" in content:
|
||||||
|
content = re.sub(
|
||||||
|
r"^last_updated: .+$",
|
||||||
|
f"last_updated: {today}",
|
||||||
|
content,
|
||||||
|
count=1,
|
||||||
|
flags=re.MULTILINE,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
content = content.replace(
|
||||||
|
"\nstatus: extracted",
|
||||||
|
f"\nlast_updated: {today}\nstatus: extracted",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Append new transcript
|
||||||
|
with open(output_path, "w") as f:
|
||||||
|
f.write(content)
|
||||||
|
if not content.endswith("\n"):
|
||||||
|
f.write("\n")
|
||||||
|
f.write("\n".join(transcript_lines))
|
||||||
|
f.write("\n")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main extraction logic
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def extract_session(
|
||||||
|
session_info: dict[str, Any],
|
||||||
|
state: dict[str, Any],
|
||||||
|
dry_run: bool = False,
|
||||||
|
) -> bool:
|
||||||
|
"""Extract a single session. Returns True if work was done."""
|
||||||
|
session_id = session_info["session_id"]
|
||||||
|
project = session_info["project"]
|
||||||
|
jsonl_path = session_info["jsonl_path"]
|
||||||
|
file_size = session_info["file_size"]
|
||||||
|
|
||||||
|
# Check state for prior extraction
|
||||||
|
session_state = state["sessions"].get(session_id, {})
|
||||||
|
last_offset = session_state.get("byte_offset", 0)
|
||||||
|
|
||||||
|
# Skip if no new content
|
||||||
|
if file_size <= last_offset:
|
||||||
|
return False
|
||||||
|
|
||||||
|
is_incremental = last_offset > 0
|
||||||
|
|
||||||
|
if dry_run:
|
||||||
|
mode = "append" if is_incremental else "new"
|
||||||
|
new_bytes = file_size - last_offset
|
||||||
|
print(f" [{mode}] {project}/{session_id[:8]} — {new_bytes:,} new bytes")
|
||||||
|
return True
|
||||||
|
|
||||||
|
# Parse the JSONL
|
||||||
|
transcript_lines, metadata = process_jsonl(jsonl_path, byte_offset=last_offset)
|
||||||
|
|
||||||
|
if not transcript_lines:
|
||||||
|
# Update offset even if no extractable content
|
||||||
|
state["sessions"][session_id] = {
|
||||||
|
"project": project,
|
||||||
|
"byte_offset": metadata["new_byte_offset"],
|
||||||
|
"message_count": session_state.get("message_count", 0),
|
||||||
|
"last_extracted": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"summarized_through_msg": session_state.get("summarized_through_msg", 0),
|
||||||
|
}
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Determine output path
|
||||||
|
date = metadata["first_date"] or datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
||||||
|
if is_incremental:
|
||||||
|
# Use existing output file
|
||||||
|
output_file = session_state.get("output_file", "")
|
||||||
|
output_path = WIKI_DIR / output_file if output_file else None
|
||||||
|
else:
|
||||||
|
output_path = None
|
||||||
|
|
||||||
|
if output_path is None or not output_path.exists():
|
||||||
|
filename = f"{date}-{session_id[:8]}.md"
|
||||||
|
output_path = CONVERSATIONS_DIR / project / filename
|
||||||
|
|
||||||
|
# Write or append
|
||||||
|
total_messages = session_state.get("message_count", 0) + metadata["message_count"]
|
||||||
|
|
||||||
|
if is_incremental and output_path.exists():
|
||||||
|
append_to_conversation(output_path, transcript_lines, total_messages)
|
||||||
|
print(f" [append] {project}/{output_path.name} — +{metadata['message_count']} messages")
|
||||||
|
else:
|
||||||
|
write_new_conversation(output_path, session_id, project, transcript_lines, metadata)
|
||||||
|
print(f" [new] {project}/{output_path.name} — {metadata['message_count']} messages")
|
||||||
|
|
||||||
|
# Update state
|
||||||
|
state["sessions"][session_id] = {
|
||||||
|
"project": project,
|
||||||
|
"output_file": str(output_path.relative_to(WIKI_DIR)),
|
||||||
|
"byte_offset": metadata["new_byte_offset"],
|
||||||
|
"message_count": total_messages,
|
||||||
|
"last_extracted": datetime.now(timezone.utc).isoformat(),
|
||||||
|
"summarized_through_msg": session_state.get("summarized_through_msg", 0),
|
||||||
|
}
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Extract Claude Code sessions into markdown transcripts",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--project",
|
||||||
|
help="Only extract sessions for this project code (e.g., mc, if, lp)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--session",
|
||||||
|
help="Only extract this specific session (prefix match on session ID)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--dry-run",
|
||||||
|
action="store_true",
|
||||||
|
help="Show what would be extracted without writing files",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--force",
|
||||||
|
action="store_true",
|
||||||
|
help="Re-extract from the beginning, ignoring saved byte offsets",
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
state = load_state()
|
||||||
|
|
||||||
|
if args.force:
|
||||||
|
# Reset all byte offsets
|
||||||
|
for sid in state["sessions"]:
|
||||||
|
state["sessions"][sid]["byte_offset"] = 0
|
||||||
|
|
||||||
|
# Discover sessions
|
||||||
|
sessions = discover_sessions(
|
||||||
|
project_filter=args.project,
|
||||||
|
session_filter=args.session,
|
||||||
|
)
|
||||||
|
|
||||||
|
if not sessions:
|
||||||
|
print("No sessions found matching filters.")
|
||||||
|
return
|
||||||
|
|
||||||
|
print(f"Found {len(sessions)} session(s) to check...")
|
||||||
|
if args.dry_run:
|
||||||
|
print("DRY RUN — no files will be written\n")
|
||||||
|
|
||||||
|
extracted = 0
|
||||||
|
for session_info in sessions:
|
||||||
|
if extract_session(session_info, state, dry_run=args.dry_run):
|
||||||
|
extracted += 1
|
||||||
|
|
||||||
|
if extracted == 0:
|
||||||
|
print("No new content to extract.")
|
||||||
|
else:
|
||||||
|
print(f"\nExtracted {extracted} session(s).")
|
||||||
|
|
||||||
|
if not args.dry_run:
|
||||||
|
save_state(state)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
118
scripts/mine-conversations.sh
Executable file
118
scripts/mine-conversations.sh
Executable file
@@ -0,0 +1,118 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# mine-conversations.sh — Top-level orchestrator for conversation mining pipeline
|
||||||
|
#
|
||||||
|
# Chains: Extract (Python) → Summarize (llama.cpp) → Index (Python)
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# mine-conversations.sh # Full pipeline
|
||||||
|
# mine-conversations.sh --extract-only # Phase A only (no LLM)
|
||||||
|
# mine-conversations.sh --summarize-only # Phase B only (requires llama-server)
|
||||||
|
# mine-conversations.sh --index-only # Phase C only
|
||||||
|
# mine-conversations.sh --project mc # Filter to one project
|
||||||
|
# mine-conversations.sh --dry-run # Show what would be done
|
||||||
|
|
||||||
|
# Resolve script location first so sibling scripts are found regardless of WIKI_DIR
|
||||||
|
SCRIPTS_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
WIKI_DIR="${WIKI_DIR:-$(dirname "${SCRIPTS_DIR}")}"
|
||||||
|
LOG_FILE="${SCRIPTS_DIR}/.mine.log"
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Argument parsing
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
EXTRACT=true
|
||||||
|
SUMMARIZE=true
|
||||||
|
INDEX=true
|
||||||
|
PROJECT=""
|
||||||
|
DRY_RUN=""
|
||||||
|
EXTRA_ARGS=()
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--extract-only)
|
||||||
|
SUMMARIZE=false
|
||||||
|
INDEX=false
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--summarize-only)
|
||||||
|
EXTRACT=false
|
||||||
|
INDEX=false
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--index-only)
|
||||||
|
EXTRACT=false
|
||||||
|
SUMMARIZE=false
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--project)
|
||||||
|
PROJECT="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--dry-run)
|
||||||
|
DRY_RUN="--dry-run"
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
EXTRA_ARGS+=("$1")
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
log() {
|
||||||
|
local msg
|
||||||
|
msg="[$(date '+%Y-%m-%d %H:%M:%S')] $*"
|
||||||
|
echo "${msg}" | tee -a "${LOG_FILE}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Pipeline
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
mkdir -p "${WIKI_DIR}/scripts"
|
||||||
|
|
||||||
|
log "=== Conversation mining started ==="
|
||||||
|
|
||||||
|
# Phase A: Extract
|
||||||
|
if [[ "${EXTRACT}" == true ]]; then
|
||||||
|
log "Phase A: Extracting sessions..."
|
||||||
|
local_args=()
|
||||||
|
if [[ -n "${PROJECT}" ]]; then
|
||||||
|
local_args+=(--project "${PROJECT}")
|
||||||
|
fi
|
||||||
|
if [[ -n "${DRY_RUN}" ]]; then
|
||||||
|
local_args+=(--dry-run)
|
||||||
|
fi
|
||||||
|
python3 "${SCRIPTS_DIR}/extract-sessions.py" "${local_args[@]}" "${EXTRA_ARGS[@]}" 2>&1 | tee -a "${LOG_FILE}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Phase B: Summarize
|
||||||
|
if [[ "${SUMMARIZE}" == true ]]; then
|
||||||
|
log "Phase B: Summarizing conversations..."
|
||||||
|
local_args=()
|
||||||
|
if [[ -n "${PROJECT}" ]]; then
|
||||||
|
local_args+=(--project "${PROJECT}")
|
||||||
|
fi
|
||||||
|
if [[ -n "${DRY_RUN}" ]]; then
|
||||||
|
local_args+=(--dry-run)
|
||||||
|
fi
|
||||||
|
python3 "${SCRIPTS_DIR}/summarize-conversations.py" "${local_args[@]}" "${EXTRA_ARGS[@]}" 2>&1 | tee -a "${LOG_FILE}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Phase C: Index
|
||||||
|
if [[ "${INDEX}" == true ]]; then
|
||||||
|
log "Phase C: Updating index and context..."
|
||||||
|
local_args=()
|
||||||
|
if [[ -z "${DRY_RUN}" ]]; then
|
||||||
|
local_args+=(--reindex)
|
||||||
|
fi
|
||||||
|
python3 "${SCRIPTS_DIR}/update-conversation-index.py" "${local_args[@]}" 2>&1 | tee -a "${LOG_FILE}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "=== Conversation mining complete ==="
|
||||||
40
scripts/mine-prompt-v2.md
Normal file
40
scripts/mine-prompt-v2.md
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
You analyze AI coding assistant conversation transcripts and produce structured JSON summaries.
|
||||||
|
|
||||||
|
Read the transcript, then output a single JSON object. No markdown fencing. No explanation. Just JSON.
|
||||||
|
|
||||||
|
REQUIRED JSON STRUCTURE:
|
||||||
|
|
||||||
|
{"trivial":false,"title":"...","summary":"...","halls":["fact"],"topics":["firebase-emulator","docker-compose"],"decisions":["..."],"discoveries":["..."],"preferences":["..."],"advice":["..."],"events":["..."],"tooling":["..."],"key_exchanges":[{"human":"...","assistant":"..."}],"related_topics":["..."]}
|
||||||
|
|
||||||
|
FIELD RULES:
|
||||||
|
|
||||||
|
title: 3-8 word descriptive title. NOT "Session XYZ". Describe what happened.
|
||||||
|
|
||||||
|
summary: 2-3 sentences. What the human wanted. What the assistant did. What was the outcome.
|
||||||
|
|
||||||
|
topics: REQUIRED. 1-4 kebab-case tags for the main subjects. Examples: firebase-emulator, blue-green-deploy, ci-pipeline, docker-hardening, database-migration, api-key-management, git-commit, test-failures.
|
||||||
|
|
||||||
|
halls: Which knowledge types are present. Pick from: fact, discovery, preference, advice, event, tooling.
|
||||||
|
- fact = decisions made, config changed, choices locked in
|
||||||
|
- discovery = root causes, bugs found, breakthroughs
|
||||||
|
- preference = user working style or preferences
|
||||||
|
- advice = recommendations, lessons learned
|
||||||
|
- event = deployments, incidents, milestones
|
||||||
|
- tooling = scripts used, commands run, failures encountered
|
||||||
|
|
||||||
|
decisions: State each decision as a fact. "Added restart policy to firebase service."
|
||||||
|
discoveries: State root cause clearly. "npm install failed because working directory was wrong."
|
||||||
|
preferences: Only if explicitly expressed. Usually empty.
|
||||||
|
advice: Recommendations made during the session.
|
||||||
|
events: Notable milestones or incidents.
|
||||||
|
tooling: Scripts, commands, and tools used. Note failures especially.
|
||||||
|
|
||||||
|
key_exchanges: 1-3 most important moments. Paraphrase to 1 sentence each.
|
||||||
|
|
||||||
|
related_topics: Secondary tags for cross-referencing to other wiki pages.
|
||||||
|
|
||||||
|
trivial: Set true ONLY if < 3 meaningful exchanges and no decisions or discoveries.
|
||||||
|
|
||||||
|
OMIT empty arrays — if no preferences were expressed, use "preferences": [].
|
||||||
|
|
||||||
|
Output ONLY valid JSON. No markdown. No explanation.
|
||||||
646
scripts/summarize-conversations.py
Executable file
646
scripts/summarize-conversations.py
Executable file
@@ -0,0 +1,646 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Summarize extracted conversation transcripts via LLM.
|
||||||
|
|
||||||
|
Phase B of the conversation mining pipeline. Sends transcripts to a local
|
||||||
|
llama-server or Claude Code CLI for classification, summarization, and
|
||||||
|
key exchange selection.
|
||||||
|
|
||||||
|
Handles chunking and incremental summarization.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 summarize-conversations.py # All unsummarized (local LLM)
|
||||||
|
python3 summarize-conversations.py --claude # Use claude -p (haiku/sonnet)
|
||||||
|
python3 summarize-conversations.py --claude --long 300 # Sonnet threshold: 300 msgs
|
||||||
|
python3 summarize-conversations.py --project mc # One project only
|
||||||
|
python3 summarize-conversations.py --file path.md # One file
|
||||||
|
python3 summarize-conversations.py --dry-run # Show what would be done
|
||||||
|
|
||||||
|
Claude mode uses Haiku for short conversations (<= threshold) and Sonnet
|
||||||
|
for longer ones. Threshold default: 200 messages.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
# Force unbuffered output for background/pipe usage
|
||||||
|
sys.stdout.reconfigure(line_buffering=True)
|
||||||
|
sys.stderr.reconfigure(line_buffering=True)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Configuration
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
WIKI_DIR = Path(os.environ.get("WIKI_DIR", str(Path.home() / "projects" / "wiki")))
|
||||||
|
CONVERSATIONS_DIR = WIKI_DIR / "conversations"
|
||||||
|
MINE_STATE_FILE = WIKI_DIR / ".mine-state.json"
|
||||||
|
# Prompt file lives next to this script, not in $WIKI_DIR
|
||||||
|
MINE_PROMPT_FILE = Path(__file__).resolve().parent / "mine-prompt-v2.md"
|
||||||
|
|
||||||
|
# Local LLM defaults (llama-server)
|
||||||
|
AI_BASE_URL = "http://localhost:8080/v1"
|
||||||
|
AI_MODEL = "Phi-4-14B-Q4_K_M"
|
||||||
|
AI_TOKEN = "dummy"
|
||||||
|
AI_TIMEOUT = 180
|
||||||
|
AI_TEMPERATURE = 0.3
|
||||||
|
|
||||||
|
# Claude CLI defaults
|
||||||
|
CLAUDE_HAIKU_MODEL = "haiku"
|
||||||
|
CLAUDE_SONNET_MODEL = "sonnet"
|
||||||
|
CLAUDE_LONG_THRESHOLD = 200 # messages — above this, use Sonnet
|
||||||
|
|
||||||
|
# Chunking parameters
|
||||||
|
# Local LLM: 8K context → ~3000 tokens content per chunk
|
||||||
|
MAX_CHUNK_CHARS_LOCAL = 12000
|
||||||
|
MAX_ROLLING_CONTEXT_CHARS_LOCAL = 6000
|
||||||
|
# Claude: 200K context → much larger chunks, fewer LLM calls
|
||||||
|
MAX_CHUNK_CHARS_CLAUDE = 80000 # ~20K tokens
|
||||||
|
MAX_ROLLING_CONTEXT_CHARS_CLAUDE = 20000
|
||||||
|
|
||||||
|
|
||||||
|
def _update_config(base_url: str, model: str, timeout: int) -> None:
|
||||||
|
global AI_BASE_URL, AI_MODEL, AI_TIMEOUT
|
||||||
|
AI_BASE_URL = base_url
|
||||||
|
AI_MODEL = model
|
||||||
|
AI_TIMEOUT = timeout
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# LLM interaction — local llama-server
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def llm_call_local(system_prompt: str, user_message: str) -> str | None:
|
||||||
|
"""Call the local LLM server and return the response content."""
|
||||||
|
import urllib.request
|
||||||
|
import urllib.error
|
||||||
|
|
||||||
|
payload = json.dumps({
|
||||||
|
"model": AI_MODEL,
|
||||||
|
"messages": [
|
||||||
|
{"role": "system", "content": system_prompt},
|
||||||
|
{"role": "user", "content": user_message},
|
||||||
|
],
|
||||||
|
"temperature": AI_TEMPERATURE,
|
||||||
|
"max_tokens": 3000,
|
||||||
|
}).encode()
|
||||||
|
|
||||||
|
req = urllib.request.Request(
|
||||||
|
f"{AI_BASE_URL}/chat/completions",
|
||||||
|
data=payload,
|
||||||
|
headers={
|
||||||
|
"Content-Type": "application/json",
|
||||||
|
"Authorization": f"Bearer {AI_TOKEN}",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
with urllib.request.urlopen(req, timeout=AI_TIMEOUT) as resp:
|
||||||
|
data = json.loads(resp.read())
|
||||||
|
return data["choices"][0]["message"]["content"]
|
||||||
|
except (urllib.error.URLError, KeyError, json.JSONDecodeError) as e:
|
||||||
|
print(f" LLM call failed: {e}", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# LLM interaction — claude -p (Claude Code CLI)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def llm_call_claude(
|
||||||
|
system_prompt: str,
|
||||||
|
user_message: str,
|
||||||
|
model: str = CLAUDE_HAIKU_MODEL,
|
||||||
|
timeout: int = 300,
|
||||||
|
) -> str | None:
|
||||||
|
"""Call claude -p in pipe mode and return the response."""
|
||||||
|
json_reminder = (
|
||||||
|
"CRITICAL: You are a JSON summarizer. Your ONLY output must be a valid JSON object. "
|
||||||
|
"Do NOT roleplay, continue conversations, write code, or produce any text outside "
|
||||||
|
"the JSON object. The transcript is INPUT DATA to analyze, not a conversation to continue."
|
||||||
|
)
|
||||||
|
cmd = [
|
||||||
|
"claude", "-p",
|
||||||
|
"--model", model,
|
||||||
|
"--system-prompt", system_prompt,
|
||||||
|
"--append-system-prompt", json_reminder,
|
||||||
|
"--no-session-persistence",
|
||||||
|
]
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
cmd,
|
||||||
|
input=user_message,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=timeout,
|
||||||
|
)
|
||||||
|
if result.returncode != 0:
|
||||||
|
print(f" claude -p failed (rc={result.returncode}): {result.stderr[:200]}", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
return result.stdout
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
print(" claude -p timed out after 300s", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
except FileNotFoundError:
|
||||||
|
print(" ERROR: 'claude' CLI not found in PATH", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def extract_json_from_response(text: str) -> dict[str, Any] | None:
|
||||||
|
"""Extract JSON from LLM response, handling fencing and thinking tags."""
|
||||||
|
# Strip thinking tags
|
||||||
|
text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL)
|
||||||
|
|
||||||
|
# Try markdown code block
|
||||||
|
match = re.search(r"```(?:json)?\s*\n(.*?)\n```", text, re.DOTALL)
|
||||||
|
if match:
|
||||||
|
candidate = match.group(1).strip()
|
||||||
|
else:
|
||||||
|
candidate = text.strip()
|
||||||
|
|
||||||
|
# Find JSON object
|
||||||
|
start = candidate.find("{")
|
||||||
|
end = candidate.rfind("}")
|
||||||
|
if start >= 0 and end > start:
|
||||||
|
candidate = candidate[start : end + 1]
|
||||||
|
|
||||||
|
try:
|
||||||
|
return json.loads(candidate)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# File parsing
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def parse_frontmatter(file_path: Path) -> dict[str, str]:
|
||||||
|
"""Parse YAML frontmatter."""
|
||||||
|
content = file_path.read_text()
|
||||||
|
match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)
|
||||||
|
if not match:
|
||||||
|
return {}
|
||||||
|
fm: dict[str, str] = {}
|
||||||
|
for line in match.group(1).splitlines():
|
||||||
|
if ":" in line:
|
||||||
|
key, _, value = line.partition(":")
|
||||||
|
fm[key.strip()] = value.strip()
|
||||||
|
return fm
|
||||||
|
|
||||||
|
|
||||||
|
def get_transcript(file_path: Path) -> str:
|
||||||
|
"""Get transcript section from conversation file."""
|
||||||
|
content = file_path.read_text()
|
||||||
|
idx = content.find("\n## Transcript\n")
|
||||||
|
if idx < 0:
|
||||||
|
return ""
|
||||||
|
return content[idx + len("\n## Transcript\n") :]
|
||||||
|
|
||||||
|
|
||||||
|
def get_existing_summary(file_path: Path) -> str:
|
||||||
|
"""Get existing summary sections (between frontmatter end and transcript)."""
|
||||||
|
content = file_path.read_text()
|
||||||
|
parts = content.split("---", 2)
|
||||||
|
if len(parts) < 3:
|
||||||
|
return ""
|
||||||
|
after_fm = parts[2]
|
||||||
|
idx = after_fm.find("## Transcript")
|
||||||
|
if idx < 0:
|
||||||
|
return ""
|
||||||
|
return after_fm[:idx].strip()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Chunking
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def chunk_text(text: str, max_chars: int) -> list[str]:
|
||||||
|
"""Split text into chunks, breaking at paragraph boundaries."""
|
||||||
|
if len(text) <= max_chars:
|
||||||
|
return [text]
|
||||||
|
|
||||||
|
chunks: list[str] = []
|
||||||
|
current = ""
|
||||||
|
|
||||||
|
for line in text.splitlines(keepends=True):
|
||||||
|
if len(current) + len(line) > max_chars and current:
|
||||||
|
chunks.append(current)
|
||||||
|
current = line
|
||||||
|
else:
|
||||||
|
current += line
|
||||||
|
|
||||||
|
if current:
|
||||||
|
chunks.append(current)
|
||||||
|
|
||||||
|
return chunks
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Summarization
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def select_claude_model(file_path: Path, long_threshold: int) -> str:
|
||||||
|
"""Pick haiku or sonnet based on message count."""
|
||||||
|
fm = parse_frontmatter(file_path)
|
||||||
|
try:
|
||||||
|
msg_count = int(fm.get("messages", "0"))
|
||||||
|
except ValueError:
|
||||||
|
msg_count = 0
|
||||||
|
if msg_count > long_threshold:
|
||||||
|
return CLAUDE_SONNET_MODEL
|
||||||
|
return CLAUDE_HAIKU_MODEL
|
||||||
|
|
||||||
|
|
||||||
|
def summarize_file(
|
||||||
|
file_path: Path,
|
||||||
|
system_prompt: str,
|
||||||
|
dry_run: bool = False,
|
||||||
|
use_claude: bool = False,
|
||||||
|
long_threshold: int = CLAUDE_LONG_THRESHOLD,
|
||||||
|
) -> bool:
|
||||||
|
"""Summarize a single conversation file. Returns True on success."""
|
||||||
|
transcript = get_transcript(file_path)
|
||||||
|
if not transcript.strip():
|
||||||
|
print(f" [skip] {file_path.name} — no transcript")
|
||||||
|
return False
|
||||||
|
|
||||||
|
existing_summary = get_existing_summary(file_path)
|
||||||
|
is_incremental = "## Summary" in existing_summary
|
||||||
|
|
||||||
|
# Pick chunk sizes based on provider
|
||||||
|
if use_claude:
|
||||||
|
max_chunk = MAX_CHUNK_CHARS_CLAUDE
|
||||||
|
max_rolling = MAX_ROLLING_CONTEXT_CHARS_CLAUDE
|
||||||
|
else:
|
||||||
|
max_chunk = MAX_CHUNK_CHARS_LOCAL
|
||||||
|
max_rolling = MAX_ROLLING_CONTEXT_CHARS_LOCAL
|
||||||
|
|
||||||
|
chunks = chunk_text(transcript, max_chunk)
|
||||||
|
num_chunks = len(chunks)
|
||||||
|
|
||||||
|
# Pick model for claude mode
|
||||||
|
claude_model = ""
|
||||||
|
if use_claude:
|
||||||
|
claude_model = select_claude_model(file_path, long_threshold)
|
||||||
|
|
||||||
|
if dry_run:
|
||||||
|
mode = "incremental" if is_incremental else "new"
|
||||||
|
model_info = f", model={claude_model}" if use_claude else ""
|
||||||
|
print(f" [dry-run] {file_path.name} — {num_chunks} chunk(s) ({mode}{model_info})")
|
||||||
|
return True
|
||||||
|
|
||||||
|
model_label = f" [{claude_model}]" if use_claude else ""
|
||||||
|
print(f" [summarize] {file_path.name} — {num_chunks} chunk(s)"
|
||||||
|
f"{' (incremental)' if is_incremental else ''}{model_label}")
|
||||||
|
|
||||||
|
rolling_context = ""
|
||||||
|
if is_incremental:
|
||||||
|
rolling_context = f"EXISTING SUMMARY (extend, do not repeat):\n{existing_summary}\n\n"
|
||||||
|
|
||||||
|
final_json: dict[str, Any] | None = None
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
for i, chunk in enumerate(chunks, 1):
|
||||||
|
if rolling_context:
|
||||||
|
user_msg = (
|
||||||
|
f"{rolling_context}\n\n"
|
||||||
|
f"NEW CONVERSATION CONTENT (chunk {i}/{num_chunks}):\n{chunk}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
user_msg = f"CONVERSATION TRANSCRIPT (chunk {i}/{num_chunks}):\n{chunk}"
|
||||||
|
|
||||||
|
if i == num_chunks:
|
||||||
|
user_msg += "\n\nThis is the FINAL chunk. Produce the complete JSON summary now."
|
||||||
|
else:
|
||||||
|
user_msg += "\n\nMore chunks follow. Produce a PARTIAL summary JSON for what you've seen so far."
|
||||||
|
|
||||||
|
# Call the appropriate LLM (with retry on parse failure)
|
||||||
|
max_attempts = 2
|
||||||
|
parsed = None
|
||||||
|
for attempt in range(1, max_attempts + 1):
|
||||||
|
if use_claude:
|
||||||
|
# Longer timeout for sonnet / multi-chunk conversations
|
||||||
|
call_timeout = 600 if claude_model == CLAUDE_SONNET_MODEL else 300
|
||||||
|
response = llm_call_claude(system_prompt, user_msg,
|
||||||
|
model=claude_model, timeout=call_timeout)
|
||||||
|
else:
|
||||||
|
response = llm_call_local(system_prompt, user_msg)
|
||||||
|
|
||||||
|
if not response:
|
||||||
|
print(f" [error] LLM call failed on chunk {i}/{num_chunks} (attempt {attempt})")
|
||||||
|
if attempt < max_attempts:
|
||||||
|
continue
|
||||||
|
return False
|
||||||
|
|
||||||
|
parsed = extract_json_from_response(response)
|
||||||
|
if parsed:
|
||||||
|
break
|
||||||
|
|
||||||
|
print(f" [warn] JSON parse failed on chunk {i}/{num_chunks} (attempt {attempt})")
|
||||||
|
if attempt < max_attempts:
|
||||||
|
print(f" Retrying...")
|
||||||
|
else:
|
||||||
|
# Log first 200 chars for debugging
|
||||||
|
print(f" Response preview: {response[:200]}", file=sys.stderr)
|
||||||
|
|
||||||
|
if not parsed:
|
||||||
|
print(f" [error] JSON parse failed on chunk {i}/{num_chunks} after {max_attempts} attempts")
|
||||||
|
return False
|
||||||
|
|
||||||
|
final_json = parsed
|
||||||
|
|
||||||
|
# Build rolling context for next chunk
|
||||||
|
partial_summary = parsed.get("summary", "")
|
||||||
|
if partial_summary:
|
||||||
|
rolling_context = f"PARTIAL SUMMARY SO FAR:\n{partial_summary}"
|
||||||
|
decisions = parsed.get("decisions", [])
|
||||||
|
if decisions:
|
||||||
|
rolling_context += "\n\nKEY DECISIONS:\n" + "\n".join(
|
||||||
|
f"- {d}" for d in decisions[:5]
|
||||||
|
)
|
||||||
|
if len(rolling_context) > max_rolling:
|
||||||
|
rolling_context = rolling_context[:max_rolling] + "..."
|
||||||
|
|
||||||
|
if not final_json:
|
||||||
|
print(f" [error] No summary produced")
|
||||||
|
return False
|
||||||
|
|
||||||
|
elapsed = time.time() - start_time
|
||||||
|
|
||||||
|
# Apply the summary to the file
|
||||||
|
apply_summary(file_path, final_json)
|
||||||
|
|
||||||
|
halls = final_json.get("halls", [])
|
||||||
|
topics = final_json.get("topics", [])
|
||||||
|
status = "trivial" if final_json.get("trivial") else "summarized"
|
||||||
|
|
||||||
|
print(
|
||||||
|
f" [done] {file_path.name} — {status}, "
|
||||||
|
f"halls=[{', '.join(halls)}], "
|
||||||
|
f"topics=[{', '.join(topics)}] "
|
||||||
|
f"({elapsed:.0f}s)"
|
||||||
|
)
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def apply_summary(file_path: Path, summary_json: dict[str, Any]) -> None:
|
||||||
|
"""Apply LLM summary to the conversation markdown file."""
|
||||||
|
content = file_path.read_text()
|
||||||
|
|
||||||
|
# Parse existing frontmatter
|
||||||
|
fm_match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)
|
||||||
|
if not fm_match:
|
||||||
|
return
|
||||||
|
|
||||||
|
fm_lines = fm_match.group(1).splitlines()
|
||||||
|
|
||||||
|
# Find transcript
|
||||||
|
transcript_idx = content.find("\n## Transcript\n")
|
||||||
|
transcript_section = content[transcript_idx:] if transcript_idx >= 0 else ""
|
||||||
|
|
||||||
|
# Update frontmatter
|
||||||
|
is_trivial = summary_json.get("trivial", False)
|
||||||
|
new_status = "trivial" if is_trivial else "summarized"
|
||||||
|
title = summary_json.get("title", "Untitled Session")
|
||||||
|
halls = summary_json.get("halls", [])
|
||||||
|
topics = summary_json.get("topics", [])
|
||||||
|
related = summary_json.get("related_topics", [])
|
||||||
|
|
||||||
|
fm_dict: dict[str, str] = {}
|
||||||
|
fm_key_order: list[str] = []
|
||||||
|
for line in fm_lines:
|
||||||
|
if ":" in line:
|
||||||
|
key = line.partition(":")[0].strip()
|
||||||
|
val = line.partition(":")[2].strip()
|
||||||
|
fm_dict[key] = val
|
||||||
|
fm_key_order.append(key)
|
||||||
|
|
||||||
|
fm_dict["title"] = title
|
||||||
|
fm_dict["status"] = new_status
|
||||||
|
if halls:
|
||||||
|
fm_dict["halls"] = "[" + ", ".join(halls) + "]"
|
||||||
|
if topics:
|
||||||
|
fm_dict["topics"] = "[" + ", ".join(topics) + "]"
|
||||||
|
if related:
|
||||||
|
fm_dict["related"] = "[" + ", ".join(related) + "]"
|
||||||
|
|
||||||
|
# Add new keys
|
||||||
|
for key in ["halls", "topics", "related"]:
|
||||||
|
if key in fm_dict and key not in fm_key_order:
|
||||||
|
fm_key_order.append(key)
|
||||||
|
|
||||||
|
new_fm = "\n".join(f"{k}: {fm_dict[k]}" for k in fm_key_order if k in fm_dict)
|
||||||
|
|
||||||
|
# Build summary sections
|
||||||
|
sections: list[str] = []
|
||||||
|
|
||||||
|
summary_text = summary_json.get("summary", "")
|
||||||
|
if summary_text:
|
||||||
|
sections.append(f"## Summary\n\n{summary_text}")
|
||||||
|
|
||||||
|
for hall_name, hall_label in [
|
||||||
|
("decisions", "Decisions (hall: fact)"),
|
||||||
|
("discoveries", "Discoveries (hall: discovery)"),
|
||||||
|
("preferences", "Preferences (hall: preference)"),
|
||||||
|
("advice", "Advice (hall: advice)"),
|
||||||
|
("events", "Events (hall: event)"),
|
||||||
|
("tooling", "Tooling (hall: tooling)"),
|
||||||
|
]:
|
||||||
|
items = summary_json.get(hall_name, [])
|
||||||
|
if items:
|
||||||
|
lines = [f"## {hall_label}\n"]
|
||||||
|
for item in items:
|
||||||
|
lines.append(f"- {item}")
|
||||||
|
sections.append("\n".join(lines))
|
||||||
|
|
||||||
|
exchanges = summary_json.get("key_exchanges", [])
|
||||||
|
if exchanges:
|
||||||
|
lines = ["## Key Exchanges\n"]
|
||||||
|
for ex in exchanges:
|
||||||
|
if isinstance(ex, dict):
|
||||||
|
human = ex.get("human", "")
|
||||||
|
assistant = ex.get("assistant", "")
|
||||||
|
lines.append(f"> **Human**: {human}")
|
||||||
|
lines.append(">")
|
||||||
|
lines.append(f"> **Assistant**: {assistant}")
|
||||||
|
lines.append("")
|
||||||
|
elif isinstance(ex, str):
|
||||||
|
lines.append(f"- {ex}")
|
||||||
|
sections.append("\n".join(lines))
|
||||||
|
|
||||||
|
# Assemble
|
||||||
|
output = f"---\n{new_fm}\n---\n\n"
|
||||||
|
if sections:
|
||||||
|
output += "\n\n".join(sections) + "\n\n---\n"
|
||||||
|
output += transcript_section
|
||||||
|
if not output.endswith("\n"):
|
||||||
|
output += "\n"
|
||||||
|
|
||||||
|
file_path.write_text(output)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Discovery
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def find_files_to_summarize(
|
||||||
|
project_filter: str | None = None,
|
||||||
|
file_filter: str | None = None,
|
||||||
|
) -> list[Path]:
|
||||||
|
"""Find conversation files needing summarization."""
|
||||||
|
if file_filter:
|
||||||
|
p = Path(file_filter)
|
||||||
|
if p.exists():
|
||||||
|
return [p]
|
||||||
|
p = WIKI_DIR / file_filter
|
||||||
|
if p.exists():
|
||||||
|
return [p]
|
||||||
|
return []
|
||||||
|
|
||||||
|
search_dir = CONVERSATIONS_DIR
|
||||||
|
if project_filter:
|
||||||
|
search_dir = CONVERSATIONS_DIR / project_filter
|
||||||
|
|
||||||
|
files: list[Path] = []
|
||||||
|
for md_file in sorted(search_dir.rglob("*.md")):
|
||||||
|
if md_file.name in ("index.md", ".gitkeep"):
|
||||||
|
continue
|
||||||
|
fm = parse_frontmatter(md_file)
|
||||||
|
if fm.get("status") == "extracted":
|
||||||
|
files.append(md_file)
|
||||||
|
|
||||||
|
return files
|
||||||
|
|
||||||
|
|
||||||
|
def update_mine_state(session_id: str, msg_count: int) -> None:
|
||||||
|
"""Update summarized_through_msg in mine state."""
|
||||||
|
if not MINE_STATE_FILE.exists():
|
||||||
|
return
|
||||||
|
try:
|
||||||
|
with open(MINE_STATE_FILE) as f:
|
||||||
|
state = json.load(f)
|
||||||
|
if session_id in state.get("sessions", {}):
|
||||||
|
state["sessions"][session_id]["summarized_through_msg"] = msg_count
|
||||||
|
with open(MINE_STATE_FILE, "w") as f:
|
||||||
|
json.dump(state, f, indent=2)
|
||||||
|
except (json.JSONDecodeError, KeyError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
parser = argparse.ArgumentParser(description="Summarize conversation transcripts")
|
||||||
|
parser.add_argument("--project", help="Only summarize this project code")
|
||||||
|
parser.add_argument("--file", help="Summarize a specific file")
|
||||||
|
parser.add_argument("--dry-run", action="store_true", help="Show what would be done")
|
||||||
|
parser.add_argument(
|
||||||
|
"--claude", action="store_true",
|
||||||
|
help="Use claude -p instead of local LLM (haiku for short, sonnet for long)",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--long", type=int, default=CLAUDE_LONG_THRESHOLD, metavar="N",
|
||||||
|
help=f"Message count threshold for sonnet (default: {CLAUDE_LONG_THRESHOLD})",
|
||||||
|
)
|
||||||
|
parser.add_argument("--ai-url", default=AI_BASE_URL)
|
||||||
|
parser.add_argument("--ai-model", default=AI_MODEL)
|
||||||
|
parser.add_argument("--ai-timeout", type=int, default=AI_TIMEOUT)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Update module-level config from args (local LLM only)
|
||||||
|
_update_config(args.ai_url, args.ai_model, args.ai_timeout)
|
||||||
|
|
||||||
|
# Load system prompt
|
||||||
|
if not MINE_PROMPT_FILE.exists():
|
||||||
|
print(f"ERROR: Prompt not found: {MINE_PROMPT_FILE}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
system_prompt = MINE_PROMPT_FILE.read_text()
|
||||||
|
|
||||||
|
# Find files
|
||||||
|
files = find_files_to_summarize(args.project, args.file)
|
||||||
|
if not files:
|
||||||
|
print("No conversations need summarization.")
|
||||||
|
return
|
||||||
|
|
||||||
|
provider = "claude -p" if args.claude else f"local ({AI_MODEL})"
|
||||||
|
print(f"Found {len(files)} conversation(s) to summarize. Provider: {provider}")
|
||||||
|
|
||||||
|
if args.dry_run:
|
||||||
|
for f in files:
|
||||||
|
summarize_file(f, system_prompt, dry_run=True,
|
||||||
|
use_claude=args.claude, long_threshold=args.long)
|
||||||
|
return
|
||||||
|
|
||||||
|
# Check provider availability
|
||||||
|
if args.claude:
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
["claude", "--version"],
|
||||||
|
capture_output=True, text=True, timeout=10,
|
||||||
|
)
|
||||||
|
if result.returncode != 0:
|
||||||
|
print("ERROR: 'claude' CLI not working", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
print(f"Claude CLI: {result.stdout.strip()}")
|
||||||
|
except (FileNotFoundError, subprocess.TimeoutExpired):
|
||||||
|
print("ERROR: 'claude' CLI not found in PATH", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
else:
|
||||||
|
import urllib.request
|
||||||
|
import urllib.error
|
||||||
|
health_url = AI_BASE_URL.replace("/v1", "/health")
|
||||||
|
try:
|
||||||
|
urllib.request.urlopen(health_url, timeout=5)
|
||||||
|
except urllib.error.URLError:
|
||||||
|
print(f"ERROR: LLM server not responding at {health_url}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
processed = 0
|
||||||
|
errors = 0
|
||||||
|
total_start = time.time()
|
||||||
|
|
||||||
|
for i, f in enumerate(files, 1):
|
||||||
|
print(f"\n[{i}/{len(files)}]", end=" ")
|
||||||
|
try:
|
||||||
|
if summarize_file(f, system_prompt, use_claude=args.claude,
|
||||||
|
long_threshold=args.long):
|
||||||
|
processed += 1
|
||||||
|
|
||||||
|
# Update mine state
|
||||||
|
fm = parse_frontmatter(f)
|
||||||
|
sid = fm.get("session_id", "")
|
||||||
|
msgs = fm.get("messages", "0")
|
||||||
|
if sid:
|
||||||
|
try:
|
||||||
|
update_mine_state(sid, int(msgs))
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
else:
|
||||||
|
errors += 1
|
||||||
|
except Exception as e:
|
||||||
|
print(f" [crash] {f.name} — {e}", file=sys.stderr)
|
||||||
|
errors += 1
|
||||||
|
|
||||||
|
elapsed = time.time() - total_start
|
||||||
|
print(f"\nDone. Summarized: {processed}, Errors: {errors}, Time: {elapsed:.0f}s")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
476
scripts/update-conversation-index.py
Executable file
476
scripts/update-conversation-index.py
Executable file
@@ -0,0 +1,476 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Update conversation index and context files from summarized conversations.
|
||||||
|
|
||||||
|
Phase C of the conversation mining pipeline. Reads all conversation markdown
|
||||||
|
files and regenerates:
|
||||||
|
- conversations/index.md — catalog organized by project
|
||||||
|
- context/wake-up.md — world briefing from recent conversations
|
||||||
|
- context/active-concerns.md — current blockers and open threads
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 update-conversation-index.py
|
||||||
|
python3 update-conversation-index.py --reindex # Also triggers qmd update
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from collections import defaultdict
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Configuration
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
WIKI_DIR = Path(os.environ.get("WIKI_DIR", str(Path.home() / "projects" / "wiki")))
|
||||||
|
CONVERSATIONS_DIR = WIKI_DIR / "conversations"
|
||||||
|
CONTEXT_DIR = WIKI_DIR / "context"
|
||||||
|
INDEX_FILE = CONVERSATIONS_DIR / "index.md"
|
||||||
|
WAKEUP_FILE = CONTEXT_DIR / "wake-up.md"
|
||||||
|
CONCERNS_FILE = CONTEXT_DIR / "active-concerns.md"
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════════════
|
||||||
|
# CONFIGURE ME — Project code to display name mapping
|
||||||
|
# ════════════════════════════════════════════════════════════════════════════
|
||||||
|
#
|
||||||
|
# Every project code you use in `extract-sessions.py`'s PROJECT_MAP should
|
||||||
|
# have a display name here. The conversation index groups conversations by
|
||||||
|
# these codes and renders them under sections named by the display name.
|
||||||
|
#
|
||||||
|
# Examples — replace with your own:
|
||||||
|
PROJECT_NAMES: dict[str, str] = {
|
||||||
|
"wiki": "WIKI — This Wiki",
|
||||||
|
"cl": "CL — Claude Config",
|
||||||
|
# "web": "WEB — My Webapp",
|
||||||
|
# "mob": "MOB — My Mobile App",
|
||||||
|
# "work": "WORK — Day Job",
|
||||||
|
"general": "General — Cross-Project",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Order for display — put your most-active projects first
|
||||||
|
PROJECT_ORDER = [
|
||||||
|
# "work", "web", "mob",
|
||||||
|
"wiki", "cl", "general",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Frontmatter parsing
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def parse_frontmatter(file_path: Path) -> dict[str, str]:
|
||||||
|
"""Parse YAML frontmatter from a markdown file."""
|
||||||
|
fm: dict[str, str] = {}
|
||||||
|
content = file_path.read_text()
|
||||||
|
|
||||||
|
# Find frontmatter between --- markers
|
||||||
|
match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)
|
||||||
|
if not match:
|
||||||
|
return fm
|
||||||
|
|
||||||
|
for line in match.group(1).splitlines():
|
||||||
|
if ":" in line:
|
||||||
|
key, _, value = line.partition(":")
|
||||||
|
fm[key.strip()] = value.strip()
|
||||||
|
|
||||||
|
return fm
|
||||||
|
|
||||||
|
|
||||||
|
def get_summary_line(file_path: Path) -> str:
|
||||||
|
"""Extract the first sentence of the Summary section."""
|
||||||
|
content = file_path.read_text()
|
||||||
|
match = re.search(r"## Summary\n\n(.+?)(?:\n\n|\n##)", content, re.DOTALL)
|
||||||
|
if match:
|
||||||
|
summary = match.group(1).strip()
|
||||||
|
# First sentence
|
||||||
|
first_sentence = summary.split(". ")[0]
|
||||||
|
if not first_sentence.endswith("."):
|
||||||
|
first_sentence += "."
|
||||||
|
# Truncate if too long
|
||||||
|
if len(first_sentence) > 120:
|
||||||
|
first_sentence = first_sentence[:117] + "..."
|
||||||
|
return first_sentence
|
||||||
|
return "No summary available."
|
||||||
|
|
||||||
|
|
||||||
|
def get_decisions(file_path: Path) -> list[str]:
|
||||||
|
"""Extract decisions from a conversation file."""
|
||||||
|
content = file_path.read_text()
|
||||||
|
decisions: list[str] = []
|
||||||
|
match = re.search(r"## Decisions.*?\n(.*?)(?:\n##|\n---|\Z)", content, re.DOTALL)
|
||||||
|
if match:
|
||||||
|
for line in match.group(1).strip().splitlines():
|
||||||
|
line = line.strip()
|
||||||
|
if line.startswith("- "):
|
||||||
|
decisions.append(line[2:])
|
||||||
|
return decisions
|
||||||
|
|
||||||
|
|
||||||
|
def get_discoveries(file_path: Path) -> list[str]:
|
||||||
|
"""Extract discoveries from a conversation file."""
|
||||||
|
content = file_path.read_text()
|
||||||
|
discoveries: list[str] = []
|
||||||
|
match = re.search(r"## Discoveries.*?\n(.*?)(?:\n##|\n---|\Z)", content, re.DOTALL)
|
||||||
|
if match:
|
||||||
|
for line in match.group(1).strip().splitlines():
|
||||||
|
line = line.strip()
|
||||||
|
if line.startswith("- "):
|
||||||
|
discoveries.append(line[2:])
|
||||||
|
return discoveries
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Conversation discovery
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def discover_conversations() -> dict[str, list[dict[str, Any]]]:
|
||||||
|
"""Discover all conversation files organized by project."""
|
||||||
|
by_project: dict[str, list[dict[str, Any]]] = defaultdict(list)
|
||||||
|
|
||||||
|
for project_dir in sorted(CONVERSATIONS_DIR.iterdir()):
|
||||||
|
if not project_dir.is_dir():
|
||||||
|
continue
|
||||||
|
|
||||||
|
project_code = project_dir.name
|
||||||
|
if project_code not in PROJECT_NAMES:
|
||||||
|
continue
|
||||||
|
|
||||||
|
for md_file in sorted(project_dir.glob("*.md"), reverse=True):
|
||||||
|
if md_file.name == ".gitkeep":
|
||||||
|
continue
|
||||||
|
|
||||||
|
fm = parse_frontmatter(md_file)
|
||||||
|
status = fm.get("status", "extracted")
|
||||||
|
|
||||||
|
entry = {
|
||||||
|
"file": md_file,
|
||||||
|
"relative": md_file.relative_to(CONVERSATIONS_DIR),
|
||||||
|
"title": fm.get("title", md_file.stem),
|
||||||
|
"date": fm.get("date", "unknown"),
|
||||||
|
"status": status,
|
||||||
|
"messages": fm.get("messages", "0"),
|
||||||
|
"halls": fm.get("halls", ""),
|
||||||
|
"topics": fm.get("topics", ""),
|
||||||
|
"project": project_code,
|
||||||
|
}
|
||||||
|
|
||||||
|
by_project[project_code].append(entry)
|
||||||
|
|
||||||
|
return by_project
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Index generation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def generate_index(by_project: dict[str, list[dict[str, Any]]]) -> str:
|
||||||
|
"""Generate the conversations/index.md content."""
|
||||||
|
total = sum(len(convos) for convos in by_project.values())
|
||||||
|
summarized = sum(
|
||||||
|
1
|
||||||
|
for convos in by_project.values()
|
||||||
|
for c in convos
|
||||||
|
if c["status"] == "summarized"
|
||||||
|
)
|
||||||
|
trivial = sum(
|
||||||
|
1
|
||||||
|
for convos in by_project.values()
|
||||||
|
for c in convos
|
||||||
|
if c["status"] == "trivial"
|
||||||
|
)
|
||||||
|
extracted = total - summarized - trivial
|
||||||
|
|
||||||
|
lines = [
|
||||||
|
"---",
|
||||||
|
"title: Conversation Index",
|
||||||
|
"type: index",
|
||||||
|
f"last_updated: {datetime.now(timezone.utc).strftime('%Y-%m-%d')}",
|
||||||
|
"---",
|
||||||
|
"",
|
||||||
|
"# Conversation Index",
|
||||||
|
"",
|
||||||
|
f"Mined conversations from Claude Code sessions, organized by project (wing).",
|
||||||
|
"",
|
||||||
|
f"**{total} conversations** — {summarized} summarized, {extracted} pending, {trivial} trivial.",
|
||||||
|
"",
|
||||||
|
"---",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
|
||||||
|
for project_code in PROJECT_ORDER:
|
||||||
|
convos = by_project.get(project_code, [])
|
||||||
|
display_name = PROJECT_NAMES.get(project_code, project_code.upper())
|
||||||
|
|
||||||
|
lines.append(f"## {display_name}")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
if not convos:
|
||||||
|
lines.append("_No conversations mined yet._")
|
||||||
|
lines.append("")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Show summarized first, then extracted, skip trivial from listing
|
||||||
|
shown = 0
|
||||||
|
for c in convos:
|
||||||
|
if c["status"] == "trivial":
|
||||||
|
continue
|
||||||
|
|
||||||
|
status_tag = ""
|
||||||
|
if c["status"] == "extracted":
|
||||||
|
status_tag = " _(pending summary)_"
|
||||||
|
|
||||||
|
# Get summary line if summarized
|
||||||
|
summary_text = ""
|
||||||
|
if c["status"] == "summarized":
|
||||||
|
summary_text = f" — {get_summary_line(c['file'])}"
|
||||||
|
|
||||||
|
lines.append(
|
||||||
|
f"- [{c['title']}]({c['relative']})"
|
||||||
|
f" ({c['date']}, {c['messages']} msgs)"
|
||||||
|
f"{summary_text}{status_tag}"
|
||||||
|
)
|
||||||
|
shown += 1
|
||||||
|
|
||||||
|
trivial_count = len(convos) - shown
|
||||||
|
if trivial_count > 0:
|
||||||
|
lines.append(f"\n_{trivial_count} trivial session(s) not listed._")
|
||||||
|
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Context generation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def generate_wakeup(by_project: dict[str, list[dict[str, Any]]]) -> str:
|
||||||
|
"""Generate context/wake-up.md from recent conversations."""
|
||||||
|
today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
||||||
|
|
||||||
|
# Determine activity level per project
|
||||||
|
project_activity: dict[str, dict[str, Any]] = {}
|
||||||
|
for code in PROJECT_ORDER:
|
||||||
|
convos = by_project.get(code, [])
|
||||||
|
summarized = [c for c in convos if c["status"] == "summarized"]
|
||||||
|
|
||||||
|
if summarized:
|
||||||
|
latest = max(summarized, key=lambda c: c["date"])
|
||||||
|
last_date = latest["date"]
|
||||||
|
# Simple activity heuristic: sessions in last 7 days = active
|
||||||
|
try:
|
||||||
|
dt = datetime.strptime(last_date, "%Y-%m-%d")
|
||||||
|
days_ago = (datetime.now() - dt).days
|
||||||
|
if days_ago <= 7:
|
||||||
|
status = "Active"
|
||||||
|
elif days_ago <= 30:
|
||||||
|
status = "Quiet"
|
||||||
|
else:
|
||||||
|
status = "Inactive"
|
||||||
|
except ValueError:
|
||||||
|
status = "Unknown"
|
||||||
|
last_date = "—"
|
||||||
|
else:
|
||||||
|
# Check extracted-only
|
||||||
|
if convos:
|
||||||
|
latest = max(convos, key=lambda c: c["date"])
|
||||||
|
last_date = latest["date"]
|
||||||
|
status = "Active" if latest["date"] >= today[:7] else "Quiet"
|
||||||
|
else:
|
||||||
|
status = "—"
|
||||||
|
last_date = "—"
|
||||||
|
|
||||||
|
project_activity[code] = {
|
||||||
|
"status": status,
|
||||||
|
"last_date": last_date,
|
||||||
|
"count": len(convos),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Gather recent decisions across all projects
|
||||||
|
recent_decisions: list[tuple[str, str, str]] = [] # (date, project, decision)
|
||||||
|
for code, convos in by_project.items():
|
||||||
|
for c in convos:
|
||||||
|
if c["status"] != "summarized":
|
||||||
|
continue
|
||||||
|
for decision in get_decisions(c["file"]):
|
||||||
|
recent_decisions.append((c["date"], code, decision))
|
||||||
|
|
||||||
|
recent_decisions.sort(key=lambda x: x[0], reverse=True)
|
||||||
|
recent_decisions = recent_decisions[:10] # Top 10 most recent
|
||||||
|
|
||||||
|
# Gather recent discoveries
|
||||||
|
recent_discoveries: list[tuple[str, str, str]] = []
|
||||||
|
for code, convos in by_project.items():
|
||||||
|
for c in convos:
|
||||||
|
if c["status"] != "summarized":
|
||||||
|
continue
|
||||||
|
for disc in get_discoveries(c["file"]):
|
||||||
|
recent_discoveries.append((c["date"], code, disc))
|
||||||
|
|
||||||
|
recent_discoveries.sort(key=lambda x: x[0], reverse=True)
|
||||||
|
recent_discoveries = recent_discoveries[:5]
|
||||||
|
|
||||||
|
lines = [
|
||||||
|
"---",
|
||||||
|
"title: Wake-Up Briefing",
|
||||||
|
"type: context",
|
||||||
|
f"last_updated: {today}",
|
||||||
|
"---",
|
||||||
|
"",
|
||||||
|
"# Wake-Up Briefing",
|
||||||
|
"",
|
||||||
|
"Auto-generated world state for AI session context.",
|
||||||
|
"",
|
||||||
|
"## Active Projects",
|
||||||
|
"",
|
||||||
|
"| Code | Project | Status | Last Activity | Sessions |",
|
||||||
|
"|------|---------|--------|---------------|----------|",
|
||||||
|
]
|
||||||
|
|
||||||
|
for code in PROJECT_ORDER:
|
||||||
|
if code == "general":
|
||||||
|
continue # Skip general from roster
|
||||||
|
info = project_activity.get(code, {"status": "—", "last_date": "—", "count": 0})
|
||||||
|
display = PROJECT_NAMES.get(code, code).split(" — ")[1] if " — " in PROJECT_NAMES.get(code, "") else code
|
||||||
|
lines.append(
|
||||||
|
f"| {code.upper()} | {display} | {info['status']} | {info['last_date']} | {info['count']} |"
|
||||||
|
)
|
||||||
|
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
if recent_decisions:
|
||||||
|
lines.append("## Recent Decisions")
|
||||||
|
lines.append("")
|
||||||
|
for date, proj, decision in recent_decisions[:7]:
|
||||||
|
lines.append(f"- **[{proj.upper()}]** {decision} ({date})")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
if recent_discoveries:
|
||||||
|
lines.append("## Recent Discoveries")
|
||||||
|
lines.append("")
|
||||||
|
for date, proj, disc in recent_discoveries[:5]:
|
||||||
|
lines.append(f"- **[{proj.upper()}]** {disc} ({date})")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
if not recent_decisions and not recent_discoveries:
|
||||||
|
lines.append("## Recent Decisions")
|
||||||
|
lines.append("")
|
||||||
|
lines.append("_Populated after summarization runs._")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def generate_concerns(by_project: dict[str, list[dict[str, Any]]]) -> str:
|
||||||
|
"""Generate context/active-concerns.md from recent conversations."""
|
||||||
|
today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
||||||
|
|
||||||
|
# For now, this is a template that gets populated as summaries accumulate.
|
||||||
|
# Future enhancement: parse "blockers", "open questions" from summaries.
|
||||||
|
lines = [
|
||||||
|
"---",
|
||||||
|
"title: Active Concerns",
|
||||||
|
"type: context",
|
||||||
|
f"last_updated: {today}",
|
||||||
|
"---",
|
||||||
|
"",
|
||||||
|
"# Active Concerns",
|
||||||
|
"",
|
||||||
|
"Auto-generated from recent conversations. Current blockers, deadlines, and open questions.",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Count recent activity to give a sense of what's hot
|
||||||
|
active_projects: list[tuple[str, int]] = []
|
||||||
|
for code in PROJECT_ORDER:
|
||||||
|
convos = by_project.get(code, [])
|
||||||
|
recent = [c for c in convos if c["date"] >= today[:7]] # This month
|
||||||
|
if recent:
|
||||||
|
active_projects.append((code, len(recent)))
|
||||||
|
|
||||||
|
if active_projects:
|
||||||
|
active_projects.sort(key=lambda x: x[1], reverse=True)
|
||||||
|
lines.append("## Current Focus Areas")
|
||||||
|
lines.append("")
|
||||||
|
for code, count in active_projects[:5]:
|
||||||
|
display = PROJECT_NAMES.get(code, code)
|
||||||
|
lines.append(f"- **{display}** — {count} session(s) this month")
|
||||||
|
lines.append("")
|
||||||
|
|
||||||
|
lines.extend([
|
||||||
|
"## Blockers",
|
||||||
|
"",
|
||||||
|
"_Populated from conversation analysis._",
|
||||||
|
"",
|
||||||
|
"## Open Questions",
|
||||||
|
"",
|
||||||
|
"_Populated from conversation analysis._",
|
||||||
|
"",
|
||||||
|
])
|
||||||
|
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> None:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Update conversation index and context files",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--reindex",
|
||||||
|
action="store_true",
|
||||||
|
help="Also trigger qmd update and embed after updating files",
|
||||||
|
)
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
# Discover all conversations
|
||||||
|
by_project = discover_conversations()
|
||||||
|
|
||||||
|
total = sum(len(v) for v in by_project.values())
|
||||||
|
print(f"Found {total} conversation(s) across {len(by_project)} projects.")
|
||||||
|
|
||||||
|
# Generate and write index
|
||||||
|
index_content = generate_index(by_project)
|
||||||
|
INDEX_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
INDEX_FILE.write_text(index_content)
|
||||||
|
print(f"Updated {INDEX_FILE.relative_to(WIKI_DIR)}")
|
||||||
|
|
||||||
|
# Generate and write context files (create dir if needed)
|
||||||
|
WAKEUP_FILE.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
wakeup_content = generate_wakeup(by_project)
|
||||||
|
WAKEUP_FILE.write_text(wakeup_content)
|
||||||
|
print(f"Updated {WAKEUP_FILE.relative_to(WIKI_DIR)}")
|
||||||
|
|
||||||
|
concerns_content = generate_concerns(by_project)
|
||||||
|
CONCERNS_FILE.write_text(concerns_content)
|
||||||
|
print(f"Updated {CONCERNS_FILE.relative_to(WIKI_DIR)}")
|
||||||
|
|
||||||
|
# Optionally trigger qmd reindex
|
||||||
|
if args.reindex:
|
||||||
|
print("Triggering qmd reindex...")
|
||||||
|
try:
|
||||||
|
subprocess.run(["qmd", "update"], check=True, capture_output=True)
|
||||||
|
subprocess.run(["qmd", "embed"], check=True, capture_output=True)
|
||||||
|
print("qmd index updated.")
|
||||||
|
except FileNotFoundError:
|
||||||
|
print("qmd not found — skipping reindex.", file=sys.stderr)
|
||||||
|
except subprocess.CalledProcessError as e:
|
||||||
|
print(f"qmd reindex failed: {e}", file=sys.stderr)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
878
scripts/wiki-harvest.py
Executable file
878
scripts/wiki-harvest.py
Executable file
@@ -0,0 +1,878 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Harvest external reference URLs from summarized conversations into the wiki.
|
||||||
|
|
||||||
|
Scans summarized conversation transcripts for URLs, classifies them, fetches
|
||||||
|
the content, stores the raw source under raw/harvested/, and optionally calls
|
||||||
|
`claude -p` to compile each raw file into a staging/ wiki page.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 scripts/wiki-harvest.py # Process all summarized conversations
|
||||||
|
python3 scripts/wiki-harvest.py --project mc # One project only
|
||||||
|
python3 scripts/wiki-harvest.py --file PATH # One conversation file
|
||||||
|
python3 scripts/wiki-harvest.py --dry-run # Show what would be harvested
|
||||||
|
python3 scripts/wiki-harvest.py --no-compile # Fetch only, skip claude -p compile step
|
||||||
|
python3 scripts/wiki-harvest.py --limit 10 # Cap number of URLs processed
|
||||||
|
|
||||||
|
State is persisted in .harvest-state.json; existing URLs are deduplicated.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import hashlib
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
|
||||||
|
# Force unbuffered output for pipe usage
|
||||||
|
sys.stdout.reconfigure(line_buffering=True)
|
||||||
|
sys.stderr.reconfigure(line_buffering=True)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Configuration
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
WIKI_DIR = Path(os.environ.get("WIKI_DIR", str(Path.home() / "projects" / "wiki")))
|
||||||
|
CONVERSATIONS_DIR = WIKI_DIR / "conversations"
|
||||||
|
RAW_HARVESTED_DIR = WIKI_DIR / "raw" / "harvested"
|
||||||
|
STAGING_DIR = WIKI_DIR / "staging"
|
||||||
|
INDEX_FILE = WIKI_DIR / "index.md"
|
||||||
|
CLAUDE_MD = WIKI_DIR / "CLAUDE.md"
|
||||||
|
HARVEST_STATE_FILE = WIKI_DIR / ".harvest-state.json"
|
||||||
|
|
||||||
|
# ════════════════════════════════════════════════════════════════════════════
|
||||||
|
# CONFIGURE ME — URL classification rules
|
||||||
|
# ════════════════════════════════════════════════════════════════════════════
|
||||||
|
#
|
||||||
|
# Type D: always skip. Add your own internal/ephemeral/personal domains here.
|
||||||
|
# Patterns use `re.search` so unanchored suffixes like `\.example\.com$` work.
|
||||||
|
# Private IPs (10.x, 172.16-31.x, 192.168.x, 127.x) are detected separately.
|
||||||
|
SKIP_DOMAIN_PATTERNS = [
|
||||||
|
# Generic: ephemeral / personal / chat / internal
|
||||||
|
r"\.atlassian\.net$",
|
||||||
|
r"^app\.asana\.com$",
|
||||||
|
r"^(www\.)?slack\.com$",
|
||||||
|
r"\.slack\.com$",
|
||||||
|
r"^(www\.)?discord\.com$",
|
||||||
|
r"^localhost$",
|
||||||
|
r"^0\.0\.0\.0$",
|
||||||
|
r"^mail\.google\.com$",
|
||||||
|
r"^calendar\.google\.com$",
|
||||||
|
r"^docs\.google\.com$",
|
||||||
|
r"^drive\.google\.com$",
|
||||||
|
r"^.+\.local$",
|
||||||
|
r"^.+\.internal$",
|
||||||
|
# Add your own internal domains below, for example:
|
||||||
|
# r"\.mycompany\.com$",
|
||||||
|
# r"^git\.mydomain\.com$",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Type C — issue trackers / Q&A; only harvest if topic touches existing wiki
|
||||||
|
C_TYPE_URL_PATTERNS = [
|
||||||
|
r"^https?://github\.com/[^/]+/[^/]+/issues/\d+",
|
||||||
|
r"^https?://github\.com/[^/]+/[^/]+/pull/\d+",
|
||||||
|
r"^https?://github\.com/[^/]+/[^/]+/discussions/\d+",
|
||||||
|
r"^https?://(www\.)?stackoverflow\.com/questions/\d+",
|
||||||
|
r"^https?://(www\.)?serverfault\.com/questions/\d+",
|
||||||
|
r"^https?://(www\.)?superuser\.com/questions/\d+",
|
||||||
|
r"^https?://.+\.stackexchange\.com/questions/\d+",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Asset/image extensions to filter out
|
||||||
|
ASSET_EXTENSIONS = {
|
||||||
|
".png", ".jpg", ".jpeg", ".gif", ".svg", ".webp", ".ico", ".bmp",
|
||||||
|
".css", ".js", ".mjs", ".woff", ".woff2", ".ttf", ".eot",
|
||||||
|
".mp4", ".webm", ".mov", ".mp3", ".wav",
|
||||||
|
".zip", ".tar", ".gz", ".bz2",
|
||||||
|
}
|
||||||
|
|
||||||
|
# URL regex — HTTP(S), stops at whitespace, brackets, and common markdown delimiters
|
||||||
|
URL_REGEX = re.compile(
|
||||||
|
r"https?://[^\s<>\"')\]}\\|`]+",
|
||||||
|
re.IGNORECASE,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Claude CLI models
|
||||||
|
CLAUDE_HAIKU_MODEL = "haiku"
|
||||||
|
CLAUDE_SONNET_MODEL = "sonnet"
|
||||||
|
SONNET_CONTENT_THRESHOLD = 20_000 # chars — larger than this → sonnet
|
||||||
|
|
||||||
|
# Fetch behavior
|
||||||
|
FETCH_DELAY_SECONDS = 2
|
||||||
|
MAX_FAILED_ATTEMPTS = 3
|
||||||
|
MIN_CONTENT_LENGTH = 100
|
||||||
|
FETCH_TIMEOUT = 45
|
||||||
|
|
||||||
|
# HTML-leak detection — content containing any of these is treated as a failed extraction
|
||||||
|
HTML_LEAK_MARKERS = ["<div", "<script", "<nav", "<header", "<footer"]
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# State management
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def load_state() -> dict[str, Any]:
|
||||||
|
defaults: dict[str, Any] = {
|
||||||
|
"harvested_urls": {},
|
||||||
|
"skipped_urls": {},
|
||||||
|
"failed_urls": {},
|
||||||
|
"rejected_urls": {},
|
||||||
|
"last_run": None,
|
||||||
|
}
|
||||||
|
if HARVEST_STATE_FILE.exists():
|
||||||
|
try:
|
||||||
|
with open(HARVEST_STATE_FILE) as f:
|
||||||
|
state = json.load(f)
|
||||||
|
for k, v in defaults.items():
|
||||||
|
state.setdefault(k, v)
|
||||||
|
return state
|
||||||
|
except (OSError, json.JSONDecodeError):
|
||||||
|
pass
|
||||||
|
return defaults
|
||||||
|
|
||||||
|
|
||||||
|
def save_state(state: dict[str, Any]) -> None:
|
||||||
|
state["last_run"] = datetime.now(timezone.utc).isoformat()
|
||||||
|
tmp = HARVEST_STATE_FILE.with_suffix(".json.tmp")
|
||||||
|
with open(tmp, "w") as f:
|
||||||
|
json.dump(state, f, indent=2, sort_keys=True)
|
||||||
|
tmp.replace(HARVEST_STATE_FILE)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# URL extraction
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def extract_urls_from_file(file_path: Path) -> list[str]:
|
||||||
|
"""Extract all HTTP(S) URLs from a conversation markdown file.
|
||||||
|
|
||||||
|
Filters:
|
||||||
|
- Asset URLs (images, CSS, JS, fonts, media, archives)
|
||||||
|
- URLs shorter than 20 characters
|
||||||
|
- Duplicates within the same file
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
text = file_path.read_text(errors="replace")
|
||||||
|
except OSError:
|
||||||
|
return []
|
||||||
|
|
||||||
|
seen: set[str] = set()
|
||||||
|
urls: list[str] = []
|
||||||
|
|
||||||
|
for match in URL_REGEX.finditer(text):
|
||||||
|
url = match.group(0).rstrip(".,;:!?") # strip trailing sentence punctuation
|
||||||
|
# Drop trailing markdown/code artifacts
|
||||||
|
while url and url[-1] in "()[]{}\"'":
|
||||||
|
url = url[:-1]
|
||||||
|
if len(url) < 20:
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
parsed = urlparse(url)
|
||||||
|
except ValueError:
|
||||||
|
continue
|
||||||
|
if not parsed.scheme or not parsed.netloc:
|
||||||
|
continue
|
||||||
|
path_lower = parsed.path.lower()
|
||||||
|
if any(path_lower.endswith(ext) for ext in ASSET_EXTENSIONS):
|
||||||
|
continue
|
||||||
|
if url in seen:
|
||||||
|
continue
|
||||||
|
seen.add(url)
|
||||||
|
urls.append(url)
|
||||||
|
|
||||||
|
return urls
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# URL classification
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def _is_private_ip(host: str) -> bool:
|
||||||
|
"""Return True if host is an RFC1918 or loopback IP literal."""
|
||||||
|
if not re.match(r"^\d+\.\d+\.\d+\.\d+$", host):
|
||||||
|
return False
|
||||||
|
parts = [int(p) for p in host.split(".")]
|
||||||
|
if parts[0] == 10:
|
||||||
|
return True
|
||||||
|
if parts[0] == 127:
|
||||||
|
return True
|
||||||
|
if parts[0] == 172 and 16 <= parts[1] <= 31:
|
||||||
|
return True
|
||||||
|
if parts[0] == 192 and parts[1] == 168:
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def classify_url(url: str) -> str:
|
||||||
|
"""Classify a URL as 'harvest' (A/B), 'check' (C), or 'skip' (D)."""
|
||||||
|
try:
|
||||||
|
parsed = urlparse(url)
|
||||||
|
except ValueError:
|
||||||
|
return "skip"
|
||||||
|
|
||||||
|
host = (parsed.hostname or "").lower()
|
||||||
|
if not host:
|
||||||
|
return "skip"
|
||||||
|
|
||||||
|
if _is_private_ip(host):
|
||||||
|
return "skip"
|
||||||
|
|
||||||
|
for pattern in SKIP_DOMAIN_PATTERNS:
|
||||||
|
if re.search(pattern, host):
|
||||||
|
return "skip"
|
||||||
|
|
||||||
|
for pattern in C_TYPE_URL_PATTERNS:
|
||||||
|
if re.match(pattern, url):
|
||||||
|
return "check"
|
||||||
|
|
||||||
|
return "harvest"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Filename derivation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def slugify(text: str) -> str:
|
||||||
|
text = text.lower()
|
||||||
|
text = re.sub(r"[^a-z0-9]+", "-", text)
|
||||||
|
return text.strip("-")
|
||||||
|
|
||||||
|
|
||||||
|
def raw_filename_for_url(url: str) -> str:
|
||||||
|
parsed = urlparse(url)
|
||||||
|
host = parsed.netloc.lower().replace("www.", "")
|
||||||
|
path = parsed.path.rstrip("/")
|
||||||
|
host_slug = slugify(host)
|
||||||
|
path_slug = slugify(path) if path else "index"
|
||||||
|
# Truncate overly long names
|
||||||
|
if len(path_slug) > 80:
|
||||||
|
path_slug = path_slug[:80].rstrip("-")
|
||||||
|
return f"{host_slug}-{path_slug}.md"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Fetch cascade
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def run_fetch_command(cmd: list[str], timeout: int = FETCH_TIMEOUT) -> tuple[bool, str]:
|
||||||
|
"""Run a fetch command and return (success, output)."""
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
cmd,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=timeout,
|
||||||
|
)
|
||||||
|
if result.returncode != 0:
|
||||||
|
return False, result.stderr.strip() or "non-zero exit"
|
||||||
|
return True, result.stdout
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
return False, "timeout"
|
||||||
|
except FileNotFoundError as e:
|
||||||
|
return False, f"command not found: {e}"
|
||||||
|
except OSError as e:
|
||||||
|
return False, str(e)
|
||||||
|
|
||||||
|
|
||||||
|
def validate_content(content: str) -> bool:
|
||||||
|
if not content or len(content.strip()) < MIN_CONTENT_LENGTH:
|
||||||
|
return False
|
||||||
|
low = content.lower()
|
||||||
|
if any(marker in low for marker in HTML_LEAK_MARKERS):
|
||||||
|
return False
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_with_trafilatura(url: str) -> tuple[bool, str]:
|
||||||
|
ok, out = run_fetch_command(
|
||||||
|
["trafilatura", "-u", url, "--markdown", "--no-comments", "--precision"]
|
||||||
|
)
|
||||||
|
if ok and validate_content(out):
|
||||||
|
return True, out
|
||||||
|
return False, out if not ok else "content validation failed"
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_with_crawl4ai(url: str, stealth: bool = False) -> tuple[bool, str]:
|
||||||
|
cmd = ["crwl", url, "-o", "markdown-fit"]
|
||||||
|
if stealth:
|
||||||
|
cmd += [
|
||||||
|
"-b", "headless=true,user_agent_mode=random",
|
||||||
|
"-c", "magic=true,scan_full_page=true,page_timeout=20000",
|
||||||
|
]
|
||||||
|
else:
|
||||||
|
cmd += ["-c", "page_timeout=15000"]
|
||||||
|
ok, out = run_fetch_command(cmd, timeout=90)
|
||||||
|
if ok and validate_content(out):
|
||||||
|
return True, out
|
||||||
|
return False, out if not ok else "content validation failed"
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_from_conversation(url: str, conversation_file: Path) -> tuple[bool, str]:
|
||||||
|
"""Fallback: scrape a block of content near where the URL appears in the transcript.
|
||||||
|
|
||||||
|
If the assistant fetched the URL during the session, some portion of the
|
||||||
|
content is likely inline in the transcript.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
text = conversation_file.read_text(errors="replace")
|
||||||
|
except OSError:
|
||||||
|
return False, "cannot read conversation file"
|
||||||
|
|
||||||
|
idx = text.find(url)
|
||||||
|
if idx == -1:
|
||||||
|
return False, "url not found in conversation"
|
||||||
|
|
||||||
|
# Grab up to 2000 chars after the URL mention
|
||||||
|
snippet = text[idx : idx + 2000]
|
||||||
|
if not validate_content(snippet):
|
||||||
|
return False, "snippet failed validation"
|
||||||
|
return True, snippet
|
||||||
|
|
||||||
|
|
||||||
|
def fetch_cascade(url: str, conversation_file: Path) -> tuple[bool, str, str]:
|
||||||
|
"""Attempt the full fetch cascade. Returns (success, content, method_used)."""
|
||||||
|
ok, out = fetch_with_trafilatura(url)
|
||||||
|
if ok:
|
||||||
|
return True, out, "trafilatura"
|
||||||
|
|
||||||
|
ok, out = fetch_with_crawl4ai(url, stealth=False)
|
||||||
|
if ok:
|
||||||
|
return True, out, "crawl4ai"
|
||||||
|
|
||||||
|
ok, out = fetch_with_crawl4ai(url, stealth=True)
|
||||||
|
if ok:
|
||||||
|
return True, out, "crawl4ai-stealth"
|
||||||
|
|
||||||
|
ok, out = fetch_from_conversation(url, conversation_file)
|
||||||
|
if ok:
|
||||||
|
return True, out, "conversation-fallback"
|
||||||
|
|
||||||
|
return False, out, "failed"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Raw file storage
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def content_hash(content: str) -> str:
|
||||||
|
return "sha256:" + hashlib.sha256(content.encode("utf-8")).hexdigest()
|
||||||
|
|
||||||
|
|
||||||
|
def write_raw_file(
|
||||||
|
url: str,
|
||||||
|
content: str,
|
||||||
|
method: str,
|
||||||
|
discovered_in: Path,
|
||||||
|
) -> Path:
|
||||||
|
RAW_HARVESTED_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
filename = raw_filename_for_url(url)
|
||||||
|
out_path = RAW_HARVESTED_DIR / filename
|
||||||
|
# Collision: append short hash
|
||||||
|
if out_path.exists():
|
||||||
|
suffix = hashlib.sha256(url.encode()).hexdigest()[:8]
|
||||||
|
out_path = RAW_HARVESTED_DIR / f"{out_path.stem}-{suffix}.md"
|
||||||
|
|
||||||
|
rel_discovered = discovered_in.relative_to(WIKI_DIR)
|
||||||
|
frontmatter = [
|
||||||
|
"---",
|
||||||
|
f"source_url: {url}",
|
||||||
|
f"fetched_date: {datetime.now(timezone.utc).date().isoformat()}",
|
||||||
|
f"fetch_method: {method}",
|
||||||
|
f"discovered_in: {rel_discovered}",
|
||||||
|
f"content_hash: {content_hash(content)}",
|
||||||
|
"---",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
out_path.write_text("\n".join(frontmatter) + content.strip() + "\n")
|
||||||
|
return out_path
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# AI compilation via claude -p
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
COMPILE_PROMPT_TEMPLATE = """You are compiling a raw harvested source document into the LLM wiki at {wiki_dir}.
|
||||||
|
|
||||||
|
The wiki schema and conventions are defined in CLAUDE.md. The wiki has four
|
||||||
|
content directories: patterns/ (how), decisions/ (why), environments/ (where),
|
||||||
|
concepts/ (what). All pages require YAML frontmatter with title, type,
|
||||||
|
confidence, sources, related, last_compiled, last_verified.
|
||||||
|
|
||||||
|
IMPORTANT: Do NOT include `status`, `origin`, `staged_*`, `target_path`,
|
||||||
|
`modifies`, `harvest_source`, or `compilation_notes` fields in your page
|
||||||
|
frontmatter — the harvest script injects those automatically.
|
||||||
|
|
||||||
|
The raw source material is below. Decide what to do with it and emit the
|
||||||
|
result as a single JSON object on stdout (nothing else). Valid actions:
|
||||||
|
|
||||||
|
- "new_page" — create a new wiki page
|
||||||
|
- "update_page" — update an existing wiki page (add source, merge content)
|
||||||
|
- "both" — create a new page AND update an existing one
|
||||||
|
- "skip" — content isn't substantive enough to warrant a wiki page
|
||||||
|
|
||||||
|
JSON schema:
|
||||||
|
|
||||||
|
{{
|
||||||
|
"action": "new_page" | "update_page" | "both" | "skip",
|
||||||
|
"compilation_notes": "1-3 sentences explaining what you did and why",
|
||||||
|
"new_page": {{
|
||||||
|
"directory": "patterns" | "decisions" | "environments" | "concepts",
|
||||||
|
"filename": "kebab-case-name.md",
|
||||||
|
"content": "full markdown including frontmatter"
|
||||||
|
}},
|
||||||
|
"update_page": {{
|
||||||
|
"path": "patterns/existing-page.md",
|
||||||
|
"content": "full updated markdown including frontmatter"
|
||||||
|
}}
|
||||||
|
}}
|
||||||
|
|
||||||
|
Omit "new_page" if not applicable; omit "update_page" if not applicable. If
|
||||||
|
action is "skip", omit both. Do NOT include any prose outside the JSON.
|
||||||
|
|
||||||
|
Wiki index (so you know what pages exist):
|
||||||
|
|
||||||
|
{wiki_index}
|
||||||
|
|
||||||
|
Raw harvested source:
|
||||||
|
|
||||||
|
{raw_content}
|
||||||
|
|
||||||
|
Conversation context (the working session where this URL was cited):
|
||||||
|
|
||||||
|
{conversation_context}
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
def call_claude_compile(
|
||||||
|
raw_path: Path,
|
||||||
|
raw_content: str,
|
||||||
|
conversation_file: Path,
|
||||||
|
) -> dict[str, Any] | None:
|
||||||
|
"""Invoke `claude -p` to compile the raw source into a staging wiki page."""
|
||||||
|
|
||||||
|
# Pick model by size
|
||||||
|
model = CLAUDE_SONNET_MODEL if len(raw_content) > SONNET_CONTENT_THRESHOLD else CLAUDE_HAIKU_MODEL
|
||||||
|
|
||||||
|
try:
|
||||||
|
wiki_index = INDEX_FILE.read_text()[:20_000]
|
||||||
|
except OSError:
|
||||||
|
wiki_index = ""
|
||||||
|
|
||||||
|
try:
|
||||||
|
conversation_context = conversation_file.read_text(errors="replace")[:8_000]
|
||||||
|
except OSError:
|
||||||
|
conversation_context = ""
|
||||||
|
|
||||||
|
prompt = COMPILE_PROMPT_TEMPLATE.format(
|
||||||
|
wiki_dir=str(WIKI_DIR),
|
||||||
|
wiki_index=wiki_index,
|
||||||
|
raw_content=raw_content[:40_000],
|
||||||
|
conversation_context=conversation_context,
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
["claude", "-p", "--model", model, "--output-format", "text", prompt],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=600,
|
||||||
|
)
|
||||||
|
except FileNotFoundError:
|
||||||
|
print(" [warn] claude CLI not found — skipping compilation", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
print(" [warn] claude -p timed out", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
print(f" [warn] claude -p failed: {result.stderr.strip()[:200]}", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
|
||||||
|
# Extract JSON from output (may be wrapped in fences)
|
||||||
|
output = result.stdout.strip()
|
||||||
|
match = re.search(r"\{.*\}", output, re.DOTALL)
|
||||||
|
if not match:
|
||||||
|
print(f" [warn] no JSON found in claude output ({len(output)} chars)", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
return json.loads(match.group(0))
|
||||||
|
except json.JSONDecodeError as e:
|
||||||
|
print(f" [warn] JSON parse failed: {e}", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
STAGING_INJECT_TEMPLATE = (
|
||||||
|
"---\n"
|
||||||
|
"origin: automated\n"
|
||||||
|
"status: pending\n"
|
||||||
|
"staged_date: {staged_date}\n"
|
||||||
|
"staged_by: wiki-harvest\n"
|
||||||
|
"target_path: {target_path}\n"
|
||||||
|
"{modifies_line}"
|
||||||
|
"harvest_source: {source_url}\n"
|
||||||
|
"compilation_notes: {compilation_notes}\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _inject_staging_frontmatter(
|
||||||
|
content: str,
|
||||||
|
source_url: str,
|
||||||
|
target_path: str,
|
||||||
|
compilation_notes: str,
|
||||||
|
modifies: str | None,
|
||||||
|
) -> str:
|
||||||
|
"""Insert staging metadata after the opening --- fence of the AI-generated content."""
|
||||||
|
# Strip existing status/origin/staged fields the AI may have added
|
||||||
|
content = re.sub(r"^(status|origin|staged_\w+|target_path|modifies|harvest_source|compilation_notes):.*\n", "", content, flags=re.MULTILINE)
|
||||||
|
|
||||||
|
modifies_line = f"modifies: {modifies}\n" if modifies else ""
|
||||||
|
# Collapse multi-line compilation notes to single line for safe YAML
|
||||||
|
clean_notes = compilation_notes.replace("\n", " ").replace("\r", " ").strip()
|
||||||
|
injection = STAGING_INJECT_TEMPLATE.format(
|
||||||
|
staged_date=datetime.now(timezone.utc).date().isoformat(),
|
||||||
|
target_path=target_path,
|
||||||
|
modifies_line=modifies_line,
|
||||||
|
source_url=source_url,
|
||||||
|
compilation_notes=clean_notes or "(none provided)",
|
||||||
|
)
|
||||||
|
|
||||||
|
if content.startswith("---\n"):
|
||||||
|
return injection + content[4:]
|
||||||
|
# AI forgot the fence — prepend full frontmatter
|
||||||
|
return injection + "---\n" + content
|
||||||
|
|
||||||
|
|
||||||
|
def _unique_staging_path(base: Path) -> Path:
|
||||||
|
"""Append a short hash if the target already exists."""
|
||||||
|
if not base.exists():
|
||||||
|
return base
|
||||||
|
suffix = hashlib.sha256(str(base).encode() + str(time.time()).encode()).hexdigest()[:6]
|
||||||
|
return base.with_stem(f"{base.stem}-{suffix}")
|
||||||
|
|
||||||
|
|
||||||
|
def apply_compile_result(
|
||||||
|
result: dict[str, Any],
|
||||||
|
source_url: str,
|
||||||
|
raw_path: Path,
|
||||||
|
) -> list[Path]:
|
||||||
|
"""Write the AI compilation result into staging/. Returns paths written."""
|
||||||
|
written: list[Path] = []
|
||||||
|
action = result.get("action", "skip")
|
||||||
|
if action == "skip":
|
||||||
|
return written
|
||||||
|
|
||||||
|
notes = result.get("compilation_notes", "")
|
||||||
|
|
||||||
|
# New page
|
||||||
|
new_page = result.get("new_page") or {}
|
||||||
|
if action in ("new_page", "both") and new_page.get("filename") and new_page.get("content"):
|
||||||
|
directory = new_page.get("directory", "patterns")
|
||||||
|
filename = new_page["filename"]
|
||||||
|
target_rel = f"{directory}/{filename}"
|
||||||
|
dest = _unique_staging_path(STAGING_DIR / target_rel)
|
||||||
|
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
content = _inject_staging_frontmatter(
|
||||||
|
new_page["content"],
|
||||||
|
source_url=source_url,
|
||||||
|
target_path=target_rel,
|
||||||
|
compilation_notes=notes,
|
||||||
|
modifies=None,
|
||||||
|
)
|
||||||
|
dest.write_text(content)
|
||||||
|
written.append(dest)
|
||||||
|
|
||||||
|
# Update to existing page
|
||||||
|
update_page = result.get("update_page") or {}
|
||||||
|
if action in ("update_page", "both") and update_page.get("path") and update_page.get("content"):
|
||||||
|
target_rel = update_page["path"]
|
||||||
|
dest = _unique_staging_path(STAGING_DIR / target_rel)
|
||||||
|
dest.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
content = _inject_staging_frontmatter(
|
||||||
|
update_page["content"],
|
||||||
|
source_url=source_url,
|
||||||
|
target_path=target_rel,
|
||||||
|
compilation_notes=notes,
|
||||||
|
modifies=target_rel,
|
||||||
|
)
|
||||||
|
dest.write_text(content)
|
||||||
|
written.append(dest)
|
||||||
|
|
||||||
|
return written
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Wiki topic coverage check (for C-type URLs)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def wiki_covers_topic(url: str) -> bool:
|
||||||
|
"""Quick heuristic: check if any wiki page mentions terms from the URL path.
|
||||||
|
|
||||||
|
Used for C-type URLs (GitHub issues, SO questions) — only harvest if the
|
||||||
|
wiki already covers the topic.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
parsed = urlparse(url)
|
||||||
|
except ValueError:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Derive candidate keywords from path
|
||||||
|
path_terms = [t for t in re.split(r"[/\-_]+", parsed.path.lower()) if len(t) >= 4]
|
||||||
|
if not path_terms:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Try qmd search if available; otherwise fall back to a simple grep
|
||||||
|
query = " ".join(path_terms[:5])
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
["qmd", "search", query, "--json", "-n", "3"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=30,
|
||||||
|
)
|
||||||
|
if result.returncode == 0 and result.stdout.strip():
|
||||||
|
try:
|
||||||
|
data = json.loads(result.stdout)
|
||||||
|
hits = data.get("results") if isinstance(data, dict) else data
|
||||||
|
return bool(hits)
|
||||||
|
except json.JSONDecodeError:
|
||||||
|
return False
|
||||||
|
except (FileNotFoundError, subprocess.TimeoutExpired):
|
||||||
|
pass
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Conversation discovery
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def parse_frontmatter(file_path: Path) -> dict[str, str]:
|
||||||
|
fm: dict[str, str] = {}
|
||||||
|
try:
|
||||||
|
text = file_path.read_text(errors="replace")
|
||||||
|
except OSError:
|
||||||
|
return fm
|
||||||
|
if not text.startswith("---\n"):
|
||||||
|
return fm
|
||||||
|
end = text.find("\n---\n", 4)
|
||||||
|
if end == -1:
|
||||||
|
return fm
|
||||||
|
for line in text[4:end].splitlines():
|
||||||
|
if ":" in line:
|
||||||
|
key, _, value = line.partition(":")
|
||||||
|
fm[key.strip()] = value.strip()
|
||||||
|
return fm
|
||||||
|
|
||||||
|
|
||||||
|
def discover_summarized_conversations(
|
||||||
|
project_filter: str | None = None,
|
||||||
|
file_filter: str | None = None,
|
||||||
|
) -> list[Path]:
|
||||||
|
if file_filter:
|
||||||
|
path = Path(file_filter)
|
||||||
|
if not path.is_absolute():
|
||||||
|
path = WIKI_DIR / path
|
||||||
|
return [path] if path.exists() else []
|
||||||
|
|
||||||
|
files: list[Path] = []
|
||||||
|
for project_dir in sorted(CONVERSATIONS_DIR.iterdir()):
|
||||||
|
if not project_dir.is_dir():
|
||||||
|
continue
|
||||||
|
if project_filter and project_dir.name != project_filter:
|
||||||
|
continue
|
||||||
|
for md in sorted(project_dir.glob("*.md")):
|
||||||
|
fm = parse_frontmatter(md)
|
||||||
|
if fm.get("status") == "summarized":
|
||||||
|
files.append(md)
|
||||||
|
return files
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main pipeline
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def process_url(
|
||||||
|
url: str,
|
||||||
|
conversation_file: Path,
|
||||||
|
state: dict[str, Any],
|
||||||
|
dry_run: bool,
|
||||||
|
compile_enabled: bool,
|
||||||
|
) -> str:
|
||||||
|
"""Process a single URL. Returns a short status tag for logging."""
|
||||||
|
|
||||||
|
rel_conv = str(conversation_file.relative_to(WIKI_DIR))
|
||||||
|
today = datetime.now(timezone.utc).date().isoformat()
|
||||||
|
|
||||||
|
# Already harvested?
|
||||||
|
if url in state["harvested_urls"]:
|
||||||
|
entry = state["harvested_urls"][url]
|
||||||
|
if rel_conv not in entry.get("seen_in", []):
|
||||||
|
entry.setdefault("seen_in", []).append(rel_conv)
|
||||||
|
return "dup-harvested"
|
||||||
|
|
||||||
|
# Already rejected by AI?
|
||||||
|
if url in state["rejected_urls"]:
|
||||||
|
return "dup-rejected"
|
||||||
|
|
||||||
|
# Previously skipped?
|
||||||
|
if url in state["skipped_urls"]:
|
||||||
|
return "dup-skipped"
|
||||||
|
|
||||||
|
# Previously failed too many times?
|
||||||
|
if url in state["failed_urls"]:
|
||||||
|
if state["failed_urls"][url].get("attempts", 0) >= MAX_FAILED_ATTEMPTS:
|
||||||
|
return "dup-failed"
|
||||||
|
|
||||||
|
# Classify
|
||||||
|
classification = classify_url(url)
|
||||||
|
if classification == "skip":
|
||||||
|
state["skipped_urls"][url] = {
|
||||||
|
"reason": "domain-skip-list",
|
||||||
|
"first_seen": today,
|
||||||
|
}
|
||||||
|
return "skip-domain"
|
||||||
|
|
||||||
|
if classification == "check":
|
||||||
|
if not wiki_covers_topic(url):
|
||||||
|
state["skipped_urls"][url] = {
|
||||||
|
"reason": "c-type-no-wiki-match",
|
||||||
|
"first_seen": today,
|
||||||
|
}
|
||||||
|
return "skip-c-type"
|
||||||
|
|
||||||
|
if dry_run:
|
||||||
|
return f"would-harvest ({classification})"
|
||||||
|
|
||||||
|
# Fetch
|
||||||
|
print(f" [fetch] {url}")
|
||||||
|
ok, content, method = fetch_cascade(url, conversation_file)
|
||||||
|
time.sleep(FETCH_DELAY_SECONDS)
|
||||||
|
|
||||||
|
if not ok:
|
||||||
|
entry = state["failed_urls"].setdefault(url, {
|
||||||
|
"first_seen": today,
|
||||||
|
"attempts": 0,
|
||||||
|
})
|
||||||
|
entry["attempts"] += 1
|
||||||
|
entry["last_attempt"] = today
|
||||||
|
entry["reason"] = content[:200] if content else "unknown"
|
||||||
|
return f"fetch-failed ({method})"
|
||||||
|
|
||||||
|
# Save raw file
|
||||||
|
raw_path = write_raw_file(url, content, method, conversation_file)
|
||||||
|
rel_raw = str(raw_path.relative_to(WIKI_DIR))
|
||||||
|
|
||||||
|
state["harvested_urls"][url] = {
|
||||||
|
"first_seen": today,
|
||||||
|
"seen_in": [rel_conv],
|
||||||
|
"raw_file": rel_raw,
|
||||||
|
"wiki_pages": [],
|
||||||
|
"status": "raw",
|
||||||
|
"fetch_method": method,
|
||||||
|
"last_checked": today,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Compile via claude -p
|
||||||
|
if compile_enabled:
|
||||||
|
print(f" [compile] {rel_raw}")
|
||||||
|
result = call_claude_compile(raw_path, content, conversation_file)
|
||||||
|
if result is None:
|
||||||
|
state["harvested_urls"][url]["status"] = "raw-compile-failed"
|
||||||
|
return f"raw-saved ({method}) compile-failed"
|
||||||
|
|
||||||
|
action = result.get("action", "skip")
|
||||||
|
if action == "skip":
|
||||||
|
state["rejected_urls"][url] = {
|
||||||
|
"reason": result.get("compilation_notes", "AI rejected"),
|
||||||
|
"rejected_date": today,
|
||||||
|
}
|
||||||
|
# Remove from harvested; keep raw file for audit
|
||||||
|
state["harvested_urls"].pop(url, None)
|
||||||
|
return f"rejected ({method})"
|
||||||
|
|
||||||
|
written = apply_compile_result(result, url, raw_path)
|
||||||
|
state["harvested_urls"][url]["status"] = "compiled"
|
||||||
|
state["harvested_urls"][url]["wiki_pages"] = [
|
||||||
|
str(p.relative_to(WIKI_DIR)) for p in written
|
||||||
|
]
|
||||||
|
return f"compiled ({method}) → {len(written)} staging file(s)"
|
||||||
|
|
||||||
|
return f"raw-saved ({method})"
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
parser = argparse.ArgumentParser(description=__doc__.split("\n\n")[0])
|
||||||
|
parser.add_argument("--project", help="Only process this project (wing) directory")
|
||||||
|
parser.add_argument("--file", help="Only process this conversation file")
|
||||||
|
parser.add_argument("--dry-run", action="store_true", help="Classify and report without fetching")
|
||||||
|
parser.add_argument("--no-compile", action="store_true", help="Fetch raw only; skip claude -p compile")
|
||||||
|
parser.add_argument("--limit", type=int, default=0, help="Stop after N new URLs processed (0 = no limit)")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
files = discover_summarized_conversations(args.project, args.file)
|
||||||
|
print(f"Scanning {len(files)} summarized conversation(s) for URLs...")
|
||||||
|
|
||||||
|
state = load_state()
|
||||||
|
stats: dict[str, int] = {}
|
||||||
|
processed_new = 0
|
||||||
|
|
||||||
|
for file_path in files:
|
||||||
|
urls = extract_urls_from_file(file_path)
|
||||||
|
if not urls:
|
||||||
|
continue
|
||||||
|
rel = file_path.relative_to(WIKI_DIR)
|
||||||
|
print(f"\n[{rel}] {len(urls)} URL(s)")
|
||||||
|
|
||||||
|
for url in urls:
|
||||||
|
status = process_url(
|
||||||
|
url,
|
||||||
|
file_path,
|
||||||
|
state,
|
||||||
|
dry_run=args.dry_run,
|
||||||
|
compile_enabled=not args.no_compile,
|
||||||
|
)
|
||||||
|
stats[status] = stats.get(status, 0) + 1
|
||||||
|
print(f" [{status}] {url}")
|
||||||
|
|
||||||
|
# Persist state after each non-dry URL
|
||||||
|
if not args.dry_run and not status.startswith("dup-"):
|
||||||
|
processed_new += 1
|
||||||
|
save_state(state)
|
||||||
|
|
||||||
|
if args.limit and processed_new >= args.limit:
|
||||||
|
print(f"\nLimit reached ({args.limit}); stopping.")
|
||||||
|
save_state(state)
|
||||||
|
_print_summary(stats)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if not args.dry_run:
|
||||||
|
save_state(state)
|
||||||
|
|
||||||
|
_print_summary(stats)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _print_summary(stats: dict[str, int]) -> None:
|
||||||
|
print("\nSummary:")
|
||||||
|
for status, count in sorted(stats.items()):
|
||||||
|
print(f" {status}: {count}")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
1587
scripts/wiki-hygiene.py
Executable file
1587
scripts/wiki-hygiene.py
Executable file
File diff suppressed because it is too large
Load Diff
198
scripts/wiki-maintain.sh
Executable file
198
scripts/wiki-maintain.sh
Executable file
@@ -0,0 +1,198 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# wiki-maintain.sh — Top-level orchestrator for wiki maintenance.
|
||||||
|
#
|
||||||
|
# Chains the three maintenance scripts in the correct order:
|
||||||
|
# 1. wiki-harvest.py (URL harvesting from summarized conversations)
|
||||||
|
# 2. wiki-hygiene.py (quick or full hygiene checks)
|
||||||
|
# 3. qmd update && qmd embed (reindex after changes)
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# wiki-maintain.sh # Harvest + quick hygiene
|
||||||
|
# wiki-maintain.sh --full # Harvest + full hygiene (LLM-powered)
|
||||||
|
# wiki-maintain.sh --harvest-only # URL harvesting only
|
||||||
|
# wiki-maintain.sh --hygiene-only # Quick hygiene only
|
||||||
|
# wiki-maintain.sh --hygiene-only --full # Full hygiene only
|
||||||
|
# wiki-maintain.sh --dry-run # Show what would run (no writes)
|
||||||
|
# wiki-maintain.sh --no-compile # Harvest without claude -p compilation step
|
||||||
|
# wiki-maintain.sh --no-reindex # Skip qmd update/embed after
|
||||||
|
#
|
||||||
|
# Log file: scripts/.maintain.log (rotated manually)
|
||||||
|
|
||||||
|
# Resolve script location first so we can find sibling scripts regardless of
|
||||||
|
# how WIKI_DIR is set. WIKI_DIR defaults to the parent of scripts/ but may be
|
||||||
|
# overridden for tests or alternate installs.
|
||||||
|
SCRIPTS_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
WIKI_DIR="${WIKI_DIR:-$(dirname "${SCRIPTS_DIR}")}"
|
||||||
|
LOG_FILE="${SCRIPTS_DIR}/.maintain.log"
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Argument parsing
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
FULL_MODE=false
|
||||||
|
HARVEST_ONLY=false
|
||||||
|
HYGIENE_ONLY=false
|
||||||
|
DRY_RUN=false
|
||||||
|
NO_COMPILE=false
|
||||||
|
NO_REINDEX=false
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case "$1" in
|
||||||
|
--full) FULL_MODE=true; shift ;;
|
||||||
|
--harvest-only) HARVEST_ONLY=true; shift ;;
|
||||||
|
--hygiene-only) HYGIENE_ONLY=true; shift ;;
|
||||||
|
--dry-run) DRY_RUN=true; shift ;;
|
||||||
|
--no-compile) NO_COMPILE=true; shift ;;
|
||||||
|
--no-reindex) NO_REINDEX=true; shift ;;
|
||||||
|
-h|--help)
|
||||||
|
sed -n '3,20p' "$0" | sed 's/^# \?//'
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Unknown option: $1" >&2
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "${HARVEST_ONLY}" == "true" && "${HYGIENE_ONLY}" == "true" ]]; then
|
||||||
|
echo "--harvest-only and --hygiene-only are mutually exclusive" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Logging
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
log() {
|
||||||
|
local ts
|
||||||
|
ts="$(date '+%Y-%m-%d %H:%M:%S')"
|
||||||
|
printf '[%s] %s\n' "${ts}" "$*"
|
||||||
|
}
|
||||||
|
|
||||||
|
section() {
|
||||||
|
echo ""
|
||||||
|
log "━━━ $* ━━━"
|
||||||
|
}
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Sanity checks
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
if [[ ! -d "${WIKI_DIR}" ]]; then
|
||||||
|
echo "Wiki directory not found: ${WIKI_DIR}" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
cd "${WIKI_DIR}"
|
||||||
|
|
||||||
|
for req in python3 qmd; do
|
||||||
|
if ! command -v "${req}" >/dev/null 2>&1; then
|
||||||
|
if [[ "${req}" == "qmd" && "${NO_REINDEX}" == "true" ]]; then
|
||||||
|
continue # qmd not required if --no-reindex
|
||||||
|
fi
|
||||||
|
echo "Required command not found: ${req}" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Pipeline
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
START_TS="$(date '+%s')"
|
||||||
|
section "wiki-maintain.sh starting"
|
||||||
|
log "mode: $(${FULL_MODE} && echo full || echo quick)"
|
||||||
|
log "harvest: $(${HYGIENE_ONLY} && echo skipped || echo enabled)"
|
||||||
|
log "hygiene: $(${HARVEST_ONLY} && echo skipped || echo enabled)"
|
||||||
|
log "reindex: $(${NO_REINDEX} && echo skipped || echo enabled)"
|
||||||
|
log "dry-run: ${DRY_RUN}"
|
||||||
|
log "wiki: ${WIKI_DIR}"
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Phase 1: Harvest
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
if [[ "${HYGIENE_ONLY}" != "true" ]]; then
|
||||||
|
section "Phase 1: URL harvesting"
|
||||||
|
harvest_args=()
|
||||||
|
${DRY_RUN} && harvest_args+=(--dry-run)
|
||||||
|
${NO_COMPILE} && harvest_args+=(--no-compile)
|
||||||
|
|
||||||
|
if python3 "${SCRIPTS_DIR}/wiki-harvest.py" "${harvest_args[@]}"; then
|
||||||
|
log "harvest completed"
|
||||||
|
else
|
||||||
|
log "[error] harvest failed (exit $?) — continuing to hygiene"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
section "Phase 1: URL harvesting (skipped)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Phase 2: Hygiene
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
if [[ "${HARVEST_ONLY}" != "true" ]]; then
|
||||||
|
section "Phase 2: Hygiene checks"
|
||||||
|
hygiene_args=()
|
||||||
|
if ${FULL_MODE}; then
|
||||||
|
hygiene_args+=(--full)
|
||||||
|
fi
|
||||||
|
${DRY_RUN} && hygiene_args+=(--dry-run)
|
||||||
|
|
||||||
|
if python3 "${SCRIPTS_DIR}/wiki-hygiene.py" "${hygiene_args[@]}"; then
|
||||||
|
log "hygiene completed"
|
||||||
|
else
|
||||||
|
log "[error] hygiene failed (exit $?) — continuing to reindex"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
section "Phase 2: Hygiene checks (skipped)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Phase 3: qmd reindex
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
if [[ "${NO_REINDEX}" != "true" && "${DRY_RUN}" != "true" ]]; then
|
||||||
|
section "Phase 3: qmd reindex"
|
||||||
|
|
||||||
|
if qmd update 2>&1 | sed 's/^/ /'; then
|
||||||
|
log "qmd update completed"
|
||||||
|
else
|
||||||
|
log "[error] qmd update failed (exit $?)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if qmd embed 2>&1 | sed 's/^/ /'; then
|
||||||
|
log "qmd embed completed"
|
||||||
|
else
|
||||||
|
log "[warn] qmd embed failed or produced warnings"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
section "Phase 3: qmd reindex (skipped)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
# Summary
|
||||||
|
# -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
END_TS="$(date '+%s')"
|
||||||
|
DURATION=$((END_TS - START_TS))
|
||||||
|
section "wiki-maintain.sh finished in ${DURATION}s"
|
||||||
|
|
||||||
|
# Report the most recent hygiene reports, if any. Use `if` statements (not
|
||||||
|
# `[[ ]] && action`) because under `set -e` a false test at end-of-script
|
||||||
|
# becomes the process exit status.
|
||||||
|
if [[ -d "${WIKI_DIR}/reports" ]]; then
|
||||||
|
latest_fixed="$(ls -t "${WIKI_DIR}"/reports/hygiene-*-fixed.md 2>/dev/null | head -n 1 || true)"
|
||||||
|
latest_review="$(ls -t "${WIKI_DIR}"/reports/hygiene-*-needs-review.md 2>/dev/null | head -n 1 || true)"
|
||||||
|
if [[ -n "${latest_fixed}" ]]; then
|
||||||
|
log "latest fixed report: $(basename "${latest_fixed}")"
|
||||||
|
fi
|
||||||
|
if [[ -n "${latest_review}" ]]; then
|
||||||
|
log "latest review report: $(basename "${latest_review}")"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit 0
|
||||||
639
scripts/wiki-staging.py
Executable file
639
scripts/wiki-staging.py
Executable file
@@ -0,0 +1,639 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Human-in-the-loop staging pipeline for wiki content.
|
||||||
|
|
||||||
|
Pure file operations — no LLM calls. Moves pages between staging/ and the live
|
||||||
|
wiki, updates indexes, rewrites cross-references, and tracks rejections in
|
||||||
|
.harvest-state.json.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 scripts/wiki-staging.py --list # List pending items
|
||||||
|
python3 scripts/wiki-staging.py --list --json # JSON output
|
||||||
|
python3 scripts/wiki-staging.py --stats # Summary by type and age
|
||||||
|
python3 scripts/wiki-staging.py --promote PATH # Approve one page
|
||||||
|
python3 scripts/wiki-staging.py --reject PATH --reason "..." # Reject with reason
|
||||||
|
python3 scripts/wiki-staging.py --promote-all # Approve everything
|
||||||
|
python3 scripts/wiki-staging.py --review # Interactive approval loop
|
||||||
|
python3 scripts/wiki-staging.py --sync # Rebuild staging/index.md
|
||||||
|
|
||||||
|
PATH may be relative to the wiki root (e.g. `staging/patterns/foo.md`) or absolute.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from datetime import date
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
# Import shared helpers
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent))
|
||||||
|
from wiki_lib import ( # noqa: E402
|
||||||
|
ARCHIVE_DIR,
|
||||||
|
CONVERSATIONS_DIR,
|
||||||
|
HARVEST_STATE_FILE,
|
||||||
|
INDEX_FILE,
|
||||||
|
LIVE_CONTENT_DIRS,
|
||||||
|
REPORTS_DIR,
|
||||||
|
STAGING_DIR,
|
||||||
|
STAGING_INDEX,
|
||||||
|
WIKI_DIR,
|
||||||
|
WikiPage,
|
||||||
|
iter_live_pages,
|
||||||
|
iter_staging_pages,
|
||||||
|
parse_date,
|
||||||
|
parse_page,
|
||||||
|
today,
|
||||||
|
write_page,
|
||||||
|
)
|
||||||
|
|
||||||
|
sys.stdout.reconfigure(line_buffering=True)
|
||||||
|
sys.stderr.reconfigure(line_buffering=True)
|
||||||
|
|
||||||
|
# Fields stripped from frontmatter on promotion (staging-only metadata)
|
||||||
|
STAGING_ONLY_FIELDS = [
|
||||||
|
"status",
|
||||||
|
"staged_date",
|
||||||
|
"staged_by",
|
||||||
|
"target_path",
|
||||||
|
"modifies",
|
||||||
|
"compilation_notes",
|
||||||
|
]
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Discovery
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def list_pending() -> list[WikiPage]:
|
||||||
|
pages = [p for p in iter_staging_pages() if p.path.name != "index.md"]
|
||||||
|
return pages
|
||||||
|
|
||||||
|
|
||||||
|
def page_summary(page: WikiPage) -> dict[str, Any]:
|
||||||
|
rel = str(page.path.relative_to(WIKI_DIR))
|
||||||
|
fm = page.frontmatter
|
||||||
|
target = fm.get("target_path") or _infer_target_path(page)
|
||||||
|
staged = parse_date(fm.get("staged_date"))
|
||||||
|
age = (today() - staged).days if staged else None
|
||||||
|
return {
|
||||||
|
"path": rel,
|
||||||
|
"title": fm.get("title", page.path.stem),
|
||||||
|
"type": fm.get("type", _infer_type(page)),
|
||||||
|
"status": fm.get("status", "pending"),
|
||||||
|
"origin": fm.get("origin", "automated"),
|
||||||
|
"staged_by": fm.get("staged_by", "unknown"),
|
||||||
|
"staged_date": str(staged) if staged else None,
|
||||||
|
"age_days": age,
|
||||||
|
"target_path": target,
|
||||||
|
"modifies": fm.get("modifies"),
|
||||||
|
"compilation_notes": fm.get("compilation_notes", ""),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _infer_target_path(page: WikiPage) -> str:
|
||||||
|
"""Derive a target path when target_path isn't set in frontmatter."""
|
||||||
|
try:
|
||||||
|
rel = page.path.relative_to(STAGING_DIR)
|
||||||
|
except ValueError:
|
||||||
|
return str(page.path.relative_to(WIKI_DIR))
|
||||||
|
return str(rel)
|
||||||
|
|
||||||
|
|
||||||
|
def _infer_type(page: WikiPage) -> str:
|
||||||
|
"""Infer type from the directory name when frontmatter doesn't specify it."""
|
||||||
|
parts = page.path.relative_to(STAGING_DIR).parts
|
||||||
|
if len(parts) >= 2 and parts[0] in LIVE_CONTENT_DIRS:
|
||||||
|
return parts[0].rstrip("s") # 'patterns' → 'pattern'
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main index update
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def _remove_from_main_index(rel_path: str) -> None:
|
||||||
|
if not INDEX_FILE.exists():
|
||||||
|
return
|
||||||
|
text = INDEX_FILE.read_text()
|
||||||
|
lines = text.splitlines(keepends=True)
|
||||||
|
pattern = re.compile(rf"^- \[.+\]\({re.escape(rel_path)}\) ")
|
||||||
|
new_lines = [line for line in lines if not pattern.match(line)]
|
||||||
|
if len(new_lines) != len(lines):
|
||||||
|
INDEX_FILE.write_text("".join(new_lines))
|
||||||
|
|
||||||
|
|
||||||
|
def _add_to_main_index(rel_path: str, title: str, summary: str = "") -> None:
|
||||||
|
"""Append a new entry under the appropriate section. Best-effort — operator may re-order later."""
|
||||||
|
if not INDEX_FILE.exists():
|
||||||
|
return
|
||||||
|
text = INDEX_FILE.read_text()
|
||||||
|
# Avoid duplicates
|
||||||
|
if f"]({rel_path})" in text:
|
||||||
|
return
|
||||||
|
entry = f"- [{title}]({rel_path})"
|
||||||
|
if summary:
|
||||||
|
entry += f" — {summary}"
|
||||||
|
entry += "\n"
|
||||||
|
# Insert at the end of the first matching section
|
||||||
|
ptype = rel_path.split("/")[0]
|
||||||
|
section_headers = {
|
||||||
|
"patterns": "## Patterns",
|
||||||
|
"decisions": "## Decisions",
|
||||||
|
"concepts": "## Concepts",
|
||||||
|
"environments": "## Environments",
|
||||||
|
}
|
||||||
|
header = section_headers.get(ptype)
|
||||||
|
if header and header in text:
|
||||||
|
# Find the header and append before the next ## header or EOF
|
||||||
|
idx = text.find(header)
|
||||||
|
next_header = text.find("\n## ", idx + len(header))
|
||||||
|
if next_header == -1:
|
||||||
|
next_header = len(text)
|
||||||
|
# Find the last non-empty line in the section
|
||||||
|
section = text[idx:next_header]
|
||||||
|
last_nl = section.rfind("\n", 0, len(section) - 1) + 1
|
||||||
|
INDEX_FILE.write_text(text[: idx + last_nl] + entry + text[idx + last_nl :])
|
||||||
|
else:
|
||||||
|
INDEX_FILE.write_text(text.rstrip() + "\n" + entry)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Staging index update
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def regenerate_staging_index() -> None:
|
||||||
|
STAGING_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
pending = list_pending()
|
||||||
|
|
||||||
|
lines = [
|
||||||
|
"# Staging — Pending Wiki Content",
|
||||||
|
"",
|
||||||
|
"Content awaiting human review. These pages were generated by automated scripts",
|
||||||
|
"and need approval before joining the live wiki.",
|
||||||
|
"",
|
||||||
|
"**Review options**:",
|
||||||
|
"- Browse in Obsidian and move files manually (then run `scripts/wiki-staging.py --sync`)",
|
||||||
|
"- Run `python3 scripts/wiki-staging.py --list` for a summary",
|
||||||
|
"- Start a Claude session: \"let's review what's in staging\"",
|
||||||
|
"",
|
||||||
|
f"**{len(pending)} pending item(s)** as of {today().isoformat()}",
|
||||||
|
"",
|
||||||
|
"## Pending Items",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
|
||||||
|
if not pending:
|
||||||
|
lines.append("_No pending items._")
|
||||||
|
else:
|
||||||
|
lines.append("| Page | Type | Source | Staged | Age | Target |")
|
||||||
|
lines.append("|------|------|--------|--------|-----|--------|")
|
||||||
|
for page in pending:
|
||||||
|
s = page_summary(page)
|
||||||
|
title = s["title"]
|
||||||
|
rel_in_staging = str(page.path.relative_to(STAGING_DIR))
|
||||||
|
age = f"{s['age_days']}d" if s["age_days"] is not None else "—"
|
||||||
|
staged = s["staged_date"] or "—"
|
||||||
|
lines.append(
|
||||||
|
f"| [{title}]({rel_in_staging}) | {s['type']} | "
|
||||||
|
f"{s['staged_by']} | {staged} | {age} | `{s['target_path']}` |"
|
||||||
|
)
|
||||||
|
|
||||||
|
STAGING_INDEX.write_text("\n".join(lines) + "\n")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Cross-reference rewriting
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def _rewrite_cross_references(old_path: str, new_path: str) -> int:
|
||||||
|
"""Rewrite links and `related:` entries across the wiki."""
|
||||||
|
targets: list[Path] = [INDEX_FILE]
|
||||||
|
for sub in LIVE_CONTENT_DIRS:
|
||||||
|
targets.extend((WIKI_DIR / sub).glob("*.md"))
|
||||||
|
if STAGING_DIR.exists():
|
||||||
|
for sub in LIVE_CONTENT_DIRS:
|
||||||
|
targets.extend((STAGING_DIR / sub).glob("*.md"))
|
||||||
|
if ARCHIVE_DIR.exists():
|
||||||
|
for sub in LIVE_CONTENT_DIRS:
|
||||||
|
targets.extend((ARCHIVE_DIR / sub).glob("*.md"))
|
||||||
|
|
||||||
|
count = 0
|
||||||
|
old_esc = re.escape(old_path)
|
||||||
|
link_patterns = [
|
||||||
|
(re.compile(rf"\]\({old_esc}\)"), f"]({new_path})"),
|
||||||
|
(re.compile(rf"\]\(\.\./{old_esc}\)"), f"](../{new_path})"),
|
||||||
|
]
|
||||||
|
related_patterns = [
|
||||||
|
(re.compile(rf"^(\s*-\s*){old_esc}$", re.MULTILINE), rf"\g<1>{new_path}"),
|
||||||
|
]
|
||||||
|
for target in targets:
|
||||||
|
if not target.exists():
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
text = target.read_text()
|
||||||
|
except OSError:
|
||||||
|
continue
|
||||||
|
new_text = text
|
||||||
|
for pat, repl in link_patterns + related_patterns:
|
||||||
|
new_text = pat.sub(repl, new_text)
|
||||||
|
if new_text != text:
|
||||||
|
target.write_text(new_text)
|
||||||
|
count += 1
|
||||||
|
return count
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Promote
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def promote(page: WikiPage, dry_run: bool = False) -> Path | None:
|
||||||
|
summary = page_summary(page)
|
||||||
|
target_rel = summary["target_path"]
|
||||||
|
target_path = WIKI_DIR / target_rel
|
||||||
|
|
||||||
|
modifies = summary["modifies"]
|
||||||
|
if modifies:
|
||||||
|
# This is an update to an existing page. Merge: keep staging content,
|
||||||
|
# preserve the live page's origin if it was manual.
|
||||||
|
live_path = WIKI_DIR / modifies
|
||||||
|
if not live_path.exists():
|
||||||
|
print(
|
||||||
|
f" [warn] modifies target {modifies} does not exist — treating as new page",
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
modifies = None
|
||||||
|
else:
|
||||||
|
live_page = parse_page(live_path)
|
||||||
|
if live_page:
|
||||||
|
# Warn if live page has been updated since staging
|
||||||
|
live_compiled = parse_date(live_page.frontmatter.get("last_compiled"))
|
||||||
|
staged = parse_date(page.frontmatter.get("staged_date"))
|
||||||
|
if live_compiled and staged and live_compiled > staged:
|
||||||
|
print(
|
||||||
|
f" [warn] live page {modifies} was updated ({live_compiled}) "
|
||||||
|
f"after staging ({staged}) — human should verify merge",
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
# Preserve origin from live if it was manual
|
||||||
|
if live_page.frontmatter.get("origin") == "manual":
|
||||||
|
page.frontmatter["origin"] = "manual"
|
||||||
|
|
||||||
|
rel_src = str(page.path.relative_to(WIKI_DIR))
|
||||||
|
|
||||||
|
if dry_run:
|
||||||
|
action = "update" if modifies else "new page"
|
||||||
|
print(f" [dry-run] promote {rel_src} → {target_rel} ({action})")
|
||||||
|
return target_path
|
||||||
|
|
||||||
|
# Clean frontmatter — strip staging-only fields
|
||||||
|
new_fm = {k: v for k, v in page.frontmatter.items() if k not in STAGING_ONLY_FIELDS}
|
||||||
|
new_fm.setdefault("origin", "automated")
|
||||||
|
new_fm["last_verified"] = today().isoformat()
|
||||||
|
if "last_compiled" not in new_fm:
|
||||||
|
new_fm["last_compiled"] = today().isoformat()
|
||||||
|
|
||||||
|
target_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
old_path = page.path
|
||||||
|
page.path = target_path
|
||||||
|
page.frontmatter = new_fm
|
||||||
|
write_page(page)
|
||||||
|
old_path.unlink()
|
||||||
|
|
||||||
|
# Rewrite cross-references: staging/... → target_rel
|
||||||
|
rel_staging = str(old_path.relative_to(WIKI_DIR))
|
||||||
|
_rewrite_cross_references(rel_staging, target_rel)
|
||||||
|
|
||||||
|
# Update main index
|
||||||
|
summary_text = page.body.strip().splitlines()[0] if page.body.strip() else ""
|
||||||
|
_add_to_main_index(target_rel, new_fm.get("title", page.path.stem), summary_text[:120])
|
||||||
|
|
||||||
|
# Regenerate staging index
|
||||||
|
regenerate_staging_index()
|
||||||
|
|
||||||
|
# Log to hygiene report (append a line)
|
||||||
|
_append_log(f"promote | {rel_staging} → {target_rel}" + (f" (modifies {modifies})" if modifies else ""))
|
||||||
|
return target_path
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Reject
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def reject(page: WikiPage, reason: str, dry_run: bool = False) -> None:
|
||||||
|
rel = str(page.path.relative_to(WIKI_DIR))
|
||||||
|
if dry_run:
|
||||||
|
print(f" [dry-run] reject {rel} — {reason}")
|
||||||
|
return
|
||||||
|
|
||||||
|
# Record in harvest-state if this came from URL harvesting
|
||||||
|
_record_rejection_in_harvest_state(page, reason)
|
||||||
|
|
||||||
|
# Delete the file
|
||||||
|
page.path.unlink()
|
||||||
|
|
||||||
|
# Regenerate staging index
|
||||||
|
regenerate_staging_index()
|
||||||
|
|
||||||
|
_append_log(f"reject | {rel} — {reason}")
|
||||||
|
print(f" [rejected] {rel}")
|
||||||
|
|
||||||
|
|
||||||
|
def _record_rejection_in_harvest_state(page: WikiPage, reason: str) -> None:
|
||||||
|
"""If the staged page came from wiki-harvest, add the source URL to rejected_urls."""
|
||||||
|
if not HARVEST_STATE_FILE.exists():
|
||||||
|
return
|
||||||
|
# Look for the source URL in frontmatter (harvest_source) or in sources field
|
||||||
|
source_url = page.frontmatter.get("harvest_source")
|
||||||
|
if not source_url:
|
||||||
|
sources = page.frontmatter.get("sources") or []
|
||||||
|
if isinstance(sources, list):
|
||||||
|
for src in sources:
|
||||||
|
src_str = str(src)
|
||||||
|
# If src is a raw/harvested/... file, look up its source_url
|
||||||
|
if "raw/harvested/" in src_str:
|
||||||
|
raw_path = WIKI_DIR / src_str
|
||||||
|
if raw_path.exists():
|
||||||
|
raw_page = parse_page(raw_path)
|
||||||
|
if raw_page:
|
||||||
|
source_url = raw_page.frontmatter.get("source_url")
|
||||||
|
break
|
||||||
|
|
||||||
|
if not source_url:
|
||||||
|
return
|
||||||
|
|
||||||
|
try:
|
||||||
|
with open(HARVEST_STATE_FILE) as f:
|
||||||
|
state = json.load(f)
|
||||||
|
except (OSError, json.JSONDecodeError):
|
||||||
|
return
|
||||||
|
|
||||||
|
state.setdefault("rejected_urls", {})[source_url] = {
|
||||||
|
"reason": reason,
|
||||||
|
"rejected_date": today().isoformat(),
|
||||||
|
}
|
||||||
|
# Remove from harvested_urls if present
|
||||||
|
state.get("harvested_urls", {}).pop(source_url, None)
|
||||||
|
|
||||||
|
with open(HARVEST_STATE_FILE, "w") as f:
|
||||||
|
json.dump(state, f, indent=2, sort_keys=True)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Logging
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def _append_log(line: str) -> None:
|
||||||
|
REPORTS_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
log = REPORTS_DIR / f"staging-{today().isoformat()}.log"
|
||||||
|
with open(log, "a") as f:
|
||||||
|
f.write(f"{line}\n")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Path resolution
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def resolve_page(raw_path: str) -> WikiPage | None:
|
||||||
|
path = Path(raw_path)
|
||||||
|
if not path.is_absolute():
|
||||||
|
# Accept "staging/..." or just "patterns/foo.md" (assumes staging)
|
||||||
|
if not raw_path.startswith("staging/") and raw_path.split("/", 1)[0] in LIVE_CONTENT_DIRS:
|
||||||
|
path = STAGING_DIR / raw_path
|
||||||
|
else:
|
||||||
|
path = WIKI_DIR / raw_path
|
||||||
|
if not path.exists():
|
||||||
|
print(f" [error] not found: {path}", file=sys.stderr)
|
||||||
|
return None
|
||||||
|
return parse_page(path)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Commands
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_list(as_json: bool = False) -> int:
|
||||||
|
pending = list_pending()
|
||||||
|
if as_json:
|
||||||
|
data = [page_summary(p) for p in pending]
|
||||||
|
print(json.dumps(data, indent=2))
|
||||||
|
return 0
|
||||||
|
|
||||||
|
if not pending:
|
||||||
|
print("No pending items in staging.")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
print(f"{len(pending)} pending item(s):\n")
|
||||||
|
for p in pending:
|
||||||
|
s = page_summary(p)
|
||||||
|
age = f"{s['age_days']}d" if s["age_days"] is not None else "—"
|
||||||
|
marker = " (update)" if s["modifies"] else ""
|
||||||
|
print(f" {s['path']}{marker}")
|
||||||
|
print(f" title: {s['title']}")
|
||||||
|
print(f" type: {s['type']}")
|
||||||
|
print(f" source: {s['staged_by']}")
|
||||||
|
print(f" staged: {s['staged_date']} ({age} old)")
|
||||||
|
print(f" target: {s['target_path']}")
|
||||||
|
if s["modifies"]:
|
||||||
|
print(f" modifies: {s['modifies']}")
|
||||||
|
if s["compilation_notes"]:
|
||||||
|
notes = s["compilation_notes"][:100]
|
||||||
|
print(f" notes: {notes}")
|
||||||
|
print()
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_stats() -> int:
|
||||||
|
pending = list_pending()
|
||||||
|
total = len(pending)
|
||||||
|
if total == 0:
|
||||||
|
print("No pending items in staging.")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
by_type: dict[str, int] = {}
|
||||||
|
by_source: dict[str, int] = {}
|
||||||
|
ages: list[int] = []
|
||||||
|
updates = 0
|
||||||
|
|
||||||
|
for p in pending:
|
||||||
|
s = page_summary(p)
|
||||||
|
by_type[s["type"]] = by_type.get(s["type"], 0) + 1
|
||||||
|
by_source[s["staged_by"]] = by_source.get(s["staged_by"], 0) + 1
|
||||||
|
if s["age_days"] is not None:
|
||||||
|
ages.append(s["age_days"])
|
||||||
|
if s["modifies"]:
|
||||||
|
updates += 1
|
||||||
|
|
||||||
|
print(f"Total pending: {total}")
|
||||||
|
print(f"Updates (modifies existing): {updates}")
|
||||||
|
print(f"New pages: {total - updates}")
|
||||||
|
print()
|
||||||
|
print("By type:")
|
||||||
|
for t, n in sorted(by_type.items()):
|
||||||
|
print(f" {t}: {n}")
|
||||||
|
print()
|
||||||
|
print("By source:")
|
||||||
|
for s, n in sorted(by_source.items()):
|
||||||
|
print(f" {s}: {n}")
|
||||||
|
if ages:
|
||||||
|
print()
|
||||||
|
print(f"Age (days): min={min(ages)}, max={max(ages)}, avg={sum(ages)//len(ages)}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_promote(path_arg: str, dry_run: bool) -> int:
|
||||||
|
page = resolve_page(path_arg)
|
||||||
|
if not page:
|
||||||
|
return 1
|
||||||
|
result = promote(page, dry_run=dry_run)
|
||||||
|
if result and not dry_run:
|
||||||
|
print(f" [promoted] {result.relative_to(WIKI_DIR)}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_reject(path_arg: str, reason: str, dry_run: bool) -> int:
|
||||||
|
page = resolve_page(path_arg)
|
||||||
|
if not page:
|
||||||
|
return 1
|
||||||
|
reject(page, reason, dry_run=dry_run)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_promote_all(dry_run: bool) -> int:
|
||||||
|
pending = list_pending()
|
||||||
|
if not pending:
|
||||||
|
print("No pending items.")
|
||||||
|
return 0
|
||||||
|
print(f"Promoting {len(pending)} page(s)...")
|
||||||
|
for p in pending:
|
||||||
|
promote(p, dry_run=dry_run)
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_review() -> int:
|
||||||
|
"""Interactive review loop. Prompts approve/reject/skip for each pending item."""
|
||||||
|
pending = list_pending()
|
||||||
|
if not pending:
|
||||||
|
print("No pending items.")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
print(f"Reviewing {len(pending)} pending item(s). (a)pprove / (r)eject / (s)kip / (q)uit\n")
|
||||||
|
for p in pending:
|
||||||
|
s = page_summary(p)
|
||||||
|
print(f"━━━ {s['path']} ━━━")
|
||||||
|
print(f" {s['title']} ({s['type']})")
|
||||||
|
print(f" from: {s['staged_by']} ({s['staged_date']})")
|
||||||
|
print(f" target: {s['target_path']}")
|
||||||
|
if s["modifies"]:
|
||||||
|
print(f" updates: {s['modifies']}")
|
||||||
|
if s["compilation_notes"]:
|
||||||
|
print(f" notes: {s['compilation_notes'][:150]}")
|
||||||
|
# Show first few lines of body
|
||||||
|
first_lines = [ln for ln in p.body.strip().splitlines() if ln.strip()][:3]
|
||||||
|
for ln in first_lines:
|
||||||
|
print(f" │ {ln[:100]}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
while True:
|
||||||
|
try:
|
||||||
|
answer = input(" [a/r/s/q] > ").strip().lower()
|
||||||
|
except EOFError:
|
||||||
|
return 0
|
||||||
|
if answer in ("a", "approve"):
|
||||||
|
promote(p)
|
||||||
|
break
|
||||||
|
if answer in ("r", "reject"):
|
||||||
|
try:
|
||||||
|
reason = input(" reason > ").strip()
|
||||||
|
except EOFError:
|
||||||
|
return 0
|
||||||
|
reject(p, reason or "no reason given")
|
||||||
|
break
|
||||||
|
if answer in ("s", "skip"):
|
||||||
|
break
|
||||||
|
if answer in ("q", "quit"):
|
||||||
|
return 0
|
||||||
|
print()
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def cmd_sync() -> int:
|
||||||
|
"""Reconcile staging index after manual operations (Obsidian moves, deletions).
|
||||||
|
|
||||||
|
Also detects pages that were manually moved out of staging without going through
|
||||||
|
the promotion flow and reports them.
|
||||||
|
"""
|
||||||
|
print("Regenerating staging index...")
|
||||||
|
regenerate_staging_index()
|
||||||
|
|
||||||
|
# Detect pages in live directories with status: pending (manual promotion without cleanup)
|
||||||
|
leaked: list[Path] = []
|
||||||
|
for page in iter_live_pages():
|
||||||
|
if str(page.frontmatter.get("status", "")) == "pending":
|
||||||
|
leaked.append(page.path)
|
||||||
|
|
||||||
|
if leaked:
|
||||||
|
print("\n[warn] live pages still marked status: pending — fix manually:")
|
||||||
|
for p in leaked:
|
||||||
|
print(f" {p.relative_to(WIKI_DIR)}")
|
||||||
|
|
||||||
|
pending = list_pending()
|
||||||
|
print(f"\n{len(pending)} pending item(s) in staging.")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Main
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
parser = argparse.ArgumentParser(description="Wiki staging pipeline")
|
||||||
|
group = parser.add_mutually_exclusive_group(required=True)
|
||||||
|
group.add_argument("--list", action="store_true", help="List pending items")
|
||||||
|
group.add_argument("--stats", action="store_true", help="Summary stats")
|
||||||
|
group.add_argument("--promote", metavar="PATH", help="Approve a pending page")
|
||||||
|
group.add_argument("--reject", metavar="PATH", help="Reject a pending page")
|
||||||
|
group.add_argument("--promote-all", action="store_true", help="Promote every pending page")
|
||||||
|
group.add_argument("--review", action="store_true", help="Interactive approval loop")
|
||||||
|
group.add_argument("--sync", action="store_true", help="Regenerate staging index & detect drift")
|
||||||
|
|
||||||
|
parser.add_argument("--json", action="store_true", help="JSON output for --list")
|
||||||
|
parser.add_argument("--reason", default="", help="Rejection reason for --reject")
|
||||||
|
parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
STAGING_DIR.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
if args.list:
|
||||||
|
return cmd_list(as_json=args.json)
|
||||||
|
if args.stats:
|
||||||
|
return cmd_stats()
|
||||||
|
if args.promote:
|
||||||
|
return cmd_promote(args.promote, args.dry_run)
|
||||||
|
if args.reject:
|
||||||
|
if not args.reason:
|
||||||
|
print("--reject requires --reason", file=sys.stderr)
|
||||||
|
return 2
|
||||||
|
return cmd_reject(args.reject, args.reason, args.dry_run)
|
||||||
|
if args.promote_all:
|
||||||
|
return cmd_promote_all(args.dry_run)
|
||||||
|
if args.review:
|
||||||
|
return cmd_review()
|
||||||
|
if args.sync:
|
||||||
|
return cmd_sync()
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
230
scripts/wiki-sync.sh
Executable file
230
scripts/wiki-sync.sh
Executable file
@@ -0,0 +1,230 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# wiki-sync.sh — Auto-commit, pull, resolve conflicts, push, reindex
|
||||||
|
#
|
||||||
|
# Designed to run via cron on both work and home machines.
|
||||||
|
# Safe to run frequently — no-ops when nothing has changed.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# wiki-sync.sh # Full sync (commit + pull + push + reindex)
|
||||||
|
# wiki-sync.sh --commit # Only commit local changes
|
||||||
|
# wiki-sync.sh --pull # Only pull remote changes
|
||||||
|
# wiki-sync.sh --push # Only push local commits
|
||||||
|
# wiki-sync.sh --reindex # Only rebuild qmd index
|
||||||
|
# wiki-sync.sh --status # Show sync status (no changes)
|
||||||
|
|
||||||
|
WIKI_DIR="${WIKI_DIR:-${HOME}/projects/wiki}"
|
||||||
|
LOG_FILE="${WIKI_DIR}/scripts/.sync.log"
|
||||||
|
LOCK_FILE="/tmp/wiki-sync.lock"
|
||||||
|
|
||||||
|
# --- Helpers ---
|
||||||
|
|
||||||
|
log() {
|
||||||
|
local msg
|
||||||
|
msg="[$(date '+%Y-%m-%d %H:%M:%S')] $*"
|
||||||
|
echo "${msg}" | tee -a "${LOG_FILE}"
|
||||||
|
}
|
||||||
|
|
||||||
|
die() {
|
||||||
|
log "ERROR: $*"
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
acquire_lock() {
|
||||||
|
if [[ -f "${LOCK_FILE}" ]]; then
|
||||||
|
local pid
|
||||||
|
pid=$(cat "${LOCK_FILE}" 2>/dev/null || echo "")
|
||||||
|
if [[ -n "${pid}" ]] && kill -0 "${pid}" 2>/dev/null; then
|
||||||
|
die "Another sync is running (pid ${pid})"
|
||||||
|
fi
|
||||||
|
rm -f "${LOCK_FILE}"
|
||||||
|
fi
|
||||||
|
echo $$ > "${LOCK_FILE}"
|
||||||
|
trap 'rm -f "${LOCK_FILE}"' EXIT
|
||||||
|
}
|
||||||
|
|
||||||
|
# --- Operations ---
|
||||||
|
|
||||||
|
do_commit() {
|
||||||
|
cd "${WIKI_DIR}"
|
||||||
|
|
||||||
|
# Check for uncommitted changes (staged + unstaged + untracked)
|
||||||
|
if git diff --quiet && git diff --cached --quiet && [[ -z "$(git ls-files --others --exclude-standard)" ]]; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
local hostname
|
||||||
|
hostname=$(hostname -s 2>/dev/null || echo "unknown")
|
||||||
|
|
||||||
|
git add -A
|
||||||
|
git commit -m "$(cat <<EOF
|
||||||
|
wiki: auto-sync from ${hostname}
|
||||||
|
|
||||||
|
Automatic commit of wiki changes detected by cron.
|
||||||
|
EOF
|
||||||
|
)" 2>/dev/null || true
|
||||||
|
|
||||||
|
log "Committed local changes from ${hostname}"
|
||||||
|
}
|
||||||
|
|
||||||
|
do_pull() {
|
||||||
|
cd "${WIKI_DIR}"
|
||||||
|
|
||||||
|
# Fetch first to check if there's anything to pull
|
||||||
|
git fetch origin main 2>/dev/null || die "Failed to fetch from origin"
|
||||||
|
|
||||||
|
local local_head remote_head
|
||||||
|
local_head=$(git rev-parse HEAD)
|
||||||
|
remote_head=$(git rev-parse origin/main)
|
||||||
|
|
||||||
|
if [[ "${local_head}" == "${remote_head}" ]]; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Pull with rebase to keep history linear
|
||||||
|
# If conflicts occur, resolve markdown files by keeping both sides
|
||||||
|
if ! git pull --rebase origin main 2>/dev/null; then
|
||||||
|
log "Conflicts detected, attempting auto-resolution..."
|
||||||
|
resolve_conflicts
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "Pulled remote changes"
|
||||||
|
}
|
||||||
|
|
||||||
|
resolve_conflicts() {
|
||||||
|
cd "${WIKI_DIR}"
|
||||||
|
|
||||||
|
local conflicted
|
||||||
|
conflicted=$(git diff --name-only --diff-filter=U 2>/dev/null || echo "")
|
||||||
|
|
||||||
|
if [[ -z "${conflicted}" ]]; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
while IFS= read -r file; do
|
||||||
|
if [[ "${file}" == *.md ]]; then
|
||||||
|
# For markdown: accept both sides (union merge)
|
||||||
|
# Remove conflict markers, keep all content
|
||||||
|
if [[ -f "${file}" ]]; then
|
||||||
|
sed -i.bak \
|
||||||
|
-e '/^<<<<<<< /d' \
|
||||||
|
-e '/^=======/d' \
|
||||||
|
-e '/^>>>>>>> /d' \
|
||||||
|
"${file}"
|
||||||
|
rm -f "${file}.bak"
|
||||||
|
git add "${file}"
|
||||||
|
log "Auto-resolved conflict in ${file} (kept both sides)"
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
# For non-markdown: keep ours (local version wins)
|
||||||
|
git checkout --ours "${file}" 2>/dev/null
|
||||||
|
git add "${file}"
|
||||||
|
log "Auto-resolved conflict in ${file} (kept local)"
|
||||||
|
fi
|
||||||
|
done <<< "${conflicted}"
|
||||||
|
|
||||||
|
# Continue the rebase
|
||||||
|
git rebase --continue 2>/dev/null || git commit --no-edit 2>/dev/null || true
|
||||||
|
}
|
||||||
|
|
||||||
|
do_push() {
|
||||||
|
cd "${WIKI_DIR}"
|
||||||
|
|
||||||
|
# Check if we have commits to push
|
||||||
|
local ahead
|
||||||
|
ahead=$(git rev-list --count origin/main..HEAD 2>/dev/null || echo "0")
|
||||||
|
|
||||||
|
if [[ "${ahead}" -eq 0 ]]; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
git push origin main 2>/dev/null || die "Failed to push to origin"
|
||||||
|
log "Pushed ${ahead} commit(s) to origin"
|
||||||
|
}
|
||||||
|
|
||||||
|
do_reindex() {
|
||||||
|
if ! command -v qmd &>/dev/null; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check if qmd collection exists
|
||||||
|
if ! qmd collection list 2>/dev/null | grep -q "wiki"; then
|
||||||
|
qmd collection add "${WIKI_DIR}" --name wiki 2>/dev/null
|
||||||
|
fi
|
||||||
|
|
||||||
|
qmd update 2>/dev/null
|
||||||
|
qmd embed 2>/dev/null
|
||||||
|
log "Rebuilt qmd index"
|
||||||
|
}
|
||||||
|
|
||||||
|
do_status() {
|
||||||
|
cd "${WIKI_DIR}"
|
||||||
|
|
||||||
|
echo "=== Wiki Sync Status ==="
|
||||||
|
echo "Directory: ${WIKI_DIR}"
|
||||||
|
echo "Branch: $(git branch --show-current)"
|
||||||
|
echo "Remote: $(git remote get-url origin)"
|
||||||
|
echo ""
|
||||||
|
|
||||||
|
# Local changes
|
||||||
|
local changes
|
||||||
|
changes=$(git status --porcelain 2>/dev/null | wc -l | tr -d ' ')
|
||||||
|
echo "Uncommitted changes: ${changes}"
|
||||||
|
|
||||||
|
# Ahead/behind
|
||||||
|
git fetch origin main 2>/dev/null
|
||||||
|
local ahead behind
|
||||||
|
ahead=$(git rev-list --count origin/main..HEAD 2>/dev/null || echo "0")
|
||||||
|
behind=$(git rev-list --count HEAD..origin/main 2>/dev/null || echo "0")
|
||||||
|
echo "Ahead of remote: ${ahead}"
|
||||||
|
echo "Behind remote: ${behind}"
|
||||||
|
|
||||||
|
# qmd status
|
||||||
|
if command -v qmd &>/dev/null; then
|
||||||
|
echo ""
|
||||||
|
echo "qmd: installed"
|
||||||
|
qmd collection list 2>/dev/null | grep wiki || echo "qmd: wiki collection not found"
|
||||||
|
else
|
||||||
|
echo ""
|
||||||
|
echo "qmd: not installed"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Last sync
|
||||||
|
if [[ -f "${LOG_FILE}" ]]; then
|
||||||
|
echo ""
|
||||||
|
echo "Last sync log entries:"
|
||||||
|
tail -5 "${LOG_FILE}"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# --- Main ---
|
||||||
|
|
||||||
|
main() {
|
||||||
|
local mode="${1:-full}"
|
||||||
|
|
||||||
|
mkdir -p "${WIKI_DIR}/scripts"
|
||||||
|
|
||||||
|
# Status doesn't need a lock
|
||||||
|
if [[ "${mode}" == "--status" ]]; then
|
||||||
|
do_status
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
acquire_lock
|
||||||
|
|
||||||
|
case "${mode}" in
|
||||||
|
--commit) do_commit ;;
|
||||||
|
--pull) do_pull ;;
|
||||||
|
--push) do_push ;;
|
||||||
|
--reindex) do_reindex ;;
|
||||||
|
full|*)
|
||||||
|
do_commit
|
||||||
|
do_pull
|
||||||
|
do_push
|
||||||
|
do_reindex
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
main "$@"
|
||||||
211
scripts/wiki_lib.py
Normal file
211
scripts/wiki_lib.py
Normal file
@@ -0,0 +1,211 @@
|
|||||||
|
"""Shared helpers for wiki maintenance scripts.
|
||||||
|
|
||||||
|
Provides frontmatter parsing/serialization, WikiPage dataclass, and common
|
||||||
|
constants used by wiki-hygiene.py, wiki-staging.py, and wiki-harvest.py.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import hashlib
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from datetime import date, datetime, timezone
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
# Wiki root — override via WIKI_DIR env var for tests / alternate installs
|
||||||
|
WIKI_DIR = Path(os.environ.get("WIKI_DIR", str(Path.home() / "projects" / "wiki")))
|
||||||
|
INDEX_FILE = WIKI_DIR / "index.md"
|
||||||
|
STAGING_DIR = WIKI_DIR / "staging"
|
||||||
|
STAGING_INDEX = STAGING_DIR / "index.md"
|
||||||
|
ARCHIVE_DIR = WIKI_DIR / "archive"
|
||||||
|
ARCHIVE_INDEX = ARCHIVE_DIR / "index.md"
|
||||||
|
REPORTS_DIR = WIKI_DIR / "reports"
|
||||||
|
CONVERSATIONS_DIR = WIKI_DIR / "conversations"
|
||||||
|
HARVEST_STATE_FILE = WIKI_DIR / ".harvest-state.json"
|
||||||
|
|
||||||
|
LIVE_CONTENT_DIRS = ["patterns", "decisions", "concepts", "environments"]
|
||||||
|
|
||||||
|
FM_FENCE = "---\n"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class WikiPage:
|
||||||
|
path: Path
|
||||||
|
frontmatter: dict[str, Any]
|
||||||
|
fm_raw: str
|
||||||
|
body: str
|
||||||
|
fm_start: int
|
||||||
|
|
||||||
|
|
||||||
|
def today() -> date:
|
||||||
|
return datetime.now(timezone.utc).date()
|
||||||
|
|
||||||
|
|
||||||
|
def parse_date(value: Any) -> date | None:
|
||||||
|
if not value:
|
||||||
|
return None
|
||||||
|
if isinstance(value, date):
|
||||||
|
return value
|
||||||
|
s = str(value).strip()
|
||||||
|
try:
|
||||||
|
return datetime.strptime(s, "%Y-%m-%d").date()
|
||||||
|
except ValueError:
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def parse_page(path: Path) -> WikiPage | None:
|
||||||
|
"""Parse a markdown page with YAML frontmatter. Returns None if no frontmatter."""
|
||||||
|
try:
|
||||||
|
text = path.read_text()
|
||||||
|
except OSError:
|
||||||
|
return None
|
||||||
|
if not text.startswith(FM_FENCE):
|
||||||
|
return None
|
||||||
|
end = text.find("\n---\n", 4)
|
||||||
|
if end == -1:
|
||||||
|
return None
|
||||||
|
fm_raw = text[4:end]
|
||||||
|
body = text[end + 5 :]
|
||||||
|
fm = parse_yaml_lite(fm_raw)
|
||||||
|
return WikiPage(path=path, frontmatter=fm, fm_raw=fm_raw, body=body, fm_start=end + 5)
|
||||||
|
|
||||||
|
|
||||||
|
def parse_yaml_lite(text: str) -> dict[str, Any]:
|
||||||
|
"""Parse a subset of YAML used in wiki frontmatter.
|
||||||
|
|
||||||
|
Supports:
|
||||||
|
- key: value
|
||||||
|
- key: [a, b, c]
|
||||||
|
- key:
|
||||||
|
- a
|
||||||
|
- b
|
||||||
|
"""
|
||||||
|
result: dict[str, Any] = {}
|
||||||
|
lines = text.splitlines()
|
||||||
|
i = 0
|
||||||
|
while i < len(lines):
|
||||||
|
line = lines[i]
|
||||||
|
if not line.strip() or line.lstrip().startswith("#"):
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
m = re.match(r"^([\w_-]+):\s*(.*)$", line)
|
||||||
|
if not m:
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
key, rest = m.group(1), m.group(2).strip()
|
||||||
|
if rest == "":
|
||||||
|
items: list[str] = []
|
||||||
|
j = i + 1
|
||||||
|
while j < len(lines) and re.match(r"^\s+-\s+", lines[j]):
|
||||||
|
items.append(re.sub(r"^\s+-\s+", "", lines[j]).strip())
|
||||||
|
j += 1
|
||||||
|
if items:
|
||||||
|
result[key] = items
|
||||||
|
i = j
|
||||||
|
continue
|
||||||
|
result[key] = ""
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
if rest.startswith("[") and rest.endswith("]"):
|
||||||
|
inner = rest[1:-1].strip()
|
||||||
|
if inner:
|
||||||
|
result[key] = [x.strip().strip('"').strip("'") for x in inner.split(",")]
|
||||||
|
else:
|
||||||
|
result[key] = []
|
||||||
|
i += 1
|
||||||
|
continue
|
||||||
|
result[key] = rest.strip('"').strip("'")
|
||||||
|
i += 1
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
# Canonical frontmatter key order for serialization
|
||||||
|
PREFERRED_KEY_ORDER = [
|
||||||
|
"title", "type", "confidence",
|
||||||
|
"status", "origin",
|
||||||
|
"last_compiled", "last_verified",
|
||||||
|
"staged_date", "staged_by", "target_path", "modifies", "compilation_notes",
|
||||||
|
"archived_date", "archived_reason", "original_path",
|
||||||
|
"sources", "related",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def serialize_frontmatter(fm: dict[str, Any]) -> str:
|
||||||
|
"""Serialize a frontmatter dict back to YAML in the wiki's canonical style."""
|
||||||
|
out_lines: list[str] = []
|
||||||
|
seen: set[str] = set()
|
||||||
|
for key in PREFERRED_KEY_ORDER:
|
||||||
|
if key in fm:
|
||||||
|
out_lines.append(_format_fm_entry(key, fm[key]))
|
||||||
|
seen.add(key)
|
||||||
|
for key in sorted(fm.keys()):
|
||||||
|
if key in seen:
|
||||||
|
continue
|
||||||
|
out_lines.append(_format_fm_entry(key, fm[key]))
|
||||||
|
return "\n".join(out_lines)
|
||||||
|
|
||||||
|
|
||||||
|
def _format_fm_entry(key: str, value: Any) -> str:
|
||||||
|
if isinstance(value, list):
|
||||||
|
if not value:
|
||||||
|
return f"{key}: []"
|
||||||
|
lines = [f"{key}:"]
|
||||||
|
for item in value:
|
||||||
|
lines.append(f" - {item}")
|
||||||
|
return "\n".join(lines)
|
||||||
|
return f"{key}: {value}"
|
||||||
|
|
||||||
|
|
||||||
|
def write_page(page: WikiPage, new_fm: dict[str, Any] | None = None, new_body: str | None = None) -> None:
|
||||||
|
fm = new_fm if new_fm is not None else page.frontmatter
|
||||||
|
body = new_body if new_body is not None else page.body
|
||||||
|
fm_yaml = serialize_frontmatter(fm)
|
||||||
|
text = f"---\n{fm_yaml}\n---\n{body}"
|
||||||
|
page.path.write_text(text)
|
||||||
|
|
||||||
|
|
||||||
|
def iter_live_pages() -> list[WikiPage]:
|
||||||
|
pages: list[WikiPage] = []
|
||||||
|
for sub in LIVE_CONTENT_DIRS:
|
||||||
|
for md in sorted((WIKI_DIR / sub).glob("*.md")):
|
||||||
|
page = parse_page(md)
|
||||||
|
if page:
|
||||||
|
pages.append(page)
|
||||||
|
return pages
|
||||||
|
|
||||||
|
|
||||||
|
def iter_staging_pages() -> list[WikiPage]:
|
||||||
|
pages: list[WikiPage] = []
|
||||||
|
if not STAGING_DIR.exists():
|
||||||
|
return pages
|
||||||
|
for sub in LIVE_CONTENT_DIRS:
|
||||||
|
d = STAGING_DIR / sub
|
||||||
|
if not d.exists():
|
||||||
|
continue
|
||||||
|
for md in sorted(d.glob("*.md")):
|
||||||
|
page = parse_page(md)
|
||||||
|
if page:
|
||||||
|
pages.append(page)
|
||||||
|
return pages
|
||||||
|
|
||||||
|
|
||||||
|
def iter_archived_pages() -> list[WikiPage]:
|
||||||
|
pages: list[WikiPage] = []
|
||||||
|
if not ARCHIVE_DIR.exists():
|
||||||
|
return pages
|
||||||
|
for sub in LIVE_CONTENT_DIRS:
|
||||||
|
d = ARCHIVE_DIR / sub
|
||||||
|
if not d.exists():
|
||||||
|
continue
|
||||||
|
for md in sorted(d.glob("*.md")):
|
||||||
|
page = parse_page(md)
|
||||||
|
if page:
|
||||||
|
pages.append(page)
|
||||||
|
return pages
|
||||||
|
|
||||||
|
|
||||||
|
def page_content_hash(page: WikiPage) -> str:
|
||||||
|
"""Hash of page body only (excludes frontmatter) so mechanical frontmatter fixes don't churn the hash."""
|
||||||
|
return "sha256:" + hashlib.sha256(page.body.strip().encode("utf-8")).hexdigest()
|
||||||
107
tests/README.md
Normal file
107
tests/README.md
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
# Wiki Pipeline Test Suite
|
||||||
|
|
||||||
|
Pytest-based test suite covering all 11 scripts in `scripts/`. Runs on both
|
||||||
|
macOS and Linux/WSL, uses only the Python standard library + pytest.
|
||||||
|
|
||||||
|
## Running
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Full suite (from wiki root)
|
||||||
|
bash tests/run.sh
|
||||||
|
|
||||||
|
# Single test file
|
||||||
|
bash tests/run.sh test_wiki_lib.py
|
||||||
|
|
||||||
|
# Single test class or function
|
||||||
|
bash tests/run.sh test_wiki_hygiene.py::TestArchiveRestore
|
||||||
|
bash tests/run.sh test_wiki_hygiene.py::TestArchiveRestore::test_restore_reverses_archive
|
||||||
|
|
||||||
|
# Pattern matching
|
||||||
|
bash tests/run.sh -k "archive"
|
||||||
|
|
||||||
|
# Verbose
|
||||||
|
bash tests/run.sh -v
|
||||||
|
|
||||||
|
# Stop on first failure
|
||||||
|
bash tests/run.sh -x
|
||||||
|
|
||||||
|
# Or invoke pytest directly from the tests dir
|
||||||
|
cd tests && python3 -m pytest -v
|
||||||
|
```
|
||||||
|
|
||||||
|
## What's tested
|
||||||
|
|
||||||
|
| File | Coverage |
|
||||||
|
|------|----------|
|
||||||
|
| `test_wiki_lib.py` | YAML parser, frontmatter round-trip, page iterators, date parsing, content hashing, WIKI_DIR env override |
|
||||||
|
| `test_wiki_hygiene.py` | Backfill, confidence decay math, frontmatter repair, archive/restore round-trip, orphan detection, broken-xref fuzzy matching, index drift, empty stubs, conversation refresh signals, auto-restore, staging/archive sync, state drift, hygiene state file, full quick-run idempotency |
|
||||||
|
| `test_wiki_staging.py` | List, promote, reject, promote-with-modifies, dry-run, staging index regeneration, path resolution |
|
||||||
|
| `test_wiki_harvest.py` | URL classification (harvest/check/skip), private IP detection, URL extraction + filtering, filename derivation, content validation, state management, raw file writing, dry-run CLI smoke test |
|
||||||
|
| `test_conversation_pipeline.py` | CLI smoke tests for extract-sessions, summarize-conversations, update-conversation-index; dry-run behavior; help flags; integration test with fake conversation files |
|
||||||
|
| `test_shell_scripts.py` | wiki-maintain.sh / mine-conversations.sh / wiki-sync.sh: help, dry-run, mutex flags, bash syntax check, strict-mode check, shebang check, py_compile for all .py scripts |
|
||||||
|
|
||||||
|
## How it works
|
||||||
|
|
||||||
|
**Isolation**: Every test runs against a disposable `tmp_wiki` fixture
|
||||||
|
(pytest `tmp_path`). The fixture sets the `WIKI_DIR` environment variable
|
||||||
|
so all scripts resolve paths against the tmp directory instead of the real
|
||||||
|
wiki. No test ever touches `~/projects/wiki`.
|
||||||
|
|
||||||
|
**Hyphenated filenames**: Scripts like `wiki-harvest.py` use hyphens, which
|
||||||
|
Python's `import` can't handle directly. `conftest.py` has a
|
||||||
|
`_load_script_module` helper that loads a script file by path and exposes
|
||||||
|
it as a module object.
|
||||||
|
|
||||||
|
**Clean module state**: Each test that loads a module clears any cached
|
||||||
|
import first, so `WIKI_DIR` env overrides take effect correctly between
|
||||||
|
tests.
|
||||||
|
|
||||||
|
**Subprocess tests** (for CLI smoke tests): `conftest.py` provides a
|
||||||
|
`run_script` fixture that invokes a script via `python3` or `bash` with
|
||||||
|
`WIKI_DIR` set to the tmp wiki. Uses `subprocess.run` with `capture_output`
|
||||||
|
and a timeout.
|
||||||
|
|
||||||
|
## Cross-platform
|
||||||
|
|
||||||
|
- `#!/usr/bin/env bash` shebangs (tested explicitly)
|
||||||
|
- `set -euo pipefail` in all shell scripts (tested explicitly)
|
||||||
|
- `bash -n` syntax check on all shell scripts
|
||||||
|
- `py_compile` on all Python scripts
|
||||||
|
- Uses `pathlib` everywhere — no hardcoded path separators
|
||||||
|
- Uses the Python stdlib only (except pytest itself)
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- Python 3.11+
|
||||||
|
- `pytest` — install with `pip install --user pytest` or your distro's package manager
|
||||||
|
- `bash` (any version — scripts use only portable features)
|
||||||
|
|
||||||
|
The tests do NOT require:
|
||||||
|
- `claude` CLI (mocked / skipped)
|
||||||
|
- `trafilatura` or `crawl4ai` (only dry-run / classification paths tested)
|
||||||
|
- `qmd` (reindex phase is skipped in tests)
|
||||||
|
- Network access
|
||||||
|
- The real `~/projects/wiki` or `~/.claude/projects` directories
|
||||||
|
|
||||||
|
## Speed
|
||||||
|
|
||||||
|
Full suite runs in **~1 second** on a modern laptop. All tests are isolated
|
||||||
|
and independent so they can run in any order and in parallel.
|
||||||
|
|
||||||
|
## What's NOT tested
|
||||||
|
|
||||||
|
- **Real LLM calls** (`claude -p`): too expensive, non-deterministic.
|
||||||
|
Tested: CLI parsing, dry-run paths, mocked error handling.
|
||||||
|
- **Real web fetches** (trafilatura/crawl4ai): too slow, non-deterministic.
|
||||||
|
Tested: URL classification, filter logic, fetch-result validation.
|
||||||
|
- **Real git operations** (wiki-sync.sh): requires a git repo fixture.
|
||||||
|
Tested: script loads, handles non-git dir gracefully, --status exits clean.
|
||||||
|
- **Real qmd indexing**: tested elsewhere via `qmd collection list` in the
|
||||||
|
setup verification step.
|
||||||
|
- **Real Claude Code session JSONL parsing** with actual sessions: would
|
||||||
|
require fixture JSONL files. Tested: CLI parsing, empty-dir behavior,
|
||||||
|
`CLAUDE_PROJECTS_DIR` env override.
|
||||||
|
|
||||||
|
These are smoke-tested end-to-end via the integration tests in
|
||||||
|
`test_conversation_pipeline.py` and the dry-run paths in
|
||||||
|
`test_shell_scripts.py::TestWikiMaintainSh`.
|
||||||
300
tests/conftest.py
Normal file
300
tests/conftest.py
Normal file
@@ -0,0 +1,300 @@
|
|||||||
|
"""Shared test fixtures for the wiki pipeline test suite.
|
||||||
|
|
||||||
|
All tests run against a disposable `tmp_wiki` directory — no test ever
|
||||||
|
touches the real ~/projects/wiki. Cross-platform: uses pathlib, no
|
||||||
|
platform-specific paths, and runs on both macOS and Linux/WSL.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import importlib
|
||||||
|
import importlib.util
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
SCRIPTS_DIR = Path(__file__).resolve().parent.parent / "scripts"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Module loading helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
#
|
||||||
|
# The wiki scripts use hyphenated filenames (wiki-hygiene.py etc.) which
|
||||||
|
# can't be imported via normal `import` syntax. These helpers load a script
|
||||||
|
# file as a module object so tests can exercise its functions directly.
|
||||||
|
|
||||||
|
|
||||||
|
def _load_script_module(name: str, path: Path) -> Any:
|
||||||
|
"""Load a Python script file as a module. Clears any cached version first."""
|
||||||
|
# Clear cached imports so WIKI_DIR env changes take effect between tests
|
||||||
|
for key in list(sys.modules):
|
||||||
|
if key in (name, "wiki_lib"):
|
||||||
|
del sys.modules[key]
|
||||||
|
|
||||||
|
# Make sure scripts/ is on sys.path so intra-script imports (wiki_lib) work
|
||||||
|
scripts_str = str(SCRIPTS_DIR)
|
||||||
|
if scripts_str not in sys.path:
|
||||||
|
sys.path.insert(0, scripts_str)
|
||||||
|
|
||||||
|
spec = importlib.util.spec_from_file_location(name, path)
|
||||||
|
assert spec is not None and spec.loader is not None
|
||||||
|
mod = importlib.util.module_from_spec(spec)
|
||||||
|
sys.modules[name] = mod
|
||||||
|
spec.loader.exec_module(mod)
|
||||||
|
return mod
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# tmp_wiki fixture — builds a realistic wiki tree under a tmp path
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def tmp_wiki(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> Path:
|
||||||
|
"""Set up a disposable wiki tree with all the directories the scripts expect.
|
||||||
|
|
||||||
|
Sets the WIKI_DIR environment variable so all imported modules resolve
|
||||||
|
paths against this tmp directory.
|
||||||
|
"""
|
||||||
|
wiki = tmp_path / "wiki"
|
||||||
|
wiki.mkdir()
|
||||||
|
|
||||||
|
# Create the directory tree
|
||||||
|
for sub in ["patterns", "decisions", "concepts", "environments"]:
|
||||||
|
(wiki / sub).mkdir()
|
||||||
|
(wiki / "staging" / sub).mkdir(parents=True)
|
||||||
|
(wiki / "archive" / sub).mkdir(parents=True)
|
||||||
|
(wiki / "raw" / "harvested").mkdir(parents=True)
|
||||||
|
(wiki / "conversations").mkdir()
|
||||||
|
(wiki / "reports").mkdir()
|
||||||
|
|
||||||
|
# Create minimal index.md
|
||||||
|
(wiki / "index.md").write_text(
|
||||||
|
"# Wiki Index\n\n"
|
||||||
|
"## Patterns\n\n"
|
||||||
|
"## Decisions\n\n"
|
||||||
|
"## Concepts\n\n"
|
||||||
|
"## Environments\n\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Empty state files
|
||||||
|
(wiki / ".harvest-state.json").write_text(json.dumps({
|
||||||
|
"harvested_urls": {},
|
||||||
|
"skipped_urls": {},
|
||||||
|
"failed_urls": {},
|
||||||
|
"rejected_urls": {},
|
||||||
|
"last_run": None,
|
||||||
|
}))
|
||||||
|
|
||||||
|
# Point all scripts at this tmp wiki
|
||||||
|
monkeypatch.setenv("WIKI_DIR", str(wiki))
|
||||||
|
|
||||||
|
return wiki
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Sample page factories
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def make_page(
|
||||||
|
wiki: Path,
|
||||||
|
rel_path: str,
|
||||||
|
*,
|
||||||
|
title: str | None = None,
|
||||||
|
ptype: str | None = None,
|
||||||
|
confidence: str = "high",
|
||||||
|
last_compiled: str = "2026-04-01",
|
||||||
|
last_verified: str = "2026-04-01",
|
||||||
|
origin: str = "manual",
|
||||||
|
sources: list[str] | None = None,
|
||||||
|
related: list[str] | None = None,
|
||||||
|
body: str = "# Content\n\nA substantive page with real content so it is not a stub.\n",
|
||||||
|
extra_fm: dict[str, Any] | None = None,
|
||||||
|
) -> Path:
|
||||||
|
"""Write a well-formed wiki page with all required frontmatter fields."""
|
||||||
|
if sources is None:
|
||||||
|
sources = []
|
||||||
|
if related is None:
|
||||||
|
related = []
|
||||||
|
"""Write a page to the tmp wiki and return its path."""
|
||||||
|
path = wiki / rel_path
|
||||||
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
if title is None:
|
||||||
|
title = path.stem.replace("-", " ").title()
|
||||||
|
if ptype is None:
|
||||||
|
ptype = path.parent.name.rstrip("s")
|
||||||
|
|
||||||
|
fm_lines = [
|
||||||
|
"---",
|
||||||
|
f"title: {title}",
|
||||||
|
f"type: {ptype}",
|
||||||
|
f"confidence: {confidence}",
|
||||||
|
f"origin: {origin}",
|
||||||
|
f"last_compiled: {last_compiled}",
|
||||||
|
f"last_verified: {last_verified}",
|
||||||
|
]
|
||||||
|
if sources is not None:
|
||||||
|
if sources:
|
||||||
|
fm_lines.append("sources:")
|
||||||
|
fm_lines.extend(f" - {s}" for s in sources)
|
||||||
|
else:
|
||||||
|
fm_lines.append("sources: []")
|
||||||
|
if related is not None:
|
||||||
|
if related:
|
||||||
|
fm_lines.append("related:")
|
||||||
|
fm_lines.extend(f" - {r}" for r in related)
|
||||||
|
else:
|
||||||
|
fm_lines.append("related: []")
|
||||||
|
if extra_fm:
|
||||||
|
for k, v in extra_fm.items():
|
||||||
|
if isinstance(v, list):
|
||||||
|
if v:
|
||||||
|
fm_lines.append(f"{k}:")
|
||||||
|
fm_lines.extend(f" - {item}" for item in v)
|
||||||
|
else:
|
||||||
|
fm_lines.append(f"{k}: []")
|
||||||
|
else:
|
||||||
|
fm_lines.append(f"{k}: {v}")
|
||||||
|
fm_lines.append("---")
|
||||||
|
|
||||||
|
path.write_text("\n".join(fm_lines) + "\n" + body)
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
def make_conversation(
|
||||||
|
wiki: Path,
|
||||||
|
project: str,
|
||||||
|
filename: str,
|
||||||
|
*,
|
||||||
|
date: str = "2026-04-10",
|
||||||
|
status: str = "summarized",
|
||||||
|
messages: int = 100,
|
||||||
|
related: list[str] | None = None,
|
||||||
|
body: str = "## Summary\n\nTest conversation summary.\n",
|
||||||
|
) -> Path:
|
||||||
|
"""Write a conversation file to the tmp wiki."""
|
||||||
|
proj_dir = wiki / "conversations" / project
|
||||||
|
proj_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
path = proj_dir / filename
|
||||||
|
|
||||||
|
fm_lines = [
|
||||||
|
"---",
|
||||||
|
f"title: Test Conversation {filename}",
|
||||||
|
"type: conversation",
|
||||||
|
f"project: {project}",
|
||||||
|
f"date: {date}",
|
||||||
|
f"status: {status}",
|
||||||
|
f"messages: {messages}",
|
||||||
|
]
|
||||||
|
if related:
|
||||||
|
fm_lines.append("related:")
|
||||||
|
fm_lines.extend(f" - {r}" for r in related)
|
||||||
|
fm_lines.append("---")
|
||||||
|
|
||||||
|
path.write_text("\n".join(fm_lines) + "\n" + body)
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
def make_staging_page(
|
||||||
|
wiki: Path,
|
||||||
|
rel_under_staging: str,
|
||||||
|
*,
|
||||||
|
title: str = "Pending Page",
|
||||||
|
ptype: str = "pattern",
|
||||||
|
staged_by: str = "wiki-harvest",
|
||||||
|
staged_date: str = "2026-04-10",
|
||||||
|
modifies: str | None = None,
|
||||||
|
target_path: str | None = None,
|
||||||
|
body: str = "# Pending\n\nStaged content body.\n",
|
||||||
|
) -> Path:
|
||||||
|
path = wiki / "staging" / rel_under_staging
|
||||||
|
path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
if target_path is None:
|
||||||
|
target_path = rel_under_staging
|
||||||
|
|
||||||
|
fm_lines = [
|
||||||
|
"---",
|
||||||
|
f"title: {title}",
|
||||||
|
f"type: {ptype}",
|
||||||
|
"confidence: medium",
|
||||||
|
"origin: automated",
|
||||||
|
"status: pending",
|
||||||
|
f"staged_date: {staged_date}",
|
||||||
|
f"staged_by: {staged_by}",
|
||||||
|
f"target_path: {target_path}",
|
||||||
|
]
|
||||||
|
if modifies:
|
||||||
|
fm_lines.append(f"modifies: {modifies}")
|
||||||
|
fm_lines.append("compilation_notes: test note")
|
||||||
|
fm_lines.append("last_verified: 2026-04-10")
|
||||||
|
fm_lines.append("---")
|
||||||
|
|
||||||
|
path.write_text("\n".join(fm_lines) + "\n" + body)
|
||||||
|
return path
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Module fixtures — each loads the corresponding script as a module
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def wiki_lib(tmp_wiki: Path) -> Any:
|
||||||
|
"""Load wiki_lib fresh against the tmp_wiki directory."""
|
||||||
|
return _load_script_module("wiki_lib", SCRIPTS_DIR / "wiki_lib.py")
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def wiki_hygiene(tmp_wiki: Path) -> Any:
|
||||||
|
"""Load wiki-hygiene.py fresh. wiki_lib must be loaded first for its imports."""
|
||||||
|
_load_script_module("wiki_lib", SCRIPTS_DIR / "wiki_lib.py")
|
||||||
|
return _load_script_module("wiki_hygiene", SCRIPTS_DIR / "wiki-hygiene.py")
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def wiki_staging(tmp_wiki: Path) -> Any:
|
||||||
|
_load_script_module("wiki_lib", SCRIPTS_DIR / "wiki_lib.py")
|
||||||
|
return _load_script_module("wiki_staging", SCRIPTS_DIR / "wiki-staging.py")
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def wiki_harvest(tmp_wiki: Path) -> Any:
|
||||||
|
_load_script_module("wiki_lib", SCRIPTS_DIR / "wiki_lib.py")
|
||||||
|
return _load_script_module("wiki_harvest", SCRIPTS_DIR / "wiki-harvest.py")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Subprocess helper — runs a script as if from the CLI, with WIKI_DIR set
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def run_script(tmp_wiki: Path):
|
||||||
|
"""Return a function that runs a script via subprocess with WIKI_DIR set."""
|
||||||
|
import subprocess
|
||||||
|
|
||||||
|
def _run(script_rel: str, *args: str, timeout: int = 60) -> subprocess.CompletedProcess:
|
||||||
|
script = SCRIPTS_DIR / script_rel
|
||||||
|
if script.suffix == ".py":
|
||||||
|
cmd = ["python3", str(script), *args]
|
||||||
|
else:
|
||||||
|
cmd = ["bash", str(script), *args]
|
||||||
|
env = os.environ.copy()
|
||||||
|
env["WIKI_DIR"] = str(tmp_wiki)
|
||||||
|
return subprocess.run(
|
||||||
|
cmd,
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=timeout,
|
||||||
|
env=env,
|
||||||
|
)
|
||||||
|
|
||||||
|
return _run
|
||||||
9
tests/pytest.ini
Normal file
9
tests/pytest.ini
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
[pytest]
|
||||||
|
testpaths = .
|
||||||
|
python_files = test_*.py
|
||||||
|
python_classes = Test*
|
||||||
|
python_functions = test_*
|
||||||
|
addopts = -ra --strict-markers --tb=short
|
||||||
|
markers =
|
||||||
|
slow: tests that take more than 1 second
|
||||||
|
network: tests that hit the network (skipped by default)
|
||||||
31
tests/run.sh
Executable file
31
tests/run.sh
Executable file
@@ -0,0 +1,31 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# run.sh — Convenience wrapper for running the wiki pipeline test suite.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# bash tests/run.sh # Run the full suite
|
||||||
|
# bash tests/run.sh -v # Verbose output
|
||||||
|
# bash tests/run.sh test_wiki_lib # Run one file
|
||||||
|
# bash tests/run.sh -k "parse" # Run tests matching a pattern
|
||||||
|
#
|
||||||
|
# All arguments are passed through to pytest.
|
||||||
|
|
||||||
|
TESTS_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
cd "${TESTS_DIR}"
|
||||||
|
|
||||||
|
# Verify pytest is available
|
||||||
|
if ! python3 -c "import pytest" 2>/dev/null; then
|
||||||
|
echo "pytest not installed. Install with: pip install --user pytest"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Clear any previous test artifacts
|
||||||
|
rm -rf .pytest_cache 2>/dev/null || true
|
||||||
|
|
||||||
|
# Default args: quiet with colored output
|
||||||
|
if [[ $# -eq 0 ]]; then
|
||||||
|
exec python3 -m pytest --tb=short
|
||||||
|
else
|
||||||
|
exec python3 -m pytest "$@"
|
||||||
|
fi
|
||||||
121
tests/test_conversation_pipeline.py
Normal file
121
tests/test_conversation_pipeline.py
Normal file
@@ -0,0 +1,121 @@
|
|||||||
|
"""Smoke + integration tests for the conversation mining pipeline.
|
||||||
|
|
||||||
|
These scripts interact with external systems (Claude Code sessions dir,
|
||||||
|
claude CLI), so tests focus on CLI parsing, dry-run behavior, and error
|
||||||
|
handling rather than exercising the full extraction/summarization path.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# extract-sessions.py
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestExtractSessions:
|
||||||
|
def test_help_exits_clean(self, run_script) -> None:
|
||||||
|
result = run_script("extract-sessions.py", "--help")
|
||||||
|
assert result.returncode == 0
|
||||||
|
assert "--project" in result.stdout
|
||||||
|
assert "--dry-run" in result.stdout
|
||||||
|
|
||||||
|
def test_dry_run_with_empty_sessions_dir(
|
||||||
|
self, run_script, tmp_wiki: Path, tmp_path: Path, monkeypatch
|
||||||
|
) -> None:
|
||||||
|
# Point CLAUDE_PROJECTS_DIR at an empty tmp dir via env (not currently
|
||||||
|
# supported — script reads ~/.claude/projects directly). Instead, use
|
||||||
|
# --project with a code that has no sessions to verify clean exit.
|
||||||
|
result = run_script("extract-sessions.py", "--dry-run", "--project", "nonexistent")
|
||||||
|
assert result.returncode == 0
|
||||||
|
|
||||||
|
def test_rejects_unknown_flag(self, run_script) -> None:
|
||||||
|
result = run_script("extract-sessions.py", "--bogus-flag")
|
||||||
|
assert result.returncode != 0
|
||||||
|
assert "error" in result.stderr.lower() or "unrecognized" in result.stderr.lower()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# summarize-conversations.py
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestSummarizeConversations:
|
||||||
|
def test_help_exits_clean(self, run_script) -> None:
|
||||||
|
result = run_script("summarize-conversations.py", "--help")
|
||||||
|
assert result.returncode == 0
|
||||||
|
assert "--claude" in result.stdout
|
||||||
|
assert "--dry-run" in result.stdout
|
||||||
|
assert "--project" in result.stdout
|
||||||
|
|
||||||
|
def test_dry_run_empty_conversations(
|
||||||
|
self, run_script, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
result = run_script("summarize-conversations.py", "--claude", "--dry-run")
|
||||||
|
assert result.returncode == 0
|
||||||
|
|
||||||
|
def test_dry_run_with_extracted_conversation(
|
||||||
|
self, run_script, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
from conftest import make_conversation
|
||||||
|
|
||||||
|
make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"general",
|
||||||
|
"2026-04-10-abc.md",
|
||||||
|
status="extracted", # Not yet summarized
|
||||||
|
messages=50,
|
||||||
|
)
|
||||||
|
result = run_script("summarize-conversations.py", "--claude", "--dry-run")
|
||||||
|
assert result.returncode == 0
|
||||||
|
# Should mention the file or show it would be processed
|
||||||
|
assert "2026-04-10-abc.md" in result.stdout or "1 conversation" in result.stdout
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# update-conversation-index.py
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestUpdateConversationIndex:
|
||||||
|
def test_help_exits_clean(self, run_script) -> None:
|
||||||
|
result = run_script("update-conversation-index.py", "--help")
|
||||||
|
assert result.returncode == 0
|
||||||
|
|
||||||
|
def test_runs_on_empty_conversations_dir(
|
||||||
|
self, run_script, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
result = run_script("update-conversation-index.py")
|
||||||
|
# Should not crash even with no conversations
|
||||||
|
assert result.returncode == 0
|
||||||
|
|
||||||
|
def test_builds_index_from_conversations(
|
||||||
|
self, run_script, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
from conftest import make_conversation
|
||||||
|
|
||||||
|
make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"general",
|
||||||
|
"2026-04-10-one.md",
|
||||||
|
status="summarized",
|
||||||
|
)
|
||||||
|
make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"general",
|
||||||
|
"2026-04-11-two.md",
|
||||||
|
status="summarized",
|
||||||
|
)
|
||||||
|
result = run_script("update-conversation-index.py")
|
||||||
|
assert result.returncode == 0
|
||||||
|
|
||||||
|
idx = tmp_wiki / "conversations" / "index.md"
|
||||||
|
assert idx.exists()
|
||||||
|
text = idx.read_text()
|
||||||
|
assert "2026-04-10-one.md" in text or "one.md" in text
|
||||||
|
assert "2026-04-11-two.md" in text or "two.md" in text
|
||||||
209
tests/test_shell_scripts.py
Normal file
209
tests/test_shell_scripts.py
Normal file
@@ -0,0 +1,209 @@
|
|||||||
|
"""Smoke tests for the bash scripts.
|
||||||
|
|
||||||
|
Bash scripts are harder to unit-test in isolation — these tests verify
|
||||||
|
CLI parsing, help text, and dry-run/safe flags work correctly and that
|
||||||
|
scripts exit cleanly in all the no-op paths.
|
||||||
|
|
||||||
|
Cross-platform note: tests invoke scripts via `bash` explicitly, so they
|
||||||
|
work on both macOS (default /bin/bash) and Linux/WSL. They avoid anything
|
||||||
|
that requires external state (network, git, LLM).
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from conftest import make_conversation, make_page, make_staging_page
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# wiki-maintain.sh
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestWikiMaintainSh:
|
||||||
|
def test_help_flag(self, run_script) -> None:
|
||||||
|
result = run_script("wiki-maintain.sh", "--help")
|
||||||
|
assert result.returncode == 0
|
||||||
|
assert "Usage:" in result.stdout or "usage:" in result.stdout.lower()
|
||||||
|
assert "--full" in result.stdout
|
||||||
|
assert "--harvest-only" in result.stdout
|
||||||
|
assert "--hygiene-only" in result.stdout
|
||||||
|
|
||||||
|
def test_rejects_unknown_flag(self, run_script) -> None:
|
||||||
|
result = run_script("wiki-maintain.sh", "--bogus")
|
||||||
|
assert result.returncode != 0
|
||||||
|
assert "Unknown option" in result.stderr
|
||||||
|
|
||||||
|
def test_harvest_only_and_hygiene_only_conflict(self, run_script) -> None:
|
||||||
|
result = run_script(
|
||||||
|
"wiki-maintain.sh", "--harvest-only", "--hygiene-only"
|
||||||
|
)
|
||||||
|
assert result.returncode != 0
|
||||||
|
assert "mutually exclusive" in result.stderr
|
||||||
|
|
||||||
|
def test_hygiene_only_dry_run_completes(
|
||||||
|
self, run_script, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/one.md")
|
||||||
|
result = run_script(
|
||||||
|
"wiki-maintain.sh", "--hygiene-only", "--dry-run", "--no-reindex"
|
||||||
|
)
|
||||||
|
assert result.returncode == 0
|
||||||
|
assert "Phase 2: Hygiene checks" in result.stdout
|
||||||
|
assert "finished" in result.stdout
|
||||||
|
|
||||||
|
def test_phase_1_skipped_in_hygiene_only(
|
||||||
|
self, run_script, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
result = run_script(
|
||||||
|
"wiki-maintain.sh", "--hygiene-only", "--dry-run", "--no-reindex"
|
||||||
|
)
|
||||||
|
assert result.returncode == 0
|
||||||
|
assert "Phase 1: URL harvesting (skipped)" in result.stdout
|
||||||
|
|
||||||
|
def test_phase_3_skipped_in_dry_run(
|
||||||
|
self, run_script, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/one.md")
|
||||||
|
result = run_script(
|
||||||
|
"wiki-maintain.sh", "--hygiene-only", "--dry-run"
|
||||||
|
)
|
||||||
|
assert "Phase 3: qmd reindex (skipped)" in result.stdout
|
||||||
|
|
||||||
|
def test_harvest_only_dry_run_completes(
|
||||||
|
self, run_script, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
# Add a summarized conversation so harvest has something to scan
|
||||||
|
make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"2026-04-10-test.md",
|
||||||
|
status="summarized",
|
||||||
|
body="See https://docs.python.org/3/library/os.html for details.\n",
|
||||||
|
)
|
||||||
|
result = run_script(
|
||||||
|
"wiki-maintain.sh",
|
||||||
|
"--harvest-only",
|
||||||
|
"--dry-run",
|
||||||
|
"--no-compile",
|
||||||
|
"--no-reindex",
|
||||||
|
)
|
||||||
|
assert result.returncode == 0
|
||||||
|
assert "Phase 2: Hygiene checks (skipped)" in result.stdout
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# wiki-sync.sh
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestWikiSyncSh:
|
||||||
|
def test_status_on_non_git_dir_exits_cleanly(self, run_script) -> None:
|
||||||
|
"""wiki-sync.sh --status against a non-git dir should fail gracefully.
|
||||||
|
|
||||||
|
The tmp_wiki fixture is not a git repo, so git commands will fail.
|
||||||
|
The script should report the problem without hanging or leaking stack
|
||||||
|
traces. Any exit code is acceptable as long as it exits in reasonable
|
||||||
|
time and prints something useful to stdout/stderr.
|
||||||
|
"""
|
||||||
|
result = run_script("wiki-sync.sh", "--status", timeout=30)
|
||||||
|
# Should have produced some output and exited (not hung)
|
||||||
|
assert result.stdout or result.stderr
|
||||||
|
assert "Wiki Sync Status" in result.stdout or "not a git" in result.stderr.lower()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# mine-conversations.sh
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestMineConversationsSh:
|
||||||
|
def test_extract_only_dry_run(self, run_script, tmp_wiki: Path) -> None:
|
||||||
|
"""mine-conversations.sh --extract-only --dry-run should complete without LLM."""
|
||||||
|
result = run_script(
|
||||||
|
"mine-conversations.sh", "--extract-only", "--dry-run", timeout=30
|
||||||
|
)
|
||||||
|
assert result.returncode == 0
|
||||||
|
|
||||||
|
def test_rejects_unknown_flag(self, run_script) -> None:
|
||||||
|
result = run_script("mine-conversations.sh", "--bogus-flag")
|
||||||
|
assert result.returncode != 0
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Cross-platform sanity — scripts use portable bash syntax
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestBashPortability:
|
||||||
|
"""Verify scripts don't use bashisms that break on macOS /bin/bash 3.2."""
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"script",
|
||||||
|
["wiki-maintain.sh", "mine-conversations.sh", "wiki-sync.sh"],
|
||||||
|
)
|
||||||
|
def test_shebang_is_env_bash(self, script: str) -> None:
|
||||||
|
"""All shell scripts should use `#!/usr/bin/env bash` for portability."""
|
||||||
|
path = Path(__file__).parent.parent / "scripts" / script
|
||||||
|
first_line = path.read_text().splitlines()[0]
|
||||||
|
assert first_line == "#!/usr/bin/env bash", (
|
||||||
|
f"{script} has shebang {first_line!r}, expected #!/usr/bin/env bash"
|
||||||
|
)
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"script",
|
||||||
|
["wiki-maintain.sh", "mine-conversations.sh", "wiki-sync.sh"],
|
||||||
|
)
|
||||||
|
def test_uses_strict_mode(self, script: str) -> None:
|
||||||
|
"""All shell scripts should use `set -euo pipefail` for safe defaults."""
|
||||||
|
path = Path(__file__).parent.parent / "scripts" / script
|
||||||
|
text = path.read_text()
|
||||||
|
assert "set -euo pipefail" in text, f"{script} missing strict mode"
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"script",
|
||||||
|
["wiki-maintain.sh", "mine-conversations.sh", "wiki-sync.sh"],
|
||||||
|
)
|
||||||
|
def test_bash_syntax_check(self, script: str) -> None:
|
||||||
|
"""bash -n does a syntax-only parse and catches obvious errors."""
|
||||||
|
path = Path(__file__).parent.parent / "scripts" / script
|
||||||
|
result = subprocess.run(
|
||||||
|
["bash", "-n", str(path)],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=10,
|
||||||
|
)
|
||||||
|
assert result.returncode == 0, f"{script} has bash syntax errors: {result.stderr}"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Python script syntax check (smoke)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestPythonSyntax:
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"script",
|
||||||
|
[
|
||||||
|
"wiki_lib.py",
|
||||||
|
"wiki-harvest.py",
|
||||||
|
"wiki-staging.py",
|
||||||
|
"wiki-hygiene.py",
|
||||||
|
"extract-sessions.py",
|
||||||
|
"summarize-conversations.py",
|
||||||
|
"update-conversation-index.py",
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_py_compile(self, script: str) -> None:
|
||||||
|
"""py_compile catches syntax errors without executing the module."""
|
||||||
|
import py_compile
|
||||||
|
|
||||||
|
path = Path(__file__).parent.parent / "scripts" / script
|
||||||
|
# py_compile.compile raises on error; success returns the .pyc path
|
||||||
|
py_compile.compile(str(path), doraise=True)
|
||||||
323
tests/test_wiki_harvest.py
Normal file
323
tests/test_wiki_harvest.py
Normal file
@@ -0,0 +1,323 @@
|
|||||||
|
"""Unit + integration tests for scripts/wiki-harvest.py."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
from unittest.mock import patch
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from conftest import make_conversation
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# URL classification
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestClassifyUrl:
|
||||||
|
def test_regular_docs_site_harvest(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.classify_url("https://docs.python.org/3/library/os.html") == "harvest"
|
||||||
|
assert wiki_harvest.classify_url("https://blog.example.com/post") == "harvest"
|
||||||
|
|
||||||
|
def test_github_issue_is_check(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.classify_url("https://github.com/foo/bar/issues/42") == "check"
|
||||||
|
|
||||||
|
def test_github_pr_is_check(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.classify_url("https://github.com/foo/bar/pull/99") == "check"
|
||||||
|
|
||||||
|
def test_stackoverflow_is_check(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.classify_url(
|
||||||
|
"https://stackoverflow.com/questions/12345/title"
|
||||||
|
) == "check"
|
||||||
|
|
||||||
|
def test_localhost_skip(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.classify_url("http://localhost:3000/path") == "skip"
|
||||||
|
assert wiki_harvest.classify_url("http://localhost/foo") == "skip"
|
||||||
|
|
||||||
|
def test_private_ip_skip(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.classify_url("http://10.0.0.1/api") == "skip"
|
||||||
|
assert wiki_harvest.classify_url("http://172.30.224.1:8080/v1") == "skip"
|
||||||
|
assert wiki_harvest.classify_url("http://192.168.1.1/test") == "skip"
|
||||||
|
assert wiki_harvest.classify_url("http://127.0.0.1:8080/foo") == "skip"
|
||||||
|
|
||||||
|
def test_local_and_internal_tld_skip(self, wiki_harvest: Any) -> None:
|
||||||
|
# `.local` and `.internal` are baked into SKIP_DOMAIN_PATTERNS
|
||||||
|
assert wiki_harvest.classify_url("https://router.local/admin") == "skip"
|
||||||
|
assert wiki_harvest.classify_url("https://service.internal/api") == "skip"
|
||||||
|
|
||||||
|
def test_custom_skip_pattern_runtime(self, wiki_harvest: Any) -> None:
|
||||||
|
# Users can append their own patterns at runtime — verify the hook works
|
||||||
|
wiki_harvest.SKIP_DOMAIN_PATTERNS.append(r"\.mycompany\.com$")
|
||||||
|
try:
|
||||||
|
assert wiki_harvest.classify_url("https://git.mycompany.com/foo") == "skip"
|
||||||
|
assert wiki_harvest.classify_url("https://docs.mycompany.com/api") == "skip"
|
||||||
|
finally:
|
||||||
|
wiki_harvest.SKIP_DOMAIN_PATTERNS.pop()
|
||||||
|
|
||||||
|
def test_atlassian_skip(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.classify_url("https://foo.atlassian.net/browse/BAR-1") == "skip"
|
||||||
|
|
||||||
|
def test_slack_skip(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.classify_url("https://myteam.slack.com/archives/C123") == "skip"
|
||||||
|
|
||||||
|
def test_github_repo_root_is_harvest(self, wiki_harvest: Any) -> None:
|
||||||
|
# Not an issue/pr/discussion — just a repo root, might contain docs
|
||||||
|
assert wiki_harvest.classify_url("https://github.com/foo/bar") == "harvest"
|
||||||
|
|
||||||
|
def test_invalid_url_skip(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.classify_url("not a url") == "skip"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Private IP detection
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestPrivateIp:
|
||||||
|
def test_10_range(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest._is_private_ip("10.0.0.1") is True
|
||||||
|
assert wiki_harvest._is_private_ip("10.255.255.255") is True
|
||||||
|
|
||||||
|
def test_172_16_to_31_range(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest._is_private_ip("172.16.0.1") is True
|
||||||
|
assert wiki_harvest._is_private_ip("172.31.255.255") is True
|
||||||
|
assert wiki_harvest._is_private_ip("172.15.0.1") is False
|
||||||
|
assert wiki_harvest._is_private_ip("172.32.0.1") is False
|
||||||
|
|
||||||
|
def test_192_168_range(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest._is_private_ip("192.168.0.1") is True
|
||||||
|
assert wiki_harvest._is_private_ip("192.167.0.1") is False
|
||||||
|
|
||||||
|
def test_loopback(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest._is_private_ip("127.0.0.1") is True
|
||||||
|
|
||||||
|
def test_public_ip(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest._is_private_ip("8.8.8.8") is False
|
||||||
|
|
||||||
|
def test_hostname_not_ip(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest._is_private_ip("example.com") is False
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# URL extraction from files
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestExtractUrls:
|
||||||
|
def test_finds_urls_in_markdown(
|
||||||
|
self, wiki_harvest: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"test.md",
|
||||||
|
body="See https://docs.python.org/3/library/os.html for details.\n"
|
||||||
|
"Also https://fastapi.tiangolo.com/tutorial/.\n",
|
||||||
|
)
|
||||||
|
urls = wiki_harvest.extract_urls_from_file(path)
|
||||||
|
assert "https://docs.python.org/3/library/os.html" in urls
|
||||||
|
assert "https://fastapi.tiangolo.com/tutorial/" in urls
|
||||||
|
|
||||||
|
def test_filters_asset_extensions(
|
||||||
|
self, wiki_harvest: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"assets.md",
|
||||||
|
body=(
|
||||||
|
"Real: https://example.com/docs/article.html\n"
|
||||||
|
"Image: https://example.com/logo.png\n"
|
||||||
|
"Script: https://cdn.example.com/lib.js\n"
|
||||||
|
"Font: https://fonts.example.com/face.woff2\n"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
urls = wiki_harvest.extract_urls_from_file(path)
|
||||||
|
assert "https://example.com/docs/article.html" in urls
|
||||||
|
assert not any(u.endswith(".png") for u in urls)
|
||||||
|
assert not any(u.endswith(".js") for u in urls)
|
||||||
|
assert not any(u.endswith(".woff2") for u in urls)
|
||||||
|
|
||||||
|
def test_strips_trailing_punctuation(
|
||||||
|
self, wiki_harvest: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"punct.md",
|
||||||
|
body="See https://example.com/foo. Also https://example.com/bar, and more.\n",
|
||||||
|
)
|
||||||
|
urls = wiki_harvest.extract_urls_from_file(path)
|
||||||
|
assert "https://example.com/foo" in urls
|
||||||
|
assert "https://example.com/bar" in urls
|
||||||
|
|
||||||
|
def test_deduplicates_within_file(
|
||||||
|
self, wiki_harvest: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"dup.md",
|
||||||
|
body=(
|
||||||
|
"First mention: https://example.com/same\n"
|
||||||
|
"Second mention: https://example.com/same\n"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
urls = wiki_harvest.extract_urls_from_file(path)
|
||||||
|
assert urls.count("https://example.com/same") == 1
|
||||||
|
|
||||||
|
def test_returns_empty_for_missing_file(
|
||||||
|
self, wiki_harvest: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
assert wiki_harvest.extract_urls_from_file(tmp_wiki / "nope.md") == []
|
||||||
|
|
||||||
|
def test_filters_short_urls(
|
||||||
|
self, wiki_harvest: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
# Less than 20 chars are skipped
|
||||||
|
path = make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"short.md",
|
||||||
|
body="tiny http://a.b/ and https://example.com/long-path\n",
|
||||||
|
)
|
||||||
|
urls = wiki_harvest.extract_urls_from_file(path)
|
||||||
|
assert "http://a.b/" not in urls
|
||||||
|
assert "https://example.com/long-path" in urls
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Raw filename derivation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestRawFilename:
|
||||||
|
def test_basic_url(self, wiki_harvest: Any) -> None:
|
||||||
|
name = wiki_harvest.raw_filename_for_url("https://docs.docker.com/build/multi-stage/")
|
||||||
|
assert name.startswith("docs-docker-com-")
|
||||||
|
assert "build" in name and "multi-stage" in name
|
||||||
|
assert name.endswith(".md")
|
||||||
|
|
||||||
|
def test_strips_www(self, wiki_harvest: Any) -> None:
|
||||||
|
name = wiki_harvest.raw_filename_for_url("https://www.example.com/foo")
|
||||||
|
assert "www" not in name
|
||||||
|
|
||||||
|
def test_root_url_uses_index(self, wiki_harvest: Any) -> None:
|
||||||
|
name = wiki_harvest.raw_filename_for_url("https://example.com/")
|
||||||
|
assert name == "example-com-index.md"
|
||||||
|
|
||||||
|
def test_long_paths_truncated(self, wiki_harvest: Any) -> None:
|
||||||
|
long_url = "https://example.com/" + "a-very-long-segment/" * 20
|
||||||
|
name = wiki_harvest.raw_filename_for_url(long_url)
|
||||||
|
assert len(name) < 200
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Content validation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestValidateContent:
|
||||||
|
def test_accepts_clean_markdown(self, wiki_harvest: Any) -> None:
|
||||||
|
content = "# Title\n\n" + ("A clean paragraph of markdown content. " * 5)
|
||||||
|
assert wiki_harvest.validate_content(content) is True
|
||||||
|
|
||||||
|
def test_rejects_empty(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.validate_content("") is False
|
||||||
|
|
||||||
|
def test_rejects_too_short(self, wiki_harvest: Any) -> None:
|
||||||
|
assert wiki_harvest.validate_content("# Short") is False
|
||||||
|
|
||||||
|
def test_rejects_html_leak(self, wiki_harvest: Any) -> None:
|
||||||
|
content = "# Title\n\n<div class='nav'>Navigation</div>\n" + "content " * 30
|
||||||
|
assert wiki_harvest.validate_content(content) is False
|
||||||
|
|
||||||
|
def test_rejects_script_tag(self, wiki_harvest: Any) -> None:
|
||||||
|
content = "# Title\n\n<script>alert()</script>\n" + "content " * 30
|
||||||
|
assert wiki_harvest.validate_content(content) is False
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# State management
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestStateManagement:
|
||||||
|
def test_load_returns_defaults_when_file_empty(
|
||||||
|
self, wiki_harvest: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
(tmp_wiki / ".harvest-state.json").write_text("{}")
|
||||||
|
state = wiki_harvest.load_state()
|
||||||
|
assert "harvested_urls" in state
|
||||||
|
assert "skipped_urls" in state
|
||||||
|
|
||||||
|
def test_save_and_reload(
|
||||||
|
self, wiki_harvest: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
state = wiki_harvest.load_state()
|
||||||
|
state["harvested_urls"]["https://example.com"] = {
|
||||||
|
"first_seen": "2026-04-12",
|
||||||
|
"seen_in": ["conversations/mc/foo.md"],
|
||||||
|
"raw_file": "raw/harvested/example.md",
|
||||||
|
"status": "raw",
|
||||||
|
"fetch_method": "trafilatura",
|
||||||
|
}
|
||||||
|
wiki_harvest.save_state(state)
|
||||||
|
|
||||||
|
reloaded = wiki_harvest.load_state()
|
||||||
|
assert "https://example.com" in reloaded["harvested_urls"]
|
||||||
|
assert reloaded["last_run"] is not None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Raw file writer
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestWriteRawFile:
|
||||||
|
def test_writes_with_frontmatter(
|
||||||
|
self, wiki_harvest: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
conv = make_conversation(tmp_wiki, "test", "source.md")
|
||||||
|
raw_path = wiki_harvest.write_raw_file(
|
||||||
|
"https://example.com/article",
|
||||||
|
"# Article\n\nClean content.\n",
|
||||||
|
"trafilatura",
|
||||||
|
conv,
|
||||||
|
)
|
||||||
|
assert raw_path.exists()
|
||||||
|
text = raw_path.read_text()
|
||||||
|
assert "source_url: https://example.com/article" in text
|
||||||
|
assert "fetch_method: trafilatura" in text
|
||||||
|
assert "content_hash: sha256:" in text
|
||||||
|
assert "discovered_in: conversations/test/source.md" in text
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Dry-run CLI smoke test (no actual fetches)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestHarvestCli:
|
||||||
|
def test_dry_run_no_network_calls(
|
||||||
|
self, run_script, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"test.md",
|
||||||
|
body="See https://docs.python.org/3/ and https://github.com/foo/bar/issues/1.\n",
|
||||||
|
)
|
||||||
|
result = run_script("wiki-harvest.py", "--dry-run")
|
||||||
|
assert result.returncode == 0
|
||||||
|
# Dry-run should classify without fetching
|
||||||
|
assert "would-harvest" in result.stdout or "Summary" in result.stdout
|
||||||
|
|
||||||
|
def test_help_flag(self, run_script) -> None:
|
||||||
|
result = run_script("wiki-harvest.py", "--help")
|
||||||
|
assert result.returncode == 0
|
||||||
|
assert "--dry-run" in result.stdout
|
||||||
|
assert "--no-compile" in result.stdout
|
||||||
616
tests/test_wiki_hygiene.py
Normal file
616
tests/test_wiki_hygiene.py
Normal file
@@ -0,0 +1,616 @@
|
|||||||
|
"""Integration tests for scripts/wiki-hygiene.py.
|
||||||
|
|
||||||
|
Uses the tmp_wiki fixture so tests never touch the real wiki.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from datetime import date, timedelta
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from conftest import make_conversation, make_page, make_staging_page
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Backfill last_verified
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestBackfill:
|
||||||
|
def test_sets_last_verified_from_last_compiled(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/foo.md", last_compiled="2026-01-15")
|
||||||
|
# Strip last_verified from the fixture-built file
|
||||||
|
text = path.read_text()
|
||||||
|
text = text.replace("last_verified: 2026-04-01\n", "")
|
||||||
|
path.write_text(text)
|
||||||
|
|
||||||
|
changes = wiki_hygiene.backfill_last_verified()
|
||||||
|
assert len(changes) == 1
|
||||||
|
assert changes[0][1] == "last_compiled"
|
||||||
|
|
||||||
|
reparsed = wiki_hygiene.parse_page(path)
|
||||||
|
assert reparsed.frontmatter["last_verified"] == "2026-01-15"
|
||||||
|
|
||||||
|
def test_skips_pages_already_verified(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/done.md", last_verified="2026-04-01")
|
||||||
|
changes = wiki_hygiene.backfill_last_verified()
|
||||||
|
assert changes == []
|
||||||
|
|
||||||
|
def test_dry_run_does_not_write(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/foo.md", last_compiled="2026-01-15")
|
||||||
|
text = path.read_text().replace("last_verified: 2026-04-01\n", "")
|
||||||
|
path.write_text(text)
|
||||||
|
|
||||||
|
changes = wiki_hygiene.backfill_last_verified(dry_run=True)
|
||||||
|
assert len(changes) == 1
|
||||||
|
|
||||||
|
reparsed = wiki_hygiene.parse_page(path)
|
||||||
|
assert "last_verified" not in reparsed.frontmatter
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Confidence decay math
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestConfidenceDecay:
|
||||||
|
def test_recent_page_unchanged(self, wiki_hygiene: Any) -> None:
|
||||||
|
recent = wiki_hygiene.today() - timedelta(days=30)
|
||||||
|
assert wiki_hygiene.expected_confidence("high", recent, False) == "high"
|
||||||
|
|
||||||
|
def test_six_months_decays_high_to_medium(self, wiki_hygiene: Any) -> None:
|
||||||
|
old = wiki_hygiene.today() - timedelta(days=200)
|
||||||
|
assert wiki_hygiene.expected_confidence("high", old, False) == "medium"
|
||||||
|
|
||||||
|
def test_nine_months_decays_medium_to_low(self, wiki_hygiene: Any) -> None:
|
||||||
|
old = wiki_hygiene.today() - timedelta(days=280)
|
||||||
|
assert wiki_hygiene.expected_confidence("medium", old, False) == "low"
|
||||||
|
|
||||||
|
def test_twelve_months_decays_to_stale(self, wiki_hygiene: Any) -> None:
|
||||||
|
old = wiki_hygiene.today() - timedelta(days=400)
|
||||||
|
assert wiki_hygiene.expected_confidence("high", old, False) == "stale"
|
||||||
|
|
||||||
|
def test_superseded_is_always_stale(self, wiki_hygiene: Any) -> None:
|
||||||
|
recent = wiki_hygiene.today() - timedelta(days=1)
|
||||||
|
assert wiki_hygiene.expected_confidence("high", recent, True) == "stale"
|
||||||
|
|
||||||
|
def test_none_date_leaves_confidence_alone(self, wiki_hygiene: Any) -> None:
|
||||||
|
assert wiki_hygiene.expected_confidence("medium", None, False) == "medium"
|
||||||
|
|
||||||
|
def test_bump_confidence_ladder(self, wiki_hygiene: Any) -> None:
|
||||||
|
assert wiki_hygiene.bump_confidence("stale") == "low"
|
||||||
|
assert wiki_hygiene.bump_confidence("low") == "medium"
|
||||||
|
assert wiki_hygiene.bump_confidence("medium") == "high"
|
||||||
|
assert wiki_hygiene.bump_confidence("high") == "high"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Frontmatter repair
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestFrontmatterRepair:
|
||||||
|
def test_adds_missing_confidence(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = tmp_wiki / "patterns" / "no-conf.md"
|
||||||
|
path.write_text(
|
||||||
|
"---\ntitle: No Confidence\ntype: pattern\n"
|
||||||
|
"last_compiled: 2026-04-01\nlast_verified: 2026-04-01\n---\n"
|
||||||
|
"# Body\n\nSubstantive content here for testing purposes.\n"
|
||||||
|
)
|
||||||
|
changes = wiki_hygiene.repair_frontmatter()
|
||||||
|
assert any("confidence" in fields for _, fields in changes)
|
||||||
|
|
||||||
|
reparsed = wiki_hygiene.parse_page(path)
|
||||||
|
assert reparsed.frontmatter["confidence"] == "medium"
|
||||||
|
|
||||||
|
def test_fixes_invalid_confidence(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/bad-conf.md", confidence="wat")
|
||||||
|
changes = wiki_hygiene.repair_frontmatter()
|
||||||
|
assert any(p == path for p, _ in changes)
|
||||||
|
|
||||||
|
reparsed = wiki_hygiene.parse_page(path)
|
||||||
|
assert reparsed.frontmatter["confidence"] == "medium"
|
||||||
|
|
||||||
|
def test_leaves_valid_pages_alone(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/good.md")
|
||||||
|
changes = wiki_hygiene.repair_frontmatter()
|
||||||
|
assert changes == []
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Archive and restore round-trip
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestArchiveRestore:
|
||||||
|
def test_archive_moves_file_and_updates_frontmatter(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/doomed.md")
|
||||||
|
page = wiki_hygiene.parse_page(path)
|
||||||
|
|
||||||
|
wiki_hygiene.archive_page(page, "test archive")
|
||||||
|
|
||||||
|
assert not path.exists()
|
||||||
|
archived = tmp_wiki / "archive" / "patterns" / "doomed.md"
|
||||||
|
assert archived.exists()
|
||||||
|
|
||||||
|
reparsed = wiki_hygiene.parse_page(archived)
|
||||||
|
assert reparsed.frontmatter["archived_reason"] == "test archive"
|
||||||
|
assert reparsed.frontmatter["original_path"] == "patterns/doomed.md"
|
||||||
|
assert reparsed.frontmatter["confidence"] == "stale"
|
||||||
|
|
||||||
|
def test_restore_reverses_archive(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
original = make_page(tmp_wiki, "patterns/zombie.md")
|
||||||
|
page = wiki_hygiene.parse_page(original)
|
||||||
|
wiki_hygiene.archive_page(page, "test")
|
||||||
|
|
||||||
|
archived = tmp_wiki / "archive" / "patterns" / "zombie.md"
|
||||||
|
archived_page = wiki_hygiene.parse_page(archived)
|
||||||
|
wiki_hygiene.restore_page(archived_page)
|
||||||
|
|
||||||
|
assert original.exists()
|
||||||
|
assert not archived.exists()
|
||||||
|
|
||||||
|
reparsed = wiki_hygiene.parse_page(original)
|
||||||
|
assert reparsed.frontmatter["confidence"] == "medium"
|
||||||
|
assert "archived_date" not in reparsed.frontmatter
|
||||||
|
assert "archived_reason" not in reparsed.frontmatter
|
||||||
|
assert "original_path" not in reparsed.frontmatter
|
||||||
|
|
||||||
|
def test_archive_rejects_non_live_pages(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
# Page outside the live content dirs — should refuse to archive
|
||||||
|
weird = tmp_wiki / "raw" / "weird.md"
|
||||||
|
weird.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
weird.write_text("---\ntitle: Weird\n---\nBody\n")
|
||||||
|
page = wiki_hygiene.parse_page(weird)
|
||||||
|
result = wiki_hygiene.archive_page(page, "test")
|
||||||
|
assert result is None
|
||||||
|
|
||||||
|
def test_archive_dry_run_does_not_move(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/safe.md")
|
||||||
|
page = wiki_hygiene.parse_page(path)
|
||||||
|
wiki_hygiene.archive_page(page, "test", dry_run=True)
|
||||||
|
assert path.exists()
|
||||||
|
assert not (tmp_wiki / "archive" / "patterns" / "safe.md").exists()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Orphan detection
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestOrphanDetection:
|
||||||
|
def test_finds_orphan_page(self, wiki_hygiene: Any, tmp_wiki: Path) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/lonely.md")
|
||||||
|
orphans = wiki_hygiene.find_orphan_pages()
|
||||||
|
assert len(orphans) == 1
|
||||||
|
assert orphans[0].path.stem == "lonely"
|
||||||
|
|
||||||
|
def test_page_referenced_in_index_is_not_orphan(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/linked.md")
|
||||||
|
idx = tmp_wiki / "index.md"
|
||||||
|
idx.write_text(idx.read_text() + "- [Linked](patterns/linked.md) — desc\n")
|
||||||
|
orphans = wiki_hygiene.find_orphan_pages()
|
||||||
|
assert not any(p.path.stem == "linked" for p in orphans)
|
||||||
|
|
||||||
|
def test_page_referenced_in_related_is_not_orphan(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/referenced.md")
|
||||||
|
make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/referencer.md",
|
||||||
|
related=["patterns/referenced.md"],
|
||||||
|
)
|
||||||
|
orphans = wiki_hygiene.find_orphan_pages()
|
||||||
|
stems = {p.path.stem for p in orphans}
|
||||||
|
assert "referenced" not in stems
|
||||||
|
|
||||||
|
def test_fix_orphan_adds_to_index(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/orphan.md", title="Orphan Test")
|
||||||
|
page = wiki_hygiene.parse_page(path)
|
||||||
|
wiki_hygiene.fix_orphan_page(page)
|
||||||
|
idx_text = (tmp_wiki / "index.md").read_text()
|
||||||
|
assert "patterns/orphan.md" in idx_text
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Broken cross-references
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestBrokenCrossRefs:
|
||||||
|
def test_detects_broken_link(self, wiki_hygiene: Any, tmp_wiki: Path) -> None:
|
||||||
|
make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/source.md",
|
||||||
|
body="See [nonexistent](patterns/does-not-exist.md) for details.\n",
|
||||||
|
)
|
||||||
|
broken = wiki_hygiene.find_broken_cross_refs()
|
||||||
|
assert len(broken) == 1
|
||||||
|
target, bad, suggested = broken[0]
|
||||||
|
assert bad == "patterns/does-not-exist.md"
|
||||||
|
|
||||||
|
def test_fuzzy_match_finds_near_miss(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/health-endpoint.md")
|
||||||
|
make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/source.md",
|
||||||
|
body="See [H](patterns/health-endpoints.md) — typo.\n",
|
||||||
|
)
|
||||||
|
broken = wiki_hygiene.find_broken_cross_refs()
|
||||||
|
assert len(broken) >= 1
|
||||||
|
_, bad, suggested = broken[0]
|
||||||
|
assert suggested == "patterns/health-endpoint.md"
|
||||||
|
|
||||||
|
def test_fix_broken_xref(self, wiki_hygiene: Any, tmp_wiki: Path) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/health-endpoint.md")
|
||||||
|
src = make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/source.md",
|
||||||
|
body="See [H](patterns/health-endpoints.md).\n",
|
||||||
|
)
|
||||||
|
broken = wiki_hygiene.find_broken_cross_refs()
|
||||||
|
for target, bad, suggested in broken:
|
||||||
|
wiki_hygiene.fix_broken_cross_ref(target, bad, suggested)
|
||||||
|
text = src.read_text()
|
||||||
|
assert "patterns/health-endpoints.md" not in text
|
||||||
|
assert "patterns/health-endpoint.md" in text
|
||||||
|
|
||||||
|
def test_archived_link_triggers_restore(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
# Page in archive, referenced by a live page
|
||||||
|
make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"archive/patterns/ghost.md",
|
||||||
|
confidence="stale",
|
||||||
|
extra_fm={
|
||||||
|
"archived_date": "2026-01-01",
|
||||||
|
"archived_reason": "test",
|
||||||
|
"original_path": "patterns/ghost.md",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/caller.md",
|
||||||
|
body="See [ghost](patterns/ghost.md).\n",
|
||||||
|
)
|
||||||
|
broken = wiki_hygiene.find_broken_cross_refs()
|
||||||
|
assert len(broken) >= 1
|
||||||
|
for target, bad, suggested in broken:
|
||||||
|
if suggested and suggested.startswith("__RESTORE__"):
|
||||||
|
wiki_hygiene.fix_broken_cross_ref(target, bad, suggested)
|
||||||
|
# After restore, ghost should be live again
|
||||||
|
assert (tmp_wiki / "patterns" / "ghost.md").exists()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Index drift
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestIndexDrift:
|
||||||
|
def test_finds_page_missing_from_index(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/missing.md")
|
||||||
|
missing, stale = wiki_hygiene.find_index_drift()
|
||||||
|
assert "patterns/missing.md" in missing
|
||||||
|
assert stale == []
|
||||||
|
|
||||||
|
def test_finds_stale_index_entry(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
idx = tmp_wiki / "index.md"
|
||||||
|
idx.write_text(
|
||||||
|
idx.read_text()
|
||||||
|
+ "- [Ghost](patterns/ghost.md) — page that no longer exists\n"
|
||||||
|
)
|
||||||
|
missing, stale = wiki_hygiene.find_index_drift()
|
||||||
|
assert "patterns/ghost.md" in stale
|
||||||
|
|
||||||
|
def test_fix_adds_missing_and_removes_stale(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/new.md")
|
||||||
|
idx = tmp_wiki / "index.md"
|
||||||
|
idx.write_text(
|
||||||
|
idx.read_text()
|
||||||
|
+ "- [Gone](patterns/gone.md) — deleted page\n"
|
||||||
|
)
|
||||||
|
missing, stale = wiki_hygiene.find_index_drift()
|
||||||
|
wiki_hygiene.fix_index_drift(missing, stale)
|
||||||
|
idx_text = idx.read_text()
|
||||||
|
assert "patterns/new.md" in idx_text
|
||||||
|
assert "patterns/gone.md" not in idx_text
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Empty stubs
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestEmptyStubs:
|
||||||
|
def test_flags_small_body(self, wiki_hygiene: Any, tmp_wiki: Path) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/stub.md", body="# Stub\n\nShort.\n")
|
||||||
|
stubs = wiki_hygiene.find_empty_stubs()
|
||||||
|
assert len(stubs) == 1
|
||||||
|
assert stubs[0].path.stem == "stub"
|
||||||
|
|
||||||
|
def test_ignores_substantive_pages(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
body = "# Full\n\n" + ("This is substantive content. " * 20) + "\n"
|
||||||
|
make_page(tmp_wiki, "patterns/full.md", body=body)
|
||||||
|
stubs = wiki_hygiene.find_empty_stubs()
|
||||||
|
assert stubs == []
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Conversation refresh signals
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestConversationRefreshSignals:
|
||||||
|
def test_picks_up_related_link(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/hot.md", last_verified="2026-01-01")
|
||||||
|
make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"2026-04-11-abc.md",
|
||||||
|
date="2026-04-11",
|
||||||
|
related=["patterns/hot.md"],
|
||||||
|
)
|
||||||
|
refs = wiki_hygiene.scan_conversation_references()
|
||||||
|
assert "patterns/hot.md" in refs
|
||||||
|
assert refs["patterns/hot.md"] == date(2026, 4, 11)
|
||||||
|
|
||||||
|
def test_apply_refresh_updates_last_verified(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/hot.md", last_verified="2026-01-01")
|
||||||
|
make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"2026-04-11-abc.md",
|
||||||
|
date="2026-04-11",
|
||||||
|
related=["patterns/hot.md"],
|
||||||
|
)
|
||||||
|
refs = wiki_hygiene.scan_conversation_references()
|
||||||
|
changes = wiki_hygiene.apply_refresh_signals(refs)
|
||||||
|
assert len(changes) == 1
|
||||||
|
|
||||||
|
reparsed = wiki_hygiene.parse_page(path)
|
||||||
|
assert reparsed.frontmatter["last_verified"] == "2026-04-11"
|
||||||
|
|
||||||
|
def test_bumps_low_confidence_to_medium(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/reviving.md",
|
||||||
|
confidence="low",
|
||||||
|
last_verified="2026-01-01",
|
||||||
|
)
|
||||||
|
make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"2026-04-11-ref.md",
|
||||||
|
date="2026-04-11",
|
||||||
|
related=["patterns/reviving.md"],
|
||||||
|
)
|
||||||
|
refs = wiki_hygiene.scan_conversation_references()
|
||||||
|
wiki_hygiene.apply_refresh_signals(refs)
|
||||||
|
reparsed = wiki_hygiene.parse_page(path)
|
||||||
|
assert reparsed.frontmatter["confidence"] == "medium"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Auto-restore
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestAutoRestore:
|
||||||
|
def test_restores_page_referenced_in_conversation(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
# Archive a page
|
||||||
|
path = make_page(tmp_wiki, "patterns/returning.md")
|
||||||
|
page = wiki_hygiene.parse_page(path)
|
||||||
|
wiki_hygiene.archive_page(page, "aging out")
|
||||||
|
assert (tmp_wiki / "archive" / "patterns" / "returning.md").exists()
|
||||||
|
|
||||||
|
# Reference it in a conversation
|
||||||
|
make_conversation(
|
||||||
|
tmp_wiki,
|
||||||
|
"test",
|
||||||
|
"2026-04-12-ref.md",
|
||||||
|
related=["patterns/returning.md"],
|
||||||
|
)
|
||||||
|
|
||||||
|
# Auto-restore
|
||||||
|
restored = wiki_hygiene.auto_restore_archived()
|
||||||
|
assert len(restored) == 1
|
||||||
|
assert (tmp_wiki / "patterns" / "returning.md").exists()
|
||||||
|
assert not (tmp_wiki / "archive" / "patterns" / "returning.md").exists()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Staging / archive index sync
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestIndexSync:
|
||||||
|
def test_staging_sync_regenerates_index(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/pending.md")
|
||||||
|
changed = wiki_hygiene.sync_staging_index()
|
||||||
|
assert changed is True
|
||||||
|
text = (tmp_wiki / "staging" / "index.md").read_text()
|
||||||
|
assert "pending.md" in text
|
||||||
|
|
||||||
|
def test_staging_sync_idempotent(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/pending.md")
|
||||||
|
wiki_hygiene.sync_staging_index()
|
||||||
|
changed_second = wiki_hygiene.sync_staging_index()
|
||||||
|
assert changed_second is False
|
||||||
|
|
||||||
|
def test_archive_sync_regenerates_index(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"archive/patterns/old.md",
|
||||||
|
confidence="stale",
|
||||||
|
extra_fm={
|
||||||
|
"archived_date": "2026-01-01",
|
||||||
|
"archived_reason": "test",
|
||||||
|
"original_path": "patterns/old.md",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
changed = wiki_hygiene.sync_archive_index()
|
||||||
|
assert changed is True
|
||||||
|
text = (tmp_wiki / "archive" / "index.md").read_text()
|
||||||
|
assert "old" in text.lower()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# State drift detection
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestStateDrift:
|
||||||
|
def test_detects_missing_raw_file(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
import json
|
||||||
|
state = {
|
||||||
|
"harvested_urls": {
|
||||||
|
"https://example.com": {
|
||||||
|
"raw_file": "raw/harvested/missing.md",
|
||||||
|
"wiki_pages": [],
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
(tmp_wiki / ".harvest-state.json").write_text(json.dumps(state))
|
||||||
|
issues = wiki_hygiene.find_state_drift()
|
||||||
|
assert any("missing.md" in i for i in issues)
|
||||||
|
|
||||||
|
def test_empty_state_has_no_drift(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
# Fixture already creates an empty .harvest-state.json
|
||||||
|
issues = wiki_hygiene.find_state_drift()
|
||||||
|
assert issues == []
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Hygiene state file
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestHygieneState:
|
||||||
|
def test_load_returns_defaults_when_missing(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
state = wiki_hygiene.load_hygiene_state()
|
||||||
|
assert state["last_quick_run"] is None
|
||||||
|
assert state["pages_checked"] == {}
|
||||||
|
|
||||||
|
def test_save_and_reload(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
state = wiki_hygiene.load_hygiene_state()
|
||||||
|
state["last_quick_run"] = "2026-04-12T00:00:00Z"
|
||||||
|
wiki_hygiene.save_hygiene_state(state)
|
||||||
|
|
||||||
|
reloaded = wiki_hygiene.load_hygiene_state()
|
||||||
|
assert reloaded["last_quick_run"] == "2026-04-12T00:00:00Z"
|
||||||
|
|
||||||
|
def test_mark_page_checked_stores_hash(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/tracked.md")
|
||||||
|
page = wiki_hygiene.parse_page(path)
|
||||||
|
state = wiki_hygiene.load_hygiene_state()
|
||||||
|
wiki_hygiene.mark_page_checked(state, page, "quick")
|
||||||
|
entry = state["pages_checked"]["patterns/tracked.md"]
|
||||||
|
assert entry["content_hash"].startswith("sha256:")
|
||||||
|
assert "last_checked_quick" in entry
|
||||||
|
|
||||||
|
def test_page_changed_since_detects_body_change(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/mutable.md", body="# One\n\nOne body.\n")
|
||||||
|
page = wiki_hygiene.parse_page(path)
|
||||||
|
state = wiki_hygiene.load_hygiene_state()
|
||||||
|
wiki_hygiene.mark_page_checked(state, page, "quick")
|
||||||
|
|
||||||
|
assert not wiki_hygiene.page_changed_since(state, page, "quick")
|
||||||
|
|
||||||
|
# Mutate the body
|
||||||
|
path.write_text(path.read_text().replace("One body", "Two body"))
|
||||||
|
new_page = wiki_hygiene.parse_page(path)
|
||||||
|
assert wiki_hygiene.page_changed_since(state, new_page, "quick")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Full quick-hygiene run end-to-end (dry-run, idempotent)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestRunQuickHygiene:
|
||||||
|
def test_empty_wiki_produces_empty_report(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
report = wiki_hygiene.run_quick_hygiene(dry_run=True)
|
||||||
|
assert report.backfilled == []
|
||||||
|
assert report.archived == []
|
||||||
|
|
||||||
|
def test_real_run_is_idempotent(
|
||||||
|
self, wiki_hygiene: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/one.md")
|
||||||
|
make_page(tmp_wiki, "patterns/two.md")
|
||||||
|
|
||||||
|
report1 = wiki_hygiene.run_quick_hygiene()
|
||||||
|
# Second run should have 0 work
|
||||||
|
report2 = wiki_hygiene.run_quick_hygiene()
|
||||||
|
assert report2.backfilled == []
|
||||||
|
assert report2.decayed == []
|
||||||
|
assert report2.archived == []
|
||||||
|
assert report2.frontmatter_fixes == []
|
||||||
314
tests/test_wiki_lib.py
Normal file
314
tests/test_wiki_lib.py
Normal file
@@ -0,0 +1,314 @@
|
|||||||
|
"""Unit tests for scripts/wiki_lib.py — the shared frontmatter library."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from datetime import date
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from conftest import make_page, make_staging_page
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# parse_yaml_lite
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseYamlLite:
|
||||||
|
def test_simple_key_value(self, wiki_lib: Any) -> None:
|
||||||
|
result = wiki_lib.parse_yaml_lite("title: Hello\ntype: pattern\n")
|
||||||
|
assert result == {"title": "Hello", "type": "pattern"}
|
||||||
|
|
||||||
|
def test_quoted_values_are_stripped(self, wiki_lib: Any) -> None:
|
||||||
|
result = wiki_lib.parse_yaml_lite('title: "Hello"\nother: \'World\'\n')
|
||||||
|
assert result["title"] == "Hello"
|
||||||
|
assert result["other"] == "World"
|
||||||
|
|
||||||
|
def test_inline_list(self, wiki_lib: Any) -> None:
|
||||||
|
result = wiki_lib.parse_yaml_lite("tags: [a, b, c]\n")
|
||||||
|
assert result["tags"] == ["a", "b", "c"]
|
||||||
|
|
||||||
|
def test_empty_inline_list(self, wiki_lib: Any) -> None:
|
||||||
|
result = wiki_lib.parse_yaml_lite("sources: []\n")
|
||||||
|
assert result["sources"] == []
|
||||||
|
|
||||||
|
def test_block_list(self, wiki_lib: Any) -> None:
|
||||||
|
yaml = "related:\n - foo.md\n - bar.md\n - baz.md\n"
|
||||||
|
result = wiki_lib.parse_yaml_lite(yaml)
|
||||||
|
assert result["related"] == ["foo.md", "bar.md", "baz.md"]
|
||||||
|
|
||||||
|
def test_mixed_keys(self, wiki_lib: Any) -> None:
|
||||||
|
yaml = (
|
||||||
|
"title: Mixed\n"
|
||||||
|
"type: pattern\n"
|
||||||
|
"related:\n"
|
||||||
|
" - one.md\n"
|
||||||
|
" - two.md\n"
|
||||||
|
"confidence: high\n"
|
||||||
|
)
|
||||||
|
result = wiki_lib.parse_yaml_lite(yaml)
|
||||||
|
assert result["title"] == "Mixed"
|
||||||
|
assert result["related"] == ["one.md", "two.md"]
|
||||||
|
assert result["confidence"] == "high"
|
||||||
|
|
||||||
|
def test_empty_value(self, wiki_lib: Any) -> None:
|
||||||
|
result = wiki_lib.parse_yaml_lite("empty: \n")
|
||||||
|
assert result["empty"] == ""
|
||||||
|
|
||||||
|
def test_comment_lines_ignored(self, wiki_lib: Any) -> None:
|
||||||
|
result = wiki_lib.parse_yaml_lite("# this is a comment\ntitle: X\n")
|
||||||
|
assert result == {"title": "X"}
|
||||||
|
|
||||||
|
def test_blank_lines_ignored(self, wiki_lib: Any) -> None:
|
||||||
|
result = wiki_lib.parse_yaml_lite("\ntitle: X\n\ntype: pattern\n\n")
|
||||||
|
assert result == {"title": "X", "type": "pattern"}
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# parse_page
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestParsePage:
|
||||||
|
def test_parses_valid_page(self, wiki_lib: Any, tmp_wiki: Path) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/foo.md", title="Foo", confidence="high")
|
||||||
|
page = wiki_lib.parse_page(path)
|
||||||
|
assert page is not None
|
||||||
|
assert page.frontmatter["title"] == "Foo"
|
||||||
|
assert page.frontmatter["confidence"] == "high"
|
||||||
|
assert "# Content" in page.body
|
||||||
|
|
||||||
|
def test_returns_none_without_frontmatter(
|
||||||
|
self, wiki_lib: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = tmp_wiki / "patterns" / "no-fm.md"
|
||||||
|
path.write_text("# Just a body\n\nNo frontmatter.\n")
|
||||||
|
assert wiki_lib.parse_page(path) is None
|
||||||
|
|
||||||
|
def test_returns_none_for_missing_file(self, wiki_lib: Any, tmp_wiki: Path) -> None:
|
||||||
|
assert wiki_lib.parse_page(tmp_wiki / "nonexistent.md") is None
|
||||||
|
|
||||||
|
def test_returns_none_for_truncated_frontmatter(
|
||||||
|
self, wiki_lib: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = tmp_wiki / "patterns" / "broken.md"
|
||||||
|
path.write_text("---\ntitle: Broken\n# never closed\n")
|
||||||
|
assert wiki_lib.parse_page(path) is None
|
||||||
|
|
||||||
|
def test_preserves_body_exactly(self, wiki_lib: Any, tmp_wiki: Path) -> None:
|
||||||
|
body = "# Heading\n\nLine 1\nLine 2\n\n## Sub\n\nMore.\n"
|
||||||
|
path = make_page(tmp_wiki, "patterns/body.md", body=body)
|
||||||
|
page = wiki_lib.parse_page(path)
|
||||||
|
assert page.body == body
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# serialize_frontmatter
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestSerializeFrontmatter:
|
||||||
|
def test_preferred_key_order(self, wiki_lib: Any) -> None:
|
||||||
|
fm = {
|
||||||
|
"related": ["a.md"],
|
||||||
|
"sources": ["raw/x.md"],
|
||||||
|
"title": "T",
|
||||||
|
"confidence": "high",
|
||||||
|
"type": "pattern",
|
||||||
|
}
|
||||||
|
yaml = wiki_lib.serialize_frontmatter(fm)
|
||||||
|
lines = yaml.split("\n")
|
||||||
|
# title/type/confidence should come before sources/related
|
||||||
|
assert lines[0].startswith("title:")
|
||||||
|
assert lines[1].startswith("type:")
|
||||||
|
assert lines[2].startswith("confidence:")
|
||||||
|
assert "sources:" in yaml
|
||||||
|
assert "related:" in yaml
|
||||||
|
# sources must come before related (both are in PREFERRED_KEY_ORDER)
|
||||||
|
assert yaml.index("sources:") < yaml.index("related:")
|
||||||
|
|
||||||
|
def test_list_formatted_as_block(self, wiki_lib: Any) -> None:
|
||||||
|
fm = {"title": "T", "related": ["one.md", "two.md"]}
|
||||||
|
yaml = wiki_lib.serialize_frontmatter(fm)
|
||||||
|
assert "related:\n - one.md\n - two.md" in yaml
|
||||||
|
|
||||||
|
def test_empty_list(self, wiki_lib: Any) -> None:
|
||||||
|
fm = {"title": "T", "sources": []}
|
||||||
|
yaml = wiki_lib.serialize_frontmatter(fm)
|
||||||
|
assert "sources: []" in yaml
|
||||||
|
|
||||||
|
def test_unknown_keys_appear_alphabetically_at_end(self, wiki_lib: Any) -> None:
|
||||||
|
fm = {"title": "T", "type": "pattern", "zoo": "z", "alpha": "a"}
|
||||||
|
yaml = wiki_lib.serialize_frontmatter(fm)
|
||||||
|
# alpha should come before zoo (alphabetical)
|
||||||
|
assert yaml.index("alpha:") < yaml.index("zoo:")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Round-trip: parse_page → write_page → parse_page
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestRoundTrip:
|
||||||
|
def test_round_trip_preserves_core_fields(
|
||||||
|
self, wiki_lib: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/rt.md",
|
||||||
|
title="Round Trip",
|
||||||
|
sources=["raw/a.md", "raw/b.md"],
|
||||||
|
related=["patterns/other.md"],
|
||||||
|
)
|
||||||
|
page1 = wiki_lib.parse_page(path)
|
||||||
|
wiki_lib.write_page(page1)
|
||||||
|
page2 = wiki_lib.parse_page(path)
|
||||||
|
assert page2.frontmatter["title"] == "Round Trip"
|
||||||
|
assert page2.frontmatter["sources"] == ["raw/a.md", "raw/b.md"]
|
||||||
|
assert page2.frontmatter["related"] == ["patterns/other.md"]
|
||||||
|
assert page2.body == page1.body
|
||||||
|
|
||||||
|
def test_round_trip_preserves_mutation(
|
||||||
|
self, wiki_lib: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/rt.md", confidence="high")
|
||||||
|
page = wiki_lib.parse_page(path)
|
||||||
|
page.frontmatter["confidence"] = "low"
|
||||||
|
wiki_lib.write_page(page)
|
||||||
|
page2 = wiki_lib.parse_page(path)
|
||||||
|
assert page2.frontmatter["confidence"] == "low"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# parse_date
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestParseDate:
|
||||||
|
def test_iso_format(self, wiki_lib: Any) -> None:
|
||||||
|
assert wiki_lib.parse_date("2026-04-10") == date(2026, 4, 10)
|
||||||
|
|
||||||
|
def test_empty_string_returns_none(self, wiki_lib: Any) -> None:
|
||||||
|
assert wiki_lib.parse_date("") is None
|
||||||
|
|
||||||
|
def test_none_returns_none(self, wiki_lib: Any) -> None:
|
||||||
|
assert wiki_lib.parse_date(None) is None
|
||||||
|
|
||||||
|
def test_invalid_format_returns_none(self, wiki_lib: Any) -> None:
|
||||||
|
assert wiki_lib.parse_date("not-a-date") is None
|
||||||
|
assert wiki_lib.parse_date("2026/04/10") is None
|
||||||
|
assert wiki_lib.parse_date("04-10-2026") is None
|
||||||
|
|
||||||
|
def test_date_object_passthrough(self, wiki_lib: Any) -> None:
|
||||||
|
d = date(2026, 4, 10)
|
||||||
|
assert wiki_lib.parse_date(d) == d
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# page_content_hash
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestPageContentHash:
|
||||||
|
def test_deterministic(self, wiki_lib: Any, tmp_wiki: Path) -> None:
|
||||||
|
path = make_page(tmp_wiki, "patterns/h.md", body="# Same body\n\nLine.\n")
|
||||||
|
page = wiki_lib.parse_page(path)
|
||||||
|
h1 = wiki_lib.page_content_hash(page)
|
||||||
|
h2 = wiki_lib.page_content_hash(page)
|
||||||
|
assert h1 == h2
|
||||||
|
assert h1.startswith("sha256:")
|
||||||
|
|
||||||
|
def test_different_bodies_yield_different_hashes(
|
||||||
|
self, wiki_lib: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
p1 = make_page(tmp_wiki, "patterns/a.md", body="# A\n\nAlpha.\n")
|
||||||
|
p2 = make_page(tmp_wiki, "patterns/b.md", body="# B\n\nBeta.\n")
|
||||||
|
h1 = wiki_lib.page_content_hash(wiki_lib.parse_page(p1))
|
||||||
|
h2 = wiki_lib.page_content_hash(wiki_lib.parse_page(p2))
|
||||||
|
assert h1 != h2
|
||||||
|
|
||||||
|
def test_frontmatter_changes_dont_change_hash(
|
||||||
|
self, wiki_lib: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
"""Hash is body-only so mechanical frontmatter fixes don't churn it."""
|
||||||
|
path = make_page(tmp_wiki, "patterns/f.md", confidence="high")
|
||||||
|
page = wiki_lib.parse_page(path)
|
||||||
|
h1 = wiki_lib.page_content_hash(page)
|
||||||
|
|
||||||
|
page.frontmatter["confidence"] = "medium"
|
||||||
|
wiki_lib.write_page(page)
|
||||||
|
page2 = wiki_lib.parse_page(path)
|
||||||
|
h2 = wiki_lib.page_content_hash(page2)
|
||||||
|
assert h1 == h2
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Iterators
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestIterators:
|
||||||
|
def test_iter_live_pages_finds_all_types(
|
||||||
|
self, wiki_lib: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/p1.md")
|
||||||
|
make_page(tmp_wiki, "patterns/p2.md")
|
||||||
|
make_page(tmp_wiki, "decisions/d1.md")
|
||||||
|
make_page(tmp_wiki, "concepts/c1.md")
|
||||||
|
make_page(tmp_wiki, "environments/e1.md")
|
||||||
|
pages = wiki_lib.iter_live_pages()
|
||||||
|
assert len(pages) == 5
|
||||||
|
stems = {p.path.stem for p in pages}
|
||||||
|
assert stems == {"p1", "p2", "d1", "c1", "e1"}
|
||||||
|
|
||||||
|
def test_iter_live_pages_empty_wiki(
|
||||||
|
self, wiki_lib: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
assert wiki_lib.iter_live_pages() == []
|
||||||
|
|
||||||
|
def test_iter_staging_pages(self, wiki_lib: Any, tmp_wiki: Path) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/s1.md")
|
||||||
|
make_staging_page(tmp_wiki, "decisions/s2.md", ptype="decision")
|
||||||
|
pages = wiki_lib.iter_staging_pages()
|
||||||
|
assert len(pages) == 2
|
||||||
|
assert all(p.frontmatter.get("status") == "pending" for p in pages)
|
||||||
|
|
||||||
|
def test_iter_archived_pages(self, wiki_lib: Any, tmp_wiki: Path) -> None:
|
||||||
|
make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"archive/patterns/old.md",
|
||||||
|
confidence="stale",
|
||||||
|
extra_fm={
|
||||||
|
"archived_date": "2026-01-01",
|
||||||
|
"archived_reason": "test",
|
||||||
|
"original_path": "patterns/old.md",
|
||||||
|
},
|
||||||
|
)
|
||||||
|
pages = wiki_lib.iter_archived_pages()
|
||||||
|
assert len(pages) == 1
|
||||||
|
assert pages[0].frontmatter["archived_reason"] == "test"
|
||||||
|
|
||||||
|
def test_iter_skips_malformed_pages(
|
||||||
|
self, wiki_lib: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_page(tmp_wiki, "patterns/good.md")
|
||||||
|
(tmp_wiki / "patterns" / "no-fm.md").write_text("# Just a body\n")
|
||||||
|
pages = wiki_lib.iter_live_pages()
|
||||||
|
assert len(pages) == 1
|
||||||
|
assert pages[0].path.stem == "good"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# WIKI_DIR env var override
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestWikiDirEnvVar:
|
||||||
|
def test_honors_env_var(self, wiki_lib: Any, tmp_wiki: Path) -> None:
|
||||||
|
"""The tmp_wiki fixture sets WIKI_DIR — verify wiki_lib picks it up."""
|
||||||
|
assert wiki_lib.WIKI_DIR == tmp_wiki
|
||||||
|
assert wiki_lib.STAGING_DIR == tmp_wiki / "staging"
|
||||||
|
assert wiki_lib.ARCHIVE_DIR == tmp_wiki / "archive"
|
||||||
|
assert wiki_lib.INDEX_FILE == tmp_wiki / "index.md"
|
||||||
267
tests/test_wiki_staging.py
Normal file
267
tests/test_wiki_staging.py
Normal file
@@ -0,0 +1,267 @@
|
|||||||
|
"""Integration tests for scripts/wiki-staging.py."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from conftest import make_page, make_staging_page
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# List + page_summary
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestListPending:
|
||||||
|
def test_empty_staging(self, wiki_staging: Any, tmp_wiki: Path) -> None:
|
||||||
|
assert wiki_staging.list_pending() == []
|
||||||
|
|
||||||
|
def test_finds_pages_in_all_type_subdirs(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/p.md", ptype="pattern")
|
||||||
|
make_staging_page(tmp_wiki, "decisions/d.md", ptype="decision")
|
||||||
|
make_staging_page(tmp_wiki, "concepts/c.md", ptype="concept")
|
||||||
|
pending = wiki_staging.list_pending()
|
||||||
|
assert len(pending) == 3
|
||||||
|
|
||||||
|
def test_skips_staging_index_md(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
(tmp_wiki / "staging" / "index.md").write_text(
|
||||||
|
"---\ntitle: Index\n---\n# staging index\n"
|
||||||
|
)
|
||||||
|
make_staging_page(tmp_wiki, "patterns/real.md")
|
||||||
|
pending = wiki_staging.list_pending()
|
||||||
|
assert len(pending) == 1
|
||||||
|
assert pending[0].path.stem == "real"
|
||||||
|
|
||||||
|
def test_page_summary_populates_all_fields(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/sample.md",
|
||||||
|
title="Sample",
|
||||||
|
staged_by="wiki-harvest",
|
||||||
|
staged_date="2026-04-10",
|
||||||
|
target_path="patterns/sample.md",
|
||||||
|
)
|
||||||
|
pending = wiki_staging.list_pending()
|
||||||
|
summary = wiki_staging.page_summary(pending[0])
|
||||||
|
assert summary["title"] == "Sample"
|
||||||
|
assert summary["type"] == "pattern"
|
||||||
|
assert summary["staged_by"] == "wiki-harvest"
|
||||||
|
assert summary["target_path"] == "patterns/sample.md"
|
||||||
|
assert summary["modifies"] is None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Promote
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestPromote:
|
||||||
|
def test_moves_file_to_live(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/new.md", title="New Page")
|
||||||
|
page = wiki_staging.parse_page(tmp_wiki / "staging" / "patterns" / "new.md")
|
||||||
|
result = wiki_staging.promote(page)
|
||||||
|
assert result is not None
|
||||||
|
assert (tmp_wiki / "patterns" / "new.md").exists()
|
||||||
|
assert not (tmp_wiki / "staging" / "patterns" / "new.md").exists()
|
||||||
|
|
||||||
|
def test_strips_staging_only_fields(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/clean.md")
|
||||||
|
page = wiki_staging.parse_page(tmp_wiki / "staging" / "patterns" / "clean.md")
|
||||||
|
wiki_staging.promote(page)
|
||||||
|
|
||||||
|
promoted = wiki_staging.parse_page(tmp_wiki / "patterns" / "clean.md")
|
||||||
|
for field in ("status", "staged_date", "staged_by", "target_path", "compilation_notes"):
|
||||||
|
assert field not in promoted.frontmatter
|
||||||
|
|
||||||
|
def test_preserves_origin_automated(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/auto.md")
|
||||||
|
page = wiki_staging.parse_page(tmp_wiki / "staging" / "patterns" / "auto.md")
|
||||||
|
wiki_staging.promote(page)
|
||||||
|
promoted = wiki_staging.parse_page(tmp_wiki / "patterns" / "auto.md")
|
||||||
|
assert promoted.frontmatter["origin"] == "automated"
|
||||||
|
|
||||||
|
def test_updates_main_index(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/indexed.md", title="Indexed Page")
|
||||||
|
page = wiki_staging.parse_page(tmp_wiki / "staging" / "patterns" / "indexed.md")
|
||||||
|
wiki_staging.promote(page)
|
||||||
|
|
||||||
|
idx = (tmp_wiki / "index.md").read_text()
|
||||||
|
assert "patterns/indexed.md" in idx
|
||||||
|
|
||||||
|
def test_regenerates_staging_index(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/one.md")
|
||||||
|
make_staging_page(tmp_wiki, "patterns/two.md")
|
||||||
|
page = wiki_staging.parse_page(tmp_wiki / "staging" / "patterns" / "one.md")
|
||||||
|
wiki_staging.promote(page)
|
||||||
|
|
||||||
|
idx = (tmp_wiki / "staging" / "index.md").read_text()
|
||||||
|
assert "two.md" in idx
|
||||||
|
assert "1 pending" in idx
|
||||||
|
|
||||||
|
def test_dry_run_does_not_move(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/safe.md")
|
||||||
|
page = wiki_staging.parse_page(tmp_wiki / "staging" / "patterns" / "safe.md")
|
||||||
|
wiki_staging.promote(page, dry_run=True)
|
||||||
|
assert (tmp_wiki / "staging" / "patterns" / "safe.md").exists()
|
||||||
|
assert not (tmp_wiki / "patterns" / "safe.md").exists()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Promote with modifies field
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestPromoteUpdate:
|
||||||
|
def test_update_overwrites_existing_live_page(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
# Existing live page
|
||||||
|
make_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/existing.md",
|
||||||
|
title="Old Title",
|
||||||
|
last_compiled="2026-01-01",
|
||||||
|
)
|
||||||
|
# Staging update with `modifies`
|
||||||
|
make_staging_page(
|
||||||
|
tmp_wiki,
|
||||||
|
"patterns/existing.md",
|
||||||
|
title="New Title",
|
||||||
|
modifies="patterns/existing.md",
|
||||||
|
target_path="patterns/existing.md",
|
||||||
|
)
|
||||||
|
page = wiki_staging.parse_page(
|
||||||
|
tmp_wiki / "staging" / "patterns" / "existing.md"
|
||||||
|
)
|
||||||
|
wiki_staging.promote(page)
|
||||||
|
|
||||||
|
live = wiki_staging.parse_page(tmp_wiki / "patterns" / "existing.md")
|
||||||
|
assert live.frontmatter["title"] == "New Title"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Reject
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestReject:
|
||||||
|
def test_deletes_file(self, wiki_staging: Any, tmp_wiki: Path) -> None:
|
||||||
|
path = make_staging_page(tmp_wiki, "patterns/bad.md")
|
||||||
|
page = wiki_staging.parse_page(path)
|
||||||
|
wiki_staging.reject(page, "duplicate")
|
||||||
|
assert not path.exists()
|
||||||
|
|
||||||
|
def test_records_rejection_in_harvest_state(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
# Create a raw harvested file with a source_url
|
||||||
|
raw = tmp_wiki / "raw" / "harvested" / "example-com-test.md"
|
||||||
|
raw.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
raw.write_text(
|
||||||
|
"---\n"
|
||||||
|
"source_url: https://example.com/test\n"
|
||||||
|
"fetched_date: 2026-04-10\n"
|
||||||
|
"fetch_method: trafilatura\n"
|
||||||
|
"discovered_in: conversations/mc/test.md\n"
|
||||||
|
"content_hash: sha256:abc\n"
|
||||||
|
"---\n"
|
||||||
|
"# Example\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Create a staging page that references it
|
||||||
|
make_staging_page(tmp_wiki, "patterns/reject-me.md")
|
||||||
|
staging_path = tmp_wiki / "staging" / "patterns" / "reject-me.md"
|
||||||
|
# Inject sources so reject() finds the harvest_source
|
||||||
|
page = wiki_staging.parse_page(staging_path)
|
||||||
|
page.frontmatter["sources"] = ["raw/harvested/example-com-test.md"]
|
||||||
|
wiki_staging.write_page(page)
|
||||||
|
|
||||||
|
page = wiki_staging.parse_page(staging_path)
|
||||||
|
wiki_staging.reject(page, "test rejection")
|
||||||
|
|
||||||
|
state = json.loads((tmp_wiki / ".harvest-state.json").read_text())
|
||||||
|
assert "https://example.com/test" in state["rejected_urls"]
|
||||||
|
assert state["rejected_urls"]["https://example.com/test"]["reason"] == "test rejection"
|
||||||
|
|
||||||
|
def test_reject_dry_run_keeps_file(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
path = make_staging_page(tmp_wiki, "patterns/kept.md")
|
||||||
|
page = wiki_staging.parse_page(path)
|
||||||
|
wiki_staging.reject(page, "test", dry_run=True)
|
||||||
|
assert path.exists()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Staging index regeneration
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestStagingIndexRegen:
|
||||||
|
def test_empty_index_shows_none(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
wiki_staging.regenerate_staging_index()
|
||||||
|
idx = (tmp_wiki / "staging" / "index.md").read_text()
|
||||||
|
assert "0 pending" in idx
|
||||||
|
assert "No pending items" in idx
|
||||||
|
|
||||||
|
def test_lists_pending_items(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/a.md", title="A")
|
||||||
|
make_staging_page(tmp_wiki, "decisions/b.md", title="B", ptype="decision")
|
||||||
|
wiki_staging.regenerate_staging_index()
|
||||||
|
idx = (tmp_wiki / "staging" / "index.md").read_text()
|
||||||
|
assert "2 pending" in idx
|
||||||
|
assert "A" in idx and "B" in idx
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Path resolution
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestResolvePage:
|
||||||
|
def test_resolves_staging_relative_path(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/foo.md")
|
||||||
|
page = wiki_staging.resolve_page("staging/patterns/foo.md")
|
||||||
|
assert page is not None
|
||||||
|
assert page.path.name == "foo.md"
|
||||||
|
|
||||||
|
def test_returns_none_for_missing(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
assert wiki_staging.resolve_page("staging/patterns/does-not-exist.md") is None
|
||||||
|
|
||||||
|
def test_resolves_bare_patterns_path_as_staging(
|
||||||
|
self, wiki_staging: Any, tmp_wiki: Path
|
||||||
|
) -> None:
|
||||||
|
make_staging_page(tmp_wiki, "patterns/bare.md")
|
||||||
|
page = wiki_staging.resolve_page("patterns/bare.md")
|
||||||
|
assert page is not None
|
||||||
|
assert "staging" in str(page.path)
|
||||||
Reference in New Issue
Block a user