Initial commit — memex

A compounding LLM-maintained knowledge wiki. Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's mempalace, with an automation layer on top for conversation mining, URL harvesting, human-in-the-loop staging, staleness decay, and hygiene. Includes: - 11 pipeline scripts (extract, summarize, index, harvest, stage, hygiene, maintain, sync, + shared library) - Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE - Example CLAUDE.md files (wiki schema + global instructions) tuned for the three-collection qmd setup - 171-test pytest suite (cross-platform, runs in ~1.3s) - MIT licensed
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions
--- a/docs/examples/wiki-CLAUDE.md
+++ b/docs/examples/wiki-CLAUDE.md
@@ -0,0 +1,278 @@
+# LLM Wiki — Schema
+
+This is a persistent, compounding knowledge base maintained by LLM agents.
+It captures the **why** behind patterns, decisions, and implementations —
+not just the what. Copy this file to the root of your wiki directory
+(i.e. `~/projects/wiki/CLAUDE.md`) and edit for your own conventions.
+
+> This is an example `CLAUDE.md` for the wiki root. The agent reads this
+> at the start of every session when working inside the wiki. It's the
+> "constitution" that tells the agent how to maintain the knowledge base.
+
+## How This Wiki Works
+
+**You are the maintainer.** When working in this wiki directory, you read
+raw sources, compile knowledge into wiki pages, maintain cross-references,
+and keep everything consistent.
+
+**You are a consumer.** When working in any other project directory, you
+read wiki pages to inform your work — applying established patterns,
+respecting decisions, and understanding context.
+
+## Directory Structure
+
+```
+wiki/
+├── CLAUDE.md              ← You are here (schema)
+├── index.md               ← Content catalog — read this FIRST on any query
+├── log.md                 ← Chronological record of all operations
+│
+├── patterns/              ← LIVE: HOW things should be built (with WHY)
+├── decisions/             ← LIVE: WHY we chose this approach (with alternatives rejected)
+├── environments/          ← LIVE: WHERE implementations differ
+├── concepts/              ← LIVE: WHAT the foundational ideas are
+│
+├── raw/                   ← Immutable source material (NEVER modify)
+│   └── harvested/         ← URL harvester output
+│
+├── staging/               ← PENDING automated content awaiting human review
+│   ├── index.md
+│   └── <type>/
+│
+├── archive/               ← STALE / superseded (excluded from default search)
+│   ├── index.md
+│   └── <type>/
+│
+├── conversations/         ← Mined Claude Code session transcripts
+│   ├── index.md
+│   └── <wing>/            ← per-project or per-person (MemPalace "wing")
+│
+├── context/               ← Auto-updated AI session briefing
+│   ├── wake-up.md         ← Loaded at the start of every session
+│   └── active-concerns.md
+│
+├── reports/               ← Hygiene operation logs
+└── scripts/               ← The automation pipeline
+```
+
+**Core rule — automated vs manual content**:
+
+| Origin | Destination | Status |
+|--------|-------------|--------|
+| Script-generated (harvester, hygiene, URL compile) | `staging/` | `pending` |
+| Human-initiated ("add this to the wiki" in a Claude session) | Live wiki (`patterns/`, etc.) | `verified` |
+| Human-reviewed from staging | Live wiki (promoted) | `verified` |
+
+Managed via `scripts/wiki-staging.py --list / --promote / --reject / --review`.
+
+## Page Conventions
+
+### Frontmatter (required on all wiki pages)
+
+```yaml
+---
+title: Page Title
+type: pattern | decision | environment | concept
+confidence: high | medium | low
+origin: manual | automated    # How the page entered the wiki
+sources: [list of raw/ files this was compiled from]
+related: [list of other wiki pages this connects to]
+last_compiled: YYYY-MM-DD     # Date this page was last (re)compiled from sources
+last_verified: YYYY-MM-DD     # Date the content was last confirmed accurate
+---
+```
+
+**`origin` values**:
+- `manual` — Created by a human in a Claude session. Goes directly to the live wiki, no staging.
+- `automated` — Created by a script (harvester, hygiene, etc.). Must pass through `staging/` for human review before promotion.
+
+**Confidence decay**: Pages with no refresh signal for 6 months decay `high → medium`; 9 months → `low`; 12 months → `stale` (auto-archived). `last_verified` drives decay, not `last_compiled`. See `scripts/wiki-hygiene.py` and `archive/index.md`.
+
+### Staging Frontmatter (pages in `staging/<type>/`)
+
+Automated-origin pages get additional staging metadata that is **stripped on promotion**:
+
+```yaml
+---
+title: ...
+type: ...
+origin: automated
+status: pending              # Awaiting review
+staged_date: YYYY-MM-DD      # When the automated script staged this
+staged_by: wiki-harvest      # Which script staged it (wiki-harvest, wiki-hygiene, ...)
+target_path: patterns/foo.md # Where it should land on promotion
+modifies: patterns/bar.md    # Only present when this is an update to an existing live page
+compilation_notes: "..."     # AI's explanation of what it did and why
+harvest_source: https://...  # Only present for URL-harvested content
+sources: [...]
+related: [...]
+last_verified: YYYY-MM-DD
+---
+```
+
+### Pattern Pages (`patterns/`)
+
+Structure:
+1. **What** — One-paragraph description of the pattern
+2. **Why** — The reasoning, constraints, and goals that led to this pattern
+3. **Canonical Example** — A concrete implementation (link to raw/ source or inline)
+4. **Structure** — The specification: fields, endpoints, formats, conventions
+5. **When to Deviate** — Known exceptions or conditions where the pattern doesn't apply
+6. **History** — Key changes and the decisions that drove them
+
+### Decision Pages (`decisions/`)
+
+Structure:
+1. **Decision** — One sentence: what we decided
+2. **Context** — What problem or constraint prompted this
+3. **Options Considered** — What alternatives existed (with pros/cons)
+4. **Rationale** — Why this option won
+5. **Consequences** — What this decision enables and constrains
+6. **Status** — Active | Superseded by [link] | Under Review
+
+### Environment Pages (`environments/`)
+
+Structure:
+1. **Overview** — What this environment is (platform, CI, infra)
+2. **Key Differences** — Table comparing environments for this domain
+3. **Implementation Details** — Environment-specific configs, credentials, deploy method
+4. **Gotchas** — Things that have bitten us
+
+### Concept Pages (`concepts/`)
+
+Structure:
+1. **Definition** — What this concept means in our context
+2. **Why It Matters** — How this concept shapes our decisions
+3. **Related Patterns** — Links to patterns that implement this concept
+4. **Related Decisions** — Links to decisions driven by this concept
+
+## Operations
+
+### Ingest (adding new knowledge)
+
+When a new raw source is added or you learn something new:
+
+1. Read the source material thoroughly
+2. Identify which existing wiki pages need updating
+3. Identify if new pages are needed
+4. Update/create pages following the conventions above
+5. Update cross-references (`related:` frontmatter) on all affected pages
+6. Update `index.md` with any new pages
+7. Set `last_verified:` to today's date on every page you create or update
+8. Set `origin: manual` on any page you create when a human directed you to
+9. Append to `log.md`: `## [YYYY-MM-DD] ingest | Source Description`
+
+**Where to write**:
+- **Human-initiated** ("add this to the wiki", "create a pattern for X") — write directly to the live directory (`patterns/`, `decisions/`, etc.) with `origin: manual`. The human's instruction IS the approval.
+- **Script-initiated** (harvest, auto-compile, hygiene auto-fix) — write to `staging/<type>/` with `origin: automated`, `status: pending`, plus `staged_date`, `staged_by`, `target_path`, and `compilation_notes`. For updates to existing live pages, also set `modifies: <live-page-path>`.
+
+### Query (answering questions from other projects)
+
+When working in another project and consulting the wiki:
+
+1. Use `qmd` to search first (see Search Strategy below). Read `index.md` only when browsing the full catalog.
+2. Read the specific pattern/decision/concept pages
+3. Apply the knowledge, respecting environment differences
+4. If a page's `confidence` is `low`, flag that to the user — the content may be aging out
+5. If a page has `status: pending` (it's in `staging/`), flag that to the user: "Note: this is from a pending wiki page in staging, not yet verified." Use the content but make the uncertainty visible.
+6. If you find yourself consulting a page under `archive/`, mention it's archived and may be outdated
+7. If your work reveals new knowledge, **file it back** — update the wiki (and bump `last_verified`)
+
+### Search Strategy — which qmd collection to use
+
+The wiki has three qmd collections. Pick the right one for the question:
+
+| Question type | Collection | Command |
+|---|---|---|
+| "What's our current pattern for X?" | `wiki` (default) | `qmd search "X" --json -n 5` |
+| "What's the rationale behind decision Y?" | `wiki` (default) | `qmd vsearch "why did we choose Y" --json -n 5` |
+| "What was our OLD approach before we changed it?" | `wiki-archive` | `qmd search "X" -c wiki-archive --json -n 5` |
+| "When did we discuss this, and what did we decide?" | `wiki-conversations` | `qmd search "X" -c wiki-conversations --json -n 5` |
+| "Find everything across time" | all three | `qmd search "X" -c wiki -c wiki-archive -c wiki-conversations --json -n 10` |
+
+**Rules of thumb**:
+- Use `qmd search` for keyword matches (BM25, fast)
+- Use `qmd vsearch` for conceptual / semantically-similar queries (vector)
+- Use `qmd query` for the best quality — hybrid BM25 + vector + LLM re-ranking
+- Always use `--json` for structured output
+- Read individual matched pages with `cat` or your file tool after finding them
+
+### Mine (conversation extraction and summarization)
+
+Four-phase pipeline that extracts sessions into searchable conversation pages:
+
+1. **Extract** (`extract-sessions.py`) — Parse session files into markdown transcripts
+2. **Summarize** (`summarize-conversations.py --claude`) — Classify + summarize via `claude -p` with haiku/sonnet routing
+3. **Index** (`update-conversation-index.py --reindex`) — Regenerate conversation index + `context/wake-up.md`
+4. **Harvest** (`wiki-harvest.py`) — Scan summarized conversations for external reference URLs and compile them into wiki pages
+
+Full pipeline via `mine-conversations.sh`. Extraction is incremental (tracks byte offsets). Summarization is incremental (tracks message count).
+
+### Maintain (wiki health automation)
+
+`scripts/wiki-maintain.sh` chains harvest + hygiene + qmd reindex:
+
+```bash
+bash scripts/wiki-maintain.sh                 # Harvest + quick hygiene + reindex
+bash scripts/wiki-maintain.sh --full          # Harvest + full hygiene (LLM) + reindex
+bash scripts/wiki-maintain.sh --harvest-only  # Harvest only
+bash scripts/wiki-maintain.sh --hygiene-only  # Hygiene only
+bash scripts/wiki-maintain.sh --dry-run       # Show what would run
+```
+
+### Lint (periodic health check)
+
+Automated via `scripts/wiki-hygiene.py`. Two tiers:
+
+**Quick mode** (no LLM, run daily — `python3 scripts/wiki-hygiene.py`):
+- Backfill missing `last_verified`
+- Refresh `last_verified` from conversation `related:` references
+- Auto-restore archived pages that are referenced again
+- Repair frontmatter (missing required fields, invalid values)
+- Confidence decay per 6/9/12-month thresholds
+- Archive stale and superseded pages
+- Orphan pages (auto-linked into `index.md`)
+- Broken cross-references (fuzzy-match fix via `difflib`, or restore from archive)
+- Main index drift (auto add missing entries, remove stale ones)
+- Empty stubs (report-only)
+- State file drift (report-only)
+- Staging/archive index resync
+
+**Full mode** (LLM, run weekly — `python3 scripts/wiki-hygiene.py --full`):
+- Everything in quick mode, plus:
+- Missing cross-references between related pages (haiku)
+- Duplicate coverage — weaker page auto-merged into stronger (sonnet)
+- Contradictions between pages (sonnet, report-only)
+- Technology lifecycle — flag pages referencing versions older than what's in recent conversations
+
+**Reports** (written to `reports/`):
+- `hygiene-YYYY-MM-DD-fixed.md` — what was auto-fixed
+- `hygiene-YYYY-MM-DD-needs-review.md` — what needs human judgment
+
+## Cross-Reference Conventions
+
+- Link between wiki pages using relative markdown links: `[Pattern Name](../patterns/file.md)`
+- Link to raw sources: `[Source](../raw/path/to/file.md)`
+- In frontmatter `related:` use the relative filename: `patterns/secrets-at-startup.md`
+
+## Naming Conventions
+
+- Filenames: `kebab-case.md`
+- Patterns: named by what they standardize (e.g., `health-endpoints.md`, `secrets-at-startup.md`)
+- Decisions: named by what was decided (e.g., `no-alpine.md`, `dhi-base-images.md`)
+- Environments: named by domain (e.g., `docker-registries.md`, `ci-cd-platforms.md`)
+- Concepts: named by the concept (e.g., `two-user-database-model.md`, `build-once-deploy-many.md`)
+
+## Customization Notes
+
+Things you should change for your own wiki:
+
+1. **Directory structure** — the four live dirs (`patterns/`, `decisions/`, `concepts/`, `environments/`) reflect engineering use cases. Pick categories that match how you think — research wikis might use `findings/`, `hypotheses/`, `methods/`, `literature/` instead. Update `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py` to match.
+
+2. **Page page-type sections** — the "Structure" blocks under each page type are for my use. Define your own conventions.
+
+3. **`status` field** — if you want to track Superseded/Active/Under Review explicitly, this is a natural add. The hygiene script already checks for `status: Superseded by ...` and archives those automatically.
+
+4. **Environment Detection** — if you don't have multiple environments, remove the section. If you do, update it for your own environments (work/home, dev/prod, mac/linux, etc.).
+
+5. **Cross-reference path format** — I use `patterns/foo.md` in the `related:` field. Obsidian users might prefer `[[foo]]` wikilink format. The hygiene script handles standard markdown links; adapt as needed.