Initial commit — memex

A compounding LLM-maintained knowledge wiki.

Synthesis of Andrej Karpathy's persistent-wiki gist and milla-jovovich's
mempalace, with an automation layer on top for conversation mining, URL
harvesting, human-in-the-loop staging, staleness decay, and hygiene.

Includes:
- 11 pipeline scripts (extract, summarize, index, harvest, stage,
  hygiene, maintain, sync, + shared library)
- Full docs: README, SETUP, ARCHITECTURE, DESIGN-RATIONALE, CUSTOMIZE
- Example CLAUDE.md files (wiki schema + global instructions) tuned for
  the three-collection qmd setup
- 171-test pytest suite (cross-platform, runs in ~1.3s)
- MIT licensed
This commit is contained in:
Eric Turner
2026-04-12 21:16:02 -06:00
commit ee54a2f5d4
31 changed files with 10792 additions and 0 deletions

View File

@@ -0,0 +1,278 @@
# LLM Wiki — Schema
This is a persistent, compounding knowledge base maintained by LLM agents.
It captures the **why** behind patterns, decisions, and implementations —
not just the what. Copy this file to the root of your wiki directory
(i.e. `~/projects/wiki/CLAUDE.md`) and edit for your own conventions.
> This is an example `CLAUDE.md` for the wiki root. The agent reads this
> at the start of every session when working inside the wiki. It's the
> "constitution" that tells the agent how to maintain the knowledge base.
## How This Wiki Works
**You are the maintainer.** When working in this wiki directory, you read
raw sources, compile knowledge into wiki pages, maintain cross-references,
and keep everything consistent.
**You are a consumer.** When working in any other project directory, you
read wiki pages to inform your work — applying established patterns,
respecting decisions, and understanding context.
## Directory Structure
```
wiki/
├── CLAUDE.md ← You are here (schema)
├── index.md ← Content catalog — read this FIRST on any query
├── log.md ← Chronological record of all operations
├── patterns/ ← LIVE: HOW things should be built (with WHY)
├── decisions/ ← LIVE: WHY we chose this approach (with alternatives rejected)
├── environments/ ← LIVE: WHERE implementations differ
├── concepts/ ← LIVE: WHAT the foundational ideas are
├── raw/ ← Immutable source material (NEVER modify)
│ └── harvested/ ← URL harvester output
├── staging/ ← PENDING automated content awaiting human review
│ ├── index.md
│ └── <type>/
├── archive/ ← STALE / superseded (excluded from default search)
│ ├── index.md
│ └── <type>/
├── conversations/ ← Mined Claude Code session transcripts
│ ├── index.md
│ └── <wing>/ ← per-project or per-person (MemPalace "wing")
├── context/ ← Auto-updated AI session briefing
│ ├── wake-up.md ← Loaded at the start of every session
│ └── active-concerns.md
├── reports/ ← Hygiene operation logs
└── scripts/ ← The automation pipeline
```
**Core rule — automated vs manual content**:
| Origin | Destination | Status |
|--------|-------------|--------|
| Script-generated (harvester, hygiene, URL compile) | `staging/` | `pending` |
| Human-initiated ("add this to the wiki" in a Claude session) | Live wiki (`patterns/`, etc.) | `verified` |
| Human-reviewed from staging | Live wiki (promoted) | `verified` |
Managed via `scripts/wiki-staging.py --list / --promote / --reject / --review`.
## Page Conventions
### Frontmatter (required on all wiki pages)
```yaml
---
title: Page Title
type: pattern | decision | environment | concept
confidence: high | medium | low
origin: manual | automated # How the page entered the wiki
sources: [list of raw/ files this was compiled from]
related: [list of other wiki pages this connects to]
last_compiled: YYYY-MM-DD # Date this page was last (re)compiled from sources
last_verified: YYYY-MM-DD # Date the content was last confirmed accurate
---
```
**`origin` values**:
- `manual` — Created by a human in a Claude session. Goes directly to the live wiki, no staging.
- `automated` — Created by a script (harvester, hygiene, etc.). Must pass through `staging/` for human review before promotion.
**Confidence decay**: Pages with no refresh signal for 6 months decay `high → medium`; 9 months → `low`; 12 months → `stale` (auto-archived). `last_verified` drives decay, not `last_compiled`. See `scripts/wiki-hygiene.py` and `archive/index.md`.
### Staging Frontmatter (pages in `staging/<type>/`)
Automated-origin pages get additional staging metadata that is **stripped on promotion**:
```yaml
---
title: ...
type: ...
origin: automated
status: pending # Awaiting review
staged_date: YYYY-MM-DD # When the automated script staged this
staged_by: wiki-harvest # Which script staged it (wiki-harvest, wiki-hygiene, ...)
target_path: patterns/foo.md # Where it should land on promotion
modifies: patterns/bar.md # Only present when this is an update to an existing live page
compilation_notes: "..." # AI's explanation of what it did and why
harvest_source: https://... # Only present for URL-harvested content
sources: [...]
related: [...]
last_verified: YYYY-MM-DD
---
```
### Pattern Pages (`patterns/`)
Structure:
1. **What** — One-paragraph description of the pattern
2. **Why** — The reasoning, constraints, and goals that led to this pattern
3. **Canonical Example** — A concrete implementation (link to raw/ source or inline)
4. **Structure** — The specification: fields, endpoints, formats, conventions
5. **When to Deviate** — Known exceptions or conditions where the pattern doesn't apply
6. **History** — Key changes and the decisions that drove them
### Decision Pages (`decisions/`)
Structure:
1. **Decision** — One sentence: what we decided
2. **Context** — What problem or constraint prompted this
3. **Options Considered** — What alternatives existed (with pros/cons)
4. **Rationale** — Why this option won
5. **Consequences** — What this decision enables and constrains
6. **Status** — Active | Superseded by [link] | Under Review
### Environment Pages (`environments/`)
Structure:
1. **Overview** — What this environment is (platform, CI, infra)
2. **Key Differences** — Table comparing environments for this domain
3. **Implementation Details** — Environment-specific configs, credentials, deploy method
4. **Gotchas** — Things that have bitten us
### Concept Pages (`concepts/`)
Structure:
1. **Definition** — What this concept means in our context
2. **Why It Matters** — How this concept shapes our decisions
3. **Related Patterns** — Links to patterns that implement this concept
4. **Related Decisions** — Links to decisions driven by this concept
## Operations
### Ingest (adding new knowledge)
When a new raw source is added or you learn something new:
1. Read the source material thoroughly
2. Identify which existing wiki pages need updating
3. Identify if new pages are needed
4. Update/create pages following the conventions above
5. Update cross-references (`related:` frontmatter) on all affected pages
6. Update `index.md` with any new pages
7. Set `last_verified:` to today's date on every page you create or update
8. Set `origin: manual` on any page you create when a human directed you to
9. Append to `log.md`: `## [YYYY-MM-DD] ingest | Source Description`
**Where to write**:
- **Human-initiated** ("add this to the wiki", "create a pattern for X") — write directly to the live directory (`patterns/`, `decisions/`, etc.) with `origin: manual`. The human's instruction IS the approval.
- **Script-initiated** (harvest, auto-compile, hygiene auto-fix) — write to `staging/<type>/` with `origin: automated`, `status: pending`, plus `staged_date`, `staged_by`, `target_path`, and `compilation_notes`. For updates to existing live pages, also set `modifies: <live-page-path>`.
### Query (answering questions from other projects)
When working in another project and consulting the wiki:
1. Use `qmd` to search first (see Search Strategy below). Read `index.md` only when browsing the full catalog.
2. Read the specific pattern/decision/concept pages
3. Apply the knowledge, respecting environment differences
4. If a page's `confidence` is `low`, flag that to the user — the content may be aging out
5. If a page has `status: pending` (it's in `staging/`), flag that to the user: "Note: this is from a pending wiki page in staging, not yet verified." Use the content but make the uncertainty visible.
6. If you find yourself consulting a page under `archive/`, mention it's archived and may be outdated
7. If your work reveals new knowledge, **file it back** — update the wiki (and bump `last_verified`)
### Search Strategy — which qmd collection to use
The wiki has three qmd collections. Pick the right one for the question:
| Question type | Collection | Command |
|---|---|---|
| "What's our current pattern for X?" | `wiki` (default) | `qmd search "X" --json -n 5` |
| "What's the rationale behind decision Y?" | `wiki` (default) | `qmd vsearch "why did we choose Y" --json -n 5` |
| "What was our OLD approach before we changed it?" | `wiki-archive` | `qmd search "X" -c wiki-archive --json -n 5` |
| "When did we discuss this, and what did we decide?" | `wiki-conversations` | `qmd search "X" -c wiki-conversations --json -n 5` |
| "Find everything across time" | all three | `qmd search "X" -c wiki -c wiki-archive -c wiki-conversations --json -n 10` |
**Rules of thumb**:
- Use `qmd search` for keyword matches (BM25, fast)
- Use `qmd vsearch` for conceptual / semantically-similar queries (vector)
- Use `qmd query` for the best quality — hybrid BM25 + vector + LLM re-ranking
- Always use `--json` for structured output
- Read individual matched pages with `cat` or your file tool after finding them
### Mine (conversation extraction and summarization)
Four-phase pipeline that extracts sessions into searchable conversation pages:
1. **Extract** (`extract-sessions.py`) — Parse session files into markdown transcripts
2. **Summarize** (`summarize-conversations.py --claude`) — Classify + summarize via `claude -p` with haiku/sonnet routing
3. **Index** (`update-conversation-index.py --reindex`) — Regenerate conversation index + `context/wake-up.md`
4. **Harvest** (`wiki-harvest.py`) — Scan summarized conversations for external reference URLs and compile them into wiki pages
Full pipeline via `mine-conversations.sh`. Extraction is incremental (tracks byte offsets). Summarization is incremental (tracks message count).
### Maintain (wiki health automation)
`scripts/wiki-maintain.sh` chains harvest + hygiene + qmd reindex:
```bash
bash scripts/wiki-maintain.sh # Harvest + quick hygiene + reindex
bash scripts/wiki-maintain.sh --full # Harvest + full hygiene (LLM) + reindex
bash scripts/wiki-maintain.sh --harvest-only # Harvest only
bash scripts/wiki-maintain.sh --hygiene-only # Hygiene only
bash scripts/wiki-maintain.sh --dry-run # Show what would run
```
### Lint (periodic health check)
Automated via `scripts/wiki-hygiene.py`. Two tiers:
**Quick mode** (no LLM, run daily — `python3 scripts/wiki-hygiene.py`):
- Backfill missing `last_verified`
- Refresh `last_verified` from conversation `related:` references
- Auto-restore archived pages that are referenced again
- Repair frontmatter (missing required fields, invalid values)
- Confidence decay per 6/9/12-month thresholds
- Archive stale and superseded pages
- Orphan pages (auto-linked into `index.md`)
- Broken cross-references (fuzzy-match fix via `difflib`, or restore from archive)
- Main index drift (auto add missing entries, remove stale ones)
- Empty stubs (report-only)
- State file drift (report-only)
- Staging/archive index resync
**Full mode** (LLM, run weekly — `python3 scripts/wiki-hygiene.py --full`):
- Everything in quick mode, plus:
- Missing cross-references between related pages (haiku)
- Duplicate coverage — weaker page auto-merged into stronger (sonnet)
- Contradictions between pages (sonnet, report-only)
- Technology lifecycle — flag pages referencing versions older than what's in recent conversations
**Reports** (written to `reports/`):
- `hygiene-YYYY-MM-DD-fixed.md` — what was auto-fixed
- `hygiene-YYYY-MM-DD-needs-review.md` — what needs human judgment
## Cross-Reference Conventions
- Link between wiki pages using relative markdown links: `[Pattern Name](../patterns/file.md)`
- Link to raw sources: `[Source](../raw/path/to/file.md)`
- In frontmatter `related:` use the relative filename: `patterns/secrets-at-startup.md`
## Naming Conventions
- Filenames: `kebab-case.md`
- Patterns: named by what they standardize (e.g., `health-endpoints.md`, `secrets-at-startup.md`)
- Decisions: named by what was decided (e.g., `no-alpine.md`, `dhi-base-images.md`)
- Environments: named by domain (e.g., `docker-registries.md`, `ci-cd-platforms.md`)
- Concepts: named by the concept (e.g., `two-user-database-model.md`, `build-once-deploy-many.md`)
## Customization Notes
Things you should change for your own wiki:
1. **Directory structure** — the four live dirs (`patterns/`, `decisions/`, `concepts/`, `environments/`) reflect engineering use cases. Pick categories that match how you think — research wikis might use `findings/`, `hypotheses/`, `methods/`, `literature/` instead. Update `LIVE_CONTENT_DIRS` in `scripts/wiki_lib.py` to match.
2. **Page page-type sections** — the "Structure" blocks under each page type are for my use. Define your own conventions.
3. **`status` field** — if you want to track Superseded/Active/Under Review explicitly, this is a natural add. The hygiene script already checks for `status: Superseded by ...` and archives those automatically.
4. **Environment Detection** — if you don't have multiple environments, remove the section. If you do, update it for your own environments (work/home, dev/prod, mac/linux, etc.).
5. **Cross-reference path format** — I use `patterns/foo.md` in the `related:` field. Obsidian users might prefer `[[foo]]` wikilink format. The hygiene script handles standard markdown links; adapt as needed.