Architecture · Free the Beagles

Internal · /architecture

Pipeline architecture

The data pipeline behind this site, in technical detail.

Data flow

Crawl DATCP — Python script walks the public DATCP index page and the 37 monthly buckets at cvicentral.com. Output: manifest.csv (171,686 rows across all species).
Reconcile — For each monthly bucket, compare row count against the server's stated total; re-fetch any missing pages. 37/37 reconciled exactly.
Download PDFs — One per canine CVI: 38,128 files / 7.4 GB. Filename-collision aware.
Tier 1 (Haiku 4.5) — Read every canine PDF; return verbatim consignor name/address/USDA license. Cached per shrine_id. 37,960 unique extractions, $158.
Match (Python) — Classify each Tier-1 extraction into target sets (Ridglan / Marshall / Envigo / Inotiv / Class B). Pure-code matching against verbatim fields — recall does not depend on model classification.
Tier 2 (Sonnet 4.6) — Deep structured extraction on matched candidates only. 294 extractions, $23.
Tier 2b (Sonnet 4.6, ditto-aware) — Re-extract Ridglan rows with a prompt that explicitly handles the form's vertical-line/ditto convention and pulls vaccinations + treatments + health declarations. 77 of 211 done so far, ~$7.
Tier 3 (Opus 4.7) — Independent re-read of every Ridglan candidate with a paraphrased prompt. Cross-tier agreement score computed against Tier 1 and Tier 2. 211 extractions, $21.
Load → Postgres — All extractions into extractions with full provenance (run_id, model, prompt_version, tokens, cost, timestamp). Canonical view in shipments picks the best extraction per CVI.

Why three (now four) tiers

Haiku is cheap enough to read every PDF in the corpus.
Putting matching logic in Python (not in the LLM) means recall doesn't bottleneck on classification quality.
Sonnet handles structured extraction well at ~$0.08 per CVI.
Opus on a paraphrased prompt gives an independent reading for cross-validation. Disagreements get flagged, not silently discarded.
Tier 2b corrects a specific failure mode (per-animal columns where the form uses a vertical "ditto" line for repeated values) and pulls additional fields the original prompt skipped.

Provenance

Every extracted value in the database carries a foreign key to a specific extraction row, which carries a foreign key to an extraction_run (model + prompt version + timestamp). Re-extraction is additive: new runs create new extraction rows; older readings are preserved. The shipments table picks "best available" based on phase priority (tier2b > tier2 > tier3 > tier1) at query time, so we can swap the picker without re-running anything.

Stack

Web: Next.js 16 · React 19 · Tailwind 4 · TypeScript
DB: PostgreSQL 16 (Docker container on droplet, port 5435 local-only)
Server: systemd + Caddy 2 (auto-TLS via Let's Encrypt)
Extraction: Python 3.14 · Anthropic SDK · async (concurrency 5–16)
Exports: exceljs for XLSX, pdfkit for PDF

For prose-friendly framing, see /methodology. Live data-quality stats are maintained internally by Dane4Dogs.