Internal · /architecture
Pipeline architecture
The data pipeline behind this site, in technical detail.
Data flow
- Crawl DATCP — Python script walks the public DATCP index page and the 37 monthly buckets at cvicentral.com. Output:
manifest.csv(171,686 rows across all species). - Reconcile — For each monthly bucket, compare row count against the server's stated total; re-fetch any missing pages. 37/37 reconciled exactly.
- Download PDFs — One per canine CVI: 38,128 files / 7.4 GB. Filename-collision aware.
- Tier 1 (Haiku 4.5) — Read every canine PDF; return verbatim consignor name/address/USDA license. Cached per shrine_id. 37,960 unique extractions, $158.
- Match (Python) — Classify each Tier-1 extraction into target sets (Ridglan / Marshall / Envigo / Inotiv / Class B). Pure-code matching against verbatim fields — recall does not depend on model classification.
- Tier 2 (Sonnet 4.6) — Deep structured extraction on matched candidates only. 294 extractions, $23.
- Tier 2b (Sonnet 4.6, ditto-aware) — Re-extract Ridglan rows with a prompt that explicitly handles the form's vertical-line/ditto convention and pulls vaccinations + treatments + health declarations. 77 of 211 done so far, ~$7.
- Tier 3 (Opus 4.7) — Independent re-read of every Ridglan candidate with a paraphrased prompt. Cross-tier agreement score computed against Tier 1 and Tier 2. 211 extractions, $21.
- Load → Postgres — All extractions into
extractionswith full provenance (run_id, model, prompt_version, tokens, cost, timestamp). Canonical view inshipmentspicks the best extraction per CVI.
Why three (now four) tiers
- Haiku is cheap enough to read every PDF in the corpus.
- Putting matching logic in Python (not in the LLM) means recall doesn't bottleneck on classification quality.
- Sonnet handles structured extraction well at ~$0.08 per CVI.
- Opus on a paraphrased prompt gives an independent reading for cross-validation. Disagreements get flagged, not silently discarded.
- Tier 2b corrects a specific failure mode (per-animal columns where the form uses a vertical "ditto" line for repeated values) and pulls additional fields the original prompt skipped.
Provenance
Every extracted value in the database carries a foreign key to a specific extraction row, which carries a foreign key to an extraction_run (model + prompt version + timestamp). Re-extraction is additive: new runs create new extraction rows; older readings are preserved. The shipments table picks "best available" based on phase priority (tier2b > tier2 > tier3 > tier1) at query time, so we can swap the picker without re-running anything.
Stack
- Web: Next.js 16 · React 19 · Tailwind 4 · TypeScript
- DB: PostgreSQL 16 (Docker container on droplet, port 5435 local-only)
- Server: systemd + Caddy 2 (auto-TLS via Let's Encrypt)
- Extraction: Python 3.14 · Anthropic SDK · async (concurrency 5–16)
- Exports:
exceljsfor XLSX,pdfkitfor PDF
For prose-friendly framing, see /methodology. Live data-quality stats are maintained internally by Dane4Dogs.