Free the Beagles

A Ridglan Farms public-records archive

Internal · /architecture

Pipeline architecture

The data pipeline behind this site, in technical detail.

Data flow

  1. Crawl DATCP — Python script walks the public DATCP index page and the 37 monthly buckets at cvicentral.com. Output: manifest.csv (171,686 rows across all species).
  2. Reconcile — For each monthly bucket, compare row count against the server's stated total; re-fetch any missing pages. 37/37 reconciled exactly.
  3. Download PDFs — One per canine CVI: 38,128 files / 7.4 GB. Filename-collision aware.
  4. Tier 1 (Haiku 4.5) — Read every canine PDF; return verbatim consignor name/address/USDA license. Cached per shrine_id. 37,960 unique extractions, $158.
  5. Match (Python) — Classify each Tier-1 extraction into target sets (Ridglan / Marshall / Envigo / Inotiv / Class B). Pure-code matching against verbatim fields — recall does not depend on model classification.
  6. Tier 2 (Sonnet 4.6) — Deep structured extraction on matched candidates only. 294 extractions, $23.
  7. Tier 2b (Sonnet 4.6, ditto-aware) — Re-extract Ridglan rows with a prompt that explicitly handles the form's vertical-line/ditto convention and pulls vaccinations + treatments + health declarations. 77 of 211 done so far, ~$7.
  8. Tier 3 (Opus 4.7) — Independent re-read of every Ridglan candidate with a paraphrased prompt. Cross-tier agreement score computed against Tier 1 and Tier 2. 211 extractions, $21.
  9. Load → Postgres — All extractions into extractions with full provenance (run_id, model, prompt_version, tokens, cost, timestamp). Canonical view in shipments picks the best extraction per CVI.

Why three (now four) tiers

  • Haiku is cheap enough to read every PDF in the corpus.
  • Putting matching logic in Python (not in the LLM) means recall doesn't bottleneck on classification quality.
  • Sonnet handles structured extraction well at ~$0.08 per CVI.
  • Opus on a paraphrased prompt gives an independent reading for cross-validation. Disagreements get flagged, not silently discarded.
  • Tier 2b corrects a specific failure mode (per-animal columns where the form uses a vertical "ditto" line for repeated values) and pulls additional fields the original prompt skipped.

Provenance

Every extracted value in the database carries a foreign key to a specific extraction row, which carries a foreign key to an extraction_run (model + prompt version + timestamp). Re-extraction is additive: new runs create new extraction rows; older readings are preserved. The shipments table picks "best available" based on phase priority (tier2b > tier2 > tier3 > tier1) at query time, so we can swap the picker without re-running anything.

Stack

  • Web: Next.js 16 · React 19 · Tailwind 4 · TypeScript
  • DB: PostgreSQL 16 (Docker container on droplet, port 5435 local-only)
  • Server: systemd + Caddy 2 (auto-TLS via Let's Encrypt)
  • Extraction: Python 3.14 · Anthropic SDK · async (concurrency 5–16)
  • Exports: exceljs for XLSX, pdfkit for PDF

For prose-friendly framing, see /methodology. Live data-quality stats are maintained internally by Dane4Dogs.