01.The Problem With Filename-Based Classification
Most document processing pipelines make a naive assumption: that the filename tells you what's inside. In practice, brokers upload files named 'scan001.pdf', 'final_FINAL_v3.pdf', or just 'doc'. Filename heuristics fail catastrophically in production. We needed a system that reads the document content itself to decide what it is.
02.Stage 1 — PDF Parsing & Text Extraction
Every incoming document hits a Lambda function that attempts text extraction via pdf-parse. If the extracted text is under 100 characters (scanned/image PDF), we escalate to Gemini Vision for OCR. This gives us a reliable text payload regardless of document origin.
async function extractText(buffer: Buffer): Promise<string> {
const parsed = await pdfParse(buffer);
if (parsed.text.trim().length > 100) return parsed.text;
// Fallback: Gemini Vision OCR
return await geminiVisionOcr(buffer);
}03.Stage 2 — Deterministic Form Registry Matching
We built a FormRegistry mapping OREA form numbers and title keywords to canonical document types. The extracted text is scanned against this registry first. If a match is found with high confidence, we skip the AI call entirely — reducing cost and latency significantly.
- ▸OREA Form 100 → Agreement of Purchase and Sale
- ▸OREA Form 200 → Listing Agreement
- ▸OREA Form 320 → Seller Property Information Statement
- ▸Confidence threshold: 0.85 for deterministic classification
04.Stage 3 — Vertex AI Inference (Escalation Path)
Documents that don't match the registry are sent to Vertex AI with a structured prompt requesting JSON output. We enforce strict schema validation on the response. If Gemini returns malformed JSON or confidence below threshold, the document is flagged for human review — never silently misclassified.
const prompt = `
Classify this real estate document. Return ONLY valid JSON:
{
"docType": "APS | LISTING | SPIS | OTHER",
"confidence": 0.0-1.0,
"representer": "BUYER | SELLER | DUAL | UNKNOWN",
"dealType": "FREEHOLD | CONDO | UNKNOWN"
}
Document text: ${text.slice(0, 4000)}
`;05.Stage 4 — SQS Event Sourcing & DLQ Handling
Each classification result is published as an event to SQS. Workers consume these events and persist the structured data. Failed messages after 3 retries land in a Dead Letter Queue, triggering a CloudWatch alarm for manual triage. This ensures zero silent failures in production.
06.Results
Switching from filename heuristics to content-based analysis brought extraction accuracy from ~62% to ~95%. The deterministic registry handles 70% of documents without touching Vertex AI, cutting inference costs by over 60%. Human review queue dropped by 80%.