Building AI-Powered Document Extraction Pipelines

01.The Problem With Filename-Based Classification

Most document processing pipelines make a naive assumption: that the filename tells you what's inside. In practice, brokers upload files named 'scan001.pdf', 'final_FINAL_v3.pdf', or just 'doc'. Filename heuristics fail catastrophically in production. We needed a system that reads the document content itself to decide what it is.

02.Stage 1 — PDF Parsing & Text Extraction

Every incoming document hits a Lambda function that attempts text extraction via pdf-parse. If the extracted text is under 100 characters (scanned/image PDF), we escalate to Gemini Vision for OCR. This gives us a reliable text payload regardless of document origin.

typescript

async function extractText(buffer: Buffer): Promise<string> {
  const parsed = await pdfParse(buffer);
  if (parsed.text.trim().length > 100) return parsed.text;
  // Fallback: Gemini Vision OCR
  return await geminiVisionOcr(buffer);
}

03.Stage 2 — Deterministic Form Registry Matching

We built a FormRegistry mapping OREA form numbers and title keywords to canonical document types. The extracted text is scanned against this registry first. If a match is found with high confidence, we skip the AI call entirely — reducing cost and latency significantly.

▸OREA Form 100 → Agreement of Purchase and Sale
▸OREA Form 200 → Listing Agreement
▸OREA Form 320 → Seller Property Information Statement
▸Confidence threshold: 0.85 for deterministic classification

04.Stage 3 — Vertex AI Inference (Escalation Path)

Documents that don't match the registry are sent to Vertex AI with a structured prompt requesting JSON output. We enforce strict schema validation on the response. If Gemini returns malformed JSON or confidence below threshold, the document is flagged for human review — never silently misclassified.

typescript

const prompt = `
Classify this real estate document. Return ONLY valid JSON:
{
  "docType": "APS | LISTING | SPIS | OTHER",
  "confidence": 0.0-1.0,
  "representer": "BUYER | SELLER | DUAL | UNKNOWN",
  "dealType": "FREEHOLD | CONDO | UNKNOWN"
}
Document text: ${text.slice(0, 4000)}
`;

05.Stage 4 — SQS Event Sourcing & DLQ Handling

Each classification result is published as an event to SQS. Workers consume these events and persist the structured data. Failed messages after 3 retries land in a Dead Letter Queue, triggering a CloudWatch alarm for manual triage. This ensures zero silent failures in production.

06.Results

Switching from filename heuristics to content-based analysis brought extraction accuracy from ~62% to ~95%. The deterministic registry handles 70% of documents without touching Vertex AI, cutting inference costs by over 60%. Human review queue dropped by 80%.