PDF Data Extraction: Turning Documents into Structured, Queryable Data

PDFs are where data goes to die.

Financial filings, supplier invoices, legal contracts, research reports, government datasets — some of the most valuable structured information in the world is locked inside PDF files that can’t be queried, joined, or fed into a pipeline.

The traditional approach involves a mix of pdfparse, tabula-py, PDF.co, and significant manual cleanup. It’s slow, brittle, and hard to scale.

This guide covers how to extract structured data from PDFs at scale using workers — with no preprocessing infrastructure.

What PDF extraction handles

A good PDF extraction worker can handle:

Multi-page documents: Maintain context across hundreds of pages
Tables: Extract rows/columns correctly even from complex layouts
Scanned PDFs: OCR converts image-based pages to text
Forms: Extract named field values from fillable PDFs
Metadata: Author, creation date, application name, page count
Embedded data: Sometimes PDFs have hidden structured elements

Basic extraction call

curl -X POST https://api.seek-api.com/v1/workers/pdf-extractor/jobs \
  -H "X-Api-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/reports/annual-report-2025.pdf"}'

Or upload a base64-encoded file:

{
  "file": "<base64-encoded-pdf>",
  "extractTables": true,
  "ocrEnabled": true
}

Response:

{
  "pages": 48,
  "title": "Annual Report 2025",
  "author": "Finance Department",
  "createdAt": "2026-01-15",
  "textByPage": {
    "1": "To our shareholders...",
    "2": "Revenue grew 24% year-over-year..."
  },
  "tables": [
    {
      "page": 12,
      "rows": [
        ["Segment", "Revenue", "YoY Change"],
        ["North America", "$124M", "+18%"],
        ["Europe", "$87M", "+31%"],
        ["APAC", "$43M", "+52%"]
      ]
    }
  ]
}

Invoice processing

Invoice processing is one of the highest-value use cases. Hundreds of companies still receive supplier invoices as PDF attachments, manually enter data into their ERP, and reconcile line items by hand.

Configure extraction with invoice-specific fields:

{
  "url": "...",
  "template": "invoice",
  "extractFields": ["invoice_number", "date", "total", "vendor_name", "line_items"]
}

Result:

{
  "invoice_number": "INV-2025-00482",
  "date": "2025-12-30",
  "vendor_name": "Acme Supplies Ltd.",
  "subtotal": 4500.00,
  "tax": 450.00,
  "total": 4950.00,
  "line_items": [
    { "description": "Server Rack Unit", "qty": 3, "unit_price": 1200.00, "total": 3600.00 },
    { "description": "Cable Management", "qty": 5, "unit_price": 180.00, "total": 900.00 }
  ]
}

Automate the routing logic: post extracted data to your ERP API, flag anomalies, fire approval workflows in Slack.

Processing a folder of SEC filings

SEC filings (10-K, 10-Q, 8-K) are public PDFs. Their financial tables contain structured data that analysts normally extract manually.

import httpx

FILINGS = [
    "https://www.sec.gov/Archives/edgar/data/1652044/000165204425000010/goog-20241231.pdf",
    # ... hundreds more
]

# Submit all extractions at once
jobs = [
    httpx.post(
        "https://api.seek-api.com/v1/workers/pdf-extractor/jobs",
        headers={"X-Api-Key": API_KEY},
        json={"url": url, "extractTables": True}
    ).json()["job_uuid"]
    for url in FILINGS
]

# Collect and normalize
results = [wait_for_job(j) for j in jobs]

# Find all income statement tables
income_statements = [
    {"url": FILINGS[i], "table": t}
    for i, r in enumerate(results)
    for t in r["tables"]
    if any("revenue" in str(row).lower() for row in t["rows"])
]

Legal contract review

For teams doing due diligence at scale (M&A screening, vendor reviews), extract key clauses from contracts:

Use a worker configured to detect:

Governing law clause
Liability cap amount
Termination conditions
Non-compete duration
Payment terms

The extracted JSON can be routed to a diff system that flags unusual provisions against your standard template.

When OCR is needed

Contracts and invoices received as scanned images (rather than digital PDFs) require OCR. Set "ocrEnabled": true — the worker detects whether a page is a raster image and applies OCR automatically.

Accuracy is high for clean scans (~98%), and drops for handwritten annotations or very low DPI. For edge cases, the raw confidence score per page is returned so your pipeline can route uncertain pages to human review.

Pricing

PDF extraction runs at approximately 1 credit per page, with OCR pages at 2 credits. A 10-page invoice costs $0.02. Processing 100 invoices costs $2.