PDFs are where data goes to die.
Financial filings, supplier invoices, legal contracts, research reports, government datasets — some of the most valuable structured information in the world is locked inside PDF files that can’t be queried, joined, or fed into a pipeline.
The traditional approach involves a mix of pdfparse, tabula-py, PDF.co, and significant manual cleanup. It’s slow, brittle, and hard to scale.
This guide covers how to extract structured data from PDFs at scale using workers — with no preprocessing infrastructure.
What PDF extraction handles
A good PDF extraction worker can handle:
- Multi-page documents: Maintain context across hundreds of pages
- Tables: Extract rows/columns correctly even from complex layouts
- Scanned PDFs: OCR converts image-based pages to text
- Forms: Extract named field values from fillable PDFs
- Metadata: Author, creation date, application name, page count
- Embedded data: Sometimes PDFs have hidden structured elements
Basic extraction call
curl -X POST https://api.seek-api.com/v1/workers/pdf-extractor/jobs \
-H "X-Api-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/reports/annual-report-2025.pdf"}'
Or upload a base64-encoded file:
{
"file": "<base64-encoded-pdf>",
"extractTables": true,
"ocrEnabled": true
}
Response:
{
"pages": 48,
"title": "Annual Report 2025",
"author": "Finance Department",
"createdAt": "2026-01-15",
"textByPage": {
"1": "To our shareholders...",
"2": "Revenue grew 24% year-over-year..."
},
"tables": [
{
"page": 12,
"rows": [
["Segment", "Revenue", "YoY Change"],
["North America", "$124M", "+18%"],
["Europe", "$87M", "+31%"],
["APAC", "$43M", "+52%"]
]
}
]
}
Invoice processing
Invoice processing is one of the highest-value use cases. Hundreds of companies still receive supplier invoices as PDF attachments, manually enter data into their ERP, and reconcile line items by hand.
Configure extraction with invoice-specific fields:
{
"url": "...",
"template": "invoice",
"extractFields": ["invoice_number", "date", "total", "vendor_name", "line_items"]
}
Result:
{
"invoice_number": "INV-2025-00482",
"date": "2025-12-30",
"vendor_name": "Acme Supplies Ltd.",
"subtotal": 4500.00,
"tax": 450.00,
"total": 4950.00,
"line_items": [
{ "description": "Server Rack Unit", "qty": 3, "unit_price": 1200.00, "total": 3600.00 },
{ "description": "Cable Management", "qty": 5, "unit_price": 180.00, "total": 900.00 }
]
}
Automate the routing logic: post extracted data to your ERP API, flag anomalies, fire approval workflows in Slack.
Processing a folder of SEC filings
SEC filings (10-K, 10-Q, 8-K) are public PDFs. Their financial tables contain structured data that analysts normally extract manually.
import httpx
FILINGS = [
"https://www.sec.gov/Archives/edgar/data/1652044/000165204425000010/goog-20241231.pdf",
# ... hundreds more
]
# Submit all extractions at once
jobs = [
httpx.post(
"https://api.seek-api.com/v1/workers/pdf-extractor/jobs",
headers={"X-Api-Key": API_KEY},
json={"url": url, "extractTables": True}
).json()["job_uuid"]
for url in FILINGS
]
# Collect and normalize
results = [wait_for_job(j) for j in jobs]
# Find all income statement tables
income_statements = [
{"url": FILINGS[i], "table": t}
for i, r in enumerate(results)
for t in r["tables"]
if any("revenue" in str(row).lower() for row in t["rows"])
]
Legal contract review
For teams doing due diligence at scale (M&A screening, vendor reviews), extract key clauses from contracts:
Use a worker configured to detect:
- Governing law clause
- Liability cap amount
- Termination conditions
- Non-compete duration
- Payment terms
The extracted JSON can be routed to a diff system that flags unusual provisions against your standard template.
When OCR is needed
Contracts and invoices received as scanned images (rather than digital PDFs) require OCR. Set "ocrEnabled": true — the worker detects whether a page is a raster image and applies OCR automatically.
Accuracy is high for clean scans (~98%), and drops for handwritten annotations or very low DPI. For edge cases, the raw confidence score per page is returned so your pipeline can route uncertain pages to human review.
Pricing
PDF extraction runs at approximately 1 credit per page, with OCR pages at 2 credits. A 10-page invoice costs $0.02. Processing 100 invoices costs $2.