Extract Structured Data
Extract structured data from a file or URL using AI.
Upload a file (or provide a URL) along with a typed JSON schema describing the
fields you want extracted. Each field must have a type and description.
Returns structured data matching your schema with guaranteed type conformance.
Example schema:
{
"invoice_number": {"type": "string", "description": "The invoice number"},
"total_amount": {"type": "number", "description": "Total amount due", "required": true},
"status": {"type": "string", "enum": ["paid", "unpaid", "overdue"]},
"line_items": {
"type": "array",
"description": "Each line item",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer"},
"unit_price": {"type": "number"}
}
}
}
}
Supported types: string, number, integer, boolean, array, object
Supported formats: PDF, DOCX, XLSX, PPTX, CSV, images (JPG, PNG), and any file type supported by the markdown conversion endpoint.
Vision mode: For images or scanned documents, set use_vision=true
for highest accuracy on visual content.
Headers
Body
JSON schema defining fields to extract
File to extract data from
URL to fetch file from
JSON-encoded headers for URL auth
Model: 'fast', 'accurate', or 'auto'
Use vision mode for images and scanned documents
Follow internal links (e.g. /about, /contact) to find more data. URL inputs only.
Set to false to process asynchronously. Returns a task_id to poll.
URL to POST the result to when async processing completes.
Response
Successful Response
Response for structured data extraction.
Extracted structured data matching the provided schema. Types are guaranteed when using typed schema.
Original filename or URL
Model used for extraction (fast or accurate)
How text was extracted: text, ocr, ocr+vision, vision, or dom+text
Length of extracted text processed
True if all required fields were found. False if any required field is missing (partial data still returned).
Per-field calibrated confidence scores (0.0–1.0). Composite of LLM self-report, token logprobs, and OCR quality weighted by extraction method.
Per-field source text snippets from the document
Detected MIME type
Number of document pages actually read (0 for non-paginated files)
Total pages in the document (0 for non-paginated files)
Credits consumed for this extraction
Schema format: always 'typed'. Each field has an explicit type for guaranteed type conformance.
typed Per-field status (typed schemas only).
List of required fields that could not be found (typed schemas only). Present only when success=false.
Raw quality signals used for confidence calibration. 'ocr_quality': blended OCR quality (0.7mean + 0.3p10, 0–1, absent for non-OCR). 'doc_quality': geometric mean of LLM token probs (0–1). 'signal_disagreement': mean |llm_self_report - logprob| across fields. Values above 0.3 indicate the signals disagree — the model may be confidently wrong (high logprob, low self-report) or self-deprecating.