Skip to main content
POST
/
api
/
v1
/
extract
Extract Structured Data
curl --request POST \
  --url https://api.example.com/api/v1/extract \
  --header 'Content-Type: multipart/form-data' \
  --form 'schema=<string>' \
  --form 'file=<string>' \
  --form 'url=<string>' \
  --form 'url_headers=<string>' \
  --form model=auto \
  --form use_vision=false \
  --form follow_links=false \
  --form sync=true \
  --form 'webhook_url=<string>'
{
  "citations": {
    "invoice_number": "Invoice #: INV-2024-0042",
    "total_amount": "Total Due: $1,234.56"
  },
  "confidence": {
    "invoice_number": 0.96,
    "total_amount": 0.91,
    "vendor_name": 0.94
  },
  "content_type": "application/pdf",
  "credits_used": 7,
  "data": {
    "invoice_number": "INV-2024-0042",
    "total_amount": 1234.56,
    "vendor_name": "Acme Corp"
  },
  "extraction_method": "text",
  "field_status": {
    "invoice_number": {
      "confidence": 0.96,
      "required": false,
      "status": "found",
      "type": "string"
    },
    "total_amount": {
      "confidence": 0.91,
      "required": true,
      "status": "found",
      "type": "number"
    },
    "vendor_name": {
      "confidence": 0.94,
      "required": false,
      "status": "found",
      "type": "string"
    }
  },
  "filename": "invoice.pdf",
  "model_used": "fast",
  "pages_processed": 50,
  "schema_format": "typed",
  "success": true,
  "text_length": 2450,
  "total_pages": 1200
}

Headers

X-API-Key
string

Body

multipart/form-data
schema
string
required

JSON schema defining fields to extract

file
string | null

File to extract data from

url
string | null

URL to fetch file from

url_headers
string | null

JSON-encoded headers for URL auth

model
string
default:auto

Model: 'fast', 'accurate', or 'auto'

use_vision
boolean
default:false

Use vision mode for images and scanned documents

Follow internal links (e.g. /about, /contact) to find more data. URL inputs only.

sync
boolean
default:true

Set to false to process asynchronously. Returns a task_id to poll.

webhook_url
string | null

URL to POST the result to when async processing completes.

Response

Successful Response

Response for structured data extraction.

data
Data · object
required

Extracted structured data matching the provided schema. Types are guaranteed when using typed schema.

filename
string
required

Original filename or URL

model_used
string
required

Model used for extraction (fast or accurate)

extraction_method
string
required

How text was extracted: text, ocr, ocr+vision, vision, or dom+text

text_length
integer
required

Length of extracted text processed

success
boolean
default:true

True if all required fields were found. False if any required field is missing (partial data still returned).

confidence
Confidence · object

Per-field calibrated confidence scores (0.0–1.0). Composite of LLM self-report, token logprobs, and OCR quality weighted by extraction method.

citations
Citations · object

Per-field source text snippets from the document

content_type
string | null

Detected MIME type

pages_processed
integer
default:0

Number of document pages actually read (0 for non-paginated files)

total_pages
integer
default:0

Total pages in the document (0 for non-paginated files)

credits_used
integer
default:0

Credits consumed for this extraction

schema_format
enum<string> | null

Schema format: always 'typed'. Each field has an explicit type for guaranteed type conformance.

Available options:
typed
field_status
Field Status · object

Per-field status (typed schemas only).

missing_required
string[] | null

List of required fields that could not be found (typed schemas only). Present only when success=false.

quality_signals
Quality Signals · object

Raw quality signals used for confidence calibration. 'ocr_quality': blended OCR quality (0.7mean + 0.3p10, 0–1, absent for non-OCR). 'doc_quality': geometric mean of LLM token probs (0–1). 'signal_disagreement': mean |llm_self_report - logprob| across fields. Values above 0.3 indicate the signals disagree — the model may be confidently wrong (high logprob, low self-report) or self-deprecating.