Extract Structured Data

curl --request POST \
  --url https://api.example.com/api/v1/extract \
  --header 'Content-Type: multipart/form-data' \
  --form 'schema=<string>' \
  --form 'file=<string>' \
  --form 'url=<string>' \
  --form 'url_headers=<string>' \
  --form model=auto \
  --form use_vision=false \
  --form follow_links=false \
  --form sync=true \
  --form 'webhook_url=<string>'

{
  "citations": {
    "invoice_number": "Invoice #: INV-2024-0042",
    "total_amount": "Total Due: $1,234.56"
  },
  "confidence": {
    "invoice_number": 0.96,
    "total_amount": 0.91,
    "vendor_name": 0.94
  },
  "content_type": "application/pdf",
  "credits_used": 7,
  "data": {
    "invoice_number": "INV-2024-0042",
    "total_amount": 1234.56,
    "vendor_name": "Acme Corp"
  },
  "extraction_method": "text",
  "field_status": {
    "invoice_number": {
      "confidence": 0.96,
      "required": false,
      "status": "found",
      "type": "string"
    },
    "total_amount": {
      "confidence": 0.91,
      "required": true,
      "status": "found",
      "type": "number"
    },
    "vendor_name": {
      "confidence": 0.94,
      "required": false,
      "status": "found",
      "type": "string"
    }
  },
  "filename": "invoice.pdf",
  "model_used": "fast",
  "pages_processed": 50,
  "schema_format": "typed",
  "success": true,
  "text_length": 2450,
  "total_pages": 1200
}

Extraction

Extract Structured Data

Extract structured data from a file or URL using AI.

Upload a file (or provide a URL) along with a typed JSON schema describing the fields you want extracted. Each field must have a type and description. Returns structured data matching your schema with guaranteed type conformance.

Example schema:

{
  "invoice_number": {"type": "string", "description": "The invoice number"},
  "total_amount": {"type": "number", "description": "Total amount due", "required": true},
  "status": {"type": "string", "enum": ["paid", "unpaid", "overdue"]},
  "line_items": {
    "type": "array",
    "description": "Each line item",
    "items": {
      "type": "object",
      "properties": {
        "description": {"type": "string"},
        "quantity": {"type": "integer"},
        "unit_price": {"type": "number"}
      }
    }
  }
}

Supported types: string, number, integer, boolean, array, object

Supported formats: PDF, DOCX, XLSX, PPTX, CSV, images (JPG, PNG), and any file type supported by the markdown conversion endpoint.

Vision mode: For images or scanned documents, set use_vision=true for highest accuracy on visual content.

POST

api

extract

Extract Structured Data

curl --request POST \
  --url https://api.example.com/api/v1/extract \
  --header 'Content-Type: multipart/form-data' \
  --form 'schema=<string>' \
  --form 'file=<string>' \
  --form 'url=<string>' \
  --form 'url_headers=<string>' \
  --form model=auto \
  --form use_vision=false \
  --form follow_links=false \
  --form sync=true \
  --form 'webhook_url=<string>'

{
  "citations": {
    "invoice_number": "Invoice #: INV-2024-0042",
    "total_amount": "Total Due: $1,234.56"
  },
  "confidence": {
    "invoice_number": 0.96,
    "total_amount": 0.91,
    "vendor_name": 0.94
  },
  "content_type": "application/pdf",
  "credits_used": 7,
  "data": {
    "invoice_number": "INV-2024-0042",
    "total_amount": 1234.56,
    "vendor_name": "Acme Corp"
  },
  "extraction_method": "text",
  "field_status": {
    "invoice_number": {
      "confidence": 0.96,
      "required": false,
      "status": "found",
      "type": "string"
    },
    "total_amount": {
      "confidence": 0.91,
      "required": true,
      "status": "found",
      "type": "number"
    },
    "vendor_name": {
      "confidence": 0.94,
      "required": false,
      "status": "found",
      "type": "string"
    }
  },
  "filename": "invoice.pdf",
  "model_used": "fast",
  "pages_processed": 50,
  "schema_format": "typed",
  "success": true,
  "text_length": 2450,
  "total_pages": 1200
}

Headers

X-API-Key

string

Body

multipart/form-data

schema

string

required

JSON schema defining fields to extract

file

string | null

File to extract data from

url

string | null

URL to fetch file from

url_headers

string | null

JSON-encoded headers for URL auth

model

string

default:auto

Model: 'fast', 'accurate', or 'auto'

use_vision

boolean

default:false

Use vision mode for images and scanned documents

follow_links

boolean

default:false

Follow internal links (e.g. /about, /contact) to find more data. URL inputs only.

sync

boolean

default:true

Set to false to process asynchronously. Returns a task_id to poll.

webhook_url

string | null

URL to POST the result to when async processing completes.

Response

Successful Response

Response for structured data extraction.

data

Data · object

required

Extracted structured data matching the provided schema. Types are guaranteed when using typed schema.

filename

string

required

Original filename or URL

model_used

string

required

Model used for extraction (fast or accurate)

extraction_method

string

required

How text was extracted: text, ocr, ocr+vision, vision, or dom+text

text_length

integer

required

Length of extracted text processed

success

boolean

default:true

True if all required fields were found. False if any required field is missing (partial data still returned).

confidence

Confidence · object

Per-field calibrated confidence scores (0.0–1.0). Composite of LLM self-report, token logprobs, and OCR quality weighted by extraction method.

Show child attributes

citations

Citations · object

Per-field source text snippets from the document

content_type

string | null

Detected MIME type

pages_processed

integer

default:0

Number of document pages actually read (0 for non-paginated files)

total_pages

integer

default:0

Total pages in the document (0 for non-paginated files)

credits_used

integer

default:0

Credits consumed for this extraction

schema_format

enum<string> | null

Schema format: always 'typed'. Each field has an explicit type for guaranteed type conformance.

Available options:

typed

field_status

Field Status · object

Per-field status (typed schemas only).

Show child attributes

missing_required

string[] | null

List of required fields that could not be found (typed schemas only). Present only when success=false.

quality_signals

Quality Signals · object

Raw quality signals used for confidence calibration. 'ocr_quality': blended OCR quality (0.7mean + 0.3p10, 0–1, absent for non-OCR). 'doc_quality': geometric mean of LLM token probs (0–1). 'signal_disagreement': mean |llm_self_report - logprob| across fields. Values above 0.3 indicate the signals disagree — the model may be confidently wrong (high logprob, low self-report) or self-deprecating.

Show child attributes

Async & Webhooks Batch Extract