Schema Authoring Guide¶
Schemas are the core of Koji. A schema tells the extraction pipeline exactly what data to pull from your documents and where to find it. This guide covers everything from basic field definitions to advanced hint-driven routing.
Schema basics¶
A schema is a YAML file with a name, description, and a set of fields:
name: purchase_order
description: Standard purchase order extraction
fields:
po_number:
type: string
required: true
description: The purchase order number
vendor:
type: string
description: Vendor or supplier name
total:
type: number
description: Total order amount
name-- identifies the schema in logs and outputdescription-- helps you remember what this schema targets (not used by extraction)fields-- the data you want extracted, keyed by field name
Field names become the keys in your output JSON. Use snake_case -- these show up in your downstream systems.
The description on each field is sent to the extraction model. Be specific. "The invoice number, usually in the top-right header" is better than "invoice number".
Field types¶
string¶
The default. Use for names, IDs, addresses, free-text values.
Output: "company_name": "Acme Corp"
number¶
Numeric values. Koji strips currency symbols and commas automatically -- $1,234.56 becomes 1234.56.
Output: "total_amount": 1234.56
Integer values stay as integers (no .0 suffix).
date¶
Dates are normalized to ISO 8601 (YYYY-MM-DD) regardless of the source format.
Input document might say "March 15, 2026" or "03/15/2026" or "2026-03-15" -- all produce:
Output: "invoice_date": "2026-03-15"
enum¶
A constrained set of allowed values. The extraction model picks the closest match.
policy_type:
type: enum
description: Type of insurance policy
options:
- General Liability
- Workers Compensation
- Commercial Property
- Commercial Auto
- Umbrella
- Professional Liability
- Cyber Liability
- Other
Output: "policy_type": "General Liability"
Enum matching is fuzzy — if the document says "Gen. Liability" or "GL", Koji matches it to "General Liability". See Enum matching for details.
mapping¶
Like enum, but with explicit aliases for normalization. Use this when real-world documents have many different ways of writing the same canonical value.
policy_type:
type: mapping
description: Type of insurance policy
mappings:
BOP: ["Business Owners Policy", "Businessowners", "Bus. Owners", "BOP"]
GL: ["General Liability", "CGL", "Commercial General Liability"]
WC: ["Workers Compensation", "Workers Comp", "Work Comp", "WC"]
Each canonical key has a list of aliases. The extracted value is normalized to the canonical key:
- "Business Owners Policy" →
"BOP" - "CGL" →
"GL" - "Workers Comp" →
"WC"
Matching is case-insensitive, with fuzzy substring fallback. Use mapping when downstream systems expect a fixed set of identifiers (e.g., insurance product codes, country codes, currency codes) rather than the raw text the document uses.
array¶
Lists of items. Define the shape of each item with items:
line_items:
type: array
items:
type: object
properties:
description:
type: string
quantity:
type: number
unit_price:
type: number
total:
type: number
Output:
"line_items": [
{
"description": "Consulting services",
"quantity": 40,
"unit_price": 150.00,
"total": 6000.00
},
{
"description": "Travel expenses",
"quantity": 1,
"unit_price": 450.00,
"total": 450.00
}
]
Arrays can also hold simple values:
Output: "tags": ["urgent", "reviewed", "approved"]
boolean¶
True/false values. Koji normalizes common representations automatically:
The following are all recognized as true: "true", "yes", "Y", "X", "1", "checked". And as false: "false", "no", "N", "0", "", "unchecked".
Output: "gl_claims_made": true
Booleans are especially useful with form mappings -- checkbox mapping types detect whether a checkbox is marked and return the boolean value directly.
Required fields¶
Mark fields as required: true when the extraction is incomplete without them:
When a required field is not found:
- The field appears as
nullin the output - Its confidence is marked as
not_found - Koji logs a warning:
Missing required fields: [invoice_number] - Future: gap-filling will broaden the search automatically
Use required sparingly. Not every field needs it -- only fields where a missing value means the extraction failed.
Intake limits¶
Before Koji parses a document or sends a single token to an LLM, an intake integrity check runs. Header validation (MIME matches extension, PDF magic bytes are valid) and "at least one page was produced" are always on and require no configuration. Size, page, and type limits are opt-in per schema via the top-level intake: block:
name: invoice
description: Standard invoice extraction
intake:
max_size_mb: 25 # reject files bigger than 25 MB
max_pages: 50 # reject documents longer than 50 pages
allowed_types: [pdf] # only accept PDFs — block docx, images, etc.
fields:
invoice_number: ...
All three fields are optional. Any integrity failure is surfaced to the caller as an HTTP 400 with a clear reason (e.g. "File is 34.2 MB, exceeds schema limit of 25 MB."). Use limits to protect yourself from runaway cost, oversize uploads, or wrong-type files hitting a pipeline tuned for a specific format.
Recognized canonical types for allowed_types: pdf, docx, xlsx, pptx, png, jpg, tiff, html, md, txt.
Normalization¶
Extraction gives you the values the LLM pulled from the document. Normalization turns those values into the shape your downstream systems actually want — currency as minor units, dates as ISO 8601, phone numbers as E.164, strings trimmed and slugified — without any LLM calls. It's pure Python running after the model returns.
Declare transforms per-field with a normalize: directive:
fields:
vendor_name:
type: string
normalize: [trim, lowercase] # chain transforms, applied in order
total_amount:
type: number
normalize: minor_units # $1,234.56 → 123456 (cents)
invoice_date:
type: date
normalize: iso8601 # "4/3/26" → "2026-04-03"
contact_phone:
type: string
normalize: e164 # (555) 123-4567 → +15551234567
status:
type: enum
normalize: slugify # "Active Status" → "active_status"
Transforms also apply to array-of-object rows — declare them on the inner property:
line_items:
type: array
items:
type: object
properties:
description:
type: string
normalize: trim
total:
type: number
normalize: minor_units
Built-in transforms¶
| Name | Effect |
|---|---|
trim |
Strip leading/trailing whitespace |
lowercase |
ASCII-insensitive lowercasing |
uppercase |
ASCII-insensitive uppercasing |
slugify |
Lowercase + replace non-alphanumerics with _ + strip underscores at edges |
iso8601 |
Parse common date formats (ISO, MM/DD/YYYY, MM-DD-YY, etc.) to YYYY-MM-DD |
minor_units |
Parse currency strings or numbers to integer minor units (cents). ($50.00) → -5000 |
e164 |
Strip phone formatting; prefix +1 for bare 10-digit US numbers |
Transforms are deterministic and fault-tolerant: if a value can't be parsed (e.g. "next Tuesday" through iso8601), the original value is passed through unchanged. Unknown transform names are recorded as warnings in the response's normalization.warnings list rather than raising.
Resolve (field reference lookup)¶
Use resolve to populate a field by looking up another field's value. The template string uses {field_name} syntax to interpolate extracted values into a field name, then returns the value of that field.
fields:
insurer_a:
type: string
insurer_b:
type: string
gl_insurer_letter:
type: string
description: "Letter (A-E) identifying which insurer covers General Liability"
gl_insurer_name:
type: string
resolve: "insurer_{gl_insurer_letter}"
If gl_insurer_letter extracts as "A", the template resolves to "insurer_a", and gl_insurer_name is set to whatever insurer_a contains (e.g. "Trisura Insurance Company").
Resolve runs after all other normalization. It only fills fields that are null or empty -- it won't overwrite a value that was already extracted. This makes it safe to use alongside form mappings where some fields come from coordinates and others from LLM interpretation.
Validation rules¶
Validation runs immediately after normalization. It evaluates a list of schema-declared rules against the extracted output and returns a report indicating which rules passed and which failed. Think of it as "turn extracted JSON into extracted and verified JSON" — the schema is the contract, and validation is the enforcement point before the data leaves Koji for downstream systems.
Declare rules at the top level of the schema:
validation:
- required: [invoice_number, total_amount, invoice_date]
- not_empty: [line_items]
- enum_in:
field: currency
allowed: [USD, EUR, GBP]
- date_order: [issue_date, due_date]
- sum_equals:
field: total_amount
sum_of: line_items.total
tolerance: 0.01
- regex:
field: invoice_number
pattern: "^INV-\\d+$"
Each list entry is a single-key dict. The key is the rule type; the value is either a list of field names (for rules that just reference fields) or a dict of parameters.
Built-in rules¶
| Rule | Shape | Description |
|---|---|---|
required |
list of fields | Fail if any listed field is null, empty string, or empty list |
not_empty |
list of fields | Fail if any listed field has zero length |
enum_in |
{field, allowed} |
Fail if the field's value is not in the allowed list |
date_order |
list of date fields | Fail if the dates are not in ascending order (ties allowed) |
sum_equals |
{field, sum_of, tolerance?} |
Fail if field != sum of the dotted sum_of path (default tolerance 0.01) |
regex |
{field, pattern} |
Fail if the field's value does not match the regex |
sum_of accepts a dotted path that crosses arrays: line_items.total sums row.total across every row.
Response shape¶
Every extraction response now includes a validation block alongside extracted:
{
"extracted": { ... },
"validation": {
"ok": false,
"issues": [
{"rule": "sum_equals", "field": "total_amount", "message": "total_amount=100 but sum of line_items.total=95 (tolerance 0.01)"}
]
},
"normalization": {
"applied": [
{"field": "vendor_name", "transform": "trim"},
{"field": "total_amount", "transform": "minor_units"}
],
"warnings": []
}
}
validation.ok is true only when every rule passes. The issues list contains only failures, so a passing run returns an empty list. Use this to decide whether to accept the extraction into your downstream system, hand it off for human review, or reject it outright.
Validation never raises on malformed rule entries: a bad rule definition surfaces as an issue in the report so your pipeline keeps running and you can see exactly which rule is broken.
Targeting specific document types with apply_to¶
When you run Koji against a packet — a single upload containing multiple stapled-together documents (an invoice + a certificate of insurance + a policy declaration, say) — you usually want each schema to extract only from the section that contains its type of data. The classifier stage in the pipeline can split a packet into typed sections, and the apply_to schema key tells the router which of those sections this schema should run against.
name: insurance_policy
description: Commercial insurance policy extraction
apply_to: [policy] # only run against sections classified as "policy"
fields:
policy_number: ...
The type IDs in apply_to must match the ones declared in your koji.yaml classifier config (see docs/configuration.md for the classify block). You can target multiple types in one schema:
When the classifier is disabled (the default, and the state of every Koji install that hasn't opted in), apply_to is ignored. Adding it to a schema is a no-op under a single-document pipeline — safe to sprinkle in now and activate later.
When the classifier is enabled and a schema has no apply_to, behavior depends on the require_apply_to flag in koji.yaml:
require_apply_to: false(default, forgiving): the schema runs against every section the classifier produces, regardless of type. Good for migration.require_apply_to: true(strict): missingapply_tois a config error and extraction raises a clear message at call time. Turn this on once you have more than a few schemas and want to prevent accidental cross-section extraction.
When apply_to matches multiple sections — a packet with three stapled invoices and an invoice schema — extraction runs once per matching section and each result comes back as its own entry in the output. When it matches zero sections, extraction returns an empty list and an explicit no_matching_section reason. See the classify-split design doc for the full output shape and pipeline contract.
Schema hints¶
Hints are the key differentiator in Koji's extraction pipeline. Instead of sending the entire document to the model and hoping it finds your fields, hints tell the router exactly where to look.
Without hints, Koji uses generic inference -- matching field names and types against document content. This works for simple documents. For complex multi-section documents (insurance policies, contracts, regulatory filings), hints dramatically improve accuracy and reduce token usage.
policy_number:
type: string
required: true
description: Policy number or ID
hints:
look_in: [declarations]
patterns: ["policy.*(?:number|no|#)", "[A-Z]{2,5}\\d{5,}"]
signals: [has_key_value_pairs]
Want a domain-specific signal like
has_policy_numbers? Define it as a custom signal in your schema. See the signals section below.
look_in¶
Routes the field to specific document categories. look_in is a hard filter: when any chunk matches one of the listed categories, the router only considers those chunks for this field. Patterns and signals then rank within that filtered pool. If no chunk matches the listed categories, the router falls back to scoring the full document with the remaining hints so the field still gets routed.
Categories are defined entirely by your schema. Koji ships with no built-in categories — instead, the schema's categories.keywords block tells the mapper which keywords identify which sections of your documents. Without category definitions, every chunk is other and look_in has nothing to match against.
Define categories at the top of your schema:
name: invoice
description: Commercial invoice extraction
categories:
keywords:
header: ["invoice", "bill to", "ship to", "invoice number"]
line_items: ["description", "quantity", "unit price"]
totals: ["subtotal", "tax", "total due", "balance"]
fields:
invoice_number:
type: string
required: true
hints:
look_in: [header]
...
Categories are detected from section titles (strong signal — one keyword in the title matches) and content keywords (weaker signal — requires 2+ keyword matches in the body). Sections that don't match any defined category are labeled other.
For an insurance schema, you might define categories like declarations, endorsement, conditions, exclusions, etc. For a contract, parties, term, compensation, termination. The right categories are the ones that match your document type.
Tuning classification¶
The defaults work well for most documents, but long-section or sparse-keyword documents sometimes need different tradeoffs. Override them under a top-level classification block:
classification:
window: 1500 # chars of chunk content scanned for keywords (default 500)
threshold: 1 # min keyword hits required to match a category (default 2)
scan: head_and_tail # head | all | head_and_tail (default head)
title_priority: true # title keyword match short-circuits content scan (default true)
window— how much of each chunk's body is scanned. Raise it for long sections where the classifying keywords live deep in the body. Lower it on short documents where you want to avoid incidental matches.threshold— how many distinct keywords from the category must appear in the scanned text. The default (2) reduces false positives from single-word overlap. Drop it to 1 when your categories are already specific enough that a single keyword is unambiguous.scan— how the window is sampled from the content:head(default): firstwindowcharactersall: the entire chunk, regardless ofwindowhead_and_tail: firstwindow/2+ lastwindow/2characters — useful when a category's keywords consistently cluster at the top or bottom of a long section
title_priority— whentrue(default), a title match short-circuits the content scan. Set tofalseif your document titles are generic (e.g. "Section 1", "Page 2") and misleading.
Unknown or invalid values silently fall back to defaults, so you can add these knobs without worrying about breaking the schema.
prefer_contains¶
A list of case-insensitive phrases. Chunks whose title or content contains any of the phrases get a strong score bonus (below look_in but above patterns). Use it when the right chunk for a field is reliably identified by a distinctive phrase that regex patterns can't easily express — or when body chunks with generic keyword matches would otherwise outscore the chunk that actually holds the value.
Common pattern: the real value lives in a signature block at the bottom of the document (e.g. SEC filings, contracts), while the body text is full of matches for "filing date" / "dated" that score higher under category and pattern hints alone. prefer_contains boosts the signature chunk so it wins.
The bonus is applied at most once per chunk no matter how many phrases match — if you want an additional bump for stronger matches, use patterns or signals as well.
patterns¶
Regex patterns matched against chunk titles and content. Medium priority -- patterns score below look_in but above signals.
Patterns are matched case-insensitively against the first 1500 characters of each chunk (title + content). Use them to:
- Match labels near your target value:
"effective.*date","total.*premium" - Match the value format itself:
"[A-Z]{2,5}\\d{5,}"for policy numbers - Match section indicators:
"schedule of.*coverage"
Tips:
- Use .* for flexible spacing between words
- Use (?:...) for non-capturing groups
- Keep patterns broad enough to match variations (abbreviations, different formatting)
- One matching pattern is enough -- you don't need all patterns to match
signals¶
Content signals detected automatically by the document mapper. Lowest priority among hints, but useful for disambiguation when multiple chunks could match a field.
Built-in signals:
| Signal | Detects |
|---|---|
has_dollar_amounts |
Currency amounts: $1,234.56, €500, £200, ¥1000, 1234.56 USD, etc. |
has_dates |
Date patterns (MM/DD/YYYY, YYYY-MM-DD, DD.MM.YYYY, etc.) |
has_key_value_pairs |
Lines formatted as Key: Value |
has_tables |
Pipe-delimited table rows (\| ... \| ... \|) |
Signals are boolean — either the chunk has the signal or it doesn't. Each matching signal adds a small score boost.
Custom signals
Built-in signals are purely structural. For domain-specific patterns (policy numbers, invoice numbers, named insured references, etc.), define custom signals in your schema:
signals:
has_policy_numbers:
pattern: "[A-Z]{2,5}\\d{5,}"
has_named_insured:
pattern: "(?:named\\s+insured|policyholder)\\s*[:.]"
flags: "i"
has_invoice_id:
pattern: "INV[\\s-]?\\d{4,}"
flags: "i"
Each custom signal needs a pattern (a regex). Optional flags accept i (case-insensitive), m (multiline), and s (dotall). If the regex matches anywhere in a chunk's content, the signal is set to true and a <name>_count is set to the number of matches.
Once defined, custom signals can be referenced in field hints just like built-in ones:
This is how Koji stays domain-agnostic: structural signals are built in, anything insurance-specific (or invoice-specific, or contract-specific) lives in your schema.
max_chunks¶
By default, each field is routed to the top 3 scoring chunks. Override this for fields that legitimately need to aggregate data from many chunks:
Use this for arrays of objects that span the document. Example: an insurance certificate's policies array, where each policy's detail lives in its own H3 section. The default cap of 3 misses most of the policies; setting max_chunks: 12 lets the router pull from every detail section.
Don't set this for simple scalar fields — it just wastes tokens.
How hints interact¶
look_in is a hard filter. If any chunk matches one of the listed categories, the router considers only those chunks for the field — other chunks are excluded entirely, even if their patterns or signals would have scored higher. Declaring look_in: [declarations] is a promise from the schema author that the value lives in declarations; the router takes the promise at face value.
Within the filtered pool, prefer_contains, patterns, and signals rank which chunks win the slots:
- prefer_contains — +15 points if any phrase is found (applied at most once; matches the
look_inweight so a distinctive phrase is decisive against a body chunk that only matches broad patterns + signals) - patterns — +8 points if any regex pattern matches (only the first match counts)
- signals — +4 points per matching signal
If look_in is set but no chunks match the listed categories (e.g., the schema author referenced a category the document doesn't have), the router falls back to scoring every chunk with patterns + signals so the field still gets routed somewhere. Generic inference (field name matching, type-based signals) is skipped whenever any hint is defined — hints are authoritative.
The top 3 scoring chunks are selected for each field by default (or up to max_chunks if you've set it). Fields that share the same top chunks are grouped into a single extraction call to minimize LLM usage.
When to use hints vs. letting the router infer¶
Skip hints when: - Your document is short (1-3 pages) - Field names are descriptive and match how they appear in the document - There's only one place a value could be
Add hints when: - Documents have multiple sections where a value could appear but only one is correct - The same term appears in different contexts (e.g., "date" appears in 10 places) - You need precision on complex documents (20+ pages) - Extraction is returning values from the wrong section
Start without hints, test extraction, and add hints where accuracy is poor.
Extraction hints¶
description on a field tells Koji (and the reader) what a field means. For tricky fields you also need to tell the model how to pick the right value — especially when the document has many plausible candidates and simple keyword matching isn't enough. That's what extraction_hint is for:
fields:
filing_date:
type: date
required: true
description: Date the filing was submitted to SEC.
extraction_hint: |
The authoritative filing date is in the signature block at the
bottom of the document — look for lines like
"/s/ Officer Name ... Dated: April 9, 2026".
For AMENDMENT forms (10-K/A, 10-Q/A, 8-K/A), the EXPLANATORY NOTE
may reference the ORIGINAL filing date. Do NOT use that — the
filing_date is the date the AMENDMENT was filed, which appears
in the signature block.
period_of_report:
type: date
description: Fiscal period the filing covers.
extraction_hint: |
period_of_report is the fiscal period the filing covers — NOT the
submission date, signature date, or preparer date. Look on the
COVER PAGE for the form-specific label:
- 10-K: "For the fiscal year ended <date>"
- 10-Q: "For the quarterly period ended <date>"
- 8-K: "Date of Report (Date of earliest event reported): <date>"
- DEF 14A: the scheduled meeting date ("to be held on <date>")
Extraction hints are rendered into a dedicated ## Extraction notes block in the prompt the LLM sees, right under the field list. The wording is free-form — write whatever the model needs to disambiguate.
When to use extraction_hint instead of description:
descriptionis a short, reader-facing summary of what the field means. It ends up in the field list line (- filing_date: date — Date the filing was submitted). Keep it under one sentence.extraction_hintis multi-line model-facing guidance about which of several candidates to pick and why. Use it for fields where the document has obvious-looking distractors (e.g. an amendment form's EXPLANATORY NOTE references both the original and current dates).
Hints also flow into the gap-fill pass, so fields that time out on the main extraction attempt still get the guidance on retry. Fields without an extraction_hint don't get an "Extraction notes" block — it's only rendered when at least one field in the group provides one.
Conditional hints based on other fields¶
Sometimes the right guidance for a field depends on another field's value. Classic SEC example: period_of_report means different things across form types — "fiscal year ended" for a 10-K, "quarterly period ended" for a 10-Q, "Date of Report" for an 8-K, and the annual meeting date for a DEF 14A. Writing one extraction_hint covering all of them would overwhelm the model; writing a narrow one would only help for one form.
Two things make this work: depends_on declares that a field's extraction should run after another field, and extraction_hint_by maps the parent field's value to a specific hint:
fields:
form_type:
type: enum
required: true
options: [10-K, 10-K/A, 10-Q, 8-K, DEF 14A]
hints:
look_in: [header]
period_of_report:
type: date
required: true
depends_on: [form_type]
extraction_hint: |
Fallback: the fiscal period the filing covers, on the cover page.
extraction_hint_by:
form_type:
"10-K": "Look for 'For the fiscal year ended <date>' on the cover page."
"10-K/A": "Same fiscal period as the ORIGINAL 10-K this amends — NOT the amendment filing date."
"10-Q": "Look for 'For the quarterly period ended <date>' on the cover page."
"8-K": "Use the 'Date of Report (Date of earliest event reported)' from the cover."
"DEF 14A": "The scheduled annual meeting date ('to be held on <date>' near the top)."
Under the hood, Koji topologically sorts fields into extraction waves. form_type has no depends_on, so it lands in wave 0 and extracts normally. period_of_report depends on form_type, so it lands in wave 1 and only routes/extracts after wave 0 completes. Before wave 1 runs, Koji resolves every dependent field's extraction_hint_by against the values already extracted — in this example, period_of_report's extraction_hint becomes the 10-K/A line if that's what the document turned out to be.
Fallback behavior:
- If the parent field is still null after its wave (extraction failed, optional and missing), the dependent field falls back to its unconditional extraction_hint.
- If the parent's extracted value isn't in the extraction_hint_by map, same fallback.
- Empty or whitespace-only hint strings are ignored — also a fallback.
Cost: within a wave, field grouping still minimizes LLM calls the same way as before. Across waves, dependent fields can't group with their parents, so you pay one extra LLM call per dependent wave. For a typical SEC schema that's 1 extra call per document — worth it for targeted per-form guidance.
Rules:
- depends_on must reference fields defined in the same schema — unknown names raise a schema error.
- Circular dependencies (a depends on b, b depends on a) raise an error at extraction time.
- Self-dependencies are rejected.
- depends_on applies the ordering constraint regardless of whether extraction_hint_by is present, so you can use it just to sequence extraction if that's useful on its own.
If depends_on becomes too heavy for your schema, the alternative is to split the polymorphic field into form-specific fields (period_fiscal_year_end, period_quarter_end, period_date_of_report, period_meeting_date) with narrow hints each, and normalize them at a later layer. Both approaches are supported.
Heading inference¶
The document mapper splits parsed markdown into chunks at # headings. For clean PDFs with structured layout, docling emits headings just fine. For OCR'd scans, invoices, and table-heavy forms, the parsed markdown often comes out with no # markers at all — and the chunker collapses the whole document into one giant chunk.
When that happens, Koji runs a heading inference pass before chunking. It promotes visually prominent standalone lines to ## headings so the chunker has something to split on:
- Bold lines on their own paragraph:
**Bill To**,**Invoice Summary:** - ALL CAPS short lines above content:
INVOICE,SOLD TO:,SECTION 1 - Schema-defined regex patterns (see below)
Inference only runs when the parsed markdown contains zero # headings — well-structured input is left untouched. Lines must start a fresh paragraph (blank line above) to be promoted, which avoids over-promoting bold spans inside flowing prose.
Consecutive bold or ALL CAPS lines separated only by blanks are treated as a single stanza — think cover pages, title blocks, multi-line company names. Short stanzas (up to four lines) are merged into one heading so multi-line titles like **CXJ** / **GROUP CO., Limited** stay intact as a single chunk anchor. Longer stanzas (five or more lines) are assumed to be word-wrapped boilerplate — common when parsers bold every word on an SEC cover page or legal front matter — and nothing is promoted; the whole block falls through to Document Start instead. The stanza resets on any non-heuristic content, so a real chapter heading after the stanza is still detected.
Bold lines whose content is mostly digits or punctuation (phone numbers, ZIP codes, registration IDs) are skipped entirely — they aren't semantic headings even when the parser marks them bold.
Custom heading patterns¶
If your documents have structural markers that don't fit the bold / ALL CAPS heuristics, declare them explicitly:
Patterns must fullmatch the line. They take priority over the generic heuristics and are matched even on short lines that the all-caps rule would skip. A pattern match also breaks out of a stanza, so you can use patterns to carve up sections that the bold/ALL CAPS heuristics would otherwise merge.
Patterns-only mode¶
If your documents have stylistic bold or ALL CAPS lines that aren't actually structural (marketing copy, emphasized phrases, legalese boilerplate), you can disable the generic heuristics while keeping explicit schema patterns:
With generic: false, bold and ALL CAPS lines are left alone and only your declared patterns produce synthetic headings.
Disabling inference¶
If your parser already produces clean headings and you'd rather skip the whole pass:
infer: false is the master kill-switch — it disables both generic heuristics and schema patterns.
Arrays and nested objects¶
Arrays extract repeated structures -- tables, line items, coverage lists, anything that appears multiple times.
Table extraction¶
The most common array pattern extracts tabular data:
coverages:
type: array
description: List of coverages with limits
items:
type: object
properties:
coverage_name:
type: string
limit:
type: string
deductible:
type: string
hints:
look_in: [schedule_of_coverages, declarations]
signals: [has_tables, has_dollar_amounts]
patterns: ["coverage", "limit", "deductible"]
The extraction model identifies rows in tables, bulleted lists, or repeated structures and returns them as an array of objects.
Arrays with hints¶
Hints on array fields route to chunks containing the tabular/repeated data. The has_tables signal is particularly useful -- it fires on any chunk with pipe-delimited markdown tables.
line_items:
type: array
items:
type: object
properties:
description:
type: string
amount:
type: number
hints:
signals: [has_tables, has_dollar_amounts]
patterns: ["item", "description", "amount"]
Reconciliation for arrays¶
When multiple extraction groups return results for the same array field, Koji concatenates and deduplicates them. This means array fields spanning multiple pages or sections are merged automatically.
Enum matching¶
Enum fields constrain extraction to a predefined set of values. Koji applies fuzzy matching in this order:
- Exact match -- value matches an option exactly
- Case-insensitive match --
"general liability"matches"General Liability" - Substring match --
"Gen. Liability"matches"General Liability"(option contains value or value contains option)
If no match is found, the raw extracted value is returned with a validation issue logged.
Best practices for enum options: - Use the full, unabbreviated form as the option value - Include an "Other" option as a catch-all - Keep the list to 15 or fewer options (more options = more ambiguity for the model)
Tips and patterns¶
Invoices¶
Invoices are usually short, well-structured documents. Hints are often unnecessary.
name: invoice
description: Standard invoice extraction
fields:
invoice_number:
type: string
required: true
description: The invoice or reference number
date:
type: date
required: true
description: Invoice issue date
due_date:
type: date
description: Payment due date
vendor:
type: string
description: Vendor or supplier name
bill_to:
type: string
description: Customer or recipient name
line_items:
type: array
items:
type: object
properties:
description:
type: string
quantity:
type: number
unit_price:
type: number
total:
type: number
subtotal:
type: number
description: Subtotal before tax
tax:
type: number
description: Tax amount
total_amount:
type: number
required: true
description: Total amount due
Contracts¶
Contracts are long and multi-section. Use look_in heavily.
name: contract
description: Commercial contract extraction
categories:
keywords:
parties: ["party", "parties", "between", "by and between"]
terms: ["term", "effective date", "commencement", "duration"]
payment: ["payment", "compensation", "fee", "invoice"]
termination: ["termination", "cancel", "expir"]
fields:
party_a:
type: string
required: true
description: First party name
hints:
look_in: [parties]
patterns: ["(?:party|first party|between).*?(?:,|and)"]
party_b:
type: string
required: true
description: Second party name
hints:
look_in: [parties]
patterns: ["(?:and|second party)"]
effective_date:
type: date
hints:
look_in: [terms]
patterns: ["effective.*date", "commenc"]
signals: [has_dates]
termination_date:
type: date
hints:
look_in: [terms, termination]
patterns: ["terminat", "expir", "end.*date"]
signals: [has_dates]
contract_value:
type: number
description: Total contract value or annual fee
hints:
look_in: [payment]
patterns: ["(?:total|contract).*(?:value|amount|fee)"]
signals: [has_dollar_amounts]
Insurance policies¶
Policies are the most complex -- many sections, many fields, overlapping terminology. Use the full hint system.
See schemas/examples/insurance_policy.yaml for a complete working example with custom categories, pattern matching, and signal routing.
Key patterns:
- Define custom categories matching your document's section structure
- Use look_in: [declarations] for most identifying fields (policy number, dates, insured name)
- Use look_in: [schedule_of_coverages] for coverage arrays
- Combine has_dollar_amounts with patterns to distinguish premium from limits
Forms and applications¶
Forms have dense key-value pairs. The has_key_value_pairs signal is your friend.
name: application
description: Insurance application form
fields:
applicant_name:
type: string
required: true
description: Applicant full name
hints:
patterns: ["(?:applicant|insured).*name"]
signals: [has_key_value_pairs]
business_type:
type: enum
options:
- Corporation
- LLC
- Partnership
- Sole Proprietor
- Non-Profit
- Other
hints:
patterns: ["(?:business|entity|organization).*(?:type|form)"]
annual_revenue:
type: number
description: Annual revenue or gross sales
hints:
patterns: ["(?:annual|gross).*(?:revenue|sales|receipts)"]
signals: [has_dollar_amounts]
employee_count:
type: number
description: Number of employees
hints:
patterns: ["(?:number|#|num).*(?:employee|staff|worker)"]
Full example: building a schema from scratch¶
Let's build a schema for medical bills. Walk through the process step by step.
Step 1: Identify the fields you need.
Look at a sample document. What data do you need in your system? Start with the obvious fields:
name: medical_bill
description: Medical bill / explanation of benefits
fields:
patient_name:
type: string
required: true
description: Patient full name
provider_name:
type: string
required: true
description: Healthcare provider or facility name
date_of_service:
type: date
description: Date services were rendered
total_charges:
type: number
description: Total billed charges
amount_due:
type: number
required: true
description: Amount the patient owes
Step 2: Test extraction without hints.
Check the output. Are fields correct? Missing? Pulled from the wrong section?
Step 3: Add arrays for line items.
Medical bills have procedure line items. Add them:
procedures:
type: array
description: List of procedures / services billed
items:
type: object
properties:
cpt_code:
type: string
description:
type: string
charges:
type: number
adjustments:
type: number
patient_responsibility:
type: number
Step 4: Add hints where extraction was inaccurate.
Say date_of_service was pulling the statement date instead of the service date. Add hints:
date_of_service:
type: date
description: Date services were rendered
hints:
patterns: ["(?:date of|dos|service.*date)", "(?:from|through)"]
signals: [has_dates]
Say amount_due was pulling total charges instead of patient responsibility:
amount_due:
type: number
required: true
description: Amount the patient owes after insurance
hints:
patterns: ["(?:amount|balance).*(?:due|owe)", "patient.*(?:responsib|pay)"]
signals: [has_dollar_amounts]
Step 5: Add an enum for categorization.
bill_type:
type: enum
description: Type of medical bill
options:
- Hospital
- Physician
- Laboratory
- Pharmacy
- Dental
- Vision
- Other
Step 6: Final schema.
name: medical_bill
description: Medical bill / explanation of benefits
fields:
patient_name:
type: string
required: true
description: Patient full name
hints:
patterns: ["patient.*name", "member.*name"]
signals: [has_key_value_pairs]
provider_name:
type: string
required: true
description: Healthcare provider or facility name
hints:
patterns: ["provider", "facility", "physician", "hospital"]
date_of_service:
type: date
description: Date services were rendered
hints:
patterns: ["(?:date of|dos|service.*date)", "(?:from|through)"]
signals: [has_dates]
statement_date:
type: date
description: Date the bill/statement was generated
hints:
patterns: ["statement.*date", "bill.*date", "printed"]
signals: [has_dates]
bill_type:
type: enum
options:
- Hospital
- Physician
- Laboratory
- Pharmacy
- Dental
- Vision
- Other
procedures:
type: array
description: List of procedures / services billed
items:
type: object
properties:
cpt_code:
type: string
description:
type: string
charges:
type: number
adjustments:
type: number
patient_responsibility:
type: number
hints:
signals: [has_tables, has_dollar_amounts]
patterns: ["procedure", "service", "cpt", "charge"]
total_charges:
type: number
description: Total billed charges before adjustments
hints:
patterns: ["total.*charge", "gross.*charge"]
signals: [has_dollar_amounts]
insurance_paid:
type: number
description: Amount paid by insurance
hints:
patterns: ["(?:insurance|plan).*paid", "(?:allowed|covered).*amount"]
signals: [has_dollar_amounts]
amount_due:
type: number
required: true
description: Amount the patient owes after insurance
hints:
patterns: ["(?:amount|balance).*(?:due|owe)", "patient.*(?:responsib|pay)"]
signals: [has_dollar_amounts]
Run extraction again. Iterate until accuracy is where you need it. Hints are surgical -- add them only where the router needs guidance.
Key-value pair scanning¶
For documents with structured label-value data (forms, certificates, declarations pages), you can enable automatic key-value pair extraction alongside the schema-driven extraction:
name: insurance_universal_scan
include_kv_pairs: true
fields:
policy_number:
type: string
nullable: true
description: Any policy or certificate number
named_insured:
type: string
nullable: true
description: Primary named insured or policyholder
When include_kv_pairs: true is set, the extraction result includes a kv_pairs array alongside the schema-defined fields:
{
"extracted": {
"policy_number": "BKS-123456-78",
"named_insured": "ABC Corporation"
},
"kv_pairs": [
{ "label": "Policy Number", "value": "BKS-123456-78" },
{ "label": "Named Insured", "value": "ABC Corporation" },
{ "label": "Effective Date", "value": "04/01/2026" },
{ "label": "General Aggregate Limit", "value": "$2,000,000" },
{ "label": "Each Occurrence", "value": "$1,000,000" },
...
]
}
Key differences from schema-driven extraction:
| Schema fields | KV pairs | |
|---|---|---|
| Precision | High — you define exactly what to extract | Lower — finds all label-value patterns |
| Cost | LLM API call per extraction | Zero — pure pattern matching on parsed markdown |
| Coverage | Only the fields you define | Everything that looks like Label: Value |
| Use case | Production pipelines with known document types | Document triage, universal scanning, discovery |
When to use KV pairs:
- Document discovery — "what's in this document?" before writing a schema
- Universal scanning — extract common identifiers (policy numbers, names, dates) from any document type without a specific schema
- Triage and routing — use KV pair content to classify documents and route them to specialized schemas
- Audit — capture everything the document says alongside the schema-driven extraction for compliance
KV pairs detect these patterns:
- Label: Value (colon-separated, including multi-word labels and values)
- **Bold Label**: Value (markdown bold labels)
- | Label | Value | (markdown table rows)
The default is include_kv_pairs: false — KV pairs are not included unless the schema opts in.