Getting Started¶
Zero to structured data in five minutes. This guide walks you through installing Koji, starting a processing cluster, and extracting data from a document.
Prerequisites¶
- Docker Desktop (or Docker Engine with Compose v2) — running, with 8GB+ RAM allocated
- Python 3.11+
- An OpenAI API key (or ollama installed for fully local processing)
Install¶
Or with pipx:
Verify it worked:
Updating?
uv tool upgrade kojipulls the latest.
Initialize a project¶
Koji ships with a set of domain templates so you can scaffold a project with a working schema in one command:
This creates:
Available templates¶
| Template | What you get |
|---|---|
invoice |
Invoice number, vendor, dates, totals, line items |
receipt |
POS receipt: merchant, items, tax, tip, payment method |
contract |
Contract: parties, term, effective/expiration dates, governing law |
insurance |
Commercial insurance policy with category routing and hints |
form |
Government form: name, DOB, address, checkboxes |
List them at any time:
Plain koji init myproject (no template) creates just a koji.yaml so you can define your own schema from scratch. --quickstart still works and is an alias for --template invoice.
Configure¶
Set your OpenAI API key so Koji can pass it through to containers:
Templates ship with a working pipeline already wired up. The generated koji.yaml from --template invoice looks like this:
project: myproject
cluster:
base_port: 9400
pipeline:
- step: parse
engine: docling
- step: extract
model: openai/gpt-4o-mini
schemas:
- ./schemas/invoice.yaml
output:
structured: ./output/
Using ollama instead? Set model: llama3.2 and make sure ollama is running locally. No API key needed.
Running plain
koji initwith no template? You'll get just theproject,cluster, andoutputsections — add apipeline:block yourself when you're ready to wire up extraction.
Start the cluster¶
This pulls Docker images and starts the processing services: a parse engine (docling), extraction service, API server, and dashboard. First run takes a minute or two for image pulls. Subsequent starts are fast.
The dashboard is at http://127.0.0.1:9400.
Check that everything came up:
If something looks wrong, run the diagnostic tool:
Koji Doctor
✓ Docker installed (Docker version 27.x.x)
✓ Docker Compose available
✓ Docker daemon running
✓ koji.yaml found
✓ koji.yaml valid (project: myproject)
✓ Ports available (base: 9400)
✓ OPENAI_API_KEY set
7 passed, 0 warning, 0 failed
koji doctor checks Docker, your config file, port availability, and API keys. Fix anything marked with a failure before proceeding.
Process your first document¶
Run the full pipeline (parse + extract) on a document:
This sends the document through the parse step (PDF to markdown), then extracts structured data using your schema. Results are written to ./output/:
{
"invoice_number": "INV-2026-0042",
"date": "2026-03-15",
"vendor": "Acme Corp",
"line_items": [
{
"description": "Consulting services",
"quantity": 40,
"unit_price": 150.00,
"total": 6000.00
}
],
"total_amount": 6000.00,
"currency": "USD"
}
You can also process an entire directory:
Extract from existing markdown¶
Already have parsed markdown from a previous run? Skip the slow parse step and go straight to extraction:
This is much faster and useful for iterating on your schema. The --model flag lets you override the model from your config on the fly.
Options:
| Flag | Description |
|---|---|
--schema, -s |
Path to extraction schema (required) |
--model, -m |
Model override (e.g., openai/gpt-4o-mini, llama3.2) |
--output, -o |
Output directory (default: ./output/) |
--strategy |
Extraction strategy: parallel (default) or agent |
Write your own schema¶
A schema is a YAML file that tells Koji what to extract. Here's the structure:
name: purchase_order
description: Purchase order extraction
fields:
po_number:
type: string
required: true
description: The purchase order number
vendor:
type: string
description: Vendor or supplier name
items:
type: array
items:
type: object
properties:
description:
type: string
quantity:
type: number
unit_price:
type: number
Field types: string, number, date, enum, mapping, array. Arrays can hold nested objects with their own properties — see the Schema Authoring Guide for the full reference.
The description on each field matters -- it guides the extraction model. Be specific about what the field represents and where it typically appears in the document.
Schema hints¶
For complex documents, add hints to improve extraction accuracy. Hints tell the extraction pipeline where to look and what patterns to expect:
fields:
invoice_number:
type: string
required: true
description: The invoice number
hints:
look_in: [header]
patterns: ["invoice\\s*(?:number|no|#)"]
signals: [has_key_value_pairs]
look_in— which document sections to search (sections you define yourself incategories.keywords)patterns— regex patterns that indicate where the value livessignals— structural cues likehas_dollar_amounts,has_dates,has_tables,has_key_value_pairs. You can also define your own custom signals via regex.
See schemas/examples/insurance_policy.yaml for a complete working example with custom categories, hints, and patterns. The Schema Authoring Guide has the complete reference.
What's next¶
Useful commands while you work:
koji status # cluster health
koji logs # all service logs
koji logs extract -f # follow extraction service logs
koji stop # shut down the cluster
Once you have a schema you trust, you can lock it in with regression tests and benchmarks:
koji test --schema schemas/invoice.yaml # run schema regression tests
koji bench --corpus ./corpus --model openai/gpt-4o-mini # benchmark across a corpus
Further reading:
- Schemas — full schema authoring guide: field types, hints, arrays, enums, custom signals
- Configuration Reference — every
koji.yamloption - CLI Reference — every command and flag
- Architecture — how the pipeline works