Skip to content

Getting Started

Zero to structured data in five minutes. This guide walks you through installing Koji, starting a processing cluster, and extracting data from a document.

Prerequisites

  • Docker Desktop (or Docker Engine with Compose v2) — running, with 8GB+ RAM allocated
  • Python 3.11+
  • An OpenAI API key (or ollama installed for fully local processing)

Install

uv tool install git+https://github.com/getkoji/koji.git

Or with pipx:

pipx install git+https://github.com/getkoji/koji.git

Verify it worked:

koji version
# koji 0.9.0

Updating? uv tool upgrade koji pulls the latest.

Initialize a project

Koji ships with a set of domain templates so you can scaffold a project with a working schema in one command:

koji init myproject --template invoice
cd myproject

This creates:

myproject/
  koji.yaml              # pipeline configuration
  schemas/
    invoice.yaml         # extraction schema

Available templates

Template What you get
invoice Invoice number, vendor, dates, totals, line items
receipt POS receipt: merchant, items, tax, tip, payment method
contract Contract: parties, term, effective/expiration dates, governing law
insurance Commercial insurance policy with category routing and hints
form Government form: name, DOB, address, checkboxes

List them at any time:

koji init --list-templates

Plain koji init myproject (no template) creates just a koji.yaml so you can define your own schema from scratch. --quickstart still works and is an alias for --template invoice.

Configure

Set your OpenAI API key so Koji can pass it through to containers:

export OPENAI_API_KEY="sk-..."

Templates ship with a working pipeline already wired up. The generated koji.yaml from --template invoice looks like this:

project: myproject

cluster:
  base_port: 9400

pipeline:
  - step: parse
    engine: docling

  - step: extract
    model: openai/gpt-4o-mini
    schemas:
      - ./schemas/invoice.yaml

output:
  structured: ./output/

Using ollama instead? Set model: llama3.2 and make sure ollama is running locally. No API key needed.

Running plain koji init with no template? You'll get just the project, cluster, and output sections — add a pipeline: block yourself when you're ready to wire up extraction.

Start the cluster

koji start

This pulls Docker images and starts the processing services: a parse engine (docling), extraction service, API server, and dashboard. First run takes a minute or two for image pulls. Subsequent starts are fast.

The dashboard is at http://127.0.0.1:9400.

Check that everything came up:

koji status

If something looks wrong, run the diagnostic tool:

koji doctor
Koji Doctor

  ✓ Docker installed (Docker version 27.x.x)
  ✓ Docker Compose available
  ✓ Docker daemon running
  ✓ koji.yaml found
  ✓ koji.yaml valid (project: myproject)
  ✓ Ports available (base: 9400)
  ✓ OPENAI_API_KEY set

7 passed, 0 warning, 0 failed

koji doctor checks Docker, your config file, port availability, and API keys. Fix anything marked with a failure before proceeding.

Process your first document

Run the full pipeline (parse + extract) on a document:

koji process ./invoice.pdf --schema schemas/invoice.yaml

This sends the document through the parse step (PDF to markdown), then extracts structured data using your schema. Results are written to ./output/:

{
  "invoice_number": "INV-2026-0042",
  "date": "2026-03-15",
  "vendor": "Acme Corp",
  "line_items": [
    {
      "description": "Consulting services",
      "quantity": 40,
      "unit_price": 150.00,
      "total": 6000.00
    }
  ],
  "total_amount": 6000.00,
  "currency": "USD"
}

You can also process an entire directory:

koji process ./documents/ --schema schemas/invoice.yaml

Extract from existing markdown

Already have parsed markdown from a previous run? Skip the slow parse step and go straight to extraction:

koji extract ./output/invoice.md --schema schemas/invoice.yaml --model openai/gpt-4o-mini

This is much faster and useful for iterating on your schema. The --model flag lets you override the model from your config on the fly.

Options:

Flag Description
--schema, -s Path to extraction schema (required)
--model, -m Model override (e.g., openai/gpt-4o-mini, llama3.2)
--output, -o Output directory (default: ./output/)
--strategy Extraction strategy: parallel (default) or agent

Write your own schema

A schema is a YAML file that tells Koji what to extract. Here's the structure:

name: purchase_order
description: Purchase order extraction

fields:
  po_number:
    type: string
    required: true
    description: The purchase order number

  vendor:
    type: string
    description: Vendor or supplier name

  items:
    type: array
    items:
      type: object
      properties:
        description:
          type: string
        quantity:
          type: number
        unit_price:
          type: number

Field types: string, number, date, enum, mapping, array. Arrays can hold nested objects with their own properties — see the Schema Authoring Guide for the full reference.

The description on each field matters -- it guides the extraction model. Be specific about what the field represents and where it typically appears in the document.

Schema hints

For complex documents, add hints to improve extraction accuracy. Hints tell the extraction pipeline where to look and what patterns to expect:

fields:
  invoice_number:
    type: string
    required: true
    description: The invoice number
    hints:
      look_in: [header]
      patterns: ["invoice\\s*(?:number|no|#)"]
      signals: [has_key_value_pairs]
  • look_in — which document sections to search (sections you define yourself in categories.keywords)
  • patterns — regex patterns that indicate where the value lives
  • signals — structural cues like has_dollar_amounts, has_dates, has_tables, has_key_value_pairs. You can also define your own custom signals via regex.

See schemas/examples/insurance_policy.yaml for a complete working example with custom categories, hints, and patterns. The Schema Authoring Guide has the complete reference.

What's next

Useful commands while you work:

koji status          # cluster health
koji logs            # all service logs
koji logs extract -f # follow extraction service logs
koji stop            # shut down the cluster

Once you have a schema you trust, you can lock it in with regression tests and benchmarks:

koji test --schema schemas/invoice.yaml      # run schema regression tests
koji bench --corpus ./corpus --model openai/gpt-4o-mini   # benchmark across a corpus

Further reading: