Skip to content

Configuration Reference

Koji is configured through a single koji.yaml file in your project root. Every option has a sensible default -- a minimal config can be just a project name.

For a walkthrough of setting up your first config, see Getting Started.

Minimal config

project: myproject

Full example

project: myproject

cluster:
  name: default
  base_port: 9400

services:
  parse: true
  ollama: true

pipeline:
  - step: parse
    engine: docling

  - step: extract
    model: openai/gpt-4o-mini
    schemas:
      - ./schemas/invoice.yaml

models:
  providers:
    openai:
      backend: openai
      api_key: ${OPENAI_API_KEY}
    local:
      backend: ollama
      endpoint: http://localhost:11434

output:
  structured: ./output/
  vectors: ./vectors/
  raw_markdown: ./markdown/

webhooks:
  - url: https://my-app.com/api/koji-callback
    events: [job.completed, job.failed]
    secret: my-hmac-secret

project

Type string
Default "koji"
Required No

The project name. Used as a namespace for Docker containers, logs, and the dashboard.

project: invoice-processing

cluster

Cluster-level settings that control networking and service identity.

cluster.name

Type string
Default "default"
Required No

Name of the cluster. Useful when running multiple Koji clusters on the same machine.

cluster.base_port

Type integer
Default 9400
Required No

Base port for the cluster. All service ports are derived from this value using fixed offsets:

Service Offset Default port
UI (dashboard) +0 9400
API server +1 9401
Ollama +10 9410
Parse +11 9411
Extract +12 9412

To run a second cluster on the same machine, set a different base_port:

cluster:
  name: production
  base_port: 9500
# dashboard at :9500, server at :9501, parse at :9511, etc.

cluster.version

Type string
Default "latest"
Required No

The image tag to pull from ghcr.io/getkoji. Defaults to latest. Pin a specific release for reproducible deployments:

cluster:
  version: v0.2.0

cluster.dev

Type boolean
Default false
Required No

Build images from local source instead of pulling from ghcr.io/getkoji. Required when developing on Koji itself. The koji start --dev CLI flag sets this for one invocation.

cluster:
  dev: true

Most users should leave this false and let Koji pull pre-built images.


services

Toggle optional services on or off. Disabling a service prevents Koji from starting its container.

services.parse

Type boolean
Default true
Required No

Enable the parse service. Set to false if you only need extraction from pre-parsed markdown (via koji extract).

services.ollama

Type boolean
Default true
Required No

Enable the bundled ollama service for local model inference. Set to false if you are using only API-based providers (e.g., OpenAI) and don't need local models.

services:
  parse: true
  ollama: false  # using OpenAI only, no local models needed

pipeline

Type list[PipelineStep]
Default []
Required No (but nothing processes without it)

An ordered list of processing steps. Each step defines one stage of the document processing pipeline. Steps are independent services -- use the full pipeline or any subset.

Pipeline step fields

Field Type Default Description
step string Required. Step type: parse or extract.
engine string null Processing engine for parsing (e.g., docling).
model string null Model in provider/model-name format (e.g., openai/gpt-4o-mini, ollama/llama3.2).
schemas list[string] null Paths to schema YAML files for extraction.
ocr string null OCR engine for the parse step (engine-specific).
strategy string "intelligent" Extraction strategy: intelligent (default), parallel, or agent.
categories list[string] null (parallel strategy only) Restrict extraction to these chunk categories. Ignored by the default intelligent strategy, which routes via schema hints.
max_tokens integer null Maximum token limit for model calls in this step.

Parse step

Converts documents (PDF, Word, images) into clean markdown.

pipeline:
  - step: parse
    engine: docling

Extract step

Extracts structured data from markdown using schemas and an LLM.

pipeline:
  - step: extract
    model: openai/gpt-4o-mini
    strategy: parallel
    schemas:
      - ./schemas/invoice.yaml
      - ./schemas/receipt.yaml
    max_tokens: 4096

Reference models as provider/model-name. The provider name must match a key under models.providers, or use a well-known provider prefix like openai/ or ollama/.

Classify step (optional)

Inserts a document-type classifier between parse and extract. This is the stage that handles packets — a single upload containing multiple stapled-together documents — by splitting the input into typed sections that downstream schemas can target via their apply_to field.

pipeline:
  - step: parse
    engine: docling

  - step: classify
    model: openai/gpt-4o-mini         # cheap model recommended; separate from extract model
    require_apply_to: false           # default false; set true to error on schemas missing apply_to
    short_doc_chunks: 2               # docs at or below this chunk count skip the classifier
    coalesce_other_threshold: 0.5     # if one type covers ≥50% of chunks + rest are `other`, collapse to a single typed section
    types:
      - id: invoice
        description: Commercial invoice with line items, bill-to, and totals.
      - id: coi
        description: Certificate of insurance showing coverage, policyholder, and insurer.
      - id: policy
        description: Insurance policy document with declarations, coverages, endorsements.
      - id: sec_filing
        description: SEC EDGAR filing (10-K, 10-Q, 8-K, DEF 14A, or amendment variants).

  - step: extract
    model: openai/gpt-4o-mini
    schemas:
      - ./schemas/invoice.yaml
      - ./schemas/insurance_policy.yaml

Presence of the classify step is the only way to turn classification on. When it's absent, the pipeline is byte-identical to single-document processing — no behavior change for existing deployments.

Each entry in types declares a document type the classifier is allowed to emit. Koji ships with no built-in type taxonomy; you declare the types your pipeline cares about. The reserved type other is always valid (used as a catch-all) and document is used internally as a fallback when the classifier has no useful answer. Type IDs are referenced from schema files via apply_to — see schema-guide.md for how schemas opt into a specific type.

require_apply_to: when false (default), a schema without apply_to runs against every section the classifier produces. When true, such a schema is a config error at extraction time. Forgiving is fine for small deployments; strict mode is safer once you have many schemas and want to prevent accidental cross-section extraction.

short_doc_chunks: the inclusive chunk-count threshold below which the classifier is bypassed entirely. Documents at or below this size return a single fallback document section with no LLM call — useful for tiny synthetic fixtures, truncated cover pages, and any upload where there's too little structure for a classifier to help. The default is 2. Set to 0 to disable the fast path and force the classifier on every document. The fallback document section is specifically designed to match any schema's apply_to list, so you won't silently lose extraction when the classifier has no opinion.

coalesce_other_threshold: a post-classify safety net that undoes LLM over-splitting on multi-section single documents (the motivating case: a DEF 14A proxy statement where the cover and voting-rights section get labeled sec_filing but the comp-table and signature sections get labeled other, halving extraction). If any other sections are present and one non-other type covers at least this fraction of chunks, the classifier output is collapsed into a single section of the dominant type covering the whole document. Defaults to 0.5. Set to 0 to disable — which you'd want only if your pipeline regularly processes genuine packets where a small minority section of type X is surrounded by a majority of type Y and both need independent extraction. When coalesce fires, classifier.coalesced_type in the response records the type that won.

The classify step issues one extra LLM call per document (unless the short-doc fast path applies). That call shows up in koji bench output as its own cost line alongside the extract call, so you can measure the overhead on your corpus before deciding whether to enable it in production.


models

Configuration for model providers. Mix local and API providers freely.

models.providers

Type dict[string, ModelProviderConfig]
Default {}
Required No

A map of provider names to their configuration. The key is a label you choose (e.g., openai, local, anthropic).

Provider config fields

Field Type Default Description
backend string null Provider backend: openai, ollama, anthropic, etc.
api_key string null API key. Supports ${VAR} environment variable syntax.
endpoint string null Custom API endpoint URL. Required for self-hosted providers.
format string null Response format hint (provider-specific).
models:
  providers:
    openai:
      backend: openai
      api_key: ${OPENAI_API_KEY}
    local:
      backend: ollama
      endpoint: http://localhost:11434
    custom:
      backend: openai
      api_key: ${CUSTOM_API_KEY}
      endpoint: https://my-inference-server.com/v1

Reference models in pipeline steps as provider/model-name:

pipeline:
  - step: extract
    model: openai/gpt-4o-mini   # uses the openai provider
  - step: classify
    model: local/llama3.2       # uses the local (ollama) provider

output

Controls where processed results are written.

output.structured

Type string
Default "./output/"
Required No

Directory for structured extraction output (JSON files).

output.vectors

Type string
Default null (disabled)
Required No

Directory for vector embeddings output. When set, Koji writes vector representations alongside structured output.

output.raw_markdown

Type string
Default null (disabled)
Required No

Directory for raw markdown from the parse step. Useful for debugging or re-running extraction without re-parsing.

output:
  structured: ./output/
  vectors: ./vectors/
  raw_markdown: ./markdown/

webhooks

Type list[WebhookConfig]
Default []
Required No

Webhooks receive HTTP POST notifications when processing events occur. Each webhook is delivered asynchronously and does not block the pipeline.

Webhook config fields

Field Type Default Description
url string -- Required. Endpoint URL to receive webhook deliveries.
events list[string] ["job.completed", "job.failed"] Events that trigger this webhook.
secret string null HMAC-SHA256 secret for signing payloads. When set, deliveries include an X-Koji-Signature header.

Supported events

Event Fired when
job.completed A processing job finishes successfully.
job.failed A processing job fails.

Webhook payload format

{
  "event": "job.completed",
  "timestamp": "2026-04-11T12:00:00+00:00",
  "data": {
    "filename": "invoice.pdf",
    "schema": "invoice",
    "extracted": { "...": "..." },
    "elapsed_ms": 2340
  }
}

Webhook delivery headers

Header Description
Content-Type application/json
X-Koji-Event Event name (e.g., job.completed)
X-Koji-Signature HMAC-SHA256 hex digest of the raw JSON body (only when secret is set)
webhooks:
  - url: https://my-app.com/api/koji-callback
    events: [job.completed, job.failed]
    secret: my-hmac-secret

Environment variables

These environment variables affect Koji at runtime:

Variable Description Default
KOJI_CONFIG_PATH Path to koji.yaml inside the server container. /etc/koji/koji.yaml
KOJI_SCHEMAS_DIR Directory where schema YAML files are stored. ./schemas/
OPENAI_API_KEY OpenAI API key. Must be set before koji start to pass through to containers. --

Use ${VAR_NAME} syntax anywhere in koji.yaml to reference environment variables:

models:
  providers:
    openai:
      api_key: ${OPENAI_API_KEY}