Skip to content

Koji Documentation

Koji is a self-hosted, config-driven document processing platform. Parse and extract structured data from any document — PDFs, Word, images, scans — using local models or any OpenAI-compatible API provider.

Get started

New to Koji? Start here:

  • Getting Started — install, configure, and extract data from your first document in five minutes

Core concepts

  • Schemas — define what to extract: field types, hints, arrays, enums, custom signals
  • Form Mappings — extract from fixed-layout PDFs by position: draw boxes, map to fields, extract at near-zero cost
  • Configuration — full koji.yaml reference
  • Architecture — how the pipeline works (map → route → extract → validate → reconcile)

Reference

  • CLI Reference — all commands, flags, and usage examples
  • API Reference — every HTTP endpoint, request/response shape, and example

Key ideas

Config-driven. One YAML schema defines what to extract. The pipeline is generic — domain knowledge lives in your schema, not in the engine.

Self-hosted. Runs on your infrastructure. docker compose for development, the same containers for production. Documents never leave the network you control unless you want them to.

Model-agnostic. Use any OpenAI-compatible API provider, local models via ollama, or your own inference endpoint. Mix and match per pipeline step.

Open source. Apache 2.0. Read the code, fork it, run it anywhere.