Document Conversion

Convert PDF, HWPX, DOCX, XLSX, and text files into Markdown for workflows and RAG

Gencow document conversion turns uploaded files into faithful Markdown for retrieval pipelines, citations, and long-running workflow jobs.

The public app code calls wf.services.document.convert() from a workflow(). The platform owns provider routing, cost guards, provider secrets, and self-hosted converter endpoints.

When to use it

Use document conversion when an app needs to:

  • ingest uploaded files into RAG corpora
  • preserve document headings, lists, tables, and reading order as Markdown
  • handle Korean office formats such as HWP, HWPX, and HWPML
  • run OCR or VLM conversion only when explicitly requested

Do not call document conversion from a normal procedure.query or procedure.mutation. It can run external services and is workflow-only.

Basic workflow

import { workflow, v } from "@gencow/core";

export const convertDocument = workflow("documents.convert", {
    args: {
        storageId: v.string(),
        filename: v.string(),
        mimeType: v.optional(v.string()),
    },
    handler: async (wf, { storageId, filename, mimeType }) => {
        return wf.services.document.convert({
            storageId,
            filename,
            mimeType,
            corpus: "knowledge-base",
            visibility: "shared",
            provider: "auto",
        });
    },
});

The result includes both plain text and Markdown plus provider trace metadata:

const result = await wf.services.document.convert({
    storageId,
    filename: "policy.hwpx",
    corpus: "policies",
    visibility: "shared",
    provider: "auto",
});

result.markdown;
result.text;
result.sections;
result.pages;
result.warnings;
result.providerTrace.provider; // "local_text", "kordoc", "opendataloader", "openai", ...

Supported formats

Format Support Default auto path
TXT, Markdown, HTML, CSV Supported local_text
PDF Supported kordoc quality gate -> opendataloader when configured
HWP, HWPX, HWPML Supported kordoc -> opendataloader when configured
DOCX Supported kordoc -> opendataloader when configured, with local fallback where available
XLSX Supported kordoc -> opendataloader when configured
PPTX Not supported yet planned separately
legacy DOC, PPT, XLS Not supported convert to modern Office formats first

If browsers upload HWPX as application/octet-stream, Gencow infers the document type from the filename extension before routing.

Safe auto routing

provider: "auto" is the recommended default. It avoids accidental AI spend by trying local or self-hosted providers first and stopping before paid providers unless you opt in.

Text-like files:
  local_text

PDF/DOCX/XLSX/HWP/HWPX/HWPML:
  kordoc
  -> opendataloader
  -> stop with DOCUMENT_PAID_FALLBACK_REQUIRED

auto does not silently call Gemini, OpenAI, OCR, or custom VLM providers. To allow paid fallback, set paidFallback: true and cap the request:

await wf.services.document.convert({
    storageId,
    filename: "scan.pdf",
    mimeType: "application/pdf",
    corpus: "knowledge-base",
    visibility: "shared",
    provider: "auto",
    paidFallback: true,
    maxServiceCredits: 500,
});

If paidFallback: true is set without maxServiceCredits, the platform default cap still applies before any paid provider call.

Explicit providers

Use an explicit provider when you want a specific route and do not want auto selection.

await wf.services.document.convert({
    storageId,
    filename: "contract.pdf",
    mimeType: "application/pdf",
    corpus: "contracts",
    visibility: "shared",
    provider: "kordoc",
});

Supported provider names are:

auto
local_text
kordoc
opendataloader
gemini
openai
ocr
custom_vlm

local_text is a runtime-local parser for text-like files. Platform document proxy requests reject it because the platform proxy only owns external and self-hosted providers.

Modes

Mode Behavior
auto Default. Uses safe auto routing.
text-only Uses local extraction only and fails if no usable local text exists.
no-external-provider Same safety intent as text-only for external denial paths.
prefer-external Tries external/self-hosted providers first, then local text when allowed.
force-external Requires external provider routing. Paid providers are allowed.
force-ocr Uses OCR-oriented provider routing and prompt selection.

Kordoc operations

Kordoc-backed formats require platform-side configuration:

{
  "document": {
    "providers": {
      "kordoc": {
        "url": "http://kordoc.internal:5004",
        "token": "replace-me",
        "timeoutMs": 120000
      }
    }
  }
}

The platform process does not import or run Kordoc directly. It calls the internal infra/kordoc-server HTTP service with X-Internal-Token. Production and non-loopback Kordoc servers must require INTERNAL_TOKEN or KORDOC_INTERNAL_TOKEN.

Provider URLs, tokens, prompts, models, and paid fallback caps are platform-owned configuration. They must not be injected into tenant app environment variables.

See Deployment for platform runtime config.