Document Conversion
Convert PDF, HWPX, DOCX, XLSX, and text files into Markdown for workflows and RAG
Gencow document conversion turns uploaded files into faithful Markdown for retrieval pipelines, citations, and long-running workflow jobs.
The public app code calls wf.services.document.convert() from a workflow(). The platform owns provider routing, cost guards, provider secrets, and self-hosted converter endpoints.
When to use it
Use document conversion when an app needs to:
- ingest uploaded files into RAG corpora
- preserve document headings, lists, tables, and reading order as Markdown
- handle Korean office formats such as HWP, HWPX, and HWPML
- run OCR or VLM conversion only when explicitly requested
Do not call document conversion from a normal procedure.query or procedure.mutation. It can run external services and is workflow-only.
Basic workflow
import { workflow, v } from "@gencow/core";
export const convertDocument = workflow("documents.convert", {
args: {
storageId: v.string(),
filename: v.string(),
mimeType: v.optional(v.string()),
},
handler: async (wf, { storageId, filename, mimeType }) => {
return wf.services.document.convert({
storageId,
filename,
mimeType,
corpus: "knowledge-base",
visibility: "shared",
provider: "auto",
});
},
});The result includes both plain text and Markdown plus provider trace metadata:
const result = await wf.services.document.convert({
storageId,
filename: "policy.hwpx",
corpus: "policies",
visibility: "shared",
provider: "auto",
});
result.markdown;
result.text;
result.sections;
result.pages;
result.warnings;
result.providerTrace.provider; // "local_text", "kordoc", "opendataloader", "openai", ...Supported formats
| Format | Support | Default auto path |
|---|---|---|
| TXT, Markdown, HTML, CSV | Supported | local_text |
| Supported | kordoc quality gate -> opendataloader when configured |
|
| HWP, HWPX, HWPML | Supported | kordoc -> opendataloader when configured |
| DOCX | Supported | kordoc -> opendataloader when configured, with local fallback where available |
| XLSX | Supported | kordoc -> opendataloader when configured |
| PPTX | Not supported yet | planned separately |
| legacy DOC, PPT, XLS | Not supported | convert to modern Office formats first |
If browsers upload HWPX as application/octet-stream, Gencow infers the document type from the filename extension before routing.
Safe auto routing
provider: "auto" is the recommended default. It avoids accidental AI spend by trying local or self-hosted providers first and stopping before paid providers unless you opt in.
Text-like files:
local_text
PDF/DOCX/XLSX/HWP/HWPX/HWPML:
kordoc
-> opendataloader
-> stop with DOCUMENT_PAID_FALLBACK_REQUIREDauto does not silently call Gemini, OpenAI, OCR, or custom VLM providers. To allow paid fallback, set paidFallback: true and cap the request:
await wf.services.document.convert({
storageId,
filename: "scan.pdf",
mimeType: "application/pdf",
corpus: "knowledge-base",
visibility: "shared",
provider: "auto",
paidFallback: true,
maxServiceCredits: 500,
});If paidFallback: true is set without maxServiceCredits, the platform default cap still applies before any paid provider call.
Explicit providers
Use an explicit provider when you want a specific route and do not want auto selection.
await wf.services.document.convert({
storageId,
filename: "contract.pdf",
mimeType: "application/pdf",
corpus: "contracts",
visibility: "shared",
provider: "kordoc",
});Supported provider names are:
auto
local_text
kordoc
opendataloader
gemini
openai
ocr
custom_vlmlocal_text is a runtime-local parser for text-like files. Platform document proxy requests reject it because the platform proxy only owns external and self-hosted providers.
Modes
| Mode | Behavior |
|---|---|
auto |
Default. Uses safe auto routing. |
text-only |
Uses local extraction only and fails if no usable local text exists. |
no-external-provider |
Same safety intent as text-only for external denial paths. |
prefer-external |
Tries external/self-hosted providers first, then local text when allowed. |
force-external |
Requires external provider routing. Paid providers are allowed. |
force-ocr |
Uses OCR-oriented provider routing and prompt selection. |
Kordoc operations
Kordoc-backed formats require platform-side configuration:
{
"document": {
"providers": {
"kordoc": {
"url": "http://kordoc.internal:5004",
"token": "replace-me",
"timeoutMs": 120000
}
}
}
}The platform process does not import or run Kordoc directly. It calls the internal infra/kordoc-server HTTP service with X-Internal-Token. Production and non-loopback Kordoc servers must require INTERNAL_TOKEN or KORDOC_INTERNAL_TOKEN.
Provider URLs, tokens, prompts, models, and paid fallback caps are platform-owned configuration. They must not be injected into tenant app environment variables.
See Deployment for platform runtime config.