RAG & Memory

Document ingestion, semantic search, and agent memory systems

This guide covers the RAG (Retrieval-Augmented Generation) and Memory components in detail.

RAG — Document Search Pipeline

RAG enables your AI to answer questions based on your own documents.

Setup

gencow add RAG

This creates:

gencow/rag.ts — createRag({ embeddingModel, answerModel }), the rag singleton, legacy local RAG helpers, and canonical grounded-answer facades
gencow/schema-rag.ts — local rag_documents table for the legacy helpers

Important: Import schema-rag.ts in your schema to create the tables:
// gencow/schema.ts
export * from "./schema-rag";

Grounded answers use a different corpus path. rag.askGrounded(), rag.compareCorpus(), and rag.extractTopics() read canonical Phase 2 rag_* tables populated through documents.ingest.*; documents inserted by rag.ingest() are only available to rag.retrieve() / rag.ask().

Two RAG Surfaces

	Provider API	Facade API
Entry point	`createRag({ embeddingModel, answerModel })`	`import { rag } from "./rag"`
Best for	Explicit model injection, tests, AI SDK-native composition	Existing generated starter code and the default singleton
Local helpers	`ingest()`, `retrieve()`, `ask()`	`ingest()`, `retrieve()`, `ask()`, `askGrounded()`, `compareCorpus()`, `extractTopics()`

Provider API

Use the factory when you want explicit model selection:

import { createGencowAI } from "./ai";
import { createRag } from "./rag";

const gencow = createGencowAI();
const customRag = createRag({
    embeddingModel: gencow.embeddingModel("text-embedding-3-small"),
    answerModel: gencow.languageModel("gpt-5.4-mini"),
});

await customRag.ingest(ctx, "manual.md", documentText);

const hits = await customRag.retrieve(ctx, "refund policy?");
// → [{ chunk, source, similarity, metadata }]

const answer = await customRag.ask(ctx, "refund policy?");

The generated schema-rag.ts starter stores embeddings in rag_documents.embedding as vector(1536). If you switch to an embedding model with a different output dimension, update the schema and re-index existing vectors before mixing old and new rows.

Facade API

Use the generated singleton when you want the default compatibility surface:

import { rag } from "./rag";

await rag.ingest(ctx, "manual.md", documentText);
const hits = await rag.retrieve(ctx, "refund policy?");
const answer = await rag.ask(ctx, "refund policy?");
const grounded = await rag.askGrounded(ctx, "refund policy?", {
    corpus: "default",
    visibility: "shared",
});

rag.askGrounded() does not read rows inserted by rag.ingest(). Use the canonical ingest pipeline below when you need grounded citations.

Canonical RAG Foundation (Recommended)

For production RAG and grounded answers, use the built-in canonical pipeline:

storage file -> documents.ingest.start -> rag_* tables -> ctx.search() / ctx.grounding.answer()

The canonical tables are rag_corpora, rag_sources, rag_sections, rag_chunks, rag_ingest_jobs, and rag_operation_metrics. This path is tenant-scoped and is the only path used by grounded citations.

Canonical ingest creates chunk embeddings through the OpenAI-compatible /platform/ai/v1/embeddings route in deployed apps. The runtime no longer uses ctx.ai for document ingest. If no embedding target is configured in local development, chunks are still stored for keyword search and embedding remains null; if a configured proxy returns non-JSON, ingest fails with the upstream status, route, and content type in the error.

Start Ingest

Enable the generated Cloud RAG client surface in gencow.config:

export default {
    cloudFeatures: {
        rag: true,
    },
};

import { useMutation } from "@gencow/react";
import { api } from "./gencow/api";

// inside a React component
const { mutate: startIngest } = useMutation(api.cloud.ragIngest.start);

async function handleIngest(storageId: string) {
    await startIngest({
        storageId,
        corpus: "manuals",
        visibility: "shared",
        sourceKey: "refund-policy.pdf",
        mode: "auto",
        provider: "auto",
    });
}

Search Canonical Chunks

import { v } from "@gencow/core";
import { procedure } from "./runtime";

export const searchManuals = procedure.query
    .name("manuals.search")
    .input(v.object({ question: v.string() }))
    .handler(async ({ context: ctx, input }) => {
        ctx.auth.requireAuth();

        return ctx.search("rag_chunks", input.question, {
            fields: ["chunk_text", "lexical_text"],
            scope: { corpus: "manuals", visibility: "shared" },
            limit: 10,
        });
    });

Rerank Retrieved Candidates

Reranking runs after retrieval has already filtered candidates by tenant, owner, corpus, visibility, and read grants. Use the AI SDK rerank() path or createReranker() when you want a dedicated cross-encoder style ordering step before answer generation:

import { rerank } from "ai";
import { createGencowAI } from "./ai";
import { createReranker } from "./reranker";

const gencow = createGencowAI();

const ranked = await rerank({
    model: gencow.rerankingModel("Cohere-rerank-v4.0-fast"),
    query: input.question,
    documents: searchResults.items.map((item) => item.row.text),
    topN: 8,
});

const reranker = createReranker({
    rerankingModel: gencow.rerankingModel("Cohere-rerank-v4.0-fast"),
    fallbackModel: gencow.languageModel("gpt-5.4-mini"),
});

Cloud rerank defaults to Azure Cohere Cohere-rerank-v4.0-fast. Local direct mode does not provide a real reranking model today, so gencow.rerankingModel should fail fast unless the Gencow proxy is configured. The generated reranker starter may still use its explicit fallback LLM path for compatibility, but production RAG should prefer the proxy-backed reranker so usage is metered as ai_rerank and provider keys stay on the platform.

Grounded Answer

import { v } from "@gencow/core";
import { procedure } from "./runtime";

export const askManuals = procedure.query
    .name("manuals.ask")
    .input(v.object({ question: v.string() }))
    .handler(async ({ context: ctx, input }) => {
        ctx.auth.requireAuth();
        if (!ctx.grounding) throw new Error("Grounding runtime is not available");

        return ctx.grounding.answer({
            question: input.question,
            scope: { corpus: "manuals", visibility: "shared" },
            mode: "qa",
            budget: {
                maxVerifyLoops: 2,
                maxResearchQueriesPerLoop: 3,
                maxCitationsPerClaim: 3,
            },
        });
    },
});

Operations Surface

import { useMutation, useQuery } from "@gencow/react";
import { api } from "./gencow/api";

const summary = useQuery(api.cloud.ragOps.summary, { corpus: "manuals" });
const metrics = useQuery(api.cloud.ragOps.metrics, { corpus: "manuals", limit: 20 });
const { mutate: evaluate } = useMutation(api.cloud.ragOps.evaluate);
const reindexPlan = useQuery(api.cloud.ragOps.reindexPlan, {
    corpus: "manuals",
    visibility: "shared",
    mode: "corpus-policy-changed",
    reason: "policy refresh",
});

Use these operations to inspect source/chunk counts, ingest status, grounded-answer metrics, evaluation fixtures, and reindex candidates.

Facade API — Parsers + Reranker Example

Production file ingest should use documents.ingest.* and workflow document conversion rather than calling legacy parsers directly. See Document Conversion for PDF, HWPX, DOCX, and XLSX routing.

import { v } from "@gencow/core";
import { procedure } from "./runtime";
import { parsers } from "./parsers";
import { rag } from "./rag";
import { reranker } from "./reranker";

export const ingestPdf = procedure.mutation
    .name("docs.ingestPdf")
    .handler(async ({ context: ctx, input }) => {
        ctx.auth.requireAuth();

        const file = input?.["file"] as File;
        if (!file || typeof file === "string") throw new Error("No file");

        const text = await parsers.pdf(Buffer.from(await file.arrayBuffer()));
        await rag.ingest(ctx, file.name, text);
    });

export const smartSearch = procedure.query
    .name("docs.smartSearch")
    .input(v.object({ question: v.string() }))
    .handler(async ({ context: ctx, input }) => {
        ctx.auth.requireAuth();
        return reranker.searchAndRerank(ctx, rag, input.question);
    });

export const askGroundedDocs = procedure.query
    .name("docs.askGrounded")
    .input(v.object({ question: v.string() }))
    .handler(async ({ context: ctx, input }) => {
        ctx.auth.requireAuth();

        return await rag.askGrounded(ctx, input.question, {
            corpus: "default",
            visibility: "shared",
        });
    });

Memory — Agent Memory System

Memory gives your AI agents persistent context across conversations.

Setup

gencow add Memory

This creates:

gencow/memory.ts — createMemory({ extractionModel, embeddingModel }), the memory singleton, and memory helpers
gencow/schema-memory.ts — Database tables for memory storage

Import in schema: export * from "./schema-memory";

Two Memory Surfaces

	Provider API	Facade API
Entry point	`createMemory({ extractionModel, embeddingModel })`	`import { memory } from "./memory"`
Best for	Explicit model injection, tests, AI SDK-native composition	Existing starter code and default singleton usage
Core helpers	`extract()`, `search()`, `buildContext()`, `loadSession()`, `saveSession()`, `applyDecay()`	Same helpers on the singleton

Provider API

Use the factory when you want explicit extraction/embedding model selection:

import { generateText } from "ai";
import { createGencowAI } from "./ai";
import { createMemory } from "./memory";

const gencow = createGencowAI();
const customMemory = createMemory({
    extractionModel: gencow.languageModel("gpt-5.4-mini"),
    embeddingModel: gencow.embeddingModel("text-embedding-3-small"),
});

const memCtx = await customMemory.buildContext(ctx, userId, sessionId, input.message);

const result = await generateText({
    model: gencow.languageModel("gpt-5.4-mini"),
    system: `You are a helpful assistant.

${memCtx.toSystemPrompt()}`,
    messages: [...memCtx.recentMessages, { role: "user", content: input.message }],
});

await customMemory.extract(
    ctx,
    userId,
    `User: ${input.message}
Assistant: ${result.text}`,
);

await customMemory.saveSession(ctx, sessionId, userId, [
    ...memCtx.recentMessages,
    { role: "user", content: input.message },
    { role: "assistant", content: result.text },
]);

The generated schema-memory.ts starter stores embeddings in agent_memories.embedding as vector(1536). Keep the embedding model dimension aligned with that column size, or change the schema and migrate/rebuild stored vectors.

Facade API

Use the singleton when you want the default generated surface:

import { ai } from "./ai";
import { memory } from "./memory";

const memCtx = await memory.buildContext(ctx, userId, sessionId, input.message);

const result = await ai.chat({
    system: `You are a helpful assistant.

${memCtx.toSystemPrompt()}`,
    messages: [...memCtx.recentMessages, { role: "user", content: input.message }],
});

await memory.extract(
    ctx,
    userId,
    `User: ${input.message}
Assistant: ${result.text}`,
);

buildContext() returns:

recentMessages — the latest saved session messages
longTermFacts — ranked memory facts formatted for prompts
toSystemPrompt() — a helper that renders those facts into a system string

Three Memory Types

Type	Purpose	Example
Episodic	Recent conversation history	"5 minutes ago you asked about…"
Semantic	Facts learned about the user	"User prefers Korean language"
Procedural	Learned behaviors and rules	"When user says 'report', generate PDF"

Memory Helpers

memory.search(ctx, userId, query) — vector search over stored facts, ranked by relevance and importance.
memory.loadSession(ctx, sessionId) / memory.saveSession(...) — short-term conversation history storage.
memory.applyDecay(ctx) — reduces importance for memories that have not been accessed recently; schedule it with cron if needed.

Memory Lifecycle

User: "I prefer responses in Korean"
        │
        ├── memory.extract() → stores semantic memory:
        │   "User prefers Korean language responses"
        │
        [Next conversation]
        │
        ├── memory.buildContext() → retrieves recent messages + ranked long-term facts
        │
        └── AI responds in Korean ✅

Next Steps

CLI Reference — All CLI commands
React Hooks — Frontend API reference
Core API — Backend API reference

Components CLI Reference