From Documents to Answers
An overview of how foundation4.ai processes documents and text securely for your organization to enable your team to get answers quickly.
From Documents to Answer:
How foundation4.ai Processes Enterprise Knowledge End to End
Technical Deep Dive • foundation4.ai
Most enterprise AI initiatives stall at the same point: the organization has mountains of valuable knowledge scattered across wikis, ticketing systems, contracts, and documentation repos - but no reliable way to make it queryable by an LLM. Bolting a retrieval step onto a general-purpose model rarely works cleanly. The context is noisy, the permissions are wrong, and the results are inconsistent.
foundation4.ai is built to close that gap. It provides a complete, self-managed RAG server: you push documents in, configure how they should be indexed and who can access them, and a consistent API returns grounded, permission-aware answers. But the real value is in understanding what happens between those two endpoints. This post traces the full journey of a piece of enterprise knowledge - from the moment it enters the system to the moment an LLM cites it in a response.
The Pipeline at a Glance
Before diving into each step, here is the full data flow:
End-to-End Data Flow
POST document → NATS queue → Background worker → Text splitting → Embedding → pgvector storage → Agent execution → LLM generation → Streaming SSE response
Each stage is deliberately decoupled - ingestion is non-blocking, processing is asynchronous, and LLM selection is deferred to query time. That separation is what makes the system both scalable and flexible. Let's walk through it.
Step 1: Document Ingestion
Everything begins with a POST /pipelines/{id}/documents request. You send the document's text content, a classification (a hierarchical label like support/technical or legal/contracts/vendor ), an optional external identifier linking back to your source system, and any metadata fields you want to filter on later (author, department, date, priority, and so on).
The API response comes back immediately with a status: \"pending\" - the document has been accepted but not yet processed. This matters at scale: you can submit thousands of documents in parallel without waiting for any single one to complete. The external identifier is particularly useful here; re-submitting the same ID later simply creates a new version and optionally expires the old one, making updates idempotent.
Step 2: The NATS Queue
Accepted documents are immediately published to a NATS JetStream queue. NATS is a high-throughput distributed messaging system, and foundation4.ai deploys it as a three-node cluster for resilience. This queue is the key architectural seam: the API server's only job is to persist the document record and publish the processing event. Everything downstream from there happens asynchronously.
This design means the ingestion path scales independently from the processing path. During a large initial data import you can run multiple workers in parallel; under normal operation, a smaller worker pool keeps up with incremental updates. Neither path blocks the other.
Step 3: Text Splitting
A background worker picks up the event and begins processing. The first real transformation is text splitting: the document's content is divided into smaller, semantically coherent fragments (also called chunks). This step is more consequential than it looks.
foundation4.ai ships three splitters. The RecursiveCharacterTextSplitter is the right default for most content: it tries successively smaller separators (double newline, single newline, space) until each chunk is under the configured size limit. The CharacterTextSplitter is simpler and faster for short-form content, while the TokenTextSplitter respects token boundaries - useful when you need precise control over context window usage downstream.
Two parameters govern every splitter: chunk_size (how large each fragment can be) and chunk_overlap (how many characters are shared between adjacent fragments). Overlap is what prevents relevant context from falling into the gap between two chunks. For general documentation, a chunk size of 1,000 characters with 200 characters of overlap is a reliable starting point; code documentation often benefits from larger chunks around 1,500 characters. Each fragment inherits the parent document's metadata and classification, so no context is lost during splitting.
Step 4: Embedding
Each fragment is then passed to the pipeline's configured embedding model, which converts the text into a dense numerical vector. This vector captures semantic meaning: fragments about similar topics end up close together in the vector space, even if they share no keywords.
Out of the box, foundation4.ai ships with all-MiniLM-L6-v2 as the default embedding model - a fast, lightweight model producing 384-dimensional vectors that works well for most use cases right out of the gate. But the platform is designed to work with any embedding model. foundation4.ai supports a wide range of embedding providers - from cloud-hosted OpenAI embeddings to fully self-hosted models running on HuggingFace's Text Embeddings Inference (TEI) server. Teams that need higher-fidelity retrieval can swap in a model like Qwen3-Embedding-8B (4096 dimensions) for maximum quality, use a multilingual model like multilingual-e5-large for non-English content, or deploy a self-hosted EmbeddingGemma-300m via TEI for air-gapped environments where no data can leave the cluster. The choice is yours, and you configure it at the pipeline level.
One important constraint: the embedding model is bound to the pipeline at creation time and cannot be changed later without re-ingesting all documents. This makes the initial model selection consequential - though in practice, a benchmark test on a sample of your actual content is the most reliable guide.
Step 5: Storage in pgvector
The embedded fragments are stored in PostgreSQL with the pgvector extension. The document status transitions to success and the fragments become searchable. Choosing PostgreSQL rather than a dedicated vector database was a deliberate architectural decision: it means rich metadata filtering (using full SQL semantics), ACID compliance for versioning, point-in-time queries via the as_of parameter, and a single operational system to manage rather than two. The entire document history: every version, every expiry timestamp - lives in the same database as the vectors.
Step 6: Agent Execution and the LLM
Retrieval is triggered by a POST /agents/{id}/execute request. The caller specifies a pipeline (which knowledge base to search), an agent (which prompt template and retrieval parameters to use), an LLM (which model to generate with), and optionally a classification and metadata filters to narrow the search space.
The agent's prompt template defines placeholders that get resolved at runtime. A similarity placeholder fires a vector search - the user's query is embedded with the same model used during ingestion, and the k most similar fragments are retrieved. An mmr (Maximal Marginal Relevance) placeholder does the same but re-ranks results to balance relevance with diversity - useful when documents tend to be repetitive. The retrieved fragments, along with any conversation context from prior turns, are assembled into the final prompt and sent to the LLM.
Critically, the LLM is selected at query time via an X-LLM-ID header - not baked into the agent. You can route the same agent to a fast, cheap model for development and a high-quality model for production without changing any configuration. foundation4.ai supports any OpenAI-compatible endpoint, so the model can be GPT-5, Claude Sonnet, a locally hosted open-weight model, or anything in between.
Responses stream back via Server-Sent Events, so the first tokens reach the client before generation finishes. When tracing is enabled ( options.tracing: true ), the response also includes a trace record showing exactly which fragments were retrieved, what the assembled prompt looked like, and how long each stage took which makes it straightforward to debug retrieval quality without guesswork.
Why This Architecture Matters
The pipeline described above might look like a series of technical steps, but each decision reflects a production reality. Async ingestion means a nightly sync of 50,000 documents doesn't block your API. Classification-aware retrieval means a legal document can't surface in a customer-facing chatbot. Version history means a compliance audit can answer "what did our AI know, and when did it know it?" Runtime LLM selection means you're never locked into a single provider.
Together, these decisions add up to something that's genuinely rare in the AI infrastructure space: a RAG system that is designed for enterprise from the ground up - not retrofitted for it after the fact. The journey from document to answer is a long one. foundation4.ai makes every step of it reliable, auditable, and yours to control.