RAG Pipeline for Proposal Automation: Hybrid Retrieval, HyDE, and Reranking

A RAG pipeline for proposal automation accelerates the journey from an incoming RFQ to a structured, accurate draft proposal by retrieving the right past content — capability statements, pricing templates, case studies — and grounding the LLM's output in retrieved evidence rather than hallucinated detail. When the retrieval layer is built correctly, with hybrid search, HyDE query expansion, and cross-encoder reranking, the generated proposal reflects your actual deliverables with measurable faithfulness. When it is not, the LLM fills gaps with confident-sounding fiction.

This guide is a practitioner's walkthrough of the full pipeline: document ingestion via the Unstructured API, dense and sparse indexing in ChromaDB with BM25, Hypothetical Document Embeddings (HyDE), Reciprocal Rank Fusion (RRF), cross-encoder reranking, faithfulness scoring, and DOCX generation — including notes on running the embedding and inference steps on Apple Silicon with MPS acceleration.

What Is a RAG Pipeline and Why Proposal Automation Needs It

Retrieval-Augmented Generation (RAG) is an architecture where an LLM's response is grounded by a retrieval step that fetches relevant documents from a knowledge base before generation begins. The retrieved chunks become part of the context window, reducing the model's reliance on parametric memory (where hallucinations live) and instead anchoring answers in documents you control.

For proposal automation, the knowledge base is the company's own corpus: past proposals, SOW templates, pricing schedules, capability decks, and case studies. When a new RFQ arrives, the pipeline retrieves the most relevant sections from prior work and instructs the LLM to draft the new proposal section-by-section from that evidence.

Without retrieval, the LLM invents delivery timelines, fabricates credentials, and fills scope sections with generic language. With a well-tuned retrieval layer, it reproduces your actual methodology with correct project references — the difference between a proposal that wins and one that disqualifies.

Document Ingestion with the Unstructured API

Before any retrieval can happen, source documents must be parsed into clean, semantically coherent chunks. PDFs, DOCX files, and scanned archives are structurally messy: headers nest inside tables, footnotes interrupt body paragraphs, and column layouts confuse naive text extraction.

The Unstructured API solves this by performing layout analysis and element classification before chunking, returning typed elements — Title, NarrativeText, Table, ListItem — rather than a raw stream of characters. This matters for proposal content because a pricing table serialized incorrectly becomes unrecoverable noise in the embedding space.

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

elements = partition(filename="past_proposal.pdf", strategy="hi_res")
chunks = chunk_by_title(elements, max_characters=800, overlap=100)

chunk_by_title respects section boundaries when chunking, so a capabilities section stays together rather than being split mid-sentence between a pricing header and delivery timeline. Set max_characters to 600–900 to stay comfortably under most embedding model context windows, and always include a small overlap to preserve cross-chunk context at boundaries.

Store metadata — source filename, page number, section title, document date — alongside each chunk. You will need it for citation in the final DOCX output.

Hybrid Retrieval: ChromaDB Dense + BM25 Sparse

Dense vector retrieval and sparse keyword retrieval have complementary failure modes. Dense retrieval (embedding similarity) handles semantic paraphrase well — "headcount for the project" matches "team size and staffing" — but struggles with rare terms, product names, and specific numbers. Sparse retrieval (BM25) handles exact lexical matches precisely — "ISO 27001" finds "ISO 27001" — but fails on synonyms and conceptual queries.

Hybrid retrieval runs both in parallel and fuses the ranked lists. For proposal retrieval, this is not optional: RFQs contain a mixture of conceptual requirements ("cloud-native architecture") and precise specifications ("delivery within 12 weeks of contract signature").

ChromaDB stores and serves the dense embeddings. Use a model that performs well on domain text — all-mpnet-base-v2 or BAAI/bge-large-en-v1.5 are reliable starting points for English proposal text.

import chromadb
from chromadb.utils import embedding_functions

client = chromadb.PersistentClient(path="./chroma_store")
emb_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="BAAI/bge-large-en-v1.5"
)
collection = client.get_or_create_collection("proposals", embedding_function=emb_fn)

collection.add(
    documents=[c.text for c in chunks],
    metadatas=[c.metadata for c in chunks],
    ids=[c.id for c in chunks],
)

BM25 operates over the same chunks using rank_bm25:

from rank_bm25 import BM25Okapi

tokenized_corpus = [chunk.text.lower().split() for chunk in chunks]
bm25 = BM25Okapi(tokenized_corpus)

Run both retrievers against the query and collect the top-K results from each before fusion.

Apple Silicon MPS Acceleration

If you are running embedding and reranking models locally on a Mac with Apple Silicon, use MPS (Metal Performance Shaders) instead of CPU. The speedup on M1/M2/M3 for sentence-transformer inference is roughly 3–5× on typical batch sizes.

import torch
from sentence_transformers import SentenceTransformer

device = "mps" if torch.backends.mps.is_available() else "cpu"
model = SentenceTransformer("BAAI/bge-large-en-v1.5", device=device)

Note that not all operations are MPS-compatible in all framework versions — if you hit an OperationNotSupportedOnMPSError, pin torch>=2.1 and sentence-transformers>=2.6. Cross-encoder models (for reranking) also benefit from MPS:

from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", device=device)

HyDE: Hypothetical Document Embeddings for Query Expansion

Standard dense retrieval embeds the raw query and finds similar document chunks. The problem: a short, keyword-heavy RFQ question ("What annotation tooling do you use for NLP tasks?") occupies a very different region of embedding space from a long, narrative-style proposal answer. The semantic gap reduces recall.

HyDE (Hypothetical Document Embeddings) bridges this gap by using the LLM to generate a hypothetical answer to the query, then embedding that answer rather than the original question. The hypothetical answer lives in the same embedding space as real proposal text — so retrieval precision and recall both improve.

from anthropic import Anthropic

client = Anthropic()

def hypothetical_answer(query: str) -> str:
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": (
                f"Write a concise, factual paragraph that would appear in a "
                f"professional proposal answering this question:\n\n{query}"
            )
        }]
    )
    return response.content[0].text

hyde_text = hypothetical_answer(rfq_query)
hyde_results = collection.query(query_texts=[hyde_text], n_results=20)

Run HyDE alongside the original query embedding and BM25, then merge all three ranked lists. In practice, HyDE contributes most for conceptual or process-oriented questions; BM25 contributes most for named entities and precise requirements; the original embedding splits the difference.

Reciprocal Rank Fusion (RRF)

With three ranked lists — dense (original query), dense (HyDE), sparse (BM25) — you need a fusion method that combines rankings without requiring score normalization across incompatible scoring functions. Reciprocal Rank Fusion does exactly this.

For each document, its RRF score is the sum of 1 / (k + rank) across all lists where it appears, where k is a smoothing constant (typically 60). Documents that rank highly across multiple retrieval methods score highest:

from collections import defaultdict

def rrf_fusion(ranked_lists: list[list[str]], k: int = 60) -> list[str]:
    scores: dict[str, float] = defaultdict(float)
    for ranked in ranked_lists:
        for rank, doc_id in enumerate(ranked, start=1):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

fused = rrf_fusion([
    dense_original_ids,
    dense_hyde_ids,
    bm25_ids,
])
top_candidates = fused[:40]  # pass top-40 to the reranker

RRF is robust to retriever quality imbalance — if BM25 occasionally surfaces noise, its RRF contribution is bounded by rank rather than amplified by a raw score spike.

Cross-Encoder Reranking

The fused candidate set (typically 30–50 chunks) is too large to fit in the LLM's context window with full content. Cross-encoder reranking scores each (query, chunk) pair jointly — unlike bi-encoder embeddings which score query and document independently — and is significantly more accurate at identifying genuinely relevant chunks.

query_chunk_pairs = [(rfq_query, chunk_texts[cid]) for cid in top_candidates]
scores = reranker.predict(query_chunk_pairs)

reranked = sorted(
    zip(top_candidates, scores),
    key=lambda x: x[1],
    reverse=True
)
final_context_ids = [doc_id for doc_id, _ in reranked[:8]]

Pass only the top 6–10 reranked chunks as context to the generation step. This keeps the context window focused and reduces the likelihood that an irrelevant chunk contaminates the generated text.

Faithfulness Scoring

Before the generated proposal section is accepted, score it for faithfulness — whether the claims in the generated text are grounded in the retrieved context. Unfaithful generation is the main residual failure mode after retrieval is tuned: the LLM occasionally extrapolates beyond the retrieved evidence.

A lightweight faithfulness scorer uses the LLM itself as a judge:

def score_faithfulness(generated_text: str, context_chunks: list[str]) -> dict:
    context = "\n\n".join(context_chunks)
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": (
                "Given the following context and generated text, identify any claims "
                "in the generated text that are NOT supported by the context. "
                "Return JSON: {\"faithful\": true/false, \"unsupported_claims\": [...]}\n\n"
                f"CONTEXT:\n{context}\n\nGENERATED:\n{generated_text}"
            )
        }]
    )
    import json
    return json.loads(response.content[0].text)

Set a faithfulness gate: if faithful is false and unsupported_claims is non-empty, retry generation with an explicit instruction to restrict output to the provided context. Two retries cover the vast majority of cases; if faithfulness fails on the third attempt, flag the section for human review rather than silently passing unfaithful content.

DOCX Generation

The final output of a proposal pipeline is a formatted Word document — not markdown, not a chat response. Use python-docx to assemble the generated sections into a structured DOCX with correct heading hierarchy, section numbering, and your organisation's style template.

from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH

def build_proposal_docx(sections: list[dict], template_path: str = "template.docx") -> Document:
    doc = Document(template_path)

    for section in sections:
        heading = doc.add_heading(section["title"], level=1)
        heading.alignment = WD_ALIGN_PARAGRAPH.LEFT

        for para_text in section["paragraphs"]:
            p = doc.add_paragraph(para_text)
            p.style = doc.styles["Body Text"]

        if section.get("source_citations"):
            doc.add_paragraph(
                "Sources: " + "; ".join(section["source_citations"]),
                style="Caption"
            )

    return doc

Using template_path applies your organisation's existing styles — fonts, colours, header/footer — automatically. The source citations (populated from chunk metadata) create an internal audit trail showing which past proposals each generated section drew from.

What Does RAG-Powered Proposal Automation Mean for the Full Workflow?

A mature RAG pipeline for proposal automation compresses RFQ-to-draft time from days to hours. More importantly, it standardises quality: every proposal section is grounded in the company's best prior work, retrieved systematically rather than recalled from memory by whoever is available to write.

The pipeline described here handles the retrieval and generation layer. The upstream layer — maintaining a clean, well-structured proposal knowledge base — requires systematic document ingestion, metadata discipline, and periodic review to remove outdated or superseded content. The downstream layer — review, approval, and final editing — remains human, but operates on a draft that is already 70–80% complete and verifiably grounded.

What Are the Most Common RAG Pipeline Failure Modes?

Retrieval gaps are the leading failure mode. The correct source document is in the corpus but is not retrieved because the query and document use different terminology. Fix: add HyDE and ensure the corpus is chunked at the right granularity — chunks that are too large dilute relevance signals.

Chunk boundary splits break coherent reasoning. A pricing table split across two chunks becomes unusable. Fix: use layout-aware parsing (Unstructured API) and chunk by logical section rather than by character count alone.

Faithfulness failures are rarer with good retrieval but occur when the LLM is asked to synthesise across many weak-signal chunks. Fix: rerank aggressively to a small, high-confidence context set, and gate on faithfulness before accepting output.

Index staleness means the pipeline retrieves outdated pricing or superseded capability claims. Fix: track document version and effective date in metadata; filter out documents past their review date at query time.

How Does ASPL Approach RAG for Enterprise AI?

ASPL builds agentic AI systems for enterprise clients — including document intelligence pipelines, RAG-powered proposal and contract workflows, and autonomous data extraction systems. Our AI development services cover the full stack: data ingestion and annotation, retrieval architecture, agent orchestration, and production deployment.

The RAG pipeline described in this guide is representative of what we build for clients who need to automate high-value, knowledge-intensive document workflows. If your team is evaluating RAG for proposal automation, contract review, or technical documentation, contact us to discuss an architecture review and scoping engagement — or explore how Pixeal, our data annotation platform, can help build the high-quality retrieval corpus your pipeline needs.

Building a Production RAG Pipeline: Where to Start

The stack described here — Unstructured API for ingestion, ChromaDB + BM25 for hybrid retrieval, HyDE for query expansion, RRF for fusion, cross-encoder reranking, and faithfulness scoring — is production-tested and runs efficiently on both GPU and Apple Silicon MPS. Each component is independently replaceable: swap ChromaDB for Qdrant, swap BM25 for Elasticsearch, swap the cross-encoder model for a domain-fine-tuned variant as your corpus grows.

Start with the retrieval layer before optimising generation. A well-tuned hybrid retrieval pipeline with reranking consistently outperforms a sophisticated generation prompt built on weak retrieval. Ground truth first; generation second. That principle is what makes RAG pipelines that actually work in production — and it is the same principle behind every reliable agentic AI system.