William Marrero | Desarrollador Fullstack & Especialista en Automatización

TL;DR

Speculative RAG pairs a smaller, faster specialist LLM with a larger, more capable generalist LLM. The small model drafts answers across subsets of retrieved documents, and the large model verifies or fuses them into one final, accurate response. The result: faster generation, lower cost per token, and improved reliability.

What Is Speculative RAG?

Speculative RAG is an advanced RAG framework where two models collaborate. A lightweight specialist generates draft answers from document subsets, while a larger generalist verifies and refines the output. The technique originates from Google's speculative decoding research.

When to Use Speculative RAG

Scientific and technical QA — literature review, patent search, troubleshooting.
Customer support automation — fast initial answers validated by a stronger LLM.
Document summarization — smaller model drafts summaries, larger model ensures accuracy.

Example N8N Workflow

Retrieve documents via vector search (Supabase, Pinecone, etc.).
Split or cluster results using a Function node (KMeans or simple batching).
Parallel branches — each batch sent to a smaller LLM node (GPT-3.5-turbo-mini) to generate draft.
Merge drafts using a Merge node.
Verification pass — send all drafts to larger LLM (GPT-4 or Claude 3) for final answer synthesis.

Implementation Patterns

Cluster retrieval outputs → run small LLMs independently per cluster.
Use generalist LLM for cross-draft comparison and synthesis.
Evaluate confidence using consistency or citation matching in the prompt.

Strengths & Weaknesses

Strengths: Up to 50% lower latency through parallel execution, improved factual robustness via multiple perspectives, cheaper overall for large query loads. Weaknesses: Pipeline complexity and orchestration between two models, degraded output if small model drafts are poor, increased maintenance burden.