How to Increase the Accuracy of Enterprise RAG Using Gemini Flash 2.0
Semantic chunking and knowledge extraction techniques
The typical approach to RAG (Retrieval-Augmented Generation) is pretty blunt — chop up massive unstructured documents into 1000-token chunks, sprinkle in some chunk overlap for good measure, embed everything into vectors, and call it a day. This brute-force method is neither accurate nor complete.
So, how do we fix it? This question has been stuck in my head for a while. After some fair bit of reading, tinkering around with different repos, and experimentation, I think I have finally landed on a better way that I can stand behind.
First, to get on the same page, let’s try and understand how naive RAG works. Let’s say you have a pdf document. You pick an arbitrary chunk size, say 512 and then you go about chopping the entire doc into those chunk sizes and add some overlap. You then convert these chunks into embeddings or vectors and now you are ready for semantic search. I trivialize this, of course since there are different kinds of chunking strategies but you get the point. But, what if we first spotted similar chunks in a document by topics or entities and then created the embeddings, I thought (and read somewhere)?
The biggest issue with that is if your document is large and say, you have 100K documents, it is an expensive proposition to do this with an LLM. Till Google launched Gemini Flash 2.0 which is orders of magnitude cheaper and better than some previous models.
So I had to try it but I had to also combine it with another strategy I had read about — Knowledge Augmented Generation (KAG).
But, here’s the challenge — I wanted to use ONE database. Not a vector + graph or Postgres with Graph and multiple extensions, just ONE database. Preferably one that could do SQL, JSON, Vectors and basic Graph.
So I chose SingleStore for DB, Python (Fast API) for backend and NextJS for front end. At a high level, this was my plan.
Ditch arbitrary chunking. Instead, run the document through an LLM to identify semantically coherent sections. I chose Gemini Flash’s low-latency, budget-friendly API.
Extract structured knowledge. On top of retrieving text, use Gemini to extract key entities and their relationships from the documents. Store this structured data in a relational format (SingleStore for quick lookup.
Go hybrid with retrieval. Instead of relying solely on vector search, blend semantically chunked retrieval with knowledge graph lookups. Rank the results logically, then hand over the best context to the LLM.
tl;dr — How was the accuracy when I did this? Let’s just say I was pleasantly surprised.
In this article, I’ll walk through the entire approach step by step. You can also grab my code from the repo and test it on your own dataset.
First, let’s look at the simple schema for the database — a documents table to keep records of the documents, Documents_Embeddings table, to store the semantic chunks and corresponding vectors, an entities table and relationship table.
Here is the SQL to create this with both vector and keyword match indices:
CREATE TABLE Document_Embeddings (
embedding_id BIGINT PRIMARY KEY AUTO_INCREMENT,
doc_id BIGINT NOT NULL,
content TEXT,
embedding VECTOR(1536),
SORT KEY(),
FULLTEXT USING VERSION 2 content_ft_idx (content), -- Full-Text index (v2) on content
VECTOR INDEX embedding_vec_idx (embedding), -- Vector index on embedding column
INDEX_OPTIONS '{ "index_type": "HNSW_FLAT", "metric_type": "DOT_PRODUCT" }'
);
ALTER TABLE Entities
ADD FULLTEXT USING VERSION 2 ft_idx_name (name);
CREATE TABLE Documents (
doc_id BIGINT PRIMARY KEY AUTO_INCREMENT,
title VARCHAR(255),
author VARCHAR(100),
publish_date DATE,
source JSON --Other metadata fields (e.g. summary, URL) can be added as needed
);
CREATE TABLE Relationships (
relationship_id BIGINT PRIMARY KEY AUTO_INCREMENT,
source_entity_id BIGINT NOT NULL,
target_entity_id BIGINT NOT NULL,
relation_type VARCHAR(100),
doc_id BIGINT, - reference to Documents.doc_id (not an enforced foreign key)
KEY (source_entity_id) USING HASH, -- index for quickly finding relationships by source
KEY (target_entity_id) USING HASH, --index for quickly finding relationships by target
KEY (doc_id) --index for querying relationships by document
);
CREATE TABLE Entities (
entity_id BIGINT NOT NULL AUTO_INCREMENT,
name VARCHAR(255) NOT NULL,
description TEXT,
aliases JSON,
category VARCHAR(100),
PRIMARY KEY (entity_id, name), --Composite primary key including shard key columns
SHARD KEY (entity_id, name), --Enforce local uniqueness on shard key
FULLTEXT USING VERSION 2 name_ft_idx (name) --Full-text index for name search
);
Now that we have table set up, populating these with a pdf as a document source is not that complex or interesting. I have different methods for each step -
Get semantic chunks and create emebddings, insert into Documents and Document_Embeddings, extract entities and relationships and populate Entities and Relationships table.
Now, let’s look at the retrieval strategy because this can now be done in a few different ways whether you are looking for more accuracy or speed. I chose the strategy below.
Let’s walk through the strategy with a simple example. Let’s say we are looking to do a search against a simple term — “hello world”:
Step 1 — Convert the query into an embedding using the same model we used to create emebeddings for our document and do a hybrid search against the Documents_Embeddings table.
— Simplified query combining vector and text search
SELECT chunk_text,
dot_product(chunk_embedding, query_embedding) as vector_score,
MATCH(chunk_text) AGAINST('hello world') as text_score
FROM Document_Embeddings
ORDER BY (0.7 * vector_score + 0.3 * text_score) DESC
LIMIT 5;
Step 2 — We now look at the results and run our queries to find the entities for the reranked results from previous steps
Extract potential entities:
“Hello World” (programming concept)
"Brian Kernighan" (person)
"C Programming Language" (programming language)
"Java" (programming language)
"Python" (programming language)
Next, we get the relationships, for example
"Brian Kernighan" -> "created" -> "Hello World"
"C Programming Language" -> "introduced" -> "Hello World"
"Brian Kernighan" -> "authored" -> "C Programming Language"
Step 3 — We now merge and enrich the results as a final context assembly for the LLM using sub steps of
Sorting chunks by relevance score
Enrich with entity information
Add relationship context
Format for LLM prompt
Our example now looks like the following which we then pass to the LLM for a response
context = f”””
Relevant Information:
{top_chunks_with_scores}
Key Entities:
- Hello World (Programming Concept)
- Brian Kernighan (Person, Creator)
Important Relationships:
- Brian Kernighan created Hello World in 1972
- Hello World was first introduced in C Programming Language
Here is a more detailed view of the flow for the more visual folks
Conclusion
RAG has already come a long way since its introduction two years ago but in my mind we now have made a huge leap with the availability of newer faster, cheaper and better reasoning models. Yes, the context sizes have also increased so in some cases it may be easier to just send the whole document to the model but in my mind that will still not replace going through domain specific knowledge that is spread across 100s of thousands of unstructured and structured data. For that, we will still need some sort of Retrieval Augmentation.
If you are building an enterprise AI app, hopefully this gives you a starting point of doing something that is more enterprise ready versus simple document only parsing and information synthesis.
✌️