Simple RAG

Retrieval-augmented generation (RAG) is a powerful framework that combines the strengths of Large Language Models (LLMs) with information retrieval systems. At its core, RAG enables AI systems to generate more accurate, contextual, and factual responses by accessing and leveraging external knowledge bases. This approach addresses one of the key limitations of traditional LLMs: their inability to access up-to-date or specific information beyond their training data.

How RAG Works

RAG operates through two fundamental steps:

Retrieval Phase: Documents are converted into vector embeddings. These embeddings are stored in a vector database. When a query is received, the relevant information is retrieved using semantic search. Then, the system identifies and extracts the most pertinent pieces of information.
Generation Phase: Retrieved information is intelligently incorporated into the prompt. The LLM uses this context to generate informed, accurate responses. The response combines the model's inherent knowledge with the retrieved information.

In this notebook, we explore multiple ways in which we can use Arcee's SLMs (Small Language Models) in aiding us to implement efficient Retrieval-Augmented Generation (RAG) pipelines.

First, let's install the necessary packages for downloading data from the web. We first have to ensure that this data is not present in the training corpus of the SLMs (Small Language Models) we're using for RAG (Retrieval-Augmented Generation). This verification is important to confirm that the RAG system is functioning properly and that the LLM is truly retrieving information rather than responding with knowledge from its parameters.

Below is the list of packages that need to be installed :

httpx
openai
requests
python-dotenv
voyageai
trafilatura
lxml_html_clean

! pip install 'httpx[http2]'
! pip install openai requests python-dotenv voyageai trafilatura lxml_html_clean

Downloading the Documents

We first download the document that has been recently published on reasoning models and save this document for further processing.

import requests
from lxml_html_clean import Cleaner
from lxml import html

# Example usage
url = "https://sebastianraschka.com/blog/2025/understanding-reasoning-llms.html"
cleaner = Cleaner(
        style=True,
        links=True,
        add_nofollow=True,
        page_structure=False,
        safe_attrs_only=True,
        remove_tags=['span', 'div', 'aside', 'nav']
    )

def clean_webpage(url, cleaner):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
    }
    
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        
        doc = html.fromstring(response.content)
    
        cleaned_doc = cleaner.clean_html(doc)
        text = cleaned_doc.text_content()
        text = ' '.join(text.split())
        
        return text
        
    except Exception as e:
        print(f"Error: {e}")
        return None


cleaned_text = clean_webpage(url, cleaner)

if cleaned_text:
    with open('cleaned_article.txt', 'w', encoding='utf-8') as f:
        f.write(cleaned_text)

# Let us print the first 200 characters for verifying.
print(cleaned_text[:200])
print("Length of the text : ", len(cleaned_text))

Then, import the libraries needed for the API call. Also, please save your OPENAI_API_KEY and OPENAI_BASE_URL and your VOYAGE_API_KEY in your .env file

import os
from dotenv import load_dotenv
import numpy as np
import os

import httpx
from openai import OpenAI

load_dotenv()

client = OpenAI(
  http_client=httpx.Client(http2=True),
  base_url=os.environ["OPENAI_BASE_URL"],
  api_key=os.environ["OPENAI_API_KEY"]
)

RAG from Scratch

In this section, we explore how to implement RAG using manual document chunking and by using Faiss as the backbone to retrieve relevant documents for a question. In this example, we are using Voyage for extracting the embeddings from documents and later, we use it for retrieving relevant documents.

For more information: https://docs.voyageai.com/docs/introduction

Split the Document into Chunks

In a RAG system, it is crucial to split the document into smaller chunks so that it’s more effective to identify and retrieve the most relevant information in the retrieval process later. In this example, we simply split our text by character, combine 2048 characters into each chunk, and we get 37 chunks in total.

chunk_size = 2048
chunks = [cleaned_text[i:i + chunk_size] for i in range(0, len(cleaned_text), chunk_size)]

Considerations:

Chunk size: Depending on your specific use case, it may be necessary to customize or experiment with different chunk sizes and chunk overlap to achieve optimal performance in RAG. For example, smaller chunks can be more beneficial in retrieval processes, as larger text chunks often contain filler text that can obscure the semantic representation. As such, using smaller text chunks in the retrieval process can enable the RAG system to identify and extract relevant information more effectively and accurately. However, it’s worth considering the trade-offs that come with using smaller chunks, such as increasing processing time and computational resources.
How to split: While the simplest method is to split the text by character, there are other options depending on the use case and document structure. For example, to avoid exceeding token limits in API calls, it may be necessary to split the text by tokens. To maintain the cohesiveness of the chunks, it can be useful to split the text into sentences, paragraphs, or HTML headers. If working with code, it’s often recommended to split by meaningful code chunks for example using an Abstract Syntax Tree (AST) parser.

Create Embeddings for Each Text Chunk

For each text chunk, we then need to create text embeddings, which are numeric representations of the text in the vector space. Words with similar meanings are expected to be in closer proximity or have a shorter distance in the vector space. To create an embedding, we will use Voyage AI's API endpoint and the embedding model voyage-3-large. We create a get_text_embedding to get the embedding from a single text chunk and then, we use list comprehension to get text embeddings for all text chunks.

import voyageai
vo = voyageai.Client()
sentences = [
    "Renewable energy sources are becoming increasingly important",
    "Multi agent systems combined with reasoning models would be able to transform the Tech landscape ",
    "Quantum mechanics revolutionized our understanding of atomic behavior.",
]
result = vo.embed(sentences, model="voyage-3")

result_embeddings = np.array(result.embeddings)
print(result_embeddings.shape)

def get_text_embedding(text):
    result = vo.embed([text], model="voyage-3")
    vec_array = np.array(result.embeddings)
    return vec_array

text_embeddings = np.array([get_text_embedding(chunk) for chunk in chunks])

text_embeddings = text_embeddings.squeeze()
text_embeddings.shape

Load into a Vector Database

Once we get the text embeddings, a common practice is to store them in a vector database for efficient processing and retrieval. There are several vector databases to choose from. In our simple example, we are using an open-source vector database Faiss, which allows for efficient similarity search.

With Faiss, we instantiate an instance of the Index class, which defines the indexing structure of the vector database. We then add the text embeddings to this indexing structure.

Please install faiss-gpu using the following command:

! conda install -c pytorch -c nvidia faiss-gpu=1.9.0

Refer to the Faiss documentation for more information.

import faiss

d = text_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(text_embeddings)

Create Embeddings for a Question

Whenever users ask a question, we also need to create embeddings for this question using the same embedding models as before.

question = "Explain about the DeepSeek-R1-Zero Large language model."
question_embeddings = np.array([get_text_embedding(question)])
# squeeze the array
question_embeddings = np.squeeze(question_embeddings, axis=0)
question_embeddings.shape
question_embeddings

Considerations:

Hypothetical Document Embeddings (HyDE): In some cases, the user’s question might not be the most relevant query to use for identifying the relevant context. Instead, it may be more effective to generate a hypothetical answer or a hypothetical document based on the user’s query and use the embeddings of the generated text to retrieve similar text chunks.

Retrieve Similar Chunks from the Vector Database

We can perform a search on the vector database with index.search, which takes two arguments: the first is the vector embeddings of the question, and the second is the number of similar vectors to retrieve. This function returns the distances and the indices of the most similar vectors to the question vector in the vector database. Then based on the returned indices, we can retrieve the actual relevant text chunks that correspond to those indices.

D, I = index.search(question_embeddings, k=2)
print(I)

retrieved_chunk = [chunks[i] for i in I.tolist()[0]]
print(retrieved_chunk)

Considerations:

Retrieval Methods: There are a lot of different retrieval strategies. In our example, we are showing a simple similarity search with embeddings. Sometimes when there is metadata available for the data, it’s better to filter the data based on the metadata first before performing similarity search. There are also other statistical retrieval methods like TF-IDF and BM25 that use frequency and distribution of terms in the document to identify relevant text chunks.
Retrieved Document: Do we always retrieve individual text chunks as it is? Not always.
- Sometimes, we would like to include more context around the actual retrieved text chunk. We call the actual retrieve text chunk "child chunk" and our goal is to retrieve a larger "parent chunk" that the "child chunk" belongs to.
- On occasion, we might also want to provide weights to our retrieved documents. For example, a time-weighted approach would help us retrieve the most recent document.
- One common issue in the retrieval process is the "lost in the middle" problem where the information in the middle of a long context gets lost. Our models have tried to mitigate this issue. For example, in the passkey task, our models have demonstrated the ability to find a "needle in a haystack" by retrieving a randomly inserted passkey within a long prompt, up to 32k context length. However, it is worth considering experimenting with reordering the document to determine if placing the most relevant chunks at the beginning and end leads to improved results.

Combine Context and Question in a Prompt and Generate a Response

Finally, we can offer the retrieved text chunks as the context information within the prompt. Here is a prompt template where we can include both the retrieved text and user questions in the prompt.

prompt = f"""
Context information is below.
---------------------
{retrieved_chunk}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {question}
Answer:
"""

import textwrap


def run_virtuoso(prompt):

  response = client.chat.completions.create(
    model='virtuoso-small',
    messages=[{'role': 'user', 'content': prompt}],
    temperature=0.4,
  )
  return response


response = run_virtuoso(prompt)
answer_text = response.choices[0].message.content


print(textwrap.fill(answer_text, width=80))

PreviousText Generation and Analysis NextFunction Calling

Last updated 13 days ago

How RAG Works

RAG operates through two fundamental steps:

Retrieval Phase: Documents are converted into vector embeddings. These embeddings are stored in a vector database. When a query is received, the relevant information is retrieved using semantic search. Then, the system identifies and extracts the most pertinent pieces of information.
Generation Phase: Retrieved information is intelligently incorporated into the prompt. The LLM uses this context to generate informed, accurate responses. The response combines the model's inherent knowledge with the retrieved information.

In this notebook, we explore multiple ways in which we can use Arcee's SLMs (Small Language Models) in aiding us to implement efficient Retrieval-Augmented Generation (RAG) pipelines.

Below is the list of packages that need to be installed :

httpx
openai
requests
python-dotenv
voyageai
trafilatura
lxml_html_clean

! pip install 'httpx[http2]'
! pip install openai requests python-dotenv voyageai trafilatura lxml_html_clean

Downloading the Documents

We first download the document that has been recently published on reasoning models and save this document for further processing.

import requests
from lxml_html_clean import Cleaner
from lxml import html

# Example usage
url = "https://sebastianraschka.com/blog/2025/understanding-reasoning-llms.html"
cleaner = Cleaner(
        style=True,
        links=True,
        add_nofollow=True,
        page_structure=False,
        safe_attrs_only=True,
        remove_tags=['span', 'div', 'aside', 'nav']
    )

def clean_webpage(url, cleaner):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
    }
    
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()
        
        doc = html.fromstring(response.content)
    
        cleaned_doc = cleaner.clean_html(doc)
        text = cleaned_doc.text_content()
        text = ' '.join(text.split())
        
        return text
        
    except Exception as e:
        print(f"Error: {e}")
        return None


cleaned_text = clean_webpage(url, cleaner)

if cleaned_text:
    with open('cleaned_article.txt', 'w', encoding='utf-8') as f:
        f.write(cleaned_text)

# Let us print the first 200 characters for verifying.
print(cleaned_text[:200])
print("Length of the text : ", len(cleaned_text))

Then, import the libraries needed for the API call. Also, please save your OPENAI_API_KEY and OPENAI_BASE_URL and your VOYAGE_API_KEY in your .env file

import os
from dotenv import load_dotenv
import numpy as np
import os

import httpx
from openai import OpenAI

load_dotenv()

client = OpenAI(
  http_client=httpx.Client(http2=True),
  base_url=os.environ["OPENAI_BASE_URL"],
  api_key=os.environ["OPENAI_API_KEY"]
)

RAG from Scratch

For more information: https://docs.voyageai.com/docs/introduction

Split the Document into Chunks

chunk_size = 2048
chunks = [cleaned_text[i:i + chunk_size] for i in range(0, len(cleaned_text), chunk_size)]

Considerations:

Chunk size: Depending on your specific use case, it may be necessary to customize or experiment with different chunk sizes and chunk overlap to achieve optimal performance in RAG. For example, smaller chunks can be more beneficial in retrieval processes, as larger text chunks often contain filler text that can obscure the semantic representation. As such, using smaller text chunks in the retrieval process can enable the RAG system to identify and extract relevant information more effectively and accurately. However, it’s worth considering the trade-offs that come with using smaller chunks, such as increasing processing time and computational resources.
How to split: While the simplest method is to split the text by character, there are other options depending on the use case and document structure. For example, to avoid exceeding token limits in API calls, it may be necessary to split the text by tokens. To maintain the cohesiveness of the chunks, it can be useful to split the text into sentences, paragraphs, or HTML headers. If working with code, it’s often recommended to split by meaningful code chunks for example using an Abstract Syntax Tree (AST) parser.

Create Embeddings for Each Text Chunk

import voyageai
vo = voyageai.Client()
sentences = [
    "Renewable energy sources are becoming increasingly important",
    "Multi agent systems combined with reasoning models would be able to transform the Tech landscape ",
    "Quantum mechanics revolutionized our understanding of atomic behavior.",
]
result = vo.embed(sentences, model="voyage-3")

result_embeddings = np.array(result.embeddings)
print(result_embeddings.shape)

def get_text_embedding(text):
    result = vo.embed([text], model="voyage-3")
    vec_array = np.array(result.embeddings)
    return vec_array

text_embeddings = np.array([get_text_embedding(chunk) for chunk in chunks])

text_embeddings = text_embeddings.squeeze()
text_embeddings.shape

Load into a Vector Database

With Faiss, we instantiate an instance of the Index class, which defines the indexing structure of the vector database. We then add the text embeddings to this indexing structure.

Please install faiss-gpu using the following command:

! conda install -c pytorch -c nvidia faiss-gpu=1.9.0

Refer to the Faiss documentation for more information.

import faiss

d = text_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(text_embeddings)

Create Embeddings for a Question

Whenever users ask a question, we also need to create embeddings for this question using the same embedding models as before.

question = "Explain about the DeepSeek-R1-Zero Large language model."
question_embeddings = np.array([get_text_embedding(question)])
# squeeze the array
question_embeddings = np.squeeze(question_embeddings, axis=0)
question_embeddings.shape
question_embeddings

Considerations:

Retrieve Similar Chunks from the Vector Database

D, I = index.search(question_embeddings, k=2)
print(I)

retrieved_chunk = [chunks[i] for i in I.tolist()[0]]
print(retrieved_chunk)

Considerations:

Retrieval Methods: There are a lot of different retrieval strategies. In our example, we are showing a simple similarity search with embeddings. Sometimes when there is metadata available for the data, it’s better to filter the data based on the metadata first before performing similarity search. There are also other statistical retrieval methods like TF-IDF and BM25 that use frequency and distribution of terms in the document to identify relevant text chunks.
Retrieved Document: Do we always retrieve individual text chunks as it is? Not always.
- Sometimes, we would like to include more context around the actual retrieved text chunk. We call the actual retrieve text chunk "child chunk" and our goal is to retrieve a larger "parent chunk" that the "child chunk" belongs to.
- On occasion, we might also want to provide weights to our retrieved documents. For example, a time-weighted approach would help us retrieve the most recent document.
- One common issue in the retrieval process is the "lost in the middle" problem where the information in the middle of a long context gets lost. Our models have tried to mitigate this issue. For example, in the passkey task, our models have demonstrated the ability to find a "needle in a haystack" by retrieving a randomly inserted passkey within a long prompt, up to 32k context length. However, it is worth considering experimenting with reordering the document to determine if placing the most relevant chunks at the beginning and end leads to improved results.

Combine Context and Question in a Prompt and Generate a Response

Finally, we can offer the retrieved text chunks as the context information within the prompt. Here is a prompt template where we can include both the retrieved text and user questions in the prompt.

prompt = f"""
Context information is below.
---------------------
{retrieved_chunk}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {question}
Answer:
"""

import textwrap


def run_virtuoso(prompt):

  response = client.chat.completions.create(
    model='virtuoso-small',
    messages=[{'role': 'user', 'content': prompt}],
    temperature=0.4,
  )
  return response


response = run_virtuoso(prompt)
answer_text = response.choices[0].message.content


print(textwrap.fill(answer_text, width=80))