Why my AI chatbot kept forgetting things (and how I fixed it)

#ai #python #tutorial #discuss

I spent the last two weekends building a customer support chatbot for my side project. It was supposed to answer questions from our documentation. The first day was magic – it answered simple questions perfectly. Then came the hard ones.

A user asked "How do I reset my password using the recovery email option, because my old method from last year isn't working?" The chatbot replied with a generic link to the password reset page. Completely useless. The problem wasn't the language model – it was that the relevant context was scattered across three different documents, and my naive retrieval setup couldn't connect the dots.

The naive approach that failed

My first attempt was simple: break all documentation into fixed-size chunks (512 tokens), embed them with OpenAI embeddings, and stuff the top-3 chunks into the prompt. This works fine for short, isolated answers. But when a user asks a multi-step question that references prior context ("that old method from last year"), the fixed chunks often lack the necessary background.

I tried a sliding window – overlapping chunks with 50% overlap. That helped a little, but I was still losing information when the relevant data lived in different sections. Worse, as the conversation history grew, the prompt ballooned in size. I was paying for thousands of tokens just to keep the chatbot from saying "I don't know" to the next question.

What actually worked: recursive retrieval with hierarchical summarization

The breakthough came when I stopped thinking about "chucking' and started thinking about "building a context hierarchy". Here’s the idea:

Split documents into coarse sections (by H2 headers or logical breaks).
For each section, generate a short summary (using a cheaper LLM like GPT-3.5-turbo-small).
Embed both the summaries and the full text of each section.
On query, first retrieve the top-K summaries. Use those to decide which full sections to pull in.
Then do a second retrieval inside those selected sections to find the exact chunks.

This two-stage approach let me handle questions that required info from different parts of the docs without blowing up the prompt. The summaries act as a table of contents, so the LLM knows where to look before committing to context.

The code (simplified, but real)

Here’s a Python snippet that does the hierarchical retrieval. I’m using LangChain for orchestration, but the pattern is tool-agnostic.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Step 1: Coarse splitting by headers
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000, chunk_overlap=200, separators=["\n## ", "\n# ", "\n\n"]
)
sections = splitter.split_documents(docs)

# Step 2: Generate summaries for each section
summary_llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
section_summaries = []
for section in sections:
    summary = summary_llm.predict(
        f"Summarize this document section in 2-3 sentences:\n{section.page_content}"
    )
    section_summaries.append(summary)

# Step 3: Embed summaries and section texts separately
embeddings = OpenAIEmbeddings()
summary_vectorstore = FAISS.from_texts(section_summaries, embeddings)
full_text_vectorstore = FAISS.from_documents(sections, embeddings)

# Step 4: Two-stage retrieval
def hierarchical_retrieval(query, k_summaries=3, k_chunks=2):
    # First retrieve top-k summaries
    summary_results = summary_vectorstore.similarity_search(query, k=k_summaries)
    # Identify which sections those summaries belong to (by index)
    indices = [section_summaries.index(s.page_content) for s in summary_results]
    # Gather the full sections for those indices
    candidate_sections = [sections[i] for i in indices]
    # Second retrieval within those sections (or you can just use them directly)
    # For simplicity, I do a second FAISS search on a subset
    local_vs = FAISS.from_documents(candidate_sections, embeddings)
    final_chunks = local_vs.similarity_search(query, k=k_chunks)
    return final_chunks

# Use in a QA chain
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(model="gpt-4", temperature=0),
    retriever=hierarchical_retrieval,  # This needs to be a callable – adjust as needed
)

This is a minimal example; in production you’d cache vectors and handle chunk indices properly. The key takeaway is the two-stage idea.

Lessons learned

Summaries are cheap insurance. Generating them cost me about $0.02 per document – a tiny upfront cost that saved massive token waste later.
Chunk size is a trade-off. I settled on 2000-word sections for coarse granularity, and then 500-word chunks for fine retrieval. Your mileage depends on your document structure.
This pattern is not perfect for everything. If your knowledge base is a single long article (like a novel), hierarchical retrieval doesn’t help much – you might need a sliding window with topic tracking.
Alternatives exist. You could fine-tune a small model on your docs, or use a reranker like Cohere. But hierarchical retrieval is simpler to maintain.

When to avoid this

If your queries are always single-fact ("What's the refund policy?"), the naive chunking works fine. Over-engineering with two-stage retrieval adds complexity and latency. Also, if your docs are constantly changing, re-generating summaries becomes a chore. In that case, consider on-the-fly summarization of retrieved chunks instead.

What I'd do differently next time

Instrument everything. I should have added logging for which chunks were retrieved for each query – would have sped up debugging.
Use a cheaper summary model. GPT-3.5-turbo-small is fine, but even cheaper models like Llama-3-8B via an API would work. Next time I’ll try a local model.
Test edge cases. I didn’t think about questions that needed info from four docs – my k_summaries=3 missed one. Dynamic k based on question complexity is an improvement.

Now my chatbot remembers that "old method from last year" refers to the deprecated email recovery flow, and pulls in both the old and new documentation. The user got a helpful answer. Victory.

This whole journey taught me that retrieval is harder than generation. The model is smart; the real work is feeding it the right context.

What’s your approach to context management in LLM applications? Do you use chunking, summarization, or something else entirely?

P.S. If you're curious about the actual tooling I used beyond LangChain, I experimented with a service at https://ai.interwestinfo.com/ for the vector store hosting, but the technique itself is framework-agnostic.

DEV Community