Build Your First Q&A Bot with Langchain - A Beginner's Guide

Feb 20 2024

 

Large language models (LLMs) excel at answering questions within their trained topics. However, they cannot provide answers regarding personal data, company proprietary documents, or articles created after their training. Having the ability to converse with our own documents and answer questions using an LLM would be incredibly valuable

# Introduction

Now, let's first understand LLMs - LLMs, or Large Language Models, are machine learning models that have been trained on massive amounts of data to understand and generate human language. Some examples of LLMs are GPT, LaMDA, and LLaMA."

"Next, what is LangChain? LangChain is a framework for developing applications powered by language models. LangChain empowers developers to harness the power of language models for a wide range of applications, whether you're working on chatbots, document analysis, or code-related tasks.

# Extracting Text from URLs

First, we'll extract text from web pages using popular libraries such as requests, BeautifulSoup, and xmltodict. We'll utilize the sitemap.xml of the site for which we are creating the Q&A bot. As an example, let's consider my own website, which is located at https://techknowcrave.com/sitemap.xml.

import requests
import xmltodict
from bs4 import BeautifulSoup

sitemap = requests.get("https://techknowcrave.com/sitemap.xml")
sitemap_xml = sitemap.text
raw_data = xmltodict.parse(sitemap_xml)

pages = []

# Extract text from each url in the sitemap and store it in the pages list
for info in raw_data["urlset"]["url"]:
    url = info["loc"]
    my_url = (
        "https://techknowcrave.com/post/why-python-as-your-first-programming-language/"
    )
    if my_url in url:
        html = requests.get(url).text
        text = BeautifulSoup(html, features="html.parser").get_text()

        lines = (line.strip() for line in text.splitlines())
        cleaned_text = "\n".join(line for line in lines if line)
        pages.append({"text": cleaned_text, "source": url})

In the provided code, we start by making a GET request to the URL using the requests library to retrieve the content of the sitemap.xml file, which we then parse.

We then iterate through the URLs extracted from the sitemap, filtering out those URLs that contain a specific prefix (my_url). For each URL, we fetch the HTML content and use BeautifulSoup to parse the HTML and extract text from it. Following this, we clean up the extracted text by removing any leading or trailing whitespace and combining non-empty lines into a single string with newline characters.

If you want to use all site data, you can achieve this by removing the if condition used to filter the my_url condition.

Finally, we append the cleaned text along with the source URL to a list named pages.

# Split content of each page into multiple documents

Although we have collected all the data from the blog pages, we must consider the LLM context limit. To avoid excessively long documents, we will use the CharacterTextSplitter from langchain.

# Split the text into chunks of 1500 characters and store the chunks in the docs list
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
docs, metadatas = [], []
for page in pages:
    splits = text_splitter.split_text(page["text"])
    docs.extend(splits)
    metadatas.extend([{"source": page["source"]}] * len(splits))

In this code, we use the CharacterTextSplitter to break down the content of each page into smaller documents due to LLM context limit. The resulting docs list contains these split documents, and the metadatas list keeps track of the source information for each document.

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
docs, metadatas = [], []
for page in pages:
    splits = text_splitter.split_text(page["text"])
    docs.extend(splits)
    metadatas.extend([{"source": page["source"]}] * len(splits))

# Create a vector store

Now we have well-organized documents and their source URLs, and we are going to generate OpenAI embeddings for text and create vector store.

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

vectorstore = FAISS.from_texts(
    docs,
    OpenAIEmbeddings(),
    metadatas=metadatas,
)

# save the vectorstore to a local file
vectorstore.save_local("vectorstore")

We saved the resulting embeddings in a FAISS store in current directory named as vectorstore.

Asking questions

 

We can now ask questions about our documents - 

import sys

from dotenv import load_dotenv
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_core.vectorstores import VectorStoreRetriever
from langchain_openai import OpenAI, OpenAIEmbeddings

load_dotenv()

query = None
if len(sys.argv) > 1:
    query = sys.argv[1]

    vectorstore = FAISS.load_local(
        "vectorstore",
        OpenAIEmbeddings(),
    )

    retriever = VectorStoreRetriever(vectorstore=vectorstore)

    retrievalQA = RetrievalQA.from_llm(llm=OpenAI(), retriever=retriever)

    response = retrievalQA.invoke(query)

    # print(f"Question: {response['query']}")
    print(f"Answer: {response['result']}")

You can call it like this: "python ask_question.py "why python is so popular?"

Answer:

Python is popular for a variety of reasons, including its straightforward syntax, versatility, extensive collection of libraries and frameworks, compatibility, and vibrant open-source community. Additionally, Python's ease of use and power make it an excellent choice for beginners and experienced programmers alike.

you can find the complete code on the Github

Tags :
# AI # Langchain