Large language models (LLMs) excel at answering questions within their trained topics. However, they cannot provide answers regarding personal data, company proprietary documents, or articles created after their training. Having the ability to converse with our own documents and answer questions using an LLM would be incredibly valuable
# Introduction
Now, let's first understand LLMs - LLMs, or Large Language Models, are machine learning models that have been trained on massive amounts of data to understand and generate human language. Some examples of LLMs are GPT, LaMDA, and LLaMA."
"Next, what is LangChain? LangChain is a framework for developing applications powered by language models. LangChain empowers developers to harness the power of language models for a wide range of applications, whether you're working on chatbots, document analysis, or code-related tasks.
First, we'll extract text from web pages using popular libraries such as requests, BeautifulSoup, and xmltodict. We'll utilize the sitemap.xml of the site for which we are creating the Q&A bot. As an example, let's consider my own website, which is located at https://techknowcrave.com/sitemap.xml.
import requests
import xmltodict
from bs4 import BeautifulSoup
sitemap = requests.get("https://techknowcrave.com/sitemap.xml")
sitemap_xml = sitemap.text
raw_data = xmltodict.parse(sitemap_xml)
pages = []
# Extract text from each url in the sitemap and store it in the pages list
for info in raw_data["urlset"]["url"]:
url = info["loc"]
my_url = (
"https://techknowcrave.com/post/why-python-as-your-first-programming-language/"
)
if my_url in url:
html = requests.get(url).text
text = BeautifulSoup(html, features="html.parser").get_text()
lines = (line.strip() for line in text.splitlines())
cleaned_text = "\n".join(line for line in lines if line)
pages.append({"text": cleaned_text, "source": url})
In the provided code, we start by making a GET request to the URL using the requests library to retrieve the content of the sitemap.xml file, which we then parse.
We then iterate through the URLs extracted from the sitemap, filtering out those URLs that contain a specific prefix (my_url
). For each URL, we fetch the HTML content and use BeautifulSoup to parse the HTML and extract text from it. Following this, we clean up the extracted text by removing any leading or trailing whitespace and combining non-empty lines into a single string with newline characters.
If you want to use all site data, you can achieve this by removing the if condition used to filter the my_url
condition.
Finally, we append the cleaned text along with the source URL to a list named pages
.
# Split content of each page into multiple documents
Although we have collected all the data from the blog pages, we must consider the LLM context limit. To avoid excessively long documents, we will use the CharacterTextSplitter from langchain.
# Split the text into chunks of 1500 characters and store the chunks in the docs list
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
docs, metadatas = [], []
for page in pages:
splits = text_splitter.split_text(page["text"])
docs.extend(splits)
metadatas.extend([{"source": page["source"]}] * len(splits))
In this code, we use the CharacterTextSplitter
to break down the content of each page into smaller documents due to LLM context limit. The resulting docs
list contains these split documents, and the metadatas
list keeps track of the source information for each document.
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
docs, metadatas = [], []
for page in pages:
splits = text_splitter.split_text(page["text"])
docs.extend(splits)
metadatas.extend([{"source": page["source"]}] * len(splits))
# Create a vector store
Now we have well-organized documents and their source URLs, and we are going to generate OpenAI embeddings for text and create vector store.
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
vectorstore = FAISS.from_texts(
docs,
OpenAIEmbeddings(),
metadatas=metadatas,
)
# save the vectorstore to a local file
vectorstore.save_local("vectorstore")
We saved the resulting embeddings in a FAISS store in current directory named as vectorstore.
Asking questions
We can now ask questions about our documents -
import sys
from dotenv import load_dotenv
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_core.vectorstores import VectorStoreRetriever
from langchain_openai import OpenAI, OpenAIEmbeddings
load_dotenv()
query = None
if len(sys.argv) > 1:
query = sys.argv[1]
vectorstore = FAISS.load_local(
"vectorstore",
OpenAIEmbeddings(),
)
retriever = VectorStoreRetriever(vectorstore=vectorstore)
retrievalQA = RetrievalQA.from_llm(llm=OpenAI(), retriever=retriever)
response = retrievalQA.invoke(query)
# print(f"Question: {response['query']}")
print(f"Answer: {response['result']}")
You can call it like this: "python ask_question.py "why python is so popular?"
Answer:
Python is popular for a variety of reasons, including its straightforward syntax, versatility, extensive collection of libraries and frameworks, compatibility, and vibrant open-source community. Additionally, Python's ease of use and power make it an excellent choice for beginners and experienced programmers alike.
you can find the complete code on the Github