Build Chatbots: Your Own Multi-Document Assistant

Welcome to this tutorial where we’ll build a powerful chatbot to answer questions from various documents (PDF, DOC, TXT). I’ve used LangChain, OpenAI API, and Large Language Models from Hugging Face to create a question/answer pipeline, and employed Streamlit for crafting a user-friendly web interface. In this blog, I will introduce LangChain, a cutting-edge framework for developing applications using LLMs.

1. Single Document vs Multiple Documents

When dealing with a single document, whether it’s a PDF, Microsoft Word, or a text file, the process remains fairly straightforward. Extracting all the text from the document and feeding it into an LLM prompt like ChatGPT allows us to pose inquiries about the content. This method mirrors the conventional usage of ChatGPT.

However, the scenario becomes more intricate when handling multiple documents. Due to the token limits inherent in LLMs, we confront the challenge of not being able to ingest all the information from these documents in a single request. As a result, our strategy shifts toward sending only the pertinent information to the LLM prompt to circumvent this limitation. But the question arises: How do we isolate and retrieve only the relevant information from our multitude of documents? This is where embeddings and vector stores become pivotal.

2. Embeddings and Vector Stores

Embeddings and vector stores play a crucial role in distilling relevant information from multiple documents. These tools aid in transforming text into numerical representations that capture semantic relationships and contextual meanings. By converting textual information into high-dimensional vectors, we can effectively organize and index the content. This allows us to efficiently retrieve and present only the most relevant information to the LLM prompt, thereby overcoming the size limitation and ensuring that our queries are focused and contextually aligned.

3. Introduction to LangChain

LangChain emerges as a robust framework that equips us with potent tools and methodologies essential for harnessing the capabilities of Language Models (LMs) effectively. It simplifies the implementation of LMs, providing a more user-friendly interface that accelerates the creation of diverse applications leveraging these models. Moreover, LangChain can support various LMs, including those from Hugging Face, OpenAI API, and others.


from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings  #HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS

Documents Processing

Text Splitter

The text_splitter module within LangChain, exemplified here by the CharacterTextSplitter class, provides a valuable utility for breaking down extensive text into manageable chunks. This functionality becomes particularly useful when dealing with large volumes of text, such as multiple documents, enabling efficient processing and analysis. The parameters defined within CharacterTextSplitter, including the separator, chunk size, and overlap, allow customization to suit specific requirements. By employing this module, developers can segment text intelligently, enhancing the overall workflow efficiency.


def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
    return chunks

Embeddings

The utilization of embeddings is integral to LangChain’s functionality. These embeddings capture the semantic nuances and contextual relationships within the text, enabling the generation of high-dimensional vectors that encapsulate the essence of the content. The choice of embeddings can significantly impact the performance of downstream tasks, and LangChain’s flexibility in accommodating various embedding methodologies ensures adaptability to diverse use cases. In our case, we use the OpenAI embeddings transformer, which employs the cosine similarity method to calculate the similarity between documents and a question.

Vector Stores

The vector store, exemplified by the FAISS class, serves as a repository for the high-dimensional vectors generated from the text chunks using the specified embeddings. This component enables efficient storage, indexing, and retrieval of vectors, optimizing the process of accessing relevant information. By organizing these vectors in a manner conducive to rapid search and retrieval, LangChain’s vector stores empower developers to efficiently navigate through vast volumes of textual data, retrieving targeted information while mitigating the challenges posed by token limits in Language Models.


def get_vectorstore(text_chunks):
    embeddings = OpenAIEmbeddings()
    # embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")
    vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
    return vectorstore

LLM Model Deployment

Let’s take OpenAI as an example. How do we integrate it into our LangChain?

from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

chain = load_qa_chain(llm=OpenAI())
query = 'Hi, OpenAI.'
response = chain.run(input_documents=documents, question=query)
print(response)

Additionally, we can provide context for the load_qa_chain function, which sends the prompt to OpenAI, resembling something similar to the following:

Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to
make up an answer.

{context}
Question: {query}
Helpful Answer:

This code snippet demonstrates how we can employ the LangChain framework to load the question-answering chain and utilize an LLM (in this case, OpenAI) to respond to queries based on provided documents.

4. Making the Chatbot Remember Conversation History

To elevate the capabilities of our chatbot, we can implement a feature that allows it to retain and recall previous conversation records.

LangChain provides the ConversationBufferMemory class to manage conversation history. This class effectively stores and retrieves dialogue records, passing the history to the model with each request.

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

def get_conversation_chain(vectorstore):
    llm = ChatOpenAI(temperature=0.5, max_tokens=512)  # Placeholder for your chosen LLM
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain

5. Streamlit for Web App Development

Streamlit simplifies web app development by enabling users to create interactive applications with ease. Here’s an overview of how you can build a web app using Streamlit:

Setting up Streamlit: Install Streamlit using pip install streamlit and initialize your app with streamlit run app.py.
Creating the Interface:
- Use st.write() to display text, charts, images, or other content.
- Leverage interactive components like st.button(), st.slider(), or st.text_input() for user interaction.
- Utilize st.sidebar to create a sidebar for additional controls or information.
Handling User Inputs:
- Capture user inputs using functions like st.text_input() or st.file_uploader().
- Process and respond to user queries or actions based on the inputs received.
Real-time Updates:
- Streamlit automatically updates the app in real-time as you modify the code, providing instant previews without manual refreshing.
Integration with Data Visualization:
- Integrate popular data visualization libraries such as Matplotlib or Plotly to visualize data within your app using st.pyplot() or st.plotly_chart().
Deployment:
- Streamlit offers straightforward deployment options for sharing your app, making it accessible to others via a URL.

Example:

import streamlit as st

# App setup
st.title('My Streamlit Web App')
user_input = st.text_input('Enter text here:')
st.write('You entered:', user_input)

Streamlit’s simplicity and integration with Python make it an excellent choice for quickly building and deploying web apps, especially for data-driven applications.

6. Entire Code

import streamlit as st
from dotenv import load_dotenv
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings  #HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from htmlTemplates import css, bot_template, user_template
from langchain.llms import HuggingFaceHub

def get_text(docs):
    documents = []
    for file in docs:
        text = file.read().decode('utf-8')
        documents.append(text)
    return documents

def get_text_chunks(text):
    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(text)
    return chunks

def get_vectorstore(text_chunks):
    embeddings = OpenAIEmbeddings()
    # embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")
    vectorstore = FAISS.from_texts(texts=text_chunks, embedding=embeddings)
    return vectorstore

def get_conversation_chain(vectorstore):
    llm = ChatOpenAI(temperature=0.5, max_tokens=512)
    # llm = HuggingFaceHub(repo_id="google/mt5-base", model_kwargs={"temperature":0.5, "max_length":512})
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain

def handle_userinput(user_question):
    response = st.session_state.conversation({'question': user_question})
    st.session_state.chat_history = response['chat_history']

    for i, message in enumerate(st.session_state.chat_history):
        if i % 2 == 0:
            st.write(user_template.replace(
                "{{MSG}}", message.content), unsafe_allow_html=True)
        else:
            st.write(bot_template.replace(
                "{{MSG}}", message.content), unsafe_allow_html=True)

def main():
    load_dotenv()
    st.set_page_config(page_title="Chat with docs", page_icon=':books:')
    st.write(css, unsafe_allow_html=True)

    if "conversation" not in st.session_state:
        st.session_state.conversation = None
    if "chat_history" not in st.session_state:
        st.session_state.chat_history = None

    st.header("Chat with docs :books:")
    user_question = st.text_input("Ask a question about your documents:")
    if user_question:
        handle_userinput(user_question)
    
    with st.sidebar:
        st.subheader("your docs")
        docs=st.file_uploader("Upload here", accept_multiple_files=True)
        if st.button("Process"):
            with st.spinner("Processing"):
                #get the text
                raw_text=get_text(docs)
                #get the chunks
                text_chunks=get_text_chunks(raw_text)
                # st.write(text_chunks)
                #create vector store
                vectorstore=get_vectorstore(text_chunks)
                # create conversation chain
                st.session_state.conversation = get_conversation_chain(vectorstore)

if __name__ == '__main__':
    main()

This code specifically handles text files (.txt). For PDF or DOC files, you’d need to utilize specific libraries to extract text content from them.
For PDF files, libraries like PyPDF2, pdfplumber, or PyMuPDF can be used to extract text content.
For DOC files, libraries such as python-docx, pywin32, or textract can assist in obtaining text content.
For example:

documents = []
for file in os.listdir('docs'):
    if file.endswith('.pdf'):
        pdf_path = './docs/' + file
        loader = PyPDFLoader(pdf_path)
        documents.extend(loader.load())
    elif file.endswith('.docx') or file.endswith('.doc'):
        doc_path = './docs/' + file
        loader = Docx2txtLoader(doc_path)
        documents.extend(loader.load())
    elif file.endswith('.txt'):
        text_path = './docs/' + file
        loader = TextLoader(text_path)
        documents.extend(loader.load())

Additionally, you need to input your OPENAI_API_KEY and HUGGINGFACEHUB_API_TOKEN in the .env file. You’ll also require an htmlTemplates.py. For specific project details, please refer to my Chatbot repository on GitHub (publicly available).

7. Running Examples

Open your terminal and input streamlit run followed by the file path. For instance, on my computer:

1	`streamlit run /Users/jenny/Documents/chatbot/app.py`

This command starts the application by executing the file named app.py located at the specified file path.

After uploading your files, click on the ‘Process’ button, then you can ask your questions!

This is a screenshot of my running example:

Running Example

Happy coding!

Natural language processing

#AI #LLM #NLP

Build Chatbots: Your Own Multi-Document Assistant

http://example.com/2023/12/20/Develop-a-Multi-Document-Chatbot-with-LangChain-and-LLM/

Author

Jenny Qu

Posted on

December 20, 2023

Licensed under

About me Previous

NLP: Sentiment Analysis with LSTM Next