Power of DocsGPT: A Tutorial on Querying PDF and Word Documents

June 21, 2023
Rss Fetcher

Learn to use DocsGPT, a tool leveraging OpenAI’s GPT for querying PDF, PowerPoint, and Word documents using Google Collaboratory

A couple of months ago, my colleague Farhad and I came in an idea that made PDF queries using GPT-3 more accessible. We wrote a simple, functional code, created a repo, then shared it on LinkedIn and got lots of feedback:

Farhad Davaripour, Ph.D. on LinkedIn: GitHub – Farhad-Davaripour/DocsGPT: This app allows users to easily query… | 14 comments

Based on feedback, we gradually updated the code to make it compatible with querying from PowerPoint and Word files! In this tutorial, I will explain how to use this tool and how it works under the hood. The great benefit of this tool is that everything is stored in your Google Drive. You do not need to go into detail, install Python, and worry about compatibility! Everything will run in Google Colab!

Here is the GitHub repo:

GitHub – Farhad-Davaripour/DocsGPT: This app allows users to easily query a PDF document using OpenAI’s GPT-3 language model in Google Colab, utilizing Google Drive for storage.

Here is a tutorial on how to use this tool on YouTube:

<a href="https://medium.com/media/76ccde50c5083a7f5359ddc5978ecf5e/href">https://medium.com/media/76ccde50c5083a7f5359ddc5978ecf5e/href</a>

I got motivated to write this article to explain how this works for more curious people. In this Tutorial, we will explore how to use DocsGPT in detail. Here is the table of contents for this article that will help you to navigate through different sections:

Introduction to DocsGPT using Google Colab

How to Use DocsGPT

Understanding the Code: Jupyter Notebook and running query

Understanding the Code: main DocsGPT code

Conclusion

Introduction

Using the GPT-3 language model from OpenAI, the programme DocsGPT enables users to query PDF, Word, and PowerPoint documents quickly. This application is made to work with Google Colab, a free cloud service that utilizes the Jupyter Notebook environment and offers free GPU. Our version of DocsGPT has the advantage of using Google Drive to store and provide access to PDF, Word, and PowerPoint files.

We’ll walk you through using DocsGPT in this tutorial. In the next section, we will go through each step in detail for utilizing DocsGPT.

How to Use DocsGPT

This section will guide you on how to query your Word or PDF documents using DocsGPT. To learn how to utilize this tool, refer to the YouTube video lesson up above.

To start, follow the steps below:

Start DocsGPT: You can use this link to run on Google Colab:

Google Colaboratory

Or run the following code on Google Colab:

#@title 
# connect your Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import os
drive_path = '/content/drive/MyDrive'
folder_name = 'git_repo'
folder_path = os.path.join(drive_path, folder_name)
if not os.path.exists(folder_path):
    os.makedirs(folder_path)
    
repo_path = os.path.join(folder_path, "DocsGPT")

if not os.path.exists(repo_path):
  os.chdir(folder_path)
  !git clone https://github.com/Farhad-Davaripour/DocsGPT.git
  os.chdir(repo_path)
else:
  os.chdir(repo_path)
  !git pull

# import docsGpt
import docsGpt

# this run the conversational based on uploaded PDF
docsGpt.run_conversation(folder_path)

2. Provide OpenAI API Key: This is a must. You may obtain an API token here if you don’t already have one.

3. Upload Your PDF File: You will be requested to upload your PDF or Word file after providing your API key. The location of this file on your Google Drive will be: /content/drive/MyDrive/repo_path.

4. Submit Your Queries: Last but not least, you can send your queries (questions) regarding the uploaded document. Once you are done asking your questions, be sure to enter “stop.”

That’s it! You have now successfully set up and run DocsGPT. You can now start querying your documents using the power of GPT-3.

But wait, how this works? Let’s get to the next section, where we will discuss how to provide feedback and the licensing details of DocsGPT.

Understanding the Code: Jupyter Notebook and Running Query

We will go into the code that underlies DocsGPT in this part. We will explain what each line of code performs and how it adds to the application’s overall functionality.

Let’s start with the Jupyter Notebook cell we ran in the previous section. The first part of the code is responsible for setting up the environment and preparing for the execution of DocsGPT. It does the following:

Connect to Google Drive: The drive.mount(‘/content/drive’, force_remount=True) command is used to connect your Google Drive to the Google Colab environment. This allows the notebook to access and manipulate files in your Google Drive.
Set Up Paths: The next few lines of code set up the necessary paths for the operation of DocsGPT. The drive_path is set to the root of your Google Drive. The folder_name is set to ‘git_repo,’ and the folder_path is created by joining the drive_path and folder_name. If the folder_path does not exist. It is created using the os.makedirs(folder_path) command.
Clone or Pull the DocsGPT Repository: The code then checks if the DocsGPT repository exists in the specified folder_path. If it does not exist, the repository is cloned from GitHub using the !git clone https://github.com/Farhad-Davaripour/DocsGPT.git command. If it does exist, the code navigates to the repository’s directory and pulls the latest changes from GitHub using the !git pull command.
Import DocsGPT: The import docsGpt line is where the magic starts. This line of code imports the DocsGPT module, which contains the functionality for querying documents using OpenAI’s GPT-3.

In the next section, we will delve deeper into the docsGpt module and explain how it works.

Understanding the Code: Main DocsGPT Code

In this section, we will delve into the docsGpt.py module, which is the heart of the DocsGPT application. You can check it here, but we will review it more thoroughly.

DocsGPT/docsGpt.py at main · Farhad-Davaripour/DocsGPT

This module contains the functions that enable the querying of PDF Or Word documents using OpenAI’s GPT.

Let’s break down the main components of the docsGpt.py module:

Importing required modules

The module begins by importing necessary Python libraries, such as sys, subprocess, os, shutil, time, tempfile, and others. These libraries provide various functionalities that are used throughout the module. As these libraries are installed by default, we don’t need to install anything.

# Import required modules
import sys
import subprocess
from google.colab import files
import os
import shutil
import time
import tempfile

The module then defines a list of library names required to operate DocsGPT. These libraries are then dynamically imported using a for loop. If a library is not found (which is not installed on Google Colab by default if you are running for the first time), it is installed using the pip install command. In this case, we ensure you won’t run any errors if that specific library is not installed.

# List of library names to import
library_names = ['langchain', 'openai', 'PyPDF2', 
                 'tiktoken', 'faiss-cpu', 'textwrap', 
                 'python-docx', 'python-pptx']

# Dynamically import libraries from list
for name in library_names:
    try:
        __import__(name)
    except ImportError:
        print(f"{name} not found. Installing {name}...")
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', name])

# Import required modules
from PyPDF2 import PdfReader 
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch, Pinecone, Weaviate, FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from getpass import getpass
import textwrap
import os
import docx
import pptx

Let’s see what these libraries are and why we need to install them:

langchain: This is a Python library developed by OpenAI. It provides tools for working with language models, including GPT-3. In DocsGPT, it is used to create the QA chain and for splitting the text into chunks that the GPT-3 model can process.
openai: This is the official Python client library for the OpenAI API. It provides a convenient way to interact with the OpenAI API and use its features, including the GPT-3 model. In DocsGPT, it is used to make requests to the OpenAI API.
PyPDF2: This is a Python library for reading and manipulating PDF files. In DocsGPT, it is used for extracting text from PDF files.
tiktoken: This is a Python library developed by OpenAI. It provides a way to count the number of tokens in a text string without making an API call. In DocsGPT, it is used to ensure that the text chunks do not exceed the token size limits of the GPT-3 model.
faiss-cpu: This is a Python library developed by Facebook AI Research. It provides a set of algorithms for efficient similarity search and clustering of dense vectors. In DocsGPT, it is used to create a FAISS index from the text chunks, which is used for similarity search in the run_query function.
textwrap: This is a standard Python library used to wrap and format plain text. In DocsGPT, it is used to format the output of the run_query function.
python-docx: This is a Python library to create and update Microsoft Word (.docx) files. In DocsGPT, it is used to extract text from .docx files.
python-pptx: This is a Python library to create and update Microsoft PowerPoint (.pptx) files. In DocsGPT, it is used to extract text from .pptx files.

Each of these libraries plays a crucial role in the functioning of DocsGPT, providing the necessary tools and functionalities for querying PDF documents using OpenAI’s GPT-3.

Setting Up OpenAI API key

The module then prompts users to enter their OpenAI API key. This key is then set as an environment variable, which allows it to be used by the OpenAI library.

# adding token
# the keys: https://platform.openai.com/account/billing/overview")

token = getpass("Enter your OpenAI token: ()")
os.environ["OPENAI_API_KEY"] = str(token)

Initializing OpenAI embeddings and QA chain

The module then initializes the OpenAI embeddings and the QA chain. The embeddings convert the text into a format that the GPT-3 model can understand, and the QA chain is used to process the queries and return the answers.

# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()
chain = load_qa_chain(OpenAI(), chain_type="stuff")

Here are the main functions:

The module then defines several functions that are used to run DocsGPT:

extract_texts(root_files): This function extracts text from the uploaded files and puts it in a list. It supports .pdf, .docx, and .pptx file types. If multiple files are provided, their contents are concatenated.
run_query(query, docsearch): This function runs a query on a document file using the docsearch and chain libraries. It returns a string containing the output of the chain library run on the documents returned by the docsearch similarity search.
upload_file(folder_path): This function uploads a file from the local file system and saves it to a specified folder path. It returns a string representing the path of the uploaded file.
run_conversation(folder_path): This function initiates a conversation with the user by repeatedly asking for input queries and running them on a document file.

Now, let’s go through each function in detail:

1. Function: extract_texts(root_files)

def extract_texts(root_files):
    """
    Extracts text from uploaded file and puts it in a list.
    Supported file types: .pdf, .docx, .pptx
    If multiple files are provided, their contents are concatenated.
    Args:
    - root_files: A list of file paths to be processed.
    Returns:
    - A FAISS index object containing the embeddings of the 
    text chunks.
    """
    raw_text = ''

    for root_file in root_files:
        _, ext = os.path.splitext(root_file)
        if ext == '.pdf':
            with open(root_file, 'rb') as f:
                reader = PdfReader(f)
                for i in range(len(reader.pages)):
                    page = reader.pages[i]
                    raw_text += page.extract_text()
        elif ext == '.docx':
            doc = docx.Document(root_file)
            for paragraph in doc.paragraphs:
                raw_text += paragraph.text
        elif ext == '.pptx':
            ppt = pptx.Presentation(root_file)
            for slide in ppt.slides:
                for shape in slide.shapes:
                    if hasattr(shape, 'text'):
                        raw_text += shape.text

    # retreival we don't hit the token size limits. 
    text_splitter = CharacterTextSplitter(        
                                            separator = "n",
                                            chunk_size = 1000,
                                            chunk_overlap  = 200,
                                            length_function = len,
                                        )

    texts = text_splitter.split_text(raw_text)

    docsearch = FAISS.from_texts(texts, embeddings)
    return docsearch

This function is designed to extract text from the uploaded files and store it in a list. It supports .pdf, .docx, and .pptx file types. If multiple files are provided, their contents are concatenated.

Let’s breakdown the code:

The function begins by initializing an empty string, raw_text, which will store the extracted text from the files.
It then enters a loop where it iterates over each file in the root_files list. For each file, it determines the file extension using the os.path.splitext function.
Depending on the file extension, it uses different methods to extract the text:
– For .pdf files, it uses the PdfReader class from the PyPDF2 library to read the file and extract the text from each page.
– For .docx files, it uses the Document class from the python-docx library to read the file and extract the text from each paragraph.
– For .pptx files, it uses the Presentation class from the python-pptx library to read the file and extract the text from each shape in each slide.
After extracting the text, it uses the CharacterTextSplitter class from the langchain library to split the raw_text into chunks. This is done to ensure that the text chunks do not exceed the token size limits of the GPT-3 model. The CharacterTextSplitter is initialized with a separator of “n”, a chunk size of 1000, a chunk overlap of 200, and a length function of len.
Finally, it uses the FAISS.from_texts method from the langchain library to create a FAISS index object from the text chunks and the embeddings. This index object is used for similarity searches in the run_query function.

2. Function: run_query(query, docsearch)

def run_query(query, docsearch):
    """
    Runs a query on a PDF file using the docsearch and chain 
    libraries. 
    Args:
    - query: A string representing the query to be run.
    - file: A PDFReader object containing the PDF file to be 
    searched.
    Returns:
    - A string containing the output of the chain library run 
    on the documents returned by the docsearch similarity search.
    """
    
    docs = docsearch.similarity_search(query)
    return chain.run(input_documents=docs, question=query)

This function is designed to run a query on a PDF file using the docsearch and chain libraries. It returns a string containing the output of the chain library run on the documents returned by the docsearch similarity search.

Let’s breakdown the code:

The function begins by running a similarity search on the docsearch object using the provided query. The docsearch object is a FAISS index object that was created in the extract_texts function. The similarity search returns a list of documents similar to the query.
It then uses the chain.run method to run the chain on the documents returned by the similarity search and the query. The chain is a QA chain that was loaded from the langchain library. The chain.run method returns a string containing the output of the chain library run on the documents.

3. Function: upload_file(folder_path)

def upload_file(folder_path):
    """
    Uploads a file from the local file system and saves it to 
    a folder path. 
    Args:
    - folder_path: A string representing the folder path where 
    the file will be saved.
    Returns:
    - A string representing the path of the uploaded file.
    """
    
    uploaded = files.upload()
    root_file = []

    for filename, data in uploaded.items():
        with open(filename, 'wb') as f:
            f.write(data)
        shutil.copy(filename, folder_path + "/")
        root_file.append(folder_path + "/" + filename)
        os.remove(filename)


    return root_file

This function is designed to upload a file from the local file system and save it to a specified folder path. It returns a string representing the path of the uploaded file.

Let’s breakdown the code:

The function begins by calling the files.upload method from the google.colab library. This method opens a file dialog that allows the user to select a file from their local file system. The method returns a dictionary where the keys are the filenames and the values are the binary data of the files.
It then enters a loop where it iterates over each item in the dictionary. For each item, it opens a new file in the local file system with the same name as the uploaded file and writes the binary data to the new file. This effectively saves the uploaded file to the local file system.
After saving the file to the local file system, it uses the shutil.move function to move the file to the specified folder_path. The shutil.move function takes two arguments: the source file path and the destination file path. It moves the file from the source path to the destination path.
Finally, it returns the path of the moved file. This path is used in the run_conversation function to extract the text from the uploaded file.

4. Function: run_conversation(folder_path)

def run_conversation(folder_path):
    """
    Initiates a conversation with the user by repeatedly asking for 
    input queries and running them on a PDF file. 
    Args:
    - folder_path: A string representing the folder path where the 
    PDF file is located.
    Returns:
    - Run conversation based on PDF
    """
    root_files = upload_file(folder_path)
    # location of the pdf file/files.


    docsearch = extract_texts(root_files)

    count = 0
    while True:
        print("Question ", count + 1)

        query = input(" Ask your question or if you have no further question type stop:n ")
        
        if query.lower() == "stop":
            print("### Thanks for using the app! ###")
            break
        elif query == "":
            print("### Your input is empty! Try again! ###")
            continue
        else:
            wrapped_text = textwrap.wrap(run_query(query, docsearch), width=100)
            print("Answer:")
            for line in wrapped_text:
                print(line)
            count += 1

This function initiates a conversation with the user by repeatedly asking for input queries and running them on a PDF or Word file.

Let’s breakdown the code:

The function begins by calling the upload_file function to upload a file and get the file path. The upload_file function returns a string representing the path of the uploaded file.
It then calls the extract_texts function to extract the text from the uploaded file and create a FAISS index object. The extract_texts function returns a FAISS index object that is used for similarity search in the run_query function.
Finally, it enters a loop where it repeatedly asks the user for a query. This is done using the built-in input function with a prompt of “Enter your question:”. The input function reads a line from input, converts it to a string, and returns that string.
After getting the query from the user, it checks if the query is “stop.” If it is, it breaks the loop and ends the conversation. If it is not, it calls the run_query function to run the query on the FAISS index object and prints the output. The run_query function returns a string containing the output of the chain library run on the documents returned by the docsearch similarity search.
The loop continues until the user enters “stop,” the conversation ends.

Through these functions, DocsGPT provides a powerful and user-friendly interface for querying PDF documents using OpenAI’s GPT-3. By understanding how these functions work, you can better understand how DocsGPT operates and how you can use it to its full potential.

Conclusion

In this tutorial, we have walked through the process of using DocsGPT, an application that leverages the power of OpenAI’s GPT-3 to query PDF and Word documents. We have explored the step-by-step process of setting up and running DocsGPT in Google Colab and delved into the code behind DocsGPT to understand how it works.

Our version of DocsGPT, by harnessing the power of GPT-3, allows you to ask questions about your documents and get answers in real time. Whether you’re a student researching for a project, a professional looking for specific information in a report, or just someone who wants to get the most out of their PDFs, DocsGPT can be a game-changer.

If you found this tutorial helpful, please give the DocsGPT repository a star on GitHub. Your support helps to keep the project going and encourages further development. Thank you for reading, and happy querying!

If you think this level of information is not enough and want to learn what’s behind GPT and the Large Language Model, I suggest checking out this article:

LLM Behind the Scenes: Exploring Transformers with TensorFlow

As usual, thank you for reading my article, and I hope it was useful for you. If you enjoyed the article and would like to show your support, please consider taking the following actions:
📖 Follow me on Medium to access more of the content on my profile. Follow Now
🔔 Subscribe to the newsletter to not miss my latest posts: Subscribe Now
🛎 Connect with me on LinkedIn for updates.

Power of DocsGPT: A Tutorial on Querying PDF and Word Documents was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learn to use DocsGPT, a tool leveraging OpenAI’s GPT for querying PDF, PowerPoint, and Word documents using Google Collaboratory

Introduction

How to Use DocsGPT

Understanding the Code: Jupyter Notebook and Running Query

Understanding the Code: Main DocsGPT Code

Importing required modules

Setting Up OpenAI API key

Initializing OpenAI embeddings and QA chain

Here are the main functions:

1. Function: extract_texts(root_files)

2. Function: run_query(query, docsearch)

3. Function: upload_file(folder_path)

4. Function: run_conversation(folder_path)

Conclusion

Previous Post

Next Post

Solutions

Regions Covered