Efficiently Navigate Massive Documentations: AI-Powered Natural Language Queries for Knowledge…

June 20, 2023
Rss Fetcher

Efficiently Navigate Massive Documentations: AI-Powered Natural Language Queries for Knowledge Discovery

In this article, I wanted to share a proof of concept project I’ve been working on called UE5_documentalist. It’s an exciting project that uses Natural Language Processing (NLP) to potentially enhance your experience with massive documentation.

While I worked on the Unreal Engine 5 documentation for this project, it can be applied to any kind of use case, such as your company’s in-house documentation.

What is UE5_documentalist?

UE5_documentalist is an intelligent assistant designed to simplify your navigation through Unreal Engine 5.1’s — or any other — documentation. By leveraging NLP techniques, this project allows you to make natural language queries and effortlessly find the most relevant sections in more than 1700 web pages.

For instance, you can query, ‘what system can I use to avoid collision between agents?’ and be redirected to the documentation that best suits your needs.

You can check my code on this repo.

Demo

To give you a better idea of what UE5_documentalist can do, here’s a quick demo showcasing its capabilities:

How Does It Work?

Step 1 — Scraping

I first scraped Unreal Engine 5.1’s web documentation, converted the HTML output to markdown, and saved the text into a dictionary.

The scraping function looks like this:

def main(limit, urls_registry, subsections_path):
    urls_registry = "./src/utils/urls.txt"
    with open(urls_registry, 'r') as f:
        urls = f.read()
    urls = urls.split('n')

    # initialize dictionary to store subsections
    subsections = {}
    for idx, url in enumerate(urls):
        # stop if limit is reached
        if limit is not None:
            if idx > limit:
                break
        if idx % 100 == 0:
            print(f"Processing url {idx}")
        # make request
        try:
            with urllib.request.urlopen(url) as f:
                content = f.read()
        except HTTPError as e:
            print(f"Error with url {url}")
            print('The server couldn't fulfill the request.')
            print('Error code: ', e.code)
            continue
        except URLError as e:
            print(f"Error with url {url}")
            print('We failed to reach a server.')
            print('Reason: ', e.reason)
            continue
        # parse content
        md_content = md(content.decode('utf-8'))
        preproc_content = split_text_into_components(md_content)
        # extract info from url name
        subsection_title = extract_info_from_url(url)
        # add to dictionary
        subsections[url] = {
            "title": subsection_title,
            "content": preproc_content
        }
    # save dictionary
    with open(subsections_path, 'w') as f:
        json.dump(subsections, f)

The function calls another function called split_text_into_components, which is a sequence of preprocessing steps that remove special characters such as links, images, boldings, and headers.

In the end, this function returns a dictionary that looks like this:

{url : 
  {'title' : 'some_contextual_title',
   'content' : 'content of the page'
  }
...
}

Step 2 — Embeddings

From there, I generated embeddings using the Instructor-XL model. At first, I used OpenAI’s text-embedding-ada-002 model. It was extremely cheap (less than $2 for more than 170 million characters), fast, and globally well-suited for my use case. However, I switched to the Instructor-XL model, which runs slightly slower but can be used locally without communicating with an online API. This solution is better if you are concerned about privacy or work on sensitive data.

The embedding function looks like this:

def embed(subsection_dict_path, embedder, security):
    """Embed the files in the directory.
    Args:
        subsection_dict_path (dict): Path to the dictionary containing the subsections.
        security (str): Security setting. Either "activated" or "deactivated".
                        prevents the function from running if not "deactivated" 
                        and avoids unexpected costs.
    Returns:
        embeddings (dict): Dictionary containing the embeddings.
    """
    # If embeddings already exist, load them
    if os.path.exists(os.path.join("./embeddings", f'{embedder}_embeddings.json')):
        print("Embeddings already exist. Loading them.")
        with open(os.path.join("./embeddings", f'{embedder}_embeddings.json'), 'r') as f:
            embeddings = json.load(f)
    else:
        # initialize dictionary to store embeddings
        embeddings = {}
    # check security if embedder is openai (avoids spending $$$ by mistake)
    if security != "deactivated":
        if embedder == 'openai':
            raise Exception("Security is not deactivated.")
    
    # load subsections
    with open(subsection_dict_path, 'r') as f:
        subsection_dict = json.load(f)

    # For debugging purposes only
    # Compute average text length to embed
    dict_len = len(subsection_dict)
    total_text_len = 0
    for url, subsection in subsection_dict.items():
        total_text_len += len(subsection['content'])
    avg_text_len = total_text_len / dict_len
    # initialize openai api if embedder is 'openai'
    if embedder == "openai":
        openai_model = "text-embedding-ada-002"
        # Fetch API key from environment variable or prompt user for it
        api_key = os.getenv('API_KEY')
        if api_key is None:
            api_key = input("Please enter your OpenAI API key: ")
        openai.api_key = api_key
    # initialize instructor model if embedder is 'instructor'
    elif embedder == "instructor":
        instructor_model = INSTRUCTOR('hkunlp/instructor-xl')
        # set device to gpu if available
        if (torch.backends.mps.is_available()) and (torch.backends.mps.is_built()):
            device = torch.device("mps")
        elif torch.cuda.is_available():
            device = torch.device("cuda")
        else:
            device = torch.device("cpu")
    else:
        raise ValueError(f"Embedder must be 'openai' or 'instructor'. Not {embedder}")
    
    # loop through subsections
    for url, subsection in tqdm(subsection_dict.items()):
        subsection_name = subsection['title']
        text_to_embed = subsection['content']
        # skip if already embedded
        if url in embeddings.keys():
            continue
        # make request for embedding
        # case 1: openai
        if embedder == 'openai':
            try:
                response = openai.Embedding.create(
                    input=text_to_embed,
                    model=openai_model
                )
                embedding = response['data'][0]['embedding']
            except InvalidRequestError as e:
                print(f"Error with url {url}")
                print('The server couldn't fulfill the request.')
                print('Error code: ', e.code)
                print(f'Tried to embed {len(text_to_embed)} characters while average is {avg_text_len}')
                continue
        # case 2: instructor
        elif embedder == 'instructor':
            instruction = "Represent the UnrealEngine documentation for retrieval:"
            embedding = instructor_model.encode([[instruction, text_to_embed]], device=device)
            embedding = [float(x) for x in embedding.squeeze().tolist()]
        else:
            raise ValueError(f"Embedder must be 'openai' or 'instructor'. Not {embedder}")
        
        # add embedding to dictionary
        embeddings[url] = {
            "title": subsection_name,
            "embedding": embedding
        }
        # save dictionary every 100 iterations
        if len(embeddings) % 100 == 0:
            print(f"Saving embeddings after {len(embeddings)} iterations.")
            # save embeddings to pickle file
            with open(os.path.join("./embeddings", f'{embedder}_embeddings.pkl'), 'wb') as f:
                pickle.dump(embeddings, f)
            # save embeddings to json file
            with open(os.path.join("./embeddings", f'{embedder}_embeddings.json'), 'w') as f:
                json.dump(embeddings, f)
    return embeddings

This function returns a dictionary that looks like this:

{url : 
  {'title' : 'some_contextual_title',
   'embedding' : vector representation of the content
  }
...
}

Step 3 — Vector index database

These embeddings were then uploaded to a Qdrant vector index database running on a Docker container.

The database is launched with the following docker commands:

docker pull qdrant/qdrant
docker run -d -p 6333:6333 qdrant/qdrant

And is populated with the following function (you will need the qdrant_client package: pip install qdrant-client)

import qdrant_client as qc
import qdrant_client.http.models as qmodels
import uuid
import json
import argparse
from tqdm import tqdm

client = qc.QdrantClient(url="localhost")
METRIC = qmodels.Distance.DOT
COLLECTION_NAME = "ue5_docs"
def create_index():
    client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config = qmodels.VectorParams(
            size=DIMENSION,
            distance=METRIC,
        )
    )

def create_subsection_vector(
    subsection_content,
    section_anchor,
    page_url
    ):
    id = str(uuid.uuid1().int)[:32]
    payload = {
        "text": subsection_content,
        "url": page_url,
        "section_anchor": section_anchor,
        "block_type": 'text'
    }
    return id, payload

def add_doc_to_index(embeddings, content_dict):
    ids = []
    vectors = []
    payloads = []
    
    for url, content in tqdm(embeddings.items()):
        section_anchor = content['title']
        section_vector = content['embedding']
        section_content = content_dict[url]['content']
        id, payload = create_subsection_vector(
            section_content,
            section_anchor,
            url
    )
        ids.append(id)
        vectors.append(section_vector)
        payloads.append(payload)
        # Add vectors to collection
        client.upsert(
            collection_name=COLLECTION_NAME,
            points=qmodels.Batch(
                ids = [id],
                vectors=[section_vector],
                payloads=[payload]
            ),
        )

This function takes the web page URL and associated content, embeds it, and uploads it to the Qdrant database.

You must upload an ID, a vector, and a payload as a vector database. Here, our IDs are procedurally generated, our vectors are the embeddings (which will be matched with the queries later on), and the payload is the additional information.

In our case, the main information to include in the payload is the web URL and the web page’s content. That way, when we match a query to the most relevant entry in our Qdrant database, we can print the documentation and open the web page.

Final step — Query

We can now query the database.

The query must first be embedded with the same embedder used on the documentation. We do that with the following function:

def embed_query(query, embedder):

    if embedder == "openai":
        # Fetch API key from environment variable or prompt user for it
        api_key = os.getenv('API_KEY')
        if api_key is None:
            api_key = input("Please enter your OpenAI API key: ")
        
        openai_model = "text-embedding-ada-002"

        openai.api_key = api_key
        response = openai.Embedding.create(
                    input=query,
                    model=openai_model
        )
        embedding = response['data'][0]['embedding']

    elif embedder == "instructor":
        instructor_model = INSTRUCTOR('hkunlp/instructor-xl')
         # set device to gpu if available
        if (torch.backends.mps.is_available()) and (torch.backends.mps.is_built()):
            device = torch.device("mps")
        elif torch.cuda.is_available():
            device = torch.device("cuda")
        else:
            device = torch.device("cpu")

        instruction = "Represent the UnrealEngine query for retrieving supporting documents:"
        embedding = instructor_model.encode([[instruction, query]], device=device)
        embedding = [float(x) for x in embedding.squeeze().tolist()]
    else:
        raise ValueError("Embedder must be 'openai' or 'instructor'")
    return embedding

The embedded query is then used to interrogate the Qdrant database:

def query_index(query, embedder, top_k=10, block_types=None):
    """
    Queries the Qdrant vector index DB for documents that match the given query.

    Args:
        query (str): The query to search for.
        embedder (str): The embedder to use. Must be either "openai" or "instructor".
        top_k (int, optional): The maximum number of documents to return. Defaults to 10.
        block_types (str or list of str, optional): The types of document blocks to search in. Defaults to "text".

    Returns:
        A list of dictionaries representing the matching documents, sorted by relevance. Each dictionary contains the following keys:
        - "id": The ID of the document.
        - "score": The relevance score of the document.
        - "text": The text content of the document.
        - "block_type": The type of the document block that matched the query.
    """
    collection_name = get_collection_name()

    if not collection_exists(collection_name):
        raise Exception(f"Collection {collection_name} does not exist. Exisiting collections are: {list_collections()}")


    vector = embed_query(query, embedder)

    _search_params = models.SearchParams(
        hnsw_ef=128,
        exact=False
    )

    block_types = parse_block_types(block_types)

    _filter = models.Filter(
        must=[
            models.Filter(
                should= [
                    models.FieldCondition(
                        key="block_type",
                        match=models.MatchValue(value=bt),
                    )
                for bt in block_types
                ]  
            )
        ]
    )

    results = CLIENT.search(
        collection_name=collection_name,
        query_vector=vector,
        query_filter=_filter,
        limit=top_k,
        with_payload=True,
        search_params=_search_params,
    
    )

    results = [
        (
            f"{res.payload['url']}#{res.payload['section_anchor']}", 
            res.payload["text"],
            res.score
        )
        for res in results
    ]

    return results


def ue5_docs_search(
    query, 
    embedder=None,
    top_k=10, 
    block_types=None,
    score=False,
    open_url=True
):
    """
    Searches the Qdrant vector index DB for documents related to the given query and prints the top results.

    Args:
        query (str): The query to search for.
        embedder (str): The embedder to use. Must be either "openai" or "instructor".
        top_k (int, optional): The maximum number of documents to return. Defaults to 10.
        block_types (str or list of str, optional): The types of document blocks to search in. Defaults to "text".
        score (bool, optional): Whether to include the relevance score in the output. Defaults to False.
        open_url (bool, optional): Whether to open the top URL in a web browser. Defaults to True.

    Returns:
        None
    """
    # Check if embedder is 'openai' or 'instructor'. raise error if not
    assert embedder in ['openai', 'instructor'], f"Embedder must be 'openai' or 'instructor'. Not {embedder}"

    results = query_index(
        query,
        embedder=embedder,
        top_k=top_k,
        block_types=block_types
    )

    print_results(query, results, score=score)
    if open_url:
        top_url = results[0][0]
        webbrowser.open(top_url)

These functions call other internal processing functions found in my repo.

And there you have it! The query is embedded, matched to the documentation embedding that is the closest (you can choose the evaluation metric for that, but for this use case, I chose the dot product). Your result is printed in the terminal, and the associated web page is automatically open!

Want To Try It Out Yourself?

If you’re interested in exploring UE5_documentalist, I’ve prepared a comprehensive step-by-step guide on how to set it up on your local machine. The embeddings are already done, so you can directly start populating the Qdrant database, and you’ll find all the necessary resources and detailed instructions in my GitHub repo.

For now, it runs with Python commands.

Why Give It a Shot?

UE5_documentalist aims to simplify searching through lengthy documentation, potentially saving you valuable development time. By asking questions in plain English, you’ll be directed to the specific sections that address your queries. This tool might improve your experience and allow you to focus more on building amazing projects with Unreal Engine 5.1 or your documentation.

I’m proud of my progress with UE5_documentalist so far, and I’m eager to hear your feedback and suggestions for improvement.

TL;DR:

Introducing UE5_documentalist — an intelligent documentation assistant powered by NLP. It allows developers to navigate Unreal Engine 5.1’s documentation effortlessly using natural language queries. Check out the demo and find instructions to set it up on your machine in the GitHub repo. Who knows, if it works well, you might say goodbye to tedious documentation searches and potentially improve your experience with knowledge discovery!

Links

GitHub repo

Example use (Gif)

Credits

I got the idea of simplifying the UE5 documentation from a Reddit user who started a project using LLMs (I forgot his name, but here’s a link to his repo).

I based my implementation on this TDS article by Jacob Marks. His code was extremely useful for most of the parsing and indexing steps, and I couldn’t have succeeded without his detailed article. Don’t hesitate to take a look at his work!

Efficiently Navigate Massive Documentations: AI-Powered Natural Language Queries for Knowledge… was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.