Understanding AI’s talkative cousin
Wished lately that you had a buddy who could answer your deepest, most random questions at the right time, make up a captivating story on the spot, or maybe even keep you company during an insomnia-ridden night? Well, in comes ChatGPT, the charismatic AI that’s been taking the digital world by storm. It’s like the Siri of texting but with the verve of Shakespeare and the knowledge of a trivia champion.
The Rise of ChatGPT
Let’s rewind to, well, just a few months ago when OpenAI more publicly introduced us to this gem in the AI world, ChatGPT. It’s been a fascinating journey of discovery and learning ever since. ChatGPT, now a trusted friend to many, has been lending a helping hand to students, aiding professionals, and providing company to those in need. Imagine a portable, digital mentor with an excellent sense of humor tailored for you and right at your service.
What’s behind the name?
ChatGPT might sound like a cryptic codename, but it’s actually quite straightforward. The “Chat” portion naturally refers to a conversational interaction. “GPT” stands for “Generative Pre-trained Transformer,” which is short for a fancy name for the AI’s learning method. (Oh yeah, I’m funny).
No, but it’s far less complicated than it sounds, and it definitely doesn’t involve any Autobots or Decepticons! (Again, so hilarious).
LLMs and Generative Transformers
Allow me to delve a bit into the technical side quickly, but fear not, we’re keeping it simple. You might have heard of AI models like ChatGPT referred to as “Large Language Models” (LLMs) or “Generative Transformers,” which sounds like an early dose of jargon jambalaya with a sprinkle of acronym gumbo, right? Let’s simplify.
Language models
First, we should perhaps begin by understanding language models more comprehensively. Think of a language model as an expansive word puzzle solver. It takes a prompt (an unfinished puzzle) and uses its training (puzzle rules) to fill in the blanks.
When you provide it a prompt or ask a question, this serves as the initial part of a puzzle. The language model then uses its vast training data, which mostly consists of a wide range of texts, to understand the context and rules of language. But what does “context” truly mean for an AI model, and what are these “rules of language” I have just made up?
For leveraging AI language models the way we currently do, they are exposed to a training phase, where data is first introduced and presented in a particular manner to teach the model statistical patterns, linguistic structures, grammar, and semantic relationships between words in a specific natural language. These will effectively serve as the original source of truth, or rules, that the model learns from since there is no other existing data to base its responses on.
By analyzing the co-occurrence of words and phrases it was trained on, the model determines and remembers the probabilities of certain words appearing in specific contexts. For example, it learns that “cat” is more likely to follow “the” than “banana.” Not that it wouldn’t sound really cool… (I’ll stop).
Then based on these rules, the language model generates predictions and probabilities for each possible word or phrase that could fit in the provided puzzle. It also takes into account the words and context preceding the prompt to make these predictions as well. It then proceeds to assign higher weight values (i.e., probabilities) to more likely answers and lower weights to less likely ones based on the same contextual data it was trained on but recalculated as needed by using the values from the user’s input.
Lastly, after all, has been done and assigned, the language model uses sophisticated algorithms and techniques to weigh these probabilities and select the most appropriate completion for the prompt. Although the model doesn’t learn from individual interactions, it uses extensive pre-training to generate accurate and relevant completions.
So, “large” language models?
The “large” portion of LLMs represents the vast amount of more data it was trained on and the size of the model itself (i.e., amount of parameters, which I’ll explain below). And you would think this is the extent of that adjective in the name, but it dives a little deeper.
It emphasizes that the model can capture a lot more nuanced and complex patterns in the data, which enables it to better understand the context (remember, weighted rules), generate more accurate responses, and cover a wider range of topics in the same conversation. However, this doesn’t specify any underlying architecture or mechanism to achieve this goal. For instance, a large language model could, in theory, be based on various architectures, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), or Transformers (like our hot tool, ChatGPT).
Transformer-based generative models
While language models focus more, in general, functionality and scalability, generative transformers, on the other hand, are often more related to architecture and behavior.
A transformer architecture is a specific type of neural network design that’s particularly good at processing sequential data and capturing long-range dependencies in the data. In other words, imagine a machine that’s good at interpreting and remembering the order of things, like words in a sentence or events in a story, and can further figure out how different parts of a conversation or story are related to each other, even when they’re far apart.
The term “generative” indicates that the model can produce new outputs, like text, based on the patterns it learned during training. After encoding input data, the transformer model generates an output, one element at a time, with each new element being influenced by all the previous ones. It’s like a movie or TV series director carefully selecting each actor’s entry and dialogue, one at a time, based on the scene that has already unfolded. Then, this process continues until the model generates an end-of-sentence token (i.e., end-of-interaction identifier) or until a specified maximum length has been reached.
However, this doesn’t inherently convey the scale of the language model or the scope of its capabilities, which we determined earlier when exploring models more in-depth. It’s all about transforming something basic into something insightful and meaningful. The combination of both scopes ultimately makes up our friendly neighborhood ChatGPT.
Tokenization
Now, we’ve addressed LLMs and GPT (look at that, vocabulary) as puzzle solvers of the incomplete puzzles we provide through our prompts. Naturally, we would think each word from our input serves as an individual piece of this puzzle. But it’s truly far more granular than that. Let’s briefly dive into what this means.
Tokenization is a process by which text is split into smaller units called tokens. These tokens can be as short as one character or as long as one word. This is a critical step in language model processing because it represents an efficient and effective way for the model to learn the relationships and patterns of words at a very granular level. How so? Why? We can go over a very quick example.
Consider the word “unrecognizable.” What are potential variations? Well, unrecognized, recognizable, recognizability, recognize, recognizes, recognized, recognizing, and probably quite a few more. Now imagine extending the possibilities by considering each variation’s potential typos. Without tokenization, ChatGPT would need to fully remember each of these entries separately, including typos, and more granularly understand their relationship during its training process to determine proper usage at the right interpreted context (i.e., the right variation at the right time). Yikes, that was probably the longest sentence in this whole article.
However, by allowing it to “tokenize” the original word into smaller chunks (“un”, “recogn”, “izable”), combining tokens back to make them up again has proven to be a significantly more performant strategy than having it evaluate all the potential entries individually, for ultimately the same expected result.
Tokenization also helps with understanding the structure of a sentence. For example, the sentence “The cat chased the mouse” has a different meaning than “The mouse chased the cat,” even though it contains the same words. So this process allows the AI to analyze the order and relationship of the words, enabling it to understand better a sentence’s meaning.
Embeddings
Embeddings are arguably the last layer into input processing and tokenization that may well be its separate scope and full article, but I believe it should be briefly mentioned for the sake of completing the foundational circle.
Think of embeddings like a magical dictionary that turns words, or in this case, our tokens, into numbers. Now, this is not your regular number line. It’s more like a multidimensional “space” where the specific spot words get to hang out based on their structure, meaning, usage, and other linguistic features. Or how about a quick example?
Imagine you’re at a party where groups of friends are scattered around the room. There’s the sports fanatics in one corner, the movie buffs in another, the tech enthusiasts by the window, and so on. In this room, people who share common interests tend to cluster together. Similarly, tokens that share common ‘interests’ (or linguistic properties) hang out in the multi-dimensional space of embeddings.
The model then finds tokens within this ‘party’ as needed. So, if it’s looking for words similar to ‘king,’ it knows to go to the royalty cluster. This strategy helps ChatGPT determine context and semantic relationships between words (or tokens) which, in turn, helps it generate more contextually relevant and accurate responses.
So, What Makes ChatGPT Seem So… Human?
Still wondering why ChatGPT comes across as more of an articulate human-like companion than other chatbots you’ve likely interacted with before? ChatGPT doesn’t “know” words in the human sense, but we’ve learned how adept it is at applying patterns learned during training to generate responses. Yet, it seems every new version (e.g., GPT-3.5 to GPT-4) crafts even more human-like responses in contextual phrasing than its past one. How?
Well, in a sense, we danced around this earlier when breaking down language models and the transformer architecture, but we may have missed establishing the particular reasons concisely:
- Increased Training Data: GPT-4 is trained on a significantly larger dataset than its predecessor. This larger dataset includes more diverse and nuanced examples of human language, which allows it to generate responses that better mimic the complexity and variability of human communication.
- Improved Language Modeling: GPT-4 benefits from language modeling techniques and algorithms advancements. These improvements allow the model to understand the context at different levels and respond in ways that are not just syntactically correct but also more semantically coherent, further contributing to its human-like quality of text generation.
- More Powerful Computing Resources: With the growth of OpenAI and greater computational resources at its disposal, GPT-4 can model larger and more complex patterns in data. This increased capability allows GPT to more effectively understand some of the subtleties of human language and produce more contextually nuanced responses.
- Better Handling of Long Conversations: GPT-4 also shows improvements in not only maintaining context over longer stretches of conversation but incorporating prior parts of the conversation into its responses, which leads to a more coherent and engaging conversation.
The Famous “Fine-Tuning” Process
GPT-4 has further benefited from advancements in fine-tuning. Fine-tuning is an additional stage within the model’s training process where it is explicitly refined for specific tasks or domains, resulting in more accurate, relevant, and contextually-fitting responses. Now, it sounds pretty, but what exactly does “refining” mean for a Generative (Already) Pre-trained Transformer?
Excellent question there, my friend. (Do I need you for this conversation..?)
We can consider fine-tuning or this refining process to be a form of continued training, where the model, after having been pre-trained on a broad dataset, is now trained on a more specific dataset or for a particular task. Let’s briefly go over how this works.
The main idea is to slightly adjust the existing learned parameters to better suit the new task without drastically changing the foundational knowledge the model has already learned during pre-training, thus allowing it to specialize its understanding and generation capabilities during those tasks only. To achieve this, the new training data is exposed to the model at a lower learning rate than its pre-training phase.
Depending on how satisfactory the model’s performance then becomes on the specific task after the first round, adjustments can always be made to the fine-tuning process. These can range from lowering the learning rate, increasing its fine-tuning data, or even changing the architecture and amount of additional fine-tuning layers being used.
Learning rate
This arguably may border on being off-topic, but I don’t enjoy leaving needed but unexplained terms thrown around. Learning rate is one of the most crucial hyperparameters in machine learning and deep learning. It determines how much an algorithm changes its internal parameters, such as the weights and biases in a neural network, in response to the error observed in its output during training.
Let’s imagine a typical pinball game as a main example, where a player uses paddles (flippers) to manipulate a metal ball inside a glass-covered cabinet, scoring points by hitting various targets (bumpers). Here, the ball’s path changes depending on which bumpers and flippers it hits and how it hits them.
Weights
In this analogy, the weights in a neural network can be compared to the positions, sizes, shapes, and angles of the bumpers and flippers in the pinball machine. Each bumper or flipper (weight) affects the ball’s (data’s) trajectory in a specific way when it hits. Bigger bumpers or flippers might push the ball more dramatically (larger weights), while smaller ones have a more subtle effect (smaller weights). Then, the positions and angles of the bumpers determine which direction the ball will go after it hits, much like how weights in a neural network influence the direction of the input data as it moves toward the output.
Biases
Now, let’s consider the overall slope or tilt of the pinball machine. This slope makes the pinball move downwards every single time, regardless of which bumpers or flippers it hits. The ball doesn’t stay still; it always rolls down due to gravity, driven by this slope.
This is analogous to a bias in a neural network. It acts as a constant force or baseline, subtly influencing the network’s output irrespective of the particular input values it receives. The bias in a neural network effectively adjusts and guides the output along a certain direction, just like the slope of a pinball machine inherently guides the ball downwards. These biases are most often used as starting points.
GPT-4’s Parameter Mystery
The last technical bit I wanted to address is the rampant speculation surrounding the size of OpenAI’s GPT-4 technology in terms of parameters. Currently, the leading claim concerning GPT-4’s usage surrounds the number one hundred trillion parameters (100-T), although strong variations have been stipulated ranging between 1 trillion and 170 trillion parameters. And if just the number by itself is enough to make anyone’s head spin, imagine being able to potentially handle this amount of entries at the same time.
Why is this important? Well, for one, let’s consider that its GPT-3.5 predecessor’s largest model can only handle up to 175 billion parameters, which would represent well over a 57,000% increase in capacity in just about half a year. As we’ve learned so far, such a dramatic increase in parameter capacity would allow for a much deeper understanding and generation of complex content.
In this sense, the language model side would benefit from an improved performance in tasks like translation, summarization, or sentiment analysis, where understanding the intricacies and subtleties of a language is crucial, while the generative transformer side could result in more coherent and engaging text generation, empowering it for tasks like content creation, dialogue generation, and even creative writing.
The problem
However, it’s important to apply a cautious note of skepticism to bold claims thrown around in such an early phase of the technology’s development. While all of these observations would be true with such a parameter increase, as of now, OpenAI has not officially released any specific details regarding the exact parameter count of their GPT-4 model.
This information vacuum has led to the emergence of various conjectures thrown around, likely fueled by some offhand remarks seen or heard from leading sources, such as Bing’s new AI being the primary source behind the 100-T proposition. Until a more official statement is released, speculations might well just be our fool’s game for now.
Temperature
This article probably wouldn’t be complete if I didn’t expand on those moments of randomness and speculation when ChatGPT goes off-topic and out-of-context, leaving you wondering if it finally went alive and has begun trolling your inputs.
More specifically, let’s talk a bit about a particular property you might have heard thrown around GPT and OpenAI’s API usage, temperature. And no, not the Sean Paul song. (OK, this was my last one).
In the context of language models like GPT, “temperature” is a term borrowed from physics, and specifically statistical mechanics, and is used in the softmax function (also called softargmax) for scaling the output distribution. I know, I know. What does this jargon mean?
Softmax function in language models
Well, imagine you have a group of kids, and you’ve asked each one to guess how many candies are in a jar. Some will guess high, some low, and perhaps most of the kids will guess somewhere in the middle. Now, you want to pick one guess to bet on, but instead of just picking the most common guess, you decide to consider all of the guesses. By probability, you’re most likely to pick guesses that are close to the most common guess and less likely to pick guesses that are far off.
Similarly, a softmax function within a language model looks at a bunch of “guesses” (probabilities of different words being the next word in a sentence) and decides which option is the most likely to be correct. It achieves this by ensuring that all output probabilities sum to 1, which can be interpreted as the model’s confidence in each potential next word.
Temperature and randomness
Temperature, being an influential element within the softmax function expression, plays a significant role in closely manipulating the randomness of an output.
For example, providing a high-temperature value effectively smoothens the “distribution of probabilities” (i.e., how likely different outcomes are within a set of possibilities). This means that the highest and lowest probabilities are much closer to each other, ultimately making the available selection of words more randomized and diverse. Thus, providing a low value makes for a sharper distribution of these probabilities and a stronger distinction between the most probable and least probable words during the selection process.
Temperature and ChatGPT
Now, temperature’s mention and explanation here beg the question. How does this property affect us as users? Well, if you’ve only been using the web version of ChatGPT, this information will be most transparent to you, albeit always still worth understanding.
The present web version of ChatGPT, as of this date, has made progress in allowing users to leverage features such as Web Browsing and Plugins (sorry if you’re not there yet!), but it still far off from allowing modifications to some of its more granular properties, like temperature, maximum length, stop sequences, presence and frequency penalties, text injection, probability thresholds, among some others. (How about we cover these in another article?).
In a general sense, the temperature value within a softmax function can theoretically be any positive number, keeping in mind our discussion that the higher the number, the higher the randomness probability in the output due to a smoother distribution. In practice, this number often tends to be between 0 and 2, although there is no strict rule about temperature values needing to be within a certain range or this low.
OpenAI’s API provides their models a default value of 1 when developers miss assigning one, purposely or not. So I have to think that perhaps pending a more official statement from OpenAI in the future, ChatGPT’s web version should be using the same default value of 1. (If there is a more official memo and I’ve missed it, let me know!).
Nucleus sampling (top_p)
Why talk about top_p? I’ll tell you why… But first, some formalities.
The property top_p, or “nucleus sampling”, is a different sampling strategy altogether than providing a specific temperature value to a model. Instead of considering all possible next words in the distribution, it only considers the smallest set of top words whose cumulative probability exceeds a certain provided threshold p (hence top_p). The key effect remains in that it ultimately adds an element of randomness to the output, but it can often lead to more diverse and creative outputs than just scaling temperature. Although this might be pending more experimentation on my part.
But how about we break this down slightly more? From the top.
What is meant by sampling strategy? In machine learning, and particularly in language models, a sampling strategy refers to the method used to generate outputs (like words or sentences) based on the probabilities provided by the model. Some common strategies include greedy sampling, top-k sampling, and the nucleus (top-p) sampling our GPT uses. Now, why was top-p sampling chosen over others? Perhaps that deserves its own separate scope and post as well. (I’ll update!).
Now, let’s mitigate the second layer of confusion. Both top-p sampling and “nucleus sampling” are the same picture. Top-p sampling is the formal term that refers to the way the strategy works, which is selecting from the smallest set of top words (i.e., higher weights) whose combined probabilities meet or exceed a certain provided threshold (p).
The term “nucleus” was just appointed as a more metaphorical representation of how it works, where the nucleus in a cell is the core part that contains the essential components for a cell’s functions.
Temperature vs top_p
Now, you might be thinking, how about we just use a combination of temperature and top_p for experimenting with different strong varieties of creativity in outputs? And well, technically, you can. But let’s define what this “strong” output would mean.
When you increase top-p sampling, you’re allowing more words into the consideration set during the model’s selection process. When you increase temperature, you give each word more equal chances of being chosen. So if you truly knew the granular effect these have on the model for your specific use case and how to adjust them both simultaneously, by all means properly, go have your fun!
But more often than not, increasing the randomness from both sides results in highly diverse output, but potentially so random that it’s nonsensical. Similarly, setting both temperature and top_p too low can make the output too deterministic, potentially causing the model to repeat itself or produce overly generic responses.
As cheesy as this undoubtedly sounds, the key lies in understanding the appropriate balance between them. The “quick and easy” suggestion? Stick with using just temperature for now. It has a more intuitive and relatable mental model for most use cases. But if you insist top_p is definitely more for you, or wish to explore it a little better purposely, there’s a good community post with examples that may send you on the right path.
Prompt Engineering
As we conclude this article, I want to briefly make mention of a fascinating concept that has been making waves in this AI industry: prompt engineering. Having just discussed the intricate workings of ChatGPT, it is perhaps an opportune moment to pivot from theory to practice as we explore how to optimize our interactions with artificial intelligence in general.
You may ask, “How can I improve my conversations with ChatGPT or make them more efficient?” That is precisely where the art of prompt engineering comes into play. In short, prompt engineering is a strategy that revolves around meticulously crafting your questions, statements, and, well, any prompt in general to extract the most accurate and beneficial responses from the AI tailored to your intentions.
Think of interacting with ChatGPT like trying to get a recipe from a chef. If you vaguely ask the chef, “How do I cook something good?” you might get a general answer like, “Use fresh ingredients, use clean and appropriate utensils, and make sure you don’t burn anything.” Useful advice, no doubt, but not particularly helpful if you didn’t even know what dish you wanted to prepare.
Now, let’s say you use your prompt engineering skills and ask, “Can you act as a chef and describe how I can make the perfect spaghetti carbonara in steps?” The chef, in this case, ChatGPT, would provide a more specific answer, including the list of ingredients, cooking steps, and maybe even some chef’s secrets like the perfect cheese-to-bacon ratio, or how to get that creamy texture without using cream.
In a sense, prompt engineering often requires us to understand better the details of our own requests. As I started out with ChatGPT, I realized I never even gave thought to most of the minor details from my inquiries, that is, until needing to explicitly define them following countless unsatisfying responses. Now, it’s more often a matter of knowing how to phrase these details properly within a first or second try. It’s a process but an arguably very rewarding one.
And, there you have it — the end of today’s AI exploration! I hope this has sparked some intriguing thoughts and inspired your curiosity.
Happy chatting!
A Hitchhiker’s Guide to ChatGPT was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.