A Brief History of GPT Models

This article provides a brief history of GPT Models.
By Boris Delovski • Updated on Mar 26, 2024
blog image

The Turing test was introduced in 1950 by Alan Turing in his seminal paper "Computing Machinery and Intelligence," proposed as a measure of intelligence. In it, Alan defined that it can safely be said that a machine exhibits intelligence. This occurs when a human can no longer reliably distinguish whether they are talking to another human or a machine. Initially, the test was perceived as a reliable metric. However, there has since been a debate whether it truly captures the essence of intelligence or not.

The realm of conversational AI was significantly transformed with the advent of ChatGPT by OpenAI, launched on November 30, 2022. This marked a pivotal moment when the public was introduced to a chatbot capable of engaging in conversations remarkably similar to human interaction. From its establishment, ChatGPT, based on the GPT 3.5 model, showcased an ability to conduct dialogues far surpassing prior chatbots. The evolution of ChatGPT, now leveraging the GPT-4 model, illustrates the progressive advancement of its capabilities. Such progress promises a future where it becomes increasingly challenging to distinguish between conversations with ChatGPT and humans.

In this article, we are going to navigate through the history of ChatGPT. More specifically we will be shedding light on the development of GPT architecture and its iterations. This should provide most with some much-needed context,  helping them understand what is happening "under the hood" of today's most popular Deep Learning model

What is the Transformers Model

To comprehend the evolution leading to modern GPT models, like the ones used by ChatGPT,  we must rewind back to 2017.  A new architecture was released in that year, known as the Transformers neural network architecture. This architecture marked a significant departure from the previous architectures used in the field of NLP. Moreover, it marked the first demonstration of the remarkable capacity of language models. This architecture showcased the potential of language models in a way never seen before.

The technical specifics of this architecture were already discussed in another piece on our website. Now, we can briefly understand how the model operates by giving a simple example. To be more precise, since the original architecture was employed for translating from one language to another, let us consider a similar scenario. Two groups of translators collaborate to translate a book from one language to another. Think of the original text as a series of messages that need to be translated. Each message is complex, with nuances and subtleties that need to be preserved in the translation. 

First, a group of translators (the encoder part of the Transformer) reads through the messages. They read everything at once, instead of translating word by word in sequence. They can focus on specific words or phrases that are crucial for understanding the context and nuances of the message. Each translator in the group is distinguished for a special skill. Some are better at understanding emotions, others at cultural references, and some at technical details. Together, they create a comprehensive map of the message, highlighting all the important parts and how they relate to each other. This is done in one of the two main parts of the Transformer model, called the encoder.

In the other part of the model, the decoder, the detailed map created in the encoder is used by the second team of translators. They begin by crafting the translated message in the target language. Simultaneously, they make use of the acquired insights by the first team to ensure that every nuance of the original text is captured in the translation. These translators can focus on different parts of the message as needed, similar to the first team. This is done in order to ensure that the translation is not only word for word but captures the essence of the original message.

As the translation progresses, there is a continuous exchange of information between the two groups of translators, the ones crafting the translation and the first team with the initial insights. This ensures that the translation remains faithful to the original message throughout its progression. The result is a translated message that accurately reflects the original's meaning, tone, and nuances.

What is the GPT-1 Model

The first GPT model, now known as GPT-1, was introduced by OpenAI in 2018. As a model, it is an advanced iteration of the original Transformer architecture, focusing exclusively on its decoder part. This focus enables GPT to excel in text generation, as it is trained to predict the next word in a sequence given all the previous words. It leverages a vast database of text from the internet to achieve this.

GPT models are trained through a two-step process. First, they undergo pre-training, where they learn from a great amount of text data gathered mainly from the internet. This phase teaches the model the basics of how human language works. It teaches about language patterns, grammar, context, style, etc. After pre-training, a GPT model is able to write original text and hold conversations. In addition, it can even be fine-tuned to perform more specialized tasks like summarization, translation, and much more.

More modern versions, such as GPT-3 and its successors, have the capability to generate code across a diverse range of programming languages and for various purposes. This encompasses tasks such as writing functions, fixing bugs, generating algorithms, and even developing entire software applications or websites, provided with sufficiently detailed specifications. Once again, let us take a look at an example to avoid digging deep into the inner workings of GPT’. For this example, let us employ the metaphor of a talented composer to illustrate the workings of the GPT model.

Before a composer can compose their music, they must first be inspired by the great examples of the past. They immerse themselves in music from all over the world, extending across centuries of history. They study the big names like Bach, Beethoven, and Mozart., They dive into the blues, jazz, and rock, they explore music from every culture and era. This phase is akin to GPT's pre-training, during which the model absorbs an extensive corpus of human-written text. It learns patterns, styles, nuances, and the myriad ways ideas can be expressed.

After this extensive training, our composer is now ready to create their music. Given a theme, emotion, or even a single note as a source of inspiration, they can compose a piece that resonates with listeners. This can be done by weaving together melodies and harmonies in a way that feels both familiar and astonishingly new.

Likewise, when GPT receives a prompt, it generates text that aligns with the input, drawing on its vast pre-training to produce content that is coherent, relevant, and sometimes remarkably creative. Whether it is going on with a story, answering a question, or even writing a poem, GPT composes text in a way that mimics human creativity and understanding. Similar to our composer tailoring their compositions to fit a specific genre or mood, GPT can be fine-tuned for specialized tasks. If you want the composer to compose music in the  Romantic era style, following its initial training, provide it with additional examples of music from that period. In doing so, it will become adept at replicating that specific style.

The metaphor above, used to illustrate how GPT works is accurate.  However, it may not be immediately apparent if you were to revisit GPT-1 today. Being the first iteration in the Generative Pre-trained Transformer series, GPT-1 introduced the foundational concepts of utilizing a transformer-based model for natural language understanding and generation. While it was a significant step forward in text generation capabilities, it had limitations. Therefore,  a more accurate explanation would be that GPT-1 "tries" to do everything we mentioned above. However, it falls short in many aspects if we take into consideration today's expectations. Nevertheless, for its time, it was an exceptionally capable model. It only became outdated in the future with the introduction of the more advanced versions of itself, such as GPT-2, GPT-3, and beyond.

Article continues below

What is the GPT-2 Model

The GPT-2 model was a variant of the GPT-1 model that introduced several changes enabling the model to come much closer to achieving what was illustrated in the previous chapter's example. Released in 2019, it was controversial at the time. Even back then, the creators of the model were aware of its intense capacity. Therefore, they withheld its full version due to concerns about potential malicious use.

There are a few key differences that made GPT-2 significantly more powerful than GPT-1:

  • ITen times the size of GPT-1
  • ITrained on a much larger dataset
  • More advanced training techniques

The original model, GPT-1, had 117 million parameters, which made it a groundbreaking model at its time. However, GPT-2 boasted 1.5 billion parameters, significantly surpassing the learning capabilities of GPT-1. In terms of architecture, with its increased number of parameters, GPT-2 naturally has more layers, wider layers, or both. This allows it to process information more thoroughly.

GPT-2 was also trained on a much larger dataset. We already mentioned that GPT-1 was trained on an abundance of data collected from websites and books. However, GPT-2 was not only trained with an even larger amount of data, but the data was also more carefully chosen. The purpose of this was to include more varied and nuanced examples of language use.

Ultimately, while both models were trained using unsupervised learning techniques, learning to predict the next word in a sentence given all the previous words, GPT-2's training was further optimized to accommodate its larger scale. This involved incorporating more advanced techniques to efficiently train such a large model without compromising on learning quality. GPT-2 used techniques such as gradient checkpointing, mixed precision training, dynamic batch sizing, as well as numerous other techniques. This was done to ensure that not only the final model will be better, but the training procedure will also be better optimized.

What is the GPT-3 Model

GPT-3 follows the trend of increasing the size of language models. The previous iteration, GPT-2, with its 1.5 billion parameters was 10 times bigger than GPT-1. However, OpenAI made an even bigger leap when moving from GPT-2 to GPT-3, with the latter being over 100 times bigger than its predecessor. In comparison to the 175 billion parameters of GPT-3, even the 1.5 billion parameters of GPT-2 appear relatively small.

This vast increase in parameters allows GPT-3 to have a significantly deeper understanding of language and context. This is particularly notable considering that GPT-3 was trained on a larger corpus of text data compared to GPT-2. It incorporates more diverse sources and a broader spectrum of language use as a result. This increase in size and training data made GPT-3 much better at understanding context and generating more coherent and contextually relevant responses.

However, the difference between GPT-2 and GPT-3 is not solely attributed to differences in parameter number and dataset size. One of the hallmark features of GPT-3 is its ability to perform "few-shot learning," where it can understand a task from only a few examples. This is a significant improvement over GPT-2, which generally requires more explicit instructions or training data to perform specific tasks.

Certainly, this also meant that GPT-3 was the first model whose behavior we could in certain cases mistake for the behavior of humans. Indeed, we do not require complex and explicit instructions to perform simple tasks. Therefore, GPT-3's ability to mimic this behavior contributed to it feeling much more "human" than any preceding model.

However, while GPT-3 proved to be remarkably good, it was not the breakthrough model for OpenAI. That came in the form of a slightly modified version of this model, known as GPT-3.5, which later served as the foundation for the first ever widely popular and freely accessible chatbot: ChatGPT.

What is GPT-3.5 and ChatGPT

For starters, it is highly important to mention that, while GPT-3.5 and ChatGPT are related, they are in reality distinct in their design and intended use. Essentially, while GPT-3.5 is a more general language model, ChatGPT is an application of a GPT model that has been specifically optimized for conversational tasks. Let us break this down further. 

GPT-3.5 refers to a version of the Generative Pre-trained Transformer (GPT) models developed by OpenAI. It's a large-scale language model trained on a diverse range of internet text. GPT-3.5, like its predecessors, is designed to generate text based on the input it receives. It can be used for a wide range of tasks, including answering questions, writing essays, generating code, and more, based on the patterns and information it acquired during training.

ChatGPT, on the other hand, is a specialized application of that GPT model fine-tuned specifically for generating conversational responses. This fine-tuning process encompasses additional training steps, during which the model learns from a dataset of human-like conversations. This helps the model generate responses that are more contextually relevant and coherent within a chat format. ChatGPT is designed to simulate a human-like conversation and can be used for various applications, including customer service, tutoring, and casual conversation. Therefore, contrary to popular belief, the transition from GPT-3 to GPT-3.5 was marked by significant refinements rather than just scaling up the model's size.

The launch of ChatGPT had a grand impact on public perception and utilization of AI. It demonstrated the practical applications of conversational AI in daily life, offering many people who were previously disinterested in AI a glimpse of its capabilities. At first, only the free version of the model was available, but nowadays you can subscribe to a premium version of it called ChatGPT Plus. Subscribing to ChatGPT Plus offers numerous benefits, with one of the most significant being the, somewhat limited, access to the newest member of the GPT family: GPT-4.

What is GPT-4 and the Future of GPT Models

The evolution of Generative Pre-trained Transformer models took another monumental leap with the introduction of GPT-4. This version represents a significant milestone in AI development, pushing the boundaries of what conversational AI can achieve. While GPT-3.5 and ChatGPT established the groundwork for human-like conversation, GPT-4 builds upon this foundation to provide even greater capabilities in understanding and generating human language.

What distinguishes GPT-4 from previous models, aside from an increased number of parameters and training data, is its unparalleled understanding and generation of natural language. This model is trained on an even more diverse and extensive dataset. This is achieved through the introduction of "zero-shot" and "one-shot" learning capabilities. This allows it to perform tasks with little or even no examples at all provided, making GPT-4 more similar to humans than any preceding GPT model.

GPT-4 is proficient to the point that people cannot imagine a model getting any better. Indeed, OpenAI continues to pursue the development of increasingly advanced models. While the difference in quality between the current version and the next one might not be as substantial as the one between GPT-3.5 and GPT-4, one thing remains certain. Given the current pace of progress, it is only a matter of time before we create a model that achieves artificial general intelligence. Such a machine would not only be adaptable and capable of solving almost all tasks without specific training but would also exhibit signs of self-awareness.

In this article, we have delved into the evolution of GPT models, providing a comprehensive overview of their development and functionalities. From the foundational Transformers model to the latest iteration, GPT-4, we have explored how these models have evolved, becoming proficient and adept at understanding and generating human language. In future articles, we will focus on utilizing ChatGPT, demonstrating how to take complete advantage of the most potent Large Language Model available to the general public.

Boris Delovski

Data Science Trainer

Boris Delovski

Boris is a data science trainer and consultant who is passionate about sharing his knowledge with others.

Before Edlitera, Boris applied his skills in several industries, including neuroimaging and metallurgy, using data science and deep learning to analyze images.