Andrej Karpathy – CraftWithCodeWiz

January 7 2024

Unorthodox Onomastics: How AI Can Assist You in Discovering the Perfect Name for Your Child

sanjaybhatikar PyTorch Andrej Karpathy, Artificial Intelligence, Bigram model, Deep Learning, Embeddings, Large Language Models, Onomastics 0

What is notable about this collection of names?

ann.
akela.
az.
arileri.
chaiadayra

They share a common origin – each one was generated by a Deep Learning model. Intrigued to understand how? Large Language Models (LLMs) are multifaceted, handling complex tasks such as sentence completion, Q&A, text summarization, and sentiment analysis. LLMs, emphasizing their substantial size, are intricate models with tens or hundreds of billions of parameters, honed on vast datasets totaling 10 terabytes. However, it is possible to appreciate the foundation of how machines learn meaning from text starting from a seemingly straightforward concept – the bigram model.

The bigram model operates on the principle of predicting one token from another. For simplicity, let’s consider tokens as characters in the English alphabet. This principle closely aligns with the essence of LLMs like ChatGPT, which predict subsequent tokens based on preceding ones, iteratively generating coherent text and even entire computer programs. In our bigram model, however, we predict one character from the next, utilizing a 26×26 matrix of probabilities. Each entry in the matrix represents the probability of a particular character appearing after another. This matrix, with some modifications, constitutes our model. Our goal? To generate names.

Bigram Matrix

The bigram matrix shows the frequency of occurrence of one token following another (“bigram”) in a given dataset. The tokens here are characters of the English alphabet plus one additional token to mark the start or end of a word. The dataset is a collection of 30,000+ names from a public database. The entry in a cell is the count of occurrences of the character in the column following the character in the row.

We introduce an extra character to mark the start or end of a word, expanding from a 26×26 matrix to a 27×27 matrix. The matrix entries arise from patterns observed in a training dataset comprising over 30,000 names from a public database. Raw occurrence counts shown are transformed into probabilities for sampling. Generating a name involves starting with the character that marks the start of a word, sampling the 1st character from the multinomial probability distribution in the 1st row, recycling that character as input to predict the 2nd character, and so forth until reaching the end character. The resulting names, like junide, janasah, p, cony, and a, showcase the model’s unique outputs.

Considering these names, one might favor Janasah! But there’s room for enhancement. Enter the neural network! How would this transition occur? Instead of relying on a lookup matrix, the neural network would predict one character from another. Here’s how:

Representation: Numerically represent each character for input and output with vectors of length 27, accounting for the extra character.
Data Sets: Divide the data into training, validation, and testing sets to train the model, guard against overfitting, and assess performance.
Loss Function: Utilize negative log-likelihood, common in such scenarios, calculated through a softmax layer to generate a probability distribution.
Training: Adjust model parameters using calculated gradients and backpropagation through the neural network.

Refer to the Colab notebook for the implementation with detailed notes. So we have trained a neural network to do what we could do with a matrix. What’s the big deal?

For one, we can use a longer sequence of characters as input to the neural network, giving the model more material to work with to make better predictions. This block of characters provides not just one sequence, but all sequences including and up to the last character as context to the neural network. This already goes beyond what we can do with matrices with counts of occurrences of bigrams.

But how does a neural network learn meaning in text? Part of the answer lies in embeddings. Every token is converted into a numerical vector of fixed size, thus allowing a spatial representation in which meaningful associations can take shape. We allow the embeddings to emerge as properties of a neural network during the training process. The deeper layers of the neural network use these associations as stepping stones to enrich structure in keeping with the nuances and intricacies of linguistic constructs.

Talk about layered meaning!

Wrapping up our baby steps in language models, we’ve transitioned from basic bigram models to deep neural networks, exploring the evolution from mechanical predictions to embeddings that allow associations that capture primitives of nuanced linguistic structure. We get a glimpse into the potential of these models to grasp the intricacies of language, beyond generating names. As we take these initial steps, the horizon of possibilities widens, promising not only enhanced language generation but also advancements in diverse applications, hinting at a future where machines engage with human communication in increasingly sophisticated ways.

Explore the fascinating world of Artificial Intelligence in my upcoming class, powered by FastAI! We’ll embark on a hands-on journey through the evolving landscape of AI, building models with state-of-the-art architecture and learning to wield the power of Large Language Models (LLMs). Whether you’re a beginner or seasoned enthusiast, this class promises a dynamic and engaging exploration into the realm of AI, equipping you with the skills to navigate and innovate in this rapidly evolving field. Join me for an exciting learning experience that goes beyond theory, fueled by the practical insights and advancements offered by FastAI.