## Unlocking the Power of Language: Leveraging Large Language Models for Next-Gen Semantic Search and Real-World Applications

Invited talk at Calfus, Pune, June 20, 2024.

Invited talk at Calfus, Pune, June 20, 2024.

What is notable about this collection of names?

ann.

akela.

az.

arileri.

chaiadayra

They share a common origin – each one was generated by a Deep Learning model. Intrigued to understand how? Large Language Models (LLMs) are multifaceted, handling complex tasks such as sentence completion, Q&A, text summarization, and sentiment analysis. LLMs, emphasizing their substantial size, are intricate models with tens or hundreds of billions of parameters, honed on vast datasets totaling 10 terabytes. However, it is possible to appreciate the foundation of how machines learn meaning from text starting from a seemingly straightforward concept – the bigram model.

The bigram model operates on the principle of predicting one token from another. For simplicity, let’s consider tokens as characters in the English alphabet. This principle closely aligns with the essence of LLMs like ChatGPT, which predict subsequent tokens based on preceding ones, iteratively generating coherent text and even entire computer programs. In our bigram model, however, we predict one character from the next, utilizing a 26×26 matrix of probabilities. Each entry in the matrix represents the probability of a particular character appearing after another. This matrix, with some modifications, constitutes our model. Our goal? To generate names.

We introduce an extra character to mark the start or end of a word, expanding from a 26×26 matrix to a 27×27 matrix. The matrix entries arise from patterns observed in a training dataset comprising over 30,000 names from a public database. Raw occurrence counts shown are transformed into probabilities for sampling. Generating a name involves starting with the character that marks the start of a word, sampling the 1st character from the multinomial probability distribution in the 1st row, recycling that character as input to predict the 2nd character, and so forth until reaching the end character. The resulting names, like junide, janasah, p, cony, and a, showcase the model’s unique outputs.

Considering these names, one might favor Janasah! But there’s room for enhancement. Enter the neural network! How would this transition occur? Instead of relying on a lookup matrix, the neural network would predict one character from another. Here’s how:

- Representation: Numerically represent each character for input and output with vectors of length 27, accounting for the extra character.
- Data Sets: Divide the data into training, validation, and testing sets to train the model, guard against overfitting, and assess performance.
- Loss Function: Utilize negative log-likelihood, common in such scenarios, calculated through a softmax layer to generate a probability distribution.
- Training: Adjust model parameters using calculated gradients and backpropagation through the neural network.

Refer to the Colab notebook for the implementation with detailed notes. So we have trained a neural network to do what we could do with a matrix. What’s the big deal?

For one, we can use a longer sequence of characters as input to the neural network, giving the model more material to work with to make better predictions. This block of characters provides not just one sequence, but all sequences including and up to the last character as context to the neural network. This already goes beyond what we can do with matrices with counts of occurrences of bigrams.

But how does a neural network learn meaning in text? Part of the answer lies in embeddings. Every token is converted into a numerical vector of fixed size, thus allowing a spatial representation in which meaningful associations can take shape. We allow the embeddings to emerge as properties of a neural network during the training process. The deeper layers of the neural network use these associations as stepping stones to enrich structure in keeping with the nuances and intricacies of linguistic constructs.

*Talk about layered meaning!*

Wrapping up our baby steps in language models, we’ve transitioned from basic bigram models to deep neural networks, exploring the evolution from mechanical predictions to embeddings that allow associations that capture primitives of nuanced linguistic structure. We get a glimpse into the potential of these models to grasp the intricacies of language, beyond generating names. As we take these initial steps, the horizon of possibilities widens, promising not only enhanced language generation but also advancements in diverse applications, hinting at a future where machines engage with human communication in increasingly sophisticated ways.

*Explore the fascinating world of Artificial Intelligence in my upcoming class, powered by FastAI! We’ll embark on a hands-on journey through the evolving landscape of AI, building models with state-of-the-art architecture and learning to wield the power of Large Language Models (LLMs). Whether you’re a beginner or seasoned enthusiast, this class promises a dynamic and engaging exploration into the realm of AI, equipping you with the skills to navigate and innovate in this rapidly evolving field. Join me for an exciting learning experience that goes beyond theory, fueled by the practical insights and advancements offered by FastAI.*

In Building a Simple Neural Network From Scratch in PyTorch, we described a recipe with 6 functions as follows:

`train_model(epochs=30, lr=0.1)`

: This function acts as the outer wrapper of our training process. It requires access to the training data,`trainingIn`

and`trainingOut`

, which should be defined in the environment.`train_model`

orchestrates the training process by calling the`execute_epoch`

function for a specified number of epochs.`execute_epoch(coeffs, lr)`

: Serving as the inner wrapper, this function carries out one complete training epoch. It takes the current coefficients (weights and biases) and a learning rate as input. Within an epoch, it calculates the loss and updates the coefficients. To estimate the loss, it calls`calc_loss`

, which compares the predicted output generated by`calc_preds`

with the target output. After this,`execute_epoch`

performs a backward pass to compute the gradients of the loss, storing these gradients in the`grad`

attribute of each coefficient tensor.`calc_loss(coeffs, indeps, deps)`

: This function calculates the loss using the given coefficients, input predictors`indeps`

, and target output`deps`

. It relies on`calc_preds`

to obtain the predicted output, which is then compared to the target output to compute the loss. The backward pass is subsequently invoked to compute the gradients, which are stored within the`grad`

attribute of the coefficient tensors for further optimization.`calc_preds(coeffs, indeps)`

: Responsible for computing the predicted output based on the given coefficients and input predictors`indeps`

. This function follows the forward pass logic and applies activation functions where necessary to produce the output.`update_coeffs(coeffs, lr)`

: This function plays a pivotal role in updating the coefficients. It iterates through the coefficient tensors, applying gradient descent with the specified learning rate`lr`

. After each update, it resets the gradients to zero using the`zero_`

function, ensuring the gradients are fresh for the next iteration.`init_coeffs(n_hidden=20)`

: The initialization function is responsible for setting up the initial coefficients. It shapes each coefficient tensor based on the number of neurons specified for the sole hidden layer.`model_accuracy(coeffs)`

: An optional function that evaluates the prediction accuracy on the validation set, providing insights into how well the trained model generalizes to unseen data.

In this blog post, we’ll take a deep dive into constructing a powerful deep learning neural network from the ground up using PyTorch. Building upon the foundations of the previous simple neural network, we’ll refactor some of these functions for deep learning.

**Initializing Weights and Biases**

To prepare our neural network for deep learning, we’ve revamped the weight and bias initialization process. The `init_coeffs`

function now allows for specifying the number of neurons in each hidden layer, making it flexible for different network configurations. We generate weight matrices and bias vectors for each layer while ensuring they are equipped to handle the deep learning challenges.

`def init_coeffs(hiddens=[10, 10]): sizes = [trainingIn.shape[1]] + hiddens + [1] n = len(sizes) weights = [(torch.rand(sizes[i], sizes[i+1]) - 0.3) / sizes[i+1] * 4 for i in range(n-1)] # Weight initialization biases = [(torch.rand(1)[0] - 0.5) * 0.1 for i in range(n-1)] # Bias initialization for wt in weights: wt.requires_grad_() for bs in biases: bs.requires_grad_() return weights, biases`

We define the architecture’s structure using `sizes`

, where `hiddens`

specifies the number of neurons in each hidden layer. We ensure that weight and bias initialization is suitable for deep networks.

**Forward Propagation With Multiple Hidden Layers**

Our revamped `calc_preds`

function accommodates multiple hidden layers in the network. It iterates through the layers, applying weight matrices and biases at each step and introducing non-linearity using the ReLU activation function in the hidden layers and the sigmoid activation in the output layer. This enables our deep learning network to capture complex patterns in the data.

`def calc_preds(coeffs, indeps): weights, biases = coeffs res = indeps n = len(weights) for i, wt in enumerate(weights): res = res @ wt + biases[i] if (i != n-1): res = F.relu(res) # Apply ReLU activation in hidden layers return torch.sigmoid(res) # Sigmoid activation in the output layer`

Note that weights is now a list of tensors containing layer-wise weights and correspondingly, biases is the the list of tensors containing layer-wise biases.

**Backward Propagation With Multiple Hidden Layers**

Loss calculation and gradient descent remain consistent with the simple neural network implementation. We use the mean absolute error (MAE) for loss as before and tweak the `update_coeffs`

function to apply gradient descent to update the weights and biases in each hidden layer.

`def update_coeffs(coeffs, lr): weights, biases = coeffs for layer in weights+biases: layer.sub_(layer.grad * lr) layer.grad.zero_()`

**Putting It All Together in Wrapper Functions**

Our `train_model`

function can be used ‘as is’ to orchestrate the raining process using the `execute_epoch`

wrapper function to help as before. The `model_accuracy`

function also does not change.

With these modifications, we’ve refactored our simple neural network into a deep learning model that has greater capacity for learning. The beauty of it is we have retained the same set of functions and interfaces that we implemented in a simple neural network, refactoring the code to scale with multiple hidden layers.

`train_model(epochs=30, lr=0.1)`

: No change!`execute_epoch(coeffs, lr)`

: No change!`calc_loss(coeffs, indeps, deps)`

: No change!`calc_preds(coeffs, indeps)`

: Tweak to use the set of weights and corresponding set of biases in each hidden layer, iterating over all layers from input to output.`update_coeffs(coeffs, lr)`

: Tweak to iterate over the set of weights and accompanying set of biases in each layer.`init_coeffs(hiddens=[10, 10])`

: Tweak for compatibility with an architecture that can potentially have any number of hidden layers of any size.`model_accuracy(coeffs)`

: No change!

Such a deep learning model has greater capacity for learning. However, it is is more hungry for training data! In subsequent posts, we will examine the breakthroughs that have made it possible to make deep learning models practically feasible and reliable. These include advancements such as:

- Batch Normalization
- Residual Connections
- Dropouts