March 31, 2023. 15 min

Generative AI and Large Language Models

Generative models are all the rage lately and captured the imagination of everyone. The field of generative models witnessed significant advances and applications in the previous year, starting with image generation using text prompt using tools such as DALL-E by OpenAI, stable diffusion by stability.ai, and Midjourney and ended with explosion of ChatGPT for text generation.

In this blog, you will learn about large language models (LLMs), how you can solve problems using LLMs, and finally why this is different and worthy of your attention. Let's jump in!

What is a Generative Model?

Generative models are not a new phenomenon. They have been around for many years. These are statistical models which aim to learn the distribution of training data and are able to generate new data by sampling from its learned distribution. Traditional machine learning models like Naïve Bayes and Hidden Markov Models are examples of generative models. Since the deep learning era, there have been constant development in generative models like Variational Autoencoders, Generative Adversarial Networks, and more recent Diffusion models. The space of generative AI is rapidly evolving with new announcements every week. Here are some interesting recent announcements:

  1. OpenAI's GPT-4, which is a larger model than GPT-3.5 and was trained on larger dataset
  2. Google Bard is open for experimental usage
  3. Meta's LLaMA, a 65 billion parameters foundational model
  4. Microsoft Teams Copilot, which is AI based productivity tool for Microsoft365
  5. Amazon CodeWhisperer, ML-powered code generation tool
  6. Adobe Firefly, generative AI tool for image editing and image content creation

What are Large Language Models (LLMs)?

Language models are generative model which learn the distribution of natural language (text) data. Until 2017, popular method for language modelling was Recurrent Neural Networks, LSTMs and GRUs, but the introduction of transformer architecture in Attention is All You Need in NeurIPS 2017 has changed the entire landscape and gave birth to large language models. Large language models are based on transformer architecture with some variations. There are a few popular variants of this architecture used by different LLMs:

  1. Encoder only model for example BERT pretrained on tasks similar to Masked Language Modelling (Some of the input tokens are masked and the model predicts those tokens) and Next Sentence Prediction (Model takes two sentences as input and is trained to classify second sentence is followed by the first) tasks
  2. Decoder only models like GPT which were pretrained on autoregressive text generation task (Model is trained to predict next token given a sequence of tokens. Here the word ‘autoregressive’ means that the tokens generated by the model are given as input to the model for predicting the next token)
  3. Complete transformer architecture like T5 and BART, which were pretrained on text denoising objective (The input contains noisy sequence of tokens, i.e., some tokens may be deleted or replaced by a random token and the model predicts the original sequence of tokens i.e., removes the noise)
What can LLMs do?

Since these models were trained on huge corpus of text, they have learnt the distribution of natural language (or code) and hence can be used for generating text. These are not only good for text generation, but they can also be used for a variety of downstream tasks such as text translation, text classification, text summarization, question answering, natural language inference, etc. The above mentioned language model papers had shown performance on benchmarks for downstream tasks.

You can use these models in different ways depending on your use case. If your task is text generation, you may directly use any autoregressive model, you can also use the same model for a different task such as text summarization with some modifications to the prompt. If you want to use a BERT model, pretrained for masked language modelling task, for a sequence classification task, you have to finetune the model. Finetuning is a technique in which we start with the pretrained model and train it further for a new task using a task specific dataset which may be labelled and is usually much smaller compared to the pretraining dataset. The benefit of this technique is that start from the language understanding of LLM which was learned using huge unlabeled dataset so the model can learn the new task in the same language faster than starting from random weights.

Why is this Important?

In a recent note, Bill Gates stated that Artificial intelligence is as revolutionary as mobile phones and the Internet.

In a paper, OpenAI talked about how GPT-4 is showing early "sparks" of Artificial General Intelligence (AGI).

GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.

Fearing loosing control over our civilization, 1,100 signatories, including Elong Musk, Steve Wozniak singed an open letter to pause for at least 6 months the training of AI systems that are more powerful than GPT-4.

How to build something using LLMs?

Large models GPT-3+ can be used for tasks other than text generation even without finetuning . LLMs can be considered as few shot learners, i.e., they can learn a new concept from only a few examples without being finetuned (see Language Models are Few-Shot Learners). This is achieved by designing the correct prompt. Selecting the prompt for your problem also requires a fair bit of experimentation. Small variations in the prompt can lead to major changes in the model output.

Here are five references:
  1. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting...
  2. The Power of Scale for Parameter-Efficient Prompt Tuning
  3. Prefix-Tuning: Optimizing Continuous Prompts for Generation
  4. Language Models are Unsupervised Multitask Learners
  5. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Sometimes, it may not be necessary to give examples for your task, you can use this model for zero-shot inference i.e., inference on a task the model was not explicitly trained on. This too is done by designing appropriate prompt (see Large Language Models are Zero-Shot Reasoners).

Prompt Engineering

Prompt engineering refers to methods for controlling the output generation by generative models by designing input prompts (see Prompt Engineering). Here are few techniques for prompt design in brief:

  1. Instruction prompting: The prompt should contain specific and precise description of the task and describe the requirements in detail. For example, Summarize the following French text in English in less than 100 words: [French text]
  2. Chain of thought prompting: This is useful in case of tasks that involve reasoning ability, for example solving a puzzle. In the few shot inference the prompt should contain reasoning logic step by step (known as reasoning chains). For zero shot the prompt should include a statement like Let's think step by step which would nudge the model to follow a reasoning approach
  3. Automatic prompt design: The prompt is treated as trainable parameters and is optimized using gradient descent. This would result in a sequence of tokens (i.e., the prompt) which increase the probability of output given input
Leveraging Pretrained Models

Some pretrained models such as BERT, RoBERTa, etc. needs to be finetuned for different tasks. For finetuning a model for a text classification task, an output layer or a set of layers followed by the output layer for classification can be added at the top of embeddings produced by LLMs for the input sequence, and the model can be trained on a labelled dataset using appropriate loss function like cross entropy loss. Finetuning can improve the performance over few shot approach in cases where the target task is quite different from the training distribution of the model. For example, if you are using a model trained on English language text and want to use it for classifying German text, then it won’t work without finetuning. Another case is if your model was trained on English text taken from wikipedia, books, articles and blogs, but your use case is to classify social media data like Twitter even in that case the training and the target distribution are so different that finetuning is necessary.

Code Example

Let's see an example of how you can use a LLM for zero-shot classification. We'll run inference using huggingface library and bloomz-520m model, which is an autoregressive LLM with 520 million parameters [Crosslingual Generalization through Multitask Finetuning, bigscience/bloomz-560m].

We first need to install huggingface transformers and torch


pip install torch
pip install transformers	

We need to import torch, the model class AutoModelForCausalLM and tokenizer


import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

Let's initialize the model and tokenizer


model_name = "bigscience/bloomz-560m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)	

Let's use this model for sentiment classification. We'll do this by designing a prompt. After some trial and error, one of the suitable input prompts for this task is:


I love the sunny weather today. # Sentence you want to get prediction for
What is the tone of this sentence positive, negative, or angry?

When given the prompt in the above format, the model generates output from the given options, followed by a end of string (EOS) token. In the particular example given above, the model outputs positive

The model expects the input to be a list of tokens, not strings, so we first tokenize our input strings. Since we are using pytorch for inference, we pass return_tensors="pt" to get pytorch tensors.


input_tokens = tokenizer.encode(prompt_str, return_tensors="pt")

We can pass this as the input to the model to get generated output tokens. Specifying max_length may be required in some cases where the default length is shorter than the combined number of input and output tokens.


output_tokens = model.generate(input_tokens, max_length=60)

To get the output in human readable format we decode the output tokens.


output_str = tokenizer.decode(output_tokens)

When we are sure that our input prompt behaves in the way we want, we can write a function that takes a text input and returns the predicted class.


def find_sentiment(input_str, labels):
	labels[-1] = "or " + labels[-1] + "?"
	prompt_str = "{}\nWhat is the tone of this sentence {}?".format(input_str, ", ".join(labels))
	input_tokens = tokenizer.encode(prompt_str, return_tensors="pt")
	input_len = input_tokens[0].shape[0]
	output_tokens = model.generate(input_tokens, max_length=60)
	label = tokenizer.decode(output_tokens[0, input_len:], skip_special_tokens=True)
	return label.strip()	

We can now use this function for sentiment classification.


find_sentiment("I love the sunny weather today", ["positive", "negative"])
# positive
find_sentiment("I don't love the rainy weather today", ["positive", "negative"])
# negative

Let's use the same model for classifying programming language. We want to tell which tells which language a piece of code is written in, and design a prompt in similar way.


def classify_code(code_str, labels):
	prompt_str = "{}\nThe above function is in which of these languages: {}?".format(code_str, " or ".join(language_labels))
	input_tokens = tokenizer.encode(prompt_str, return_tensors="pt")
	output_tokens = model.generate(input_tokens, max_length=100)
	input_len = input_tokens[0].shape[0]
	label = tokenizer.decode(output_tokens[0, input_len:], skip_special_tokens=True)
	return label.strip()

We can use this function for code classification as shown below.


code_str = "int findMax(int a, int b){\n\tif (a > b) ans = a;\n\telse ans = b;\n\treturn ans;\n}"
language_labels = ["C++", "Python"]
classify_code(code_str,language_labels)

Input:
int findMax(int a, int b) {
	if (a > b) ans = a;
	else ans = b;
	return ans;
}
Output:
C++

code_str = "def find_max(a, b, c):\n\tans = max(c, max(a,b))\n\treturn ans"
language_labels = ["C++", "Python"]
classify_code(code_str, language_labels)

Input:
def find_max(a, b, c):
	ans = max(c, max(a,b))
	return ans
Output:
Python
Finetuning

Let's say that you are not getting the desired accuracy for this classification task and you have a large labelled dataset, then you can finetune the model for classification. In HuggingFace, you can use AutoModelForSequenceClassification for text classification, which works for all types of models. You have to specify the number of classes in parameter num_labels and then you can train the model. In cases where the model was not trained for classification or the specified num_labels is different than the pretrained model, a classification layer with randomly initialized weights is added at the top, so without training the output will be random.


model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # 2 for classes Python and C++

Following is a simple code snippet for this training this classification model


# Specify the loss function
loss_func = torch.nn.CrossEntropyLoss())
# Specify the optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Specifying which device to use. This will use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device) # Copying the model to the device
for epoch in range(num_epochs):
	for step, code_snippets, labels in enumerate(labelled_training_dataset):
	# Here code_snippet is a list of strings and labels are list of integers
	# The batches were already created and length of list is batch_size
	code_tokens = tokenizer(code_snippets, 
		truncation=True, # Truncates sequences longer than max_length
		padding="max_length", # Pads the sequences that are shorter than max_length
		return_tensors="pt", # We want Pytorch tensors 
		# max_length = 512, # You can change the max length
		)
	# Converting labels to pytorch tensors
	labels = torch.LongTensors(labels)
	# The data and the model must be on the same device
	code_tokens.to(device)
	labels.to(device)
	# Zero the gradients every step
	optimizer.zero_grad()
	# Pass the data through the model
	logits = model(code_tokens)
	# Calculate the train step loss
	loss = loss_func(logits, labels)
	# Compute gradients for backpropagation
	loss.backward()
	# Update the model parameters
	optimizer.step()

Limitations of LLMs

Even though LLMs are full of potential and have displayed state of the art performance in many problems, however these come with risks that one must be aware of, especially if your use case is sensitive. One of the dangerous issues is hallucination, which are confident statements generated by the model which are incorrect, for example incorrect facts, numbers or dates produced in an answer to questions about real incidents. These issues also prevail in the new GPT-4 models as described in their technical report [gpt-4 technical report]. The authors have also warned against over-reliance on the models as these “made up facts” can also occur between actual facts which can lead a person to believe it.

These model could also have social biases, arising from the training data, which can cause harm in sensitive use cases [RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models]. There is ongoing research on understanding and removing these potential biases [Red Teaming Language Models with Language Models]. Also, there are gaps in the studies about LLM ethics in terms of the language, where the model bias is not well studied in low resource languages [Fairness in Language Models Beyond English: Gaps and Challenges]. The research on the biases, issues and potential harms of LLMs is ongoing and its use carries a similar risk as being early adopters of any new technology.

Conclusion

LLMs are useful for a variety of tasks. We've all seen examples of ChatGPT generating convincing stories, or generating a list of ideas, or answering questions you would usually do a web search for. You can generate summaries of long texts, translate a text to various languages . For all of these different tasks, LLMs can be used directly for inference even if the model was not trained explicitly for the inference task, though this may require some prompt engineering. They can also be finetuned to perform a specific task using appropriate datasets.

We, at CloudAEye, are also working on exciting products using generative models and LLMs. Feel free to ping us if you are curious.

Curious about AIOps?

Did you know that CloudAEye offers the most advanced AIOps solution for AWS Lambda? Request a free demo today!