From the course: Introduction to LLM Vulnerabilities

How are LLMs created

- [Instructor] I want to show you a little bit, and this might be confusing if you've never seen anything related to machine learning, but I wanted to talk about large language models and how they are created. Now this is very foundational, it's a very high overview, and I want to start here with Wikipedia. Now the reason I want to start with Wikipedia is because I want to show you is like you're not going to get a straight answer. So let's take a look here at what we're dealing here. So large language model is language model notable for its ability. So great, but it's not really telling me how one would go ahead and create one. Now usually you don't, right? So let me get that very clear. You don't go ahead and say, "Hey, I'm going to build a large language model." That's not very common. So that's useful to know, and you essentially, what you're doing here is you have to get data and then use that data in order to train that model and then to produce an output. Now Wikipedia will always be very wide and broad and might not get into enough details. But basically you're dealing with a dataset that is the data source, and you'll have some sort of an initial place where you're going to get started. And that data will have to, perhaps in this case, removal of toxic passages, imagine like grabbing everything, every single text available in the internet. Like you're going to get probably a lot of toxicity there. And so you'll want to have some cleaning and you want to have things that are exactly what you want. So say for example, if you are in the medical field and you want to just collect data related to medical data, then good. And you want to make sure that say for example, anonymous data and not personal identifiable information is present. So a lot of things related to data. And then on the training and how it exactly works, you'll have different types of training, and these approaches will allow you to get to an actual large language model. So I want to now move on to a little bit on a different description this time based on Microsoft Learn, a large language model, LLM is a type of AI that can process and produce natural language text. It learns from a massive amount of data gathered from sources like books, articles, web pages, and images to discover patterns and rules of language. So this is the actual part that I want to mention here on how exactly that is accomplished. So based on like several different perhaps enormous amounts of text, and this is why books, articles, and webpages are listed here, and in some cases images when we're talking about generative AI with images to discover those patterns and rules of language. And that training is done for a model. Now I am not going to go exactly and create a large language model, but I want to use a cloud service here. In this case, it's going to be Azure AI Machine Learning Studio, which is a cloud platform cloud service from Azure that allows you to do some training. So just, I'm not going to go ahead and create anything, but I'm going to show you exactly what this would look like when you're actually trying to try to train machine learning models. So in this case, we're going to start here by using this automated ML job. I know I haven't gotten into specifics on what are cloud platforms and what are we doing there and how to get here, but just let's just focus on a little bit of the concepts. So what you have is basically understanding the training method. You're going to have some settings, the type of task and the data, and then you're going to have some sort of compute system. And then you're going to start this job, this training. So let's see what we have here. We're going to do this training automatically. We're going to create an automated machine learning job. That means that we're going to let the machine decide when it's good enough and this is good to go. So let's go into basic settings> In here, we would actually set the name of our job, what we're going to do. And then in the task type and data, this is where I want to focus. You are always going to have to understand what type of training you're going to do. So if I click here on this dropdown, you'll see that there's different types of training. So classification, regression, time series forecasting. So for example, classification would allow you to predict either yes or no, blue, red, green, one, zero, on or off. Imagine if you want to create a machine learninng model that allows you to understand if a story is open given certain text. So you have several different types of data that allows you to define when it's open and when it's closed, and you would train based on that data. Now for machine learning models based on or creating large language models, you will be doing natural language processing or NLP, predicts based on text-only data types using multi-class or multilevel classification. Now these are big words and big terms. Essentially what it means is you're going to be using a lot of plain text to train your models so that it understands based on specific language is going to understand what to do with the language. So once you decide, then you're going to have what type of classification you want. You're going to have predict the correct class for each sample, multi-label classification, predict all the classes for each sample or name entity recognition. So extracting domain-specific entities from a structured stack. So there's a lot of things going on here, but at the end of the day, that is how you would set up something that would allow you to create something like a large language model. This is not exactly it. Note that I'm just giving you an example, how you would create something like a machine learning model. For a large language model, it would be very much more involved. It would definitely start here with the data. I don't have enormous amounts of text data to train these, so this would actually not be a good use case. So assemble a whole team of data scientists that understands deep machine learning concepts like neural networks and how do they work and how to manipulate data in a way that they can actually create these and have the ability of several different GPUs and infrastructure to train will be the right way. But this should give you like a good idea on how that might be and what other steps might be involved. And again, you will hear me mention data a lot as we make progress in the course where we need to have an enormous amount of data that is not only a large amount of data, but is actually clean as we saw before in the previous descriptions.

Contents