Vertex AI Guide: Deployment of Llama 2 in the Google Cloud

Vertex AI is the collective term for everything that works in any way with AI solutions and applications in the Google Cloud. In line with the great demand for simple Machine Learning Applications and tools - evil tongues suspect a "hype" - the Google Cloud offers Vertex AI, a service for creating your own models, viewing ready-made models and benefiting from their applications.

Similar to Sagemaker in AWS Vertex AI is designed to support users throughout their entire Machine Learning (ML) workflowespecially when training, deploying and, increasingly important, managing their models. Vertex AI offers sophisticated ML Ops solutions, including alerting and AI experiments. There are also ready-made models and tools on the platform, such as image recognition, which can be trained using your own data. For example, a company can train a chatbot on its instructions so that it can perform actions precisely and help users with their problems more quickly.

This brings us to this article, because Llama 2 is a model that can be used under the hood of a chatbot program.

Using Llama 2 for applications "how to"

If you want to run Llama 2 for your application, then this application is certainly very simple. For high-demand models for production applications, there is a collection of out-of-the-box models within Vertex AI. This is called Model Garden. This collection contains turnkey models from Google, open source and third-party companies.

You will notice that you can already view Llama 2 and that it is therefore available in the Model Garden. It is correspondingly easy to provide an endpoint that returns Llama 2 requests. I assume that you already have a Google Cloud project with a connected billing account. Then you can go to this Google Cloud project and follow the instructions:

Step 1

First, you should take a look at the Compute Engine API and activate it. This is because you need a virtual machine to use a model. The Compute Engine API is responsible for virtual machines and solutions in the Google Cloud.

To do this, go to the APIs & Services menu of the platform. There you can click on the Enable APIs and Services button and search for the Compute Engine API in the following interface. Once you have found it, click on the blue "Enable" button.

Compute Engine API Google Cloud Vertex AI

Compute Engine API Google Cloud Vertex AI 2
Compute Engine API Google Cloud Vertex AI 3

Compute Engine API Google Cloud Vertex AI 4

Step 2

As soon as you have activated the Compute Engine API, you can view the Model Garden. You can access this as a sub-item in Vertex AI from the "Burger menu" in the Google Cloud Console.

Transition to the Model Garden Google Cloud Vertex AI

From there you can search for Llama 2 in the solutions or models if you do not see it directly next to the tiles as shown in the screenshot.

Transition to the Model Garden Google Cloud Vertex AI 2

If you display the details, you can provide the model.

Transition to the Model Garden Google Cloud Vertex AI 3

Step 3

The context menu shows you the configuration and data for the virtual machine. Under Machine type you can see g2-standard-96. This is important because the selected virtual machine determines the costs. 

Pricing configuration for the Google Cloud Vertex AI machine type

A look at the "Pricing" information for the machine type "g2" in the "standard" version with configuration level "96" reveals that operating this machine will cost you around 10$ per hour. Or 240$ per day.

These costs will be charged to your credit card for as long as the machine is running.

It is therefore best to be careful when you provide something. It is highly recommended that you know how to shut down the machine. If you get stuck with the applications described, you can delete the Google Cloud project and the associated data at any time. Support is also available via chat if you are unsure how to shut down the machine if you have already deployed it.

Pricing configuration for the Google Cloud Vertex AI 2 machine type

Step 4

As you have seen, the process is very accessible. Select Llama 2 from the list and follow the deploy steps (you may need to enable the Vertex AI API). What happens next: A copy of the Model Garden is made into your Vertex AI environment, called the Model Registry - but more on that later - and the model is deployed with a machine to an endpoint on the Internet. You can view this endpoint from the Vertex AI interface under "Online Prediction".

During the process, it is necessary to start up a virtual machine via the Google Cloud on which the model is located.

Please be careful not to incur any expenses here, as the computer will continue to run the whole time and the above-mentioned fees will be charged for this period.

Deploy AI models in the Google Cloud with Vertex

Regardless of whether you want to use a ready-made model - as in the previous section - or have your own models as solutions, it is worth delving deeper. Vertex AI's training and deployment tools, which differ in degrees of freedom, serve as a basis.

  • Car ML allows you to bring your own data for certain purposes, such as image recognition, and train on it. You do not need any coding or special data knowledge for this.
  • The other extreme is Custom Trainingwhere you can do everything yourself. From the training configuration to the hyperparameter tuning, everything is in your hands when you create it.
  • In between is the Model Garden, where you can select models from Google, open source or third parties and deploy them directly to an endpoint of yours in Vertex AI.

If you're unsure which option is best for you (or possibly your business), Google has a Overview ready.

Vertex AI - Deployment process "how to"

With this knowledge, you can now tackle the next step: Providing your own model in the Google Cloud via Vertex AI. We will explain the respective steps for you in the following graphic. The services from the Google Cloud are marked with the Google Cloud logo.

Vertex AI deployment process

In general, Vertex AI works as follows for your models:

Step 1

Roughly speaking, you will need to set up your model artifacts, usually as a Python distribution, on Google Cloud Storage upload. Google Cloud Storage is the object storage solution of the Google Cloud. Once you have taken this step, you can upload your model to the Vertex AI Model Registry .. You have now brought your first model to the Google Cloud Platform.

Step 2

Once you can view your model in the Vertex AI Model Registry, you will be able to select the computer resources you want to train your model with - and train your model. To do this, you can either bring your data with you, for example as part of the model, or manage it in storage tools such as Google Cloud. For the latter, it makes sense to use a table from BigQuery with Vertex AI to a Feature Store resource. You can then use this data in the training process. The training procedure depends on your Model type from

Step 3

You have now integrated a model in the Vertex AI. For the freestyle, you probably want to create, or rather provide, predictions. There are batch and "online" - or live - predictions. Batch predictions can be executed in a similar way to a training job. You need your model and have to rent the compute resources to get your batch predictions. More relevant for generative AI models like Lama 2, are clearly the online predictions. If you want to provide online predictions, you also need a prediction container. This comes ready-made with Google's AutoML models or models from the Model Garden.

However, if you want predictions for your own model, i.e. a custom model, then it is necessary to provide the prediction container yourself. This is subject to certain Requirements. In particular, you must ensure that the endpoint that Vertex AI provides for you is only a Scheme is accepted for requests and responses. In other words, your prediction container may only view, take and generate this input and output data.

Step 4

Once you create your prediction container and successfully complete this step, you can get ready to place it on Vertex AI. To do this, create your first cost-intensive Service ready. So far, all products used have been very inexpensive, if not free.

For example: The cloud storage on which your model is stored will only cost you 2 cents per GB per month. Your model will probably not be so large that it will be really expensive for you.

To provide your prediction model, however, you must have a Deployment Resource Pool create. This is a long word for one - or several - computers that host your prediction container. Depending on the complexity and the number of requests that are made to your model, this computer should be correspondingly large. You may want to add an accelerator. This means that you may need to rent a graphics card for the computer.

What is important for this: Depending on your scaling settings, this computer runs in the background all the time.

So if you follow this tutorial, please do not forget to delete the Deployment Resource Pool.

Tip: It is best to choose an inexpensive computer. This is definitely the most efficient way.

Summary - Vertex AI Guide

Now you have a better understanding of Vertex AI in the Google Cloud. You have acquired the know-how on how to use the Model Garden of this platform to create and train a prediction endpoint out-of-the-box with just a few clicks. In addition, you will know that there are different model types in the Google Cloud, whose data can be divided into Auto ML and Custom Models.

Have fun with the implementation!

Do you have questions about deploying Llama 2 in the Google Cloud with Vertex AI? Write me a message:

    Gregor Kondla Avatar

    Latest articles