DML: Introduction to Deploying Private LLMs with AWS SageMaker: Focus on Llama2-7b-chat

Jan 18, 2024

Welcome to Decoding ML 👋🏼

The one place where we will help you decode complex topics about ML & MLOps one week at a time 🔥

Hello everyone, Alex V. here 👋

One of the new DML team members. Excited to write my first article. In case you haven’t read the latest rebranding article, here are a few words about me:

I started my career in software engineering in 2015, specializing in Python and AI technologies, and rapidly advanced to take on challenging AI projects. As a CTO in various startups, I led teams in creating innovative software across the healthcare and automotive sectors, implementing AI systems to boost efficiency.

To read more about Decoding ML’s new rebranding plans, check out our latest “New year, the new & improved Decoding ML - What to expect?” article, where Paul explained in detail the new bright future of Decoding ML.

Today, I’m breaking the ice with my first DML article, practically showing you how to deploy Private LLMs with AWS SageMaker.

OpenAI has been a pioneer in this field, influencing others in the open-source community to develop their own LLMs such as Llama2, BLOOM, MPT-7B, Falcon, Mistral, and Alpaca.

This guide aims to delve into the interplay between LLMs and cloud technology, underscoring the importance of private and secure AI solutions in today's data-centric world.

Whether you're an AI professional or just curious about the latest in tech, this overview provides valuable insights into this dynamic and rapidly evolving field.

Here’s what we’re going to deep dive into:

In this blog, we dive into the process of deploying the Llama 2 model on Amazon SageMaker.
You'll learn how to leverage the Hugging Face LLM DLC, a specialized container designed for easy and secure deployment of Large Language Models (LLMs). This DLC is equipped with the Text Generation Inference (TGI) system, offering a scalable and optimized environment for LLMs.
Additionally, the blog will guide you through the hardware requirements necessary for various sizes of the Llama 2 model.

See what I have in store for you today. 👇

1. AWS SageMaker: A Comprehensive Overview

Introduction to AWS SageMaker
Key Features and Capabilities
Importance in AI and Machine Learning

2. LLama2-7B: Unveiling the Model

Understanding LLama2-7B: What It Is and Why It Matters
Capabilities and Use Cases of LLama2-7B

3. Deploying LLama2-7B on AWS SageMaker Using HuggingFace: A Hands-On Experience

Preparation and Requirements for Deployment
Step-by-Step Guide to Deploying LLama2-7B
Best Practices and Troubleshooting Tips

4. Inference with LLama2-7B on AWS SageMaker Using boto3

Setting Up for Inference: Pre-requisites and Configuration
Executing Inference: Methods and Strategies
Analyzing and Utilizing Inference Results

5. Clean up SageMaker resources

1. AWS SageMaker - A Comprehensive Overview

SageMaker offers a comprehensive, managed service for ML model development, training, and deployment, optimized for scalability and security. It simplifies fine-tuning with pre-configured algorithms and supports various ML frameworks, easing deployment. A standout feature is its auto-scaling, adapting resources to demand changes, ensuring efficiency and responsiveness.

2. Llama2-7B: Unveiling the Model

Llama2, succeeding the original Llama, is trained on 2T tokens and handles up to 4K token contexts. Enhanced by Reinforcement Learning and over 1 million human annotations, it comes in two versions: Llama2-7B-chat, for AI conversations, and Llama2-7B, a general-purpose model for various language tasks.

Hardware requirements

3. Deploying Llama2-7B on AWS SageMaker Using HuggingFace: A Hands-On Experience

Step 1: Create `settings.py` with Pydantic

Install Dependencies:
- First, ensure you have pydantic and python-dotenv installed in your Python environment. You can install these packages using pip:

pip install pydantic python-dotenv

Create the settings.py File:
- In your project directory, create a file named settings.py.
Set Up the Code:
- Inside settings.py, write the following code. All the variables will be explained later in the deployment process.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Show hidden characters

	import os
	from dotenv import load_dotenv
	from pydantic import BaseSettings

	BASE_DIR = os.path.dirname(os.path.abspath(__file__))
	load_dotenv(dotenv_path=os.path.join(BASE_DIR, '.env'))

	class Settings(BaseSettings):
	SM_NUM_GPUS: int = 1
	MAX_INPUT_LENGTH: int = 2048
	MAX_TOTAL_TOKENS: int = 4096
	MAX_BATCH_TOTAL_TOKENS: int = 8192
	HUGGING_FACE_HUB_TOKEN: str
	HF_MODEL_ID: str = 'meta-llama/Llama-2-7b-chat-hf'
	GPU_INSTANCE_TYPE: str = "ml.g5.2xlarge"
	TEMPERATURE: float = 0.8
	TOP_P: float = 0.9
	MAX_NEW_TOKENS: int = 200
	RETURN_FULL_TEXT: bool = False
	DYNAMO_TABLE: str

	class Config:
	env_file = os.path.join(os.path.dirname(__file__), '.env')

	settings = Settings()

view raw settings.py hosted with ❤ by GitHub

Create a .env File:
- Create a .env file in the same directory as your settings.py.
- Inside the .env file, you can define the values for the environment variables.

#Deploy Parameters
SM_NUM_GPUS=1
MAX_INPUT_LENGTH=2048
MAX_TOTAL_TOKENS=4096
MAX_BATCH_TOTAL_TOKENS=8192
HUGGING_FACE_HUB_TOKEN=str
HF_MODEL_ID=meta-llama/Llama-2-7b-chat-hf
GPU_INSTANCE_TYPE="ml.g5.2xlarge"

Step 2: Define a deployment configuration

from settings import settings

deploy_config = {
    'HF_MODEL_ID': settings.HF_MODEL_ID,
    'SM_NUM_GPUS': json.dumps(settings.SM_NUM_GPUS),  # Number of GPU used per replica
    'MAX_INPUT_LENGTH': json.dumps(settings.MAX_INPUT_LENGTH),  # Max length of input text
    'MAX_TOTAL_TOKENS': json.dumps(settings.MAX_TOTAL_TOKENS),  # Max length of the generation (including input text)
    'MAX_BATCH_TOTAL_TOKENS': json.dumps(settings.MAX_BATCH_TOTAL_TOKENS),
    'HUGGING_FACE_HUB_TOKEN': settings.HUGGING_FACE_HUB_TOKEN
}

Step 3: AWS Role ARN

An AWS Role ARN (Amazon Resource Name) is a unique identifier assigned to an AWS Identity and Access Management (IAM) role. In AWS, an IAM role is an entity within the service that defines a set of permissions for making AWS service requests. Roles are similar to users, but unlike users, they are not associated with a specific person.

import boto3

def get_role_arn(role_name: str):
    """
    The get_role_arn function takes a role name as input and returns the   ARN for that role.
    
    :param role_name: str: Specify the name of the role that we want to retrieve
    :return: The arn of the role
    """
    iam = boto3.client('iam')

    try:
        role_details = iam.get_role(RoleName=role_name)
        logging.info(f"Retrieved role ARN for {role_name}")
        return role_details['Role']['Arn']
    except Exception as e:
        logging.error(f"Error obtaining role ARN: {e}")
        raise

Step 4: DeDeploy Hugging Face Model (Llama2-7b-chat) on AWS SageMaker.

Retrieve the new Hugging Face LLM DLC

Show hidden characters

	"""
	1. Initializing the Hugging Face Model: We create an instance of HuggingFaceModel from SageMaker's Python SDK. This object requires the role ARN, the image URI of the model (which we can obtain from Hugging Face's model hub), and the environment configuration we defined earlier.

	2. Deploying the Model: Next, we call the deploy method on our llm_model object. Here, we specify the number of instances (initial_instance_count), the type of instance to deploy on (instance_type), and a timeout for the container startup health check. This timeout is crucial as larger models might take more time to load.
	"""

	from sagemaker.huggingface import HuggingFaceModel
	from sagemaker.huggingface import get_huggingface_llm_image_uri


	def deploy_huggingface_model(role_arn: str, llm_image: str, config: dict) -> None:
	try:
	llm_model = HuggingFaceModel(role=role_arn, image_uri=llm_image, env=config)
	deployment = llm_model.deploy(
	initial_instance_count=1,
	instance_type=settings.GPU_INSTANCE_TYPE,
	container_startup_health_check_timeout=300 # 10 minutes to load the model
	)
	logging.info("Successfully deployed HuggingFace model.")
	return deployment
	except Exception as e:
	logging.error(f"Error deploying HuggingFace model: {e}")
	raise
	@alexandruvesa
	Comment

view raw deploy.py hosted with ❤ by GitHub

Step 5: Run the deploy.py script

role_arn = get_role_arn('your role')
llm_image = get_huggingface_llm_image_uri("huggingface", version="0.9.3")
deploy_huggingface_model(role_arn, llm_image, deploy_config)

SageMaker will now create our endpoint and deploy the model to it. This takes 10-15 minutes.

4. Run Inference with the model

Before running the inference, let’s analyze the most important parameters that affect the output of the model:

temperature: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. The default value is 1.0.
max_new_tokens: The maximum number of tokens to generate. The default value is 20, max value is 512.
repetition_penalty: Controls the likelihood of repetition, defaults to null.
top_p: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to null
return_full_text: Whether or not to return the full text or only the generated part. Default value is false.

Create inference.py script

Show hidden characters

	import boto3
	import json
	import logging

	class Inference:
	"""
	A class to handle inference requests for a Large Language Model using AWS SageMaker.
	"""
	def __init__(self, endpoint_name, initial_payload=None):
	"""
	Initializes the LLMInference object.

	:param endpoint_name: The default endpoint name for the SageMaker inference.
	:param initial_payload: The default payload for inference requests. If None, a default payload is used.
	"""
	self.sagemaker_client = boto3.client("sagemaker-runtime")
	self.default_endpoint = endpoint_name
	self.inference_payload = initial_payload if initial_payload else self._create_default_payload()
	logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

	def _create_default_payload(self):
	"""
	Creates a default payload for inference if none is provided during initialization.
	"""
	return {
	"inputs": "How is the weather?",
	"parameters": {
	"max_new_tokens": settings.MAX_NEW_TOKENS,
	"top_p": settings.TOP_P,
	"temperature": settings.TEMPERATURE,
	"return_full_text": False
	}
	}

	def update_payload(self, inputs, additional_parameters=None):
	"""
	Updates the inference payload with new inputs and parameters.

	:param inputs: The new input text for the model.
	:param additional_parameters: Additional parameters to control the model's response. Optional.
	"""
	self.inference_payload['inputs'] = inputs
	if additional_parameters:
	self.inference_payload['parameters'].update(additional_parameters)

	def run_inference(self, specific_endpoint=None):
	"""
	Sends an inference request to the specified or default SageMaker endpoint.

	:param specific_endpoint: The name of a specific SageMaker endpoint to use for inference. Optional.
	:return: The response from the model.
	"""
	endpoint_to_use = specific_endpoint if specific_endpoint else self.default_endpoint

	try:
	logging.info(f"Inference request sent with parameters: {self.inference_payload['parameters']}")
	response = self.sagemaker_client.invoke_endpoint(
	EndpointName=endpoint_to_use,
	ContentType="application/json",
	Body=json.dumps(self.inference_payload),
	)
	response_body = response["Body"].read().decode("utf8")
	return json.loads(response_body)
	except Exception as e:
	logging.error(f"An error occurred during inference: {e}")
	raise

view raw inference.py hosted with ❤ by GitHub

Initialization: An Inference class is created with a SageMaker endpoint and optional initial payload. Without a payload, a default is generated.
Default Payload: A method _create_default_payload forms a basic payload with a sample input and default settings like max_new_tokens and temperature.
Payload Update: The update_payload method modifies the input and parameters for various inference needs.
Running Inference: run_inference executes the inference, using a specific or default endpoint. It logs requests, sends the payload to SageMaker, and retrieves responses.

Chat with your model

The meta-llama/Llama-2-13b-chat-hf is a conversational chat model meaning you can chat with it using the following prompt:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_msg_1 }} [/INST] {{ model_answer_1 }} </s><s>[INST] {{ user_msg_2 }} [/INST]

We create a small helper method build_llama2_prompt, which converts a List of "messages" into the prompt format. We also define a system_prompt which is used to start the conversation. You will use the system_prompt to ask the model about MLOps.

Note that <s> and </s> are special tokens for the beginning of the string (BOS) and end of the string (EOS) while [INST] and [/INST] are regular strings.

def build_llama2_prompt(messages):
    startPrompt = "<s>[INST] "
    endPrompt = " [/INST]"
    conversation = []
    for index, message in enumerate(messages):
        if message["role"] == "system" and index == 0:
            conversation.append(f"<<SYS>>\n{message['content']}\n<</SYS>>\n\n")
        elif message["role"] == "user":
            conversation.append(message["content"].strip())
        else:
            conversation.append(f" [/INST] {message['content'].strip()}</s><s>[INST] ")

    return startPrompt + "".join(conversation) + endPrompt

messages = [
  { "role": "system","content": "You are a knowledgeable MLOps engineer named Dobby. Your goal is to have natural conversations with users to help them understand MLOps concepts. "}
]

Let’s see if Dobby knows how Sagemaker helps the MLOps community.

default_endpoint = 'your-'huggingface-pytorch-tgi-inference-2024-01-11-23-42-3-211'
 
instruction = "How AWS sagemaker helps MLOps community? Be short and concise."

messages.append({"role": "user", "content": instruction})
prompt = build_llama2_prompt(messages)

inference_engine = LLMInference(default_endpoint_name=default_endpoint)

inference_engine.set_payload(
        inputs=prompt,
        parameters={
            "max_new_tokens": 400,
            "top_p": 0.9,
            "temperature": 0.8
        }
    )
results = inference_engine.inference()[0]['generated_text']

5. Clean up SageMaker resources

It’s very important to clean up the Sagemaker resources if you don’t want to pay unnecessary bills.

Create delete.sh bash script

aws sagemaker list-endpoints --query "Endpoints[*].[EndpointName]" --output text
for endpoint in $(aws sagemaker list-endpoints --query "Endpoints[*].[EndpointName]" --output text); do
    echo "Deleting endpoint: $endpoint"
    aws sagemaker delete-endpoint --endpoint-name "$endpoint"
done

Conclusion

In summary, leveraging AWS SageMaker in conjunction with Large Language Models (LLMs) significantly accelerates the process of experimentation and prototyping, enabling a more efficient and rapid transition to the production stage. This integration streamlines workflows and reduces development time, facilitating quicker deployment and realization of AI-driven solutions.

That’s it for today 👾

See you next Thursday at 9:00 a.m. CET.

Have a fantastic weekend!

Signed: Alex V. from the Decoding ML team

Whenever you’re ready, here is how we can help you:

→ Follow Decoding ML’s Medium publication for more in-depth, hands-on articles about production-ready ML & MLOps.

Paul Iusztin

You smashed it with your first article published in the DML newsletter. This is a great intro to deploying LLMs to SageMaker 🔥

Expand full comment

Manas Samantaray

Jan 31, 2024Edited

FYI, that the pydantic here is v1 and latest version of pydantic does not come with BaseSettings.

to use the BaseSettings use either `!pip install pydantic-settings and then from pydantic_settings import BaseSettings` or use `from pydantic.v1 import BaseSettings`

1 more comment...

DML: Introduction to Deploying Private LLMs with AWS SageMaker: Focus on Llama2-7b-chat

Table of Contents

1. AWS SageMaker: A Comprehensive Overview

2. LLama2-7B: Unveiling the Model

3. Deploying LLama2-7B on AWS SageMaker Using HuggingFace: A Hands-On Experience

4. Inference with LLama2-7B on AWS SageMaker Using boto3

5. Clean up SageMaker resources

1. AWS SageMaker - A Comprehensive Overview

2. Llama2-7B: Unveiling the Model

Hardware requirements

3. Deploying Llama2-7B on AWS SageMaker Using HuggingFace: A Hands-On Experience

Step 1: Create settings.py with Pydantic

Step 2: Define a deployment configuration

Step 3: AWS Role ARN

Step 4: DeDeploy Hugging Face Model (Llama2-7b-chat) on AWS SageMaker.

Step 5: Run the deploy.py script

4. Run Inference with the model

Create inference.py script

Chat with your model

5. Clean up SageMaker resources

Conclusion

Whenever you’re ready, here is how we can help you:

Discussion about this post

Step 1: Create `settings.py` with Pydantic