Welcome to Decoding ML 👋🏼
The one place where we will help you decode complex topics about ML & MLOps one week at a time 🔥
Hello everyone, Alex V. here 👋
One of the new DML team members. Excited to write my first article. In case you haven’t read the latest rebranding article, here are a few words about me:
I started my career in software engineering in 2015, specializing in Python and AI technologies, and rapidly advanced to take on challenging AI projects. As a CTO in various startups, I led teams in creating innovative software across the healthcare and automotive sectors, implementing AI systems to boost efficiency.
To read more about Decoding ML’s new rebranding plans, check out our latest “New year, the new & improved Decoding ML - What to expect?” article, where Paul explained in detail the new bright future of Decoding ML.
Today, I’m breaking the ice with my first DML article, practically showing you how to deploy Private LLMs with AWS SageMaker.
OpenAI has been a pioneer in this field, influencing others in the open-source community to develop their own LLMs such as Llama2, BLOOM, MPT-7B, Falcon, Mistral, and Alpaca.
This guide aims to delve into the interplay between LLMs and cloud technology, underscoring the importance of private and secure AI solutions in today's data-centric world.
Whether you're an AI professional or just curious about the latest in tech, this overview provides valuable insights into this dynamic and rapidly evolving field.
Here’s what we’re going to deep dive into:
In this blog, we dive into the process of deploying the Llama 2 model on Amazon SageMaker.
You'll learn how to leverage the Hugging Face LLM DLC, a specialized container designed for easy and secure deployment of Large Language Models (LLMs). This DLC is equipped with the Text Generation Inference (TGI) system, offering a scalable and optimized environment for LLMs.
Additionally, the blog will guide you through the hardware requirements necessary for various sizes of the Llama 2 model.
See what I have in store for you today. 👇
Table of Contents
1. AWS SageMaker: A Comprehensive Overview
Introduction to AWS SageMaker
Key Features and Capabilities
Importance in AI and Machine Learning
2. LLama2-7B: Unveiling the Model
Understanding LLama2-7B: What It Is and Why It Matters
Capabilities and Use Cases of LLama2-7B
3. Deploying LLama2-7B on AWS SageMaker Using HuggingFace: A Hands-On Experience
Preparation and Requirements for Deployment
Step-by-Step Guide to Deploying LLama2-7B
Best Practices and Troubleshooting Tips
4. Inference with LLama2-7B on AWS SageMaker Using boto3
Setting Up for Inference: Pre-requisites and Configuration
Executing Inference: Methods and Strategies
Analyzing and Utilizing Inference Results
5. Clean up SageMaker resources
1. AWS SageMaker - A Comprehensive Overview
SageMaker offers a comprehensive, managed service for ML model development, training, and deployment, optimized for scalability and security. It simplifies fine-tuning with pre-configured algorithms and supports various ML frameworks, easing deployment. A standout feature is its auto-scaling, adapting resources to demand changes, ensuring efficiency and responsiveness.
2. Llama2-7B: Unveiling the Model
Llama2, succeeding the original Llama, is trained on 2T tokens and handles up to 4K token contexts. Enhanced by Reinforcement Learning and over 1 million human annotations, it comes in two versions: Llama2-7B-chat, for AI conversations, and Llama2-7B, a general-purpose model for various language tasks.
Hardware requirements
3. Deploying Llama2-7B on AWS SageMaker Using HuggingFace: A Hands-On Experience
Step 1: Create settings.py
with Pydantic
Install Dependencies:
First, ensure you have
pydantic
andpython-dotenv
installed in your Python environment. You can install these packages using pip:
pip install pydantic python-dotenv
Create the
settings.py
File:In your project directory, create a file named
settings.py
.
Set Up the Code:
Inside
settings.py
, write the following code. All the variables will be explained later in the deployment process.
import os | |
from dotenv import load_dotenv | |
from pydantic import BaseSettings | |
BASE_DIR = os.path.dirname(os.path.abspath(__file__)) | |
load_dotenv(dotenv_path=os.path.join(BASE_DIR, '.env')) | |
class Settings(BaseSettings): | |
SM_NUM_GPUS: int = 1 | |
MAX_INPUT_LENGTH: int = 2048 | |
MAX_TOTAL_TOKENS: int = 4096 | |
MAX_BATCH_TOTAL_TOKENS: int = 8192 | |
HUGGING_FACE_HUB_TOKEN: str | |
HF_MODEL_ID: str = 'meta-llama/Llama-2-7b-chat-hf' | |
GPU_INSTANCE_TYPE: str = "ml.g5.2xlarge" | |
TEMPERATURE: float = 0.8 | |
TOP_P: float = 0.9 | |
MAX_NEW_TOKENS: int = 200 | |
RETURN_FULL_TEXT: bool = False | |
DYNAMO_TABLE: str | |
class Config: | |
env_file = os.path.join(os.path.dirname(__file__), '.env') | |
settings = Settings() |
Create a
.env
File:Create a
.env
file in the same directory as yoursettings.py
.Inside the
.env
file, you can define the values for the environment variables.
#Deploy Parameters
SM_NUM_GPUS=1
MAX_INPUT_LENGTH=2048
MAX_TOTAL_TOKENS=4096
MAX_BATCH_TOTAL_TOKENS=8192
HUGGING_FACE_HUB_TOKEN=str
HF_MODEL_ID=meta-llama/Llama-2-7b-chat-hf
GPU_INSTANCE_TYPE="ml.g5.2xlarge"
Step 2: Define a deployment configuration
from settings import settings
deploy_config = {
'HF_MODEL_ID': settings.HF_MODEL_ID,
'SM_NUM_GPUS': json.dumps(settings.SM_NUM_GPUS), # Number of GPU used per replica
'MAX_INPUT_LENGTH': json.dumps(settings.MAX_INPUT_LENGTH), # Max length of input text
'MAX_TOTAL_TOKENS': json.dumps(settings.MAX_TOTAL_TOKENS), # Max length of the generation (including input text)
'MAX_BATCH_TOTAL_TOKENS': json.dumps(settings.MAX_BATCH_TOTAL_TOKENS),
'HUGGING_FACE_HUB_TOKEN': settings.HUGGING_FACE_HUB_TOKEN
}
Step 3: AWS Role ARN
An AWS Role ARN (Amazon Resource Name) is a unique identifier assigned to an AWS Identity and Access Management (IAM) role. In AWS, an IAM role is an entity within the service that defines a set of permissions for making AWS service requests. Roles are similar to users, but unlike users, they are not associated with a specific person.
import boto3
def get_role_arn(role_name: str):
"""
The get_role_arn function takes a role name as input and returns the ARN for that role.
:param role_name: str: Specify the name of the role that we want to retrieve
:return: The arn of the role
"""
iam = boto3.client('iam')
try:
role_details = iam.get_role(RoleName=role_name)
logging.info(f"Retrieved role ARN for {role_name}")
return role_details['Role']['Arn']
except Exception as e:
logging.error(f"Error obtaining role ARN: {e}")
raise
Step 4: DeDeploy Hugging Face Model (Llama2-7b-chat) on AWS SageMaker.
Retrieve the new Hugging Face LLM DLC
""" | |
1. Initializing the Hugging Face Model: We create an instance of HuggingFaceModel from SageMaker's Python SDK. This object requires the role ARN, the image URI of the model (which we can obtain from Hugging Face's model hub), and the environment configuration we defined earlier. | |
2. Deploying the Model: Next, we call the deploy method on our llm_model object. Here, we specify the number of instances (initial_instance_count), the type of instance to deploy on (instance_type), and a timeout for the container startup health check. This timeout is crucial as larger models might take more time to load. | |
""" | |
from sagemaker.huggingface import HuggingFaceModel | |
from sagemaker.huggingface import get_huggingface_llm_image_uri | |
def deploy_huggingface_model(role_arn: str, llm_image: str, config: dict) -> None: | |
try: | |
llm_model = HuggingFaceModel(role=role_arn, image_uri=llm_image, env=config) | |
deployment = llm_model.deploy( | |
initial_instance_count=1, | |
instance_type=settings.GPU_INSTANCE_TYPE, | |
container_startup_health_check_timeout=300 # 10 minutes to load the model | |
) | |
logging.info("Successfully deployed HuggingFace model.") | |
return deployment | |
except Exception as e: | |
logging.error(f"Error deploying HuggingFace model: {e}") | |
raise | |
@alexandruvesa | |
Comment |
Step 5: Run the deploy.py script
role_arn = get_role_arn('your role')
llm_image = get_huggingface_llm_image_uri("huggingface", version="0.9.3")
deploy_huggingface_model(role_arn, llm_image, deploy_config)
SageMaker will now create our endpoint and deploy the model to it. This takes 10-15 minutes.
4. Run Inference with the model
Before running the inference, let’s analyze the most important parameters that affect the output of the model:
temperature
: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. The default value is 1.0.max_new_tokens
: The maximum number of tokens to generate. The default value is 20, max value is 512.repetition_penalty
: Controls the likelihood of repetition, defaults tonull
.top_p
: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default tonull
return_full_text
: Whether or not to return the full text or only the generated part. Default value isfalse
.
Create inference.py script
import boto3 | |
import json | |
import logging | |
class Inference: | |
""" | |
A class to handle inference requests for a Large Language Model using AWS SageMaker. | |
""" | |
def __init__(self, endpoint_name, initial_payload=None): | |
""" | |
Initializes the LLMInference object. | |
:param endpoint_name: The default endpoint name for the SageMaker inference. | |
:param initial_payload: The default payload for inference requests. If None, a default payload is used. | |
""" | |
self.sagemaker_client = boto3.client("sagemaker-runtime") | |
self.default_endpoint = endpoint_name | |
self.inference_payload = initial_payload if initial_payload else self._create_default_payload() | |
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') | |
def _create_default_payload(self): | |
""" | |
Creates a default payload for inference if none is provided during initialization. | |
""" | |
return { | |
"inputs": "How is the weather?", | |
"parameters": { | |
"max_new_tokens": settings.MAX_NEW_TOKENS, | |
"top_p": settings.TOP_P, | |
"temperature": settings.TEMPERATURE, | |
"return_full_text": False | |
} | |
} | |
def update_payload(self, inputs, additional_parameters=None): | |
""" | |
Updates the inference payload with new inputs and parameters. | |
:param inputs: The new input text for the model. | |
:param additional_parameters: Additional parameters to control the model's response. Optional. | |
""" | |
self.inference_payload['inputs'] = inputs | |
if additional_parameters: | |
self.inference_payload['parameters'].update(additional_parameters) | |
def run_inference(self, specific_endpoint=None): | |
""" | |
Sends an inference request to the specified or default SageMaker endpoint. | |
:param specific_endpoint: The name of a specific SageMaker endpoint to use for inference. Optional. | |
:return: The response from the model. | |
""" | |
endpoint_to_use = specific_endpoint if specific_endpoint else self.default_endpoint | |
try: | |
logging.info(f"Inference request sent with parameters: {self.inference_payload['parameters']}") | |
response = self.sagemaker_client.invoke_endpoint( | |
EndpointName=endpoint_to_use, | |
ContentType="application/json", | |
Body=json.dumps(self.inference_payload), | |
) | |
response_body = response["Body"].read().decode("utf8") | |
return json.loads(response_body) | |
except Exception as e: | |
logging.error(f"An error occurred during inference: {e}") | |
raise |
Initialization: An Inference class is created with a SageMaker endpoint and optional initial payload. Without a payload, a default is generated.
Default Payload: A method _create_default_payload forms a basic payload with a sample input and default settings like max_new_tokens and temperature.
Payload Update: The update_payload method modifies the input and parameters for various inference needs.
Running Inference: run_inference executes the inference, using a specific or default endpoint. It logs requests, sends the payload to SageMaker, and retrieves responses.
Chat with your model
The meta-llama/Llama-2-13b-chat-hf
is a conversational chat model meaning you can chat with it using the following prompt:
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_msg_1 }} [/INST] {{ model_answer_1 }} </s><s>[INST] {{ user_msg_2 }} [/INST]
We create a small helper method build_llama2_prompt
, which converts a List of "messages" into the prompt format. We also define a system_prompt
which is used to start the conversation. You will use the system_prompt
to ask the model about MLOps.
Note that <s>
and </s>
are special tokens for the beginning of the string (BOS) and end of the string (EOS) while [INST] and [/INST] are regular strings.
def build_llama2_prompt(messages):
startPrompt = "<s>[INST] "
endPrompt = " [/INST]"
conversation = []
for index, message in enumerate(messages):
if message["role"] == "system" and index == 0:
conversation.append(f"<<SYS>>\n{message['content']}\n<</SYS>>\n\n")
elif message["role"] == "user":
conversation.append(message["content"].strip())
else:
conversation.append(f" [/INST] {message['content'].strip()}</s><s>[INST] ")
return startPrompt + "".join(conversation) + endPrompt
messages = [
{ "role": "system","content": "You are a knowledgeable MLOps engineer named Dobby. Your goal is to have natural conversations with users to help them understand MLOps concepts. "}
]
Let’s see if Dobby knows how Sagemaker helps the MLOps community.
default_endpoint = 'your-'huggingface-pytorch-tgi-inference-2024-01-11-23-42-3-211'
instruction = "How AWS sagemaker helps MLOps community? Be short and concise."
messages.append({"role": "user", "content": instruction})
prompt = build_llama2_prompt(messages)
inference_engine = LLMInference(default_endpoint_name=default_endpoint)
inference_engine.set_payload(
inputs=prompt,
parameters={
"max_new_tokens": 400,
"top_p": 0.9,
"temperature": 0.8
}
)
results = inference_engine.inference()[0]['generated_text']
5. Clean up SageMaker resources
It’s very important to clean up the Sagemaker resources if you don’t want to pay unnecessary bills.
Create delete.sh bash script
aws sagemaker list-endpoints --query "Endpoints[*].[EndpointName]" --output text
for endpoint in $(aws sagemaker list-endpoints --query "Endpoints[*].[EndpointName]" --output text); do
echo "Deleting endpoint: $endpoint"
aws sagemaker delete-endpoint --endpoint-name "$endpoint"
done
Conclusion
In summary, leveraging AWS SageMaker in conjunction with Large Language Models (LLMs) significantly accelerates the process of experimentation and prototyping, enabling a more efficient and rapid transition to the production stage. This integration streamlines workflows and reduces development time, facilitating quicker deployment and realization of AI-driven solutions.
That’s it for today 👾
See you next Thursday at 9:00 a.m. CET.
Have a fantastic weekend!
Signed: Alex V. from the Decoding ML team
Whenever you’re ready, here is how we can help you:
→ Follow Decoding ML’s Medium publication for more in-depth, hands-on articles about production-ready ML & MLOps.
You smashed it with your first article published in the DML newsletter. This is a great intro to deploying LLMs to SageMaker 🔥
FYI, that the pydantic here is v1 and latest version of pydantic does not come with BaseSettings.
to use the BaseSettings use either `!pip install pydantic-settings and then from pydantic_settings import BaseSettings` or use `from pydantic.v1 import BaseSettings`