Saturday, December 2, 2023

Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 2: Interactive User Experiences in SageMaker Studio

 Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and easily build, train, and deploy machine learning (ML) models at scale. SageMaker makes it easy to deploy models into production directly through API calls to the service. Models are packaged into containers for robust and scalable deployments.

SageMaker provides a variety of options to deploy models. These options vary in the amount of control you have and the work needed at your end. The AWS SDK gives you most control and flexibility. It’s a low-level API available for Java, C++, Go, JavaScript, Node.js, PHP, Ruby, and Python. The SageMaker Python SDK is a high-level Python API that abstracts some of the steps and configuration, and makes it easier to deploy models. The AWS Command Line Interface (AWS CLI) is another high-level tool that you can use to interactively work with SageMaker to deploy models without writing your own code.

We are launching two new options that further simplify the process of packaging and deploying models using SageMaker. One way is for programmatic deployment. For that, we are offering improvements in the Python SDK. For more information, refer to Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements. The second way is for interactive deployment. For that, we are launching a new interactive experience in Amazon SageMaker Studio. It will help you quickly deploy your own trained or foundational models (FMs) from Amazon SageMaker JumpStart with optimized configuration, and achieve predictable performance at the lowest cost. Read on to check out what the new interactive experience looks like.

New interactive experience in SageMaker Studio

This post assumes that you have trained one or more ML models or are using FMs from the SageMaker JumpStart model hub and are ready to deploy them. Training a model using SageMaker is not a prerequisite for deploying model using SageMaker. Some familiarity with SageMaker Studio is also assumed.

We walk you through how to do the following:

  • Create a SageMaker model
  • Deploy a SageMaker model
  • Deploy a SageMaker JumpStart large language model (LLM)
  • Deploy multiple models behind one endpoint
  • Test model inference
  • Troubleshoot errors

Create a SageMaker model

The first step in setting up a SageMaker endpoint for inference is to create a SageMaker model object. This model object is made up of two things: a container for the model, and the trained model that will be used for inference. The new interactive UI experience makes the SageMaker model creation process straightforward. If you’re new to SageMaker Studio, refer to the Developer Guide to get started.

  1. In the SageMaker Studio interface, choose Models in the navigation pane.
  2. On the Deployable models tab, choose Create.

Now all you need to do is provide the model container details, the location of your model data, and an AWS Identity and Access Management (IAM) role for SageMaker to assume on your behalf.

  1. For the model container, you can use one of the SageMaker pre-built Docker images that it provides for popular frameworks and libraries. If you choose to use this option, choose a container framework, a corresponding framework version, and a hardware type from the list of supported types.

Alternatively, you can specify a path to your own container stored in Amazon Elastic Container Registry (Amazon ECR).

  1. Next, upload your model artifacts. SageMaker Studio provides two ways to upload model artifacts:
    • First, you can specify a model.tar.gz either in an Amazon Simple Storage Service (Amazon S3) bucket or in your local path. This model.tar.gz must be structured in a format that is compliant with the container that you are utilizing.
    • Alternatively, it supports raw artifact uploading for PyTorch and XGBoost models. For these two frameworks, provide the model artifacts in the format the container expects. For example, for PyTorch this would be a model.pth. Note that your model artifacts also include an inference script for preprocessing and postprocessing. If you don’t provide an inference script, the default inference handlers for the container you have chosen will be implemented.
  2. After you select your container and artifact, specify an IAM role.
  3. Choose Create deployable model to create a SageMaker model.

The preceding steps demonstrate the simplest workflow. You can further customize the model creation process. For example, you can specify VPC details and enable network isolation to make sure that the container can’t make outbound calls on the public internet. You can expand the Advanced options section to see more options.

You can get guidance on the hardware for best price/performance ratio to deploy your endpoint by running a SageMaker Inference Recommender benchmarking job. To further customize the SageMaker model, you can pass in any tunable environment variables at the container level. Inference Recommender will also take a range of these variables to find the optimal configuration for your model serving and container.

After you create your model, you can see it on the Deployable models tab. If there was any issue found in the model creation, you will see the status in the Monitor status column. Choose the model’s name to view the details.

Deploy a SageMaker model

In the most basic scenario, all you need to do is select a deployable model from the Models page or an LLM from the SageMaker JumpStart page, select an instance type, set the initial instance count, and deploy the model. Let’s see what this process looks like in SageMaker Studio for your own SageMaker model. We discuss using LLMs later in this post.

  1. On the Models page, choose the Deployable models tab.
  2. Select the model to deploy and choose Deploy.
  3. The next step is to select an instance type that SageMaker will put behind the inference endpoint.

You want an instance that delivers the best performance at the lowest cost. SageMaker makes it straightforward for you to make this decision by showing recommendations. If you had benchmarked your model using SageMaker Inference Recommender during the SageMaker model creation step, you will see the recommendations from that benchmark on the drop-down menu.

Otherwise, you will see a list of prospective instances on the menu. SageMaker uses its own heuristics to populate the list in that case.

  1. Specify the initial instance count, then choose Deploy.

SageMaker will create an endpoint configuration and deploy your model behind that endpoint. After the model is deployed, you will see the endpoint and model status as In service. Note that the endpoint may be ready before the model.

This is also the place in SageMaker Studio where you will manage the endpoint. You can navigate to the endpoint details page by choosing Endpoints under Deployments in the navigation pane. Use the Add model and Delete buttons to change the models behind the endpoint without needing to recreate an endpoint. The Test inference tab enables you to test your model by sending test requests to one of the in-service models directly from the SageMaker Studio interface. You can also edit the auto scaling policy on the Auto-scaling tab on this page. More details on adding, removing, and testing models are covered in the following sections. You can see the network, security, and compute information for this endpoint on the Settings tab.

Customize the deployment

The preceding example showed how straightforward it is to deploy a single model with minimum configuration required from your side. SageMaker populates most of the fields for you, but you can customize the configuration. For example, it automatically generates a name for the endpoint. However, you can name the endpoint according to your preference, or use an existing endpoint on the Endpoint name drop-down menu. For existing endpoints, you will see only the endpoints that are in service at that time. You can use the Advanced options section to specify an IAM role, VPC details, and tags.

Deploy a SageMaker JumpStart LLM

To deploy a SageMaker JumpStart LLM, complete the following steps:

  1. Navigate to the JumpStart page in SageMaker Studio.
  2. Choose one of the partner names to browse the list of models available from that partner, or use the search feature to get to the model page if you know the name of the model.
  3. Choose the model you want to deploy.
  4. Choose Deploy.

Note that use of LLMs is subject to EULA and the terms and conditions of the provider.

  1. Accept the license and terms.
  2. Specify an instance type.

Many models from the JumpStart model hub come with a price-performance optimized default instance type for deployment. For models that don’t come with this default, you will be provided with a list of supported instance types on the Instance type drop-down menu. For benchmarked models, if you want to optimize the deployment specifically for either cost or performance to meet your specific use case, you can choose Alternate configurations to view more options that have been benchmarked with different combinations of total tokens, input length, and max concurrency. You can also select from other supported instances for that model.

  1. If using an alternate configuration, select your instance and choose Select.
  2. Choose Deploy to deploy the model.

You will see the endpoint and model status change to In service. You also have options to customize the deployment to meet your requirements in this case.

Deploy multiple models behind one endpoint

SageMaker enables you to deploy multiple models behind a single endpoint. This reduces hosting costs by improving endpoint utilization compared to using endpoints with only one model behind them. It also reduces deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to your endpoint. SageMaker Studio now makes it straightforward to do this.

  1. Get started by selecting the models that you want to deploy, then choose Deploy.
  2. Then you can create an endpoint with multiple models that have an allocated amount of compute that you define.

In this case, we use an ml.p4d.24xlarge instance for the endpoint and allocate the necessary number of resources for our two different models. Note that your endpoint is constrained to the instance types that are supported by this feature.

  1. If you start the flow from the Deployable models tab and want to add a SageMaker JumpStart LLM, or vice versa, you can make it an endpoint fronting multiple models by choosing Add model after starting the deployment workflow.
  2. Here, you can choose another FM from the SageMaker JumpStart model hub or a model using the Deployable Models option, which refers to models that you have saved as SageMaker model objects.
  3. Choose your model settings:
    • If the model uses a CPU instance, choose the number of CPUs and minimum number of copies for the model.
    • If the model uses a GPU instance, choose the number of accelerators and minimum number of copies for the model.
  4. Choose Add model.
  5. Choose Deploy to deploy these models to a SageMaker endpoint.

When the endpoint is up and ready (In service status), you’ll have two models deployed behind a single endpoint.

Test model inference

SageMaker Studio now makes it straightforward to test model inference requests. You can send the payload data directly using the supported content type, such as application or JSON, text or CSV, or use Python SDK sample code to make an invocation request from your programing environment like a notebook or local integrated development environment (IDE).

Note that the Python SDK example code option is available only for SageMaker JumpStart models, and it’s tailored for the specific model use case with input/output data transformation.

Troubleshoot errors

To help troubleshoot and look deeper into model deployment, there are tooltips on the resource Status label to show corresponding error and reason messages. There are also links to Amazon CloudWatch log groups on the endpoint details page. For single-model endpoints, the link to the CloudWatch container logs is conveniently located in the Summary section of the endpoint details. For endpoints with multiple models, the links to the CloudWatch logs are located on each row of the Models table view. The following are some common error scenarios for troubleshooting:

  • Model ping health check failure – The model deployment could fail because the serving container didn’t pass the model ping health check. To debug the issue, refer to the following container logs published by the CloudWatch log groups:
    /aws/sagemaker/Endpoints/[EndpointName]
    /aws/sagemaker/InferenceComponents/[InferenceComponentName]
  • Inconsistent model and endpoint configuration caused deployment failures – If the deployment failed by one of the following error messages, it means the selected model to deploy used a different IAM role, VPC configuration, or network isolation configuration. The remediation is to update the model details to use the same IAM role, VPC configuration, and network isolation configuration during the deployment flow. If you’re adding a model to an existing endpoint, you could recreate the model object to match the target endpoint configurations.
    Model and endpoint config have different execution roles. Please ensure the execution roles are consistent.
    Model and endpoint config have different VPC configurations. Please ensure the VPC configurations are consistent.
    Model and endpoint config have different network isolation configurations. Please ensure the network isolation configurations are consistent.
  • Not enough capacity to deploy more models on the existing endpoint infrastructure – If the deployment failed with the following error message, it means the current endpoint infrastructure doesn’t have enough compute or memory hardware resources to deploy the model. The remediation is to increase the maximum instance count on the endpoint or delete any existing models deployed on the endpoint to make room for new model deployment.
    There is not enough hardware resources on the instances for this endpoint to create a copy of the inference component. Please update resource requirements for this inference component, remove existing inference components, or increase the number of instances for this endpoint.
  • Unsupported instance type for multiple model endpoint deployment – If the deployment failed with the following error message, it means the selected instance type is currently not supported for the multiple model endpoint deployment. The remediation is to change the instance type to an instance that supports this feature and retry the deployment.
    The instance type is not supported for multiple models endpoint. Please choose a different instance type.

For other model deployment issues, refer to Supported features.

Clean up

Cleanup is also straightforward. You can remove one or more models from your existing SageMaker endpoint by selecting the specific model on the SageMaker console. To delete the whole endpoint, navigate to the Endpoints page, select the desired endpoint, choose Delete, and accept the disclaimer to proceed with deletion.

 

Conclusion

The enhanced interactive experience in SageMaker Studio allows data scientists to focus on model building and bringing their artifacts to SageMaker while abstracting out the complexities of deployment. For those who prefer a code-based approach, check out the low-code equivalent with the ModelBuilder class.

Friday, December 1, 2023

Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements

 Amazon SageMaker is a fully managed service that enables developers and data scientists to quickly and effortlessly build, train, and deploy machine learning (ML) models at any scale. SageMaker makes it straightforward to deploy models into production directly through API calls to the service. Models are packaged into containers for robust and scalable deployments. Although it provides various entry points like the SageMaker Python SDK, AWS SDKs, the SageMaker console, and Amazon SageMaker Studio notebooks to simplify the process of training and deploying ML models at scale, customers are still looking for better ways to deploy their models for playground testing and to optimize production deployments.

We are launching two new ways to simplify the process of packaging and deploying models using SageMaker.

In this post, we introduce the new SageMaker Python SDK ModelBuilder experience, which aims to minimize the learning curve for new SageMaker users like data scientists, while also helping experienced MLOps engineers maximize utilization of SageMaker hosting services. It reduces the complexity of initial setup and deployment, and by providing guidance on best practices for taking advantage of the full capabilities of SageMaker. We provide detailed information and GitHub examples for this new SageMaker capability.

The other new launch is to use the new interactive deployment experience in SageMaker Studio. We discuss this in Part 2.

Deploying models to a SageMaker endpoint entails a series of steps to get the model ready to be hosted on a SageMaker endpoint. This involves getting the model artifacts in the correct format and structure, creating inference code, and specifying essential details like the model image URL, Amazon Simple Storage Service (Amazon S3) location of model artifacts, serialization and deserialization steps, and necessary AWS Identity and Access Management (IAM) roles to facilitate appropriate access permissions. Following this, an endpoint configuration requires determining the inference type and configuring respective parameters such as instance types, counts, and traffic distribution among model variants.

To further help our customers when using SageMaker hosting, we introduced the new ModelBuilder class in the SageMaker Python SDK, which brings the following key benefits when deploying models to SageMaker endpoints:

  • Unifies the deployment experience across frameworks – The new experience provides a consistent workflow for deploying models built using different frameworks like PyTorch, TensorFlow, and XGBoost. This simplifies the deployment process.
  • Automates model deployment – Tasks like selecting appropriate containers, capturing dependencies, and handling serialization/deserialization are automated, reducing manual effort required for deployment.
  • Provides a smooth transition from local to SageMaker hosted endpoint – With minimal code changes, models can be easily transitioned from local testing to deployment on a SageMaker endpoint. Live logs make debugging seamless.

Overall, SageMaker ModelBuilder simplifies and streamlines the model packaging process for SageMaker inference by handling low-level details and provides tools for testing, validation, and optimization of endpoints. This improves developer productivity and reduces errors.

In the following sections, we deep dive into the details of this new feature. We also discuss how to deploy models to SageMaker hosting using ModelBuilder, which simplifies the process. Then we walk you through a few examples for different frameworks to deploy both traditional ML models and the foundation models that power generative AI use cases.

Getting to know SageMaker ModelBuilder

The new ModelBuilder is a Python class focused on taking ML models built using frameworks, like XGBoost or PyTorch, and converting them into models that are ready for deployment on SageMaker. ModelBuilder provides a build() function, which generates the artifacts according the model server, and a deploy() function to deploy locally or to a SageMaker endpoint. The introduction of this feature simplifies the integration of models with the SageMaker environment, optimizing them for performance and scalability. The following diagram shows how ModelBuilder works on a high-level.

ModelBuilder class

The ModelBuilder class provide different options for customization. However, to deploy the framework model, the model builder just expects the model, input, output, and role:

class ModelBuilder(
    model, # model id or model object
    role_arn, # IAM role
    schema_builder, # defines the input and output
    mode, # select between local deployment and depoy to SageMaker Endpoints
    ...
)

SchemaBuilder

The SchemaBuilder class enables you to define the input and output for your endpoint. It allows the schema builder to generate the corresponding marshaling functions for serializing and deserializing the input and output. The following class file provides all the options for customization:

class SchemaBuilder(
    sample_input: Any,
    sample_output: Any,
    input_translator: CustomPayloadTranslator = None,
    output_translator: CustomPayloadTranslator = None
)

However, in most cases, just sample input and output would work. For example:

input = "How is the demo going?"
output = "Comment la démo va-t-elle?"
schema = SchemaBuilder(input, output)

By providing sample input and output, SchemaBuilder can automatically determine the necessary transformations, making the integration process more straightforward. For more advanced use cases, there’s flexibility to provide custom translation functions for both input and output, ensuring that more complex data structures can also be handled efficiently. We demonstrate this in the following sections by deploying different models with various frameworks using ModelBuilder.

Local mode experience

In this example, we use ModelBuilder to deploy XGBoost model locally. You can use Mode to switch between local testing and deploying to a SageMaker endpoint. We first train the XGBoost model (locally or in SageMaker) and store the model artifacts in the working directory:

# Train the model
model = XGBClassifier()
model.fit(X_train, y_train)
model.save_model(model_dir + "/my_model.xgb")

Then we create a ModelBuilder object by passing the actual model object, the SchemaBuilder that uses the sample test input and output objects (the same input and output we used when training and testing the model) to infer the serialization needed. Note that we use Mode.LOCAL_CONTAINER to specify a local deployment. After that, we call the build function to automatically identify the supported framework container image as well as scan for dependencies. See the following code:

model_builder_local = ModelBuilder(
    model=model,  
    schema_builder=SchemaBuilder(X_test, y_pred), 
    role_arn=execution_role, 
    mode=Mode.LOCAL_CONTAINER
)
xgb_local_builder = model_builder_local.build()

Finally, we can call the deploy function in the model object, which also provides live logging for easier debugging. You don’t need to specify the instance type or count because the model will be deployed locally. If you provided these parameters, they will be ignored. This function will return the predictor object that we can use to make prediction with the test data:

# note: all the serialization and deserialization is handled by the model builder.
predictor_local = xgb_local_builder.deploy(
# instance_type='ml.c5.xlarge',
# initial_instance_count=1
)

# Make prediction for test data. 
predictor_local.predict(X_test)

Optionally, you can also control the loading of the model and preprocessing and postprocessing using InferenceSpec. We provide more details later in this post. Using LOCAL_CONTAINER is a great way to test out your script locally before deploying to a SageMaker endpoint.

Refer to the model-builder-xgboost.ipynb example to test out deploying both locally and to a SageMaker endpoint using ModelBuilder.

Deploy traditional models to SageMaker endpoints

In the following examples, we showcase how to use ModelBuilder to deploy traditional ML models.

XGBoost models

Similar to the previous section, you can deploy an XGBoost model to a SageMaker endpoint by changing the mode parameter when creating the ModelBuilder object:

model_builder = ModelBuilder(
    model=model,  
    schema_builder=SchemaBuilder(sample_input=sample_input, sample_output=sample_output), 
    role_arn=execution_role, 
    mode=Mode.SAGEMAKER_ENDPOINT
)
xgb_builder = model_builder.build()
predictor = xgb_builder.deploy(
    instance_type='ml.c5.xlarge',
    initial_instance_count=1
)

Note that when deploying to SageMaker endpoints, you need to specify the instance type and instance count when calling the deploy function.

Refer to the model-builder-xgboost.ipynb example to deploy an XGBoost model.

Triton models

You can use ModelBuilder to serve PyTorch models on Triton Inference Server. For that, you need to specify the model_server parameter as ModelServer.TRITON, pass a model, and have a SchemaBuilder object, which requires sample inputs and outputs from the model. ModelBuilder will take care of the rest for you.

model_builder = ModelBuilder(
    model=model,  
    schema_builder=SchemaBuilder(sample_input=sample_input, sample_output=sample_output), 
    role_arn=execution_role,
    model_server=ModelServer.TRITON, 
    mode=Mode.SAGEMAKER_ENDPOINT
)

triton_builder = model_builder.build()

predictor = triton_builder.deploy(
    instance_type='ml.g4dn.xlarge',
    initial_instance_count=1
)

Refer to model-builder-triton.ipynb to deploy a model with Triton.

Hugging Face models

In this example, we show you how to deploy a pre-trained transformer model provided by Hugging Face to SageMaker. We want to use the Hugging Face pipeline to load the model, so we create a custom inference spec for ModelBuilder:

# custom inference spec with hugging face pipeline
class MyInferenceSpec(InferenceSpec):
    def load(self, model_dir: str):
        return pipeline("translation_en_to_fr", model="t5-small")
        
    def invoke(self, input, model):
        return model(input)
    
inf_spec = MyInferenceSpec()

We also define the input and output of the inference workload by defining the SchemaBuilder object based on the model input and output:

schema = SchemaBuilder(value,output)

Then we create the ModelBuilder object and deploy the model onto a SageMaker endpoint following the same logic as shown in the other example:

builder = ModelBuilder(
    inference_spec=inf_spec,
    mode=Mode.SAGEMAKER_ENDPOINT,  # you can change it to Mode.LOCAL_CONTAINER for local testing
    schema_builder=schema,
    image_uri=image,
)
model = builder.build(
    role_arn=execution_role,
    sagemaker_session=sagemaker_session,
)
predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.g5.2xlarge'
)

Refer to model-builder-huggingface.ipynb to deploy a Hugging Face pipeline model.

Deploy foundation models to SageMaker endpoints

In the following examples, we showcase how to use ModelBuilder to deploy foundation models. Just like the models mentioned earlier, all that is required is the model ID.

Hugging Face Hub

If you want to deploy a foundation model from Hugging Face Hub, all you need to do is pass the pre-trained model ID. For example, the following code snippet deploys the meta-llama/Llama-2-7b-hf model locally. You can change the mode to Mode.SAGEMAKER_ENDPOINT to deploy to SageMaker endpoints.

model_builder = ModelBuilder(
    model="meta-llama/Llama-2-7b-hf",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    model_path="/home/ec2-user/SageMaker/LoadTestResources/meta-llama2-7b", #local path where artifacts will be saved
    mode=Mode.LOCAL_CONTAINER,
    env_vars={
        # Llama 2 is a gated model and requires a Hugging Face Hub token.
        "HUGGING_FACE_HUB_TOKEN": "<YourHuggingFaceToken>"
 
    }
)
model = model_builder.build()
local_predictor = model.deploy()

For gated models on Hugging Face Hub, you need to request access via Hugging Face Hub and use the associated key by passing it as the environment variable HUGGING_FACE_HUB_TOKEN. Some Hugging Face models may require trusting remote code. It can be set as an environment variable as well using HF_TRUST_REMOTE_CODE. By default, ModelBuilder will use a Hugging Face Text Generation Inference (TGI) container as the underlying container for Hugging Face models. If you would like to use AWS Large Model Inference (LMI) containers, you can set up the model_server parameter as ModelServer.DJL_SERVING when you configure the ModelBuilder object.

A neat feature of ModelBuilder is the ability to run local tuning of the container parameters when you use LOCAL_CONTAINER mode. This feature can be used by simply running tuned_model = model.tune().

Refer to demo-model-builder-huggingface-llama2.ipynb to deploy a Hugging Face Hub model.

SageMaker JumpStart

Amazon SageMaker JumpStart also offers a number of pre-trained foundation models. Just like the process of deploying a model from Hugging Face Hub, the model ID is required. Deploying a SageMaker JumpStart model to a SageMaker endpoint is as straightforward as running the following code:

model_builder = ModelBuilder(
    model="huggingface-llm-falcon-7b-bf16",
    schema_builder=SchemaBuilder(sample_input, sample_output),
    role_arn=execution_role
)

sm_ep_model = model_builder.build()

predictor = sm_ep_model.deploy()

For all available SageMaker JumpStart model IDs, refer to Built-in Algorithms with pre-trained Model Table. Refer to model-builder-jumpstart-falcon.ipynb to deploy a SageMaker JumpStart model.

Inference component

ModelBulder allows you to use the new inference component capability in SageMaker to deploy models. For more information on inference components, see Reduce Model Deployment Costs By 50% on Average Using SageMaker’s Latest Features. You can use inference components for deployment with ModelBuilder by specifying endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED in the deploy() method. You can also use the tune() method, which fetches the optimal number of accelerators, and modify it if required.

resource_requirements = ResourceRequirements(
    requests={
        "num_accelerators": 4,
        "memory": 1024,
        "copies": 1,
    },
    limits={},
)

goldfinch_predictor_2 = model_2.deploy(
    mode=Mode.SAGEMAKER_ENDPOINT,
    endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED,
    ...
	
)

Refer to model-builder-inference-component.ipynb to deploy a model as an inference component.

Customize the ModelBuilder Class

The ModelBuilder class allows you to customize model loading using InferenceSpec.

In addition, you can control payload and response serialization and deserialization and customize preprocessing and postprocessing using CustomPayloadTranslator. Additionally, when you need to extend our pre-built containers for model deployment on SageMaker, you can use ModelBuilder to handle the model packaging process. In this following section, we provide more details of these capabilities.

InferenceSpec

InferenceSpec offers an additional layer of customization. It allows you to define how the model is loaded and how it will handle incoming inference requests. Through InferenceSpec, you can define custom loading procedures for your models, bypassing the default loading mechanisms. This flexibility is particularly beneficial when working with non-standard models or custom inference pipelines. The invoke method can be customized, providing you with the ability to tailor how the model processes incoming requests (preprocessing and postprocessing). This customization can be essential to ensure that the inference process aligns with the specific needs of the model. See the following code:

class InferenceSpec(abc.ABC):
    @abc.abstractmethod
    def load(self, model_dir: str):
        pass

    @abc.abstractmethod
    def invoke(self, input_object: object, model: object):
        pass

The following code shows an example of using this class:

class MyInferenceSpec(InferenceSpec):
    def load(self, model_dir: str):
        return // model object

    def invoke(self, input, model):
        return model(input)

CustomPayloadTranslator

When invoking SageMaker endpoints, the data is sent through HTTP payloads with different MIME types. For example, an image sent to the endpoint for inference needs to be converted to bytes at the client side and sent through the HTTP payload to the endpoint. When the endpoint receives the payload, it needs to deserialize the byte string back to the data type that is expected by the model (also known as server-side deserialization). After the model finishes prediction, the results need to be serialized to bytes that can be sent back through the HTTP payload to the user or client. When the client receives the response byte data, it needs to perform client-side deserialization to convert the bytes data back to the expected data format, such as JSON. At a minimum, you need to convert the data for the following (as numbered in the following diagram):

  1. Inference request serialization (handled by the client)
  2. Inference request deserialization (handled by the server or algorithm)
  3. Invoking the model against the payload
  4. Sending response payload back
  5. Inference response serialization (handled by the server or algorithm)
  6. Inference response deserialization (handled by the client)

The following diagram shows the process of serialization and deserialization during the invocation process.

In the following code snippet, we show an example of CustomPayloadTranslator when additional customization is needed to handle both serialization and deserialization in the client and server side, respectively:

from sagemaker.serve import CustomPayloadTranslator

# request translator
class MyRequestTranslator(CustomPayloadTranslator):
    # This function converts the payload to bytes - happens on client side
    def serialize_payload_to_bytes(self, payload: object) -> bytes:
        # converts the input payload to bytes
        ... ...
        return  //return object as bytes
        
    # This function converts the bytes to payload - happens on server side
    def deserialize_payload_from_stream(self, stream) -> object:
        # convert bytes to in-memory object
        ... ...
        return //return in-memory object
        
# response translator 
class MyResponseTranslator(CustomPayloadTranslator):
    # This function converts the payload to bytes - happens on server side
    def serialize_payload_to_bytes(self, payload: object) -> bytes:
        # converts the response payload to bytes
        ... ...
        return //return object as bytes
    
    # This function converts the bytes to payload - happens on client side
    def deserialize_payload_from_stream(self, stream) -> object:
        # convert bytes to in-memory object
        ... ...
        return //return in-memory object

In the demo-model-builder-pytorch.ipynb notebook, we demonstrate how to easily deploy a PyTorch model to a SageMaker endpoint using ModelBuilder with the CustomPayloadTranslator and the InferenceSpec class.

Stage model for deployment

If you want to stage the model for inference or in the model registry, you can use model.create() or model.register(). The enabled model is created on the service, and then you can deploy later. See the following code:

model_builder = ModelBuilder(
    model=model,  
    schema_builder=SchemaBuilder(X_test, y_pred), 
    role_arn=execution_role, 
)
deployable_model = model_builder.build()

deployable_model.create() # deployable_model.register() for model registry

Use custom containers

SageMaker provides pre-built Docker images for its built-in algorithms and the supported deep learning frameworks used for training and inference. If a pre-built SageMaker container doesn’t fulfill all your requirements, you can extend the existing image to accommodate your needs. By extending a pre-built image, you can use the included deep learning libraries and settings without having to create an image from scratch. For more details about how to extend the pre-built containers, refer to SageMaker document. ModelBuilder supports use cases when bringing your own containers that are extended from our pre-built Docker containers.

To use your own container image in this case, you need to set the fields image_uri and model_server when defining ModelBuilder:

model_builder = ModelBuilder(
    model=model,  # Pass in the actual model object. its "predict" method will be invoked in the endpoint.
    schema_builder=SchemaBuilder(X_test, y_pred), # Pass in a "SchemaBuilder" which will use the sample test input and output objects to infer the serialization needed.
    role_arn=execution_role, 
    image_uri=image_uri, # REQUIRED FOR BYOC: Passing in image hosted in personal ECR Repo
    model_server=ModelServer.TORCHSERVE, # REQUIRED FOR BYOC: Passing in model server of choice
    mode=Mode.SAGEMAKER_ENDPOINT,
    dependencies={"auto": True, "custom": ["protobuf==3.20.2"]}
)

Here, the image_uri will be the container image ARN that is stored in your account’s Amazon Elastic Container Registry (Amazon ECR) repository. One example is shown as follows:

# Pulled the xgboost:1.7-1 DLC and pushed to personal ECR repo
image_uri = "<your_account_id>.dkr.ecr.us-west-2.amazonaws.com/my-byoc:xgb"

When the image_uri is set, during the ModelBuilder build process, it will skip auto detection of the image as the image URI is provided. If model_server is not set in ModelBuilder, you will receive a validation error message, for example:

ValueError: Model_server must be set when image_uri is set. Supported model servers: {<ModelServer.TRITON: 5>, <ModelServer.DJL_SERVING: 4>, <ModelServer.TORCHSERVE: 1>}

As of the publication of this post, ModelBuilder supports bringing your own containers that are extended from our pre-built DLC container images or containers built with the model servers like Deep Java Library (DJL)Text Generation Inference (TGI)TorchServe, and Triton inference server.

Custom dependencies

When running ModelBuilder.build(), by default it automatically captures your Python environment into a requirements.txt file and installs the same dependency in the container. However, sometimes your local Python environment will conflict with the environment in the container. ModelBuilder provides a simple way for you to modify the captured dependencies to fix such dependency conflicts by allowing you to provide your custom configurations into ModelBuilder. Note that this is only for TorchServe and Triton with InferenceSpec. For example, you can specify the input parameter dependencies, which is a Python dictionary, in ModelBuilder as follows:

dependency_config = {
   "auto" = True,
   "requirements" = "/path/to/your/requirements.txt"
   "custom" = ["module>=1.2.3,<1.5", "boto3==1.16.*", "some_module@http://some/url"]
}
  
ModelBuilder(
    # Other params
    dependencies=dependency_config,
).build()

We define the following fields:

  • auto – Whether to try to auto capture the dependencies in your environment.
  • requirements – A string of the path to your own requirements.txt file. (This is optional.)
  • custom – A list of any other custom dependencies that you want to add or modify. (This is optional.)

If the same module is specified in multiple places, custom will have highest priority, then requirements, and auto will have lowest priority. For example, let’s say that during autodetect, ModelBuilder detects numpy==1.25, and a requirements.txt file is provided that specifies numpy>=1.24,<1.26. Additionally, there is a custom dependency: custom = ["numpy==1.26.1"]. In this case, numpy==1.26.1 will be picked when we install dependencies in the container.

Clean up

When you’re done testing the models, as a best practice, delete the endpoint to save costs if the endpoint is no longer required. You can follow the Clean up section in each of the demo notebooks or use following code to delete the model and endpoint created by the demo:

predictor.delete_model()
predictor.delete_endpoint()

Conclusion

The new SageMaker ModelBuilder capability simplifies the process of deploying ML models into production on SageMaker. By handling many of the complex details behind the scenes, ModelBuilder reduces the learning curve for new users and maximizes utilization for experienced users. With just a few lines of code, you can deploy models with built-in frameworks like XGBoost, PyTorch, Triton, and Hugging Face, as well as models provided by SageMaker JumpStart into robust, scalable endpoints on SageMaker.

5 DIY Python Functions for Data Cleaning

  Image by Author | Midjourney Data cleaning: whether you love it or hate it, you likely spend a lot of time doing it. It’s what we signed u...