MLOps on AWS


Welcome to this comprehensive guide on MLOps on AWS. In this article, we will explore the world of MLOps (Machine Learning Operations) on the AWS (Amazon Web Services) platform. MLOps brings together the disciplines of machine learning and DevOps, enabling organizations to effectively develop, deploy, and manage machine learning models at scale. By leveraging the powerful capabilities of AWS, you can streamline your ML workflows and drive innovation in your organization.

What is AWS MLOps?

The MLOps workflow encompasses several crucial stages to ensure the successful development and deployment of machine learning models. Let’s dive into each of these stages:

Scoping: In this initial phase, we define the project scope and assess whether machine learning is the right approach to address the problem at hand. We perform requirement engineering and ensure that the necessary data is available. Data accuracy and relevance to the use case are validated to ensure high-quality input.

Data Engineering: This stage focuses on gathering and preprocessing the data. We establish baselines, clean the data, format it appropriately, and assign relevant labels. Organizing the data in a structured manner facilitates model training and enhances overall model performance.

Modeling: Now, it’s time to design and implement the machine learning model. Using the prepared and cleaned data, we train the model and evaluate its performance. Error analysis and specifying appropriate error metrics help us monitor and fine-tune the model to achieve desired results.

Continuous integration and deployment: Once the model is ready, we package it for deployment. Depending on the requirements, deployment can take various forms. For edge-based models, we might package them as Docker containers deployed on cloud infrastructure, leverage serverless cloud platforms, integrate them into mobile applications, or wrap them with API servers exposing REST or gRPC endpoints.

Monitoring: Monitoring is a critical aspect of MLOps. After deployment, we continuously monitor the infrastructure and the model’s performance. This involves keeping a close eye on the infrastructure’s load, usage, storage, and overall health to ensure smooth operations. Additionally, we monitor the model’s efficiency, correctness, bias, and data drift. This ongoing monitoring helps us assess the model’s behavior in real-world scenarios and identify any necessary adjustments or improvements.

Collaboration: Collaboration are essential in MLOps, just like in DevOps. It involves aligning team members towards shared goals, fostering cross-functional collaboration, establishing effective communication channels, documenting processes, conducting code reviews, and promoting continuous feedback and improvement. By prioritizing collaboration and communication, MLOps teams can work cohesively, share knowledge, and drive successful outcomes.

By following this comprehensive MLOps workflow, we can effectively develop, deploy, and monitor machine learning models, ensuring their accuracy, reliability, and alignment with the desired outcomes.

Why should you do MLOps Engineering on AWS?

AWS MLOps refers to the set of practices, tools, and technologies that enable the seamless integration of machine learning development and operations on the AWS platform. It encompasses the entire ML lifecycle, from data preparation and model training to deployment, monitoring, and continuous improvement. With AWS MLOps, you can create efficient and reliable ML pipelines, automate repetitive tasks, ensure reproducibility, and foster collaboration between data scientists, developers, and operations teams.

By adopting AWS MLOps, organizations can unlock the true potential of their machine learning projects. It provides a framework to manage complexities, maintain version control, track model performance, and automate the deployment of ML models in production. AWS have an extensive suite of managed ML services, including Amazon SageMaker, AWS Glue, and AWS Step Functions, you have the necessary tools and infrastructure to build end-to-end MLOps solutions.

There are several compelling reasons to embrace MLOps engineering on the AWS platform. Firstly, AWS offers a wide range of scalable and reliable infrastructure services that are purpose-built for machine learning workloads. With AWS’s elastic compute resources, you can easily scale your ML infrastructure based on demand, ensuring optimal performance and cost-efficiency.

Secondly, AWS provides a rich ecosystem of managed ML services that simplify and accelerate ML development and deployment. Services like Amazon SageMaker provide a comprehensive platform for building, training, and deploying ML models at scale, with built-in capabilities for experimentation, hyperparameter tuning, and automatic model deployment.

Furthermore, AWS’s integration with other AWS services, such as AWS Lambda, AWS Glue, and Amazon S3, enables seamless data ingestion, preprocessing, and integration with ML pipelines. This tight integration reduces the complexity of managing data and allows you to focus more on the ML aspects of your projects.

In conclusion, by adopting MLOps engineering on AWS, you can take advantage of the robust infrastructure, comprehensive ML services, and seamless integration to streamline your ML workflows, accelerate time-to-market, and drive innovation in your organization. So, let’s dive deeper into the world of AWS MLOps and discover how it can revolutionize your machine learning initiatives.

Guide to deploy an MLOps open-source platform on a EC2 instance


Before you dive in integration guide, it is essential that you must a basic understanding of both AWS and Kubernetes. You must ensure you have the following prerequisites :

  • An AWS account with the necessary respective permissions for creating and managing resources.
  • Installed and configured AWS CLI.


First you have to Install AWS CLI on your OS by following the instructions given below for your operating system.

Check out the official installation guide here.

Once it’s done, open the terminal or command prompt and run the command “aws configure”, then input your AWS Access Key ID and Secret Access Key when prompted.

People go with default output format (e.g., json) and default region (e.g., us-east-1) when prompted. Now we can interact with the resources in our account.

What is mlflow?

MLflow is an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

Centralized Experiment Tracking with MLFlow

MLFlow, an open-source MLOps platform, houses the ability to efficiently track your experimentation directly from the code or notebooks you use to train the models (among other features). The real beauty of this platform is the API: allowing access directly from your code (Python, R, Java) or via a REST API. Check out the Python code sample below to see how easily an experiment can be tracked:

					import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import mlflow
import mlflow.sklearn
from your_data_loader import load_data
# Set MLflow tracking URI and experiment name
mlflow.set_experiment('Sample Experiment')
# Load the data
X, y = load_data()
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Start the MLflow run
with mlflow.start_run(run_name="your-run-name") as run:
    # Log run parameters
    mlflow.log_param("compute", 'local')
    mlflow.log_param("dataset", 'your-dataset-name')
    mlflow.log_param("dataset_version", '2.0')
    mlflow.log_param("dataset_path", 's3://your.example.bucket/path/to/dataset')
    mlflow.log_param("algo", 'random forest')
    # Log additional hyperparameters for reproducibility
    n_estimators = 5
    mlflow.log_param("n_estimators", n_estimators)
    # Train the model
    rf = RandomForestRegressor(n_estimators=n_estimators)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    # Save the model artifact to the MLflow server for later deployment
    mlflow.sklearn.log_model(rf, "rf-baseline-model")
    # Log model performance using a metric
    mse = mean_squared_error(y_test, y_pred)
    mlflow.log_metric("mse", mse)
    # End the MLflow run


Notice that you can log any number of parameters and metrics that you want, as shown above. After running a few of these experiments, you might want to go over to the MLFlow UI and look at your results in a centralized location. Maybe you also have other team members trying different strategies with the same data. As long as you all were using the same experiment name (MLFlow’s name for the project), all the results will be documented in the same table: 

Setting up an MLFlow Server on AWS: Step-by-step

Setting up MLFlow for collaboration and experiment tracking on AWS is a breeze, and it comes with minimal operational costs. Here’s a simplified step-by-step guide to get you started in less than 10 minutes:

1.       Launch a new EC2 instance using the free-tier t2.micro instance type and Amazon Linux 2 AMI. Don’t forget to generate an SSH key-pair and configure the security group to allow SSH access and port 5000 (the default UI port).

2.       Set up a new free-tier PostgreSQL RDS (Relational Database Service) named ‘mlflow-rds’ or a similar name. Choose a password and ensure the networking settings match the EC2 network settings. Assign the initial DB name as ‘mlflow’ or a similar name.

3.       Create a new S3 bucket to serve as the storage location for your MLFlow server. Give it an appropriate name like ‘yourorganization.mlflow.data’ or similar.

4.       Create a new EC2 Service Role and grant it the necessary permissions for S3 and RDS.

5.       Once all the resources are created, SSH into the EC2 instance and execute the following commands:

					# Updating all packages
sudo yum update
# Amazon Linux 2 AMI does not come with Python 3.7 installed at the time of writing
sudo yum install python3.10
# Installing MLFlow and the AWS Python SDK
sudo pip3 install mlflow[extras] psycopg2-binary boto3
# Starting the MLFlow server, don't forget to change the fields in caps
# If you are unfamilier with nohop, read up on it here: https://man.openbsd.org/nohup.1
nohup mlflow server --backend-store-uri postgresql://postgres:YOURPASSWORD@YOUR-DATABASE-ENDPOINT:5432 --default-artifact-root s3://YOURORGANISATION.MLFLOW.BUCKET 


6. Connect to your newly created MLFlow UI by accessing the EC2 public end-point via port 5000 in your browser.          

By following these simple steps, you’ll have MLFlow up and running on AWS, ready to streamline collaboration and track your experiments effectively.



Leave a Reply

Your email address will not be published.