Genomics refers to study of the structure, function, evolution, and mapping of genomes. Genomes are the complete set of genetic instructions found in an organism, and they contain all the information needed for an organism to grow, develop, and function. Genomics workflows are a series of steps that are carried out to process, analyze, and interpret genomic data. These workflows can include tasks such as aligning sequencing reads to a reference genome, calling genetic variants, and annotating variants with functional and clinical significance.
Genomics workflows can be resource-intensive, requiring large amounts of computing power, storage, and memory to process and analyze large amounts of genomic data. Running these workflows on a traditional on-premises infrastructure can be expensive and time-consuming, as it requires significant upfront investment in hardware and maintenance.
To address these challenges, many genomics researchers and organizations are turning to cloud-based platforms such as Amazon Web Services (AWS) to run their genomics workflows. AWS provides a range of services that can be used to automate the execution of genomics workflows on the cloud, including compute, storage, and database services. By using these services, genomics researchers can leverage the scalability, flexibility, and cost-effectiveness of the cloud to process and analyze genomic data more efficiently and cost-effectively.
In this tutorial, we will discuss how to automate the execution of genomics workflows on AWS using a combination of AWS services. We will outline the steps you can follow to set up this automation workflow and discuss the benefits of using AWS for genomics workflows.
Automate the aws Genomics Workflow
To automate the regenie workflow on AWS, you can use a combination of services including Amazon FSx for Lustre, AWS Batch, Amazon Elastic Container Registry, AWS Step Functions, Amazon CloudWatch, and Amazon DynamoDB. Here is a summary of the steps you can follow to set up this automation:
Step 1
Set up an AWS account and create an IAM user with access to the required services.
Download and install the AWS CLI on your local machine.
Step 2
Create an Amazon FSx for Lustre file system and an Amazon S3 bucket to store the sample data. Set up access to the file system and the bucket using Amazon FSx for Lustre file share access and Amazon S3 bucket policies.
Here is the example of how you can set up an Amazon FSx for Lustre file system using the AWS SDK for Python (Boto3):
import boto3
# Create an FSx client
fsx_client = boto3.client('fsx')
# Set up the parameters for the file system
params = {
'FileSystemType': 'LUSTRE',
'StorageCapacity': 1200, # Set the storage capacity in GB
'SubnetIds': ['subnet-12345678'],
# Replace with your subnet IDs
'SecurityGroupIds': ['sg-12345678'],
# Replace with your security group IDs
'ImportPath': 's3://my-genomic-data-bucket/data/', # Replace with your S3 import path
'ExportPath': 's3://my-genomic-data-bucket/exports/' # Replace with your S3 export path
}
# Create the file system
response = fsx_client.create_file_system(**params)
# Get the file system ID
file_system_id = response['FileSystemId']
# Wait for the file system to become available
fsx_client.wait_until_file_system_available(FileSystemId=file_system_id)
print(f'File system {file_system_id} is available')
This code creates a new Amazon FSx for Lustre file system with the specified storage capacity, subnet IDs, security group IDs, and S3 import and export paths. It then waits for the file system to become available before printing a message.
You can use similar code to set up other Amazon FSx file systems such as Amazon FSx for Windows File Server or Amazon FSx for Cloud. Just make sure to adjust the FileSystemType and other parameters as needed.
Step 3
Create a Docker image of regenie and push it to Amazon Elastic Container Registry.
Install Docker on your local machine if you don’t have it already:
$ sudo apt-get update
$ sudo apt-get install docker.io
Create a new Docker image of regenie by writing a Dockerfile
that specifies the necessary dependencies and steps to build the image. Here is an example Dockerfile
FROM ubuntu:18.04
# Install dependencies
RUN apt-get update && apt-get install -y \
git \
build-essential \
zlib1g-dev \
libncurses5-dev \
python3 \
python3-pip
# Clone regenie repository and build
RUN git clone https://github.com/ryanlayer/regenie.git
RUN cd regenie && make
# Set the entrypoint to the regenie executable
ENTRYPOINT ["/regenie/regenie"]
To create an Amazon Elastic Container Registry (ECR) repository and push a Docker image to it using the AWS CLI, follow these steps:
Install the AWS CLI and configure it with your AWS credentials.
Run the following command to create an ECR repository:
aws ecr create-repository --repository-name my-regenie-repo
This will create a new ECR repository with the name “my-regenie-repo”.
Build the Docker image for the Regenie workflow using the following command:
docker build -t my-regenie-image
This will build the Docker image and give it the tag “my-regenie-image”.
Run the given command to log in to your ECR registry:
aws ecr get-login-password --region region-code | docker login --username AWS --password-stdin account-id.dkr.ecr.region-code.amazonaws.com
Replace “region-code” and “account-id” with the appropriate values for your account and region. This command will log you in to your ECR registry using the AWS CLI.
Tag the Docker image with the URI of your ECR repository using the following command:
docker tag my-regenie-image account-id.dkr.ecr.region-code.amazonaws.com/my-regenie-repo:latest
Again, replace “region-code” and “account-id” with the appropriate values for your account and region. This will give the Docker image the tag “latest”, which will be used to identify the latest version of the image in the ECR repository.
Finally, push the Docker image to the ECR repository using the following command:
docker push account-id.dkr.ecr.region-code.amazonaws.com/my-regenie-repo:latest
This will push the Docker image to the ECR repository, making it available for use in your automation workflow. Alternatively, you can use AWS CodePipeline to automate the build and push process for the Docker image. This can be helpful if you want to set up a continuous delivery pipeline for your Regenie workflow.
Step 4
Set up an AWS Batch job queue and compute environment to run the regenie tasks. Configure the compute environment to pull the regenie Docker image from Amazon Elastic Container Registry at launch time.
To create an AWS Batch job definition that specifies the Regenie workflow as a Docker container image stored in your ECR repository using the AWS CLI, follow these steps:
Install the AWS CLI and configure it with your AWS credentials.
Run the following command to create a job definition for the Regenie workflow:
aws batch create-job-definition --name my-regenie-job --type container --container-properties '{"image": "account-id.dkr.ecr.region-code.amazonaws.com/my-regenie-repo:latest", "vcpus": 4, "memory": 8192, "command": ["regenie", "run", "--input-file", "/data/input.txt", "--output-file", "/data/output.txt"]}'
Replace “region-code” and “account-id” with the appropriate values for your account and region. This command will create a new AWS Batch job definition with the name “my-regenie-job” that specifies the Regenie workflow as a Docker container image stored in your ECR repository. The job definition also specifies the number of vCPUs and memory to allocate for the job, as well as the command to run the Regenie workflow.
You can use AWS Management Console or AWS CloudFormation to create the job definition as well. In the AWS Management Console, go to the AWS Batch dashboard and click on “Create job definition” to create a new job definition. In AWS CloudFormation, you can use the AWS::Batch::JobDefinition resource to create a job definition. Once the job definition is created, you can use it to submit jobs to AWS Batch and run the Regenie workflow on the AWS cloud.
Step 5
Create an AWS Step Functions state machine to orchestrate the workflow. Use an AWS Lambda function to initiate the state machine with interactive user input or run the “start-execution” command in the AWS CLI and pass a JSON file with the input parameters.
To create an AWS Step Functions state machine to automate the execution of the Regenie workflow using the AWS CLI, follow these steps:
Create a file called state-machine.json with the following contents:
{
"Comment": "A state machine to run the Regenie workflow",
"StartAt": "Upload Data",
"States": {
"Upload Data": {
"Type": "Task",
"Resource": "arn:aws:states:::s3:putObject",
"Parameters": {
"Bucket": "my-genomic-data-bucket",
"Key": "data/input.txt",
"Body.$": "$.input"
},
"Next": "Run Regenie Workflow"
},
"Run Regenie Workflow": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob",
"Parameters": {
"JobName": "my-regenie-job",
"JobQueue": "my-regenie-queue",
"JobDefinition": "my-regenie-job",
"ContainerOverrides": {
"Vcpus": 4,
"Memory": 8192,
"Command": ["regenie", "run", "--input-file", "/data/input.txt", "--output-file", "/data/output.txt"]
}
},
"Next": "Process Results"
},
"Process Results": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "my-result-processor",
"Payload.$": "$.output"
},
"End": true
}
}
}
This state machine includes three tasks: “Upload Data”, “Run Regenie Workflow”, and “Process Results”. The “Upload Data” task uploads data to the Amazon FSx file system, the “Run Regenie Workflow” task runs the Regenie workflow on AWS Batch, and the “Process Results” task processes the results of the Regenie workflow.
Run the following command to create the state machine:
aws stepfunctions create-state-machine --name my-regenie-state-machine --definition file://state-machine.json
This command will create a new AWS Step Functions state machine with the name “my-regenie-state-machine” and the definition specified in the state-machine.json file.
You can use AWS Management Console or AWS CloudFormation to create the state machine as well. In the AWS Management Console, go to the AWS Step Functions dashboard and click on “Create a state machine” for creating a new state machine. In AWS CloudFormation, you can use the AWS::StepFunctions::StateMachine resource to create a state machine.
Once the state machine is created, you can use it to automate the execution of the Regenie workflow on AWS. You can trigger the state machine by sending an event to the state machine’s input, or you can schedule the state machine to run at regular intervals using AWS CloudWatch Events.
Step 6
Monitor the performance of the workflow using Amazon CloudWatch. Set up filters to match specific error types and create subscriptions to deliver real-time log events to Amazon Kinesis or AWS Lambda for further retry.
To set up Amazon CloudWatch alarms to monitor the execution of the Regenie workflow and send notifications if any errors occur, follow these steps:
Run the given command to create an alarm that triggers when the number of failed executions of the Regenie workflow exceeds a certain threshold:
aws cloudwatch put-metric-alarm --alarm-name my-regenie-failure-alarm --metric-name FailedExecutions --namespace AWS/StepFunctions --statistic Average --period 60 --threshold 1 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 1 --actions-enabled --alarm-actions arn:aws:sns:region-code:account-id:my-regenie-alarm-topic
Replace “region-code” and “account-id” with the appropriate values for your account and region. This command will create an alarm that triggers when the average number of failed executions of the Regenie workflow exceeds 1 over a period of 60 seconds, and sends a notification to the Amazon Simple Notification Service (SNS) topic “my-regenie-alarm-topic” when the alarm is triggered.
Run the given command to create an alarm that triggers when the execution time of the Regenie workflow exceeds a certain threshold:
aws cloudwatch put-metric-alarm --alarm-name my-regenie-execution-time-alarm --metric-name ExecutionTime --namespace AWS/StepFunctions --statistic Maximum --period 60 --threshold 300 --comparison-operator GreaterThanOrEqualToThreshold --evaluation-periods 1 --actions-enabled --alarm-actions arn:aws:sns:region-code:account-id:my-regenie-alarm-topic
This command will create an alarm that triggers when the maximum execution time of the Regenie workflow exceeds 300 seconds over a period of 60 seconds, and sends a notification to the same SNS topic when the alarm is triggered.
You can use AWS Management Console or AWS CloudFormation to create CloudWatch alarms as well. In the AWS Management Console, go to the CloudWatch dashboard and click on “Alarms” in the
left menu to create a new alarm. In AWS CloudFormation, you can use the AWS::CloudWatch::Alarm resource to create an alarm.
By setting up CloudWatch alarms, you can monitor the execution of the Regenie workflow and receive notifications if any errors occur, helping you to troubleshoot and resolve issues quickly.
Step 7
If desired, use another AWS Lambda function to put failed job logs to Amazon DynamoDB. Scientists can update table items and DynamoDB Streams can initiate the retry.
To use Amazon DynamoDB to store the status and progress of the Regenie workflow, you can follow these steps:
Run the following command to create a DynamoDB table for storing the status and progress of the Regenie workflow:
aws dynamodb create-table --table-name my-regenie-table --attribute-definitions AttributeName=WorkflowId,AttributeType=S AttributeName=Status,AttributeType=S --key-schema AttributeName=WorkflowId,KeyType=HASH AttributeName=Status,KeyType=RANGE --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1
This command will create a new DynamoDB table with the name “my-regenie-table” and the specified attribute definitions and key schema. The table has a provisioned throughput of 1 read capacity unit and 1 write capacity unit.
Run the following command to add a new item to the DynamoDB table to store the status and progress of a Regenie workflow:
aws dynamodb put-item --table-name my-regenie-table --item '{"WorkflowId": {"S": "abc123"}, "Status": {"S": "Running"}, "Progress": {"N": "50"}}'
This command will add a new item to the DynamoDB table with the workflow ID “abc123”, the status “Running”, and the progress “50”.
To update the status and progress of a Regenie workflow, you can use the update-item command in the AWS CLI:
aws dynamodb update-item --table-name my-regenie-table --key '{"WorkflowId": {"S": "abc123"}, "Status": {"S": "Running"}}' --update-expression "SET Progress = :p" --expression-attribute-values '{":p": {"N": "75"}}'
This command will update the progress of the Regenie workflow with the workflow ID “abc123” and the status “Running” to “75”.
You can use similar commands to query, scan, or delete items from the DynamoDB table as needed. You can also use the AWS SDK for Python (Boto3) or other programming languages to interact with DynamoDB in your automation workflow.
By storing the status and progress of the Regenie workflow in DynamoDB, you can track the progress of your workflow and identify any issues that may arise. This can be helpful for monitoring and debugging purposes.
Conclusion
In conclusion, Amazon Web Services (AWS) provides a range of services that can be used to automate the execution of genomics workflows on the cloud. By using services such as Amazon FSx for Lustre, AWS Batch, Amazon Elastic Container Registry, AWS Step Functions, Amazon CloudWatch, and Amazon DynamoDB, you can set up an automation workflow that can efficiently and reliably process and analyze genomic data.
To set up this automation workflow, you can follow the steps outlined above, such as creating an Amazon FSx file system for storing and processing genomic data, building and pushing a Docker image for the genomics workflow to Amazon Elastic Container Registry, creating an AWS Batch job definition for the genomics workflow, creating an AWS Step Functions state machine to automate the execution of the genomics workflow, and setting up Amazon CloudWatch alarms to monitor the execution of the genomics workflow. You can also use Amazon DynamoDB to store the status and progress of the genomics workflow.
By automating the execution of genomics workflows on AWS, you can reduce the time and effort required to process and analyze large amounts of genomic data, and enable faster and more accurate analysis of genetic variation and disease.