Accelerating Precision Oncology: Genomics England and AWS SageMaker for Multi-Modal Cancer Analysis

1. Introduction

Genomics England (GEL) plays a pivotal role in advancing personalized medicine through large-scale genomic data analysis, primarily within the National Health Service (NHS) in the UK. By sequencing and analyzing genomes from patients with cancer and rare diseases, GEL aims to improve diagnosis, treatment strategies, and ultimately, patient outcomes. A critical area of focus is cancer research, where understanding the intricate molecular and clinical landscapes of tumors is paramount for accurate subtyping and predicting patient survival.

Traditional approaches often analyze single data modalities in isolation. However, the complexity of cancer necessitates a holistic view, integrating information from various sources, including genomics (e.g., somatic mutations, copy number variations, gene expression), clinical data (e.g., patient demographics, treatment history, pathology reports), and medical imaging (e.g., radiology scans). Multi-modal machine learning (MMML) offers a powerful framework to achieve this integration, potentially uncovering synergistic relationships and improving predictive accuracy beyond what single modalities can achieve.

Implementing and scaling MMML pipelines for large datasets like those managed by GEL presents significant technical challenges. These include managing diverse data formats, performing computationally intensive feature extraction and model training, ensuring data security and compliance, and deploying robust and interpretable models for clinical translation. To address these challenges, GEL has collaborated with Amazon Web Services (AWS), leveraging the capabilities of Amazon SageMaker, a fully managed machine learning service, to build and deploy sophisticated MMML models for cancer subtyping and survival analysis. This article delves into the technical details of this collaboration, exploring the architecture, data engineering processes, model development strategies, and deployment mechanisms employed.

2. Architecture Overview

The MMML pipeline on AWS leverages a suite of services orchestrated by Amazon SageMaker to handle the end-to-end workflow, from data ingestion to model deployment and visualization.

The following diagram illustrates the high-level architecture of the MMML pipeline:

Image by Author

Component Description

Data Lake (S3): Serves as the central repository for raw data in various formats.
Data Pre-processing (AWS Glue, Lambda, Step Functions, SageMaker Processing): Responsible for ETL (Extract, Transform, Load) processes, data cleaning, feature engineering, and orchestration of complex pre-processing workflows for each modality.
Feature Store (S3): Stores the processed and harmonized features in an efficient format (e.g., Parquet) for model training.
Model Training (SageMaker Data Parallel, Training Job, Hyperparameter Tuning, Experiments, Debugger): Encompasses the development, training, and optimization of multi-modal ML models using SageMaker’s managed capabilities.
Model Registry (SageMaker): Provides a central repository to version and manage trained models.
Deployment (SageMaker Endpoint, Batch Transform Job): Facilitates the deployment of models for real-time inference or batch processing, catering to different clinical or research needs.
Visualization & Interpretation (Amazon QuickSight, Jupyter Notebooks, SageMaker Clarify): Enables the visualization of results, model performance, and explanations to facilitate clinical understanding and trust.

3. Data Engineering for Multi-Modal ML

Integrating data from diverse sources requires robust data engineering pipelines. Genomics England handles various data types, each requiring specific preprocessing steps:

Genomic Data: Typically stored as FASTQ files (raw sequencing reads) or VCF files (variant calls). Preprocessing involves alignment, variant calling (if starting from FASTQ), quality control, and feature extraction. Common features include the presence or absence of specific Single Nucleotide Polymorphisms (SNPs), copy number variations (CNVs), and gene expression levels (obtained from RNA-Seq data). These features are often vectorized into numerical representations suitable for ML models. AWS Glue is used for scalable ETL and feature engineering on large genomic datasets, often outputting data in columnar formats like Parquet for efficient downstream processing in SageMaker.
Clinical Data: Includes structured data from Electronic Health Records (EHRs) or CSV files, such as patient demographics (age, sex), diagnosis information (cancer type, stage), treatment history, biomarker measurements, and survival outcomes. Preprocessing involves data cleaning (handling missing values, outliers), categorical encoding (e.g., one-hot encoding), and normalization/standardization of numerical features. AWS Lambda functions can be used for lightweight data transformations, while AWS Glue can handle larger-scale data wrangling.
Imaging Data: Primarily in DICOM (for radiology) or NIfTI (for MRI) formats. Preprocessing involves tasks like noise reduction, bias field correction, registration, and normalization. Feature extraction can involve traditional radiomics (quantitative features describing image characteristics) or leveraging pre-trained deep learning models (e.g., ResNet for 2D images, 3D CNNs for volumetric data) to extract high-level embeddings. SageMaker Processing Jobs, leveraging custom Docker containers with libraries like PyTorch or TensorFlow and specialized imaging libraries (e.g., NiBabel, SimpleITK), are ideal for computationally intensive imaging feature extraction. Step Functions can orchestrate multi-step imaging pipelines, potentially involving distributed processing across multiple instances.

A significant challenge in MMML is aligning data across modalities. This involves ensuring that the features from different sources correspond to the same patient and time point (where relevant). Robust patient identifiers and data lineage tracking are crucial. Feature stores, like the one conceptually represented in S3, help in centralizing and managing these aligned features.

4. Model Architecture

Developing effective MMML models requires careful consideration of how to fuse information from different modalities. Two common strategies are:

Early Fusion: Concatenates the feature vectors from all modalities into a single input vector before feeding it into a unified model. This allows the model to learn cross-modal interactions from the beginning.
Late Fusion: Trains separate models for each modality and then combines their predictions (e.g., through averaging, weighted averaging, or another meta-learner). This allows each modality to be processed by a potentially specialized model architecture.

More complex architectures can involve hybrid approaches or attention mechanisms to dynamically weigh the contribution of different modalities.

PyTorch Code Snippet (Early Fusion Example)

Here’s a simplified PyTorch example of a multi-input model for early fusion:

import torch
import torch.nn as nn

class MultiModalModel(nn.Module):
    def __init__(self, genomic_input_dim, clinical_input_dim, imaging_embedding_dim, hidden_dim, output_dim):
        super(MultiModalModel, self).__init__()
        self.genomic_fc = nn.Linear(genomic_input_dim, hidden_dim)
        self.clinical_fc = nn.Linear(clinical_input_dim, hidden_dim)
        self.imaging_fc = nn.Linear(imaging_embedding_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.combined_fc = nn.Linear(3 * hidden_dim, hidden_dim)
        self.output_fc = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, genomic_features, clinical_features, imaging_embeddings):
        genomic_out = self.relu(self.genomic_fc(genomic_features))
        clinical_out = self.relu(self.clinical_fc(clinical_features))
        imaging_out = self.relu(self.imaging_fc(imaging_embeddings))

        combined = torch.cat((genomic_out, clinical_out, imaging_out), dim=1)
        combined = self.dropout(self.relu(self.combined_fc(combined)))
        output = self.output_fc(combined)
        return output

# Example usage
genomic_dim = 1000
clinical_dim = 50
imaging_dim = 256
hidden = 128
num_classes = 5 # For subtyping

model = MultiModalModel(genomic_dim, clinical_dim, imaging_dim, hidden, num_classes)
dummy_genomic = torch.randn(32, genomic_dim)
dummy_clinical = torch.randn(32, clinical_dim)
dummy_imaging = torch.randn(32, imaging_dim)
output = model(dummy_genomic, dummy_clinical, dummy_imaging)
print(output.shape)

Genomics England utilizes SageMaker’s custom training containers to accommodate specific library requirements and model architectures. SageMaker Data Parallel can be employed to accelerate the training of large models on distributed GPU instances.

SageMaker Hyperparameter Tuner automates the process of finding optimal model hyperparameters by running multiple training jobs in parallel across different hyperparameter configurations. This is crucial for maximizing model performance in complex MMML settings.

5. Training and Evaluation

Rigorous training and evaluation strategies are essential to build reliable MMML models.

Data Splitting: Stratified sampling is crucial to ensure that the distribution of the target variable (e.g., cancer subtype, survival time) is preserved across training, validation, and test sets, preventing data leakage.
Evaluation Metrics: The choice of evaluation metrics depends on the task. For cancer subtyping (a multi-class classification problem), metrics like Area Under the ROC Curve (AUC), F1-score, and accuracy are commonly used. For survival analysis, the Concordance Index (C-index) is a key metric to assess the model’s ability to correctly rank patient survival times.
Cross-Validation: Techniques like k-fold cross-validation are used to obtain a more robust estimate of the model’s generalization performance by training and evaluating the model on multiple different splits of the data. SageMaker Experiments allows for tracking and comparing different training runs and cross-validation folds.
Monitoring and Debugging: SageMaker Debugger provides tools to profile training jobs, monitor resource utilization, and identify potential bottlenecks or issues like gradient explosion or vanishing gradients, which can be particularly important in deep learning models trained on multi-modal data.

ML Pseudocode Example (Training Loop)

function train_model(model, train_dataloader, validation_dataloader, optimizer, loss_function, num_epochs):
  for epoch in range(num_epochs):
    model.train()
    for batch in train_dataloader:
      genomic_features, clinical_features, imaging_embeddings, labels = batch
      optimizer.zero_grad()
      predictions = model(genomic_features, clinical_features, imaging_embeddings)
      loss = loss_function(predictions, labels)
      loss.backward()
      optimizer.step()
      # Log training loss

    model.eval()
    with torch.no_grad():
      for batch in validation_dataloader:
        genomic_features, clinical_features, imaging_embeddings, labels = batch
        val_predictions = model(genomic_features, clinical_features, imaging_embeddings)
        val_loss = loss_function(val_predictions, labels)
        # Calculate and log validation metrics (e.g., AUC, C-index)

      # Log epoch-level validation metrics

  return trained_model

6. Model Deployment

Once a satisfactory model is trained and evaluated, it needs to be deployed for inference.

SageMaker Model Registry: The trained model artifacts are registered in the SageMaker Model Registry, which allows for versioning, tracking metadata, and managing the lifecycle of different model versions.
Deployment Options:
- Real-time Endpoints: For applications requiring immediate predictions, models can be deployed as SageMaker Real-time Endpoints. In the context of MMML, if different modalities require distinct preprocessing steps that are part of the model pipeline, a multi-container endpoint can be deployed. This allows packaging separate containers for each modality’s preprocessing and the main inference model.
- Batch Inference: For processing large volumes of data offline, SageMaker Batch Transform Jobs can be used. This is suitable for tasks like generating survival predictions for a cohort of patients.
Inference Orchestration: Depending on the use case, inference can be triggered by clinical systems through API calls to the SageMaker endpoint or orchestrated via AWS Lambda functions or Step Functions for more complex workflows.

SageMaker Training and Deployment Configuration (Conceptual)

While the exact configuration involves Python SDK calls, conceptually, a SageMaker training job configuration would specify:

IAM role for permissions.
ECR image URI for the training container (potentially a custom container).
Instance type and count for training.
Hyperparameters.
Paths to training and validation data in S3.
Output S3 path for model artifacts.

Deployment configuration for a real-time endpoint would specify:

IAM role for permissions.
Model artifacts from the Model Registry.
Instance type and count for the endpoint.
Endpoint name and configuration (e.g., multi-container specification if needed).

7. Clinical Integration & Interpretation

The ultimate goal of these MMML pipelines is to provide clinically relevant insights.

Output Interpretation: For survival analysis, the model might output a predicted survival probability distribution or a hazard ratio, which can be used to generate Kaplan-Meier curves for different risk groups identified by the model. For subtyping, the output would be the predicted cancer subtype.
Explainability: Understanding why a model makes a particular prediction is crucial for clinical trust and adoption. Techniques like SHAP (SHapley Additive exPlanations) can be integrated using SageMaker Clarify to provide insights into the contribution of each input feature to the model’s output for individual predictions. This can help clinicians understand which genomic, clinical, or imaging features are most influential in determining a patient’s subtype or survival prognosis.
Delivery of Insights: The model outputs and explainability metrics can be delivered to clinicians through interactive dashboards built with Amazon QuickSight or integrated into existing clinical applications via APIs. Jupyter notebooks hosted on SageMaker can also be used for exploratory analysis and visualization of results by researchers.

8. Security and Compliance

Handling sensitive patient data requires stringent security and compliance measures.

Data Anonymization: Genomics England employs robust data anonymization techniques before making data available for research and model development.
Access Control: Role-Based Access Control (RBAC) through AWS Identity and Access Management (IAM) ensures that only authorized personnel and services can access data and resources.
Encryption: Data at rest in S3 is encrypted using AWS Key Management Service (KMS), and data in transit is encrypted using TLS/SSL.
Network Isolation: Amazon Virtual Private Cloud (VPC) provides network isolation for SageMaker resources and other components of the pipeline.
Compliance: The entire architecture is designed to comply with relevant regulations such as HIPAA (in some collaborative contexts) and GDPR (within the UK and EU). AWS provides services and features that help organizations meet these compliance requirements.

9. Conclusion

The collaboration between Genomics England and AWS, leveraging Amazon SageMaker, represents a significant advancement in applying multi-modal machine learning to cancer research. By integrating complex genomic, clinical, and imaging data, these pipelines have the potential to improve cancer subtyping accuracy, refine survival predictions, and ultimately contribute to more personalized and effective treatment strategies.

Future directions include exploring federated learning approaches to enable collaborative model training across multiple institutions without sharing raw data, developing longitudinal models to capture the temporal dynamics of cancer progression, and further enhancing model interpretability to facilitate clinical translation.

The lessons learned from deploying these real-world MMML pipelines highlight the importance of robust data engineering, scalable infrastructure, careful model design, and a strong focus on security, compliance, and clinical relevance. As multi-modal data continues to grow in volume and complexity, the partnership between organizations like Genomics England and cloud platforms like AWS will be crucial in unlocking the full potential of machine learning for the benefit of patients.

Tagged Genomics, SageMaker