Running a 3.2M vCPU HPC Workload on AWS with YellowDog Publication

  • Joint Publication
Running a 3.2M vCPU HPC Workload on AWS with YellowDog Publication

In this post, we’ll discuss the AWS and YellowDog services we deployed, and the mechanisms used to scale to 3.2m vCPUs in 33 minutes using multiple EC2 instance types across multiple regions. Once analysis started, the utilization rate was 95%.

Overview of solution

YellowDog democratizes HPC with a comprehensive cloud workload management solution that is available to all customers, at any scale, anywhere in the world. It’s cloud native and schedules compute nodes based on application characteristics, rather than the constraints of a fixed on-premises HPC cluster. This allows you to manage clusters using the YellowDog scheduler or third-party schedulers from one control pane, so you have a single view on cost and department consumption across applications. It can also provision resources across multiple Regions, Availability Zones, instance types and machine sizes.

The YellowDog platform runs on Amazon EC2, using Amazon Elastic Kubernetes Service (EKS). It uses Elastic Load Balancing to manage access to the cluster. Amazon Elastic Block Storage (EBS) is used for the foundational services (Kafka, Artemis and Zookeeper) that require persistent storage. AWS CloudTrail and Amazon CloudWatch are used for monitoring.

Read full article on AWS