Event Details

Please note: All times listed below are in Central Time Zone

<< Go back

HPC/HTC and Cloud: Making Them Work Together Efficiently

HPC / Research

Many scientific and research computing workflows have fundamentally insatiable compute needs. This talk presents the work on how to deliver compute cycles more efficiently to help meet these demands in cost effective and resource smart manner. It addresses a common use case from actual high-performance computing (HPC) and high-throughput computing (HTC) production environments with backfill scheduling and how OpenStack provides user transparent suspend/resume of job execution that can increase overall productive utilization of resources. HPC/HTC clusters run some low priority jobs which generally gets preempted/killed by high priority ones incurring compute resource wastage. Besides, Clouds are generally overprovisioned to meet peak loads which keeps them underutilized most of the time.

HPC/HTC can exploit this underutilization to avoid the wastage and help cloud boost its utilization.

The low priority jobs can be queued up for execution on a dynamically provisioned Virtual Machine in the cloud. These nodes can be suspended and resumed based on the cloud utilization for the defined thresholds. Hence, freeing up the resources when required keeping the job states intact.

However, this requires us to break the mold of traditionally static HPC/HTC cluster or dynamically driven on workload rather than resource availability. We identified a minimum set of modifications and features needed to make a Slurm (a widely-used HPC workload manager) cluster dynamic and driven on resource utilization/availability in the cloud. We developed some daemon processes and made modifications to Slurm to accept the federation/separation of nodes dynamically keeping the job states intact. This will let us run research oriented Open Science Grid(OSG) HTC jobs that backfill an HPC cluster on OpenStack cloud.

Tuesday, May 9, 12:35pm-12:45pm (4:35pm - 4:45pm UTC)

Hynes Convention Center - Level Two - MR 206

View video

Difficulty Level: N/A

Rajul Kumar

Research Intern

Rajul is a Graduate Student at Northeastern University, Boston and a Research Assistant at Massachusetts Open Cloud(MOC), Boston. His current work deals with performance monitoring of distributed systems. He is also working on building elastic HPC clusters on OpenStack. FULL PROFILE