Many scientific and research computing workflows have fundamentally insatiable compute needs. This talk presents the work on how to deliver compute cycles more efficiently to help meet these demands in cost effective and resource smart manner. It addresses a common use case from actual high-performance computing (HPC) and high-throughput computing (HTC) production environments with backfill scheduling and how OpenStack provides user transparent suspend/resume of job execution that can increase overall productive utilization of resources. HPC/HTC clusters run some low priority jobs which generally gets preempted/killed by high priority ones incurring compute resource wastage. Besides, Clouds are generally overprovisioned to meet peak loads which keeps them underutilized most of the time.
HPC/HTC can exploit this underutilization to avoid the wastage and help cloud boost its utilization.
The low priority jobs can be queued up for execution on a dynamically provisioned Virtual Machine in the cloud. These nodes can be suspended and resumed based on the cloud utilization for the defined thresholds. Hence, freeing up the resources when required keeping the job states intact.
However, this requires us to break the mold of traditionally static HPC/HTC cluster or dynamically driven on workload rather than resource availability. We identified a minimum set of modifications and features needed to make a Slurm (a widely-used HPC workload manager) cluster dynamic and driven on resource utilization/availability in the cloud. We developed some daemon processes and made modifications to Slurm to accept the federation/separation of nodes dynamically keeping the job states intact. This will let us run research oriented Open Science Grid(OSG) HTC jobs that backfill an HPC cluster on OpenStack cloud.