Skip to content

[batch] Azure storage requirements beyond tempdisk for standing worker result in NotImplementedError #14522

Open
@jeremiahwander

Description

@jeremiahwander

What happened?

Note: this is an Azure-specific issue.

When submitting a batch/job that requests more storage than is available on the temp disk of any standing worker, but doesn't request a specific number of cores or amount of memory, a NotImplementedError is raised in batch/cloud/azure/worker/disk.py.

See this Batch record for an example of the issue in action: https://batch.azure.hail.is/batches/4563654/jobs/1. The corresponding base case to reproduce this is:

import hailtop.batch as hb
backend = hb.ServiceBackend(billing_project="<YOUR BILLING PROJECT>")
b = hb.Batch(backend=backend, name="storage_test")
j = b.new_job()
j.image("ubuntu:20.04")
j.storage("700GiB")
j.command("df -h")
b.run(wait=False)

On the cluster azure.hail.is this job gets scheduled on a Standard_D16ds_v4 instance which has a 600 GiB temp disk.

On GCP, when requests exceed this amount a data disk is provisioned to service the request. While this is feasible on Azure and could be implemented, it may not be the recommended solution as temp disks are much better suited to ephemeral workloads than data disks.

On clusters with a smaller standing worker (i.e. fewer cores) there is a workaround, which also possibly suggests a reasonable partial solution. This workaround is to specify a required number of cores that forces a larger VM of the same family to be provisioned. This makes a larger temp disk available for the job to leverage. The corresponding partial solution would be to take knowledge of the temp disk size for any VM into account when scheduling jobs and provision larger VMs when warranted by the storage requirement of a job.

Based on current limitations for VM core count (16) this suggests a ceiling on storage that can be allocated to any job in Azure of 600 GiB. At that point it would be necessary to allocate a data disk.

This issue reproduces on both azure.hail.is and our own Azure cluster.

Version

0.2.126-cdd2c132bfa2

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions