Skip to main content

Installing libraries from external storage in Azure Databricks

Introduction

There are multiple ways of installing libraries on a Databricks cluster, being the easiest one to just upload the files to DBFS and point to them on the cluster definition, like in the example:

DeploymentExternalStorage 1

In this image, we’re defining a Databricks Job using a python wheel that we have uploaded to DBFS in a previous step.

However, there is another way of using libraries without having to upload them to the internal Databricks storage, and that is by using a Azure Storage Account container. This may be the best option for you if you want to keep versions of the libraries in an accessible way, or to simplify the deployment process, so a single library can be used by multiple databricks workspaces.

How to do it

In order to do this, you need to have a Container in an Azure Storage Account with all the libraries that you want to install in the cluster:

DeploymentExternalStorage 2

So you can reference the library like this:

DeploymentExternalStorage 3

abfss://{container-name}@{storage-account-name}.dfs.core.windows.net/path/to/file.whl

But, in order for the cluster to access this file, you need to provide it with some authentication method, for example, using a Service Principal:

fs.azure.account.oauth.provider.type.{storage-account-name}.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
fs.azure.account.oauth2.client.id.{storage-account-name}.dfs.core.windows.net {{secrets/secret-scope/secret-name}}
fs.azure.account.oauth2.client.endpoint.{storage-account-name}.dfs.core.windows.net {{secrets/secret-scope/secret-name}}
fs.azure.account.auth.type.{storage-account-name}.dfs.core.windows.net OAuth
fs.azure.account.oauth2.client.secret.{storage-account-name}.dfs.core.windows.net {{secrets/secret-scope/secret-name}}

DeploymentExternalStorage 4

Notice that there are secrets in this configuration, so you don’t need to hardcode them. You can use any other authentication method, like SAS or Account Keys…

  • This approach also works for AWS S3 buckets, using the right configuration for it.