How to access Azure ADLS Gen 2 with Spark RDD in Scala
Introduction
There are moments when you need to use the Spark RDD API to access some data… as the DataFrame API isn’t enough.
For this you will normally use the existing “sparkContext”… but If you have tried to access directly to the data on an Azure Storage Account (ADLS Gen 2), you may have seen the following error:
// Setup access to Storage
spark.conf.set(s"fs.azure.account.key.${storageAccountName}.dfs.core.windows.net", accessKey)
// Read data
val path = s"abfss://${container}@${storageAccountName}.dfs.core.windows.net/${filePath}"
println(spark.sparkContext.textFile(path).count())
// Error Message: Configuration property storageAccountName.dfs.core.windows.net not found.
This is because the spark.conf.set doesn’t affect the sparkContext, and you need to define the access differently. You need to set up the same properties but using the “spark.sparkContext.hadoopConfiguration.set”, and it should work perfectly.
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.account.key.${storageAccountName}.dfs.core.windows.net", accessKey)
val path = s"abfss://${container}@${storageAccountName}.dfs.core.windows.net/${filePath}"
println(spark.sparkContext.textFile(path).count())
Using different access methods:
- SAS
// Dataframe API Config
spark.conf.set(s"fs.azure.account.auth.type.${storageAccountName}.dfs.core.windows.net", "SAS")
spark.conf.set(s"fs.azure.sas.token.provider.type.${storageAccountName}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set(s"fs.azure.sas.fixed.token.${storageAccountName}.dfs.core.windows.net", sasToken)
// RDD config
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.account.auth.type.${storageAccountName}.dfs.core.windows.net", "SAS")
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.sas.token.provider.type.${storageAccountName}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.sas.fixed.token.${storageAccountName}.dfs.core.windows.net", sasToken)
- Account Key
// Dataframe API Config
spark.conf.set(s"fs.azure.account.key.${storageAccountName}.dfs.core.windows.net", accessKey)
// RDD config
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.account.key.${storageAccountName}.dfs.core.windows.net", accessKey)
- Service Principal
// Dataframe API Config
spark.conf.set(s"fs.azure.account.auth.type.${storageAccountName}.dfs.core.windows.net", "OAuth")
spark.conf.set(s"fs.azure.account.oauth.provider.type.${storageAccountName}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set(s"fs.azure.account.oauth2.client.id.${storageAccountName}.dfs.core.windows.net", clientId)
spark.conf.set(s"fs.azure.account.oauth2.client.secret.${storageAccountName}.dfs.core.windows.net", clientSecret)
spark.conf.set(s"fs.azure.account.oauth2.client.endpoint.${storageAccountName}.dfs.core.windows.net", s"https://login.microsoftonline.com/${tenantId}/oauth2/token")
// RDD config
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.account.auth.type.${storageAccountName}.dfs.core.windows.net", "OAuth")
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.account.oauth.provider.type.${storageAccountName}.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.account.oauth2.client.id.${storageAccountName}.dfs.core.windows.net", clientId)
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.account.oauth2.client.secret.${storageAccountName}.dfs.core.windows.net", clientSecret)
spark.sparkContext.hadoopConfiguration.set(s"fs.azure.account.oauth2.client.endpoint.${storageAccountName}.dfs.core.windows.net", s"https://login.microsoftonline.com/${tenantId}/oauth2/token")