Reading SAS Files with Spark
Introduction
If you need to read SAS7BDAT files with Spark, there aren’t many options available…
The library Spark-Sas7BDAT, which hasn’t been maintained for a few years, seems to work in some cases… but I have found files that can’t be read, and no useful errors are given to be able to debug it… So use it with caution.
GitHub - saurfang/spark-sas7bdat: Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL
How to use the library
Configuration
Add the following dependency to your project:
- Maven configuration:
<repositories>
... -- Other repositories -- ...
<repository>
<id>spark-packages.org</id>
<name>Spark Package Maven2 Repository</name>
<url>https://repos.spark-packages.org/</url>
</repository>
</repositories>
<dependencies>
... -- Other Dependencies -- ...
<!-- https://mvnrepository.com/artifact/saurfang/spark-sas7bdat -->
<dependency>
<groupId>saurfang</groupId>
<artifactId>spark-sas7bdat</artifactId>
<version>3.0.0-s_2.12</version>
</dependency>
</dependencies>
- SBT Configuration
resolvers += "Spark Package Maven2 Repository" at "https://repos.spark-packages.org"
...
libraryDependencies += "saurfang" % "spark-sas7bdat" % "3.0.0-s_2.12"
Usage
Simply use the new format available.
val data = spark.read.format("com.github.saurfang.sas.spark").load("/path/to/file/my_file.sas7bdat")
Or the implicit version:
import com.github.saurfang.sas.spark._
val df = spark.read.sas("/path/to/file/my_file.sas7bdat")
Problems with cloud storage
One of the main issues that you’re going to face if you are working in a cloud environment like Databricks and you have the data in a cloud storage like ADLS Gen 2 or AWS S3, is that you may not be able to read the files, as you’ll get a “Configuration property not found”. This will happen even if you have set up the spark.conf with the right configuration to access the storage account.
For example, here we are using the account key to set up the storage account access, but it doesn’t work when using the library.
Solution
This is because the library “Spark SAS Data Source (sas7bdat)” uses RDD to read the data and for that, we need to setup the properties in a different way. We need to use the “spark.sparkContext.hadoopConfiguration.set" method to set up the configuration properties.
You can read more about the RDD Cloud Storage Access here: [Accessing ADLS Gen 2 with RDD]link docs/spark/cloud/accessing-adls-gen-2-with-rdd.md