Reading SAS Files with Spark

Introduction

If you need to read SAS7BDAT files with Spark, there aren’t many options available…

The library Spark-Sas7BDAT, which hasn’t been maintained for a few years, seems to work in some cases… but I have found files that can’t be read, and no useful errors are given to be able to debug it… So use it with caution.

GitHub - saurfang/spark-sas7bdat: Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL

How to use the library

Configuration

Add the following dependency to your project:

Maven configuration:

<repositories>
   ... -- Other repositories -- ...
    <repository>
        <id>spark-packages.org</id>
        <name>Spark Package Maven2 Repository</name>
        <url>https://repos.spark-packages.org/</url>
    </repository>
</repositories>

<dependencies>
    ... -- Other Dependencies -- ...
    <!-- https://mvnrepository.com/artifact/saurfang/spark-sas7bdat -->
    <dependency>
      <groupId>saurfang</groupId>
      <artifactId>spark-sas7bdat</artifactId>
      <version>3.0.0-s_2.12</version>
    </dependency>
</dependencies>

SBT Configuration

resolvers += "Spark Package Maven2 Repository" at "https://repos.spark-packages.org"
...
libraryDependencies += "saurfang" % "spark-sas7bdat" % "3.0.0-s_2.12"

Usage

Simply use the new format available.

val data = spark.read.format("com.github.saurfang.sas.spark").load("/path/to/file/my_file.sas7bdat")

Or the implicit version:

import com.github.saurfang.sas.spark._
val df = spark.read.sas("/path/to/file/my_file.sas7bdat")

Problems with cloud storage

One of the main issues that you’re going to face if you are working in a cloud environment like Databricks and you have the data in a cloud storage like ADLS Gen 2 or AWS S3, is that you may not be able to read the files, as you’ll get a “Configuration property not found”. This will happen even if you have set up the spark.conf with the right configuration to access the storage account.

For example, here we are using the account key to set up the storage account access, but it doesn’t work when using the library.

Reading sas files with spark 1

Solution

This is because the library “Spark SAS Data Source (sas7bdat)” uses RDD to read the data and for that, we need to setup the properties in a different way. We need to use the “spark.sparkContext.hadoopConfiguration.set" method to set up the configuration properties.

Reading sas files with spark 2

You can read more about the RDD Cloud Storage Access here: [Accessing ADLS Gen 2 with RDD]link docs/spark/cloud/accessing-adls-gen-2-with-rdd.md

Introduction​

How to use the library​

Configuration​

Usage​

Problems with cloud storage​