Using TestContainers in Scala with Spark for real unit testing

December 13, 2023 · 5 min read

TestContainers is a fascinating technology that allows you to spin up docker containers programmatically, so you can recreate situations in your tests using real infrastructure components like databases instead of using mocks.

You can find the project here: https://testcontainers.com/

Introduction

In this quick post, we're going to set up a Scala project using SBT, and write a simple test that reads from a PostgreSQL database (created through TestContainers) using Spark.

If you just want to see the code, here you have it ;) https://github.com/HyperCodeLab/scala-testing/blob/main/src/test/scala/com/hypercodelab/testcontainers/SparkDatabaseExtractionTest.scala

Requirements

The host where you run the tests needs to have docker running
Java 8 / 11
Scala 2.12

Configuration

For the build.sbt file, we're just going to add the Spark dependencies and the TestContainers one.

There is also a Scala version of TestContainers, but for this example I’m going to use the Java version, as it has a more active development.

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "2.12.12"

lazy val root = (project in file("."))
  .settings(
    name := "scala-testing"
  )

libraryDependencies ++= Seq(
  "com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.12.2",
  // Spark Libraries
  "org.apache.spark" %% "spark-core" % "3.4.0",
  "org.apache.spark" %% "spark-sql" % "3.4.0",

  "org.postgresql" % "postgresql" % "42.6.0",

  // Testing libraries
  "org.scalatest" %% "scalatest" % "3.2.17" % "test",

  // TestContainer
  "org.testcontainers" % "postgresql" % "1.18.0" % "test"
)

Writing the test

Now, in the test folder of our project (scala-testing/src/test/scala/com/hypercodelab/testcontainers/ ), we are going to create a new test with the following code:

First we create a test class where we extend all the test classes that we may need (AnyFlatSpec and Matchers in this case). This will also work with JUnit…
Inside the class definition lets also create a SparkSession. You can do it like this or just extend from a SparkSession provider trait.
We also need spark.implicits._ to have the encoders for moving between dataframes and datasets.

class SparkDatabaseExtractionTest extends AnyFlatSpec with Matchers {

    val spark = SparkSession.builder().master("local").getOrCreate()
    import spark.implicits._
    ...

}

Now, inside the class, let’s define a test that reads from a database and asserts its content. Just for simplicity, let’s assume that we have this table in the database:

Name	Version
Python	3.12
Scala	2.12

If you have been working with spark for some time, the following code should sound very familiar:

Defining a Case Class to convert our DataFrame to a DataSet so it will be easier to assert its content.
Creating a dataframe from a JDBC connection.
Collecting the dataframe in to an array.
Asserting the content of the array with the expected output.

// Modeling the table as a case class
case class ProgrammingLanguange(name: String, version: String)

class SparkDatabaseExtractionTest ...
    
    it should "Extract data from a PostgreSQL database" in {
    
          val tableName: String = "programming_list"
        
          val data: Array[ProgrammingLanguange] = spark
            .read
            .format("jdbc")
            .option("url", "")
            .option("user", "")
            .option("password", "")
            .option("dbtable", tableName)
            .load()
          .as[ProgrammingLanguange].collect()
        
            // Assert output
          data should contain allOf (
            ProgrammingLanguange("Python", "3.12"),
            ProgrammingLanguange("Scala", "2.12")
          )
    
    }

}

So the question here is… how do we test this database connection? What values do we use for the url, user and password? As in some cases you should be able to connect to a real database just for testing… but that may not always be the case… and it would also need to be available to all developers in your team.

Here is when TestContainers comes into place. It allows us to spin up an instance of a database with the data that we need just for this test. And as you will see in a moment, it’s just a few lines of code, so no need to deal with infrastructure at all.

Let’s add the code that we need to run that container. Inside the test definition, before what we wrote previously, lets:

Define a PostgreSQL Container using version 16:1. You could also use latest as you would do in docker normally.
Specify a .sql file with the queries that we want to execute before the test run, so the database is populated with the right data for the test.
Start the container and Stop it at the end. You could move this code to a beforeAll / afterAll so it runs at the start and end of the test.

it should "Extract data from a PostgreSQL database" in {

        // Start container
    val postgresContainer = new PostgreSQLContainer("postgres:16.1")
    postgresContainer.withCopyFileToContainer(MountableFile.forClasspathResource("init-dbt.sql"), "/docker-entrypoint-initdb.d/")
    postgresContainer.start()

        // Test 
        ...

        
        // Stop container
        postgresContainer.stop()
}

And for the init-dbt.sql file, you can place it under /test/resources/. The database will run this file at the start, so the database is in the state that we need for the test.

create table programming_list (
    name varchar not null,
    version varchar not null
);

insert into programming_list values ('Python', '3.12');
insert into programming_list values ('Scala', '2.12');

And that’s it, isn’t it amazing? 😄

The only thing left is to specify the JDBC url and the user/pass, but as those are generated dynamically, the instance of postgresContainer gives us what we need:

// In the spark read
    .option("url", postgresContainer.getJdbcUrl)
    .option("user", postgresContainer.getUsername)
    .option("password", postgresContainer.getPassword)

If you run the test now:

First it will try to find the docker service running.

Log1

And then, adding a show to the dataframe, we can see the data and the test passing.

Log2

The full code of this can be found here: https://github.com/HyperCodeLab/scala-testing/blob/main/src/test/scala/com/hypercodelab/testcontainers/SparkDatabaseExtractionTest.scala

Thanks for reading!

Introduction​

Requirements​

Configuration​

Writing the test​

Introduction

Requirements

Configuration

Writing the test