How to read from Databases in Apache Spark
Introduction
Spark allows you to connect to a Database and read the data from it. For that you’ll need to setup a few things first.
Configuration
In order to connect to the database you need:
jdbc url: this is the url that will point to the database. Following this pattern you'll need:
- driver: database type.
- jdbcHostname: ip address or url that points to the database.
- jdbcPort: database port.
- jdbcDatabase: database name.
Example:
val jdbcUrl = s"jdbc:${driver}://${jdbcHostname}:${jdbcPort};database=${databaseName}"
You can include the user and password in this url or pass them as a Properties / Dictionary to spark as you’ll see in the following code snippets.
Note: This jdbc urls are database dependent. All of them follow the same pattern, but they may differ in some attributes, for example:
// Oracle
val jdbcUrl = s"jdbc:oracle:thin:${userName}/${password}@${jdbcHostname}:${jdbcPort}/${serviceName}"
// Postgres
val jdbcUrl = s"jdbc:postgresql://${jdbcHostname}:${jdbcPort}/${databaseName}?user=${userName}&password=${password}&sslmode=require"
// SQL Server
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${databaseName};user=${userName};password=${password}"
Reading data
There are several ways of setting up the read method:
// Without providing User and Password in the jdbcUrl
import java.util.Properties
val connectionProperties =new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
val data = spark.read.jdbc(jdbcUrl, "tableName", connectionProperties)
// Providing the User and Password in the jdbcUrl
val data = spark.read.jdbc(jdbcUrl, "tableName")
You can also chain all the options:
val data = spark
.read
.format("jdbc")
.option("url", jdbcUrl)
.option("dbtable", tableName)
.option("user", username)
.option("password", password)
.load()
Some interesting configurations that you can add to the read method are
You can specify a Select Statement to run in the database and get that result back. You can’t use this option if you have specified “dbtable”.
.option("query", selectStatement)
Fetchsize determines how many rows to fetch per round trip. This can help performance on JDBC drivers which has a low default fetch size (e.g. Oracle with 10 rows).
.option("fetchsize", 10000)