Introduction to working and defining with Spark Encoders

Introduction

Sometimes, it can be useful to work with Datasets in Spark (Scala), as it adds an interesting later of type safety.

However, when working with them you can face some issues regarding the encoders needed to serialize and deserialize the data.

Those encoders are added implicitly if you are working in Databricks or Spark Shell, as they are already there for you, and if not, they can be easily added with the following line of code:

import spark.implicits._

However, there are situations where you don't have access to a spark session, and therefore you can face some issues. For example:

Having a case class like this:

case class Person(name: String, age: Int)

If you try to create a dataset with it:

val data = Seq(Person("p1", 1), Person("p2", 2))
val df = spark.createDataset(data)

You’ll see the following error:

No implicits found for parameter evidence$4: Encoder[Person]

The reason it’s because we don’t have the Encoders for making that serialization happens. The quick solution for it is to just define it like this:

implicit def PersonEncoder: Encoder[Person] = ExpressionEncoder()

And the error should go away.

Example

The full code snippet:

case class Person(name: String, age: Int)
implicit def PersonEncoder: Encoder[Person] = ExpressionEncoder()

val data = Seq(Person("p1", 1), Person("p2", 2))
val df = spark.createDataset(data)

df.show()

+----+---+
|name|age|
+----+---+
|  p1|  1|
|  p2|  2|
+----+---+

Sources

You can find more information about the encoders here:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoders.scala

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/Encoder.scala

Introduction​

Example​

Sources​

Introduction

Example

Sources