Introduction to working and defining with Spark Encoders
Introduction
Sometimes, it can be useful to work with Datasets in Spark (Scala), as it adds an interesting later of type safety.
However, when working with them you can face some issues regarding the encoders needed to serialize and deserialize the data.
Those encoders are added implicitly if you are working in Databricks or Spark Shell, as they are already there for you, and if not, they can be easily added with the following line of code:
import spark.implicits._
However, there are situations where you don't have access to a spark session, and therefore you can face some issues. For example:
Having a case class like this:
case class Person(name: String, age: Int)
If you try to create a dataset with it:
val data = Seq(Person("p1", 1), Person("p2", 2))
val df = spark.createDataset(data)
You’ll see the following error:
No implicits found for parameter evidence$4: Encoder[Person]
The reason it’s because we don’t have the Encoders for making that serialization happens. The quick solution for it is to just define it like this:
implicit def PersonEncoder: Encoder[Person] = ExpressionEncoder()
And the error should go away.
Example
The full code snippet:
case class Person(name: String, age: Int)
implicit def PersonEncoder: Encoder[Person] = ExpressionEncoder()
val data = Seq(Person("p1", 1), Person("p2", 2))
val df = spark.createDataset(data)
df.show()
+----+---+
|name|age|
+----+---+
| p1| 1|
| p2| 2|
+----+---+
Sources
You can find more information about the encoders here: