Skip to main content

Exploding arrays in Spark and keeping the index position of each element

Introduction

Sometimes you may need to explode an array, that means, moving the content from row’s array to individual rows.

Function Explode

You can achieve this by using the explode function that spark provides. In this case, you will have a new row for each element of the array, keeping the rest of the columns as they are.


val data: DataFrame = Seq(("Example", Array("A", "B", "C", "D")))
.toDF("c1", "data")

data.show()

+-------+------------+
| c1| data|
+-------+------------+
|Example|[A, B, C, D]|
+-------+------------+

data.selectExpr("c1", "explode(data) as dataExploded").show()

+-------+------------+
| c1|dataExploded|
+-------+------------+
|Example| A|
|Example| B|
|Example| C|
|Example| D|
+-------+------------+

Function posexplode

However, you may also want to know the position of each element of the array, in case the order matters, and you need to use that position for some ordering. For that, you can use the “posexplode” function. This will return two new columns to your dataframe, the value of the array and its position:

val data: DataFrame = Seq(("Example", Array("A", "B", "C", "D")))
.toDF("c1", "data")

data.show()

+-------+------------+
| c1| data|
+-------+------------+
|Example|[A, B, C, D]|
+-------+------------+

data.selectExpr("c1", "posexplode(data) as (index, dataExploded)").show()

+-------+-----+------------+
| c1|index|dataExploded|
+-------+-----+------------+
|Example| 0| A|
|Example| 1| B|
|Example| 2| C|
|Example| 3| D|
+-------+-----+------------+
tip

Note that in this last example, you can provide alias to both columns at the same time by specifying both names inside parenthesis.


Sources

You can find the source code of these functions here:

spark/sql/functions.scala - explode

spark/sql/functions.scala - posexplode