Exploding arrays in Spark and keeping the index position of each element
Introduction
Sometimes you may need to explode an array, that means, moving the content from row’s array to individual rows.
Function Explode
You can achieve this by using the explode function that spark provides. In this case, you will have a new row for each element of the array, keeping the rest of the columns as they are.
val data: DataFrame = Seq(("Example", Array("A", "B", "C", "D")))
.toDF("c1", "data")
data.show()
+-------+------------+
| c1| data|
+-------+------------+
|Example|[A, B, C, D]|
+-------+------------+
data.selectExpr("c1", "explode(data) as dataExploded").show()
+-------+------------+
| c1|dataExploded|
+-------+------------+
|Example| A|
|Example| B|
|Example| C|
|Example| D|
+-------+------------+
Function posexplode
However, you may also want to know the position of each element of the array, in case the order matters, and you need to use that position for some ordering. For that, you can use the “posexplode” function. This will return two new columns to your dataframe, the value of the array and its position:
val data: DataFrame = Seq(("Example", Array("A", "B", "C", "D")))
.toDF("c1", "data")
data.show()
+-------+------------+
| c1| data|
+-------+------------+
|Example|[A, B, C, D]|
+-------+------------+
data.selectExpr("c1", "posexplode(data) as (index, dataExploded)").show()
+-------+-----+------------+
| c1|index|dataExploded|
+-------+-----+------------+
|Example| 0| A|
|Example| 1| B|
|Example| 2| C|
|Example| 3| D|
+-------+-----+------------+
Note that in this last example, you can provide alias to both columns at the same time by specifying both names inside parenthesis.
Sources
You can find the source code of these functions here: