Skip to main content

Group output files from a Spark dataframe

Problem

You want to Write a dataframe out as a .csv or any other format to open it in another application. And when you try to do that, Spark by default will write the file splited in lots of small files, normally one file per partition.

Grouping output files 1

Solution

In order to change this behavior and output a single file you can “repartition” the data from the different partitions into 1, or n, and get that amount of output files:

df.coalesce(1).write.format("csv").save("/path/...")
// or
df.repartition(1).write.format("csv").save("/path/...")

Grouping output files 2

note

Technically, coalesce is better when you’re reducing the amount of partitions, but both options will give you one file with the content.

caution

Take into account that you can blow up the executors if the output file is too big, as it won't fit into memory.