Group output files from a Spark dataframe
Problem
You want to Write a dataframe out as a .csv or any other format to open it in another application. And when you try to do that, Spark by default will write the file splited in lots of small files, normally one file per partition.
Solution
In order to change this behavior and output a single file you can “repartition” the data from the different partitions into 1, or n, and get that amount of output files:
df.coalesce(1).write.format("csv").save("/path/...")
// or
df.repartition(1).write.format("csv").save("/path/...")
note
Technically, coalesce is better when you’re reducing the amount of partitions, but both options will give you one file with the content.
caution
Take into account that you can blow up the executors if the output file is too big, as it won't fit into memory.