Skip to main content

Reading incremental files from directory without Spark Streaming

Problem

You are working with Spark / Databricks, and you want to read the new files arriving at a location, but you don't want to set up a streaming solution or deal with the internal Structured Streaming checkpoint solution.

Solution

Even though the quickest solution would be to have a spark structured streaming job with a Trigger.once() schedule, and run the job from time to time, so each execution only works on the new subset of data. With this solution, will make you have to set up a spark checkpoint, but here we want to avoid that and use our own solution.

Spark read methods contains the option of passing two values:

  • modifiedAfter
  • modifiedBefore

These two values will filter the source files in the source folder based on the modification timestamp.

Code Example

spark.read.format("csv")
.option("modifiedAfter", "2023-01-01T00:00:00")
.option("modifiedBefore", "2023-02-01T00:00:00")
.load(pathToFolder)

With this code, your job will only read files that has been modified after the “2023-01-01” and before the “2023-02-01”. So you can simply store these values somewhere and use the current timestamp to get only the new data that arrives on a folder.

Be aware that this solution works based on the file modification date, so if you modify a file after it has been read, it will be read again.

Also, this solution doesn't work fine in local storage, but on cloud storages / hdfs.