Tutorial - MYSQL data to CSV using singer.io

Singer.io is an open-source ETL (Extract, Transform, Load) tool that is designed to make it easier to move data between different systems. Singer.io provides a set of standard interfaces, connectors, and protocols for building data integration pipelines that can extract data from a source system, transform it as necessary, and load it into a target system.

As the documentation can be quite difficult to follow, this tutorial will try to show you how to create a complete pipeline for extracting data from a mysql database into CSV files. Doing both, full and incremental loads.

Initial setup

Requirements

In order to follow this tutorial you'll need:

A working mysql database
python 3.10 (for the mysql tap)
python 3.9 (for the csv target)
virtualenv

Setup

Create a new folder for the tutorial:

mkdir mysql-csv-example

Create a python environment for the MySQL tap. It's higly recommended to create a different environment for each tap/target as they may have different versions of some libraries.

python3.10 -m venv tap-mysql
source ./tap-mysql/bin/activate
python3.10 -m pip install tap-mysql

deactivate

Be sure you're not in the previous environment before executing the next part.

Now, create a python environment for the csv target:

python3.9 -m venv target-csv
source ./target-csv/bin/activate
python3.9 -m pip install poetry

# Clone the repo:
cd target-csv
git clone https://github.com/MeltanoLabs/target-csv.git
cd target-csv
poetry install

deactivate

Config files

All the data process using singer.io is based on config files.

Config file for the MySQL tap:

Creat a file config-mysql.json with the following values. Fill out with your values to access the database.

{
  "host": "localhost",
  "port": "3306",
  "user": "root",
  "password": "password",
  "ssl": "false"
}

Config for the CSV target:

Create a file config-csv.json. Setup the deliminter of the output file, as well as the output path and th ename of the files.

{
    "delimiter": "#",
    "quotechar": "'",
    "output_path_prefix": "./data/",
    "file_naming_scheme": "{stream_name}-{timestamp}.csv"
}

Running the pipeline

First, we need to run a discovery process to obtain the metadata of all tables:

./tap-mysql/bin/tap-mysql --config ./config-mysql.json --discover > properties.json

This will generate a very big file (depending on the amount of tables of your database) with all the properties needed for the extraction. We need to modify this file in order to extract only the information that we want from the database.

What you want to keep, is basically an object inside the "streams" array for each of the tables that you want to extract. So you can remove all the others. The singer.io process will use this file to do the extraction.

{
  "streams": [
    {
      "tap_stream_id": "stream_name_....",
      "table_name": "table_...",
      "schema": {
        ...
      },
      "stream": "stream_name_....",
      "metadata": [
       ....
      ]
    }
  ]
}

Running a full extraction

For extracting a full snapshot of the table, we need to modify the properties.json file that we got in the previous section and add the following. In the stream related to the table that you want to extract, inside the metadata object, you need to add the attributes

selected: true
replication-method: FULL_TABLE

Like in this example:

"metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "selected-by-default": false,
            "database-name": "db",
            "row-count": 0,
            "is-view": false,
            "selected": true,
            "replication-method": "FULL_TABLE",
            "table-key-properties": [
              "id"
            ]
          }
        },
        { ...

With that change, if you now run the following command, you will get a full extraction of the selected tables:

./tap-mysql/bin/tap-mysql --config ./config-mysql.json --properties properties.json | ./target-csv/bin/target-csv --config ./config-csv.json

You can see the full execution in the following image:

Running incremental extraction

For running incremental extractions, we need to specify some additional properties to the stream in the properties.json, the same way we did for the full extraction example. In this case you need to provide the attributes:

selected: true
replication-method: INCREMENTAL
replication-key: ID

The replication-key is the column that we want to use for obtaining the incrementals, like a Timestamp, or an incremental identifier.

"metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "selected-by-default": false,
            "database-name": "db",
            "row-count": 0,
            "is-view": false,
            "selected": true,
            "replication-method": "INCREMENTAL",
            "replication-key": "id",
            "table-key-properties": [
              "id"
            ]
          }
        },
        { ...

After modifying the properties.json file, we need to do a first run without the "state".

./tap-mysql/bin/tap-mysql --config ./config-mysql.json --properties properties.json |./target-csv/bin/target-csv --config ./config-csv.json

This comand will output something like this:

{"currently_syncing": null, "bookmarks": {"xxxxxxx": {"replication_key": "id", "version": 1672491271022, "replication_key_value": 119}}}

This is the metadata that the tap will use to get the next incremental, so we need to save that content in a file, for example, in state.json.

You can see the full execution in the following screenshot:

Now, we can run the process again specifying the state, so it will only pick the newer data.

./tap-mysql/bin/tap-mysql --config ./config-mysql.json --properties properties.json --state state.json |./target-csv/bin/target-csv --config ./config-csv.json

This will output again the state, so you may want to run something like this to automate the replacement of the old state with the new one

./tap-mysql/bin/tap-mysql --config ./config-mysql.json --properties properties.json --state state.json 
| ./tap-csv/bin/target-csv --config ./config-csv.json 
> state1.json 
&& mv state1.json state.json

You can see the full execution in the following screenshot:

caution

Important to say that the filter used to get incrementals is >=, so you'll have an overlapping row in your extraction.

We hope you find this useful!

Initial setup​

Requirements​

Setup​

Config files​

Running the pipeline​

Running a full extraction​

Running incremental extraction​