Tutorial - MYSQL data to CSV using singer.io
Singer.io is an open-source ETL (Extract, Transform, Load) tool that is designed to make it easier to move data between different systems. Singer.io provides a set of standard interfaces, connectors, and protocols for building data integration pipelines that can extract data from a source system, transform it as necessary, and load it into a target system.
As the documentation can be quite difficult to follow, this tutorial will try to show you how to create a complete pipeline for extracting data from a mysql database into CSV files. Doing both, full and incremental loads.
Initial setup
Requirements
In order to follow this tutorial you'll need:
- A working mysql database
- python 3.10 (for the mysql tap)
- python 3.9 (for the csv target)
- virtualenv
Setup
Create a new folder for the tutorial:
mkdir mysql-csv-example
Create a python environment for the MySQL tap. It's higly recommended to create a different environment for each tap/target as they may have different versions of some libraries.
python3.10 -m venv tap-mysql
source ./tap-mysql/bin/activate
python3.10 -m pip install tap-mysql
deactivate
Be sure you're not in the previous environment before executing the next part.
Now, create a python environment for the csv target:
python3.9 -m venv target-csv
source ./target-csv/bin/activate
python3.9 -m pip install poetry
# Clone the repo:
cd target-csv
git clone https://github.com/MeltanoLabs/target-csv.git
cd target-csv
poetry install
deactivate
Config files
All the data process using singer.io is based on config files.
- Config file for the MySQL tap:
Creat a file config-mysql.json with the following values. Fill out with your values to access the database.
{
"host": "localhost",
"port": "3306",
"user": "root",
"password": "password",
"ssl": "false"
}
- Config for the CSV target:
Create a file config-csv.json. Setup the deliminter of the output file, as well as the output path and th ename of the files.
{
"delimiter": "#",
"quotechar": "'",
"output_path_prefix": "./data/",
"file_naming_scheme": "{stream_name}-{timestamp}.csv"
}
Running the pipeline
First, we need to run a discovery process to obtain the metadata of all tables:
./tap-mysql/bin/tap-mysql --config ./config-mysql.json --discover > properties.json
This will generate a very big file (depending on the amount of tables of your database) with all the properties needed for the extraction. We need to modify this file in order to extract only the information that we want from the database.
What you want to keep, is basically an object inside the "streams" array for each of the tables that you want to extract. So you can remove all the others. The singer.io process will use this file to do the extraction.
{
"streams": [
{
"tap_stream_id": "stream_name_....",
"table_name": "table_...",
"schema": {
...
},
"stream": "stream_name_....",
"metadata": [
....
]
}
]
}
Running a full extraction
For extracting a full snapshot of the table, we need to modify the properties.json file that we got in the previous section and add the following. In the stream related to the table that you want to extract, inside the metadata object, you need to add the attributes
- selected: true
- replication-method: FULL_TABLE
Like in this example:
"metadata": [
{
"breadcrumb": [],
"metadata": {
"selected-by-default": false,
"database-name": "db",
"row-count": 0,
"is-view": false,
"selected": true,
"replication-method": "FULL_TABLE",
"table-key-properties": [
"id"
]
}
},
{ ...
With that change, if you now run the following command, you will get a full extraction of the selected tables:
./tap-mysql/bin/tap-mysql --config ./config-mysql.json --properties properties.json | ./target-csv/bin/target-csv --config ./config-csv.json
You can see the full execution in the following image:
Running incremental extraction
For running incremental extractions, we need to specify some additional properties to the stream in the properties.json, the same way we did for the full extraction example. In this case you need to provide the attributes:
- selected: true
- replication-method: INCREMENTAL
- replication-key: ID
The replication-key is the column that we want to use for obtaining the incrementals, like a Timestamp, or an incremental identifier.
"metadata": [
{
"breadcrumb": [],
"metadata": {
"selected-by-default": false,
"database-name": "db",
"row-count": 0,
"is-view": false,
"selected": true,
"replication-method": "INCREMENTAL",
"replication-key": "id",
"table-key-properties": [
"id"
]
}
},
{ ...
After modifying the properties.json file, we need to do a first run without the "state".
./tap-mysql/bin/tap-mysql --config ./config-mysql.json --properties properties.json |./target-csv/bin/target-csv --config ./config-csv.json
This comand will output something like this:
{"currently_syncing": null, "bookmarks": {"xxxxxxx": {"replication_key": "id", "version": 1672491271022, "replication_key_value": 119}}}
This is the metadata that the tap will use to get the next incremental, so we need to save that content in a file, for example, in state.json.
You can see the full execution in the following screenshot:
Now, we can run the process again specifying the state, so it will only pick the newer data.
./tap-mysql/bin/tap-mysql --config ./config-mysql.json --properties properties.json --state state.json |./target-csv/bin/target-csv --config ./config-csv.json
This will output again the state, so you may want to run something like this to automate the replacement of the old state with the new one
./tap-mysql/bin/tap-mysql --config ./config-mysql.json --properties properties.json --state state.json
| ./tap-csv/bin/target-csv --config ./config-csv.json
> state1.json
&& mv state1.json state.json
You can see the full execution in the following screenshot:
Important to say that the filter used to get incrementals is >=, so you'll have an overlapping row in your extraction.
We hope you find this useful!