Melchi is a Snowflake to DuckDB replication tool that handles both initial data ingestion and continuous updates through CDC (Change Data Capture). It gives you a local, synchronized copy of your Snowflake data without the overhead of managing complex ETL processes.
Here's how Melchi works:
- You provide a list of Snowflake tables you want to replicate.
- Run the setup command in the terminal.
- Melchi automatically:
- Creates equivalent tables in DuckDB based on your Snowflake table schemas
- Sets up streams and change tracking tables in Snowflake for each replicated table
- Creates change tracking tables in DuckDB to monitor update times
Once set up, simply run the sync_data
command whenever you need to update. Melchi efficiently checks Snowflake for inserts, updates, and deletes, and applies these changes to DuckDB.
All you need to do is set up a role in Snowflake with the appropriate permissions. Melchi handles the rest, providing a low-maintenance, efficient solution for keeping your DuckDB instance in sync with Snowflake.
Melchi fills specific gaps in the data replication landscape:
-
DuckDB Evaluation
- Test DuckDB performance against your Snowflake workloads
- Experiment with DuckDB without disrupting production
- Compare query costs and performance
- Prototype analytics workflows locally
-
Data Warehouse to Data Warehouse
- Most CDC tools focus on OLTP → warehouse pipelines (databases → Snowflake)
- Melchi specializes in warehouse → warehouse movement
- Handles warehouse-specific data types properly (ARRAY, VARIANT, GEOGRAPHY)
- Optimized for analytical workload patterns
-
Local Development Environment
- Work with production data locally without Snowflake costs
- Fast iteration on analytics queries
- Develop and test without internet connectivity
- Quick setup for new team members
-
Cost Optimization
- Reduce Snowflake compute costs for development and testing
- Cache frequently accessed data locally
- Run expensive queries without warehouse costs
- Prototype optimizations before production deployment
Would you consider using Melchi? Let us know your use case in the Discord community.
Get a local copy of your Snowflake data in 5 minutes:
- Clone and set up:
git clone https://github.com/ryanwith/melchi.git
cd melchi
python3 -m venv venv && source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
- Create
config/config.yaml
:
source:
type: snowflake
account: ${SNOWFLAKE_ACCOUNT_IDENTIFIER}
user: ${SNOWFLAKE_USER}
password: ${SNOWFLAKE_PASSWORD}
role: YOUR_ROLE
warehouse: YOUR_WAREHOUSE
change_tracking_database: melchi_cdc_db
change_tracking_schema: streams
target:
type: duckdb
database: output/local.duckdb
change_tracking_schema: melchi
tables_config:
path: "config/tables_to_transfer.csv"
- Create
config/tables_to_transfer.csv
:
database,schema,table,cdc_type
your_db,your_schema,your_table,full_refresh
- Set environment variables in
.env
:
SNOWFLAKE_ACCOUNT_IDENTIFIER=your_account
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
- Run initial setup and sync:
python main.py setup --config config/config.yaml
python main.py sync_data --config config/config.yaml
Your data is now in DuckDB! Query it with:
import duckdb
conn = duckdb.connect('output/local.duckdb')
conn.execute('SELECT * FROM your_schema.your_table').fetchall()
See Configuration for detailed setup options and CDC Types for advanced change tracking strategies.
Melchi uses a combination of Snowflake's native change tracking features and custom metadata tables to efficiently synchronize data. Here's how the key components work together:
- Standard Streams: Uses Snowflake's native stream objects to capture INSERTs, UPDATEs, and DELETEs
- Append-Only Streams: Uses Snowflake's native append-only stream objects to capture INSERTs
- Full Refresh: Direct table copies. Good for smaller tables or for initial tests
- Processing Tables: Temporary staging areas in Snowflake for batching changes
- Source Tracking: Maintains table schemas, primary keys, and CDC configurations
- Sync State: Tracks last successful sync time and change volumes
- Row Identification: Adds
melchi_row_id
for tables using streams without primary keys
[Snowflake Source] → [Change Detection] → [Batched Processing] → [DuckDB Target]
-
Transactional Consistency
- All operations are wrapped in transactions
- Failed syncs roll back completely
- No partial updates are committed
- Source cleanup only occurs after successful target commit
-
Memory Management
- Streams process data in configurable batch sizes
- Large tables are chunked automatically
- Pandas DataFrames used for efficient type conversion
- Memory usage scales with batch size, not table size
-
Type System
- Automated type mapping between Snowflake and DuckDB
- Handles complex types (ARRAY, VARIANT, GEOGRAPHY)
- Preserves precision for numeric types
- Consistent timezone handling for timestamps
-
Error Handling
- Automatic retry logic for transient failures
- Preserves CDC data on failed syncs
- Detailed error logging and state tracking
- Self-healing for interrupted syncs
- Initial Setup
graph LR
A[Configure Tables] --> B[Create CDC Objects]
B --> C[Setup DuckDB Schema]
C --> D[Initialize Metadata]
- Regular Sync
graph TD
A[Check for Changes] --> B{CDC Type?}
B -->|Standard| C[Process Changes]
B -->|Append-Only| D[Process Inserts]
B -->|Full Refresh| E[Copy Table]
C --> F[Apply Changes]
D --> F
E --> F
F --> G[Update Metadata]
G --> H[Cleanup CDC Data]
- Current Limitations
- Geography and Geometry columns not supported with
standard_stream
due to snowflake limitations - All identifiers must adhere to Snowflake's unquoted object identifiers (https://docs.snowflake.com/en/sql-reference/identifiers-syntax)
- Primary keys must be defined in Snowflake for streams (or a
melchi_row_id
will be added) - All tables must be replaced together when modifying the transfer configuration
- You cannot replicate tables with the same schema and column names into duckdb, even if they are in different databases in Snowflake
-
Planned Enhancements
- Additional warehouse support (BigQuery, Redshift)
- Additional CDC mechanisms
-
Under Consideration
- Real-time CDC using Snowflake tasks
- Built-in data validation
- Web UI for monitoring
- Multi-target sync support
Melchi supports three different Change Data Capture (CDC) strategies that can be specified for each table in your tables_to_transfer.csv
file. All strategies provide transactional consistency - the key differences are in how changes are detected and synchronized.
- How it works: Completely drops and recreates the target table during each sync
- Best for:
- Small lookup tables (< 100k rows)
- Tables where change tracking setup in Snowflake is not desired
- Testing and initial setup
- Advantages:
- Simplest to set up - no change tracking required
- Works with all column types including GEOGRAPHY and GEOMETRY
- Disadvantages:
- Resource intensive for large tables
- Higher latency as entire table must be transferred
- Higher costs due to full data scanning
- Example use case: A small configuration table that changes completely several times a day
- How it works: Uses Snowflake's standard streams to capture all changes (inserts, updates, and deletes)
- Best for:
- Large tables with frequent changes
- Tables needing update and delete tracking
- Advantages:
- Efficient for large tables with moderate change volumes
- Captures all types of changes (inserts, updates, deletes)
- Disadvantages:
- Cannot be used with GEOGRAPHY or GEOMETRY columns (Snowflake limitation)
- Requires a Snowflake role with permissions to create tables and streams in the CDC schema
- Example use case: A customer table where records are frequently updated and occasionally deleted
- How it works: Uses Snowflake's append-only streams to capture only new records
- Best for:
- Tables that have only inserts such as log tables, event data, and time-series data
- Tables with GEOGRAPHY or GEOMETRY columns that need streaming
- Advantages:
- Most efficient for append-only patterns
- Works with all column types including GEOGRAPHY and GEOMETRY
- Disadvantages:
- Cannot capture updates or deletes
- Requires a Snowflake role with permissions to create tables and streams in the CDC schema
- Example use case: An event logging table where records are only ever inserted
- Start with these questions:
- Does your table contain GEOGRAPHY or GEOMETRY columns?
- If yes: Use
append_only_stream
orfull_refresh
- If yes: Use
- Is your table append-only?
- If yes: Use
append_only_stream
- If yes: Use
- Do you need to track updates and deletes?
- If yes: Use
standard_stream
- If yes: Use
- Is your table small (< 100k rows)?
- If yes: Consider
full_refresh
- If yes: Consider
- Consider your data patterns:
- High update frequency →
standard_stream
- Insert-only patterns →
append_only_stream
- Small reference tables →
full_refresh
- Consider your resources:
- Limited Snowflake compute → Avoid
full_refresh
for large tables - Need minimal latency → Use streams (
standard_stream
orappend_only_stream
) - Limited setup time → Start with
full_refresh
- Python 3.7 or later
- Git
-
Clone the repository:
git clone https://github.com/ryanwith/melchi.git cd melchi
-
Set up a virtual environment:
python3 -m venv venv
-
Activate the virtual environment:
- On macOS and Linux:
source venv/bin/activate
- On Windows:
venv\Scripts\activate
- On macOS and Linux:
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.env
file in the project root directory and add your Snowflake and DuckDB credentials:SNOWFLAKE_ACCOUNT=your_account SNOWFLAKE_USER=your_username SNOWFLAKE_PASSWORD=your_password
-
Verify the installation:
python main.py --help
If you see the help message with available commands, Melchi is installed correctly.
If you encounter any issues during installation, please check the following:
- Ensure you're using Python 3.7 or later
- Make sure all environment variables are set correctly
- Check that you have the necessary permissions to install packages and create directories
Melchi uses a YAML configuration file to manage connections and specify which tables to replicate. Follow these steps to set up your configuration:
-
Create a
config.yaml
file in the config folder in the root directory. -
Add the following sections to your
config.yaml
:
source:
type: snowflake
account: ${SNOWFLAKE_ACCOUNT_IDENTIFIER}
user: ${SNOWFLAKE_USER}
# For username/password authentication:
# Leave authenticator field out completely (do not set it to blank)
password: ${SNOWFLAKE_PASSWORD}
# For SSO via browser authentication:
# authenticator: externalbrowser # Uncomment this line for SSO
role: snowflake_role_to_use
warehouse: snowflake_warehouse_to_use
change_tracking_database: database_with_change_tracking_schema
change_tracking_schema: name_of_change_tracking_schema
# Optional: Use a connection profile file for any of the above settings
# connection:
# file_path: "path/to/your/config.toml"
# profile_name: "dev" # Optional - uses default profile if not specified
target:
type: duckdb
database: /path/to/your/local/duckdb/database.duckdb
change_tracking_schema: name_of_change_tracking_schema
tables_config:
path: "path/to/your/tables_to_transfer.csv"
Replace placeholders with your actual Snowflake and DuckDB details. The DuckDB database will be created by default at runtime as long as the directory you're referring to exists.
Melchi supports two authentication methods:
-
Username/Password (default):
- Requires
password
to be set - Do not include the
authenticator
field at all
- Requires
-
SSO via Browser:
- Set
authenticator: externalbrowser
password
field will be ignored if present
- Set
Optionally, you can store some or all of your Snowflake connection details in a TOML file:
source:
connection:
file_path: "path/to/your/config.toml"
profile_name: "dev" # Optional - uses default profile if not specified
Any values specified in the connection profile file will override corresponding values in the YAML config. You can specify just a few parameters in the profile file or all of them - it's up to you.
action
: Required. One of:setup
: Sets up CDC tracking and creates target tablessync_data
: Syncs data from source to targetgenerate_source_sql
: Generates SQL needed for source setup
--config
: Optional. Path to configuration file (default: 'config/config.yaml')--output
: Optional. Output directory for generated SQL (default: 'output')--replace_existing
: Optional. When used withsetup
, drops and recreates existing CDC tracking objects and target tables. Use with caution in production.
# Initial setup
python main.py setup --config config/config.yaml
# Add new tables to track. This sets up CDC for any new tables you add to the table transfer file while not touching existing tables.
python main.py setup --config config/config.yaml
# Setup with replacement of existing objects. This completely recreates all CDC tracking objects and tables in the source and target.
python main.py setup --config config/config.yaml --replace_existing
# Regular data sync
python main.py sync_data --config config/config.yaml
# Generate source SQL
python main.py generate_source_sql --config config/config.yaml
The --replace_existing
flag is particularly useful when:
- Changing CDC types for existing tables
- Resetting CDC tracking after schema changes
- Testing different configurations
- Recovering from certain error states
Warning: Using --replace_existing
will drop and recreate all existing CDC tracking objects and tables in the source and target. It will also recreate all tables in the target. Use with caution in production.
Melchi uses a YAML configuration file to manage connections and specify which tables to replicate. Follow these steps to set up your configuration:
-
Create a
config.yaml
file in the config folder in the root directory. -
Add the following sections to your
config.yaml
:source: type: snowflake account: ${SNOWFLAKE_ACCOUNT_IDENTIFIER} user: ${SNOWFLAKE_USER} password: ${SNOWFLAKE_PASSWORD} role: snowflake_role_to_use warehouse: snowflake_warehouse_to_use change_tracking_database: database_with_change_tracking_schema change_tracking_schema: name_of_change_tracking_schema target: type: duckdb database: /path/to/your/local/duckdb/database.duckdb change_tracking_schema: name_of_change_tracking_schema tables_config: path: "path/to/your/tables_to_transfer.csv"
Replace placeholders with your actual Snowflake and DuckDB details. Additionally, the duckdb database will be created by default at runtime as long as the director you're referring to exists.
-
Create a
tables_to_transfer.csv
(or other name you specify in the config) file in theconfig
directory to specify which tables to replicate:database,schema,table,cdc_type your_db,your_schema,table1,full_refresh your_db,your_schema,table2,standard_stream your_db,your_schema,table3,append_only_stream
The
cdc_type
column specifies how changes should be tracked for each table:full_refresh
: Completely refreshes the table in DuckDB with each sync by dropping and recreating it. This is the default if no cdc_type is specified.standard_stream
: Uses Snowflake's standard streams to capture inserts, updates, and deletes.append_only_stream
: Uses Snowflake's append-only streams for insert-only tables (more efficient for append-only data).
-
Set up environment variables in a
.env
file:SNOWFLAKE_ACCOUNT_IDENTIFIER=your_account SNOWFLAKE_USER=your_username SNOWFLAKE_PASSWORD=your_password DUCKDB_DATABASE_PATH=/path/to/your/duckdb/database.db
Ensure all configuration files are properly set up before running Melchi. Note. If you do not have a duckdb database file, Melchi will create one at the path you specify.
To use Melchi effectively, you need to set up the correct permissions in Snowflake. Here's how to do it:
-
Create a dedicated role in Snowflake for Melchi:
USE ROLE ACCOUNTADMIN; CREATE ROLE melchi_role;
-
Grant the necessary permissions to this role. You can do this manually or use Melchi's
generate_source_sql
feature to help you.
If you prefer to set up permissions manually, you need to grant the following:
- Usage on the warehouse
- Usage on the databases and schemas containing the tables you want to replicate
- Select permission on the tables you want to replicate
- Create Table and Create Stream permissions on the change tracking schema
For example:
GRANT USAGE ON WAREHOUSE your_warehouse TO ROLE melchi_role;
GRANT USAGE ON DATABASE your_db TO ROLE melchi_role;
GRANT USAGE ON SCHEMA your_db.your_schema TO ROLE melchi_role;
GRANT SELECT ON TABLE your_db.your_schema.your_table TO ROLE melchi_role;
GRANT CREATE TABLE, CREATE STREAM ON SCHEMA change_tracking_db.change_tracking_schema TO ROLE melchi_role;
Melchi provides a generate_permissions
feature to help you create the necessary SQL statements for setting up permissions. To use it:
-
Ensure your
config.yaml
andtables_to_transfer.csv
are correctly set up. -
Run the following command:
python main.py generate_permissions
-
This will generate a file named
source_setup.sql
in theoutput
directory. Review this file to ensure it meets your security requirements. -
Execute the SQL statements in the generated file in your Snowflake account to set up the permissions.
Remember to enable change tracking on the tables you want to replicate:
ALTER TABLE your_db.your_schema.your_table SET CHANGE_TRACKING = TRUE;
By following these steps, you'll have the necessary permissions set up in Snowflake for Melchi to operate effectively.
Melchi provides several commands to manage your data synchronization process. Here's how to use each of them:
Before you start using Melchi, you need to set up the necessary permissions and change tracking in Snowflake. Use the following command to generate the required SQL:
python main.py generate_source_sql --config path/to/your/config.yaml --output path/to/output/directory
This will create a source_setup.sql
file in the specified output directory. Review this file and execute the SQL statements in your Snowflake environment.
To initialize Melchi and prepare both your source (Snowflake) and target (DuckDB) for data synchronization, run:
python main.py setup --config path/to/your/config.yaml
This command will:
- Set up change data capture (CDC) in Snowflake for the specified tables
- Create corresponding tables in DuckDB
- Initialize change tracking metadata
To synchronize data from Snowflake to DuckDB, use:
python main.py sync_data --config path/to/your/config.yaml
This command will:
- Check for changes in the Snowflake tables since the last synchronization
- Apply those changes (inserts, updates, deletes) to the corresponding DuckDB tables
-
Initial Setup: Always run the
setup
command before your firstsync_data
operation. -
Regular Syncing: Set up a scheduled job to run the
sync_data
command at regular intervals to keep your DuckDB instance up-to-date. -
Configuration Updates: If you modify your
config.yaml
ortables_to_transfer.csv
, re-run thesetup
command to ensure all necessary changes are applied. -
Error Handling: If a
sync_data
operation fails, address the error and re-run the command. Melchi is designed to pick up where it left off. -
Monitoring: Regularly check the Melchi logs to ensure smooth operation and to catch any potential issues early.
- Set up your configuration files (
config.yaml
andtables_to_transfer.csv
). - Generate and execute the source SQL:
python main.py generate_source_sql --config config/config.yaml --output sql_output
- Review and execute the
sql_output/source_setup.sql
in Snowflake. - Run the initial setup:
python main.py setup --config config/config.yaml
- Perform your first data sync:
python main.py sync_data --config config/config.yaml
- Set up a cron job or scheduled task to run the sync_data command at your desired frequency.
By following these steps and best practices, you'll be able to efficiently manage your data synchronization process with Melchi.
Melchi uses a combination of Snowflake's change tracking features and custom metadata tables to efficiently synchronize data from Snowflake to DuckDB. Here's a detailed explanation of how it works:
When you run the setup
command, Melchi creates CDC tables in your source (as necessary) and target. Additionally, it creates tables with the matching schema in your target
Melchi creates two components in Snowflake for each table that uses streams:
-
Stream: Creates a stream to capture changes
CREATE STREAM your_db.change_tracking_schema.stream_your_table ON TABLE your_db.your_schema.your_table;
-
Processing Table: Creates a staging area for captured changes
CREATE TABLE your_db.change_tracking_schema.stream_your_table_processing LIKE your_db.your_schema.your_table;
For all CDC strategies, Melchi:
- Creates tables that mirror your Snowflake schema
- Sets up metadata tables to track synchronization status
- For tables without primary keys, adds a
melchi_row_id
column to uniquely identify rows
If you run setup with the --replace_existing
attribute, all CDC objects in your source are replaced, all CDC data in your target is dropped, and all tables are recreated.
When you run sync_data
, Melchi handles each table according to its CDC strategy:
- Drops and recreates the table with fresh data from Snowflake
- Updates synchronization metadata
- Captures changes from Snowflake streams into processing tables
- Converts changes to pandas DataFrames for processing
- Applies changes to DuckDB:
standard_stream
: Handles inserts, updates, and deletesappend_only_stream
: Handles only inserts
- Updates synchronization metadata
- Cleans up processing tables
- Snowflake streams:
change_tracking_schema.stream_[database]$[schema]$[table]
- Snowflake processing tables:
change_tracking_schema.stream_[database]$[schema]$[table]_processing
- DuckDB tables: Mirror the original Snowflake table names
- All CDC strategies provide accurate data as of their last sync time
- Tables without primary keys automatically get a
melchi_row_id
column added - Uses Snowflake's native change tracking capabilities for efficient syncing
- Supports mixing different CDC strategies across tables based on your needs
We welcome contributions to Melchi! Whether you're fixing bugs, improving documentation, or proposing new features, your efforts are appreciated. Here's how you can contribute:
-
Fork the Repository: Start by forking the Melchi repository to your GitHub account.
-
Clone the Fork: Clone your fork to your local machine:
git clone https://github.com/your-username/melchi.git cd melchi
-
Set Up the Development Environment:
- Create a virtual environment:
python -m venv venv
- Activate it:
- On Windows:
venv\Scripts\activate
- On macOS and Linux:
source venv/bin/activate
- On Windows:
- Install dependencies:
pip install -r requirements.txt
- Create a virtual environment:
I would love help developing this! If you're interested please reach out. I have a discord a
-
Create a Branch: Create a new branch for your changes:
git checkout -b your-branch-name
-
Make Your Changes: Implement your bug fix or feature.
-
Follow Coding Standards:
- Follow PEP 8 style guide for Python code.
- Write clear, commented code.
- Ensure your changes don't break existing functionality.
-
Write Tests: If you're adding new functionality, include unit tests. Ensure all tests pass.
-
Update Documentation: Update the README or other documentation if necessary.
-
Commit Your Changes:
git commit -m "Brief description of your changes"
-
Push to Your Fork:
git push origin your-branch-name
-
Create a Pull Request: Go to the Melchi repository on GitHub and create a new pull request from your branch.
-
Describe Your Changes: In the pull request, provide a clear description of what you've done and why.
-
Wait for Review: Maintainers will review your PR. Be open to feedback and be prepared to make additional changes if requested.
- Use the GitHub issue tracker to report bugs or suggest features.
- Before creating a new issue, please check if it already exists.
- Clearly describe the issue, including steps to reproduce for bugs.
- Be respectful and inclusive in your interactions with other contributors.
- We are committed to providing a welcoming and inspiring community for all.
If you have any questions about contributing, feel free to ask in the GitHub issues.
Thank you for contributing to Melchi! Your efforts help make this project better for everyone.
Melchi is released under the Apache License 2.0 with Commons Clause.
This means that while the software is open source, there are certain restrictions on its use:
- You are free to use, modify, and distribute this software for your own purposes.
- You can use this software within your organization, whether for commercial purposes or not.
- You can make modifications to the software and distribute your modified version.
However, the Commons Clause adds the following restriction:
- You are not permitted to sell the software or any modified version of it as a commercial offering, particularly as a managed service.
In simple terms, you can use Melchi for your own data synchronization needs, but you cannot offer Melchi (or a modified version of it) as a paid service to others.
For the full license text, please see the LICENSE file in the repository.
If you have any questions about the license or how you can use Melchi in your specific case, please open an issue in the GitHub repository for clarification.
If you have any questions you can reach out on discord.