Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation, mostly #7

Merged
merged 5 commits into from
Dec 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,6 @@
.idea
.venv*
*.egg-info
build
coverage.xml
build
dist
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 Crate.io Inc

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
74 changes: 10 additions & 64 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Meltano/Singer Target for CrateDB
# Singer target / Meltano loader for CrateDB

[![Tests](https://github.com/crate-workbench/meltano-target-cratedb/actions/workflows/main.yml/badge.svg)](https://github.com/crate-workbench/meltano-target-cratedb/actions/workflows/main.yml)
[![Test coverage](https://img.shields.io/codecov/c/gh/crate-workbench/meltano-target-cratedb.svg)](https://codecov.io/gh/crate-workbench/meltano-target-cratedb/)
Expand All @@ -13,73 +13,19 @@
## About

A [Singer] target for [CrateDB], built with the [Meltano SDK] for custom extractors
and loaders, and based on the [Meltano PostgreSQL target]. It connects a library of
[600+ connectors] with CrateDB, and vice versa.
and loaders, and based on the [Meltano PostgreSQL target].

In Singer ELT jargon, a "target" conceptually wraps a data sink, where you
"load" data into.

Singer, Meltano, and PipelineWise provide foundational components and
an integration engine for composable Open Source ETL with [600+ connectors].
On the database integration side, they are heavily based on [SQLAlchemy].


### CrateDB

[CrateDB] is a distributed and scalable SQL database for storing and analyzing
massive amounts of data in near real-time, even with complex queries. It is
PostgreSQL-compatible, and based on [Apache Lucene].

CrateDB offers a Python SQLAlchemy dialect, in order to plug into the
comprehensive Python data-science and -wrangling ecosystems.

### Singer

_The open-source standard for writing scripts that move data._

[Singer] is an open source specification and software framework for [ETL]/[ELT]
data exchange between a range of different systems. For talking to SQL databases,
it employs a metadata subsystem based on SQLAlchemy.

Singer reads and writes Singer-formatted messages, following the [Singer Spec].
Effectively, those are JSONL files.

### Meltano

_Unlock all the data that powers your data platform._

_Say goodbye to writing, maintaining, and scaling your own API integrations
with Meltano's declarative code-first data integration engine, bringing
600+ APIs and DBs to the table._

[Meltano] builds upon Singer technologies, uses configuration files in YAML
syntax instead of JSON, adds an improved SDK and other components, and runs
the central addon registry, [meltano | Hub].

### PipelineWise

[PipelineWise] is another Data Pipeline Framework using the Singer.io
specification to ingest and replicate data from various sources to
various destinations. The list of [PipelineWise Taps] include another
20+ high-quality data-source and -sink components.

### SQLAlchemy

[SQLAlchemy] is the leading Python SQL toolkit and Object Relational Mapper
that gives application developers the full power and flexibility of SQL.

It provides a full suite of well known enterprise-level persistence patterns,
designed for efficient and high-performing database access, adapted into a
simple and Pythonic domain language.
In order to learn more about Singer, Meltano, and friends, navigate to the
[Singer Intro](./docs/singer-intro.md).


## Install

Usually, you will not install this package directly, but on behalf
of a Meltano definition instead, for example. A corresponding snippet
is outlined in the next section. After adding it to your `meltano.yml`
configuration file, you can install all defined components and their
dependencies.
Usually, you will not install this package directly, but rather on behalf
of a Meltano project. A corresponding snippet is outlined in the next section.

After adding it to your `meltano.yml` project definition file, you can install
all defined components and their dependencies with a single command.
```
meltano install
```
Expand Down Expand Up @@ -197,8 +143,8 @@ pip_url: --editable=/path/to/sources/meltano-target-cratedb
```


[600+ connectors]: https://hub.meltano.com/
[Apache Lucene]: https://lucene.apache.org/
[connectors]: https://hub.meltano.com/
[CrateDB]: https://cratedb.com/product
[CrateDB Cloud]: https://console.cratedb.cloud/
[ELT]: https://en.wikipedia.org/wiki/Extract,_load,_transform
Expand Down
131 changes: 131 additions & 0 deletions docs/singer-intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
## About

An introduction to the Singer ecosystem of data pipeline components for
composable open source ETL.

Singer, Meltano, PipelineWise, and Airbyte, provide components and integration
engines adhering to the Singer specification.

On the database integration side, the [connectors] of Singer and Meltano are
based on [SQLAlchemy].


## Overview

### CrateDB

[CrateDB] is a distributed and scalable SQL database for storing and analyzing
massive amounts of data in near real-time, even with complex queries. It is
PostgreSQL-compatible, and based on [Apache Lucene].

CrateDB offers a Python SQLAlchemy dialect, in order to plug into the
comprehensive Python data-science and -wrangling ecosystems.

### Singer

_The open-source standard for writing scripts that move data._

[Singer] is an open source specification and software framework for [ETL]/[ELT]
data exchange between a range of different systems. For talking to SQL databases,
it employs a metadata subsystem based on SQLAlchemy.

Singer reads and writes Singer-formatted JSONL messages, following the [Singer Spec].

> The Singer specification was started in 2016 by Stitch Data. It specified a
> data transfer format that would allow any number of data systems, called taps,
> to send data to any data destinations, called targets. Airbyte was incorporated
> in 2020 and created their own specification that was heavily inspired by Singer.
> There are differences, but the core of each specification is sending new-line
> delimited JSON data from STDOUT of a tap to STDIN of a target.


### Meltano

_Unlock all the data that powers your data platform._

> _Say goodbye to writing, maintaining, and scaling your own API integrations
with Meltano's declarative code-first data integration engine, bringing
a number of APIs and DBs to the table._

[Meltano] builds upon Singer technologies, uses configuration files in YAML
syntax instead of JSON, adds an improved SDK and other components, and runs
the central addon registry, [meltano | Hub].

### PipelineWise

> [PipelineWise] is another Data Pipeline Framework using the Singer.io
specification to ingest and replicate data from various sources to
various destinations. The list of [PipelineWise Taps] include another
bunch of high-quality data-source and -sink components.

### Data Mill

> Data Mill helps organizations utilize modern data infrastructure and data
> science to power analytics, products, and services.

- https://github.com/datamill-co
- https://datamill.co/

### SQLAlchemy

> [SQLAlchemy] is the leading Python SQL toolkit and Object Relational Mapper
that gives application developers the full power and flexibility of SQL.
>
> It provides a full suite of well known enterprise-level persistence patterns,
designed for efficient and high-performing database access, adapted into a
simple and Pythonic domain language.


## Evaluations

### Singer vs. Meltano

Meltano as a framework fills many gaps and makes Singer convenient to actually
use. It is impossible to outline all details and every difference, so we will
focus on the "naming things" aspects for now.

Both ecosystems use different names for the same elements. That may be confusing
at first, but it is easy to learn: For the notion of **data source** vs. **data
sink**, common to all pipeline systems in one way or another, Singer uses the
terms **tap** vs. **target**, while Meltano uses **extractor** vs. **loader**.
Essentially, they are the same things under different names.

| Ecosystem | Data source | Data sink |
|--------|--------|--------|
| Singer | Tap | Target |
| Meltano | Extractor | Loader |

In Singer jargon, you **tap** data from a source, and send it to a **target**.
In Meltano jargon, you **extract** data from a source, and then **load** it
into the target system.


### Singer and Airbyte criticism

- https://airbyte.com/etl-tools/singer-alternative-airbyte
- https://airbyte.com/blog/airbyte-vs-singer-why-airbyte-is-not-built-on-top-of-singer
- https://airbyte.com/blog/why-you-should-not-build-your-data-pipeline-on-top-of-singer
- https://airbyte.com/blog/a-new-license-to-future-proof-the-commoditization-of-data-integration
- [Clarify in docs relationship to Singer project from Stitch/Talend]
- [Unfair comparison to PipelineWise and Meltano]


[Apache Lucene]: https://lucene.apache.org/
[Clarify in docs relationship to Singer project from Stitch/Talend]: https://github.com/airbytehq/airbyte/issues/445
[connectors]: https://hub.meltano.com/
[CrateDB]: https://cratedb.com/product
[CrateDB Cloud]: https://console.cratedb.cloud/
[ELT]: https://en.wikipedia.org/wiki/Extract,_load,_transform
[ETL]: https://en.wikipedia.org/wiki/Extract,_transform,_load
[Meltano]: https://meltano.com/
[meltano | Hub]: https://hub.meltano.com/
[Meltano SDK]: https://github.com/meltano/sdk
[Meltano PostgreSQL target]: https://pypi.org/project/meltanolabs-target-postgres/
[meltano-target-cratedb]: https://github.com/crate-workbench/meltano-target-cratedb
[Singer]: https://www.singer.io/
[Singer Spec]: https://hub.meltano.com/singer/spec/
[PipelineWise]: https://transferwise.github.io/pipelinewise/
[PipelineWise Taps]: https://transferwise.github.io/pipelinewise/user_guide/yaml_config.html
[SQLAlchemy]: https://www.sqlalchemy.org/
[Unfair comparison to PipelineWise and Meltano]: https://github.com/airbytehq/airbyte/issues/9253
[vanilla package on PyPI]: https://pypi.org/project/meltano-target-cratedb/
27 changes: 14 additions & 13 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,29 @@ default-tag = "0.0.0"

[project]
name = "meltano-target-cratedb"
description = "A Singer target for CrateDB, built with the Meltano SDK, and based on the Meltano PostgreSQL target."
description = "A Singer target / Meltano loader for CrateDB, built with the Meltano SDK, and based on the Meltano PostgreSQL target."
readme = "README.md"
keywords = [
"cratedb",
"data-loading",
"data-processing",
"CrateDB",
"data",
"data-toolkit",
"data-transfer",
"data-transformation",
"ELT",
"ETL",
"extract",
"ingest",
"io",
"load",
"Meltano",
"Meltano SDK",
"pipeline",
"Postgres",
"PostgreSQL",
"process",
"Singer",
"SQL",
"SQLAlchemy",
"transfer",
"transformation",
]
license = { text = "MIT" }
authors = [
Expand Down Expand Up @@ -56,6 +63,7 @@ classifiers = [
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: SQL",
"Topic :: Adaptive Technologies",
"Topic :: Communications",
"Topic :: Database",
Expand Down Expand Up @@ -103,7 +111,6 @@ release = [
"twine<5",
]
test = [
"meltano-target-cratedb[testing]",
"pytest<8",
"pytest-cov<5",
"pytest-mock<4",
Expand Down Expand Up @@ -156,10 +163,6 @@ testpaths = [
]
xfail_strict = true
markers = [
"examples",
"influxdb",
"mongodb",
"slow",
]

[tool.ruff]
Expand Down Expand Up @@ -207,8 +210,6 @@ extend-ignore = [
]

extend-exclude = [
"amqp-to-mqtt.py",
"workbench.py",
]

[tool.ruff.per-file-ignores]
Expand Down