diff --git a/docs/install/container/docker.rst b/docs/install/container/docker.rst index 96c4b35b..e3a1eaa8 100644 --- a/docs/install/container/docker.rst +++ b/docs/install/container/docker.rst @@ -501,7 +501,7 @@ example:: .. _containerization: https://www.docker.com/resources/what-container .. _CrateDB Docker image: https://hub.docker.com/_/crate/ .. _default bridge network: https://docs.docker.com/network/drivers/bridge/#use-the-default-bridge-network -.. _Docker Stack YAML file: https://docs.docker.com/docker-cloud/apps/stack-yaml-reference/ +.. _Docker Stack YAML file: https://docs.oldtimes.me/docker/docker-cloud/apps/stack-yaml-reference/index.html .. _Docker Swarm: https://docs.docker.com/engine/swarm/ .. _Docker volume: https://docs.docker.com/engine/tutorials/dockervolumes/ .. _Docker: https://www.docker.com/ diff --git a/docs/integrate/etl/kafka-connect.rst b/docs/integrate/etl/kafka-connect.rst index a8251d7e..6beda163 100644 --- a/docs/integrate/etl/kafka-connect.rst +++ b/docs/integrate/etl/kafka-connect.rst @@ -495,7 +495,7 @@ The remaining steps from above remain are applicable without changes. .. _Kafka: https://www.confluent.io/what-is-apache-kafka/ .. _Kafka Connect JDBC connector: https://docs.confluent.io/kafka-connect-jdbc/current/sink-connector/ .. _Confluent Platform: https://docs.confluent.io/current/cli/index.html -.. _Avro schema: https://avro.apache.org/docs/current/spec.html +.. _Avro schema: https://avro.apache.org/docs/1.10.2/spec.html .. _PostgreSQL Kafka Connect JDBC driver: https://docs.confluent.io/kafka-connect-jdbc/current/index.html#postgresql-database .. _Sink Connector: https://docs.confluent.io/current/connect/kafka-connect-jdbc/sink-connector/index.html .. _Source Connector: https://docs.confluent.io/current/connect/kafka-connect-jdbc/source-connector/index.html diff --git a/docs/integrate/etl/mongodb.md b/docs/integrate/etl/mongodb.md new file mode 100644 index 00000000..ba4c7598 --- /dev/null +++ b/docs/integrate/etl/mongodb.md @@ -0,0 +1,216 @@ +(integrate-mongodb)= +(migrating-mongodb)= +(integrate-mongodb-quickstart)= +(import-mongodb)= + +# Import data from MongoDB + +In this quick tutorial, you'll use the [CrateDB Toolkit MongoDB I/O subsystem] +to import data from [MongoDB] into [CrateDB]. + +:::{note} +**Important:** The tutorial uses adapter software which is currently in beta testing. +If you discover any issues, please [report them] back to us. +::: + +## Synopsis +Transfer data from MongoDB database/collection into CrateDB schema/table. +:::{code} shell +ctk load table \ + "mongodb+srv://admin:p..d@cluster0.nttj7.mongodb.net/testdrive/demo" \ + --cratedb-sqlalchemy-url='crate://admin:p..d@gray-wicket-systri-warrick.aks1.westeurope.azure.cratedb.net:4200/testdrive/demo?ssl=true' +::: + +Query data in CrateDB. +:::{code} shell +export CRATEPW=password +crash --host=cratedb.example.org --username=user --command="SELECT * FROM testdrive.demo;" +::: + +## Data Model + +MongoDB stores data in collections and documents. CrateDB stores +data in schemas and tables. + +- A **database** in MongoDB is a physical container for collections, similar + to a schema in CrateDB, which groups tables together within a database. +- A **collection** in MongoDB is a grouping of documents, similar to a table + in CrateDB, which is a structured collection of rows. +- A **document** in MongoDB is a record in a collection, similar to a row in + a CrateDB table. It is a set of key-value pairs, where each key represents + a field, and the value represents the data. +- A **field** in MongoDB is similar to a column in a CrateDB table. In both + systems, fields (or columns) define the attributes for the records + (or rows/documents). +- A **primary key** in MongoDB is typically the _id field, which uniquely + identifies a document within a collection. In CrateDB, a primary key + uniquely identifies a row in a table. +- An **index** in MongoDB is similar to an index in CrateDB. Both are used to + improve query performance by providing a fast lookup for fields (or columns) + within documents (or rows). + +-- [Databases and Collections] + +## Tutorial + +The tutorial heavily uses Docker to provide services and to run jobs. +Alternatively, you can use the drop-in replacement Podman. +The walkthrough uses basic example setup including MongoDB v7.0.x, CrateDB +and a few samples worth of data that is being transferred to CrateDB. + +### Services + +Prerequisites are running instances of CrateDB and MongoDB. + +Start MongoDB. +:::{code} shell +docker run --rm -it --name=mongodb \ + --publish=27017:27017 \ + --volume="$PWD/var/lib/mongodb:/data/db" \ + mongo:latest +::: + +Start CrateDB. +:::{code} shell +docker run --rm -it --name=cratedb \ + --publish=4200:4200 \ + --volume="$PWD/var/lib/cratedb:/data" \ + crate:latest -Cdiscovery.type=single-node +::: + +### Sample Data + +In this case we imported demo data to MongoDB in JSON format: + +:::{code} shell + [ + { + "_id": "66bb0bd8e17c5c509fbc8b2c", + "VendorID": 2, + "tpep_pickup_datetime": 1563051934000, + "tpep_dropoff_datetime": 1563053222000, + "passenger_count": 2, + "trip_distance": 3.29, + "RatecodeID": 1, + "store_and_fwd_flag": "N", + "PULocationID": 79, + "DOLocationID": 170, + "payment_type": 1, + "fare_amount": 15.5, + "extra": 0.5, + "mta_tax": 0.5, + "tip_amount": 3.86, + "tolls_amount": 0, + "improvement_surcharge": 0.3, + "total_amount": 23.16, + "congestion_surcharge": 2.5, + "airport_fee": "" + }, ... + ] +::: + +Import data to MongoDB: +:::{code} shell +mongoimport --db testdrive --collection demo --file demodata.json --jsonArray +::: + +:::{note} +`mongoimport` is part of the [MongoDB Database tools]. +::: + +Verify data is present: +:::{code} shell +docker exec -it mongodb mongosh +::: + +:::{code} shell +use testdrive +db.demo.find().pretty() +::: + +### Data Import + +First, create these command aliases, for better UX. +:::{code} shell +alias crash="docker run --rm -it --link=cratedb ghcr.io/crate-workbench/cratedb-toolkit:latest crash" +alias ctk="docker run --rm -it ghcr.io/crate/cratedb-toolkit:latest ctk" +::: + +Now, import data from MongoDB database/collection into CrateDB schema/table. +:::{code} shell +ctk load table \ + "mongodb://localhost:27017/testdrive/demo" \ + --cratedb-sqlalchemy-url="crate://crate@cratedb:4200/testdrive/demo" +::: + +Verify that relevant data has been transferred to CrateDB. +:::{code} shell +crash --host=cratedb --command="SELECT * FROM testdrive.demo;" +::: + +## Cloud to Cloud + +The procedure for importing data from [MongoDB Atlas] into [CrateDB Cloud] is +similar, with a few small adjustments. + +First, helpful aliases again: +:::{code} shell +alias ctk="docker run --rm -it ghcr.io/crate/cratedb-toolkit:latest ctk" +alias crash="docker run --rm -it ghcr.io/crate-workbench/cratedb-toolkit:latest crash" +::: + +You will need your credentials for both CrateDB and MongoDB. +These are, with examples: + +**CrateDB Cloud** +* Host: ```gray-wicket-systri-warrick.aks1.westeurope.azure.cratedb.net``` +* Username: ```admin``` +* Password: ```-9..nn``` + +**MongoDB Atlas** + * Host: ```cluster0.nttj7.mongodb.net``` + * User: ```admin``` + * Password: ```a1..d1``` + +For CrateDB, the credentials are displayed at time of cluster creation. +For MongoDB, they can be found in the [cloud platform] itself. + +Now, same as before, import data from MongoDB database/collection into +CrateDB schema/table. +:::{code} shell +ctk load table \ + "mongodb+srv://admin:a..1@cluster0.nttj7.mongodb.net/testdrive/demo" \ + --cratedb-sqlalchemy-url='crate://admin:-..n@gray-wicket-systri-warrick.aks1.westeurope.azure.cratedb.net:4200/testdrive/demo?ssl=true' +::: + +::: {note} +Note the **necessary** `ssl=true` query parameter at the end of both database connection URLs +when working on Cloud-to-Cloud transfers. +::: + +Verify that relevant data has been transferred to CrateDB. +:::{code} shell +crash --hosts 'https://admin:-..n@gray-wicket-systri-warrick.aks1.westeurope.azure.cratedb.net:4200' --command 'SELECT * FROM testdrive.demo;' +::: + +## More information + +There are more ways to apply the I/O subsystem of CrateDB Toolkit as +pipeline elements in your daily data operations routines. Please visit the +[CrateDB Toolkit MongoDB I/O subsystem] documentation, to learn more about what's possible. + +The MongoDB I/O subsystem is based on the [migr8] migration utility package. Please also +check its documentation to learn about more of its capabilities, supporting +you when working with MongoDB. + + +[cloud platform]: https://cloud.mongodb.com +[CrateDB]: https://github.com/crate/crate +[CrateDB Cloud]: https://console.cratedb.cloud/ +[CrateDB Toolkit MongoDB I/O subsystem]: https://cratedb-toolkit.readthedocs.io/io/mongodb/loader.html +[Databases and Collections]: https://www.mongodb.com/docs/manual/core/databases-and-collections/#databases-and-collections +[migr8]: https://cratedb-toolkit.readthedocs.io/io/mongodb/migr8.html +[MongoDB]: https://www.mongodb.com/docs/manual/tutorial/install-mongodb-community-with-docker/ +[MongoDB Atlas]: https://www.mongodb.com/cloud/atlas +[MongoDB Database tools]: https://www.mongodb.com/docs/database-tools/installation/installation-linux/ +[report them]: https://github.com/crate-workbench/cratedb-toolkit/issues diff --git a/docs/integrate/etl/mongodb.rst b/docs/integrate/etl/mongodb.rst deleted file mode 100644 index 537ed97a..00000000 --- a/docs/integrate/etl/mongodb.rst +++ /dev/null @@ -1,141 +0,0 @@ -.. highlight:: psql - -.. _integrate-mongodb: -.. _migrating-mongodb: - -======================== -Import data from MongoDB -======================== - -.. rubric:: Table of contents - -.. contents:: - :local: - - -Exporting data from MongoDB -=========================== - -When exporting data from a MongoDB collection, it is exported in the `MongoDB -Extended JSON`_ file format, which includes additional type information. This -additional information makes the format unsuitable for importing into a CrateDB -table. To help with this problem, we have created a `MongoDB migration tool`_ -that can export a MongoDB collection while converting it into a CrateDB friendly -format. - -First, download & install the tool according to the instructions on the repo. -You can then export a collection into a JSON file, as follows: - -.. code-block:: sh - - $ migr8 export --host --port --database --collection > data.json - - -Importing data into CrateDB -=========================== - -Before the converted file can be imported into CrateDB a table has to be -created. - -A basic CREATE TABLE statement looks as follows:: - - cr> CREATE TABLE mytable ( - ... name TEXT, - ... obj OBJECT (DYNAMIC) - ... ) CLUSTERED INTO 5 SHARDS WITH (number_of_replicas = 0); - CREATE OK, 1 row affected (... sec) - -In CrateDB each field is indexed by default. It is not necessary to create -any additional indices. - -However, if some fields are never used for filtering, indexing can be turned -off:: - - cr> CREATE TABLE mytable2 ( - ... name TEXT, - ... obj OBJECT (DYNAMIC), - ... dummy TEXT INDEX OFF - ... ) CLUSTERED INTO 5 SHARDS WITH (number_of_replicas = 0); - CREATE OK, 1 row affected (... sec) - -For fields that contain text consider using a full-text analyzer. This will -enable great full-text search capabilities. See :ref:`Indices and Fulltext -Search ` for more information. - -CrateDB is able to create dynamically defined table schemas, which can be -extended as data is inserted, so it is not necessary to define all the columns -up front:: - - cr> CREATE TABLE mytable3 ( - ... name TEXT, - ... obj OBJECT (DYNAMIC), - ... dumm TEXT index off - ... ) CLUSTERED INTO 5 SHARDS WITH (number_of_replicas = 0, column_policy = 'dynamic') - -Given the table above, it is possible to insert new columns at the top level of -the table and insert arbitrary objects into the **obj** column:: - - cr> INSERT INTO mytable3 (name, obj, newcol, dummy) VALUES - ... ('Trillian', {gender = 'female'}, 2804, 'dummy'); - INSERT OK, 1 row affected (... sec) - - cr> REFRESH TABLE mytable3; - REFRESH OK, 1 row affected (... sec) - -.. Hidden: wait for schema update so that newcol is available - - cr> _wait_for_schema_update('doc', 'mytable3', 'newcol') - -:: - - cr> SELECT * FROM mytable3; - +-------+----------+--------+----------------------+ - | dummy | name | newcol | obj | - +-------+----------+--------+----------------------+ - | dummy | Trillian | 2804 | {"gender": "female"} | - +-------+----------+--------+----------------------+ - SELECT 1 row in set (... sec) - -However, this has some limitations. For example timestamps in long format won't -be recognised as timestamps. Due to this limitation it is recommended to -specify fields up front. - -In these cases, the `MongoDB migration tool`_ can be used to autogenerate -a schema to fit your collection. For example, to create the above schema without -resorting to using a dynamic table definition: - -.. code-block:: sh - - $ migr8 extract --host --port --database --collection --scan full --out schema.json - $ migr8 translate --infile schema.json - - MongoDB -> CrateDB Exporter :: Schema Extractor - - Collection 'mytable': - CREATE TABLE IF NOT EXISTS "doc"."mytable" ( - "name" TEXT, - "obj" OBJECT (DYNAMIC) AS ( - "gender" TEXT - ), - "newcol" INTEGER, - "dummy" TEXT - ); - -This can be useful for collections with complex or heavily-nested schemas. - -.. SEEALSO:: - - - :ref:`Data Definition ` - - :ref:`CREATE TABLE ` - - -After the table has been created the file can be imported using -:ref:`COPY FROM `. - -.. SEEALSO:: - - :ref:`bulk-inserts` - - -.. _MongoDB Extended JSON: https://docs.mongodb.com/manual/reference/mongodb-extended-json/ -.. _MongoDB migration tool: https://github.com/crate/mongodb-cratedb-migration-tool