Skip to content

Commit

Permalink
Update doc with schema less mode (#33550)
Browse files Browse the repository at this point in the history
  • Loading branch information
rodireich authored Dec 19, 2023
1 parent 3ade71d commit ca33d30
Showing 1 changed file with 21 additions and 1 deletion.
22 changes: 21 additions & 1 deletion docs/integrations/sources/mongodb-v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,10 +160,29 @@ The MongoDB source utilizes change data capture (CDC) as a reliable way to keep

Airbyte utilizes [the change streams feature](https://www.mongodb.com/docs/manual/changeStreams/) of a [MongoDB replica set](https://www.mongodb.com/docs/manual/replication/) to incrementally capture inserts, updates and deletes using a replication plugin. To learn more how Airbyte implements CDC, refer to [Change Data Capture (CDC)](https://docs.airbyte.com/understanding-airbyte/cdc/).

### Schema Enforcement

By default the MongoDB V2 source connector enforces a schema. This means that while setting up a connector it will sample a configureable number of docuemnts and will create a set of fields to sync. From that set of fields, an admin can then deselect specific fields from the Replication screen to filter them out from the sync.

When the schema enforced option is disabled, MongoDB collections are read in schema-less mode which doesn't assume documents share the same structure.
This allows for greater flexibility in reading data that is unstructured or vary a lot in between documents in a single collection.
When schema is not enforced, each document will generate a record that only contains the following top-level fields:
```json
{
"_id": <document id>,
"data": {<a JSON cotaining the entire set of fields found in document>}
}
```
The contents of `data` will vary according to the contents of each document read from MongoDB.
Unlike in Schema enforced mode, the same field can vary in type between document. For example field `"xyz"` may be a String on one document and a Date on another.
As a result no field will be omitted and no document will be rejected.
When Schema is not enforced there is not way to deselect fields as all fields are read for every document.

## Limitations & Troubleshooting

* Only supports [replica set](https://www.mongodb.com/docs/manual/replication/) cluster type.
* Schema discovery uses [sampling](https://www.mongodb.com/docs/manual/reference/operator/aggregation/sample/) of the documents to collect all distinct top-level fields. This value is universally applied to all collections discovered in the target database. The approach is modelled after [MongoDB Compass sampling](https://www.mongodb.com/docs/compass/current/sampling/) and is used for efficiency. By default, 10,000 documents are sampled. This value can be increased up to 100,000 documents to increase the likelihood that all fields will be discovered. However, the trade-off is time, as a higher value will take the process longer to sample the collection.
* Schema discovery uses [sampling](https://www.mongodb.com/docs/manual/reference/operator/aggregation/sample/) of the documents to collect all distinct top-level fields. This value is universally applied to all collections discovered in the target database. The approach is modelled after [MongoDB Compass sampling](https://www.mongodb.com/docs/compass/current/sampling/) and is used for efficiency. By default, 10,000 documents are sampled. This value can be increased up to 100,000 documents to increase the likelihood that all fields will be discovered. However, the trade-off is time, as a higher value will take the process longer to sample the collection.
* When Running with Schema Enforced set to `false` there is no attempt to discover any schema. See more in [Schema Enforcement](#Schema-Enforcement).
* TLS/SSL is required by this connector. TLS/SSL is enabled by default for MongoDB Atlas clusters. To enable TSL/SSL connection for a self-hosted MongoDB instance, please refer to [MongoDb Documentation](https://docs.mongodb.com/manual/tutorial/configure-ssl/).
* Views, capped collections and clustered collections are not supported.
* Empty collections are excluded from schema discovery.
Expand All @@ -181,6 +200,7 @@ Airbyte utilizes [the change streams feature](https://www.mongodb.com/docs/manua
| Username | The username which is used to access the database. Required for MongoDB Atlas clusters. |
| Password | The password associated with this username. Required for MongoDB Atlas clusters. |
| Authentication Source | (MongoDB Atlas clusters only) Specifies the database that the supplied credentials should be validated against. Defaults to `admin`. See the [MongoDB documentation](https://www.mongodb.com/docs/manual/reference/connection-string/#mongodb-urioption-urioption.authSource) for more details. |
| Schema Enforced | Controls whether schema is discovered and enforced. See discussion in [Schema Enforcement](#Schema-Enforcement). |
| Initial Waiting Time in Seconds (Advanced) | The amount of time the connector will wait when it launches to determine if there is new data to sync or not. Defaults to 300 seconds. Valid range: 120 seconds to 1200 seconds. |
| Size of the queue (Advanced) | The size of the internal queue. This may interfere with memory consumption and efficiency of the connector, please be careful. |
| Discovery Sample Size (Advanced) | The maximum number of documents to sample when attempting to discover the unique fields for a collection. Default is 10,000 with a valid range of 1,000 to 100,000. See the [MongoDB sampling method](https://www.mongodb.com/docs/compass/current/sampling/#sampling-method) for more details. |
Expand Down

0 comments on commit ca33d30

Please sign in to comment.