Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[destination-mongodb] Out of memory error when copying from temporary to permanent collection #48851

Open
1 task done
matthieu-dujany-technis opened this issue Dec 9, 2024 · 0 comments

Comments

@matthieu-dujany-technis
Copy link

matthieu-dujany-technis commented Dec 9, 2024

Connector Name

destination-mongodb

Connector Version

0.2.0

What step the error happened?

During the sync

Relevant information

Hello,

I am reporting this out of memory error which happens with the MongoDB destination connector.
The error happens at the end of a synchronisation job, when the mongoDB destination connector tries to copy all the synchronised records from a temporary collection to a permanent.

I believe this error is caused by the copyTable function of the mongodb connector: link to source code

All the documents to copy are put in a list in memory to be inserted in the new collection, without any batching. So if there are too many documents, the local memory can be overfilled by this huge list.
This is consistent with my experience where the error happens when I am trying to synchronise a high volume data.

I suggest to fix this by adding some batching. The following code snippet would use a batch size which is fixed size in terms of number of documents:

private static void copyTable(final MongoDatabase mongoDatabase, final String collectionName, final String tmpCollectionName) {
    final var tempCollection = mongoDatabase.getOrCreateNewCollection(tmpCollectionName);
    final var collection = mongoDatabase.getOrCreateNewCollection(collectionName);

    // Define batch size
    final int BATCH_SIZE = 1000;

    // Create a temporary list to hold the current batch of documents
    final List<Document> batch = new ArrayList<>();

    try (final MongoCursor<Document> cursor = tempCollection.find().projection(excludeId()).iterator()) {
        while (cursor.hasNext()) {
            batch.add(cursor.next());

            // When the batch size is reached, insert the batch and clear the list
            if (batch.size() == BATCH_SIZE) {
                collection.insertMany(new ArrayList<>(batch));
                batch.clear();
            }
        }

        // Insert remaining documents that didn't fill a complete batch
        if (!batch.isEmpty()) {
            collection.insertMany(batch);
        }
    }
}

The code snippet I give would provide some improvements over the current behavior.
The best would be to have a batch which has a fixed size of memory, and not just a fixed size of number of documents (as the size of each mongo document may vary).

Relevant log output

2024-12-05 20:00:13 replication-orchestrator INFO Stream Status Update Received: orders - COMPLETE
2024-12-05 20:00:13 replication-orchestrator INFO Updating status: orders - COMPLETE
2024-12-05 20:00:17 destination INFO i.a.i.b.FailureTrackingAirbyteMessageConsumer(close):80 Airbyte message consumer: succeeded.
2024-12-05 20:00:17 destination INFO i.a.i.d.m.MongodbRecordConsumer(close):90 Migration finished with no explicit errors. Copying data from tmp tables to permanent
2024-12-05 20:09:11 destination INFO Malformed non-Airbyte record (connectionId = 9bb79ad9-f258-4d97-bbe6-03f691541e9e): Terminating due to java.lang.OutOfMemoryError: Java heap space
2024-12-05 20:09:12 replication-orchestrator INFO Destination finished successfully — exiting read dest...
2024-12-05 20:09:12 replication-orchestrator INFO readFromDestination: exception caught
2024-12-05 20:09:12 replication-orchestrator INFO readFromDestination: done. (writeToDestFailed:false, dest.isFinished:true)
2024-12-05 20:09:12 replication-orchestrator INFO Closing StateCheckSumCountEventHandler

Contribute

  • Yes, I want to contribute
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants