Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAK-11444 [full-gc] Save document id and empty properties names before deletion #2038

Open
wants to merge 12 commits into
base: trunk
Choose a base branch
from

Conversation

daniancu
Copy link

@daniancu daniancu commented Jan 30, 2025

Introduce a FullGcBin class as a wrapper around DocumentStore that expose two methods used to clean garbage from NODES collection

public int remove(Map<String, Long> orphanOrDeletedRemovalMap)
public List<NodeDocument> findAndUpdate(List<UpdateOp> updateOpList)

When enabled
Each methods saves the document ID or empty properties names (that will be deleted) to a separate _bin collection as a BinDocument then delegates deletion to DocumentStore

When disabled (default)
Each method delegates directly to DocumentStore

This PR does not contain the necessary configuration parameter to turn on/off the fullGcBin. I'll create a separate PR for that (due the fact how config params are passed in, it will require additional 15 file changes and will make this PR hard to follow)

Update: PR that adds configuration for this feature is available here: daniancu#1

Copy link
Contributor

@reschke reschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this adds a new Collection for the first time in more than a decade. I believe this needs more discussion, optimally not in the PR but in the ticket.

@daniancu daniancu marked this pull request as ready for review February 14, 2025 12:26
@daniancu
Copy link
Author

I've addressed all issues mentioned and increased test coverage to 100%

@thomasmueller thomasmueller self-requested a review February 14, 2025 13:38
@daniancu daniancu requested review from shodaaan and reschke February 14, 2025 15:04
import java.util.Map;
import java.util.stream.Collectors;

public class FullGcBin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding a short class description summarizing how and where this class is used, for example: "used in MongoDB garbage collection in order to first store the deleted document paths in MongoDB before doing the deletion".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

documentStore = ds;
}

public int remove(Map<String, Long> orphanOrDeletedRemovalMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider adding a short description (one line) for what the methods do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

try {
return documentStore.create(Collection.SETTINGS, docs);
} catch (Exception e) {
LOG.error("Error while adding delete candidate documents to bin", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we also log the paths of the documents that failed to be added to the bin, they will be helpful for debug purposes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

documentStore.createOrUpdate(Collection.SETTINGS, binOpList);
return true;
} catch (Exception e) {
LOG.error("Error while adding removed properties to bin", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we also log the paths of the properties that failed to be added to the bin, will be helpful for debugging.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@shodaaan shodaaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added suggestions for logging of paths and adding class and method descriptions.

Copy link
Contributor

@shodaaan shodaaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TTL index for the /bin documents should also be created automatically.
I suggest doing this in MongoDocumentStore.ensureIndexes method.

@daniancu
Copy link
Author

The TTL index for the /bin documents should also be created automatically. I suggest doing this in MongoDocumentStore.ensureIndexes method.

what should be the TTL for this documents? 30 days? 60 days? ...

@shodaaan
Copy link
Contributor

The TTL index for the /bin documents should also be created automatically. I suggest doing this in MongoDocumentStore.ensureIndexes method.

what should be the TTL for this documents? 30 days? 60 days? ...

I suggest we start with 90 days.

@@ -191,10 +191,10 @@ private static class RevisionsOptions extends Utils.NodeStoreOptions {
fullGcProgressSize = parser.accepts("fullGcProgressSize", "The number of documents to check for " +
"garbage in each Full GC cycle")
.withRequiredArg().ofType(Integer.class).defaultsTo(10000);
fullGcMaxAge = parser.accepts("fullGcMaxAge", "The maximum age of the document in seconds " +
fullGcMaxAgeSec = parser.accepts("fullGcMaxAge", "The maximum age of the document in seconds " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep both variable name & param name same.

Suggested change
fullGcMaxAgeSec = parser.accepts("fullGcMaxAge", "The maximum age of the document in seconds " +
fullGcMaxAge = parser.accepts("fullGcMaxAge", "The maximum age of the document in seconds " +

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@shodaaan shodaaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, approved.

* Each method delegates directly to DocumentStore
*/
public class MongoFullGcNodeBin implements FullGcNodeBin {
private static final Logger LOG = getLogger(MongoFullGcNodeBin.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use Fully qualified class name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean @rishabhdaim ... why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean @rishabhdaim ... why?

me too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid static import of "getLogger" for consistency with the rest of Oak :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, LoggerFactory.getLogger instead of getLogger. Right!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@thomasmueller thomasmueller self-requested a review March 3, 2025 13:01
return mongoDocumentStore.findAndUpdate(Collection.NODES, updateOpList);
}

protected boolean addToBin(Map<String, Long> orphanOrDeletedRemovalMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add the comments about what would be added in DB after this operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


public MongoFullGcNodeBin(MongoDocumentStore ds) {
mongoDocumentStore = ds;
enabled = System.getProperty("oak.document.fullGcBin.enabled", "false").equals("true");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically have a (public) constant and then use this here.

Also, Boolean.getBoolean(...)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@daniancu daniancu Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is gone in the configuration PR

Copy link
Contributor

@reschke reschke Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. But the convenience comes with drawbacks. That's why we have a utility method (logging, value checking etc...). How many times have you wondered whether you really have set the correct property? And what the default was without checking the source code?

Here's an example use:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also using "System.getProperty" instead of "Boolean.getBoolean" makes things worse because you lose common parsing of boolean values...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood. updated as suggested

return mongoDocumentStore.findAndUpdate(Collection.NODES, updateOpList);
}

protected boolean addToBin(Map<String, Long> orphanOrDeletedRemovalMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected boolean addToBin(Map<String, Long> orphanOrDeletedRemovalMap) {
private boolean addToBin(Map<String, Long> orphanOrDeletedRemovalMap) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return false;
}

protected boolean persist(List<BasicDBObject> inserts) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
protected boolean persist(List<BasicDBObject> inserts) {
private boolean persist(List<BasicDBObject> inserts) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return true;
}

BasicDBObject toBasicDBObject(UpdateOp op) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BasicDBObject toBasicDBObject(UpdateOp op) {
private BasicDBObject toBasicDBObject(UpdateOp op) {

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -481,6 +481,11 @@ private void logSplitDocIdsTobeDeleted(Bson query) {
LOG.debug(sb.toString());
}

@Override
public FullGcNodeBin getFullGCBin() {
return new MongoFullGcNodeBin(store);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be singleton.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@thomasmueller thomasmueller self-requested a review March 3, 2025 13:11
Copy link
Contributor

@rishabhdaim rishabhdaim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think the place to adding getBinCollection() is wrong. We should have this in DocumentStore interface with default implementation of noBin and then override this for MongoDocumentStore only.

For reference we could use default Throttler throttler().

For test we could override this for MemoryDocumentStore too.

cc @Joscorbe @stefan-egli

return false;
}

private boolean addToBin(List<UpdateOp> updateOpList) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both addToBin method does exactly the same thing, just differ in input. We should extract out the common code to a common method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also have the conversion of input param to BasicDBObject which is different. The common part is extracted in the persist() method

@daniancu
Copy link
Author

daniancu commented Mar 3, 2025

Also, I think the place to adding getBinCollection() is wrong. We should have this in DocumentStore interface with default implementation of noBin and then override this for MongoDocumentStore only.

For reference we could use default Throttler throttler().

For test we could override this for MemoryDocumentStore too.

cc @Joscorbe @stefan-egli

we discussed about adding a hidden collection in JIra
The idea was to avoid exposing it to other implementations because we only need it for Mongo one

Copy link
Member

@Joscorbe Joscorbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all the comments have been addressed, but apparently there is an open question. If the implementation detail has to be challenged I propose to discuss in Jira ticket, so we have everything at one place.

@daniancu
Copy link
Author

daniancu commented Mar 4, 2025

I have merged OAK-11525 containing the configuration part into this PR so we have everything in a single place

@Joscorbe
Copy link
Member

Joscorbe commented Mar 4, 2025

I'm running the pipeline just to ensure it builds properly, and will merge afterwards. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants