Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/emm #1516

Merged
merged 23 commits into from
Nov 21, 2024
Merged

Feature/emm #1516

merged 23 commits into from
Nov 21, 2024

Conversation

jannistsiroyannis
Copy link
Contributor

This is not a finalized thing, but enough (in my opinion) to merge and start testing at scale.

sendDumpPageResponse(whelk, apiBaseUrl, dump, dumpFilePath, offsetNumeric, res);
}

private static void sendDumpIndexResponse(String apiBaseUrl, HttpServletResponse res) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea which we need to integrate into the linked data surface.

We need to describe these as JSON-LD. This response has no @context yet, so we don't know what the index is for; Nor what categories mean?

I think EMM is silent about this, but XL can describe and link to them using KBV/platform terminology. These entity sets should be linked to as datadumps of regularly described datasets (to avoid adding yet another notion of dataset, cf. data.kb.se and libris datasets).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree! And I will need your help to get the details right!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll draft something. What do we need?

  1. The shape of the download; e.g. a gzipped archive stream of jsonld-files, each representing our named graphs? That is, the stored data with added @context and @id of the record, both aside the top @graph ( ping @olovy).
  2. Discovery of the entity sets, possibly based on the Dataset and datasetDistribution (ping @olovy and @klngwll). Could be added later, with initial entity sets just shared on a "need to know" basis (we have a bunch of possible nice-to-haves, e.g. NB and subject headings; see the dump page in the devops repo which we can make obsolete with this).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk back-channel, ill write to you!


public class Dump {
private static final Logger logger = LogManager.getLogger(Dump.class);
private static final String DUMP_END_MARKER = "_DUMP_END_MARKER\n"; // Must be 17 bytes
Copy link
Contributor

@olovy olovy Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to let the reader know what all the 17s are about.
Looks like it will work until the year 2593!

return;
}

// THIS SHIT is so painful in Java :(
Copy link
Contributor

@olovy olovy Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use Map.of(...) and List.of(...)
new HashMap<>(Map.of(...)) if mutability is needed.

Still pretty painful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But using put() on a LinkedHashMap which keeps insertion order will give a nicer response for humans.

String apiBaseUrl = req.getRequestURL().toString();

res.setCharacterEncoding("utf-8");
res.setContentType("application/json");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// Is there not enough data for a full page yet ?
long offsetBytes = 17 * offsetLines;
while (!dumpFinished && file.length() < offsetBytes + (17 * (long)EmmChangeSet.TARGET_HITS_PER_PAGE)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some kind of timeout/limit here?
Otherwise it will get stuck in an infinite loop if generateDump fails in the middle for some reason.

Copy link
Contributor

@olovy olovy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
As discussed offline we can merge this and then explore the options for full dumps further.

@jannistsiroyannis jannistsiroyannis merged commit 6d25b30 into develop Nov 21, 2024
1 check passed
@jannistsiroyannis jannistsiroyannis deleted the feature/emm branch November 21, 2024 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants