Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IQSS/10108 Stata mimetype refinement for direct upload #11054

Open
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Nov 26, 2024

What this PR does / why we need it: As noted in the issue and recently in https://groups.google.com/g/dataverse-community/c/TBx0TTins2k, our ~extensible mimetype detector mechanism assumes starting from a local temp file which isn't useful with direct uploads. The most visible problem we have with this is that all Stata *.dta files get an application/x-status type to start and our ingest sends all of those to the Stata 13 ingestor, which then fails for Stata 14/15 files. With normal uploads, the local detector reads the first few bytes of the file and assigns a more specific type, e.g. application/x-stata-14 (or 15, etc) which gets routed to the correct version of the Stata ingestion code.

Since the determination of Stata version only relies on reading the first ~42 bytes of the file, and only needs those in a buffer/doesn't require a File to start from, this PR adds code to retrieve the required bytes and run the Stata version check on direct upload(S3) and presumably remote files/Globus files on S3 (cases where the storageidentifier is provided during upload and where getInputStream works.)

The PR also has some notes about the potential to clean things up further and implement other mimetype detectors that only require a subset of bytes in a more extensible framework. I don't have any plan to implement this but incremental steps to add other detectors to the direct upload path are probably possible if there are other cases where we're seeing problems.

Which issue(s) this PR closes:

Special notes for your reviewer: I also added a note that I think there's a no-op section now where we check files by extension when the extension is null. If someone knows more about the history there, perhaps we can restore whatever was supposed to happen there, or we could delete it.

The code also makes a slight logging change. After the code was updated to allow getting Mimetypes using the full filename, we've been getting info-level suggestions to perhaps add that specific filename to MimeTypeDetectionByFileName.properties which is pretty noisy. I changed the code to keep info-level if the extension is checked and isn't in MimeTypeDetectionByFileExtension.properties but dropped it to fine for the filename check.

Suggestions on how to test this: Find a Stata file with version 14 or 15 and upload it via direct upload/using a direct upload store, verify that it gets a mimetype including the version and is ingested (assuming ingest size limits, etc. allow). (Some Stata files have a *.dta extension - searching for that might help in finding an example.)

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?: included

Additional documentation: slightly updated the docs to indicate the Stata check is now done in direct upload.

@coveralls
Copy link

coveralls commented Nov 26, 2024

Coverage Status

coverage: 22.565% (-0.006%) from 22.571%
when pulling 976439f on QualitativeDataRepository:IQSS/10108-StataMimeTypeRefinementForDIrectUpload
into a4d0127 on IQSS:develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready for Triage
Development

Successfully merging this pull request may close these issues.

Not ingested file into Dataverse
2 participants