Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_pacta_data_preparation should export iShares scraping info in a manifest file #26

Open
jdhoffa opened this issue Mar 22, 2023 · 1 comment
Labels
ADO Maintenance Day! feature a feature request or enhancement priority

Comments

@jdhoffa
Copy link
Member

jdhoffa commented Mar 22, 2023

Supersedes https://github.com/RMI-PACTA/pacta.data.preparation/issues/165

Full context copied manually:
@jdhoffa:

Potentially useful information:

filename,
file_extension,
filesize,
url,
download_time,
base_url,
archive_url (once the URLs are archived)
@cjyetman can you validate which of these fields you think are/ aren't useful to output (or if some are missing)?
And also to be clear, this manifest should relate to only the raw URL correct?

Relates to https://github.com/RMI-PACTA/pacta.data.preparation/pull/162

@cjyetman:

I guess all of these are relevant.... maybe file_extension is a bit overkill.

filename is critical so you know which file you're talking about

filesize is good to have so that you can verify the file you're looking at is actually the same one being described, because the file could have been modified and you wouldn't be able to tell. Maybe a checksum would be better, but that's bit more difficult to verify for an average user

url is the precise location the file was downloaded from. I think this is pretty fundamental to recording the provenance of the data/file

download_time is the precise time that the file was downloaded. This is important because files found at URLs are not necessarily stable, and often change over time, so the URL is not really enough to precisely record the provenance of the data/file

base_url was originally included here because we're capturing a JSON file that technically is not intended for anyone to access directly, and is not linked to or findable by any "normal" web browsing. Instead, the JSON file is used to feed a table on the page found at the "base_url". I have been in situations before where someone else, or a future version of myself, asked "where did you get this from? I can't find it anywhere on that site?", and base_url was the answer.

archive_url if the page is getting archived (on archive.org), this is also a convenience for anyone in the future trying to find this file or update this process, especially if the file has moved or completely disappeared from the site. One would be able to download the file again from this URL, exactly as it was at the time the archive was made. It's also a good indicator the file WAS archived, which is good to know.

These are all things to precisely record the provenance of the file, and facilitate someone in the future trying to understand something about where it came from, what it means, how to find a new version that's equivalent, etc.

@cjyetman:

Also... I think I had file_extension because sometimes JSON files like this don't even have an extension, because they come from an AJAX request or something... so it's convenient to know what type of file the original developer of the code/archiver expected the file to be, especially if the filename/URL is some random string of characters with no discernible meaning.

AB#9894

@cjyetman
Copy link
Member

cjyetman commented Apr 7, 2023

@cjyetman cjyetman transferred this issue from RMI-PACTA/workflow.data.preparation Apr 12, 2023
@jdhoffa jdhoffa added feature a feature request or enhancement ADO Maintenance Day! labels Feb 7, 2024
@jdhoffa jdhoffa self-assigned this Feb 7, 2024
@jdhoffa jdhoffa added ADO Maintenance Day! and removed ADO Maintenance Day! labels Feb 19, 2024
@jdhoffa jdhoffa removed their assignment Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADO Maintenance Day! feature a feature request or enhancement priority
Projects
None yet
Development

No branches or pull requests

2 participants