Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

please come up with "overlay" versioning scheme (e.g. in ds 201) #9

Open
yarikoptic opened this issue Sep 15, 2016 · 17 comments
Open

Comments

@yarikoptic
Copy link

https://openfmri.org/dataset/ds000201/
release 1.0.1 is a bugfix release and a new tarball was provided for "Metadata, demographics, survey, questionnaire, eye tracking, and non-imaging data (387 MB)". But the other tarballs were not uploaded for release 1.0.1, which is somewhat logical since they didn't change.

My concern is how could I (well, software) decide to have a next version to be an overlay (i.e. take old files and only replace updated ones) or a new release which might indeed have some similarly named tarballs removed. I see possible e.g. releases changing the last component (e.g. here from .0 to .1) are such overlay releases and I should assume that whatever tarball was present for previous one, is still present or replaced in the current one
Not sure if I have coded for such logic already, and not sure if there would be no glitches (eg. in some datasets some even minor inconsistency in filename could introduce "difficulties", e.g. hypothetically having "ds201_R1.0.0_dwi.tar" and "ds201_R1.0.1_dwis.tar")

But it would be nice to come up with some consistent and "standard" convention (should also be explained somewhere on the website)

@vsoch
Copy link

vsoch commented Sep 15, 2016

I'm not sure exactly how the workflow is on the backend, but given that the tarballs are served on s3 that limits the versioning to the (current) file naming method, which of course also matches BIDS. Maybe an idea is to integrate some form of "real" version control, because you could easily use tags and releases for a simple file with the md5 sum and file name of the tarballs for the release. That could (eventually) be a sort of quasi "openfmri-data" hub, where the user would start with the version defined in the file, and then use some base path (in this case the folder on aws) to retrieve the correct files for the version. Just a thought! :)

@yarikoptic
Copy link
Author

ah... I also see that 157 also has the same situation (1.0.2 which updated only some tarballs).
So -- should I only consider 3rd digit increment as 'overlay'? I see only ds 3 which had increment in 2nd number but there it was obvious since only one tarball went from _raw to a verioned one..

@yarikoptic
Copy link
Author

yarikoptic commented Sep 15, 2016

as for versioning -- just give a try to git-annex/datalad ;) (see e.g. http://datasets.datalad.org/?dir=/openfmri ; datalad install ///openfmri/ds000001 for a try).
in my current case I just want to reach some 'standard' agreement on what versions will have 'overlays' and which would come as 'complete', so any disappearing (not listed) tarball would be 'intensional'

@vsoch
Copy link

vsoch commented Sep 15, 2016

imho I don't think the version number on the file name is enough given that "old version" files are brought forward with new versions - this scheme would only work given that all "old version" files are copied and provided with the new version, so there is never any doubt. If you do my suggestion above, you would also likely want:

  1. a dev command line tool to make it easy to select included files and generate the log, and then push to a github branch for a PR
  2. the continuous integration testing would need to check the md5 sum against the files and verify they exist at some base
  3. the tool should then make it easy to update the site
  4. the user should then be able to select some version (based on the github release) and download the appropriate files

I've never used datalab but it looks cool!

@yarikoptic
Copy link
Author

datalad (just like git, but a good one, i.e. a lad) ;)

I guess I will reread your answers tomorrow morning with a cleared mind to see if we aren't talking over each other heads ;) Thanks and Cheers!

@yarikoptic
Copy link
Author

"simple file with the md5 sum and file name of the tarballs for the release" -- yes, that would be nice, since it would define it unambigously. But it would require user to look into that file, or, ideally, openfmri web frontend use it to present a view of "files of current release" or smth like that. So all in all it would require some development...
But I think "not all is lost" with how things are already setup, and I am just asking for some formalization of workflow. I will have a brief look later at what tarballs are provided:

$> git submodule foreach git ls-tree incoming
Entering 'ds000001'
040000 tree 53e962d5ab62ec6106e91291767cda58ec0d1caf    .datalad
100644 blob a6009f4392d2433656aa83ba285c90fa25821a31    .gitattributes
100644 blob 7ac454fbdca3c28cc50afe5440f40838810dae9b    changelog.txt
120000 blob 912e3153ddddfb8791b571c4a0175511a5707243    ds001_R1.1.0_raw.tgz
120000 blob 44b106c72a44ec8700e687d342dc3ea5b8dd1252    ds001_R2.0.0_raw.tgz
120000 blob 3465a356d86b4901015ec2814cddb402d9c61092    ds001_raw.tgz
Entering 'ds000002'
040000 tree e88517df6de36565c0ee8fca48e19335acc308d7    .datalad
100644 blob a6009f4392d2433656aa83ba285c90fa25821a31    .gitattributes
100644 blob eb41b053fcb4922422fcf16d893a943fc8907f14    changelog.txt
120000 blob 3692d181d7727475055860d32e9d73027c58b61a    ds002_raw.tgz
Entering 'ds000003'
040000 tree 7f1b959b15078502ca8cdeaff07dbf4c99799988    .datalad
100644 blob a6009f4392d2433656aa83ba285c90fa25821a31    .gitattributes
100644 blob a41909cf5a65af564af5301ec0e71c13c5166948    changelog.txt
120000 blob a79f6976293288eb2dc2794d6a341ef2b6a73fa7    ds003_R1.1.0_raw.tgz
120000 blob b8d579660c742b04c99e53b7933d1b58515d458e    ds003_raw.tgz
...

(full list is at http://www.onerussian.com/tmp/openfmri-incoming-20160915.txt, hopefully matches what is avail from s3 archives/ didn't check atm)

@yarikoptic
Copy link
Author

somewhat demo example question: ds 17A -- there were first ds017A_models.tgz ds017A_raw.tgz and then ds017A_R1.1.0_raw.tgz . Assuming that "non-versioned" release was an 1.0.0 release (I think we agreed on that some time ago). I have 2nd digit change. But are the _models still applicable or not to the 1.1.0 release? I guess not, since I think (I can only guess since changelog file includes dates but not release numbers) that 1.1.0 was the one fixing orientation in the nifti files, thus invalidating majority if not any derived data as well. so it seems that not doing overlay for this one (and it was the 2nd digit boost) would be the correct behavior, right?

@poldrack
Copy link
Owner

IIRC, models were solely based on pre-bids data

On Sep 15, 2016, at 8:57 AM, Yaroslav Halchenko [email protected] wrote:

somewhat demo example question: ds 17A -- there were first ds017A_models.tgz ds017A_raw.tgz and then ds017A_R1.1.0_raw.tgz . Assuming that "non-versioned" release was an 1.0.0 release (I think we agreed on that some time ago). I have 2nd digit change. But are the _models still applicable or not to the 1.1.0 release? I guess not, since I think (I can only guess since changelog file includes dates but not release numbers) that 1.1.0 was the one fixing orientation in the nifti files, thus invalidating majority if not any derived data as well. so it seems that not doing overlay for this one (and it was the 2nd digit boost) would be the correct behavior, right?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #9 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AA1KkBW3G6IJkQm4H7CLBAG0uFq4MQyMks5qqWr1gaJpZM4J9bmW.


Russell A. Poldrack
Albert Ray Lang Professor of Psychology
Bldg. 420, Jordan Hall
Stanford University
Stanford, CA 94305

[email protected]
http://www.poldracklab.org/

@yarikoptic
Copy link
Author

and 1.1.0 for that is also pre-BIDS -- I am just dealing with datasets already released on openfmri

Re versioning -- easy to grasp semantically would be naming convention as "major.minor.patch", where the ".patch" (3rd level) would assume an overlay to patch previous "major.minor.patch-1" state (at the level of tarball names, so if there is 1.1.0_blah.tar.gz and 1.1.1_blah.tar.gz, I would simply not consider 1.1.0 version for the "1.1.1" release, thus if any file got removed within 1.1.1_blah.tar.gz, then it will be removed from the 1.1.1 release. so it is not an overlay as "I extract all the tarballs on top of each other and just extending the content"... sorry for confusing language but may be you get my point

@yarikoptic
Copy link
Author

another 'interesting" dataset is ds 9:

$> ls -lL
total 24427496
-rw-r--r-- 1 yoh datalad        207 Mar 31 01:01 changelog.txt
-r--r--r-- 1 yoh datalad 6003651730 Feb 21  2016 ds009_R1.1.0_raw.tgz
-r--r--r-- 1 yoh datalad 4007257776 Mar 25 22:39 ds009_R2.0.0_01-17.tgz
-r--r--r-- 1 yoh datalad 2364753676 Mar 25 22:51 ds009_R2.0.0_18-29.tgz
-r--r--r-- 1 yoh datalad  124894379 Mar 25 22:52 ds009_R2.0.0_toplevel_metadata.tgz
-r--r--r-- 1 yoh datalad 4007258231 May  8 15:49 ds009_R2.0.1_01-17.tgz
-r--r--r-- 1 yoh datalad 2364753689 May  8 15:51 ds009_R2.0.1_18-29.tgz
-r--r--r-- 1 yoh datalad  124894223 May  8 15:52 ds009_R2.0.1_metadata_derivatives.tgz
-r--r--r-- 1 yoh datalad 6016271853 Apr  9  2015 ds009_raw.tgz

where the suffix has changed from toplevel_metadata to metadata_derivatives, so it is not clear (just by looking at filenames) if it is a new file added while relying on previous release (2.0.0) providing differently named file, or updated a differently named previous file.

@yarikoptic
Copy link
Author

yarikoptic commented Sep 17, 2016

BTW -- looking at the API (which is nice!) https://openfmri.org/dataset/api/ds000201/ - 1.0.1 release lists only the "overlay/patch" (changed) file and not any other files which still apply to 1.0.1 release from 1.0.0 release @chrisfilo ?

@chrisgorgo
Copy link

  1. This is the wrong repo to report those bugs - in the future please use https://github.com/poldracklab/open_fmri
  2. @suyashdb @jbwexler could you look into this?

@yarikoptic
Copy link
Author

oh, ok -- then please move #10 there as well.

@jbwexler
Copy link

We are currently trying to decide on practices for the workflow that will be used consistently, which will hopefully help with machine readability. One change we are planning to make is, for each new revision, to make copies of all the unaltered files from the previous revision, and rename these according to the new revision. See ds117: https://openfmri.org/dataset/ds000117/
Would this solve the issue?

Though even if this solves the problem from now on, there is still the issue with ds201, ds157, and others that had revisions added before we decided on this practice. I'm not sure we want to alter old revisions--what do people think?

@poldrack
Copy link
Owner

cc’ing Chris on this to make sure he is following this thread…

On Sep 17, 2016, at 9:32 PM, jbwexler [email protected] wrote:

We are currently trying to decide on practices for the workflow that will be used consistently, which will hopefully help with machine readability. One change we are planning to make is, for each new revision, to make copies of all the unaltered files from the previous revision, and rename these according to the new revision. See ds117: https://openfmri.org/dataset/ds000117/ https://openfmri.org/dataset/ds000117/
Would this solve the issue?

Though even if this solves the problem from now on, there is still the issue with ds201, ds157, and others that had revisions added before we decided on this practice. I'm not sure we want to alter old revisions--what do people think?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #9 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AA1KkBVNnpaDTWVBSXGtG63n1-JwWFvLks5qrL7xgaJpZM4J9bmW.


Russell A. Poldrack
Albert Ray Lang Professor of Psychology
Bldg. 420, Jordan Hall
Stanford University
Stanford, CA 94305

[email protected]
http://www.poldracklab.org/

@suyashdb
Copy link

Hello Joe,
Lets get the versioning finalized and documented. One other thought i have

  • if we choose not to update old datasets versioning, we should document
    all previously used flavors of versioning somewhere on website for users to
    understand. What do you think?

-suyash

On Sun, Sep 18, 2016 at 7:42 AM, Russ Poldrack [email protected]
wrote:

cc’ing Chris on this to make sure he is following this thread…

On Sep 17, 2016, at 9:32 PM, jbwexler [email protected] wrote:

We are currently trying to decide on practices for the workflow that
will be used consistently, which will hopefully help with machine
readability. One change we are planning to make is, for each new revision,
to make copies of all the unaltered files from the previous revision, and
rename these according to the new revision. See ds117:
https://openfmri.org/dataset/ds000117/ <https://openfmri.org/dataset/
ds000117/>
Would this solve the issue?

Though even if this solves the problem from now on, there is still the
issue with ds201, ds157, and others that had revisions added before we
decided on this practice. I'm not sure we want to alter old revisions--what
do people think?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub <
https://github.com/poldrack/openfmri/issues/9#issuecomment-247824861>, or
mute the thread <https://github.com/notifications/unsubscribe-auth/
AA1KkBVNnpaDTWVBSXGtG63n1-JwWFvLks5qrL7xgaJpZM4J9bmW>.


Russell A. Poldrack
Albert Ray Lang Professor of Psychology
Bldg. 420, Jordan Hall
Stanford University
Stanford, CA 94305

[email protected]
http://www.poldracklab.org/


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKotn0lGTbtHKW2SfT7Wh9hEUlx9zqvYks5qrU3kgaJpZM4J9bmW
.

@poldrack
Copy link
Owner

agreed
rp

On Sep 18, 2016, at 9:35 AM, Suyash [email protected] wrote:

Hello Joe,
Lets get the versioning finalized and documented. One other thought i have

  • if we choose not to update old datasets versioning, we should document
    all previously used flavors of versioning somewhere on website for users to
    understand. What do you think?

-suyash

On Sun, Sep 18, 2016 at 7:42 AM, Russ Poldrack [email protected]
wrote:

cc’ing Chris on this to make sure he is following this thread…

On Sep 17, 2016, at 9:32 PM, jbwexler [email protected] wrote:

We are currently trying to decide on practices for the workflow that
will be used consistently, which will hopefully help with machine
readability. One change we are planning to make is, for each new revision,
to make copies of all the unaltered files from the previous revision, and
rename these according to the new revision. See ds117:
https://openfmri.org/dataset/ds000117/ <https://openfmri.org/dataset/
ds000117/>
Would this solve the issue?

Though even if this solves the problem from now on, there is still the
issue with ds201, ds157, and others that had revisions added before we
decided on this practice. I'm not sure we want to alter old revisions--what
do people think?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub <
https://github.com/poldrack/openfmri/issues/9#issuecomment-247824861>, or
mute the thread <https://github.com/notifications/unsubscribe-auth/
AA1KkBVNnpaDTWVBSXGtG63n1-JwWFvLks5qrL7xgaJpZM4J9bmW>.


Russell A. Poldrack
Albert Ray Lang Professor of Psychology
Bldg. 420, Jordan Hall
Stanford University
Stanford, CA 94305

[email protected]
http://www.poldracklab.org/


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#9 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKotn0lGTbtHKW2SfT7Wh9hEUlx9zqvYks5qrU3kgaJpZM4J9bmW
.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub #9 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AA1KkGwr91QJ2vQLk1cL1o6m85ZXgOAFks5qrWhegaJpZM4J9bmW.


Russell A. Poldrack
Albert Ray Lang Professor of Psychology
Bldg. 420, Jordan Hall
Stanford University
Stanford, CA 94305

[email protected]
http://www.poldracklab.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants