Skip to content
This repository has been archived by the owner on May 7, 2019. It is now read-only.

file size missing in some data objects #18

Open
mjkrause opened this issue Jul 13, 2018 · 0 comments
Open

file size missing in some data objects #18

mjkrause opened this issue Jul 13, 2018 · 0 comments

Comments

@mjkrause
Copy link
Contributor

mjkrause commented Jul 13, 2018

Problem:

This data object has a size attribute:

image

This one does not:

image

A few more details in this slack chat from 2018-07-13:

Michael Krause [10:42 AM]
Good morning David. Quick question: the remote-file-manifest requires file size (key is length). I don't want to download the file to find that information. So I'm running a HEAD one of the files (aws replica). From the headers I see 'X-DSS-SIZE': '5897'. Is this the file size?
nm, i think it is, just downloaded it to confirm
David Steinberg [10:45 AM]
check the DOS message
it should have a size in it
Michael Krause [10:47 AM]
hmm, you mean by using the DOS API?
David Steinberg [10:48 AM]
yeah like the data object message you got back should have a size in it

HTTP/1.1 200 OK
Access-Control-Allow-Headers: Authorization,Content-Type,X-Amz-Date,X-Amz-Security-Token,X-Api-Key
Access-Control-Allow-Origin: *
Connection: keep-alive
Content-Length: 482
Content-Type: application/json
Date: Fri, 13 Jul 2018 17:50:25 GMT
Via: 1.1 3d3d633d266d05d90a4eea7a6a59b514.cloudfront.net (CloudFront)
X-Amz-Cf-Id: muSSjRGRpv_uVzAo98yrqw5ZtAGJaRHV4bzziX0lGh9LFzVCgcyW4A==
X-Amzn-Trace-Id: Root=1-5b48e65f-9f2d99288913c1f00dca9808;Sampled=0
X-Cache: Miss from cloudfront
x-amz-apigw-id: J-ju6ErRvHcF0gw=
x-amzn-RequestId: 37a32c67-86c5-11e8-b9b9-ab3e25a1e82e

{
    "data_object": {
        "checksums": [
            {
                "checksum": "0ff8cf77", 
                "type": "crc32c"
            }
        ], 
        "content_type": "application/octet-stream", 
        "id": "46df9ac5-0e72-4aa5-8e6b-6b55b362c29a", 
        "size": 1363840, 
        "urls": [
            {
                "url": "s3://cgp-commons-public/topmed_open_access/4654080c-9a94-5fcf-a69c-c73059169207/NWD290849.recab.cram.crai"
            }, 
            {
                "url": "gs://cgp-commons-multi-region-public/topmed_open_access/4654080c-9a94-5fcf-a69c-c73059169207/NWD290849.recab.cram.crai"
            }
        ], 
        "version": "2018-05-26T134316.170053Z"
    }
}```
Michael Krause [11:16 AM]
Here's the data object information on one of the objects we looked at in the notebook, and the second one is the one you just printed. odd thing is that the top one doesn't have size:
```{'checksums': [{'checksum': 'c873835a74cea9c811cc7799f8897ac480cccf84f631c99b5293900f7a071b53',
                'type': 'sha256'},
               {'checksum': '57db2e71deb4dab5e4b3f251ac9243b0', 'type': 'etag'},
               {'checksum': '05f818a54510272c17dcda69c948f8d904b5aae3',
                'type': 'sha1'},
               {'checksum': '63439d51', 'type': 'crc32c'}],
 'content_type': 'application/json',
 'id': '8ff23235-4435-4929-8fb2-5d55b4564999',
 'urls': [{'url': 'https://commons-dss.ucsc-cgp-dev.org/v1/files/8ff23235-4435-4929-8fb2-5d55b4564999?replica=aws'},
          {'url': 'https://commons-dss.ucsc-cgp-dev.org/v1/files/8ff23235-4435-4929-8fb2-5d55b4564999?replica=azure'},
          {'url': 'https://commons-dss.ucsc-cgp-dev.org/v1/files/8ff23235-4435-4929-8fb2-5d55b4564999?replica=gcp'}],
 'version': '2018-06-07T001700.470245Z'}
{'checksums': [{'checksum': '0ff8cf77', 'type': 'crc32c'}],
 'content_type': 'application/octet-stream',
 'id': '46df9ac5-0e72-4aa5-8e6b-6b55b362c29a',
 'size': 1363840,
 'urls': [{'url': 's3://cgp-commons-public/topmed_open_access/4654080c-9a94-5fcf-a69c-c73059169207/NWD290849.recab.cram.crai'},
          {'url': 'gs://cgp-commons-multi-region-public/topmed_open_access/4654080c-9a94-5fcf-a69c-c73059169207/NWD290849.recab.cram.crai'}],
 'version': '2018-05-26T134316.170053Z'}```
David Steinberg [11:16 AM]
'size': 1363840,
oh
you can post an issue to the lambda with the key that is bugging you
I think it might not be getting the size of the application/json ones properly?
Michael Krause [11:18 AM]
ah, I wasn't sure if this has to do with the specifics of the URL. I think in the meantime I'll just patch it with the method I mentioned earlier. I can create an issue, and also help you fixing it :slightly_smiling_face:
David Steinberg [11:57 AM]
nice! yeah I think the dos-dss-lambda is of less interest right now than dos-azul-lambda
but it has some obvious things to do on it
so the problem you sussed out is this
https://github.com/DataBiosphere/dos-dss-lambda/blob/master/app.py#L41
GitHub
DataBiosphere/dos-dss-lambda
dos-dss-lambda - Access HCA DSS using GA4GH DOS
when a "File" from the DSS rest API comes to the dos azul lambda, which is just a JSON describing a file, it can sometimes describe a file that FURTHER describes a file
which is a hack michael baumman added to "load file by reference"
when I convert a DSS file to DOS, I get the size from the file about the file (sheesh)
but when we just get the file directly from DSS (which is what he has organized for sending in the metadata.json) it does get a size

but its not being added to the DOS response
so to fix just add a line with your `data_object['size'] = dss_file.get('X-DSS-SIZE')`
Michael Krause [12:02 PM]
I see, thanks for the explanation. Is it okay to create a ticket on it? We can just work on it whenever there's time or becomes a problem.
David Steinberg [12:02 PM]
yes if you'd like to specify it please do feel free to copy this chat!
Michael Krause [12:02 PM]
great, will do

┆Issue is synchronized with this [JIRA Story](https://ucsc-cgl.atlassian.net/browse/DDL-10)
┆Project Name: dos-dss-lambda
┆Issue Number: DDL-10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant