Finalize format of master list. #1

lossyrob · 2015-05-07T02:11:37Z

The register act as a list of URI's of endpoints that allow participants of OIN to be discoverable. What are the requirements on a URI that is included in the list?

lossyrob · 2015-05-07T02:15:55Z

Some questions that we should answer:

Does the URI point to and endpoint that can be hit by HTTP with a response that provides metadata about the data associated with that endpoint (such as description and file locations)? Should the URI simply point to a bucket key, where all imagery that is prefixed with that key is part of the OIN, so that the discoverability comes from doing a "directory listing" of that URI endpoint?
What are the supported storage types? We're targeting s3 and other object stores initially, but should we include things like ftp? Should we rely on the URI to make it clear what storage type the endpoint uses?

smit1678 · 2015-05-14T23:36:33Z

@lossyrob Yeah, thanks for kicking this off. These are good questions. Here's an initial take on a format -- focused on s3 and object stores:

{
    "providers": [
        {
            "name": "Some provider",
            "contact": "[email protected]",
            "node": {
                "type": "s3",
                "bucket_name": "somebucket"
            }
        }
    ]
}

Pros: this is super simple on the catalog indexing side.
Cons: we'll just need to manage the PR process to ensure that proper format is followed.

lossyrob · 2015-05-15T00:03:49Z

Interesting, this is the sort of format I was imagining for the 'grouping' metadata. What I thought of was to have the register be a flat file of a list of URIs, and those URIs would point to a json file that looked just like what you have suggested for the register entries. Either works, I'll have to think through the pros and cons of those two options, but this is a good idea.

On May 14, 2015, at 7:36 PM, Nate Smith [email protected] wrote:

@lossyrob Yeah, thanks for kicking this off. These are good questions. Here's an initial take on a format -- focused on s3 and object stores:

{
"providers": [
{
"name": "Some provider",
"contact": "[email protected]",
"node": {
"type": "s3",
"bucket_name": "somebucket"
}
}
]
}
Pros: this is super simple on the catalog indexing side.
Cons: we'll just need to manage the PR process to ensure that proper format is followed.

—
Reply to this email directly or view it on GitHub.

lossyrob · 2015-05-16T18:42:49Z

The major difference I can come up with between having the JSON entry metadata in the register vs having the register being URIs that point to provider-hosted JSON entry metadata is, if a provider needed to update the provider entry metadata, in the former case the provider would need to make a PullRequest to edit the information, in the latter case, the provider would be able to modify the entry metadata on their side without having to make a PR to the register. If we want to make it hard for providers to update information (and review all updates), then having them in the register. If we want the provider to be update their own entries easily (like if they move some imagery to another bucket and want to provide the new bucket name), it would be advantageous to keep the provider entry metadata on the provider side.

wonderchook · 2015-05-18T13:50:43Z

I was envisioning the pointer to the JSON entry so the provider could update themselves.

Is there an in between solution? Basic registration but the detailed metadata is on the provider side? Perhaps that makes things too complicated.

smit1678 · 2015-05-18T21:55:29Z

Yeah, these are valid points @lossyrob. Couple follow ups:

If we want the provider to be update their own entries easily (like if they move some imagery to another bucket and want to provide the new bucket name), it would be advantageous to keep the provider entry metadata on the provider side.

I don't think we expect people do be moving imagery between buckets a lot, do we? I would think that using something like S3 allows providers to dump into a bucket and then not worry about it.

If we want to make it hard for providers to update information (and review all updates), then having them in the register.

To me, it's not about making it harder for providers but better for the system and control over the inputs into the system. Ultimately, this is a short-term, non-scalable solution. Putting control on the repo and ensuring providers are adding valid data will enable less time spent debugging or checking. This doesn't really scale when we're above 100 objects in the file.

I also don't think we're talking about a lot of data - just bucket location, name, and contact information so you can follow up.

We also don't want to prescript how someone structures their node. They may want to use folders or may not want to use folders. Indexing can recursively go over a folder structure. Here's an updated format with some names adjusted in the case that a provider has multiple buckets:

{
    "nodes": [
        {
            "name": "Some provider",
            "contact": "[email protected]",
            "locations": [
                {
                    "type": "s3",
                    "bucket_name": "somebucket-1"
                },
                {
                    "type": "s3",
                    "bucket_name": "somebucket-2"
                },
                {
                    "type": "s3",
                    "bucket_name": "somebucket-3"
                }                        
            ]
        }, ... 
    ]
}

Would it be worthwhile to pin this and start with a test HOT node to begin with? We can see how it works and functions and then evaluate what could be improved?

lossyrob · 2015-05-18T22:14:00Z

I was imagining not moving imagery between buckets, but specifying keys within a bucket...say if a provider wanted to add another folder to the set of folders specified under one bucket (assuming we could have a { "type": "s3", "bucket_name" : "somebucket-3/thisorthatdata" } structure to adding an s3 bucket), they would be able to either do that through editing metadata on their side, or making a PR to update the metadata on the OIN side. Making it harder for providers meant just what you said, giving us more control over changes; if it's really easy for providers to modify information, then it might be the case that more mistakes could happen without detection.

I see advantages and disadvantages to both sides, but also I'm big on an implement-and-refactor workflow, so I totally agree with your last couple of sentences.

smit1678 · 2015-05-19T21:52:48Z

Cool, I'll get a HOT bucket set up and use that as a the first pull request to the list.

lossyrob mentioned this issue May 18, 2015

S3 bucket requirements #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalize format of master list. #1

Finalize format of master list. #1

lossyrob commented May 7, 2015

lossyrob commented May 7, 2015

smit1678 commented May 14, 2015

lossyrob commented May 15, 2015

lossyrob commented May 16, 2015

wonderchook commented May 18, 2015

smit1678 commented May 18, 2015

lossyrob commented May 18, 2015

smit1678 commented May 19, 2015

Finalize format of master list. #1

Finalize format of master list. #1

Comments

lossyrob commented May 7, 2015

lossyrob commented May 7, 2015

smit1678 commented May 14, 2015

lossyrob commented May 15, 2015

lossyrob commented May 16, 2015

wonderchook commented May 18, 2015

smit1678 commented May 18, 2015

lossyrob commented May 18, 2015

smit1678 commented May 19, 2015