Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize format of master list. #1

Open
lossyrob opened this issue May 7, 2015 · 8 comments
Open

Finalize format of master list. #1

lossyrob opened this issue May 7, 2015 · 8 comments

Comments

@lossyrob
Copy link
Member

lossyrob commented May 7, 2015

The register act as a list of URI's of endpoints that allow participants of OIN to be discoverable. What are the requirements on a URI that is included in the list?

@lossyrob
Copy link
Member Author

lossyrob commented May 7, 2015

Some questions that we should answer:

  • Does the URI point to and endpoint that can be hit by HTTP with a response that provides metadata about the data associated with that endpoint (such as description and file locations)? Should the URI simply point to a bucket key, where all imagery that is prefixed with that key is part of the OIN, so that the discoverability comes from doing a "directory listing" of that URI endpoint?
  • What are the supported storage types? We're targeting s3 and other object stores initially, but should we include things like ftp? Should we rely on the URI to make it clear what storage type the endpoint uses?

@smit1678
Copy link
Member

@lossyrob Yeah, thanks for kicking this off. These are good questions. Here's an initial take on a format -- focused on s3 and object stores:

{
    "providers": [
        {
            "name": "Some provider",
            "contact": "[email protected]",
            "node": {
                "type": "s3",
                "bucket_name": "somebucket"
            }
        }
    ]
}

Pros: this is super simple on the catalog indexing side.
Cons: we'll just need to manage the PR process to ensure that proper format is followed.

@lossyrob
Copy link
Member Author

Interesting, this is the sort of format I was imagining for the 'grouping' metadata. What I thought of was to have the register be a flat file of a list of URIs, and those URIs would point to a json file that looked just like what you have suggested for the register entries. Either works, I'll have to think through the pros and cons of those two options, but this is a good idea.

On May 14, 2015, at 7:36 PM, Nate Smith [email protected] wrote:

@lossyrob Yeah, thanks for kicking this off. These are good questions. Here's an initial take on a format -- focused on s3 and object stores:

{
"providers": [
{
"name": "Some provider",
"contact": "[email protected]",
"node": {
"type": "s3",
"bucket_name": "somebucket"
}
}
]
}
Pros: this is super simple on the catalog indexing side.
Cons: we'll just need to manage the PR process to ensure that proper format is followed.


Reply to this email directly or view it on GitHub.

@lossyrob
Copy link
Member Author

The major difference I can come up with between having the JSON entry metadata in the register vs having the register being URIs that point to provider-hosted JSON entry metadata is, if a provider needed to update the provider entry metadata, in the former case the provider would need to make a PullRequest to edit the information, in the latter case, the provider would be able to modify the entry metadata on their side without having to make a PR to the register. If we want to make it hard for providers to update information (and review all updates), then having them in the register. If we want the provider to be update their own entries easily (like if they move some imagery to another bucket and want to provide the new bucket name), it would be advantageous to keep the provider entry metadata on the provider side.

@wonderchook
Copy link

I was envisioning the pointer to the JSON entry so the provider could update themselves.

Is there an in between solution? Basic registration but the detailed metadata is on the provider side? Perhaps that makes things too complicated.

@smit1678
Copy link
Member

Yeah, these are valid points @lossyrob. Couple follow ups:

If we want the provider to be update their own entries easily (like if they move some imagery to another bucket and want to provide the new bucket name), it would be advantageous to keep the provider entry metadata on the provider side.

I don't think we expect people do be moving imagery between buckets a lot, do we? I would think that using something like S3 allows providers to dump into a bucket and then not worry about it.

If we want to make it hard for providers to update information (and review all updates), then having them in the register.

To me, it's not about making it harder for providers but better for the system and control over the inputs into the system. Ultimately, this is a short-term, non-scalable solution. Putting control on the repo and ensuring providers are adding valid data will enable less time spent debugging or checking. This doesn't really scale when we're above 100 objects in the file.

I also don't think we're talking about a lot of data - just bucket location, name, and contact information so you can follow up.

We also don't want to prescript how someone structures their node. They may want to use folders or may not want to use folders. Indexing can recursively go over a folder structure. Here's an updated format with some names adjusted in the case that a provider has multiple buckets:

{
    "nodes": [
        {
            "name": "Some provider",
            "contact": "[email protected]",
            "locations": [
                {
                    "type": "s3",
                    "bucket_name": "somebucket-1"
                },
                {
                    "type": "s3",
                    "bucket_name": "somebucket-2"
                },
                {
                    "type": "s3",
                    "bucket_name": "somebucket-3"
                }                        
            ]
        }, ... 
    ]
}

Would it be worthwhile to pin this and start with a test HOT node to begin with? We can see how it works and functions and then evaluate what could be improved?

@lossyrob
Copy link
Member Author

I was imagining not moving imagery between buckets, but specifying keys within a bucket...say if a provider wanted to add another folder to the set of folders specified under one bucket (assuming we could have a { "type": "s3", "bucket_name" : "somebucket-3/thisorthatdata" } structure to adding an s3 bucket), they would be able to either do that through editing metadata on their side, or making a PR to update the metadata on the OIN side. Making it harder for providers meant just what you said, giving us more control over changes; if it's really easy for providers to modify information, then it might be the case that more mistakes could happen without detection.

I see advantages and disadvantages to both sides, but also I'm big on an implement-and-refactor workflow, so I totally agree with your last couple of sentences.

@smit1678
Copy link
Member

Cool, I'll get a HOT bucket set up and use that as a the first pull request to the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants