Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-structure Capella Bucket=>Scope=>Collection configuration #379

Open
gopa-noaa opened this issue May 29, 2024 · 19 comments
Open

Re-structure Capella Bucket=>Scope=>Collection configuration #379

gopa-noaa opened this issue May 29, 2024 · 19 comments
Assignees
Labels
couchbase task Tasks break a project down into discrete steps VXingest issues related to the VXingest project

Comments

@gopa-noaa
Copy link
Contributor

gopa-noaa commented May 29, 2024

No change to bucket, 3 scopes , development, integration, production, and 3 collections under each, currently just METAR, RAOB, and COMMON.

@gopa-noaa gopa-noaa added couchbase VXingest issues related to the VXingest project task Tasks break a project down into discrete steps labels May 29, 2024
@gopa-noaa gopa-noaa self-assigned this May 29, 2024
@randytpierce
Copy link
Contributor

randytpierce commented May 29, 2024 via email

@gopa-noaa
Copy link
Contributor Author

From a quick Google-ing a scope cannot be renamed after it is created. Have sent email to Couchbase ...
Worst case, we can do the following:

  1. create new "development" scope
  2. create a new "METAR" collection
  3. re-configure our XDCR to vxdata=>development=>MEAR
  4. wait for data sync to complete
  5. Delete the original _default=>METAR

@ian-noaa
Copy link
Contributor

ian-noaa commented Jul 17, 2024

A couple of other questions:

  1. Would it make sense to move more document types out into their own collections? I think we currently have MD (Metadata), DD (Data Document), and JOB/JOB-TEST documents. Are there other document types that would make sense to put in their own collections?
  2. Could our document types be replaced by using collections more? If they are useful, when does it make sense to have a collection vs a document type field? E.g. - if the METAR collection solely contains type=DD documents, I could see dropping the type field unless there are reasons clients need to track that type.
  3. Should the JOB-TEST docs be renamed to JOB and left in a "test" scope?
  4. Does it make sense for the scorecard to be its own scope or does it make sense to be scoped with the rest of vxdata?
  5. Can we XDCR at a scope level instead of a collection?

@ian-noaa
Copy link
Contributor

ian-noaa commented Jul 19, 2024

To summarize the discussion from the dev meeting:

We decided we need to move this issue up and address how best to use collections, scopes, and buckets for our project & application.

We would like to come up with some use cases & whiteboard through how key parts of the application lifecycle would work with different data models. Ideally this would happen during the ingest meeting.

During the meeting we

  • debated what would go into a common collection. The point was made that common is pretty generic (like default) and it could be better to have explicit & meaningful names to describe the data that collections hold so that we don't end up with a grab bag of data. However, we’re unsure of the performance tradeoffs of multiple collections.
  • Called out that we will need scripts or SDK calls to create our DB schema if it becomes more complicated.

Information needed

  1. What are collections, scopes, and buckets? What are their use cases?
  2. How do collections, scopes, and buckets interact with XDCR & Time-To-Live fields?
  3. It'd be useful to get a list of the Types, DocTypes, and Subsets we have in our documents and an idea of how we are using them. @randytpierce and @gopa-noaa may have the best input here.
  4. Can we use collections/scopes/buckets to obviate some of the above fields (type, docType, and subset) in our documents? And do we want to? (I suspect no, to support our archiving & retrieval use case)
  5. What use cases should we explore to ensure we have thought the DB schema through? This is something @ian-noaa, @randytpierce & @gopa-noaa should consider by the vxingest meeting. Off the top of my head I have:
    • Ingesting data via cron, for various data types if relevant
    • Ingesting data via event, for various data types if relevant
    • Expiring data
    • Retrieving archived data
    • Querying data from MATS
  6. Where does the scorecard fit into this? Should the data be stored in a separate bucket, scope, or collection?

Context

Couchbase Server 7 (released in 2021) introduced Scopes & Collections. Previously it was recommended to put all data in a “Bucket” and distinguish the documents with a type field. It appears scopes are recommended for data isolation (prod/dev environments, introducing schema changes, etc…) and collections are intended as a replacement for the previously recommended “type” field.

@gopa-noaa
Copy link
Contributor Author

This link explains Collections and Scope:
https://docs.couchbase.com/server/current/learn/data/scopes-and-collections.html

Just noting down some salient points below:

A collection is a data container.
Up to 1000 collections can be created per cluster.
A collection can be indexed; and it can be dropped. The data in a collection can be replicated, by means of XDCR.

A scope is a mechanism for the grouping of multiple collections. Up to 1000 scopes can be created per cluster.
A scope can be dropped. A scope cannot be indexed. The contents of a scope can be replicated, by means XDCR.

Benefits of Scopes and Collections
The benefits of scopes and collections include:

The logical grouping of similar documents; potentially simplifying operations such as query, XDCR, and backup and restore.

The increased efficiency of indexing, due to the Data Service being able to provide documents from specific collections to the Index Service.

Simplified querying, since query statements are able to easily specify particular subsets of documents.

Easier migration from relational databases to Couchbase Server, since collections can be designed to correspond to pre-existing relational tables.

Secure isolation of different document-types, within a bucket; allowing applications to be specifically authorized to use only their appropriate subsets of data (see Access to Scopes and Collections, below).

This should help give us some guidance in organizing our document hierarchy. Lets plan to discuss further.

@ian-noaa
Copy link
Contributor

Thanks, Gopa! That makes it sound like it would be beneficial to explore using collections more.

2. How do collections, scopes, and buckets interact with XDCR & Time-To-Live fields?

TTL fields

  • Couchbase can have a default TTL set on buckets and collections but not scopes. You can also use the SDK to set TTL individually for each document. If we went the second route, having the import process be in charge of setting TTL values would seem to make sense.
  • See Couchbase's Data Expiration docs.

XDCR

  • Is configured at the bucket level. However, filtering can be applied to map data to different collections or exclude collections/documents.
  • XDCR will not automatically create scopes and collections. Scopes & Collections must be preconfigured on each DB cluster.
  • See XDCR with Scopes & Collections

@ian-noaa
Copy link
Contributor

During the dev meeting we confirmed that we:

  • Want the Database Scope to reflect the environment development, test, and prod were mentioned.
    • We also noted that we could use a new scope to distinguish between the on-prem and aws ingest systems.
    • Do we want 3 copies of the data? How much data is retained/what data goes where?
  • Want the Database Collections to mirror the document subset fields
    • We will need to redo our indices to take advantage of this
    • The contents and naming of the "metadata" collection is still an open discussion. Do we have a singular metadata collection, or do we have multiple collections based on metadata type? (Job, Stations, etc...)

And we need the following for today:

  1. @randytpierce & @gopa-noaa - To provide a list of document type, docType, and subset fields currently in use.
  2. @randytpierce & @gopa-noaa - To consider what scenarios we want to whiteboard out. Currently, we have:
  • Ingesting data via cron, for various data types if relevant
  • Ingesting data via event, for various data types if relevant
  • Expiring data
  • Retrieving archived data
  • Querying data from MATS

@randytpierce
Copy link
Contributor

randytpierce commented Jul 25, 2024 via email

@gopa-noaa
Copy link
Contributor Author

gopa-noaa commented Jul 25, 2024

Here are the results from the METAR Collection:

select distinct raw docType from vxdata._default.METAR

[
  "obs",
  "model",
  "CTC",
  "SUMS"
]

@gopa-noaa
Copy link
Contributor Author

gopa-noaa commented Jul 25, 2024

And on the On-Prem Cluster:

select distinct raw docType from vxdata._default.METAR

[
  null,
  "CTC",
  "SUMS",
  "classic_stations",
  "ingest",
  "landUseTypes",
  "matsAux",
  "matsGui",
  "model",
  "obs",
  "region",
  "station"
]

@gopa-noaa
Copy link
Contributor Author

gopa-noaa commented Jul 25, 2024

On-Prem Cluster output for types:

select distinct raw type from vxdata._default.METAR

[
  "DF",
  "JOB-TEST",
  "JOB",
  "LJ",
  null,
  "DD",
  "MD-TEST",
  "MD",
  "DD-TEST"
]

@JeffHamiltonNOAA
Copy link
Contributor

IMG_0427

@ian-noaa
Copy link
Contributor

To summarize the meeting last week:

  • We want to have two collection "types" largely based on document's subset field. The largest by far will be the "data" collections and will be based on the subset field of "Data" documents where type=DD. We also want to have one or more metadata collection(s).
  • It is important to retain subset and type fields in our documents so that those documents can be properly imported from the long term store.
  • Document ID's are constructed out of top-level predicates and follow this form: type:version:subset:docType:subdocType:<level>:<valid time epoch>. Note that <> items are optional.
  • MATS GUI documents should at a minimum have their own collection.
  • Currently we have a common subset - it encompasses regions and landuse and should be renamed to region as landuse is handled like a region. Potentially , we want to make this its own collection. Within that collection, it may be important to distinguish between region & landuse documents and to include metadata in the landuse documents specifying which landuse tables apply to which models.
  • If we have a "common" collection, it would be small so we may be able to create a primary index on it.

Remaining questions:

  • where does the scorecard fit into this?
  • It's still unclear what we want to do with metadata. Should it go in a single collection or multiple? It was pointed out that the metadata may be small enough that we could set up a primary index on it. It sounds like common should be renamed to region.
  • scopes - federated db will need to write to a dev scope until it's operational. When it becomes operational, we'll need to determine if we use the existing prod scope, or create a new one and drop the old one.
  • What do we do with *_TEST types? Are they obviated if we have dev and test scopes?

I'm sure I missed a few things. 🙂

@gopa-noaa
Copy link
Contributor Author

Forgot to take notes in last meeting, if I remember correctly, here are the main points:

  1. We will have multiple buckets, at least 2 for now, 1 - vxdata_prod, 2 - vxdata_dev. Both these buckets would be readily available for testing without resorting to any involved data/index setup.
  2. Integration tests would be done against vxdata_prod bucket
  3. Developers can create addition buckets for specific needs.
  4. Currently we plan to use the "_default" scope
  5. Multiple Collections under default scope like: METAR, RAOB etc

Questions:

  1. What would be a good mechanism to load a subset of production data in another bucket ?

@gopa-noaa
Copy link
Contributor Author

gopa-noaa commented Nov 14, 2024

Recording here current state of affairs ...

Magma storage transition status:

  • On Prem Single Node Cluster and Three Node Cluster all migrated to Magma
  • Capella Migrated to Magma

Buckets->Scopes->Collections

adb-cb1

vxdata=>
      _default=>
                [METAR, RAOB, SCORECARD, SCORECARD_SETTINGS]
vxdatatest=>
     _default=>
                [METAR, RAOB, SCORECARD, SCORECARD_SETTINGS]
metdata=>
      _default=>
             [MET_default]

abd-cb2,3,4

vxdata=>
        _default =>
               [METAR, RAOB, SCORECARD, SCORECARD_SETTINGS]
        development =>
               [COMMON, METAR, RAOB]
        integration =>
               [COMMON, METAR, RAOB]
         production]=>
               [COMMON, METAR, RAOB]

Capella

vxdata=>
        _default =>
               [METAR, RAOB, SCORECARD, SCORECARD_SETTINGS]
vxdatatest =>.  (not used)
       _default =>
                [_default]

@gopa-noaa
Copy link
Contributor Author

Based on our decisions on Sep 18th (see above)
Action items are

  1. Rename vxdata => vxdata_prod.
    If rename not possible, we will create another bucket called vxdata_prod and copy data from vxdata. Both buckets will need to co-exist until applications are migrated and tested. Another option is to leave it as vxdata , BUT this will be deemed the production bucket, advantage here is that no data or app migration.
  2. Create vxdata_dev bucket and copy data from prod
    Replicate this configuration to the 3 node cluster and Capella
    Delete unused buckets, scopes and collections that do not fir the above scheme on all 3 clusters.

@gopa-noaa
Copy link
Contributor Author

  1. metdata => metplusdata
  2. (We don't need COMMON collection, at least initially). Collection specific metadata can belong in their respective collections. Only metadata that describes data in a collection should belong in a collection. Other metadata is better to reside on specific metadata collection for performance reasons. . If there is a need to capture metadata (very specific) they could be in a different collection , example MD_MATS, MD_STATIONS. For now, start without COMMON collection, and if needed we can add this later.
  3. Integration tests would be done against vxdata_prod bucket. , but only as a read-only user.
  4. Moving ingest to 3 node On Prem cluster ??? and point XDCR back from 3 node On Prem to adb-cb1. Needs further discussion on this.
  5. Basic estimates on new Capella credit purchase

@gopa-noaa
Copy link
Contributor Author

So action plans from our last meeting:

  1. We are leaving vxdata as is, but this would be our production bucket
  2. We will create a vxdata_dev bucket, this bucket will be initialized with a snapshot of data from vxdata bucket

@ian-noaa
Copy link
Contributor

I'd add one more action item:

  1. Identify metadata that is applicable across data collections so they can be moved to their own collections. I believe MATS GUI settings (MD_MATS), model names (MD_MODELS), and regions (MD_REGIONS) were mentioned. It'd be useful to have a list of the types of metadata we have so we can review them for the next vxingest meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
couchbase task Tasks break a project down into discrete steps VXingest issues related to the VXingest project
Projects
None yet
Development

No branches or pull requests

4 participants