title | slug | date | keyword | license |
---|---|---|---|---|
Manage fileset metadata using Gravitino |
/manage-fileset-metadata-using-gravitino |
2024-04-02 |
Gravitino fileset metadata manage |
This software is licensed under the Apache License version 2. |
import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';
This page introduces how to manage fileset metadata in Apache Gravitino. Filesets are a collection of files and directories. Users can leverage filesets to manage non-tabular data like training datasets and other raw data.
Typically, a fileset is mapped to a directory on a file system like HDFS, S3, ADLS, GCS, etc. With the fileset managed by Gravitino, the non-tabular data can be managed as assets together with tabular data in Gravitino in a unified way.
After a fileset is created, users can easily access, manage the files/directories through the fileset's identifier, without needing to know the physical path of the managed dataset. Also, with unified access control mechanism, filesets can be managed via the same role based access control mechanism without needing to set access controls across different storage systems.
To use fileset, please make sure that:
- Gravitino server has started, and the host and port is http://localhost:8090.
- A metalake has been created.
:::tip
For a fileset catalog, you must specify the catalog type
as FILESET
when creating the catalog.
:::
You can create a catalog by sending a POST
request to the /api/metalakes/{metalake_name}/catalogs
endpoint or just use the Gravitino Java client. The following is an example of creating a catalog:
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "catalog",
"type": "FILESET",
"comment": "comment",
"provider": "hadoop",
"properties": {
"location": "file:/tmp/root"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs
GravitinoClient gravitinoClient = GravitinoClient
.builder("http://localhost:8090")
.withMetalake("metalake")
.build();
Map<String, String> properties = ImmutableMap.<String, String>builder()
.put("location", "file:/tmp/root")
// Property "location" is optional. If specified, a managed fileset without
// a storage location will be stored under this location.
.build();
Catalog catalog = gravitinoClient.createCatalog("catalog",
Type.FILESET,
"hadoop", // provider, Gravitino only supports "hadoop" for now.
"This is a Hadoop fileset catalog",
properties);
// ...
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog = gravitino_client.create_catalog(name="catalog",
type=Catalog.Type.FILESET,
provider="hadoop",
comment="This is a Hadoop fileset catalog",
properties={"location": "/tmp/test1"})
Currently, Gravitino supports the following catalog providers:
Catalog provider | Catalog property |
---|---|
hadoop |
Hadoop catalog property |
Refer to Load a catalog in relational catalog for more details. For a fileset catalog, the load operation is the same.
Refer to Alter a catalog in relational catalog for more details. For a fileset catalog, the alter operation is the same.
Refer to Drop a catalog in relational catalog for more details. For a fileset catalog, the drop operation is the same.
:::note Currently, Gravitino doesn't support dropping a catalog with schemas and filesets under it. You have to drop all the schemas and filesets under the catalog before dropping the catalog. :::
Please refer to List all catalogs in a metalake in relational catalog for more details. For a fileset catalog, the list operation is the same.
Please refer to List all catalogs' information in a metalake in relational catalog for more details. For a fileset catalog, the list operation is the same.
Schema
is a virtual namespace in a fileset catalog, which is used to organize the fileset. It
is similar to the concept of schema
in relational catalog.
:::tip Users should create a metalake and a catalog before creating a schema. :::
You can create a schema by sending a POST
request to the /api/metalakes/{metalake_name}/catalogs/{catalog_name}/schemas
endpoint or just use the Gravitino Java client. The following is an example of creating a schema:
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "schema",
"comment": "comment",
"properties": {
"location": "file:/tmp/root/schema"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas
GravitinoClient gravitinoClient = GravitinoClient
.builder("http://localhost:8090")
.withMetalake("metalake")
.build();
// Assuming you have just created a Hadoop catalog named `catalog`
Catalog catalog = gravitinoClient.loadCatalog("catalog");
SupportsSchemas supportsSchemas = catalog.asSchemas();
Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
// Property "location" is optional, if specified all the managed fileset without
// specifying storage location will be stored under this location.
.put("location", "file:/tmp/root/schema")
.build();
Schema schema = supportsSchemas.createSchema("schema",
"This is a schema",
schemaProperties
);
// ...
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
catalog.as_schemas().create_schema(name="schema",
comment="This is a schema",
properties={"location": "/tmp/root/schema"})
Currently, Gravitino supports the following schema property:
Catalog provider | Schema property |
---|---|
hadoop |
Hadoop schema property |
Please refer to Load a schema in relational catalog for more details. For a fileset catalog, the schema load operation is the same.
Please refer to Alter a schema in relational catalog for more details. For a fileset catalog, the schema alter operation is the same.
Please refer to Drop a schema in relational catalog for more details. For a fileset catalog, the schema drop operation is the same.
Note that the drop operation will also remove all of the filesets as well as the managed files
under this schema path if cascade
is set to true
.
Please refer to List all schemas under a catalog in relational catalog for more details. For a fileset catalog, the schema list operation is the same.
:::tip
- Users should create a metalake, a catalog, and a schema before creating a fileset.
- Currently, Gravitino only supports managing Hadoop Compatible File System (HCFS) locations. :::
You can create a fileset by sending a POST
request to the /api/metalakes/{metalake_name} /catalogs/{catalog_name}/schemas/{schema_name}/filesets
endpoint or just use the Gravitino Java
client. The following is an example of creating a fileset:
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "example_fileset",
"comment": "This is an example fileset",
"type": "MANAGED",
"storageLocation": "file:/tmp/root/schema/example_fileset",
"properties": {
"k1": "v1"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets
GravitinoClient gravitinoClient = GravitinoClient
.builder("http://localhost:8090")
.withMetalake("metalake")
.build();
Catalog catalog = gravitinoClient.loadCatalog("catalog");
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
.put("k1", "v1")
.build();
filesetCatalog.createFileset(
NameIdentifier.of("schema", "example_fileset"),
"This is an example fileset",
Fileset.Type.MANAGED,
"file:/tmp/root/schema/example_fileset",
propertiesMap,
);
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"),
type=Fileset.Type.MANAGED,
comment="This is an example fileset",
storage_location="/tmp/root/schema/example_fileset",
properties={"k1": "v1"})
Currently, Gravitino supports two types of filesets:
MANAGED
: The storage location of the fileset is managed by Gravitino when specified asMANAGED
, the physical location of the fileset will be deleted when this fileset is dropped.EXTERNAL
: The storage location of the fileset is not managed by Gravitino, when specified asEXTERNAL
, the files of the fileset will not be deleted when the fileset is dropped.
storageLocation
The storageLocation
is the physical location of the fileset. Users can specify this location
when creating a fileset, or follow the rules of the catalog/schema location if not specified.
For a MANAGED
fileset, the storage location is:
- The one specified by the user during the fileset creation.
- When the catalog property
location
is specified but the schema propertylocation
isn't specified, the storage location iscatalog location/schema name/fileset name
. - When the catalog property
location
isn't specified but the schema propertylocation
is specified, the storage location isschema location/fileset name
. - When both the catalog property
location
and the schema propertylocation
are specified, the storage location isschema location/fileset name
. - When both the catalog property
location
and schema propertylocation
isn't specified, the user should specify thestorageLocation
in the fileset creation.
For EXTERNAL
fileset, users should specify storageLocation
during the fileset creation,
otherwise, Gravitino will throw an exception.
You can modify a fileset by sending a PUT
request to the /api/metalakes/{metalake_name} /catalogs/{catalog_name}/schemas/{schema_name}/filesets/{fileset_name}
endpoint or just use the
Gravitino Java client. The following is an example of modifying a fileset:
curl -X PUT -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"updates": [
{
"@type": "removeProperty",
"property": "key2"
}, {
"@type": "setProperty",
"property": "key3",
"value": "value3"
}
]
}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets/fileset
// ...
// Assuming you have just created a Fileset catalog named `catalog`
Catalog catalog = gravitinoClient.loadCatalog("catalog");
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
Fileset f = filesetCatalog.alterFileset(NameIdentifier.of("schema", "fileset"),
FilesetChange.rename("fileset_renamed"), FilesetChange.updateComment("xxx"));
// ...
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
changes = (
FilesetChange.remove_property("fileset_properties_key1"),
FilesetChange.set_property("fileset_properties_key2", "fileset_propertie_new_value"),
)
fileset_new = catalog.as_fileset_catalog().alter_fileset(NameIdentifier.of("schema", "fileset"),
*changes)
Currently, Gravitino supports the following changes to a fileset:
Supported modification | JSON | Java |
---|---|---|
Rename a fileset | {"@type":"rename","newName":"fileset_renamed"} |
FilesetChange.rename("fileset_renamed") |
Update a comment | {"@type":"updateComment","newComment":"new_comment"} |
FilesetChange.updateComment("new_comment") |
Set a fileset property | {"@type":"setProperty","property":"key1","value":"value1"} |
FilesetChange.setProperty("key1", "value1") |
Remove a fileset property | {"@type":"removeProperty","property":"key1"} |
FilesetChange.removeProperty("key1") |
Remove comment | {"@type":"removeComment"} |
FilesetChange.removeComment() |
You can remove a fileset by sending a DELETE
request to the /api/metalakes/{metalake_name} /catalogs/{catalog_name}/schemas/{schema_name}/filesets/{fileset_name}
endpoint or by using the
Gravitino Java client. The following is an example of dropping a fileset:
curl -X DELETE -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" \
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets/fileset
// ...
// Assuming you have just created a Fileset catalog named `catalog`
Catalog catalog = gravitinoClient.loadCatalog("catalog");
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
// Drop a fileset
filesetCatalog.dropFileset(NameIdentifier.of("schema", "fileset"));
// ...
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
catalog.as_fileset_catalog().drop_fileset(ident=NameIdentifier.of("schema", "fileset"))
For a MANAGED
fileset, the physical location of the fileset will be deleted when this fileset is
dropped. For EXTERNAL
fileset, only the metadata of the fileset will be removed.
You can list all filesets in a schema by sending a GET
request to the /api/metalakes/ {metalake_name}/catalogs/{catalog_name}/schemas/{schema_name}/filesets
endpoint or by using the
Gravitino Java client. The following is an example of listing all the filesets in a schema:
curl -X GET -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" \
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets
// ...
Catalog catalog = gravitinoClient.loadCatalog("catalog");
FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
NameIdentifier[] identifiers =
filesetCatalog.listFilesets(Namespace.of("schema"));
// ...
gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake")
catalog: Catalog = gravitino_client.load_catalog(name="catalog")
fileset_list: List[NameIdentifier] = catalog.as_fileset_catalog().list_filesets(namespace=Namespace.of("schema")))