Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add ibis meta data routers #603

Merged
merged 4 commits into from
Jun 12, 2024
Merged

Conversation

onlyjackfrost
Copy link
Collaborator

@onlyjackfrost onlyjackfrost commented Jun 7, 2024

add ibis meta data routers

POST /v2/ibis/{datasource}/metadata/tables

Request body

for postgres

{
    "connectionInfo": {
        "host": "localhost",
        "port": "6432",
        "user": "postgres",
        "database": "postgres",
        "password": "postgres"
    }
}

for bigquery

{
    "connectionInfo": {
        "project_id": "wrenai",
        "dataset_id": "wrenai.ecommerce",
        "credentials": "your base64 encoded JSON string of your credential file"
    }
}

Response

  • name: unique table name (might contain schema name, depends on are you listing table across schema or not )
  • columns
    • name: column name in the datasource
    • type: column data type
    • notNull: boolean, nullable or not
    • description: column description(comment) if any
    • properties: column properties if any
  • description: table description if any
  • properties
    • schema: schema name to build tableReference
    • catalog: catalog name to build tableReference
    • table: table name to build tableReference
  • primaryKey: the column name which is bind with primary constraint
{
   [
        {
            "name": "public.nation",
            "columns": [
                {
                    "name": "n_nationkey",
                    "type": "INTEGER",
                    "notNull": true,
                    "description": "",
                    "properties": {}
                }
            ],
            "description": "",
            "properties": {
                "schema": "public",
                "catalog": "postgres",
                "table": "nation"
            },
            "primaryKey": ""
        },
  ]
}

POST /v2/ibis/{datasource}/metadata/constraints

Request body

Same as above

Response

  • constraints
    • constraintName: unique constraint name ({constraint_table}_{constraint_column}_{constrainted_table}_{constrainted_column})
    • constraintType: “FOREIGN KEY”
    • constraintTable
    • constraintColumn
    • constraintedTable
    • constraintedColumn
{
    [
        {
            "constraintName": "composite_pk_y_composite_fk_z",
            "constraintType": "FOREIGN KEY",
            "constraintTable": "composite_pk",
            "constraintColumn": "y",
            "constraintedTable": "composite_fk",
            "constraintedColumn": "z"
        }
    ]
}

@goldmedal
Copy link
Contributor

Could you add some tests for them? It would make it more stable. Furthermore, could you add a description for the new API?

@grieve54706
Copy link
Contributor

Plz rebase the main branch and use Ruff to format codes. Follow #601

Comment on lines 14 to 30
class Metadata(StrEnum):
postgres = auto()
bigquery = auto()

def get_table_list(self, connection_info):
if self == Metadata.postgres:
return self.get_postgres_table_list_sql(connection_info)
if self == Metadata.bigquery:
return self.get_bigquery_table_list_sql(connection_info)
raise NotImplementedError(f"Unsupported data source: {self}")

def get_constraints(self, connection_info):
if self == Metadata.postgres:
return self.get_postgres_table_constraints(connection_info)
if self == Metadata.bigquery:
return self.get_bigquery_table_constraints(connection_info)
raise NotImplementedError(f"Unsupported data source: {self}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If every method needs to check the data source type, it should split the class by data source like PostgresMetadata(Metadata) and BigQueryMetadata(Metadata).

Comment on lines 53 to 73
res = to_json(
DataSource.postgres.get_connection(connection_info)
.sql(sql, dialect="trino")
.to_pandas()
)

# transform the result to a list of dictionaries
response = [
(
lambda x: {
"table_catalog": x[0],
"table_schema": x[1],
"table_name": x[2],
"column_name": x[3],
"data_type": transform_postgres_column_type(x[4]),
"is_nullable": x[5],
"ordinal_position": x[6],
}
)(row)
for row in res["data"]
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use df.to_json(orient="records") to get data with columns like

[
    {
        "col 1": "a",
        "col 2": "b"
    },
    {
        "col 1": "c",
        "col 2": "d"
    }
]

Reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html

Comment on lines 81 to 92
table = {
"name": schema_table,
"description": "",
"columns": [],
"properties": {
"schema": row["table_schema"],
"catalog": row["table_catalog"],
"table": row["table_name"],
},
"primaryKey": "",
}
unique_tables[schema_table] = CompactTable(**table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about directly using class new?

unique_tables[schema_table] = CompactTable(
    name=schema_table,
    properties=CompactTableProperties(
        schema=row["table_schema"],
        catalog=row["table_catalog"],
        table=row["table_name"],
    )
)

CompactColumn too.

table: Optional[str] # only table name without schema or catalog


class CompactTable(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just name Table in the metadata_dto.py. Let's drop the redundant Compact

Comment on lines 104 to 105
compact_tables: list[CompactTable] = list(unique_tables.values())
return compact_tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just return list(unique_tables.values()).

def transform_postgres_column_type(data_type):
# lower case the data_type
data_type = data_type.lower()
print(f"=== data_type: {data_type}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use print

Comment on lines 329 to 355
switcher = {
"text": WrenEngineColumnType.TEXT,
"char": WrenEngineColumnType.CHAR,
"character": WrenEngineColumnType.CHAR,
"bpchar": WrenEngineColumnType.CHAR,
"name": WrenEngineColumnType.CHAR,
"character varying": WrenEngineColumnType.VARCHAR,
"bigint": WrenEngineColumnType.BIGINT,
"int": WrenEngineColumnType.INTEGER,
"integer": WrenEngineColumnType.INTEGER,
"smallint": WrenEngineColumnType.SMALLINT,
"real": WrenEngineColumnType.REAL,
"double precision": WrenEngineColumnType.DOUBLE,
"numeric": WrenEngineColumnType.DECIMAL,
"decimal": WrenEngineColumnType.DECIMAL,
"boolean": WrenEngineColumnType.BOOLEAN,
"timestamp": WrenEngineColumnType.TIMESTAMP,
"timestamp without time zone": WrenEngineColumnType.TIMESTAMP,
"timestamp with time zone": WrenEngineColumnType.TIMESTAMPTZ,
"date": WrenEngineColumnType.DATE,
"interval": WrenEngineColumnType.INTERVAL,
"json": WrenEngineColumnType.JSON,
"bytea": WrenEngineColumnType.BYTEA,
"uuid": WrenEngineColumnType.UUID,
"inet": WrenEngineColumnType.INET,
"oid": WrenEngineColumnType.OID,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use enum mapping.

class ColumnType(Enum):
    TEXT = ("TEXT", "text")

    def __init__(self, wtype, ptype):
        self.wtype = wtype
        self.ptype = ptype

    @property
    def wtype(self):
        return self.wtype

    @property
    def ptype(self):
        return self.ptype

Comment on lines 13 to 17
connection_info: Union[
PostgresConnectionUrl | PostgresConnectionInfo,
BigQueryConnectionInfo,
SnowflakeConnectionInfo,
] = Field(alias="connectionInfo")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just use connection_info: ConnectionInfo = Field(alias="connectionInfo")

Comment on lines 20 to 77
class WrenEngineColumnType(Enum):
# Boolean Types
BOOLEAN = "BOOLEAN"

# Numeric Types
TINYINT = "TINYINT"

INT2 = "INT2"
SMALLINT = "SMALLINT" # alias for INT2

INT4 = "INT4"
INTEGER = "INTEGER" # alias for INT4

INT8 = "INT8"
BIGINT = "BIGINT" # alias for INT8

NUMERIC = "NUMERIC"
DECIMAL = "DECIMAL"

# Floating-Point Types
FLOAT4 = "FLOAT4"
REAL = "REAL" # alias for FLOAT4

FLOAT8 = "FLOAT8"
DOUBLE = "DOUBLE" # alias for FLOAT8

# Character Types
VARCHAR = "VARCHAR"
CHAR = "CHAR"
BPCHAR = "BPCHAR" # BPCHAR is fixed-length blank padded string
TEXT = "TEXT" # alias for VARCHAR
STRING = "STRING" # alias for VARCHAR
NAME = "NAME" # alias for VARCHAR

# Date/Time Types
TIMESTAMP = "TIMESTAMP"
TIMESTAMPTZ = "TIMESTAMP WITH TIME ZONE"
DATE = "DATE"
INTERVAL = "INTERVAL"

# JSON Types
JSON = "JSON"

# Object identifiers (OIDs) are used internally by PostgreSQL as primary keys for various system tables.
# https:#www.postgresql.org/docs/current/datatype-oid.html
OID = "OID"

# Binary Data Types
BYTEA = "BYTEA"

# UUID Type
UUID = "UUID"

# Network Address Types
INET = "INET"

# Unknown Type
UNKNOWN = "UNKNOWN"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plz remove too many space lines.

@log_dto
def get_bigquery_constraints(dto: MetadataDTO) -> dict:
table_list = Metadata.bigquery.get_constraints(dto.connection_info)
return {"constraints": table_list}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have any other data. How about just returning a list to a JSON array?
BTW, plz add a empty line in the end of file.

@grieve54706
Copy link
Contributor

I saw you use POST for API in the codebase, but your description of PR is not mapped.

Comment on lines +224 to +325
def test_metadata_list_tables(self):
connection_info = self.get_connection_info()
response = client.post(
url="/v2/ibis/bigquery/metadata/tables",
json={"connectionInfo": connection_info},
)
assert response.status_code == 200

def test_metadata_list_constraints(self):
connection_info = self.get_connection_info()
response = client.post(
url="/v2/ibis/bigquery/metadata/constraints",
json={"connectionInfo": connection_info},
)
assert response.status_code == 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please assert the contents.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need to assert the content here, cause Pydentic will raise errors if the responding data structure is incorrect, and we do not care about the actual data in the data structure

Copy link
Contributor

@grieve54706 grieve54706 Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, FastAPI only checks the response type is dict. We should check the content like the data count or fields like name and column should be with it. The test is to sure the result follows our code design.

Copy link
Collaborator Author

@onlyjackfrost onlyjackfrost Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the test case and modify the data source's schema & constraints in another PR

@grieve54706 grieve54706 merged commit a325b49 into main Jun 12, 2024
1 check passed
@grieve54706 grieve54706 deleted the feature/ibis-metadata branch June 12, 2024 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants