-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core feature] Allow nested fields in structured datasets #4241
Comments
Can I try this issue? |
#take |
Hi! @dylanwilder cc @pingsutw
import os
import tempfile
from dataclasses import dataclass
import pandas as pd
from flytekit import task, workflow
from flytekit.types.directory import FlyteDirectory
from flytekit.types.file import FlyteFile
from flytekit.types.structured import StructuredDataset
# If you’re using Flytekit version below v1.10, you’ll need to decorate with @dataclass_json using from dataclass_json import dataclass_json instead of inheriting from Mashumaro’s DataClassJSONMixin.
from mashumaro.mixins.json import DataClassJSONMixin
# Add Annotated: https://docs.python.org/3/library/typing.html#typing.Annotated
from typing_extensions import Annotated
@dataclass
class DetailField():
age: int
sex: str
@dataclass
class RecordField():
name: str
detail: DetailField
# @dataclass_json
@dataclass
class FlyteTypes(DataClassJSONMixin):
dataframe: Annotated[StructuredDataset, {"name": str, "detail": {"age": int, "sex": str}}]
file: FlyteFile
directory: FlyteDirectory
@task
def upload_data() -> FlyteTypes:
"""
Flytekit will upload FlyteFile, FlyteDirectory and StructuredDataset to the blob store,
such as GCP or S3.
"""
# 1. StructuredDataset
df = pd.DataFrame({"Name": ["Tom", "Joseph", "Alyssa"], "Detail": [{"Age":20, "Sex": "M"}, {"Age":22, "Sex": "M"}, {"Age":24, "Sex": "F"}]})
# 2. FlyteDirectory
temp_dir = tempfile.mkdtemp(prefix="flyte-")
df.to_parquet(temp_dir + "/df.parquet")
# 3. FlyteFile
file_path = tempfile.NamedTemporaryFile(delete=False)
file_path.write(b"Hello, World!")
fs = FlyteTypes(
dataframe=StructuredDataset(dataframe=df),
file=FlyteFile(file_path.name),
directory=FlyteDirectory(temp_dir),
)
return fs
@task
def download_data(res: FlyteTypes):
assert pd.DataFrame({"Name": ["Tom", "Joseph", "Alyssa"], "Detail": [{"Age":20, "Sex": "M"}, {"Age":22, "Sex": "M"}, {"Age":24, "Sex": "F"}]}).equals(res.dataframe.open(pd.DataFrame).all())
f = open(res.file, "r")
assert f.read() == "Hello, World!"
assert os.listdir(res.directory) == ["df.parquet"]
@workflow
def dataclass_wf() -> (FlyteTypes):
o1 = upload_data()
download_data(res=o1)
return o1
if __name__ == "__main__":
dataclass_wf()
import os
import tempfile
from dataclasses import dataclass
import pandas as pd
from flytekit import task, workflow
from flytekit.types.directory import FlyteDirectory
from flytekit.types.file import FlyteFile
from flytekit.types.structured import StructuredDataset
# If you’re using Flytekit version below v1.10, you’ll need to decorate with @dataclass_json using from dataclass_json import dataclass_json instead of inheriting from Mashumaro’s DataClassJSONMixin.
from mashumaro.mixins.json import DataClassJSONMixin
# Add Annotated: https://docs.python.org/3/library/typing.html#typing.Annotated
from typing_extensions import Annotated
@dataclass
class DetailField():
age: int
sex: str
@dataclass
class RecordField():
name: str
detail: DetailField
# @dataclass_json
@dataclass
class FlyteTypes(DataClassJSONMixin):
dataframe: Annotated[StructuredDataset, {"name": str, "detail": {"age": int, "sex": str}}]
file: FlyteFile
directory: FlyteDirectory
@task
def upload_data() -> FlyteTypes:
"""
Flytekit will upload FlyteFile, FlyteDirectory and StructuredDataset to the blob store,
such as GCP or S3.
"""
# 1. StructuredDataset
df = pd.DataFrame({"Name": ["Tom", "Joseph", "Alyssa"], "Detail": [{"Age":"20", "Sex": "M"}, {"Age":22, "Sex": "M"}, {"Age":24, "Sex": "F"}]})
# 2. FlyteDirectory
temp_dir = tempfile.mkdtemp(prefix="flyte-")
df.to_parquet(temp_dir + "/df.parquet")
# 3. FlyteFile
file_path = tempfile.NamedTemporaryFile(delete=False)
file_path.write(b"Hello, World!")
fs = FlyteTypes(
dataframe=StructuredDataset(dataframe=df),
file=FlyteFile(file_path.name),
directory=FlyteDirectory(temp_dir),
)
return fs
@task
def download_data(res: FlyteTypes):
assert pd.DataFrame({"Name": ["Tom", "Joseph", "Alyssa"], "Detail": [{"Age":20, "Sex": "M"}, {"Age":22, "Sex": "M"}, {"Age":24, "Sex": "F"}]}).equals(res.dataframe.open(pd.DataFrame).all())
f = open(res.file, "r")
assert f.read() == "Hello, World!"
assert os.listdir(res.directory) == ["df.parquet"]
@workflow
def dataclass_wf() -> (FlyteTypes):
o1 = upload_data()
download_data(res=o1)
return o1
if __name__ == "__main__":
dataclass_wf()
# Schema = Annotated[StructuredDataset, RecordField]
Schema = Annotated[StructuredDataset, {"name": str, "detail": {"age": int, "sex": str}}]
@task
def mytask() -> Schema:
return StructuredDataset(dataframe=pd.DataFrame({"Name": ["Tom", "Joseph", "Alyssa"], "Detail": [{"Age":20, "Sex": "M"}, {"Age":22, "Sex": "M"}, {"Age":24, "Sex": "F"}]})) |
@austin362667 tested your code and it works for me. but it looks to not compatible with |
@austin362667 we noticed that the schema stored in the flyte workflow definition is empty as shown below. Could you confirm that you are able to see the schema in the workflow definition as well. Also, we noticed this works locally, but not on flyte.
|
Hi @gitgraghu , @dylanwilder
import os
import tempfile
from dataclasses import dataclass
from flytekit import task, workflow
from flytekit.types.directory import FlyteDirectory
from flytekit.types.file import FlyteFile
from flytekit.types.structured import StructuredDataset
# If you’re using Flytekit version below v1.10, you’ll need to decorate with @dataclass_json using from dataclass_json import dataclass_json instead of inheriting from Mashumaro’s DataClassJSONMixin.
from mashumaro.mixins.json import DataClassJSONMixin
# Add Annotated: https://docs.python.org/3/library/typing.html#typing.Annotated
from typing_extensions import Annotated
from flytekit import ImageSpec, Resources, task
import pandas as pd
# custom_image = ImageSpec(
# name="basic:latest",
# registry="localhost:30000",
# packages=["pandas", "numpy"],
# apt_packages=['git'],
# python_version=3.11,
# )
# if custom_image.is_container():
# import pandas as pd
@dataclass
class DetailField():
age: int
sex: str
@dataclass
class RecordField():
name: str
detail: DetailField
# Schema = Annotated[StructuredDataset, RecordField]
Schema = Annotated[StructuredDataset, {"name": str, "detail": {"age": int, "sex": str}}]
# @dataclass_json
@dataclass
class FlyteTypes(DataClassJSONMixin):
dataframe: Schema
file: FlyteFile
directory: FlyteDirectory
@task(requests=Resources(cpu="1", mem="1Gi"))
def upload_data() -> FlyteTypes:
"""
Flytekit will upload FlyteFile, FlyteDirectory and StructuredDataset to the blob store,
such as GCP or S3.
"""
# 1. StructuredDataset
df = pd.DataFrame({"Name": ["Tom", "Joseph", "Alyssa"], "Detail": [{"Age":20, "Sex": "M"}, {"Age":22, "Sex": "M"}, {"Age":24, "Sex": "F"}]})
# 2. FlyteDirectory
temp_dir = tempfile.mkdtemp(prefix="flyte-")
df.to_parquet(temp_dir + "/df.parquet")
# 3. FlyteFile
file_path = tempfile.NamedTemporaryFile(delete=False)
file_path.write(b"Hello, World!")
fs = FlyteTypes(
dataframe=StructuredDataset(dataframe=df),
file=FlyteFile(file_path.name),
directory=FlyteDirectory(temp_dir),
)
print("upload_data:\n", fs.dataframe.dataframe)
return fs
@task(requests=Resources(cpu="1", mem="1Gi"))
def download_data(res: FlyteTypes):
assert pd.DataFrame({"Name": ["Tom", "Joseph", "Alyssa"], "Detail": [{"Age":20, "Sex": "M"}, {"Age":22, "Sex": "M"}, {"Age":24, "Sex": "F"}]}).equals(res.dataframe.open(pd.DataFrame).all())
f = open(res.file, "r")
assert f.read() == "Hello, World!"
assert os.listdir(res.directory) == ["df.parquet"]
print("download_data:\n", res.dataframe.open(pd.DataFrame).all())
@task(requests=Resources(cpu="1", mem="1Gi"))
def mytask() -> Schema:
df = pd.DataFrame({"Name": ["Tom", "Joseph", "Alyssa"], "Detail": [{"Age":20, "Sex": "M"}, {"Age":22, "Sex": "M"}, {"Age":24, "Sex": "F"}]})
print('my_df:\n', df)
return StructuredDataset(dataframe=df)
#TODO: returning pd.DataFrame can't work cc [Kevin Su](https://github.com/pingsutw)
# return df
@workflow
def dataclass_wf()-> (FlyteTypes):
mytask()
o1 = upload_data()
download_data(res=o1)
return o1
if __name__ == "__main__":
dataclass_wf() |
@austin362667 I feel below two ways of specifying the schema is not working as expected and is similar to not specifying a schema. Schema = Annotated[StructuredDataset, {"Name": str, "Detail": {"Age": int, "Sex": str}}]
Schema = Annotated[StructuredDataset, RecordField] Try the following example with different variations of Schema (json, dataclass and kwtypes). I have added a City column to the dataframe. @task
def write_df() -> Schema:
df = pd.DataFrame({"Name": ["Tom", "Joseph", "Alyssa"], "Detail": [{"Age":20, "Sex": "M"}, {"Age":22, "Sex": "M"}, {"Age":24, "Sex": "F"}], "City": ["Madrid","Paris","London"]})
return StructuredDataset(dataframe=df)
@task
def read_df(structds: Schema):
df = structds.open(pd.DataFrame).all()
print(df)
@workflow
def struct_record_wf():
out = write_df()
read_df(structds=out) Even if I specify only Name and Detail field in the schema. The open call returns all the data which is equivalent to not specifying any schema. However, using kwtypes will consider the schema and read only the columns specified. Schema = Annotated[StructuredDataset, kwtypes(Name=str, City=str)] |
@gitgraghu You're right! Thanks for reporting, working on it now. |
…lass, kwargs or mix Signed-off-by: Austin Liu <[email protected]>
Motivation: Why do you think this is important?
Most storage formats have support for nested field structures today (Avro, parquet, bq) but currently StructuredDatasets only support flat schemas. This prevents the usage of common data modeling and organizational practices, or requires bypassing structured datasets to simple pass a URI which prevents type checking.
Goal: What should the final outcome look like, ideally?
one or both of the following should be possible
Describe alternatives you've considered
As above, can return a URI, but this is essentially bypassing the type system entirely.
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?
The text was updated successfully, but these errors were encountered: