Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add file type validation and mime type extraction #1893

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
80c50c6
feat: add file type validation and mime type extraction
jasonlai1218 Oct 15, 2023
2cf4d57
test: refactor test functions and add new tests
jasonlai1218 Oct 15, 2023
d327dee
Based on the file summaries provided, the best label for the commit w…
jasonlai1218 Oct 16, 2023
e669cbe
chore: update dev requirements and setup.py dependencies
jasonlai1218 Oct 16, 2023
3246d37
chore: update dependencies and remove unnecessary dev requirement
jasonlai1218 Oct 16, 2023
53aef24
chore: import `magic` module in `file.py`
jasonlai1218 Oct 16, 2023
38b684e
docs: fix typo in error message for incorrect file type
jasonlai1218 Oct 16, 2023
3034803
build: add libmagic1 package to Dockerfile.dev
jasonlai1218 Oct 17, 2023
65c8918
feat: refactor file type validation in FlyteFilePathTransformer class
jasonlai1218 Oct 18, 2023
50419e2
test: add test for invalid file type validation
jasonlai1218 Oct 18, 2023
ed980d6
refactor: update type hint for `validate_file_type` method
jasonlai1218 Oct 19, 2023
b64f06d
refactor: handle missing `magic` module gracefully
jasonlai1218 Oct 19, 2023
1fbffac
test: refactor file type tests to use `can_import_magic` fixture
jasonlai1218 Oct 19, 2023
cfa9801
Based on the file summaries provided, the best label for this code ch…
jasonlai1218 Oct 19, 2023
35ddf82
feat: refactor file handling and introduce hash calculation
jasonlai1218 Oct 20, 2023
c5b7408
The label that best describes this change is "refactor". This is beca…
jasonlai1218 Oct 21, 2023
2dfb7aa
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 21, 2023
946bb9c
Based on the file summaries provided, the best label for the commit w…
jasonlai1218 Oct 21, 2023
75177e4
refactor: change MIME types for `hdf5`, `ipynb`, and `onnx` files
jasonlai1218 Oct 22, 2023
73be136
refactor: update MIME types for specific file types
jasonlai1218 Oct 22, 2023
3dc3525
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 24, 2023
0577b3e
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 26, 2023
c07cffc
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 27, 2023
0b8bf4b
The label best describing this change is "test".: update file.py and …
jasonlai1218 Oct 27, 2023
6a09db0
feat: refactor file handling and input parameter in test cases
jasonlai1218 Oct 27, 2023
0beb145
chore: remove write_csv_format_file.py
jasonlai1218 Oct 27, 2023
85338ca
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 30, 2023
00d3452
feat: add `python-magic` library for file type validation and logging…
jasonlai1218 Oct 31, 2023
b176572
feat: add validation method for file types in FlyteFilePathTransformer
jasonlai1218 Oct 31, 2023
2d2e7d8
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Oct 31, 2023
7ac4c8f
feat: refactor file handling and add new tests
jasonlai1218 Oct 31, 2023
2b8cd4f
The label best describing this change is "refactor".: remove unnecess…
jasonlai1218 Oct 31, 2023
0b6c26c
test: add skipif marks to test functions for dataclass and enum in da…
jasonlai1218 Oct 31, 2023
c6b96ab
test: add import statement and skip test in unit test file
jasonlai1218 Oct 31, 2023
c1e2774
test: remove unnecessary tests in core and type_hints modules
jasonlai1218 Nov 2, 2023
d1d0476
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 2, 2023
143f1db
chore: remove custom type files
jasonlai1218 Nov 2, 2023
78ba16b
refactor: remove unnecessary validation in file transformations
jasonlai1218 Nov 2, 2023
06bd4c4
refactor: refactor file validation and test updates
jasonlai1218 Nov 3, 2023
ef04855
refactor: refactor file imports and remove unnecessary code
jasonlai1218 Nov 3, 2023
b4caa3a
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 6, 2023
f63dd51
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 7, 2023
ec732bd
feat: refactor file type handling
jasonlai1218 Nov 7, 2023
4b837a8
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 8, 2023
aea3a47
build: update requirements for installing `python-magic` library
jasonlai1218 Nov 8, 2023
bd4232a
chore: update `python-magic` usage comments for Windows compatibility
jasonlai1218 Nov 9, 2023
63e634a
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 9, 2023
d277be0
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 10, 2023
ee38477
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 11, 2023
7dc39e6
Merge remote-tracking branch 'upstream/master' into add-validation-fl…
jasonlai1218 Nov 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Dockerfile.dev
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ WORKDIR /root

ARG VERSION

RUN apt-get update && apt-get install build-essential vim -y
RUN apt-get update && apt-get install build-essential vim libmagic1 -y

COPY . /flytekit

Expand Down
6 changes: 6 additions & 0 deletions dev-requirements.in
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,12 @@ torch<=1.12.1; python_version<'3.11'
# pytorch 2 supports python 3.11
torch<=2.0.0; python_version>='3.11' or platform_system!='Windows'

# TODO: Currently, the python-magic library causes build errors on Windows due to its dependency on DLLs for libmagic.
# We have temporarily disabled this feature on Windows and are using python-magic for Mac OS and Linux instead.
# For more details, see the related GitHub issue.
# Once a solution is found, this should be updated to support Windows as well.
python-magic; (platform_system=='Darwin' or platform_system=='Linux')

pillow
scikit-learn
types-protobuf
Expand Down
54 changes: 54 additions & 0 deletions flytekit/types/file/file.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from __future__ import annotations

import mimetypes
import os
import pathlib
import typing
Expand Down Expand Up @@ -324,6 +325,57 @@
def get_literal_type(self, t: typing.Union[typing.Type[FlyteFile], os.PathLike]) -> LiteralType:
return LiteralType(blob=self._blob_type(format=FlyteFilePathTransformer.get_format(t)))

def get_mime_type_from_extension(self, extension: str) -> str:
extension_to_mime_type = {

Check warning on line 329 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L329

Added line #L329 was not covered by tests
"hdf5": "text/plain",
"joblib": "application/octet-stream",
"python_pickle": "application/octet-stream",
"ipynb": "application/json",
"onnx": "application/json",
"tfrecord": "application/octet-stream",
}

for ext, mimetype in mimetypes.types_map.items():
extension_to_mime_type[ext.split(".")[1]] = mimetype

Check warning on line 339 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L339

Added line #L339 was not covered by tests

return extension_to_mime_type[extension]

Check warning on line 341 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L341

Added line #L341 was not covered by tests

def validate_file_type(
self, python_type: typing.Type[FlyteFile], source_path: typing.Union[str, os.PathLike]
) -> None:
"""
This method validates the type of the file at source_path against the expected python_type.
It uses the magic library to determine the real type of the file. If the magic library is not installed,
it logs a debug message and returns. If the actual file does not exist, it returns without raising an error.

:param python_type: The expected type of the file
:param source_path: The path to the file to validate
:raises ValueError: If the real type of the file is not the same as the expected python_type
"""
if FlyteFilePathTransformer.get_format(python_type) == "":
return

Check warning on line 356 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L356

Added line #L356 was not covered by tests

try:

Check warning on line 358 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L358

Added line #L358 was not covered by tests
# isolate the exception to the libmagic import
import magic

Check warning on line 360 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L360

Added line #L360 was not covered by tests

except ImportError as e:
logger.debug(f"Libmagic is not installed. Error message: {e}")
return

Check warning on line 364 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L362-L364

Added lines #L362 - L364 were not covered by tests

ctx = FlyteContext.current_context()

Check warning on line 366 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L366

Added line #L366 was not covered by tests
if ctx.file_access.is_remote(source_path):
# Skip validation for remote files. One of the use cases for FlyteFile is to point to remote files,
# you might have access to a remote file (e.g., in s3) that you want to pass to a Flyte workflow.
# Therefore, we should only validate FlyteFiles for which their path is considered local.
return

Check warning on line 371 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L371

Added line #L371 was not covered by tests

if FlyteFilePathTransformer.get_format(python_type):
pingsutw marked this conversation as resolved.
Show resolved Hide resolved
real_type = magic.from_file(source_path, mime=True)
expected_type = self.get_mime_type_from_extension(FlyteFilePathTransformer.get_format(python_type))

Check warning on line 375 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L374-L375

Added lines #L374 - L375 were not covered by tests
if real_type != expected_type:
raise ValueError(f"Incorrect file type, expected {expected_type}, got {real_type}")

Check warning on line 377 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L377

Added line #L377 was not covered by tests

def to_literal(
self,
ctx: FlyteContext,
Expand All @@ -348,6 +400,7 @@

if isinstance(python_val, FlyteFile):
source_path = python_val.path
self.validate_file_type(python_type, source_path)

Check warning on line 403 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L403

Added line #L403 was not covered by tests

# If the object has a remote source, then we just convert it back. This means that if someone is just
# going back and forth between a FlyteFile Python value and a Blob Flyte IDL value, we don't do anything.
Expand All @@ -373,6 +426,7 @@
elif isinstance(python_val, pathlib.Path) or isinstance(python_val, str):
source_path = str(python_val)
if issubclass(python_type, FlyteFile):
self.validate_file_type(python_type, source_path)

Check warning on line 429 in flytekit/types/file/file.py

View check run for this annotation

Codecov / codecov/patch

flytekit/types/file/file.py#L429

Added line #L429 was not covered by tests
if ctx.file_access.is_remote(source_path):
should_upload = False
else:
Expand Down
133 changes: 131 additions & 2 deletions tests/flytekit/unit/core/test_flyte_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import pathlib
import tempfile
import typing
from unittest.mock import MagicMock
from unittest.mock import MagicMock, patch

import pytest
from typing_extensions import Annotated
Expand All @@ -19,7 +19,7 @@
from flytekit.core.workflow import workflow
from flytekit.models.core.types import BlobType
from flytekit.models.literals import LiteralMap
from flytekit.types.file.file import FlyteFile
from flytekit.types.file.file import FlyteFile, FlyteFilePathTransformer


# Fixture that ensures a dummy local file
Expand All @@ -34,6 +34,25 @@ def local_dummy_file():
os.remove(path)


@pytest.fixture
def local_dummy_txt_file():
fd, path = tempfile.mkstemp(suffix=".txt")
try:
with os.fdopen(fd, "w") as tmp:
tmp.write("Hello World")
yield path
finally:
os.remove(path)


def can_import(module_name) -> bool:
try:
__import__(module_name)
return True
except ImportError:
return False


def test_file_type_in_workflow_with_bad_format():
@task
def t1() -> FlyteFile[typing.TypeVar("txt")]:
Expand All @@ -52,6 +71,116 @@ def my_wf() -> FlyteFile[typing.TypeVar("txt")]:
assert fh.read() == "Hello World\n"


def test_matching_file_types_in_workflow(local_dummy_txt_file):
# TXT
@task
def t1(path: FlyteFile[typing.TypeVar("txt")]) -> FlyteFile[typing.TypeVar("txt")]:
return path

@workflow
def my_wf(path: FlyteFile[typing.TypeVar("txt")]) -> FlyteFile[typing.TypeVar("txt")]:
f = t1(path=path)
return f

res = my_wf(path=local_dummy_txt_file)
with open(res, "r") as fh:
assert fh.read() == "Hello World"


def test_file_types_with_naked_flytefile_in_workflow(local_dummy_txt_file):
@task
def t1(path: FlyteFile[typing.TypeVar("txt")]) -> FlyteFile:
return path

@workflow
def my_wf(path: FlyteFile[typing.TypeVar("txt")]) -> FlyteFile:
f = t1(path=path)
return f

res = my_wf(path=local_dummy_txt_file)
with open(res, "r") as fh:
assert fh.read() == "Hello World"


@pytest.mark.skipif(not can_import("magic"), reason="Libmagic is not installed")
def test_mismatching_file_types(local_dummy_txt_file):
@task
def t1(path: FlyteFile[typing.TypeVar("txt")]) -> FlyteFile[typing.TypeVar("jpeg")]:
return path

@workflow
def my_wf(path: FlyteFile[typing.TypeVar("txt")]) -> FlyteFile[typing.TypeVar("jpeg")]:
f = t1(path=path)
return f

with pytest.raises(TypeError) as excinfo:
my_wf(path=local_dummy_txt_file)
assert "Incorrect file type, expected image/jpeg, got text/plain" in str(excinfo.value)


def test_get_mime_type_from_extension_success():
transformer = TypeEngine.get_transformer(FlyteFile)
assert transformer.get_mime_type_from_extension("html") == "text/html"
assert transformer.get_mime_type_from_extension("jpeg") == "image/jpeg"
assert transformer.get_mime_type_from_extension("png") == "image/png"
assert transformer.get_mime_type_from_extension("hdf5") == "text/plain"
assert transformer.get_mime_type_from_extension("joblib") == "application/octet-stream"
assert transformer.get_mime_type_from_extension("pdf") == "application/pdf"
assert transformer.get_mime_type_from_extension("python_pickle") == "application/octet-stream"
assert transformer.get_mime_type_from_extension("ipynb") == "application/json"
assert transformer.get_mime_type_from_extension("svg") == "image/svg+xml"
assert transformer.get_mime_type_from_extension("csv") == "text/csv"
assert transformer.get_mime_type_from_extension("onnx") == "application/json"
assert transformer.get_mime_type_from_extension("tfrecord") == "application/octet-stream"
assert transformer.get_mime_type_from_extension("txt") == "text/plain"


def test_get_mime_type_from_extension_failure():
transformer = TypeEngine.get_transformer(FlyteFile)
with pytest.raises(KeyError):
transformer.get_mime_type_from_extension("unknown_extension")


@pytest.mark.skipif(not can_import("magic"), reason="Libmagic is not installed")
def test_validate_file_type_incorrect():
transformer = TypeEngine.get_transformer(FlyteFile)
source_path = "/tmp/flytekit_test.png"
source_file_mime_type = "image/png"
user_defined_format = "jpeg"

with patch.object(FlyteFilePathTransformer, "get_format", return_value=user_defined_format):
with patch("magic.from_file", return_value=source_file_mime_type):
with pytest.raises(
ValueError, match=f"Incorrect file type, expected image/jpeg, got {source_file_mime_type}"
):
transformer.validate_file_type(user_defined_format, source_path)


@pytest.mark.skipif(not can_import("magic"), reason="Libmagic is not installed")
def test_flyte_file_type_annotated_hashmethod(local_dummy_file):
def calc_hash(ff: FlyteFile) -> str:
return str(ff.path)

HashedFlyteFile = Annotated[FlyteFile["jpeg"], HashMethod(calc_hash)]

@task
def t1(path: str) -> HashedFlyteFile:
return HashedFlyteFile(path)

@task
def t2(ff: HashedFlyteFile) -> None:
print(ff.path)

@workflow
def wf(path: str) -> None:
ff = t1(path=path)
t2(ff=ff)

with pytest.raises(TypeError) as excinfo:
wf(path=local_dummy_file)
assert "Incorrect file type, expected image/jpeg, got text/plain" in str(excinfo.value)


eapolinario marked this conversation as resolved.
Show resolved Hide resolved
def test_file_handling_remote_default_wf_input():
SAMPLE_DATA = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

Expand Down
6 changes: 3 additions & 3 deletions tests/flytekit/unit/core/test_type_hints.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
from flytekit.models.types import LiteralType, SimpleType
from flytekit.tools.translator import get_serializable
from flytekit.types.directory import FlyteDirectory, TensorboardLogs
from flytekit.types.file import FlyteFile, PNGImageFile
from flytekit.types.file import FlyteFile
from flytekit.types.schema import FlyteSchema, SchemaOpenMode
from flytekit.types.structured.structured_dataset import StructuredDataset

Expand Down Expand Up @@ -390,7 +390,7 @@ def test_flyte_file_in_dataclass():
@dataclass
class InnerFileStruct(DataClassJsonMixin):
a: FlyteFile
b: PNGImageFile
b: FlyteFile
eapolinario marked this conversation as resolved.
Show resolved Hide resolved

@dataclass
class FileStruct(DataClassJsonMixin):
Expand All @@ -400,7 +400,7 @@ class FileStruct(DataClassJsonMixin):
@task
def t1(path: str) -> FileStruct:
file = FlyteFile(path)
fs = FileStruct(a=file, b=InnerFileStruct(a=file, b=PNGImageFile(path)))
fs = FileStruct(a=file, b=InnerFileStruct(a=file, b=FlyteFile(path)))
return fs

@dynamic
Expand Down
9 changes: 4 additions & 5 deletions tests/flytekit/unit/extras/tasks/test_shell.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ def test_input_output_substitution_files():
name="test",
debug=True,
script=script,
inputs=kwtypes(f=CSVFile),
inputs=kwtypes(f=FlyteFile),
output_locs=[
OutputLocation(var="y", var_type=FlyteFile, location="{inputs.f}.mod"),
],
Expand All @@ -127,11 +127,10 @@ def test_input_output_substitution_files():

contents = "1,2,3,4\n"
with tempfile.TemporaryDirectory() as tmp:
csv = os.path.join(tmp, "abc.csv")
print(csv)
with open(csv, "w") as f:
test_data = os.path.join(tmp, "abc.txt")
with open(test_data, "w") as f:
f.write(contents)
y = t(f=csv)
y = t(f=test_data)
assert y.path[-4:] == ".mod"
assert os.path.exists(y.path)
with open(y.path) as f:
Expand Down
4 changes: 4 additions & 0 deletions tests/flytekit/unit/extras/tasks/testdata/test.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
SN,Name,Contribution
eapolinario marked this conversation as resolved.
Show resolved Hide resolved
1,Linus Torvalds,Linux Kernel
2,Tim Berners-Lee,World Wide Web
3,Guido van Rossum,Python Programming
Loading