Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pyarrow direct loading #679

Merged
merged 30 commits into from
Oct 16, 2023
Merged
Changes from 1 commit
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
d30eb1b
poc: direct pyarrow load
steinitzu Oct 10, 2023
730dce2
arrow to schema types with precision, test
steinitzu Oct 1, 2023
41e2805
Fix naming copy_atomic -> move_atomic
steinitzu Oct 2, 2023
9433cc4
jsonl/parquet file normalizer classes
steinitzu Oct 14, 2023
3393110
pathlib extension checks
steinitzu Oct 2, 2023
40de1b5
indent fix
steinitzu Oct 2, 2023
a203822
Write parquet with original schema
steinitzu Oct 2, 2023
c2cd0bd
extract refactoring
steinitzu Oct 10, 2023
487987c
Init testing, bugfix
steinitzu Oct 9, 2023
2b94489
Fix import filename
steinitzu Oct 9, 2023
817eca5
Dep
steinitzu Oct 9, 2023
9973bb8
Mockup incremental implementation for arrow tables
steinitzu Oct 10, 2023
6203d71
Create loadstorage per filetype, import with hardlink
steinitzu Oct 11, 2023
4fa718d
Fallback for extract item format, detect type of lists
steinitzu Oct 11, 2023
1000613
Error message, load tests with arrow
steinitzu Oct 12, 2023
5f452b6
Some incremental optimizations
steinitzu Oct 12, 2023
b414a69
Incremental fixes and run incremental tests on arrow & pandas
steinitzu Oct 12, 2023
274f0ee
Add/update normalize tests
steinitzu Oct 14, 2023
6b4d43a
Fix load test
steinitzu Oct 14, 2023
919c5ab
Lint
steinitzu Oct 14, 2023
d58f77e
Add docs page for arrow loading
steinitzu Oct 14, 2023
73b481b
Handle none capability
steinitzu Oct 14, 2023
7df1252
Fix extract lists
steinitzu Oct 14, 2023
3662756
Exclude TIME in redshift test
steinitzu Oct 15, 2023
f16a7ba
Fix type errors
steinitzu Oct 15, 2023
93811e0
Typo
steinitzu Oct 15, 2023
33ce792
Create col from numpy array for >200x speedup, index after filter
steinitzu Oct 15, 2023
b9f7aaf
in -> not in
steinitzu Oct 16, 2023
b78618c
Format binary as hex for redshift
steinitzu Oct 16, 2023
2b96904
enables bool and duckdb test on pyarrow loading
rudolfix Oct 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix naming copy_atomic -> move_atomic
  • Loading branch information
steinitzu committed Oct 15, 2023
commit 41e28057938957d38f0c300866bd8fa6ad8dd8a1
4 changes: 2 additions & 2 deletions dlt/common/storages/file_storage.py
Original file line number Diff line number Diff line change
@@ -45,7 +45,7 @@ def save_atomic(storage_path: str, relative_path: str, data: Any, file_type: str
raise

@staticmethod
def copy_atomic(source_file_path: str, dest_folder_path: str) -> str:
def move_atomic(source_file_path: str, dest_folder_path: str) -> str:
file_name = os.path.basename(source_file_path)
dest_file_path = os.path.join(dest_folder_path, file_name)
try:
@@ -197,7 +197,7 @@ def rename_tree_files(self, from_relative_path: str, to_relative_path: str) -> N

def atomic_import(self, external_file_path: str, to_folder: str) -> str:
"""Moves a file at `external_file_path` into the `to_folder` effectively importing file into storage"""
return self.to_relative_path(FileStorage.copy_atomic(external_file_path, self.make_full_path(to_folder)))
return self.to_relative_path(FileStorage.move_atomic(external_file_path, self.make_full_path(to_folder)))
# file_name = FileStorage.get_file_name_from_file_path(external_path)
# os.rename(external_path, os.path.join(self.make_full_path(to_folder), file_name))