Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] VFS access for unittesting #595

Draft
wants to merge 48 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
ce513bf
Check VFS access of CI.
JohnMoutafis Jul 2, 2024
a660a60
Gitlab CI test aws access
JohnMoutafis Jul 2, 2024
4697da4
Check mirroring between github and gitlab.
JohnMoutafis Jul 11, 2024
f80eac1
Update branch and check assume role.
JohnMoutafis Jul 31, 2024
69f9574
Attempt to run gitlab ci from yml.
JohnMoutafis Jul 31, 2024
7ea083b
Fix typo in gitlab job.
JohnMoutafis Jul 31, 2024
a4725ed
Attempt to set the stage.
JohnMoutafis Jul 31, 2024
2687cfe
Attempt to use aws-cli image for aws commands.
JohnMoutafis Aug 1, 2024
d47bdfc
Attempt to maintain aws credentials between ci jobs.
JohnMoutafis Aug 2, 2024
a16bc58
Attempt to maintain aws credentials between jobs (2)
JohnMoutafis Aug 2, 2024
d11de06
Attempt to maintain aws credentials between jobs (3)
JohnMoutafis Aug 2, 2024
00bf4ab
Attempt to maintain aws credentials between ci jobs (4)
JohnMoutafis Aug 2, 2024
e1c9266
Attempt to set AWS credentials correctly between jobs.
JohnMoutafis Aug 2, 2024
83dbc47
Debug oidc creds.
JohnMoutafis Aug 2, 2024
1f6c27f
Attempt different web identity provider.
JohnMoutafis Aug 2, 2024
785d94c
Attempt to assume role.
JohnMoutafis Aug 6, 2024
ae008eb
Fix check-aws-credentials.
JohnMoutafis Aug 6, 2024
31f7154
Check if gitlab assumes the runners-role.
JohnMoutafis Aug 6, 2024
5f4162c
Use correct arn.
JohnMoutafis Aug 6, 2024
6a0eeea
Remove potentially overriding env vars.
JohnMoutafis Aug 6, 2024
d55021c
Fix typo.
JohnMoutafis Aug 6, 2024
4c8598d
Attempt to use CloudPy CI role.
JohnMoutafis Aug 6, 2024
1c6d229
Remove assume-role query.
JohnMoutafis Aug 6, 2024
a951dfc
Check assume role for cloud-py-ci.
JohnMoutafis Aug 6, 2024
2e13c92
Attempt to override aws access env vars.
JohnMoutafis Aug 6, 2024
6925acb
Unify assume role mehtod.
JohnMoutafis Aug 6, 2024
2474929
Attempt setting up tests.
JohnMoutafis Aug 7, 2024
29580a4
Attempt to setup test env explicitly.
JohnMoutafis Aug 7, 2024
c205b6d
Attempt to install .[tests]
JohnMoutafis Aug 7, 2024
c66f598
Try installing git in env to install cloud-py .[tests]
JohnMoutafis Aug 7, 2024
1b9cc4f
Attempt !reference tag to avoid `before_script` override.
JohnMoutafis Aug 7, 2024
33a21c7
Attempt to set AWS credentials in env.
JohnMoutafis Aug 7, 2024
2a1c912
Debug AWS access.
JohnMoutafis Aug 7, 2024
43d0679
Attempt to ingest files test.
JohnMoutafis Aug 7, 2024
0533769
Debug repo folder location in ci runner.
JohnMoutafis Aug 8, 2024
fb0a423
Attempt to ls repo and builds location.
JohnMoutafis Aug 8, 2024
aceae06
Search where pytest is executed
JohnMoutafis Aug 8, 2024
0ec6884
ls the "builds" repo folder.
JohnMoutafis Aug 8, 2024
f13da40
Attempt to debug test data location (1)
JohnMoutafis Aug 8, 2024
3136b56
Move data files and try again.
JohnMoutafis Aug 8, 2024
7516a6c
Attempt to ingest test data files.
JohnMoutafis Aug 8, 2024
f2687da
Remove error variable.
JohnMoutafis Aug 8, 2024
3a60aa3
Attempt to debug env variables.
JohnMoutafis Aug 9, 2024
dbc5765
Attempt to print Config dict.
JohnMoutafis Aug 9, 2024
76e3371
Attempt to get config through vfs.ctx.
JohnMoutafis Aug 9, 2024
d3cd20f
Attempt to manually config AWS parameters.
JohnMoutafis Aug 9, 2024
3a4a626
Attempt to use the new unittest user ACN.
JohnMoutafis Aug 14, 2024
484519c
Attempt to use acn 2.
JohnMoutafis Aug 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/tiledb-cloud-py.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ jobs:
pytest -sv \
--tb short \
--color yes \
--ignore=tests/gitlab_tests \
--splits ${{ env.PYTEST_SPLIT_GROUPS }} \
--group ${{ matrix.pytest-split-group }} \
--store-durations \
Expand Down
54 changes: 54 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
variables:
AWS_REGION: "us-east-1"
AWS_CLOUD_CI_ROLE_ARN: "arn:aws:iam::211125629230:role/TiledbCloudPyCIRole"
AWS_CLOUD_CI_S3_BUCKET: "tiledb-cloud-py-ci"

stages:
- aws-assume-role
- run-tests

aws-assume-role:
# Assume role and override AWS env variables
stage: aws-assume-role
image:
name: "amazon/aws-cli:latest"
entrypoint: [""]
before_script:
- >
export $(printf "AWS_ACCESS_KEY_ID=%s AWS_SECRET_ACCESS_KEY=%s AWS_SESSION_TOKEN=%s"
$(aws sts assume-role
--role-arn ${AWS_CLOUD_CI_ROLE_ARN}
--role-session-name "GitLabRunner-${CI_PROJECT_ID}-${CI_PIPELINE_ID}"
--duration-seconds 3600
--query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]'
--output text))
script:
- aws s3 ls s3://${AWS_CLOUD_CI_S3_BUCKET}
# Write AWS ENV Variables to run-tests.env for use in the next stages
- echo "AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID" >> run-tests.env
- echo "AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" >> run-tests.env
- echo "AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN" >> run-tests.env
artifacts:
expire_in: "30 min"
reports:
dotenv: run-tests.env

run-tests:
stage: run-tests
image: python:3.9-slim
needs:
- aws-assume-role
before_script:
- apt-get update && apt-get install -y git
- pip install --upgrade pip wheel setuptools setuptools-scm
- pip install .[tests]
- pip install tiledb-vector-search --no-deps
# Export ENV Variables from aws-assume-role stage
- export "AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID"
- export "AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY"
- export "AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN"
script:
- >
pytest -sv tests/gitlab_tests
--tb short
--color yes
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ life-sciences = ["tiledbsoma"]
docs = ["quartodoc"]
dev = ["black", "pytest", "ruff"]
tests = [
"pytest",
"xarray",
"pytest-cov",
"pytest-explicit",
Expand Down
Empty file added tests/gitlab_tests/__init__.py
Empty file.
3 changes: 3 additions & 0 deletions tests/gitlab_tests/data/file_ingestion/to_ingest_0.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Test 0
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
3 changes: 3 additions & 0 deletions tests/gitlab_tests/data/file_ingestion/to_ingest_1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Test 1
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
3 changes: 3 additions & 0 deletions tests/gitlab_tests/data/file_ingestion/to_ingest_2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Test 2
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
3 changes: 3 additions & 0 deletions tests/gitlab_tests/data/file_ingestion/to_ingest_3.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Test 3
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
3 changes: 3 additions & 0 deletions tests/gitlab_tests/data/file_ingestion/to_ingest_4.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Test 4
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
nisi ut aliquip ex ea commodo consequat.
230 changes: 230 additions & 0 deletions tests/gitlab_tests/test_file.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
import os
import unittest
from typing import List

from tqdm import tqdm

import tiledb
import tiledb.vfs
from tiledb.cloud import groups
from tiledb.cloud._common import testonly
from tiledb.cloud._common import utils
from tiledb.cloud.array import delete_array
from tiledb.cloud.files import ingestion as file_ingestion

CURRENT_DIR = os.path.dirname(os.path.realpath(__file__))


def _cleanup_residual_test_arrays(array_uris: List[str]) -> None:
"""Deletes every array in a list and potential non unique tables"""
for array_uri in array_uris:
try:
print(f"Deleting array: {array_uri}")
delete_array(array_uri)
except Exception as exc:
error_msg = str(exc)
print(f"-- {error_msg}")
if "is not unique" in error_msg:
namespace, _ = utils.split_uri(array_uri)
uuids = error_msg[error_msg.find("[") + 1 : error_msg.find("]")]
uuids = uuids.split(" ")
for uid in tqdm(
uuids, desc=f"Deleting multiple arrays with URI: {array_uri}"
):
try:
delete_array(f"tiledb://{namespace}/{uid}")
except Exception:
continue
continue


class TestFileIngestion(unittest.TestCase):
@classmethod
def setUpClass(cls) -> None:
"""Setup group and destinations once before the file tests start."""
cls.config = tiledb.Config()
cls.config["vfs.s3.region"] = os.environ["AWS_REGION"]
cls.config["vfs.s3.aws_access_key_id"] = os.environ["AWS_ACCESS_KEY_ID"]
cls.config["vfs.s3.aws_secret_access_key"] = os.environ["AWS_SECRET_ACCESS_KEY"]
cls.config["vfs.s3.aws_session_token"] = os.environ["AWS_SESSION_TOKEN"]

cls.vfs = tiledb.VFS(config=cls.config)

cls.s3_bucket = f"s3://{os.environ['AWS_CLOUD_CI_S3_BUCKET']}"
cls.test_files_folder = os.path.join(CURRENT_DIR, "data", "file_ingestion")

cls.namespace, cls.storage_path, _ = groups._default_ns_path_cred()
cls.acn = "tiledb-unittest-ci-bucket"
cls.namespace = cls.namespace.rstrip("/")
cls.storage_path = cls.storage_path.rstrip("/")
cls.destination = f"{cls.storage_path}/{testonly.random_name('file_test')}"

cls.group_name = testonly.random_name("file_ingestion_test_group")
cls.group_uri = f"tiledb://{cls.namespace}/{cls.group_name}"
cls.group_destination = f"{cls.storage_path}/{cls.group_name}"
groups.create(cls.group_name, storage_uri=cls.group_destination)

return super().setUpClass()

@classmethod
def tearDownClass(cls) -> None:
"""Cleanup after the tests have run"""
groups.delete(cls.group_uri, recursive=True)
return super().tearDownClass()

def setUp(self) -> None:
s3_test_folder = testonly.random_name("file_ingestion_test_files")
self.cleanup_arrays = []
self.s3_test_folder_uri = f"{self.s3_bucket}/{s3_test_folder}"
self.vfs.create_dir(self.s3_test_folder_uri)

# VFS does not yet support copying across file systems.
# Therefore we write the files in the folder instead
# self.vfs.copy_file(
# old_uri=os.path.join(self.test_files_folder, fname),
# new_uri=f"{self.s3_test_folder_uri}/{testonly.random_name(fn)}.{suffix}",
# )
self.test_file_uris = []
for fname in os.listdir(self.test_files_folder):
fn, suffix = os.path.splitext(fname)
s3_uri = f"{self.s3_test_folder_uri}/{testonly.random_name(fn)}{suffix}"
with open(os.path.join(self.test_files_folder, fname)) as fp:
with self.vfs.open(s3_uri, mode="wb") as vfp:
vfp.write(fp.read())
self.test_file_uris.append(s3_uri)

return super().setUp()

def tearDown(self) -> None:
"""Clean up ingested arrays and tmp file folder from s3"""
_cleanup_residual_test_arrays(array_uris=self.cleanup_arrays)
self.vfs.remove_dir(self.s3_test_folder_uri)
return super().tearDown()

def test_files_ingestion_udf(self):
ingested_array_uris = file_ingestion.ingest_files_udf(
dataset_uri=self.destination,
file_uris=self.test_file_uris,
acn=self.acn,
namespace=self.namespace,
)

self.assertEqual(len(ingested_array_uris), len(self.test_file_uris))
# Add arrays for cleanup on tearDown
self.cleanup_arrays += ingested_array_uris


# def test_files_ingestion_udf_into_group(self):
# ingested_array_uris = file_ingestion.ingest_files_udf(
# dataset_uri=self.group_destination,
# file_uris=self.test_file_uris,
# acn=self.acn,
# namespace=self.namespace,
# )

# file_ingestion.add_arrays_to_group_udf(
# array_uris=ingested_array_uris,
# group_uri=self.group_uri,
# config=client.Ctx().config().dict(),
# verbose=True,
# )

# group_info = groups.info(self.group_uri)
# self.assertEqual(group_info.asset_count, len(self.test_file_uris))
# # Clean up
# _cleanup_residual_test_arrays(array_uris=ingested_array_uris)

# def test_add_array_to_group_udf_raises_bad_namespace_error(self):
# with self.assertRaises(tiledb.TileDBError):
# file_ingestion.add_arrays_to_group_udf(
# array_uris=[f"tiledb://{self.namespace}/{self.test_file_uris[0]}"],
# group_uri=f"tiledb://very-bad-namespace/{self.group_name}",
# config=client.Ctx().config().dict(),
# verbose=True,
# )

# def test_add_array_to_group_udf_non_existing_group_raises_value_error(self):
# with self.assertRaises(ValueError):
# file_ingestion.add_arrays_to_group_udf(
# array_uris=[f"tiledb://{self.namespace}/{self.test_file_uris[0]}"],
# group_uri=f"tiledb://{self.namespace}/non-existing-group",
# config=client.Ctx().config().dict(),
# verbose=True,
# )


# class TestFileIndexing(unittest.TestCase):
# @classmethod
# def setUpClass(cls) -> None:
# """
# Setup test files, group and destinations once before the file tests start.
# """
# cls.input_file_location = "s3://tiledb-unittest/groups/file_indexing_test_files"
# # Files with name "input_file_<n[0, 4]>.pdf" have already been placed
# # in the "cls.input_file_location"
# cls.input_file_names = [f"file_to_index_{i}.pdf" for i in range(5)]
# cls.test_file_uris = [
# f"{cls.input_file_location}/{fname}" for fname in cls.input_file_names
# ]

# cls.namespace, cls.storage_path, cls.acn = groups._default_ns_path_cred()
# cls.namespace = cls.namespace.rstrip("/")
# cls.storage_path = cls.storage_path.rstrip("/")
# cls.destination = (
# f"{cls.storage_path}/{testonly.random_name('file-indexing-test')}"
# )

# # Ingest test files for testing
# cls.ingested_array_uris = file_ingestion.ingest_files_udf(
# dataset_uri=cls.destination,
# file_uris=cls.test_file_uris,
# acn=cls.acn,
# namespace=cls.namespace,
# )

# return super().setUpClass()

# @classmethod
# def tearDownClass(cls) -> None:
# """Remove index testing residuals"""
# _cleanup_residual_test_arrays(array_uris=cls.ingested_array_uris)
# return super().tearDownClass()

# def tearDown(self) -> None:
# """Cleanup indexing arrays between tests"""
# groups.delete(self.created_index_uri, recursive=True)
# # FIXME: Not a nice way to cleanup vector search residuals:
# _cleanup_residual_test_arrays(
# array_uris=[
# f"tiledb://{self.namespace}/object_metadata",
# f"tiledb://{self.namespace}/updates",
# f"tiledb://{self.namespace}/shuffled_vectors",
# f"tiledb://{self.namespace}/shuffled_vector_ids",
# f"tiledb://{self.namespace}/partition_indexes",
# f"tiledb://{self.namespace}/partition_centroids",
# ]
# )
# return super().tearDown()

# @unittest.skip("Extremely slow execution times in the CI/CD client")
# def test_create_and_update_dataset_udf(self):
# with self.assertLogs(get_logger_wrapper()) as lg:
# # Create a vector search group with 1 file
# self.created_index_uri = file_indexing.create_dataset_udf(
# search_uri=self.input_file_location,
# index_uri=f"tiledb://{self.namespace}/{self.destination}",
# config=client.Ctx().config().dict(),
# max_files=3,
# )
# self.assertTrue("Creating dataset" in lg.output[0])

# # Update the group with all the available files
# file_indexing.create_dataset_udf(
# search_uri=self.input_file_location,
# index_uri=self.created_index_uri,
# config=client.Ctx().config().dict(),
# )
# self.assertTrue("Updating reader" in lg.output[1])
# index_group_info = groups.info(self.created_index_uri)
# self.assertIsNotNone(index_group_info)
# self.assertEqual(index_group_info.asset_count, 6)
Loading
Loading