Skip to content

Commit

Permalink
Merge branch 'master' into feat/docs_strings_memory_storage
Browse files Browse the repository at this point in the history
  • Loading branch information
drobnikj authored Jan 26, 2023
2 parents 53b6d1a + 4674728 commit 4bc5ab7
Show file tree
Hide file tree
Showing 25 changed files with 783 additions and 37 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/lint_and_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@ jobs:
run: make type-check

- name: Unit tests
run: make test
run: make unit-tests
20 changes: 10 additions & 10 deletions .github/workflows/pr_toolkit.yml
Original file line number Diff line number Diff line change
@@ -1,23 +1,23 @@
name: Apify pull request toolkit
# For more info see: https://github.com/apify/pull-request-toolkit-action/
name: Apify PR toolkit

on:
pull_request:
branches:
- master
types: ['opened', 'reopened', 'synchronize', 'labeled', 'unlabeled', 'edited', 'ready_for_review'] # The first 3 are default.

concurrency: # This is to make sure that it's executed only for the most recent changes of PR.
group: ${{ github.ref }}
cancel-in-progress: true

jobs:
apify-pr-toolkit:
name: Run the Apify pull request toolkit
runs-on: ubuntu-20.04
runs-on: ubuntu-latest
steps:
- name: clone pull-request-toolkit-action
uses: actions/checkout@v3
with:
repository: apify/pull-request-toolkit-action
path: ./.github/actions/pull-request-toolkit-action

- name: run pull-request-toolkit action
uses: ./.github/actions/pull-request-toolkit-action
uses: apify/pull-request-toolkit-action@main
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
org-token: ${{ secrets.PULL_REQUEST_TOOLKIT_ACTION_GITHUB_TOKEN }}
zenhub-token: ${{ secrets.PULL_REQUEST_TOOLKIT_ACTION_ZENHUB_TOKEN }}
2 changes: 1 addition & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ jobs:
run: make type-check

- name: Unit tests
run: make test
run: make unit-tests

check_docs:
name: Check whether the documentation is up to date
Expand Down
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@ repos:
language: system
pass_filenames: false

- id: unit-test
- id: unit-tests
name: "Run unit tests"
entry: "make test"
entry: "make unit-tests"
language: system
pass_filenames: false

Expand Down
9 changes: 6 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,16 @@ install-dev:
lint:
python3 -m flake8

test:
python3 -m pytest -ra tests
unit-tests:
python3 -m pytest -n auto -ra tests/unit

integration-tests:
python3 -m pytest -ra tests/integration

type-check:
python3 -m mypy

check-code: lint type-check test
check-code: lint type-check unit-tests

format:
python3 -m isort src tests
Expand Down
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,10 +42,19 @@ To install this package and its development dependencies, run `make install-dev`

We use `autopep8` and `isort` to automatically format the code to a common format. To run the formatting, just run `make format`.

### Linting and Testing
### Linting, type-checking and unit testing

We use `flake8` for linting, `mypy` for type checking and `pytest` for unit testing. To run these tools, just run `make check-code`.

### Integration tests

We have integration tests which build and run actors using the Python SDK on the Apify Platform.
To run these tests, you need to set the `APIFY_TEST_USER_API_TOKEN` environment variable to the API token of the Apify user you want to use for the tests,
and then start them with `make integration-tests`.

If you want to run the integration tests on a different environment than the main Apify Platform,
you need to set the `APIFY_INTEGRATION_TESTS_API_URL` environment variable to the right URL to the Apify API you want to use.

### Documentation

We use the [Google docstring format](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html) for documenting the code.
Expand Down
55 changes: 52 additions & 3 deletions docs/docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,23 @@ That’s useful if you want to use the client as a different Apify user than the

#### async classmethod open_dataset(dataset_id_or_name=None, \*, force_cloud=False)

TODO: docs.
Open a dataset.

Datasets are used to store structured data where each object stored has the same attributes,
such as online store products or real estate offers.
The actual data is stored either on the local filesystem or in the Apify cloud.

* **Parameters**

* **dataset_id_or_name** (`str`, *optional*) – ID or name of the dataset to be opened.
If not provided, the method returns the default dataset associated with the actor run.

* **force_cloud** (`bool`, *optional*) – If set to True then the Apify cloud storage is always used.
This way it is possible to combine local and cloud storage.

* **Returns**

An instance of the Dataset class for the given ID or name.

* **Return type**

Expand All @@ -156,7 +172,23 @@ TODO: docs.

#### async classmethod open_key_value_store(key_value_store_id_or_name=None, \*, force_cloud=False)

TODO: docs.
Open a key-value store.

Key-value stores are used to store records or files, along with their MIME content type.
The records are stored and retrieved using a unique key.
The actual data is stored either on a local filesystem or in the Apify cloud.

* **Parameters**

* **key_value_store_id_or_name** (`str`, *optional*) – ID or name of the key-value store to be opened.
If not provided, the method returns the default key-value store associated with the actor run.

* **force_cloud** (`bool`, *optional*) – If set to True then the Apify cloud storage is always used.
This way it is possible to combine local and cloud storage.

* **Returns**

An instance of the KeyValueStore class for the given ID or name.

* **Return type**

Expand All @@ -166,7 +198,24 @@ TODO: docs.

#### async classmethod open_request_queue(request_queue_id_or_name=None, \*, force_cloud=False)

TODO: docs.
Open a request queue.

Request queue represents a queue of URLs to crawl, which is stored either on local filesystem or in the Apify cloud.
The queue is used for deep crawling of websites, where you start with several URLs and then
recursively follow links to other pages. The data structure supports both breadth-first
and depth-first crawling orders.

* **Parameters**

* **request_queue_id_or_name** (`str`, *optional*) – ID or name of the request queue to be opened.
If not provided, the method returns the default request queue associated with the actor run.

* **force_cloud** (`bool`, *optional*) – If set to True then the Apify cloud storage is always used.
This way it is possible to combine local and cloud storage.

* **Returns**

An instance of the RequestQueue class for the given ID or name.

* **Return type**

Expand Down
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@
'pytest ~= 7.2.0',
'pytest-asyncio ~= 0.20.3',
'pytest-randomly ~= 3.12.0',
'pytest-xdist ~= 3.1.0',
'respx ~= 0.20.1',
'sphinx ~= 5.3.0',
'sphinx-autodoc-typehints ~= 1.19.5',
Expand Down
50 changes: 47 additions & 3 deletions src/apify/actor.py
Original file line number Diff line number Diff line change
Expand Up @@ -478,7 +478,22 @@ def _get_storage_client(self, force_cloud: bool) -> Optional[ApifyClientAsync]:

@classmethod
async def open_dataset(cls, dataset_id_or_name: Optional[str] = None, *, force_cloud: bool = False) -> Dataset:
"""TODO: docs."""
"""Open a dataset.
Datasets are used to store structured data where each object stored has the same attributes,
such as online store products or real estate offers.
The actual data is stored either on the local filesystem or in the Apify cloud.
Args:
dataset_id_or_name (str, optional): ID or name of the dataset to be opened.
If not provided, the method returns the default dataset associated with the actor run.
force_cloud (bool, optional): If set to `True` then the Apify cloud storage is always used.
This way it is possible to combine local and cloud storage.
Returns:
Dataset: An instance of the `Dataset` class for the given ID or name.
"""
return await cls._get_default_instance().open_dataset(dataset_id_or_name=dataset_id_or_name, force_cloud=force_cloud)

async def _open_dataset_internal(self, dataset_id_or_name: Optional[str] = None, *, force_cloud: bool = False) -> Dataset:
Expand All @@ -488,7 +503,21 @@ async def _open_dataset_internal(self, dataset_id_or_name: Optional[str] = None,

@classmethod
async def open_key_value_store(cls, key_value_store_id_or_name: Optional[str] = None, *, force_cloud: bool = False) -> KeyValueStore:
"""TODO: docs."""
"""Open a key-value store.
Key-value stores are used to store records or files, along with their MIME content type.
The records are stored and retrieved using a unique key.
The actual data is stored either on a local filesystem or in the Apify cloud.
Args:
key_value_store_id_or_name (str, optional): ID or name of the key-value store to be opened.
If not provided, the method returns the default key-value store associated with the actor run.
force_cloud (bool, optional): If set to `True` then the Apify cloud storage is always used.
This way it is possible to combine local and cloud storage.
Returns:
KeyValueStore: An instance of the `KeyValueStore` class for the given ID or name.
"""
return await cls._get_default_instance().open_key_value_store(key_value_store_id_or_name=key_value_store_id_or_name, force_cloud=force_cloud)

async def _open_key_value_store_internal(self, key_value_store_id_or_name: Optional[str] = None, *, force_cloud: bool = False) -> KeyValueStore:
Expand All @@ -498,7 +527,22 @@ async def _open_key_value_store_internal(self, key_value_store_id_or_name: Optio

@classmethod
async def open_request_queue(cls, request_queue_id_or_name: Optional[str] = None, *, force_cloud: bool = False) -> RequestQueue:
"""TODO: docs."""
"""Open a request queue.
Request queue represents a queue of URLs to crawl, which is stored either on local filesystem or in the Apify cloud.
The queue is used for deep crawling of websites, where you start with several URLs and then
recursively follow links to other pages. The data structure supports both breadth-first
and depth-first crawling orders.
Args:
request_queue_id_or_name (str, optional): ID or name of the request queue to be opened.
If not provided, the method returns the default request queue associated with the actor run.
force_cloud (bool, optional): If set to `True` then the Apify cloud storage is always used.
This way it is possible to combine local and cloud storage.
Returns:
RequestQueue: An instance of the `RequestQueue` class for the given ID or name.
"""
return await cls._get_default_instance().open_request_queue(request_queue_id_or_name=request_queue_id_or_name, force_cloud=force_cloud)

async def _open_request_queue_internal(
Expand Down
2 changes: 1 addition & 1 deletion src/apify/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ def __init__(self) -> None:
self.token = _fetch_and_parse_env_var(ApifyEnvVars.TOKEN)
self.user_id = _fetch_and_parse_env_var(ApifyEnvVars.USER_ID)
self.xvfb = _fetch_and_parse_env_var(ApifyEnvVars.XVFB, False)
self.system_info_interval_millis = _fetch_and_parse_env_var(ApifyEnvVars.SYSTEM_INFO_INTERVAL_MILLIS, 60000)

self.system_info_interval_millis = 60000
self.max_used_cpu_ratio = 0.95

@classmethod
Expand Down
2 changes: 2 additions & 0 deletions src/apify/consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ class ApifyEnvVars(str, Enum):
METAMORPH_AFTER_SLEEP_MILLIS = 'APIFY_METAMORPH_AFTER_SLEEP_MILLIS'
PERSIST_STATE_INTERVAL_MILLIS = 'APIFY_PERSIST_STATE_INTERVAL_MILLIS'
PURGE_ON_START = 'APIFY_PURGE_ON_START'
SYSTEM_INFO_INTERVAL_MILLIS = 'APIFY_SYSTEM_INFO_INTERVAL_MILLIS'


_INTEGER_ENV_VARS_TYPE = Literal[
Expand All @@ -76,6 +77,7 @@ class ApifyEnvVars(str, Enum):
ApifyEnvVars.METAMORPH_AFTER_SLEEP_MILLIS,
ApifyEnvVars.PERSIST_STATE_INTERVAL_MILLIS,
ApifyEnvVars.PROXY_PORT,
ApifyEnvVars.SYSTEM_INFO_INTERVAL_MILLIS,
]

INTEGER_ENV_VARS: List[_INTEGER_ENV_VARS_TYPE] = list(get_args(_INTEGER_ENV_VARS_TYPE))
Expand Down
16 changes: 12 additions & 4 deletions src/apify/storage_client_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@


class StorageClientManager:
"""TODO: docs."""
"""A class for managing storage clients."""

_config: Configuration

Expand All @@ -16,18 +16,26 @@ class StorageClientManager:
_default_instance: Optional['StorageClientManager'] = None

def __init__(self) -> None:
"""TODO: docs."""
"""Create a `StorageClientManager` instance."""
self._config = Configuration.get_global_configuration()
self._client = MemoryStorage(persist_storage=self._config.persist_storage)

@classmethod
def get_storage_client(cls) -> Union[ApifyClientAsync, MemoryStorage]:
"""TODO: docs."""
"""Get the current storage client instance.
Returns:
ApifyClientAsync or MemoryStorage: The current storage client instance.
"""
return cls._get_default_instance()._client

@classmethod
def set_storage_client(cls, client: Union[ApifyClientAsync, MemoryStorage]) -> None:
"""TODO: docs."""
"""Set the storage client.
Args:
client (ApifyClientAsync or MemoryStorage): The instance of a storage client.
"""
cls._get_default_instance()._client = client

@classmethod
Expand Down
34 changes: 26 additions & 8 deletions src/apify/storages/storage_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@


class Storage(Protocol[T]):
"""TODO: Docs."""
"""A protocol defining common interface for storage classes."""

@classmethod
def _create_instance(cls, storage_id_or_name: str, client: Union[ApifyClientAsync, MemoryStorage]) -> T: # noqa: U100
Expand All @@ -33,16 +33,14 @@ async def _purge_default_storages(client: Union[ApifyClientAsync, MemoryStorage]


class StorageManager:
"""TODO: docs."""
"""A class for managing storages."""

_default_instance: Optional['StorageManager'] = None
_cache: Dict[Type[Storage], Dict[str, Storage]]
_config: Configuration

def __init__(self) -> None:
"""TODO: docs."""
"""Create a `StorageManager` instance."""
self._cache = {}
self._config = Configuration.get_global_configuration()

@classmethod
def _get_default_instance(cls) -> 'StorageManager':
Expand All @@ -59,9 +57,23 @@ async def open_storage(
client: Optional[Union[ApifyClientAsync, MemoryStorage]] = None,
config: Optional[Configuration] = None,
) -> T:
"""TODO: docs."""
"""Open a storage of the given class, or return a cached storage object if it was opened before.
Opens a new storage (`Dataset`, `KeyValueStore`, or `RequestQueue`) with the given ID or name.
Returns the cached storage object if the storage was opened before.
Args:
storage_class (Type[Dataset] or Type[KeyValueStore] or Type[RequestQueue]): Class of the storage to be opened.
storage_id_or_name (str, optional): ID or name of the storage to be opened. If omitted, an unnamed storage will be opened.
client (ApifyClientAsync or MemoryStorage, optional): The storage client which should be used in the storage.
If omitted, the default client will be used.
config (Configuration, optional): The actor configuration to be used in this call. If omitted, the global configuration will be used.
Returns:
An instance of the storage given by `storage_class`.
"""
storage_manager = StorageManager._get_default_instance()
used_config = config or storage_manager._config
used_config = config or Configuration.get_global_configuration()
used_client = client or StorageClientManager.get_storage_client()

# Create cache for the given storage class if missing
Expand Down Expand Up @@ -93,7 +105,13 @@ async def open_storage(

@classmethod
async def close_storage(cls, storage_class: Type[Storage], id: str, name: Optional[str]) -> None:
"""TODO: docs."""
"""Close the given storage by removing it from the cache.
Args:
storage_class (Type[Dataset] or Type[KeyValueStore] or Type[RequestQueue]): Class of the storage to be closed.
id (str): ID of the storage to be closed.
name (str, optional): Name of the storage to be closed.
"""
storage_manager = StorageManager._get_default_instance()
del storage_manager._cache[storage_class][id]
if name is not None:
Expand Down
Loading

0 comments on commit 4bc5ab7

Please sign in to comment.