Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

922 preprocessor acceleration #1004

Merged
merged 72 commits into from
Dec 11, 2023
Merged
Show file tree
Hide file tree
Changes from 63 commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
f2cacc2
accelerated define_column_types
IIaKyJIuH Nov 21, 2022
4a5a2cf
hotfix for pytests
IIaKyJIuH Nov 22, 2022
c0bff91
accelerated _clean_extra_spaces
IIaKyJIuH Nov 22, 2022
986d534
convert num col to str optimized
IIaKyJIuH Nov 23, 2022
c594f81
type inference fixes
IIaKyJIuH Nov 23, 2022
54d10ee
categorical.py/data_preprocessing.py refactored
IIaKyJIuH Nov 24, 2022
fcabc4d
fixed replacing inf with nan
IIaKyJIuH Nov 25, 2022
cf9447a
label encoder same refactoring
IIaKyJIuH Nov 28, 2022
7431c98
logical fix in label encoder
IIaKyJIuH Nov 28, 2022
a8b9c90
nans with cats in unique func fix
IIaKyJIuH Nov 30, 2022
243df8e
types fixes
IIaKyJIuH Dec 6, 2022
b6d5e77
minor improvements
IIaKyJIuH Dec 12, 2022
21f4ce4
minor conversation fix from PR
IIaKyJIuH Dec 14, 2022
001d8b1
fix format + rename semantically
IIaKyJIuH Dec 15, 2022
267f704
PR fixes
IIaKyJIuH Dec 29, 2022
d2d0f9e
PR fixes
IIaKyJIuH Jan 23, 2023
466a3ea
style fixes
IIaKyJIuH Feb 9, 2023
3978514
array creation via multiplication fix
IIaKyJIuH Feb 9, 2023
f5e1589
unified unimodal methods
IIaKyJIuH Feb 15, 2023
806acd1
+OperationTypesRepository type in operation.py
IIaKyJIuH Feb 15, 2023
d6dd5a9
add safer version of enum/strategies imports
IIaKyJIuH Feb 15, 2023
b6ecf9a
optimizations and style fixes
IIaKyJIuH Feb 17, 2023
08712b3
set psutil req with the one from golem
IIaKyJIuH Feb 21, 2023
571157c
bug fix
IIaKyJIuH Feb 28, 2023
b2e5f82
nan to num optimization
IIaKyJIuH Feb 28, 2023
4d42edb
optimized cat features transform
IIaKyJIuH Feb 28, 2023
8f96d2d
rid of for loops (v1)
IIaKyJIuH Mar 30, 2023
7568b2b
compound names fix
IIaKyJIuH Apr 18, 2023
d566947
simplified data_preprocessing.py
IIaKyJIuH Apr 18, 2023
eb7e28a
data_preprocessing logic fix
IIaKyJIuH Apr 19, 2023
2247e02
numpy's nonzero to flatzero
IIaKyJIuH Apr 19, 2023
9443d21
simplified data_types.py
IIaKyJIuH Apr 19, 2023
adc77b6
further opts
IIaKyJIuH Apr 27, 2023
5570180
cats ids via numpy
IIaKyJIuH Apr 27, 2023
1c044ee
numpy arr extend fix
IIaKyJIuH May 15, 2023
71560df
data_types.py cleanup
IIaKyJIuH May 15, 2023
9e89a92
lint fixes
IIaKyJIuH May 15, 2023
b91a993
lint fixes (v2)
IIaKyJIuH May 15, 2023
74fe8b8
supp_data typing upd
IIaKyJIuH May 16, 2023
d3534d0
ensure all column_types are of ndarray type
IIaKyJIuH May 16, 2023
08e221e
column types naming fix
IIaKyJIuH May 17, 2023
f9e47cf
remove unused f-string signs
IIaKyJIuH May 17, 2023
415bb0b
minor fixes
IIaKyJIuH Jul 5, 2023
3d70436
minor dct update fix
IIaKyJIuH Jul 7, 2023
6be9b2c
data_types.py further vectorization
IIaKyJIuH Jul 7, 2023
d39e623
preprocessing simplifications and logical fixes
IIaKyJIuH Jul 10, 2023
280822e
minor test lint fix
IIaKyJIuH Jul 10, 2023
e2e287a
minor polishing
IIaKyJIuH Jul 10, 2023
752b4ac
applymap simplification data_types.py
IIaKyJIuH Jul 11, 2023
0148e35
test_pipeline.py increase time constraint
IIaKyJIuH Jul 11, 2023
432d9ea
rename all *types to *type_ids
IIaKyJIuH Jul 31, 2023
3c338d9
rename column_types to col_type_ids
IIaKyJIuH Jul 31, 2023
3fddcc8
pandas version fix
IIaKyJIuH Jul 31, 2023
315ab99
inf condition simplification
IIaKyJIuH Jul 31, 2023
5240a6c
upd gitignore
IIaKyJIuH Aug 29, 2023
9a770e0
lint fixes
IIaKyJIuH Aug 30, 2023
8c793bb
typings
IIaKyJIuH Sep 1, 2023
ac1a577
Adding preprocessing data at once from API
aPovidlo Nov 27, 2023
3084851
Fixes in params, data preprocessor merging and fixes in tests
aPovidlo Nov 27, 2023
adaf590
Fixes for MultiModalData
aPovidlo Nov 28, 2023
097c163
Added new api param, fix in merge, fixes & editing tests
aPovidlo Nov 28, 2023
94b6af5
Fix param for test
aPovidlo Nov 29, 2023
be007cb
Fix bug in API
aPovidlo Nov 29, 2023
8e046f5
@kasyanovse requested improvements
aPovidlo Dec 5, 2023
cac26f6
Return fixes
aPovidlo Dec 5, 2023
5f62ef4
Return fixes (1)
aPovidlo Dec 6, 2023
b962148
Remove transformations to str categories
aPovidlo Dec 7, 2023
d5b0648
Return transformations to str for categories
aPovidlo Dec 7, 2023
8be836e
Fix control_categorical for label encoder
aPovidlo Dec 7, 2023
a91c9ba
Fix log message
aPovidlo Dec 7, 2023
304e29f
Small fixes with merger
aPovidlo Dec 8, 2023
0c48f7f
@andreygetmanov requested fixes
aPovidlo Dec 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.idea/
.vscode/
**/.pytest_cache/
**/__pycache__/
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm
Expand Down Expand Up @@ -76,4 +77,4 @@ dist/
test/unit/test_log.log
test/unit/catboost_info

local/
local/
54 changes: 54 additions & 0 deletions fedot/api/api_utils/api_data.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
import sys
from datetime import datetime
from typing import Dict, Union
from typing import Optional

import numpy as np
from golem.core.log import default_log

from fedot.api.api_utils.data_definition import data_strategy_selector, FeaturesType, TargetType
from fedot.core.data.data import InputData, OutputData, data_type_is_table
Expand All @@ -10,6 +13,7 @@
from fedot.core.pipelines.pipeline import Pipeline
from fedot.core.pipelines.ts_wrappers import in_sample_ts_forecast, convert_forecast_to_output
from fedot.core.repository.tasks import Task, TaskTypesEnum
from fedot.core.utils import convert_memory_size
from fedot.preprocessing.dummy_preprocessing import DummyPreprocessor
from fedot.preprocessing.preprocessing import DataPreprocessor

Expand Down Expand Up @@ -39,6 +43,8 @@ def __init__(self, task: Task, use_input_preprocessing: bool = True):
self._recommendations = {'cut': self.preprocessor.cut_dataset,
'label_encoded': self.preprocessor.label_encoding_for_fit}

self.log = default_log(self)

def define_data(self,
features: FeaturesType,
target: Optional[TargetType] = None,
Expand Down Expand Up @@ -123,3 +129,51 @@ def accept_and_apply_recommendations(self, input_data: Union[InputData, MultiMod
for name, rec in recommendations.items():
# Apply desired preprocessing function
self._recommendations[name](input_data, *rec.values())

def fit_transform(self, train_data: InputData) -> InputData:
start_time = datetime.now()
self.log.message('Preprocessing data')
memory_usage = convert_memory_size(sys.getsizeof(train_data.features))
features_shape = train_data.features.shape
target_shape = train_data.target.shape
self.log.message(
f'Train Data (Original) Memory Usage: {memory_usage} Data Shapes: {features_shape, target_shape}')

train_data = self.preprocessor.obligatory_prepare_for_fit(data=train_data)
train_data = self.preprocessor.optional_prepare_for_fit(pipeline=Pipeline(), data=train_data)
train_data = self.preprocessor.convert_indexes_for_fit(pipeline=Pipeline(), data=train_data)
train_data.supplementary_data.is_auto_preprocessed = True

memory_usage = convert_memory_size(sys.getsizeof(train_data.features))
features_shape = train_data.features.shape
target_shape = train_data.target.shape
self.log.message(
f'Train Data (Processed) Memory Usage: {memory_usage} Data Shape: {features_shape, target_shape}')
self.log.message(f'Data preprocessing runtime = {datetime.now() - start_time}')

return train_data

def transform(self, test_data: InputData, current_pipeline) -> InputData:
start_time = datetime.now()
self.log.message('Preprocessing data')
memory_usage = convert_memory_size(sys.getsizeof(test_data))
features_shape = test_data.features.shape
target_shape = test_data.target.shape
self.log.message(
f'Test Data (Original) Memory Usage: {memory_usage} Data Shapes: {features_shape, target_shape}')

test_data = self.preprocessor.obligatory_prepare_for_predict(data=test_data)
test_data = self.preprocessor.optional_prepare_for_predict(pipeline=current_pipeline, data=test_data)
test_data = self.preprocessor.convert_indexes_for_predict(pipeline=current_pipeline, data=test_data)
test_data = self.preprocessor.update_indices_for_time_series(test_data)
test_data.supplementary_data.is_auto_preprocessed = True

memory_usage = convert_memory_size(sys.getsizeof(test_data))
features_shape = test_data.features.shape
target_shape = test_data.target.shape
self.log.message(
f'Test Data (Processed) Memory Usage: {memory_usage} Data Shape: {features_shape, target_shape}')
valer1435 marked this conversation as resolved.
Show resolved Hide resolved
self.log.message(f'Data preprocessing runtime = {datetime.now() - start_time}')

return test_data

1 change: 1 addition & 0 deletions fedot/api/api_utils/api_params_repository.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def default_params_for_task(task_type: TaskTypesEnum) -> dict:
use_pipelines_cache=True,
use_preprocessing_cache=True,
use_input_preprocessing=True,
use_auto_preprocessing=False,
use_meta_rules=False,
cache_dir=default_fedot_data_dir(),
keep_history=True,
Expand Down
17 changes: 4 additions & 13 deletions fedot/api/api_utils/input_analyser.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,16 @@
from functools import partial
from inspect import signature
from typing import Any, Dict, Tuple, Union

import numpy as np
from typing import Dict, Tuple, Any, Union

from golem.core.log import default_log

from fedot.core.composer.meta_rules import get_cv_folds_number, get_recommended_preset, \
get_early_stopping_generations
from fedot.core.composer.meta_rules import get_cv_folds_number, get_early_stopping_generations, get_recommended_preset
from fedot.core.data.data import InputData
from fedot.core.data.data_preprocessing import find_categorical_columns
from fedot.core.data.multi_modal import MultiModalData
from fedot.core.repository.dataset_types import DataTypesEnum


meta_rules = [get_cv_folds_number,
get_recommended_preset,
get_early_stopping_generations]
Expand Down Expand Up @@ -118,11 +115,5 @@ def control_categorical(self, input_data: InputData) -> bool:
"""

categorical_ids, _ = find_categorical_columns(input_data.features)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Поиск категориальных столбцов требует времени, поэтому лучше брать индексы категориальных столбцов из input_data либо сохранять их там после определения.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Соглашусь с тобой. Думаю, что можно было бы проследить вызовы этой функции. Сохранение сделано для извлечения незакодированных категориальных признаков, добавлял такой признак в InputData, который сохраняет на одном из этапов предобработки.

Однако думаю, что это мог бы быть оформлен в виде issue и выполнен последующим шагом, а не в этом PR.

all_cardinality = 0
need_label = False
for idx in categorical_ids:
all_cardinality += np.unique(input_data.features[:, idx].astype(str)).shape[0]
if all_cardinality > self.max_cat_cardinality:
need_label = True
break
return need_label
uniques = np.unique(input_data.features[:, categorical_ids].astype(str))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Зачем приводить к str?
  2. В предыдущем варианте учитывалось, что в разных столбцах могут быть одинаковые значения. В новом варианте это не учитывается, хотя по логике должно.
  3. Если отказаться от приведения к str, то надо будет добавить аргумент equal_nan, это должно быть быстрее, чем приводить к str.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Зачем приводить к str?

Так это же категориальные признаки. Думаю, что там могут быть и числа, например, 1, 2, 3 и тд. Наверное, в этом и была идея переводить к str

В предыдущем варианте учитывалось, что в разных столбцах могут быть одинаковые значения. В новом варианте это не учитывается, хотя по логике должно. Если отказаться от приведения к str, то надо будет добавить аргумент equal_nan, это должно быть быстрее, чем приводить к str.

Не понял. Можешь более детальнее?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Так это же категориальные признаки. Думаю, что там могут быть и числа, например, 1, 2, 3 и тд. Наверное, в этом и была идея переводить к str

Странно, потому что '1' и 1 в таком случае получатся одной категорией.

Не понял. Можешь более детальнее?

Для нового кода 1 в первом столбце и 1 в любом другом - это одна уникальная категория. В старом коде 1 в первом столбце - это уникальное значение для первого столбца, а 1 во втором - для второго.

return len(uniques) > self.max_cat_cardinality
2 changes: 2 additions & 0 deletions fedot/api/builder.py
Original file line number Diff line number Diff line change
Expand Up @@ -330,6 +330,7 @@ def setup_data_preprocessing(
safe_mode: bool = DEFAULT_VALUE,
use_input_preprocessing: bool = DEFAULT_VALUE,
use_preprocessing_cache: bool = DEFAULT_VALUE,
use_auto_preprocessing: bool = DEFAULT_VALUE,
) -> FedotBuilder:
""" Sets parameters of input data preprocessing.

Expand All @@ -351,6 +352,7 @@ def setup_data_preprocessing(
safe_mode=safe_mode,
use_input_preprocessing=use_input_preprocessing,
use_preprocessing_cache=use_preprocessing_cache,
use_auto_preprocessing=use_auto_preprocessing,
)
return self

Expand Down
15 changes: 12 additions & 3 deletions fedot/api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,9 @@ def fit(self,

self._init_remote_if_necessary()

if isinstance(self.train_data, InputData) and self.params.get('use_auto_preprocessing'):
self.train_data = self.data_processor.fit_transform(self.train_data)
Comment on lines +159 to +160
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

А почему только для InputData?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Сделал так, потому что в MultiModal данных отсутствует supplementary_data. Из-за этого падали тесты. Думаю, что для них нужно сделать как-то по другому, и авто предобработать только если в них содержатся табличные данные. Пока не знаю как это можно лучше всего это сделать.


if predefined_model is not None:
# Fit predefined model and return it without composing
self.current_pipeline = PredefinedModel(predefined_model, self.train_data, self.log,
Expand All @@ -175,9 +178,12 @@ def fit(self,
else:
self.log.message('Already fitted initial pipeline is used')

# Store data encoder in the pipeline if it is required
# Merge API & pipelines encoders if it is required
self.current_pipeline.preprocessor = BasePreprocessor.merge_preprocessors(
self.data_processor.preprocessor, self.current_pipeline.preprocessor)
api_preprocessor=self.data_processor.preprocessor,
pipeline_preprocessor=self.current_pipeline.preprocessor,
use_input_preprocessing=self.params.get('use_auto_preprocessing')
valer1435 marked this conversation as resolved.
Show resolved Hide resolved
)

self.log.message(f'Final pipeline: {graph_structure(self.current_pipeline)}')

Expand Down Expand Up @@ -258,6 +264,9 @@ def predict(self,
self.test_data = self.data_processor.define_data(target=self.target, features=features, is_predict=True)
self._is_in_sample_prediction = in_sample

if isinstance(self.test_data, InputData) and self.params.get('use_auto_preprocessing'):
self.test_data = self.data_processor.transform(self.test_data, self.current_pipeline)

self.prediction = self.data_processor.define_predictions(current_pipeline=self.current_pipeline,
test_data=self.test_data,
in_sample=self._is_in_sample_prediction,
Expand Down Expand Up @@ -521,4 +530,4 @@ def _train_pipeline_on_full_dataset(self, recommendations: Optional[dict],
self.current_pipeline.fit(
full_train_not_preprocessed,
n_jobs=self.params.n_jobs
)
)
9 changes: 6 additions & 3 deletions fedot/core/data/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -528,11 +528,14 @@ def subset_indices(self, selected_idx: List):
target=self.target[row_nums],
task=self.task, data_type=self.data_type)

def subset_features(self, features_ids: list):
"""Return new :obj:`InputData` with subset of features based on ``features_ids`` list
def subset_features(self, feature_ids: list) -> Optional[InputData]:
"""
Return new :obj:`InputData` with subset of features based on non-empty ``features_ids`` list or `None` otherwise
"""
if not feature_ids:
return None
Comment on lines +535 to +536
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Может лучше ошибку кинуть?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Думаю, что None не просто так. Если посмотреть на использование функции, то от нее ожидается такое поведение. Если таких индексов нет, например, категориальных, то и данные должны быть пустыми, то есть None.


subsample_features = self.features[:, features_ids]
subsample_features = self.features[:, feature_ids]
subsample_input = InputData(features=subsample_features,
data_type=self.data_type,
target=self.target, task=self.task,
Expand Down
Loading
Loading