Safe mode implementation
Created by: valer1435
Sometimes we faced with large table datasets (when MxN is too large). It may cause a problems with memory. This pull request is devoted to solving such problem by cutting train data. Also a problem with encoding categorical features with big cardinality should be resolved in this PR. To do this we use label encoding when summary cardinality become bigger than a constant threshold.
Created by: pep8speaks
Hello @valer1435! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers!
Comment last updated at 2021-12-11 17:52:24 UTC
Created by: codecov[bot]
Codecov Report
Merging #514 (80741f45) into master (dccf5d2c) will increase coverage by
0.06%
. The diff coverage is96.63%
.@@ Coverage Diff @@ ## master #514 +/- ## ========================================== + Coverage 78.25% 78.32% +0.06% ========================================== Files 159 160 +1 Lines 10979 11080 +101 ========================================== + Hits 8592 8678 +86 - Misses 2387 2402 +15
Impacted Files Coverage Δ ...mentations/data_operations/categorical_encoders.py 95.18% <ø> (-0.06%)
fedot/api/api_utils/presets.py 87.75% <62.50%> (-4.93%)
fedot/preprocessing/preprocessing.py 94.42% <96.55%> (-1.85%)
fedot/api/api_utils/api_data.py 88.05% <100.00%> (+1.85%)
fedot/api/api_utils/api_data_analyser.py 100.00% <100.00%> (ø)
fedot/api/api_utils/params.py 96.25% <100.00%> (+1.08%)
fedot/api/main.py 78.10% <100.00%> (+0.94%)
fedot/core/pipelines/tuning/hyperparams.py 93.61% <0.00%> (-6.39%)
fedot/api/api_utils/api_composer.py 84.33% <0.00%> (-4.22%)
... and 10 more
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update dccf5d2...80741f4. Read the comment docs.- Last updated by Elizaveta Lutsenko
- Last updated by Elizaveta Lutsenko
11 All methods are inplace to prevent unnecessary copy of large datasets 12 It functionality is: 13 1) Cut large datasets to prevent memory stackoverflow 14 2) Use label encoder with tree models instead OneHot when summary cardinality of categorical features is high 15 """ 16 def __init__(self, safe_mode, preprocessor: ApiDataProcessor): 17 self.safe_mode = safe_mode 18 self.max_size = 600000 19 self.max_cat_cardinality = 1000 20 self.data_preprocessor = preprocessor 21 22 def safe_preprocess(self, input_data: InputData): 23 """ Preforms preprocessing to preventing crash due of memory stackoverflow if safe_mode on 24 :param input_data - data for preprocessing 25 26 """ Вопрос: а итоговый fit после работы композера делается на полных данных или на сокращенных? Должен на полных.
Created by: Dreamlone
Вопрос: а итоговый fit после работы композера делается на полных данных или на сокращенных? Должен на полных.
А, да, сейчас конечно будет производится на урезанных, так как все операции по преобразованию делаются inplace Видимо придется хранить где-то копию исходного датасета
251 240 Encode categorical features to numerical. In additional, 252 241 save encoders to use later for prediction data. 253 242 254 243 :param data: data to transform 244 :param is_label: is we use a specific label encoder 255 245 :return encoder: operation for preprocessing categorical features 256 246 """ 257 247 258 transformed, encoder = self._create_encoder(data) 248 transformed, encoder = self._create_encoder(data, is_label) 259 249 data.features = transformed 260 # Store encoder to make prediction in th future 250 # Store encoder to make prediction in the future 261 251 self.features_encoder = encoder 262 252 253 def cut_dataset(self, data: InputData, border): Created by: valer1435
74 75 76 def accept_recommendations(self, input_data: InputData, recommendations: Dict): 77 if 'label' in recommendations: 78 self.log.info("Change preset due of label encoding") 79 return self.change_preset_for_labeled_data(input_data.task) 80 else: 81 param_dict = { 82 'task': self.task, 83 'logger': self.log, 84 'metric_name': self.metric_name, 85 'composer_metric': self.metric_to_compose 86 } 87 return {**param_dict, **self.api_params} 88 89 def change_preset_for_labeled_data(self, task: Task): 90 preset_name = 'tree_reg' if task.task_type == 'regression' else 'tree_class' 107 112 """ 108 113 109 114 self.target = target 110 self.train_data = self.data_processor.define_data(features=features, target=target, is_predict=False) 115 self.train_data = deepcopy(self.data_processor.define_data(features=features, target=target, is_predict=False)) 116 full_train_not_preprocessed = deepcopy(self.train_data) 61 assert data.features[0, 0] == 'a' 62 63 64 def test_api_fit_predict_with_pseudo_large_dataset_with_label_correct(): 65 model = Fedot(problem="classification", 66 composer_params=composer_params) 67 model.data_analyser.max_cat_cardinality = 5 68 model.data_analyser.max_size = 18 69 data = get_small_cat_data() 70 model.fit(features=data) 71 model.predict(features=data) 72 assert len(model.api_params['available_operations']) == 13 73 assert 'logit' not in model.api_params['available_operations'] 74 75 76 def test_api_fit_predict_with_pseudo_large_dataset_with_onehot_correct(): 107 112 """ 108 113 109 114 self.target = target 110 self.train_data = self.data_processor.define_data(features=features, target=target, is_predict=False) 115 self.train_data = deepcopy(self.data_processor.define_data(features=features, target=target, is_predict=False)) 116 full_train_not_preprocessed = deepcopy(self.train_data) Created by: valer1435
В теории можно обойтись без него - тут он нужен, чтобы сохранить копию исходного датасета для финального fit (оригинал может быть обрезан)
То есть у нас есть:
- Исходные данные принятые из вне (их менять нельзя)
- Данные для композирования (копия исходных). Они могут быть обрезаны
- На финальный fit нужна копия оригинального не обрезанного датасета Отсюда и 2 deepcopy(
74 75 76 def accept_recommendations(self, input_data: InputData, recommendations: Dict): 77 if 'label' in recommendations: 78 self.log.info("Change preset due of label encoding") 79 return self.change_preset_for_labeled_data(input_data.task) 80 else: 81 param_dict = { 82 'task': self.task, 83 'logger': self.log, 84 'metric_name': self.metric_name, 85 'composer_metric': self.metric_to_compose 86 } 87 return {**param_dict, **self.api_params} 88 89 def change_preset_for_labeled_data(self, task: Task): 90 preset_name = 'tree_reg' if task.task_type == 'regression' else 'tree_class' Created by: valer1435
107 112 """ 108 113 109 114 self.target = target 110 self.train_data = self.data_processor.define_data(features=features, target=target, is_predict=False) 115 self.train_data = deepcopy(self.data_processor.define_data(features=features, target=target, is_predict=False)) 116 full_train_not_preprocessed = deepcopy(self.train_data) Created by: valer1435
Как и договорились, перенес deepcopy в define_data
288 288 ] 289 289 }, 290 290 "rf": { 291 "meta": "sklearn_class" 291 "meta": "sklearn_class", 292 "tags": ["tree"] Created by: valer1435
61 assert data.features[0, 0] == 'a' 62 63 64 def test_api_fit_predict_with_pseudo_large_dataset_with_label_correct(): 65 model = Fedot(problem="classification", 66 composer_params=composer_params) 67 model.data_analyser.max_cat_cardinality = 5 68 model.data_analyser.max_size = 18 69 data = get_small_cat_data() 70 model.fit(features=data) 71 model.predict(features=data) 72 assert len(model.api_params['available_operations']) == 13 73 assert 'logit' not in model.api_params['available_operations'] 74 75 76 def test_api_fit_predict_with_pseudo_large_dataset_with_onehot_correct(): Created by: valer1435
Добавил описание к тестам
- Last updated by Elizaveta Lutsenko
- Last updated by Elizaveta Lutsenko