Commit 6ca5bc43 authored by Илья Соколов's avatar Илья Соколов
Browse files

Merge branch 'master' into DRMPN-better-caching

parents 55dc4827 e15c0bfe
Showing with 323 additions and 228 deletions
+323 -228
......@@ -211,6 +211,14 @@ Jupyter ноутбуки с примерами находятся в репоз
Мы благодарны контрибьютерам за их важный вклад, а участникам многочисленных конференций и семинаров - за их ценные советы и предложения.
Финансирование
==============
Реализовано при финансовой поддержке Фонда поддержки проектов
Национальной технологической инициативы в рамках реализации "дорожной карты"
развития высокотехнологичного направления "Искусственный интеллект" на период до 2030 года (Договор № 70-2021-00187)
Дополнительные проекты
======================
- Оптимизационное ядро, вынесенное в библиотеку GOLEM.
......
......@@ -210,6 +210,14 @@ Acknowledgments
We acknowledge the contributors for their important impact and the participants of numerous scientific conferences and workshops for their valuable advice and suggestions.
Funding
=======
This research is financially supported by the Foundation for
National Technology Initiative's Projects Support as a part of the roadmap
implementation for the development of the high-tech field of
Artificial Intelligence for the period up to 2030 (agreement 70-2021-00187)
Side Projects
=============
- The optimisation core implemented in the GOLEM.
......
......@@ -16,7 +16,7 @@ FEDOT и Docker
Jupiter
=======
- **Проверте наличе docker (docker-compose)** docker (docker-compose) должен быть установлен
- **Проверьте наличие docker (docker-compose)** docker (docker-compose) должен быть установлен
- `git clone https://github.com/aimclub/FEDOT.git` получаем файлы из git
- `cd FEDOT` переходим в папку проекта
- `cd docker/jupiter` переходим в папку с Docker файлами для jupiter notebook
......
docs/source/basics/comp_table.png

209 KB

......@@ -10,3 +10,10 @@ The main framework concepts are as follows:
- **Versatility.** FEDOT is :doc:`not limited to specific modeling tasks </advanced/architecture>`, for example, it can be used in ODE or PDE;
- **Reproducibility.** Resulting pipelines can be :doc:`exported separately as JSON </advanced/pipeline_import_export>` or :doc:`together with your input data as ZIP archive </advanced/project_import_export>` for experiments reproducibility;
- **Customizability.** FEDOT allows `managing models complexity <https://fedot.readthedocs.io/en/master/introduction/fedot_features/automation_features.html#models-used>`_ and thereby achieving desired quality.
The comparison of fedot with main existing AutoML tools is provided below:
|automl_features|
.. |automl_features| image:: ./comp_table.png
:width: 80%
\ No newline at end of file
docs/source/benchmarks/img_benchmarks/fedot_amlb.png

80.7 KB

docs/source/benchmarks/img_benchmarks/metrics.png

56 KB

docs/source/benchmarks/img_benchmarks/ranks.png

47.3 KB

Tabular data
------------
Here are overall classification problem results across state-of-the-art AutoML frameworks
using self-runned tasks form OpenML test suite (10 folds run) using F1:
.. csv-table::
:header: Dataset,FEDOT,AutoGluon,H2O,TPOT
adult,0.874,0.874,0.875,0.874
airlines,0.669,0.669,0.675,0.617
airlinescodrnaadult,0.812,-,0.818,0.809
albert,0.670,0.669,0.697,0.667
amazon_employee_access,0.949,0.947,0.951,0.953
apsfailure,0.994,0.994,0.995,0.995
australian,0.871,0.870,0.865,0.860
bank-marketing,0.910,0.910,0.910,0.899
blood-transfusion,0.747,0.697,0.797,0.746
car,1.000,1.000,0.998,0.998
christine,0.746,0.746,0.748,0.737
click_prediction_small,0.835,0.835,0.777,0.777
cnae-9,0.957,0.954,0.957,0.954
connect-4,0.792,0.788,0.865,0.867
covertype,0.964,0.966,0.976,0.952
credit-g,0.753,0.759,0.766,0.727
dilbert,0.985,0.982,0.996,0.984
fabert,0.688,0.685,0.726,0.534
fashion-mnist,0.885,-,0.734,0.718
guillermo,0.821,-,0.915,0.897
helena,0.332,0.333,-,0.318
higgs,0.731,0.732,0.369,0.336
jannis,0.718,0.718,0.743,0.719
jasmine,0.817,0.821,0.734,0.727
jungle_chess_2pcs_raw_endgame_complete,0.953,0.939,0.817,0.817
kc1,0.866,0.867,0.996,0.947
kddcup09_appetency,0.982,0.982,0.866,0.818
kr-vs-kp,0.995,0.996,0.982,0.962
mfeat-factors,0.980,0.979,0.980,0.980
miniboone,0.948,0.948,0.952,0.949
nomao,0.969,0.970,0.975,0.974
numerai28_6,0.523,0.522,0.522,0.505
phoneme,0.915,0.916,0.916,0.910
riccardo,0.997,-,0.998,0.997
robert,0.405,-,0.559,0.487
segment,0.982,0.982,0.982,0.980
shuttle,1.000,1.000,1.000,1.000
sylvine,0.952,0.951,0.952,0.948
vehicle,0.851,0.849,0.846,0.835
volkert,0.694,0.694,0.758,0.697
Mean F1,0.838,0.837,0.833,0.812
Also, we tested FEDOT on the results of `AMLB <https://github.com/openml/automlbenchmark>`_ benchmark.
The visualization of FEDOT (v.0.7.3) results against H2O (3.46.0.4), AutoGluon (v.1.1.0), TPOT (v.0.12.1) and LightAutoML (v.0.3.7.3)
obtained using built-in visualizations of critial difference plot from AutoMLBenchmark are provided below:
All datasets (ROC AUC and negative log loss):
We tested FEDOT on the results of `AMLB <https://github.com/openml/automlbenchmark>`_ benchmark.
We used the setup of the framework obtained from 'frameworks.yaml' on the date of starts of experiments.
So, the following stable versions were used: AutoGluon 0.7.0, TPOT 0.11.7, LightAutoML 0.3.7.3, v3.40.0.2, FEDOT 0.7.2.
Some runs for AutoGluon are failed due to the errors (described also in Appendix D of AMLB paper [1]).
The visualization obtained using built-in visualizations of critical difference plot (CD) from AutoMLBenchmark [1].
In a CD (Critical Difference) diagram,
we display each framework's average rank and highlight which ranks are
statistically significantly different from one another.
To determine the average rank per task,
we first replace any missing values with a constant predictor,
calculate ranks for represented AutoML solutions and constant predictor
for each dataset and than took an average value of ranks across all datasets for each represented solution.
We assess statistical significance of the rank differences using a non-parametric Friedman test with a
threshold of p < 0.05 (resulting in p ≈ 0 for all diagrams)
and apply a Nemenyi post-hoc test to identify which framework pairs differ significantly.
Time budget for all experiments is 1 hour, 10 folds are used (1h8c setup for ALMB). The results are
obtained using sever based on Xeon Cascadelake (2900MHz) with 12 cores and 16GB memory.
CD for all datasets (ROC AUC and negative log loss):
.. image:: ./img_benchmarks/cd-all-1h8c-constantpredictor.png
Binary classification (ROC AUC):
The CD diagram for all datasets (ROC AUC and negative log loss) shows that all AutoML frameworks
(LightAutoML, H2OAutoML, TPOT, AutoGluon, FEDOT) perform statistically better than constant predictor:
CD for binary classification (ROC AUC):
.. image:: ./img_benchmarks/cd-binary-classification-1h8c-constantpredictor.png
Multiclass classification (negative logloss):
The CD diagram for binary classification (ROC AUC) shows that all AutoML frameworks
(LightAutoML, H2OAutoML, TPOT, AutoGluon, FEDOT) perform similarly,
falling within the same CD interval, and significantly outperform the constant predictor:
CD for multiclass classification (negative logloss):
.. image:: ./img_benchmarks/cd-multiclass-classification-1h8c-constantpredictor.png
We can claim that results are statistically better that TPOT and and indistinguishable from H2O and AutoGluon.
The CD diagram for multiclass classification (negative log loss) shows that
TPOT and Fedot demonstrate intermediate performance being on the border of the
CD interval with constant predictor and the CD interval with H2OAutoML:
We can conclude that FEDOT achieves performance comparable with competitors for tabular tasks.
The ranks for frameworks are provided below:
.. image:: ./img_benchmarks/ranks.png
The raw metrics (ROC AUC for binary and logloss for multiclass) for frameworks are provided below:
.. image:: ./img_benchmarks/metrics.png
The comparison with [1] shows that AutoGluon is underperforming in our hardware setup,
while TPOT and H2O are quite close in both setups.
To avoid any confusion, we provide below an additional comparison of the FEDOT metrics with the metrics from [1].
However, it should be noted that the conditions are different, as are the exact versions of the frameworks.
.. image:: ./img_benchmarks/fedot_amlb.png
[1] Gijsbers P. et al. AMLB: an AutoML benchmark //Journal of Machine Learning Research. – 2024. – Т. 25. – №. 101. – С. 1-65.
......@@ -41,6 +41,10 @@ def prepare_multi_modal_data(files_path: str, task: Task, images_size: tuple = (
"""
path = os.path.join(str(fedot_project_root()), files_path)
if not os.path.exists(path):
raise FileNotFoundError(path)
# unpacking of data archive
unpack_archived_data(path)
# import of table data
......@@ -68,7 +72,7 @@ def prepare_multi_modal_data(files_path: str, task: Task, images_size: tuple = (
return data
def run_multi_modal_pipeline(files_path: str, visualization=False) -> float:
def run_multi_modal_pipeline(files_path: str, timeout=15, visualization=False) -> float:
task = Task(TaskTypesEnum.classification)
images_size = (224, 224)
......@@ -76,7 +80,7 @@ def run_multi_modal_pipeline(files_path: str, visualization=False) -> float:
fit_data, predict_data = train_test_data_setup(data, shuffle=True, split_ratio=0.6)
automl_model = Fedot(problem='classification', timeout=15)
automl_model = Fedot(problem='classification', timeout=timeout)
pipeline = automl_model.fit(features=fit_data,
target=fit_data.target)
......
import datetime
from typing import Sequence
from golem.core.optimisers.genetic.operators.inheritance import GeneticSchemeTypesEnum
from golem.core.optimisers.genetic.operators.mutation import MutationTypesEnum
from fedot.core.composer.gp_composer.specific_operators import parameter_change_mutation, add_resample_mutation
from fedot.core.constants import AUTO_PRESET_NAME
from fedot.core.repository.tasks import TaskTypesEnum
from fedot.core.utils import default_fedot_data_dir
class ApiParamsRepository:
"""Repository storing possible Api parameters and their default values. Also returns parameters required
for data classes (``PipelineComposerRequirements``, ``GPAlgorithmParameters``, ``GraphGenerationParams``)
used while model composition.
"""
COMPOSER_REQUIREMENTS_KEYS = {'max_arity', 'max_depth', 'num_of_generations',
'early_stopping_iterations', 'early_stopping_timeout',
'parallelization_mode', 'use_input_preprocessing',
'show_progress', 'collect_intermediate_metric', 'keep_n_best',
'keep_history', 'history_dir', 'cv_folds'}
STATIC_INDIVIDUAL_METADATA_KEYS = {'use_input_preprocessing'}
def __init__(self, task_type: TaskTypesEnum):
self.task_type = task_type
self.default_params = ApiParamsRepository.default_params_for_task(self.task_type)
@staticmethod
def default_params_for_task(task_type: TaskTypesEnum) -> dict:
""" Returns a dict with default parameters"""
if task_type in [TaskTypesEnum.classification, TaskTypesEnum.regression]:
cv_folds = 5
elif task_type == TaskTypesEnum.ts_forecasting:
cv_folds = 3
# Dict with allowed keyword attributes for Api and their default values. If None - default value set
# in dataclasses ``PipelineComposerRequirements``, ``GPAlgorithmParameters``, ``GraphGenerationParams``
# will be used.
default_param_values_dict = dict(
parallelization_mode='populational',
show_progress=True,
max_depth=6,
max_arity=3,
pop_size=20,
num_of_generations=None,
keep_n_best=1,
available_operations=None,
metric=None,
cv_folds=cv_folds,
genetic_scheme=None,
early_stopping_iterations=None,
early_stopping_timeout=10,
optimizer=None,
collect_intermediate_metric=False,
max_pipeline_fit_time=None,
initial_assumption=None,
preset=AUTO_PRESET_NAME,
use_operations_cache=True,
use_preprocessing_cache=True,
use_predictions_cache=False,
use_input_preprocessing=True,
use_auto_preprocessing=False,
use_meta_rules=False,
cache_dir=default_fedot_data_dir(),
keep_history=True,
history_dir=default_fedot_data_dir(),
with_tuning=True
)
return default_param_values_dict
def check_and_set_default_params(self, params: dict) -> dict:
""" Sets default values for parameters which were not set by the user
and raises KeyError for invalid parameter keys"""
allowed_keys = self.default_params.keys()
invalid_keys = params.keys() - allowed_keys
if invalid_keys:
raise KeyError(f"Invalid key parameters {invalid_keys}")
else:
missing_params = self.default_params.keys() - params.keys()
for k in missing_params:
if (v := self.default_params[k]) is not None:
params[k] = v
return params
@staticmethod
def get_params_for_composer_requirements(params: dict) -> dict:
""" Returns dict with parameters suitable for ``PipelineComposerParameters``"""
composer_requirements_params = {k: v for k, v in params.items()
if k in ApiParamsRepository.COMPOSER_REQUIREMENTS_KEYS}
max_pipeline_fit_time = params.get('max_pipeline_fit_time')
if max_pipeline_fit_time:
composer_requirements_params['max_graph_fit_time'] = datetime.timedelta(minutes=max_pipeline_fit_time)
composer_requirements_params = ApiParamsRepository.set_static_individual_metadata(composer_requirements_params)
return composer_requirements_params
@staticmethod
def set_static_individual_metadata(composer_requirements_params: dict) -> dict:
""" Returns dict with representing ``static_individual_metadata`` for ``PipelineComposerParameters``"""
static_individual_metadata = {k: v for k, v in composer_requirements_params.items()
if k in ApiParamsRepository.STATIC_INDIVIDUAL_METADATA_KEYS}
for k in ApiParamsRepository.STATIC_INDIVIDUAL_METADATA_KEYS:
composer_requirements_params.pop(k)
composer_requirements_params['static_individual_metadata'] = static_individual_metadata
return composer_requirements_params
def get_params_for_gp_algorithm_params(self, params: dict) -> dict:
""" Returns dict with parameters suitable for ``GPAlgorithmParameters``"""
gp_algorithm_params = {'pop_size': params.get('pop_size'),
'genetic_scheme_type': GeneticSchemeTypesEnum.parameter_free}
if params.get('genetic_scheme') == 'steady_state':
gp_algorithm_params['genetic_scheme_type'] = GeneticSchemeTypesEnum.steady_state
gp_algorithm_params['mutation_types'] = ApiParamsRepository._get_default_mutations(self.task_type, params)
return gp_algorithm_params
@staticmethod
def _get_default_mutations(task_type: TaskTypesEnum, params) -> Sequence[MutationTypesEnum]:
mutations = [parameter_change_mutation,
MutationTypesEnum.single_change,
MutationTypesEnum.single_drop,
MutationTypesEnum.single_add,
MutationTypesEnum.single_edge]
# TODO remove workaround after boosting mutation fix
# Boosting mutation does not work due to problem with __eq__ with it copy.
# ``partial`` refactor to ``def`` does not work
# Also boosting mutation does not work by it own.
if task_type == TaskTypesEnum.ts_forecasting:
# mutations.append(partial(boosting_mutation, params=params))
pass
else:
mutations.append(add_resample_mutation)
return mutations
import datetime
from typing import Sequence
from golem.core.optimisers.genetic.operators.inheritance import GeneticSchemeTypesEnum
from golem.core.optimisers.genetic.operators.mutation import MutationTypesEnum
from fedot.core.composer.gp_composer.specific_operators import parameter_change_mutation, add_resample_mutation
from fedot.core.constants import AUTO_PRESET_NAME
from fedot.core.repository.tasks import TaskTypesEnum
from fedot.core.utils import default_fedot_data_dir
class ApiParamsRepository:
"""Repository storing possible Api parameters and their default values. Also returns parameters required
for data classes (``PipelineComposerRequirements``, ``GPAlgorithmParameters``, ``GraphGenerationParams``)
used while model composition.
"""
COMPOSER_REQUIREMENTS_KEYS = {'max_arity', 'max_depth', 'num_of_generations',
'early_stopping_iterations', 'early_stopping_timeout',
'parallelization_mode', 'use_input_preprocessing',
'show_progress', 'collect_intermediate_metric', 'keep_n_best',
'keep_history', 'history_dir', 'cv_folds'}
STATIC_INDIVIDUAL_METADATA_KEYS = {'use_input_preprocessing'}
def __init__(self, task_type: TaskTypesEnum):
self.task_type = task_type
self.default_params = ApiParamsRepository.default_params_for_task(self.task_type)
@staticmethod
def default_params_for_task(task_type: TaskTypesEnum) -> dict:
""" Returns a dict with default parameters"""
if task_type in [TaskTypesEnum.classification, TaskTypesEnum.regression]:
cv_folds = 5
elif task_type == TaskTypesEnum.ts_forecasting:
cv_folds = 3
# Dict with allowed keyword attributes for Api and their default values. If None - default value set
# in dataclasses ``PipelineComposerRequirements``, ``GPAlgorithmParameters``, ``GraphGenerationParams``
# will be used.
default_param_values_dict = dict(
parallelization_mode='populational',
show_progress=True,
max_depth=6,
max_arity=3,
pop_size=20,
num_of_generations=None,
keep_n_best=1,
available_operations=None,
metric=None,
cv_folds=cv_folds,
genetic_scheme=None,
early_stopping_iterations=None,
early_stopping_timeout=10,
optimizer=None,
collect_intermediate_metric=False,
max_pipeline_fit_time=None,
initial_assumption=None,
preset=AUTO_PRESET_NAME,
use_pipelines_cache=True,
use_preprocessing_cache=True,
use_predictions_cache=False,
use_input_preprocessing=True,
use_auto_preprocessing=False,
use_meta_rules=False,
cache_dir=default_fedot_data_dir(),
keep_history=True,
history_dir=default_fedot_data_dir(),
with_tuning=True,
seed=None
)
return default_param_values_dict
def check_and_set_default_params(self, params: dict) -> dict:
""" Sets default values for parameters which were not set by the user
and raises KeyError for invalid parameter keys"""
allowed_keys = self.default_params.keys()
invalid_keys = params.keys() - allowed_keys
if invalid_keys:
raise KeyError(f"Invalid key parameters {invalid_keys}")
else:
missing_params = self.default_params.keys() - params.keys()
for k in missing_params:
if (v := self.default_params[k]) is not None:
params[k] = v
return params
@staticmethod
def get_params_for_composer_requirements(params: dict) -> dict:
""" Returns dict with parameters suitable for ``PipelineComposerParameters``"""
composer_requirements_params = {k: v for k, v in params.items()
if k in ApiParamsRepository.COMPOSER_REQUIREMENTS_KEYS}
max_pipeline_fit_time = params.get('max_pipeline_fit_time')
if max_pipeline_fit_time:
composer_requirements_params['max_graph_fit_time'] = datetime.timedelta(minutes=max_pipeline_fit_time)
composer_requirements_params = ApiParamsRepository.set_static_individual_metadata(composer_requirements_params)
return composer_requirements_params
@staticmethod
def set_static_individual_metadata(composer_requirements_params: dict) -> dict:
""" Returns dict with representing ``static_individual_metadata`` for ``PipelineComposerParameters``"""
static_individual_metadata = {k: v for k, v in composer_requirements_params.items()
if k in ApiParamsRepository.STATIC_INDIVIDUAL_METADATA_KEYS}
for k in ApiParamsRepository.STATIC_INDIVIDUAL_METADATA_KEYS:
composer_requirements_params.pop(k)
composer_requirements_params['static_individual_metadata'] = static_individual_metadata
return composer_requirements_params
def get_params_for_gp_algorithm_params(self, params: dict) -> dict:
""" Returns dict with parameters suitable for ``GPAlgorithmParameters``"""
gp_algorithm_params = {'pop_size': params.get('pop_size'),
'genetic_scheme_type': GeneticSchemeTypesEnum.parameter_free}
if params.get('genetic_scheme') == 'steady_state':
gp_algorithm_params['genetic_scheme_type'] = GeneticSchemeTypesEnum.steady_state
gp_algorithm_params['mutation_types'] = ApiParamsRepository._get_default_mutations(self.task_type, params)
gp_algorithm_params['seed'] = params['seed']
return gp_algorithm_params
@staticmethod
def _get_default_mutations(task_type: TaskTypesEnum, params) -> Sequence[MutationTypesEnum]:
mutations = [parameter_change_mutation,
MutationTypesEnum.single_change,
MutationTypesEnum.single_drop,
MutationTypesEnum.single_add,
MutationTypesEnum.single_edge]
# TODO remove workaround after boosting mutation fix
# Boosting mutation does not work due to problem with __eq__ with it copy.
# ``partial`` refactor to ``def`` does not work
# Also boosting mutation does not work by it own.
if task_type == TaskTypesEnum.ts_forecasting:
# mutations.append(partial(boosting_mutation, params=params))
pass
else:
mutations.append(add_resample_mutation)
return mutations
......@@ -127,6 +127,12 @@ class MultiModalAssumptionsBuilder(AssumptionsBuilder):
data_pipeline_alternatives = subbuilder.build(first_node, use_input_preprocessing=use_input_preprocessing)
subpipelines.append(data_pipeline_alternatives)
# TODO: fix this workaround during the improvement of multi-modality
for i, subpipeline in enumerate(subpipelines):
if (len(subpipeline) == 1 and len(subpipeline[0].nodes) == 1 and
str(subpipeline[0].nodes[0]) in ['cnn', 'data_source_img']):
subpipelines[i] = [Pipeline(PipelineNode('cnn', nodes_from=[PipelineNode('data_source_img')]))]
# Then zip these alternatives together and add final node to get ensembles.
ensemble_builders: List[PipelineBuilder] = []
for pre_ensemble in zip(*subpipelines):
......
......@@ -93,6 +93,7 @@ class RegressionAssumptions(TaskAssumptions):
return {
'rfr': PipelineBuilder().add_node('rfr'),
'ridge': PipelineBuilder().add_node('ridge'),
'lgbmreg': PipelineBuilder().add_node('lgbmreg'),
}
def ensemble_operation(self) -> str:
......@@ -112,9 +113,13 @@ class ClassificationAssumptions(TaskAssumptions):
@property
def builders(self):
return {
'gbm_linear': PipelineBuilder().
add_branch('catboost', 'xgboost', 'lgbm').join_branches('logit'),
'catboost': PipelineBuilder().add_node('catboost'),
'xgboost': PipelineBuilder().add_node('xgboost'),
'lgbm': PipelineBuilder().add_node('lgbm'),
'rf': PipelineBuilder().add_node('rf'),
'logit': PipelineBuilder().add_node('logit'),
'catboost': PipelineBuilder().add_node('catboost'),
}
def ensemble_operation(self) -> str:
......
......@@ -26,7 +26,7 @@ from fedot.core.repository.tasks import Task, TaskTypesEnum, TaskParams, TsForec
class ApiParams(UserDict):
def __init__(self, input_params: Dict[str, Any], problem: str, task_params: Optional[TaskParams] = None,
n_jobs: int = -1, timeout: float = 5):
n_jobs: int = -1, timeout: float = 5, seed=None):
self.log: LoggerAdapter = default_log(self)
self.task: Task = self._get_task_with_params(problem, task_params)
self.n_jobs: int = determine_n_jobs(n_jobs)
......@@ -34,6 +34,7 @@ class ApiParams(UserDict):
self._params_repository = ApiParamsRepository(self.task.task_type)
parameters: dict = self._params_repository.check_and_set_default_params(input_params)
parameters['seed'] = seed
super().__init__(parameters)
self._check_timeout_vs_generations()
......@@ -139,9 +140,14 @@ class ApiParams(UserDict):
"""Method to initialize ``GPAlgorithmParameters``"""
gp_algorithm_parameters = self._params_repository.get_params_for_gp_algorithm_params(self.data)
# workaround for "{TypeError}__init__() got an unexpected keyword argument 'seed'"
seed = gp_algorithm_parameters['seed']
del gp_algorithm_parameters['seed']
self.optimizer_params = GPAlgorithmParameters(
multi_objective=multi_objective, **gp_algorithm_parameters
)
self.optimizer_params.seed = seed
return self.optimizer_params
def init_graph_generation_params(self, requirements: PipelineComposerRequirements) -> GraphGenerationParams:
......
......@@ -33,9 +33,9 @@ from fedot.explainability.explainer_template import Explainer
from fedot.explainability.explainers import explain_pipeline
from fedot.preprocessing.base_preprocessing import BasePreprocessor
from fedot.remote.remote_evaluator import RemoteEvaluator
from fedot.utilities.composer_timer import fedot_composer_timer
from fedot.utilities.define_metric_by_task import MetricByTask
from fedot.utilities.memory import MemoryAnalytics
from fedot.utilities.composer_timer import fedot_composer_timer
from fedot.utilities.project_import_export import export_project_to_zip, import_project_from_zip
NOT_FITTED_ERR_MSG = 'Model not fitted yet'
......@@ -95,7 +95,7 @@ class Fedot:
self.log = self._init_logger(logging_level)
# Attributes for dealing with metrics, data sources and hyperparameters
self.params = ApiParams(composer_tuner_params, problem, task_params, n_jobs, timeout)
self.params = ApiParams(composer_tuner_params, problem, task_params, n_jobs, timeout, seed)
default_metrics = MetricByTask.get_default_quality_metrics(self.params.task.task_type)
passed_metrics = self.params.get('metric')
......@@ -256,7 +256,7 @@ class Fedot:
.with_timeout(timeout)
.build(input_data))
self.current_pipeline = pipeline_tuner.tune(self.current_pipeline, show_progress)
self.current_pipeline = pipeline_tuner.tune(self.current_pipeline, show_progress=show_progress)
self.api_composer.was_tuned = pipeline_tuner.was_tuned
# Tuner returns a not fitted pipeline, and it is required to fit on train dataset
......
import pickle
import sqlite3
import zlib
from sys import getsizeof
from contextlib import closing
from os import getpid
from typing import List, Optional, Tuple, TypeVar
from golem.core.log import default_log
from fedot.core.caching.base_cache_db import BaseCacheDB
from fedot.core.operations.operation import Operation
IOperation = TypeVar('IOperation', bound=Operation)
MAX_BLOB_SIZE = 2**31 - 1
class OperationsCacheDB(BaseCacheDB):
......@@ -78,10 +83,17 @@ class OperationsCacheDB(BaseCacheDB):
with closing(sqlite3.connect(self.db_path)) as conn:
with conn:
cur = conn.cursor()
pickled = [
(uid, sqlite3.Binary(pickle.dumps(val, pickle.HIGHEST_PROTOCOL)))
for uid, val in uid_val_lst
]
pickled = []
for uid, val in uid_val_lst:
serialized = pickle.dumps(val, pickle.HIGHEST_PROTOCOL)
serialized_size = getsizeof(serialized)
if serialized_size > MAX_BLOB_SIZE:
serialized = zlib.compress(serialized)
default_log('Cache').warning(
f'Pipeline serialization was compressed due to size limit exceeded. '
f'Size: {serialized_size:.2f} bytes (limit: {MAX_BLOB_SIZE} bytes)'
)
pickled.append((uid, sqlite3.Binary(serialized)))
cur.executemany(f'INSERT OR IGNORE INTO {self._main_table} VALUES (?, ?);', pickled)
def _init_db(self):
......
......@@ -57,11 +57,11 @@ class Data:
def from_numpy(cls,
features_array: np.ndarray,
target_array: np.ndarray,
features_names: np.ndarray[str] = None,
categorical_idx: Union[list[int, str], np.ndarray[int, str]] = None,
idx: Optional[np.ndarray] = None,
task: Union[Task, str] = 'classification',
data_type: Optional[DataTypesEnum] = DataTypesEnum.table) -> InputData:
data_type: Optional[DataTypesEnum] = DataTypesEnum.table,
features_names: np.ndarray[str] = None,
categorical_idx: Union[list[int, str], np.ndarray[int, str]] = None) -> InputData:
"""Import data from numpy array.
Args:
......@@ -79,7 +79,13 @@ class Data:
"""
if isinstance(task, str):
task = Task(TaskTypesEnum(task))
return array_to_input_data(features_array, target_array, features_names, categorical_idx, idx, task, data_type)
return array_to_input_data(features_array=features_array,
target_array=target_array,
features_names=features_names,
categorical_idx=categorical_idx,
idx=idx,
task=task,
data_type=data_type)
@classmethod
def from_numpy_time_series(cls,
......@@ -104,7 +110,11 @@ class Data:
task = Task(TaskTypesEnum(task))
if target_array is None:
target_array = features_array
return array_to_input_data(features_array, target_array, idx, task, data_type)
return array_to_input_data(features_array=features_array,
target_array=target_array,
idx=idx,
task=task,
data_type=data_type)
@classmethod
def from_dataframe(cls,
......@@ -848,11 +858,11 @@ def np_datetime_to_numeric(data: np.ndarray) -> np.ndarray:
def array_to_input_data(features_array: np.ndarray,
target_array: np.ndarray,
features_names: np.ndarray[str] = None,
categorical_idx: Union[list[int, str], np.ndarray[int, str]] = None,
idx: Optional[np.ndarray] = None,
task: Task = Task(TaskTypesEnum.classification),
data_type: Optional[DataTypesEnum] = None) -> InputData:
data_type: Optional[DataTypesEnum] = None,
features_names: np.ndarray[str] = None,
categorical_idx: Union[list[int, str], np.ndarray[int, str]] = None) -> InputData:
if idx is None:
idx = np.arange(len(features_array))
if data_type is None:
......
from typing import Optional
import numpy as np
from fedot.core.data.data import InputData, OutputData
from fedot.core.operations.evaluation.evaluation_interfaces import EvaluationStrategy
from fedot.core.operations.evaluation.operation_implementations.models.boostings_implementations import \
......@@ -7,6 +9,7 @@ from fedot.core.operations.evaluation.operation_implementations.models.boostings
FedotXGBoostClassificationImplementation, FedotXGBoostRegressionImplementation, \
FedotLightGBMClassificationImplementation, FedotLightGBMRegressionImplementation
from fedot.core.operations.operation_parameters import OperationParameters
from fedot.core.operations.evaluation.evaluation_interfaces import is_multi_output_task
from fedot.core.repository.tasks import TaskTypesEnum
from fedot.utilities.random import ImplementationRandomStateHandler
......@@ -33,6 +36,15 @@ class BoostingStrategy(EvaluationStrategy):
raise ValueError(f'Impossible to obtain Boosting Strategy for {operation_type}')
def fit(self, train_data: InputData):
if train_data.task.task_type == TaskTypesEnum.ts_forecasting:
raise ValueError('Time series forecasting not supported for boosting models')
if is_multi_output_task(train_data):
if self.operation_type == 'catboost':
self.params_for_fit.update(loss_function='MultiLogloss')
elif self.operation_type == 'catboostreg':
self.params_for_fit.update(loss_function='MultiRMSE')
operation_implementation = self.operation_impl(self.params_for_fit)
with ImplementationRandomStateHandler(implementation=operation_implementation):
......@@ -49,21 +61,35 @@ class BoostingClassificationStrategy(BoostingStrategy):
super().__init__(operation_type, params)
def predict(self, trained_operation, predict_data: InputData) -> OutputData:
n_classes = len(trained_operation.classes_)
if self.output_mode in ['labels']:
prediction = trained_operation.predict(predict_data)
elif (self.output_mode in ['probs', 'full_probs', 'default'] and
predict_data.task.task_type is TaskTypesEnum.classification):
n_classes = len(trained_operation.classes_)
is_multi_output_target = is_multi_output_task(predict_data)
prediction = trained_operation.predict_proba(predict_data)
is_prediction_correct = self._check_prediction_correctness(prediction)
if n_classes < 2:
raise ValueError('Data set contain only 1 target class. Please reformat your data.')
elif n_classes == 2 and self.output_mode != 'full_probs' and len(prediction.shape) > 1:
prediction = prediction[:, 1]
elif n_classes == 2 and self.output_mode != 'full_probs' and is_prediction_correct:
if is_multi_output_target and isinstance(prediction, list):
prediction = np.stack([pred[:, 1] for pred in prediction]).T
else:
prediction = prediction[:, 1]
else:
raise ValueError(f'Output mode {self.output_mode} is not supported')
return self._convert_to_output(prediction, predict_data)
@staticmethod
def _check_prediction_correctness(prediction) -> bool:
if isinstance(prediction, list):
return len(prediction[0].shape) > 1
else:
return len(prediction.shape) > 1
class BoostingRegressionStrategy(BoostingStrategy):
def __init__(self, operation_type: str, params: Optional[OperationParameters] = None):
......
......@@ -302,6 +302,7 @@ def convert_to_multivariate_model(sklearn_model, train_data: InputData):
def is_multi_output_task(train_data):
target_shape = train_data.target.shape
is_multi_target = len(target_shape) > 1 and target_shape[1] > 1
return is_multi_target
if train_data.target is not None:
target_shape = train_data.target.shape
is_multi_target = len(target_shape) > 1 and target_shape[1] > 1
return is_multi_target
......@@ -99,11 +99,14 @@ class OneHotEncodingImplementation(DataOperationImplementation):
if isinstance(features, np.ndarray):
transformed_categorical = self.encoder.transform(features[:, self.categorical_ids]).toarray()
# Stack transformed categorical and non-categorical data, ignore if none
non_categorical_features = features[:, self.non_categorical_ids.astype(int)]
non_categorical_features = np.array(features[:, self.non_categorical_ids.astype(int)])
else:
transformed_categorical = self.encoder.transform(features.iloc[:, self.categorical_ids]).toarray()
non_categorical_features = features.iloc[:, self.non_categorical_ids.astype(int)].to_numpy()
non_categorical_features = np.array(features.iloc[:, self.non_categorical_ids.astype(int)])
transformed_categorical = transformed_categorical.astype(np.float32)
non_categorical_features = non_categorical_features.astype(np.float32)
frames = (non_categorical_features, transformed_categorical)
transformed_features = np.hstack(frames)
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment