Modifying table preprocessing (!484) · Merge requests · ITMO-NSS-team / FEDOT

Modifying table preprocessing

Created by: Dreamlone

The preprocessing has been significantly refactored. Previously, all preprocessing functions were in pipeline.py. Now they have moved to a preprocessing module. The most important class there is DataPreprocessor. It handles "obligatory preprocessing" and "optional". The "obligatory preprocessing" includes such things as transformation of cells from "x ", " x " into "x", "x", conversion of one-dimensional targets into columns when solving classification and regression problems, exclusion of features with over 90% blanks, etc.

There is also "optional preprocessing". It consists of "imputation" and "categorical feature encoding" operations if there are no such operations in the structure of the pipeline for which preprocessing is done. If there are such operations, then this very "optional preprocessing" is not performed. How do we know that there are suitable operations in the pipeline? - For this purpose, the PipelineStructureExplorer class is implemented in preprocessing. It checks the structure of the pipeline and if it detects that if we don't fill the gaps at least somehow, the pipeline will crash, it gives a signal to DataPreprocessor to go ahead and fill the gaps before feeding data to the pipeline. So composer can always find better way to encode or fill in gaps (e.g. LabelEncoding or any other way, not OneHotEncoding). But at the same time, the pipeline won't crash even if there are no processing operations in its structure.

Preprocessing at different levels (API and pipelining) has changed. Now preprocessing is always done at API level. As soon as it is done, the preprocessed InputData block is marked as "preprocessed" via SupplementaryData flag was_preprocessed. Then in Pipeline fit and predict methods, if data block was not preprocessed, then it starts obligatory already at Pipeline level. After that comes optional.

Did a little refactoring of API. Removed unnecessary (imho) ApiFacade, which simply duplicated the functionality of Fedot class. Changed names of classes for more clear (again, imho). Also put some repeating variables into state variables. Also got rid of multiple inheritance.