MultiModalData class improvement
Created by: andreygetmanov
Now csv files with text and table columns can be read and separated to various data sources just in one motion
- MultiModalData.from_csv method added
- text fields are defined automatically, if are not predefined by user
- tests are added
Created by: pep8speaks
Hello @andreygetmanov! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers!
Comment last updated at 2022-08-12 11:35:49 UTC
Created by: codecov[bot]
Codecov Report
Merging #789 (a1084773) into master (bf399742) will increase coverage by
0.33%
. The diff coverage is96.51%
.@@ Coverage Diff @@ ## master #789 +/- ## ========================================== + Coverage 86.87% 87.20% +0.33% ========================================== Files 184 184 Lines 12823 12971 +148 ========================================== + Hits 11140 11312 +172 + Misses 1683 1659 -24
Impacted Files Coverage Δ fedot/core/data/multi_modal.py 89.43% <95.08%> (+3.89%)
fedot/core/data/data_preprocessing.py 97.22% <100.00%> (-0.04%)
fedot/core/data/merge/data_merger.py 98.93% <100.00%> (ø)
fedot/preprocessing/preprocessing.py 99.33% <100.00%> (+0.01%)
...n/operation_implementations/models/custom_model.py 85.00% <0.00%> (-5.00%)
...on_implementations/models/discriminant_analysis.py 91.66% <0.00%> (-1.30%)
...implementations/models/ts_implementations/naive.py 94.56% <0.00%> (-1.09%)
fedot/core/pipelines/tuning/unified.py 100.00% <0.00%> (ø)
fedot/core/pipelines/automl_wrappers.py 0.00% <0.00%> (ø)
fedot/core/pipelines/pipeline_node_factory.py 100.00% <0.00%> (ø)
... and 44 more Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
- Last updated by Elizaveta Lutsenko
- Last updated by Elizaveta Lutsenko
- Last updated by Elizaveta Lutsenko
- Last updated by Elizaveta Lutsenko
- Last updated by Elizaveta Lutsenko
206 except ValueError: 207 return False 208 209 210 def prepare_multimodal_text_data(dataframe: pd.DataFrame, text_columns: List[str]) -> dict: 211 """ Prepares MultiModal text data in a form of dictionary 212 213 :param dataframe: pandas DataFrame to process 214 :param text_columns: list of text columns names 215 216 :return multimodal_text_data: dictionary with numpy arrays of text data 217 """ 218 multi_modal_text_data = {} 219 220 for column in text_columns: 221 - Last updated by Elizaveta Lutsenko
- Last updated by Elizaveta Lutsenko
31 32 data = MultiModalData({ 33 'data_source_table': data_num, 34 'data_source_text': data_text 35 }) 36 37 return data 38 39 40 def run_multi_modal_example(files_path: str, is_visualise=True) -> float: 11 def run_multi_modal_example(file_path: str, is_visualise=True) -> float: 41 12 task = Task(TaskTypesEnum.classification) 42 43 data = prepare_multi_modal_data(files_path, task) 13 path = Path(fedot_project_root(), file_path) 14 data = MultiModalData.from_csv(file_path=path, task=task, target_columns='variety', index_col=None) Created by: andreygetmanov
Для удобства использования хорошо бы сделать возможность задавать task для мультимодальных данных как строку как это делается в Fedot, - просто "classification"
В примере поправил через оборачивание строкового таска при импорте данных В будущем PR попробую расширить эту фичу для любых данных, через строку задавать задачу правда удобнее
- Last updated by Elizaveta Lutsenko
Created by: Dreamlone
По поводу бага в примере
multimodal_text_num_example.py
, вот что мне удалось найти, надеюсь это поможет: Ты грешил на пропуски, я решил покопать этот момент.-
Я поставил в методе fit -
predefined_model='auto'
, чтобы быстрее считалось -
Структура генерируемого пайплайна вот такая: из рисунка видно, что для табличной ветки есть simple imputation
Раз операция заполнения пропусков есть, то опциональное заполнение в препроцессинге не проводится.
-
Место проблемы: метод
_init_main_target_source_name
вDataPreprocessor
. Идея метода такая: при использовании multi-task режима в пайплайне возникла необходимость энкодить только "главный target". Например если одновременно решается задача регрессии (побочная) и классификации (главная), то target encoding применяется только для classification ветки, если она главная. При этом, обе задачи решаются на табличных данных. Но в текущем примере возникла проблема - target везде один, но для text-ветки его преобразование не производится потому что на этапе обучения алгоритм определил энкодинг только для табличных данных. То есть препроцессор готовит энкодер только для табличных данных, но во время predict'а, когда препроцессору надо сделать обратное преобразование, алгоритм находит только то, что для text-ветки никакого преобразования не требуется и при этом ветка является основной. - "А раз target в text ветке главный, то и для всего пайплайна ничего обратно конвертировать не требуется.", что неправда
Предлагаемый способ решения проблемы: ввести обязательный target encoding не только для табличных данных, но и для других узлов. Пусть как main_target_source_name помечаются хоть текст, хоть изображения и т.д. (сейчас это сделано), и от препроцессора для таких данных можно требовать конвертацию при необходимости (этого пока нет)
-
added 16 commits
-
a1084773...e2ddda73 - 7 commits from branch
master
- 342b28ad - MultiModalData class improvement
- 888fd0d9 - minor changes:
- 539b113e - - added 2 tests on MultiModalData.from_csv use
- 6e7c1053 - - fixed bug with incorrect multimodal data preprocessing
- dbe94e5e - - fixed bug with incorrect multimodal data preprocessing
- 71d2b364 - - added data for tests
- b498fe50 - - rewrote path in test_multimodal_data.py by Path
- 4db1c9aa - - task now is defined by str, not by Task class
- a5c3751c - - added substitution of nans to '' in text features
Toggle commit list-
a1084773...e2ddda73 - 7 commits from branch
added 12 commits
-
a5c3751c...75e9a9b1 - 2 commits from branch
master
- 270c0346 - MultiModalData class improvement
- 0bd26671 - minor changes:
- 2390b34b - - added 2 tests on MultiModalData.from_csv use
- ca2a323a - - fixed bug with incorrect multimodal data preprocessing
- eeae6ebb - - fixed bug with incorrect multimodal data preprocessing
- 97f5de12 - - added data for tests
- ca7a2f8f - - rewrote path in test_multimodal_data.py by Path
- 98f8e17a - - task now is defined by str, not by Task class
- 8bb8f5ae - - added substitution of nans to '' in text features
- e2554e7e - - tests of multimodal data class are finished
Toggle commit list-
a5c3751c...75e9a9b1 - 2 commits from branch
added 1 commit
- b3798a26 - - text and ts preparation methods are now in distinct classes inheriting...
added 18 commits
-
b3798a26...2e9500fb - 6 commits from branch
master
- e20a4b5c - MultiModalData class improvement
- 0a13e177 - minor changes:
- ff8b4a09 - - added 2 tests on MultiModalData.from_csv use
- 145087b6 - - fixed bug with incorrect multimodal data preprocessing
- ef38fc62 - - fixed bug with incorrect multimodal data preprocessing
- e2d9906d - - added data for tests
- 19f54a7e - - rewrote path in test_multimodal_data.py by Path
- a959e315 - - task now is defined by str, not by Task class
- cfceb1e4 - - added substitution of nans to '' in text features
- a5eed0c4 - - tests of multimodal data class are finished
- 9aabeece - - text and ts preparation methods are now in distinct classes inheriting...
- faca601b - - refactoring of prepare_multimodal_ts_data method
Toggle commit list-
b3798a26...2e9500fb - 6 commits from branch
added 1 commit
- 87e78ddb - - refactoring structure of test_text_data_only
added 16 commits
-
022e53be - 1 commit from branch
master
- 64e77572 - MultiModalData class improvement
- 4f45749d - minor changes:
- 0bfba50c - - added 2 tests on MultiModalData.from_csv use
- 9ecc762f - - fixed bug with incorrect multimodal data preprocessing
- 3920d99a - - fixed bug with incorrect multimodal data preprocessing
- 29649398 - - added data for tests
- 9448c621 - - rewrote path in test_multimodal_data.py by Path
- 3d476c98 - - task now is defined by str, not by Task class
- 2ab2620f - - added substitution of nans to '' in text features
- 25644470 - - tests of multimodal data class are finished
- ee325bee - - text and ts preparation methods are now in distinct classes inheriting...
- b06330cc - - refactoring of prepare_multimodal_ts_data method
- 797efae9 - - refactoring structure of test_text_data_only
- f9efeb78 - - refactoring structure of test_multimodal_data_from_csv
- 38c87962 - - table and text preprocessing are now distinguished for easier readability
Toggle commit list-
022e53be - 1 commit from branch
added 1 commit
- 1147eb21 - - table and text preprocessing are now distinguished for easier readability
added 1 commit
- a2da375f - - table and text preprocessing are now distinguished for easier readability
added 17 commits
-
a2da375f...9a718076 - 2 commits from branch
master
- 823d9b3e - MultiModalData class improvement
- 55bb27a5 - minor changes:
- e1f384c6 - - added 2 tests on MultiModalData.from_csv use
- b383140f - - fixed bug with incorrect multimodal data preprocessing
- ec06c6de - - fixed bug with incorrect multimodal data preprocessing
- 12221d65 - - added data for tests
- 004df00d - - rewrote path in test_multimodal_data.py by Path
- fded2937 - - task now is defined by str, not by Task class
- 6b125796 - - added substitution of nans to '' in text features
- 2ecdadb3 - - tests of multimodal data class are finished
- c91aba4f - - text and ts preparation methods are now in distinct classes inheriting...
- 45eda3d4 - - refactoring of prepare_multimodal_ts_data method
- c03f35cb - - refactoring structure of test_text_data_only
- 90ac72b2 - - refactoring structure of test_multimodal_data_from_csv
- 64d320d3 - - table and text preprocessing are now distinguished for easier readability
Toggle commit list-
a2da375f...9a718076 - 2 commits from branch
added 18 commits
-
64d320d3...9ae9151b - 2 commits from branch
master
- 99121bef - MultiModalData class improvement
- 47a53e99 - minor changes:
- b64819bc - - added 2 tests on MultiModalData.from_csv use
- c1e7ecb0 - - fixed bug with incorrect multimodal data preprocessing
- c240d69e - - fixed bug with incorrect multimodal data preprocessing
- 74731c25 - - added data for tests
- 2fbc8797 - - rewrote path in test_multimodal_data.py by Path
- fa8ab7bd - - task now is defined by str, not by Task class
- 3bf4cbe1 - - added substitution of nans to '' in text features
- 9b9a0d0d - - tests of multimodal data class are finished
- 9d58be80 - - text and ts preparation methods are now in distinct classes inheriting...
- 20bb336c - - refactoring of prepare_multimodal_ts_data method
- ebd5f342 - - refactoring structure of test_text_data_only
- 11e953a0 - - refactoring structure of test_multimodal_data_from_csv
- cad1d587 - - table and text preprocessing are now distinguished for easier readability
- ef31bae9 - - removed duplicate of array_to_input_data method
Toggle commit list-
64d320d3...9ae9151b - 2 commits from branch
added 1 commit
- 1265ed49 - - decorators moved to the distinguished file