|
|
The repository provides two examples that demonstrate the functionality and functionality of the library.
|
|
|
________________________________________________________________________________________________________
|
|
|
|
|
|
The first example involves news selection and novelty detection of processed text fragments compared to already processed news. The test set of news texts partially ordered by time was formed from news aggregators without preliminary selection by thematic domains. To prepare datasets, the complete collection of all available pages from the resource was carried out, followed by analysis, filtering and indexing. All documents collected from open sources were converted from the original format (HTML, XML, DOC, PDF, ODT) to a plain text format without markup, service areas and promotional materials. They were normalized in terms of formatting (excess service characters were removed) and reduced to a single UTF-8 code table. Then, from these data, a news stream was formed, consisting of two thematic groups of 75 texts each. The total length of the sequence was 150 texts. A dictionary of word connections was formed (see examples/example_data_noveltyFiltering/connections_dictionary.txt), on the basis of which 150 texts were converted into a sequence of SSPs with a length of 150 samples (examples/example_data_noveltyFiltering/stream_text_data_encoded_to_SSPs.txt). In the processing, the increments of the total weights of synapses and the threshold detector are evaluated. It is shown that when the thematic group of processed texts changes, the total weight of the RNN synapses changes abruptly (see figure):
|
|
|
|
|
|
![img_2](https://user-images.githubusercontent.com/63652471/207627858-8a515923-45b4-45c5-8bc7-3994bb7f55f6.png)
|
|
|
|
|
|
By filtering texts in the second RNN by new links and subsequent selection of new blocks of text in the original news stream, it was possible to form lists of word links representing novelty (see results_noveltyFiltering/results_xxxxxxxx.txt, where xxxxxxxx is a timestamp).
|
|
|
|
|
|
_________________________________________________________________________________________________________________________
|
|
|
|
|
|
The second example involves predicting the text content of news feeds. For this example, in a manner similar to the first example, news information was collected from news aggregators with an interval of 15 minutes, filtered, and a 1000-word dictionary was determined (see examples/example_data_forecasting/words_dictionary.txt). Then the news texts to be processed in the neural network were filtered by the dictionary and the SSP sequence was formed (examples/example_data_forecasting/stream_words_data_encoded_to_SSPs.txt). The length of the generated sample was 50 samples, and the prediction is carried out at the time of processing 35, 40 and 45 samples for 4 samples ahead, which corresponds to the forecasting horizon of 1 hour. The forecasting results will be output to the results_forecasting folder in the text file "results_xxxxxxxx.txt", where xxxxxxxx is the timestamp. The metrics used to assess the accuracy of the prediction include the percentage of misses (pe0) and the percentage of false positives (pe1), determined by the formulas:
|
|
|
|
|
|
![img_5](https://user-images.githubusercontent.com/63652471/207628014-2ba7e722-b42e-4fcb-92aa-e8b9fb2fe02d.png) |