Choosing allowability boundaries for describing objects in subject areas

Anomaly detection is one of the most promising problems for study and can be used as independent units and preprocessing tools before solving any fundamental data mining problems. This article proposes a method for detecting specific errors with the involvement of experts from subject areas to fill knowledge. The proposed method about outliers hypothesizes that they locate closer to logical boundaries of intervals derived from pair features, and the interval ranges vary in different domains. We construct intervals leveraging pair feature values. While forming knowledge in a specific field, a domain specialist checks the logical al-lowability of objects based on the range of the intervals. If the objects are logical outliers, the specialist ignores or corrects them. We offer the general algorithm for the formation of the database based on the proposed method in the form of a pseudo-code


INTRODUCTION
The problem of data preprocessing to detect and remove invalid data values to improve the efficiency of algorithms for solving problems becomes an urgent task due to the growth in the volume of processed data.In the works of [1], various approaches have been proposed for data cleaning based on metric functional dependencies and minimizing statistical distortions measured with the Earth Mover's Distance [2].It should be noted that searching for logically incompatible data values in describing objects of subject areas for various sets of 2 or more features is a difficult task for implementation in practice.The problem lies in the need for more methods for checking the allowability of feature value relationships on such sets.Hypothetically, the answer to the question of the allowability of relationships can be obtained from competent experts in the subject areas.The process requires: i) development of special methods for analyzing and visualizing data to detect outliers; ii) constructing and filling of knowledge bases to control the belonging of data values to intervals with allowable boundaries.
The logical incompatibility of values for a pair of quantitative attributes within the framework of the subject area is considered a solution to the problem of finding the boundaries of the allowability of relations for this pair.Acceptance limits define intervals to control the correctness of the data used.Among the universal ❒ ISSN: 2252-8938 restrictions on the use of intervals can be attributed invariance to the scale of measurements of features.The following example will show the incompatibility of the concepts of the allowability of values for each feature and a pair of features.A whale is the age of 5 and weighs 45,000 kilograms, two values are acceptable for their ranges separately, but when analyzed by pairs of these feature values, they are unacceptable or doubtful.Building a regression relationship between features is one of the ways to describe the relationship between them.An example of using such a dependency is filling in missing values in data.The imperfection of this method can be judged by the decrease in the generalizing ability of recognition algorithms on samples, in which the missing values in data are randomly generated.Jouravlev proved the need for introducing additional restrictions on the concept of "allowable object" [3].The introduced restrictions expand the possibilities for controlling the correctness of the data used.The proposed method for checking the correctness of data does not in any way override the existing ones.It is recommended to use it in cases where the "classical methods" could not detect errors.The use of this method in practice is preceded by the creation of a knowledge base, with the help of which the user is able to analyze the reasons for the incorrectness of the data.The relevance of research on this issue is increasing in connection with the use of cloud storage technologies for the processing of big data.
The efficiency of using the boundaries of permissible values when forming a training data set can be checked through the indicators of the generalizing ability of the algorithms.For comparison "clean" samples can be used, and samples with additional falsifier objects, in the description of which there are violations of the allowability boundaries.For example, in classification problems, noise objects are unallowable relative to their class and negatively affect the accuracy of the solution algorithm.This factor can be used to test the stability of algorithms, in particular, in [4], the "data + noise" technique is proposed, the use of which, on the one hand, contributes to a "smoother" convergence of the procedure for interactive search for logical patterns.On the other hand, "noisy" objects perform an essential function of falsifiers, the "collision" contributing to an increase in the robustness of the solutions obtained.
Detecting anomalies of objects that differ from the main part of the data in the subject areas during data mining [5], there are several approaches to detecting outliers depending on the type of task: parsing, data transformation, methods of enforcing integrity constraints, duplicate detection methods, and others [6].Almost fully similar works were overviewed by [7] in 2018.According to that paper, there are some categories of outlier methods: global vs. local; labeling vs. scoring; supervised vs. unsupervised; parametric vs. nonparametric methods.While, over the years there are many types of outlier detection techniques have been proposed for various purposes [7], [8]: statistical -checking the input value in the interval, which is formed from the standard deviation based on the Chebyshev inequality [9]; deviation-based; density-based techniques based on a distance between objects; cluster-based such as "density-based spatial clustering of applications with noise" [10], [11]; association templates of rules -validate the input against some template, the representation of which is considered correct.

2.
RELATED WORKS One of the fastest ways to detect abnormal objects is the Isolation Forest method [12].Most current methods for detecting anomalous objects first create a normal object model, and then it is required to check all objects of the sample for abnormal.The process creates significant computational complexity.The classical method of Isolation Forest is thoroughly analyzed and augmented by bringing an innovative approach [13].This approach is a k-Means-based Isolation Forest that allows building a search tree based on many branches in contrast to the only two considered in the original method.A set of methods enhancing the Isolation Forest based on Fuzzy C-Means was proposed [14].The main goal of the study is to analyze the possibilities of using the grouping method using Fuzzy C-Means at the stage of building a search tree [15].In particular, it is utilized the information on the degree of membership of a given object to the group of similar objects positioned close to a given tree node.The memberships are determined based on distance from the so-called middle of the cluster, i.e., the average value of the feature [16].
The idea behind the algorithm density-based spatial clustering of applications with noise (DBSCAN) is that there is a higher density of objects inside each cluster than the density outside the cluster [10].The selection of noise objects is made on the assumption that the density in these regions is lower than in any of the clusters.Moreover, for each point of the cluster, its neighborhood of a given radius ϵ (eps) must contain at least a certain number of points, which is set by the threshold value MinPts.A new concept of moveability and Int J Artif Intell, Vol. 13 constructed new, comprehensive, hybrid feature-based density measurement method was defined that considers temporal and spatial properties [17].After, proposed the improved DBSCAN algorithm using the new density measurement method.Another density-based clustering algorithm has been presented based on DBSCAN, and computational geometry [18].It represents three significant modifications or extensions to DBSCAN: selection of parameter ϵ (eps) using the radii of empty or Voronoi circles; selection of parameter MinPts for the same epsilon; redistribution of noise points to suitable clusters using the concept of centroid hinged clustering.
The main drawbacks of most existing approaches to anomaly detection are summarized [19], [20].These approaches are not optimized for detecting anomalies, a consequence, these approaches are often not efficient enough, which leads to too many false alarms (when normal instances are identified as anomalies) or too many anomalies; many existing methods are limited to low-dimensional data and small data size due to their legacy algorithms.Based on this overview in the previous paragraphs, we compare our method with these methods regarding our statement of the task, and the main contributions of the paper in continuity [20] are: − We construct a latent feature by a simple-linear combination of two features in order to conduct logical analysis based on two features; − We propose an expert-based model which leverages the latent feature to solve the problem of logical inconsistencies; − We suggest how to utilize this model providing a general example, and we also compare the results with results of existing methods, even though many of them are suitable for another task.For example, the DBSCAN is a clustering algorithm, however, it can separate the noisy clusters too [21].

METHOD
A method for detecting logically invalid values by pairs of quantitative features in the description of sample objects in the studied subject area is proposed and is focused on detecting errors in the input data [3].Dividing the feature values by medians or means of the current features provides a scale-invariant property for the method.Since the median is a random variable, its values for a real sample of data can be considered and interpreted according to the law of large numbers [22].As the sample size grows, the mathematical expectation of the median tends to a stable value, i.e., is an unbiased estimate.
Let's denote object E 0 = {S 1 , . . ., S m } is given, described by a set of features X = (x 1 , . . ., x n ) with different types, and I, J a set of indices of features, respectively, quantitative and nominal features.For each pair of features (x i , x j ) ⊂ X(n), i ̸ = j, i, j ∈ I, we calculate the latent feature for feature pairs in (1): where S k = (a k1 , . . ., a kn ), S k ∈ E 0 , P i , P j -are the values of the medians of the features x i , x j on the set E 0 , and the boundaries of the intervals z 1 , z 2 calculated (2) Checking the allowability of nominal features is determined through the membership of their gradations in a finite set of values.The use of dimensionless quantities for calculating the boundaries of feature ratios according to (2) and data visualization allows an expert to interactively determine the correctness of object descriptions and enter information into the knowledge base of the subject area.To inform the user about the interval [z 1 ; z 2 ] recommended to split it into a given number (for example, 10 or 100) parts.The choice of the number of chunks depends on the size of the original dataset or the recommendations for conducting exploratory data analysis.The suspicion of an anomalous object is determined when its values belong to the extreme (right or left) parts of the interval [z 1 ; z 2 ].The method is implemented in 3 stages:  [23].iForest has time complexities of O(tψ log ψ) in the training stage and O(nt log ψ) in the evaluating stage, where t -the number of trees, ψ -the number of samples, n -testing data size [12].Proposed method has time complexity of O(n • m(m−1)

2
) in the evaluating stage, where m -number of features, n -testing data size.The training stage complexity is independent of specialist decisions.In one iteration of the training stage, it will be equal to O( m(m−1)

RESULTS AND DISCUSSION
We conduct our experiments on a dataset titled "Kalahari Kung San people" to explore Kalahari Kung San people collected by Nancy Howell, and publicly available [24], [25].It contains 545 people's information, and each person is described by four features: height, weight, age, and sex, but we only consume numerical features.Before exploring, we omit logically incorrect and missing rows, therefore the modified version of datasets may also contain some anomaly objects too, but we assume it does not have any logically incorrect instances during our simulations.Table 1 depicts the results of calculating the values of the boundaries of the intervals for each pair and objects that were on the boundaries of the interval for pairs of features.Table 2 represents examples of issuing the results of checking for the presence of invalid objects with an indication of the error status: "Critical error" -the value of the attribute according to (1) lies outside the interval; "Possibly a mistake" -the values of the feature according to (1) lie in the "dangerous" proximity to the boundaries of the interval, determined by the entry into [z 1 ; , where h -is the step of dividing the interval into a given number of parts.In this experiment, h = 1% for testing purposes.Moreover, we omit one object in each iteration and test the skipped object to be an anomaly during this experiment.Because forgetting the one-test object, the intervals vary in each row in Table 2. Figure 1 illustrates an example of objects that are close to the critical ranges regarding their location.The two lines indicate the upper and lower borders of the interval based on pair feature values, and we calculate them using 3h, h = (z 2 − z 1 )/100, where 3h -the gap between two lines and the ranges.The objects outside of the two lines are considered outliers.To compare methods for finding anomalous objects, we artificially create 20 anomalous objects to the original data [24].We experiment with two forms of data: we use all data in the Int J Artif Intell, Vol. 13 first form, we fit the method using original data, and we use the trained model to identify anomalous objects in the second form.Table 3 illustrates the test results.Since the proposed method works with a pair of features, in the testing process, if an object is anomalous regarding any pair of features, then the object is considered anomalous.Experimenting with separated data on the DBSCAN method is not supported using Scikit-learn library.The proposed and Isolation Forest methods get 100% in accuracy.We leverage precision and F 1 score to compare the results to make the comparison more precise.According to Table 3, the DBSCAN's results overperform the proposed method in most tests concerning F 1 scores, but the proposed method overperforms in many test cases concerning precision scores.Actually, our task is to find anomaly objects in the dataset, and therefore by the definition of the precision score (also called true positive value), we can conclude that the proposed method can find anomaly objects more than DBSCAN.Moreover, we provide another example, like this experiment, in contrast to the previous one.We ❒ ISSN: 2252-8938 only use two features to find anomaly objects in DBSCAN and the proposed methods.The results of this experiment ended up with the same average of accuracies 0.9390, 1.00, and 0.8864 of scores F 1 , precision, and recall, respectively, in each test, but we removed one feature "age" and some rows from anomaly datasets to be correct with our setup.During this experiment, we made a trade-off with the values of h individually for each test in order to get high performance in the proposed method.That is the reason why the results are the same.
In contrast, we did not make a like trade-off technique for the previous experiment due to our main reason being to illustrate a simple method.

Discussion
The construction of logical intervals for a pair of features in problems related to the solution of practical problems is considered separately for each subject area.According to the example in the Introduction, it is impossible to be a person with a 1-meter height of 250-kilogram weight.However, in other subject areas, for example, animals, a 1-meter tall animal can weigh 250 kilograms, even more.For this reason, we added the phrase "subject area" to the title.Also, we included the term "describing objects" in the title because the same objects can be used to build different boundaries, for example, when objects are grouped.
The proposed model allows the construction intervals for a pair of features with the help of experts.This allows determining an allowable range of values by pair of features.Therefore, the proposed method can be used in the process of filling the knowledge bases of subject areas.The method is nonparametric and has linear computational complexity, which makes it possible to apply it to problems with big data.The main limitation of the proposed work is the need for an expert to build knowledge, and this is a common problem that is considered a limitation of artificial intelligence too in other types of machine learning, such as labeling data in the classification task.
In contrast, other approaches mainly focused on finding anomaly objects in data.For example, DBSCAN is a clustering algorithm based on density level estimation with many modifications.One of the most significant drawbacks is using distance which may cause the problem of the curse of dimensionality.However, this limitation doesn't affect the results in this paper because we used only three features.
During experiments, we have done several trade-offs on parameters of both DBSCAN and proposed methods to get acceptable accuracies.However, we only cared a little about the Isolation forest method because the nature of this algorithm requires more objects and features in datasets to get high outcomes, and this method may not match our setups.In practice, the field of experts is required to build the ranges which identify anomaly objects using the proposed model, so if the ranges on latent features are built properly, the results of the proposed method can be more accurate than the results.For the reason that the accuracies of the proposed method in experiments did not overperform in all cases.Moreover, we leveraged DBSCAN method in scikitlearn library [26] in our experiments with parameters ϵ = 0.1 and min samples = 5.

CONCLUSION AND FUTURE WORK
A method of searching for logically incompatible quantitative data in describing objects of subject areas for various sets of two features has been proposed.Using the values of the medians of the analyzed features gives the method the property of invariance to the scale of measurements of features, which expands the range of application of the method.The general use of the method has been illustrated to form a knowledge base about the incompatibility of values based on intervals of paired features from the data of the subject area.The quantitative features space has been considered in this work.Future research will be aimed to adapt the method for different types of feature space, reduce the specialist's decision-making role, and correct the detected logical errors.
, No. 1, March 2024: 329-336 j∈I,i̸ =j Removing invalid objects from E 0 and forming a set of intervals Z = i,j∈I,i̸ =j [z 1 ; z 2 ] according to (2) and (3).whole dataset only one time and needs to calculate the distance of any pair of objects in the dataset.Time complexity of original DBSCAN algorithm is high -O(n 2 ).Using efficient indexing structures complexity can be reduced to O(n log n) − ∀S u ∈ Π and aui Pi auj Pj ∈ {[z 1 , z 2 ]} making an expert decision on its allowability (inallowability); Choosing allowability boundaries for describing objects in subject areas (Musulmon Lolaev)

Table 1 .
The boundaries of a latent feature formed from pair features

Table 2 .
Examples of defining objects with potentially invalid values

Table 3 .
The comparison results of identifying anomalous objects by methods