Fish survival prediction in an aquatic environment using random forest model

Received Oct 31, 2020 Revised May 5, 2021 Accepted May 20, 2021 In the real world, it is very difficult for fish farmers to select the perfect fish species for aquaculture in a specific aquatic environment. The main goal of this research is to build a machine learning that can predict the perfect fish species in an aquatic environment. In this paper, we have utilized a model using random forest (RF). To validate the model, we have used a dataset of aquatic environment for 11 different fishes. To predict the fish species, we utilized the different characteristics of aquatic environment including pH, temperature, and turbidity. As a performance metrics, we measured accuracy, true positive (TP) rate, and kappa statistics. Experimental results demonstrate that the proposed RF-based prediction model shows accuracy 88.48%, kappa statistic 87.11% and TP rate 88.5% for the tested dataset. In addition, we compare the proposed model with the state-of-art models J48, RF, k-nearest neighbor (k-NN), and classification and regression trees (CART). The proposed model outperforms than the existing models by exhibiting the higher accuracy score, TP rate and kappa statistics.


INTRODUCTION
Aquaculture refers to the farming of aquatic animals or plants primarily for food. It contains the breeding, nurture, and reaping of fish, mollusks, crustaceans, and plants in fresh and saltwater environments. The practice was initiated in China about 4,000 years ago and global production remains to be subjugated by China and other Asian countries. Aquaculture is used to harvest food by some of the deprived communities everywhere on the globe as well as by key corporations. Globally, aquaculture by now supplies more than half of all seafood used up by humans, a proportion that continues to rise as the world population produces. According to the Food and Agricultural Organization (FAO) [1], 3 million tons of food were produced by aquaculture in the 1970s, a figure that rose steadily to over 80 million tons in 2017.
Manually fish classification is a very complex and tedious assignment for these who are now not specialists. Fish species are concerned in many industrial and agricultural industries, as nicely as the manufacture of foodstuffs and used as food that is very vital to humans [2]. As marine biologists classify fish from their traits and also used the classification tree in the classification of fish, which led them to use laptops gaining knowledge of and structures in the data, which saved time, effort, and velocity in the classification of fish [3].

ISSN: 2252-8938
Fish survival prediction in an aquatic environment using random forest model (Md. Monirul Islam) 615 Fish classification can be the identification of fish species, depending on their physiognomies or similarities. Also, it can be described as the technique of determining the types of fish [4]. Classification of fish is critical for numerous reasons, inclusive of sample and subsistence matching extraction feature, identification of physical or behavioral characteristics, statistical control and high-quality utilized to fish of all kinds [5]. Moreover, fish classification is regarded as a vital venture for fishing and population assessments [6].
On the other hand, computerized fish classification can speed up the technique and can improve the accuracy of classification or identification of fish species. Several tactics are introduced in the literature for computerized fish species identification. In this paper, we did classification using machine learning model including J48, random forest (RF), KNN, and CART. Classification has used for prediction purposes; traditional rule-based algorithm does not provide any prediction feature for the unknown dataset. Confusion matrix provides various measurement of accuracy in prediction, where rule-based algorithm cannot perform this [7]. CNN is a deep learning model where computation complexity is higher than machine learning models. In this paper, we have considered the machine learning algorithms only due to its less computational complexity. In the CNN, we need much training time than traditional machine learning models.
In this paper, we proposed a fish survival prediction in an aquatic environment based on the random forest model. For the rest of the paper, we organize as shown in section 2 states the literature review. In section 3, the proposed model is discussed. Section 4 depicts the experimental setup and result from the analysis. Finally, the findings of this paper are discussed in section 5.

LITERATURE REVIEW
The literature states a portion of activities related to decision support systems in aquaculture garden operations. Several decision support systems have been developed. Some of them use machine learning methods and others do not. An automatic fish identification is proposed where shade and texture features are extracted from the fish images [8]. A structure is introduced using the real-time water quality indicators and operational information, where impact on survival rate, biomass, and production failure of aquaculture species are evaluated [9]. A prediction model using one feature of water called DO is presented for the aquatic creature [10]. A hardware is made for monitoring water quality factors including pH, temperature, and dissolved oxygen [11]. An IoT device is proposed for detecting and controlling the water factors including pH, temperature; however, they did not analyze the data [12]. A regression model is utilized for predicting water quality of cultivating fish; however, they did not consider the prediction accuracy [13]. An automated strategy is developed for fish identification primarily based on the use of aid vector desktop and kmeans clustering algorithm [14]. A computerized robust Nile-Tilapia fish classification approach is proposed in [15], where the scale-invariant characteristics of fish's change are extracted. Then, these points are used to feed the support vector machine.
Managing hatchery production is focused using rules and calculations of physical, chemical, and biological processes [16]. A scientific model is developed to evaluate environmental impact [17]. A rule is hand-crafted by domain experts [18]. A machine learning method is presented to obtain a balance between the farm closure and the farm opening events [19]. A feature ranking algorithm is displayed to identify the most influential cause of the closure [20]. Time series machine learning approaches is adopted like principal component analysis (PCA) and auto correlation function (ACF) to predict the closure event [21]. A set of rules are extracted from data gathered by sensor networks to find associations between environmental variables and algae growth [22]. An ensemble method is designed to find the relevant environmental variables responsible for algae growth and the growth prediction [23]. A machine learning method is developed to predict the propagation of algae patches along the waterway [24]. Figure 1 shows a detailed block diagram of the proposed model. At first, we import our dataset. In the preprocessing section, we filter and resample for our dataset. Then we select our model as random forest (RF) classifiers in the classification section. We classify our various machine learning models here. After classification, classifier output is predicted.

Description of dataset
The data used in this study involving parameters of an aquatic environment for fish farming taken from the University of Dhaka, Faculty of fisheries, Dhaka, Bangladesh. There are 191 instances of 4 attributes. Attributes are pH, temperature, turbidity, and fish. We choose pH, temperature, turbidity as feature attributes and fish as target attribute. The dataset is partitioned into two parts. One is aquatic environment characteristics and another is fish species. The detailed of target attribute is of 11 fish species including katla ISSN: 2252-8938  616   14 images, shing 17 images, prawn 14 images, rui 19 images, koi 15 images, pangas 22 images, tilapia 25 images, silver carp 7 images, karpio 33 images, magur 11 images and shrimp 14 images.
Aquatic environment characteristics: We utilized pH, temperature, and turbidity as aquatic environment parameters in our study. − pH: pH is necessary for aquaculture as a measure of the acidity of the water or soil. The optimal pH for fish is between 6.5 and 9. Fish will grow poorly, and reproduction will be affected at consistently greater or lower pH tiers [25]. The pH level for warm-water pond fish is 4 for acid death point, 4 to 5 for no reproduction, 5 to 6.5 for slow growth, 6.5 to 8.5 for desirable ranges, 9 to 10 for slow growth, and ≥11 for alkaline death point. − Temperature: The increase and endeavor of the fish rely on their physique temperature. The body temperature of the fish is about the same as the water temperature and varies with it. Each fish species is tailored to develop and reproduce inside well-defined stages of water temperatures, but the most useful boom and replica take area within narrower tiers of temperature. It is important, therefore, to understand the water temperatures reachable at your fish farm nicely to pick out the right species of fish and to graph its management as a result. Table 1 shows the thermal range of some common fish species [26]. − Turbidity: The ability of water to transmit the light that restricts light penetration and limit photosynthesis is termed as turbidity and is the resultant impact of several elements such as suspended clay particles, dispersion of plankton organisms, particulate natural things and also the pigments caused with the aid of the decomposition of organic matter. Acceptable turbidity varies from 30-80 cm is properly for fish health [27]. − Fish species: In our dataset, we utilized a total of 11 fish species as the target variable. The fish species in our dataset are presented in Figure 2; where carpio fish is shown in Figure 2(a), katla fish is in Figure 2(b), rui fish is in Figure 2(c), koi fish is in Figure 2(d), magur fish is in Figure 2(e), pangas fish is in Figure 2(f), prawn fish is in Figure 2(g), silver carp fish is in Figure 2(h), tilapia fish is in Figure  2(i), and shing fish is in Figure 2(j).

Preprocessing
In the preprocessing step, we filtered our dataset using a resampling option for observing the current relation of instances and attributes of the dataset. In the attribute selection window, we can check the missing, unique, and distinct value of each attribute. All attributes show 0% missing and pH has 28 unique values, temperature has 22 unique values, turbidity has 56 unique and fish has 11 distinct values.

Classification
In the classification section, we classified our dataset using 5 various classifiers model. Random forest (RF) outperforms the other described model.

Random forest (RF)
RF is a supervised learning method that is a decision tree-based algorithm. As the name proposes as forest the random forest classifier is an ensemble of decision trees wherever a random vector sample produce each classifier from the input vector [28] and every tree cast a unit vote for the most popular class to classify an input vector, nearly all of the time trained with a bagging method. The preparation calculation for random forest applies the overall strategy of bootstrap collecting, or packing, to tree students. Given a preparation set X = x1, ..., xn with reactions Y = y1, ..., yn, stowing more than once (A times) chooses an irregular example with substitution of the preparation set and fits trees to these examples. For a=1, ……, A: − Test, with substitution, n preparing models from X, Y; call these Xa, Ya.
The universal thought of the bagging method is that the composing of the learning method increases the overall result. The random forest is less sensitive than other streamline machine learning classifiers to overfitting and to the quality of training samples [29]. Figure 3 shows the concept of random forest model. Tree 1 and Tree 2 belong to Class A. So, predicted output will be Class A. Majority vote is Class A in Figure 3.

Classifier output
In the classifier section, we can see the result performance of our model and other state-of-art models. By choosing our described model, we can check results. In this section, we can see detailed accuracy by class. Figure 4 shows these performance results. We did not find any machine learning model for fish environment monitoring using RF. The dataset we have used in our own dataset. Figure 4 presents average TP rate as 0.885, FP rate as 0.013, precision as 0.890, recall as 0.885, F-measure as 0.879, MCC as 0.871, ROC area as 0.981, PRC Area as 0.929, Correctly Classified Instances as 88.48%, Incorrectly classified instances as 11.52%, Kappa statistics as 0.87, mean absolute error as 0.04, root mean squared error as 0.13, relative absolute error as 24.53%, Root relative squared error as 45.46%.

EXPERIMENTAL SETUP AND RESULT ANALYSIS
As data analysis, we have used WEKA tool for classifying the proposed model and described other models. The tool is very helpful to analyze and has various techniques embedded in it. We have used 10% images for testing and 90% images for training in each species for all described model.

Performance metrics
Performance parameters are the most important metrics to compare among classifier methods to get the best classifier. We have applied 3 performance parameters which are accuracy, TP rate and kappa statistics. The parameter is calculated from a confusion matrix which is situated in every step of classification. Accuracy is measured by dividing the total number of correctly classified instances by the total number of instances and also it is measured by confusion matrix which is mathematically counted by (4). TP rate is another performance metric of our study and it is calculated by (3). And kappa statistic is the last metric of our paper which is computed by (5). The higher the kappa statistics, the better the model accuracy level. A general view of the confusion matrix is illustrated in Table 2.
We have used Waikato environment for knowledge analysis (WEKA) for processing data. The proposed model, RF shows the accuracy as the value 88.4817%, the average TP rate as the weight of 88.5% and kappa statistic as the standard of 87.11%. We can say, these three metrics give a better result. We have compared the performance metrics with our proposed model and other state-art-models. We utilized 5 models in our experimental work. They are-random forest, J48, naïve bayes, KNN and CART. Table 3 depicted a detailed comparison with all model each other.  Table 3 shows, random forest (RF) gives the highest score of every metric as accuracy 88.48%, kappa statistic as 87.11% and true positive (TP) rate as 88.5%. The second highest score belongs to the KNN model which tells accuracy as 85.79%, kappa statistic as 84.05% and TP rate as 85.8%. J48 acquires 3rd highest position by achieving an accuracy as 73.16%, kappa statistic as 69.88% and TP rate as 73.2%. CART has 4th place in scoring performance metrics by getting accuracy as 64.21%, kappa statistic as 59.80 and TP rate as 64.2%.
Naïve bayes (NB) gives the lowest score by acquiring accuracy as 56.84%, kappa statistic as 51.60% and TP rate as 56.88%. NB provides the lowest performance. Because NB classifies only 108 images correctly among 191 images and cannot classify in silver cup fish. We know, naive bayes is probabilistic machine learning algorithm and it studies that the features are free of each other. It also gives lower accuracy than other classifier models. However, in real world, features depend on each other. If we add multiple classifiers in the model, the computational complexity will be higher and for our tested dataset, we already have a significant result for our model.
These performance metrics are shown in Figure 5 graphically. We marked three colored curves for three performance metrics. The blue curve is marked as an accuracy metric. The middle curve is identified for kappa statistic which is maroon color and the green curve is noticed for TP rate. We can see from this, the proposed model, RF gives the highest score in all categories of performance metrics.
Here we identified all terms in this figure in short form. RF stands for random forest, CART means classification and regression tree, NB means naïve bayes, KNN stands for K nearest neighbor and J48 means decision tree classifier and KS stands for kappa statistic. All circle point for RF model has the top position in performance metrics as accuracy 88.48%, KS as 87.11% and TP rate as 88.5%.

CONCLUSION
We conducted this research to find out the best prediction model for fish farmers in an aquatic environment using various aquatic parameters. We used pH, temperature, turbidity, and fish as parameters of the dataset where we marked temperature, pH, turbidity as feature variables and fish as the target variable. We used total 11 types of fish. They are katla, shing, prawn, rui, koi, pangas, tilapia, silver carp, carpio, magur and shrimp. We find out the accuracy, kappa statistic and TP rate as performance metrics. We analyzed a total of five supervised machine models. They are random forest (RF), naïve bayes (NB), Knearest neighbor (KNN), CART and J48. Among these models, our proposed model, random forest shows the best accuracy, kappa statistic and TP rate as performance metrics that can predict the most fish species in an aquatic environment. Random Forest provides accuracy 88.48%, KS 87.11% and TP rate 88.5%. Further, the research scope can be defined by enriching the dataset by more observation and testing with artificial neural network.