Deep learning-based modeling of second-hand ship prices in South Korea

ABSTRACT


INTRODUCTION
Ship prices are notorious for volatility due to peaks and valleys in the shipping industry, and there are always rent seekers attempting to gain financial returns by exploiting the unpredictability of ship prices.This situation has become more apparent during the coronavirus disease (COVID-19) pandemic.Although the ship sale and purchase market thrives on price volatility [1], it also indicates difficulties that valuation practitioners experience in estimating ship prices.Deep learning algorithms such as neural networks (NNs) have been appearing in the valuation literature since the 1990s [2], [3] to boost the predictive accuracy of asset valuation and have now started to replace the conventional models in the real estate valuation [4], [5].However, in contrast to the active application of NNs to real estate valuation, few such studies have been conducted in the domain of ship valuation.
In this study, we attempt to apply a NN for estimating ship prices.First, we specify an appropriate NN architecture for application to tax returns of ship acquisition to estimate second-hand ship prices.We then compare the NN's prediction accuracy to that of a few baseline models.Finally, we interpret the result produced by NN training to provide useful insights for ship valuation stakeholders.Ship valuation plays a ISSN: 2252-8938  Deep learning-based modeling of second-hand ship prices in South Korea (Changro Lee) 887 key role in the shipping market, such as investment decisions of shipping investors, and ship loan approvals by financial institutions.In contrast to the new building price of a ship, which is essentially a forward contract for the delivery of an age-zero vessel in the future, the price of second-hand ships is remarkably unpredictable, and thus more difficult to estimate to a reasonable degree, as explained above.This study aims to estimate second-hand ship prices.
It is well known that deep learning has been disproportionately applied in unstructured-data-laden sectors such as computer vision and natural language processing industries.The use of the NN algorithm in this study is expected to significantly promote the adoption of deep learning tools in structured-datadominant sectors including the shipping industry.Additionally, the empirical findings identified in the results of the NN may provide useful insights for ship valuation practitioners.
This paper is constructed.Section 2 describes background information on ship valuation and the entity embedding technique.Section 3 explaines the data used and the NN architecture selected for estimating ship prices.Results and implications are interpreted in section 4. Lastly, conclusions are presented in section 5.

LITERATURE REVIEW 2.1. Ship valuation approach
The literature on ship valuation has largely focused on the analysis of the time-series properties of ship prices, such as analyzing co-integrating relationships between ship prices and the time-charter rate, or market trends in the new building prices and second-hand prices of ships [6]- [8].Although this approach can provide an estimate of the value of a standardized generic ship based on a constructed time series of ship values, it cannot produce a price estimate for a specific ship.To provide price estimates at a vessel-specific level, valuation must be performed based on micro-scale cross-sectional data such as individual ship sales records.A few studies have been conducted based on cross-sectional data to generate ship-specific price estimates [1], [9], [10].In this line of research, one of three valuation methods, namely the income approach, sales comparison approach, and cost approach, is typically employed.The income approach estimates ship prices using discounted cash flow analysis or Monte Carlo simulation techniques based on the freight rate or time-charter rate.Because the income approach is based on cash flow generated in the future, it is typically favored by financial institutions when approving loan applications for ships.In the sales comparison approach, actual transactions between buyers and sellers are collected and utilized for estimating ship prices.Because this method directly relies on market evidence, its estimates are generally accepted as convincing estimates by market participants.The cost approach estimates ship prices by subtracting depreciation from replacement costs and is widely adopted in tax assessment because it is relatively simple and easy to implement.However, estimates from the cost approach tend to deviate from the market value as a ship becomes obsolete.In an ideal market, price estimates from all three valuation approaches tend to converge.
In this study, the sales comparison approach was used for ship valuation.Its estimates are based on market evidence; thus, they are considered reliable and convincing by stakeholders, as explained earlier.
Additionally, the prices of second-hand ships vary more drastically than those of new building prices depending on vessel-specific characteristics such as shipbuilding materials, engine manufacturers, and age.This individual heterogeneity cannot be effectively captured by the income approach or cost approach.In contrast, the sales comparison approach can capture heterogeneity; thus, it was adopted for estimating second-hand ship prices in this study.

Neural network (NN) with entity embedding
Deep learning has achieved excellent performance in terms of utilizing unstructured data such as images, audio, video, and free-form text.For example, it intensively exploits imagery data to operate self-driving cars [11], [12] or detect concrete building defects [13], [14].In contrast to the unstructured data that are commonly used in deep learning areas, the dominant data type found in the shipping industry is spreadsheet-like structured data.Ship transaction records, ship inspection certificates, and safety construction certificates are specific examples of structured data observed in shipping businesses.These structured data were utilized thoroughly in this study.
Structured data comprise continuous variables and categorical variables.Categorical variables assume values that are names or labels such as vessel body color and engine type, which can be represented by fixed numbers.These numbers are called levels or elements and provide no meaningful information directly.Categorical variables are observed frequently in the shipping industry.For example, ships are often classified as passenger ships, fishing boats, or cargo ships.Ship valuation agents provide price estimates by considering key factors such as ship type and the presence/absence of specific features (with or without automatic radar plotting aid, and with or without lifeboats).These valuation factors are types of categorical variables.In the ship valuation literature, most studies have paid little attention to employing categorical variables or have been reluctant to use them because efficient tools for processing such variables were unavailable.In [1] estimated ship prices using a multivariate density estimation method and the employed explanatory variables were deadweight tonnage, age, and the time-charter rate.All these are continuous variables that are convenient to use for density estimation.In [10] estimated second-hand ship prices using NNs, but their study also had limitations in that only continuous variables (age, time-charter rate, new building price, and scrap value) were utilized for price estimation.One reason for using only continuous variables is that NNs are natively well suited to processing continuous data but inefficient at handling categorical data.
Categorical variables need to be converted to numerical representations to allow a quantitative model to process them.Several methods have been developed for this purpose.The simplest method, namely the one-hot encoding approach, converts each element in a categorical variable into a new categorical column and assigns a binary value of one or zero to those columns.Although this method has been used commonly in the valuation literature [15], it has two disadvantages.First, when there are high-cardinality variables such as ship types or ZIP codes, one-hot encoding incurs excessive computational demand.Second, it treats different elements of categorical variables completely independently of each other and does not account for their informative interrelations.
If there is a clear ordering of the elements in a categorical variable, they can be translated into a set of numerical scores based on domain-specific knowledge.For example, ship agents can rank the grade of ship maintenance on an integer scale.A ship in a poor maintenance state may be scored as five, whereas an optimally maintained ship could be given a score of one.This method can be used efficiently in processing ordinal categorical variables [16], but has the drawbacks of demanding domain knowledge in the form of expert advice or consultation.
The entity embedding technique has been proposed as an alternative for extracting meaningful information from categorical variables more efficiently.Entity embedding is a technique for mapping categorical values into a multi-dimensional space with fewer dimensions than the original number of levels.In this space, values with similar function outputs are close to each other [17].Figure 1 presents an example of categorical data embedding methods.If there is a categorical variable, day of week, each element in it is converted into a separate binary variable under the one-hot encoding approach.Figure 1(a) shows an example of this approach.The problem with it is that the informative relationship between each element is lost during conversion.In contrast, an embedding is a vector representation of a categorical variable, and the example can be represented with four numbers for each element.Figure 1(b) shows this method, and the number four is referred to as the embedding dimension.

889
In Figure 1(b), Monday and Tuesday, Wednesday through Friday, and Saturday and Sunday are similar to each other, respectively, indicating that the entity embedding matrix reasonably captures interrelations between elements within the variable, day of week.Each value in the resultant embedding matrix represents the weights connecting the input layer to the embedding layer in an NN.Therefore, the embedding matrix contains the weights of the embedding layer and can be learned in the same manner as the parameters of other NN layers [17].Through this representation, rich information among elements in a categorical variable can be captured efficiently, and can subsequently be utilized in an NN training.Lee [18] applied this approach to real estate valuation.
The problems discussed next can be alleviated by using the entity embedding technique.First, excessive computational resource consumption caused by the one-hot encoding of high-cardinality variables can be avoided.Second, different levels of categorical variables can be handled in a meaningful manner instead of these being treated completely independently of each other.Third, domain knowledge is not required once an NN is trained to learn the relationships between values of the same categorical variable efficiently.Finally, learned embeddings can be visualized using a dimensionality reduction technique, which can provide useful insights for stakeholders.
Several studies have estimated ship prices [9], [19]- [21], and recently, NNs have begun to be utilized to enhance the predictive accuracy of real estate valuation [22]- [24] and ship valuation [10].However, most studies have not employed categorical variables for valuation, and even the few studies that have used these variables have not explicitly exploited the advantages of the aforementioned entity embedding technique.In this study, we attempted to fill this research gap by employing categorical variables actively in deep learning-based valuation.
The extracted features such as the embedding vectors can be reused in any subsequent models including a support vector machine, a random forest, and an NN.However, the main focus of this study is to create embedding vectors for categorical variables by using an NN and demonstrate their benefits in the context of asset valuation.Thus, an NN is adopted as the main model in this study.

METHOD 3.1. Dataset
The dataset used in this study contained tax returns for property acquisitions of ships traded in 2018.The ship attributes available in the tax returns include the deadweight tonnage, age, and acquisition price, as reported by taxpayers.Table 1 presents descriptive statistics for the 3,475 ships used in our analysis.
The initial dataset included over 4,000 ships and the following records were removed during data preprocessing: records with missing values, redundant records, and implausible records such as an acquisition price reported as less than 10,000,000 KRW (approximately 9,000 USD).The median acquisition price for all ships in the dataset is 38,000,000 KRW (approximately 34,000 USD).The median deadweight tonnage is 5 tons and 99.4% of the ships (3,453 of 3,475) are less than 50,000 tons.Overall, this table indicates that small-and medium-sized ships are typically traded and reported for the purposes of property acquisition taxes.Ships with deadweight tonnage values of up to 50,000 tons are generally referred to as "handy-size" vessels and are mainly constructed in shipyards in South Korea, China, Japan, and Vietnam.Province refers to a local government having jurisdiction over the port of ship acquisition; 35.1% of the ships (1,218 of 3,475) were registered in Jeolla Province in 2018.The dominant ship type is a fishing boat (49.6%, 1,725 of 3,475).This study adopted a fully connected layer NN or dense NN.Input layers were created, and embedding layers for the two high-cardinality categorical variables (province and ship type) were additionally created and joined to the architecture.A proper number of dimensions had to be defined for each embedding layer, and the prediction performance under various dimension sizes was investigated using the usual validation process.The number of dimensions assigned to each categorical variable based on this validation was four, as shown in the last column of Table 2. Three hidden layers were joined to the architecture to include more parameters to capture minor information contained in data.The final architecture of the NN is schematized in Figure 2.

891
indicate the numbers of units (neurons) in each layer.That is, the three dense layers following concatenation contain 80, 40, and 10 neurons in order.The output layer is a linear layer with one neuron representing the ship price.
The specific implementation details are: a gradient descent optimizer with momentum and Glorot initialization with a uniform distribution were used.A constant learning rate of 0.001 was adopted because the result was only trivially changed by alteration in the learning rate or learning schedule (e.g., exponential scheduling and power scheduling).A rectified linear unit (ReLU) activation function was used for all layers, except that a linear activation function was used for the output layer.The NN was trained for 30 epochs with a batch size of 64, and mean squared error was adopted as a loss function.

Results
The mean absolute percentage error (MAPE) was used to evaluate model performance, as expressed by the following equation: where  indicates the observed price, and  ̂ denotes the estimated price from the NN.MAPE is a measure that is frequently used by property valuation agencies [25].MAPE indicates the prediction error as a percentage and is convenient for comparisons across different valuation models.According to [26], [27], MAPE values greater than 10.0 are typically regarded as inappropriate for loan collateral programs and a valuation model with MAPE values in excess of 20.0 should not be used in applications involving risk.
The dataset of 3,475 records was randomly split into training (80%) and test (20%) datasets.Table 3 shows the MAPE results based on the test dataset.The performances of a few baseline models are presented for comparative purposes.Specifically, we consider a regression model and an NN without entity embedding layers.One-hot encoding was accepted for the categorical variables in both the regression model and NN without entity embedding.As listed in the table, MAPE is generally less than 10.0, indicating that the fitted models do not present serious drawbacks.The MAPE of the NN utilizing entity embedding layers is the lowest by a meaningful margin.Figure 3 presents the distributions of residuals from the three models in Table 3.There are no specifically notable patterns in any of the three models.One clear result is that the range of residuals was reduced remarkably in the case of the NN with entity embedding layers, indicating its excellent prediction performance.
Note: EE stands for entity embedding.

Interpreting learned embedding space
The primary aim of entity embedding is to map similar elements in a categorical variable close to each other in an embedding space.Then, a relevant question arises: how does the distribution of elements in a categorical variable appear in the embedding space?To present high-dimensional embeddings visually, t-SNE was employed to map the embeddings to a 2D space.t-SNE is a dimensionality reduction technique that produces outstanding visualizations by reducing the tendency to crowd points together in a map [28].Figures 4 and 5 present the results of the province and ship type embeddings represented in a 2D space.
The province variable had 12 levels and was transformed to an embedding layer with four dimensions.Figure 4 presents the learned province embedding, which is visualized in a 2D space.Three distinct clusters can be observed.One group consisting of Incheon, Chungcheong, and Gyeonggi shares the same location, that is, all three provinces face the west coast of the Korean Peninsula.Another group consisting of Ulsan, Jeolla, Jeju, and Busan also shares the same location, that is, all four provinces in the group are located in the south of the Korean Peninsula facing the south coast.The third group consisting of the remaining five provinces seems to defy a relevant interpretation.However, it is promising that some of the geographically adjacent provinces are clustered together, despite the algorithm's ignorance of the geography in South Korea.Although interpreting resultant embedding patterns is difficult and necessarily involves subjective judgment, it is for high-cardinality variables such as the province and ship type that the entity embedding technique exhibits noteworthy performance compared to the one-hot encoding approach.

CONCLUSION
Categorical variables are abundant in ship valuation, and the most common approach to handle these variables is to use one-hot encoding; thus, the advantages of entity embedding have not yet been explicitly exploited.We employed the entity embedding technique in this study to improve the performance of an NN by specifying and fitting an NN with entity embedding layers to tax returns for ship acquisitions.The results demonstrated that the NN with entity embedding layers outperformed the baseline models (regression model and NN without entity embedding layers).This improvement can be attributed to the capabilities of the entity embedding layers to capture informative relationships between the elements in each categorical variable.A study limitation is that we only discussed the clustering patterns in a compressed 2D embedding space and did not interpret the values in the embedding matrices.For example, the embedding matrix for the provinces had 12 rows and 4 columns.The element values in this matrix were learned during NN training.Subsequent studies should interpret the values in embedding matrices to provide more explainable deep learning models for stakeholders.Another limitation is the small number of variables used in the valuation model.The variables frequently reported in the literature include the time-charter rate, scrap value, oil price, and London Interbank Offered Rate.These variables are primarily responsible for reflecting changes in the shipping market and need to be utilized in future studies, particularly when the dataset consists of time-series data.

Figure 1 .
Figure 1.Example of categorical data encoding methods (a) one-hot encoding and (b) entity embedding

Figure 3 .
Figure 3. Residual distributions for each model on the test dataset

Figure 4 .
Figure 4. Province embedding mapped to a 2D space using t-SNE

Table 1 .
Descriptive statistics for 3,475 ships (transactions in 2018) Ship prices are affected by various factors, and selecting relevant variables involves a trade-off between valuation theory and data availability.Nine inputs were employed to estimate ship prices, as shwon in Table2.All variables excluding deadweight tonnage, age, and assessed value are categorical data.The Artif Intell, Vol.11, No. 3, September 2022: 886-894 890 continuous variables were scaled to have a mean of zero and a standard deviation of one.The target variable was the acquisition price, which was log-transformed to alleviate a right-skewed distribution.The 3,475 ships were split into training data (80%, 2,780 ships) and test data (20%, 695 ships) in a random manner, and the training data were further divided into two sets for ordinary training (2,224 ships) and validation purposes such as hyperparameter tuning (556 ships).The test data were reserved for the final evaluation of NN performance.

Table 2 .
Input variables

Table 3 .
Comparisons of model performance