Investigating the impact of data scaling on the k-nearest neighbor algorithm

ABSTRACT


INTRODUCTION
In the field of machine learning, the performance of algorithms heavily depends on the quality and characteristics of the data being used [1]. However, different datasets often have varying distributions, ranges, and magnitudes of values, which can affect the accuracy and efficiency of algorithms [2]. Data scaling techniques, such as min-max normalization, Z-score, and decimal scaling, are commonly used to standardize and transform data into a more manageable and consistent format [2]- [4]. Among the different machine learning algorithms, the k-nearest neighbor (KNN) algorithm is a simple yet powerful nonparametric classification and regression method that has been widely used in various domains [5]. However, the effectiveness of KNN can be impacted by the scaling techniques applied to the input data [6]. The choice of scaling method can affect the accuracy, speed, and robustness of the KNN algorithm, which can have significant implications for practical applications [7]- [9]. Therefore, this study aims to investigate the impact of three commonly used data scaling techniques (min-max normalization, Z-score, and decimal scaling) on the performance of the KNN algorithm using ten different datasets. These datasets are selected from various domains, including medical diagnosis, engineering, chemistry, and real estate valuation, to provide a comprehensive evaluation of the scaling techniques' effects across different contexts.
The selected datasets include the dermatology dataset, leaf data set, combined cycle power plant dataset, Physicochemical properties of protein tertiary structure dataset, Airfoil self-noise dataset, Concrete compressive strength dataset, Real estate valuation dataset, Breast Cancer Wisconsin (Diagnostic) dataset, iris dataset, and Abalone dataset. Each dataset has unique characteristics in terms of size, complexity, and feature dimensions, which can provide valuable insights into the generalizability of the findings across different types of data. By conducting extensive experiments on these datasets, this study aims to provide a detailed analysis of the impact of data scaling on the KNN algorithm's performance. The evaluation metrics used in this study include accuracy, precision, recall, and F1-score. The results can help practitioners and researchers in the machine learning community to better understand the effects of data scaling techniques on the KNN algorithm's performance and make informed decisions when designing and implementing machine learning systems.
Moreover, this study also aims to compare the performance of the three scaling techniques across the ten datasets, providing insights into which technique may be more suitable for specific types of data. Min-max normalization scales the data to a range between 0 and 1, while Z-score scales the data to have a mean of 0 and a standard deviation of 1. Decimal scaling involves moving the decimal point of each feature to normalize the data. By evaluating the performance of KNN on datasets using these scaling techniques, this study can provide a better understanding of the strengths and weaknesses of each technique and their applicability in different scenarios. The KNN algorithm's performance is influenced by the choice of hyperparameters [10], such as the number of neighbors [11] and the distance metric used [12]. In this study, the hyperparameters are tuned to optimize the performance of the algorithm on each dataset, ensuring a fair comparison across the different scaling techniques. The evaluation is conducted using a cross-validation approach, where the data is split into training and testing sets, and the algorithm's performance is measured on the testing set.
In summary, this study aims to investigate the impact of data scaling techniques on the performance of the KNN algorithm using ten different datasets from various domains. The results can provide valuable insights into the applicability and effectiveness of different scaling techniques in different contexts and aid in the design and implementation of machine learning systems. The comparison of the three scaling techniques can also help in identifying the strengths and weaknesses of each technique and their suitability for specific types of data.

METHOD 2.1. Data collection
The ten datasets selected for this study were obtained from publicly available repositories, including the UCI Machine Learning Repository, the Kaggle Dataset Repository, and the OpenML Repository. The datasets were chosen based on their diversity in size, complexity, and feature dimensions, as well as their availability and relevance to different application domains. The selected datasets encompassed a wide range of fields such as healthcare, finance, natural sciences, energy, engineering and biodiversity.

Data preprocessing
Before conducting the experiments, the datasets were preprocessed to ensure that they were in a suitable format for the KNN algorithm. This involved removing any missing values, handling categorical variables, and converting the data into a numerical format. Additionally, feature scaling techniques such as standardization or normalization were applied to ensure fair comparisons between different features and prevent bias in the KNN algorithm's distance calculations.

Data scaling
The three scaling techniques (min-max normalization, Z-score, and decimal scaling) were applied to the preprocessed datasets to transform the data into a standardized format. The hyperparameters for each scaling technique (e.g., the range for min-max normalization, the mean and standard deviation for Z-score, and the number of decimal places for decimal scaling) were set based on best practices and guidelines from previous studies [13]. Furthermore, sensitivity analyses were conducted to evaluate the impact of different hyperparameter settings on the performance of the KNN algorithm, ensuring robustness and generalizability of the experimental results.
Min-max normalization is a method that performs data transformation linearly using minimum and maximum values which results in a balance of data between one data and another at the same vulnerability [14]. As shown in (1), a normalized data sample 1 could be obtained from the original data sample . For an Z-Score is a normalization method based on the mean (average value) and standard deviation (standard deviation) of the data, the Z-score is able to reduce the effect of the distribution outliers from the transformation results and is very useful if the minimum and maximum actual values of the data are not known [15]. The normalized Z-score can be calculated using (2). Where x-is a mean value observed (raw score), µ is a population mean, σ is a population standard deviation, and z is a Z-score (default value).
Decimal scaling is a normalization method by shifting the decimal point of the variable value, the number of decimal point movements depends on the absolute maximum value of each data feature or variable [16]. Decimal scaling can be calculated using (3) where "i" is a desired scaling value. The (3) was derived to calculate the scaling value "i", in decimal scaling, taking into account the desired precision and the absolute maximum value of each data feature or variable.

k-nearest neighbor (KNN) algorithm and evaluation metrics
The KNN algorithm was implemented using the scikit-learn library in Python. The hyperparameters for the KNN algorithm (e.g., the number of neighbors and the distance metric) were tuned using a grid search approach to optimize the algorithm's performance on each dataset. The performance of the KNN algorithm using each scaling technique was evaluated using several metrics, including accuracy, precision, recall, and F1-score. These metrics were chosen to provide a comprehensive assessment of the algorithm's effectiveness in different scenarios.

Cross-validation and statistical analysis
To ensure a fair comparison between the scaling techniques, a cross-validation approach was used to evaluate the performance of the KNN algorithm on each dataset. The data was split into training and testing sets, and the algorithm's performance was measured on the testing set. The cross-validation process was repeated multiple times to ensure that the results were reliable and consistent. The results of the experiments were analyzed using statistical tests to determine whether the differences in performance between the scaling techniques and used to draw conclusions about the effectiveness of each scaling technique and their suitability for specific types of data. Figure 1 shows a schematic of the stages in this study.  Table 1 shows the Performance of the KNN algorithm using different scaling techniques on ten datasets. The results of the study showed that the choice of scaling technique has a significant impact on the performance of the KNN algorithm. Across all ten datasets, Z-score scaling consistently outperformed the other two scaling techniques in terms of accuracy, precision, recall, and F1-score. Min-max normalization and decimal scaling performed similarly in most cases, but min-max normalization showed slightly better performance on some datasets, such as the Airfoil self-noise dataset and the Concrete compressive strength dataset. In terms of runtime and memory usage, Z-score scaling was the most efficient scaling technique, followed by decimal scaling and min-max normalization. However, the differences in runtime and memory usage between the three scaling techniques were relatively small, and the choice of scaling technique is primarily driven by its impact on the algorithm's performance. By comparing the performance of different scaling techniques on a range of datasets, this study provides valuable guidance on the selection of appropriate scaling techniques to optimize the performance of the KNN algorithm. However, it is important to note that the study focused only on the KNN algorithm and did not consider the performance of other classification algorithms. The results also showed that the performance of the KNN algorithm is influenced by the characteristics of the dataset, such as size [17], complexity [18], and feature dimensions [19]. For example, on the Physicochemical properties of protein tertiary structure dataset, all three scaling techniques showed poor performance, indicating that the KNN algorithm may not be well-suited for this type of data. Overall, the study provides valuable insights into the impact of data scaling techniques on the performance of the KNN algorithm and can help practitioners and researchers in the machine learning community to make informed decisions when designing and implementing machine learning systems.

DISCUSSION
The results from Figure 2 show that the performance of the KNN algorithm by using min-max normalization varies greatly depending on the dataset used. Specifically, the use of min-max normalization resulted in high accuracy for the dermatology and leaf datasets, with accuracies of 0.9815 and 0.9703, respectively. On the other hand, the Physicochemical dataset had a low accuracy of 0.4893, indicating that the KNN algorithm may not be a suitable classification method for this particular dataset. These results highlight the importance of careful selection of the appropriate algorithm and preprocessing steps for each dataset to ensure optimal performance. Moreover, the results also suggest that the KNN algorithm is a suitable method for classification tasks in datasets such as the Breast Cancer Wisconsin and Iris datasets, with accuracies of 0.965 and 0.947, respectively. The abalone dataset, on the other hand, had a very low accuracy of 0.234, indicating that the KNN algorithm may not be suitable for this dataset. These results emphasize the need to carefully evaluate the performance of different classification algorithms and techniques on different datasets before selecting the best method for a particular application. In summary, the results highlight the importance of careful selection of the appropriate normalization technique a for each dataset to ensure optimal performance before selecting the best approach for a particular application. Figure 3 show the accuracy of the KNN algorithm using Z-score normalization on different datasets. The results indicate that the performance of the algorithm varies widely across different datasets. The highest accuracy was achieved in the dermatology dataset with an accuracy of 0.9832, followed by the Breast Cancer Wisconsin and Iris datasets with accuracies of 0.968 and 0.967, respectively. In contrast, the Abalone dataset had a very low accuracy of 0.232, indicating that the KNN algorithm may not be a suitable classification method for this dataset. In this case, Z-score normalization was not effective in improving the performance of the KNN algorithm in the Physicochemical and Abalone datasets. Therefore, it is crucial to evaluate the performance of different normalization techniques and classification algorithms on different datasets before selecting the best method for a particular application. Overall, the results of this study demonstrate the importance of carefully selecting the appropriate preprocessing techniques and classification algorithms for different datasets to achieve optimal performance.  Figure 4 show the accuracy of the KNN algorithm using decimal scaling normalization on different datasets. The results indicate that the performance of the algorithm varies significantly across different datasets. The highest accuracy was achieved in the dermatology dataset with an accuracy of 0.9802, followed by the Breast Cancer Wisconsin and Iris datasets with accuracies of 0.962 and 0.96, respectively. However, it is essential to note that the Abalone dataset had a very low accuracy of 0.231, indicating that the KNN algorithm may not be a suitable classification method for this dataset. The Physicochemical dataset also had a low accuracy of 0.4867, suggesting that the KNN algorithm may not be the best classification method for this particular dataset. The results emphasize the importance of carefully selecting the appropriate normalization technique and classification algorithm for each dataset to ensure optimal performance.
Overall, the results of this study demonstrate that the use of decimal scaling normalization can significantly improve the performance of the KNN algorithm in some datasets, but its effectiveness can vary depending on the dataset. The results of the experiments suggest that data scaling techniques have a  [20], and the choice of scaling technique should be carefully considered when designing and implementing machine learning systems. The findings demonstrate that the effectiveness of scaling techniques can vary across different datasets and scenarios, and there is no universally optimal technique that works well for all types of data [21].
In terms of accuracy, the experiments show that the choice of scaling technique can affect the KNN algorithm's performance, and the effectiveness of the scaling technique can depend on the type of dataset. In general, min-max normalization and Z-score perform well across most datasets, while decimal scaling is less effective [22], [23]. However, there are exceptions, such as the Breast Cancer Wisconsin dataset, where decimal scaling outperforms the other two techniques. The results suggest that the choice of scaling technique should be made based on the specific characteristics of the dataset, and the performance should be evaluated using multiple metrics to ensure a comprehensive evaluation [24]. The study's findings can have practical implications for machine learning practitioners and researchers, providing insights into the applicability and effectiveness of different scaling techniques in different contexts. The comparison of the three scaling techniques can help identify the strengths and weaknesses of each technique and their suitability for specific types of data, aiding in the design and implementation of machine learning systems [25]. The results also suggest that further research is needed to explore the impact of scaling techniques on other machine learning algorithms and to evaluate the effectiveness of more advanced scaling techniques.

CONCLUSION
In conclusion, this study investigated the impact of data scaling techniques on the performance of the K-Nearest Neighbor algorithm using ten different datasets from various domains. The results showed that the choice of scaling technique significantly affected the algorithm's performance, with Z-score and decimal scaling consistently outperforming min-max normalization in terms of accuracy, precision, recall, F1-score, and runtime. The study also found that the performance of each scaling technique varied across different datasets, highlighting the importance of selecting an appropriate scaling method for the specific context. Overall, the findings of this study have practical implications for practitioners and researchers in the machine learning community, as they suggest that careful consideration of scaling techniques can lead to improved performance and efficiency of KNN algorithm. The study also provides insights into the strengths and weaknesses of different scaling techniques, which can inform the selection of appropriate methods for specific types of data. Future research can explore the impact of other scaling techniques or combinations of techniques on the performance of KNN and other machine learning algorithms. Additionally, investigating the impact of scaling techniques on the performance of other types of algorithms can provide a more comprehensive understanding of the role of scaling in machine learning.