Measurement of Centroid Distance in Determining Stunting Clusters

This study evaluates the effectiveness of distance measurement methods in the K-Means clustering algorithm for determining stunting clusters by comparing Euclidean and Manhattan distances. The goal is to obtain optimal cluster centroids and the closest distances within each cluster. The study uses a sample of 552 records with 3 attributes. The process begins with applying the K-Means algorithm, followed by distance measurement using Euclidean and Manhattan methods. Iterations are performed until optimal results are achieved. Evaluation is conducted using Sum of Squared Errors (SSE) to assess the total error within clusters and Mean Squared Error (MSE) to calculate the average nearest distance within clusters. The results indicate that both SSE and MSE methods are effective in identifying cluster quality and provide insights into the accuracy and effectiveness of Euclidean and Manhattan methods in clustering


Introduction
As technology develops and develops very rapidly in processing data with the aim of knowing patterns so as to obtain information stored from that data.For this case, Clustering is the process of grouping data objects into several scattered clusters so that the data in each cluster is combined into a group of data where the similarity of the data is identical.The K-Means algorithm is a partition clustering method that is capable of grouping data and partitioning data into one or more clusters that have the same characteristics (Fadilah et al., 2022).
A cluster is a collection of data objects that have the same characteristics as another but are in the same cluster and if the data from these characteristics is different from the data object then the data is in another cluster .The cluster center point ( centroid ) is the starting point that begins with grouping clusters using the K-Means algorithm .The stages in grouping data are carried out by calculating the distance from the initial cluster center point ( centroid ) as the midpoint of cluster formation .The output produced from the clustering process using the K-Means method plays a very important role in selecting the initial cluster center point ( centroid ).
In selecting the cluster center point ( centroid ) The initial process will be carried out randomly and an iterative process will be carried out to determine the distance from several data points to the nearest centroid and before that the number of clusters (k) has been determined before analysis (Retno, 2019).In the stage of calculating the difference in data distance from each cluster center point ( centroid ), the K-Means algorithm repeats this stage until the resulting data does not experience cluster movement or until the end of the specified iteration limit.The application of the K-Means algorithm produces a midpoint or centroid value from the data obtained in accordance with clustering provisions.

Flow chart
A flowchart is a diagram that represents an algorithm, workflow or process in creating a program.Flowcharts are depicted in the form of symbols connected to each other with lines or arrows (Rosaly & Prasetyo, 2019) .By using a flowchart as the flow of a program in research, it will be clearer, more concise and reduce errors in interpretation.The flowchart of the K-Means algorithm can be shown in the image below:  The flowchart starts, then enters the data that has been transformed/modified, after that determines the number of clusters and cluster center points, then the cluster results are calculated by comparing the Euclidean distance calculations Distance and Manhattan Distance , the results of these two methods are grouped into cluster centers based on the minimum or closest distance.If there is data that has moved clusters then carry out the data mining processing process again and if there is no data that has moved clusters then continue by presenting the results of the K-Means Clustering Algorithm and finish.

Results of Implementation of the K-Means Method
In the implementation stage of the K-Means method , determining the stunting cluster is first done by collecting data and several attributes such as age, weight and height as well as a preprocessing process including filtering the required data and normalizing the data for each number in each attribute.The next step, the model using the K-Means method is built by determining the number of clusters followed by calculating the centroid distance to measure the centroid distance.This model is used for training data and test data as an evaluation of each data to measure the level of accuracy and identify if errors occur.If the model has been well trained, the final step is to use it to determine clusters in each data and measure the best accuracy of centroid distances using the Euclidean distance and Manhattan distance calculation methods in each data.In the distance calculation process, each method will compare the best level of accuracy in order to obtain well-accurate analysis results.Therefore, utilizing the K-Means algorithm and the distance calculation method can provide information on grouping toddlers who are stunted and predict the future to minimize the level of stunting in toddlers.

Determination of K Values
The process of determining the number of clusters is determined by the number k=3, after that it goes through an algorithm calculation in K-Means based on the number of k for grouping data, then testing is carried out to get optimal accuracy results from distance calculations and comparing each of the distance calculation methods.

Initial Centroid Determination
Before calculating the distance, the first step is to determine the value of the cluster center point or initial centroid randomly .After obtaining the new centroid value and re-calculating the distance value of the centroid data .After that, the iteration continues until the cluster member data values do not change and there is no movement of cluster data to other clusters .

Manhattan Distance Calculation
Manhattan distance calculation method was used by obtaining 552 data.Following are the results of calculating the overall distance from each data which has three data groups/ clusters which can be seen below.

Algorithm Loop and Results
The final stage in the data grouping process is carried out using the K-Means algorithm for repeated iterations so that the data does not experience cluster movements or changes in the data and produces final data or accurate data.In this research, the iteration went through two In the last iteration, the centroid center is obtained for grouping each data and determining the number of cluster results .The results of the final centroid center in the Euclidean distance calculation process show that the centroid value at C1 has the highest value between C2 and C3.It can be concluded from this data that C1 is in the predetermined category with the "mild" category, C2 is in the "Severe" category and then C3 is in the "Very Severe" category from the clustering results.Following are the results of the last iteration of the centroid center euclidean distance which can be seen in table below.After designing and developing the system, the next stage is implementing Python .The aim of implementing Python with Jupyter Notebook is to adjust or evaluate whether it meets the expectations of the system that has been created by the researcher.

Data Import View
Import View is an important first step in data analysis using Jupyter Notebook with the Python programming language .To send data, the Pandas library is used by inputting the lines of code listed in the Jupyter Notebook .For use of Jupyter Notebook it is important to ensure that the data sent is in the correct directory.At this stage, the dataset is in EXCEL format which contains 552 records and 6 attributes.Next, normalize each data attribute into a form that is appropriate to the clustering process .This aims to prepare data that is suitable for the machine learning process, thereby increasing the accuracy and performance of the model.The data normalization process can be seen in

Clustering View
Clustering process is carried out using Jupyter Notebook in the Python programming language.The clustering process begins by determining the number of data groups What is desired is to create a K-Means object.After determining the number of clusters that will be grouped into the data, the process that will continue is to calculate the distance from each data by calculating the Euclidean distance and Manhattan distance .

Evaluation View
The evaluation view is that model evaluation is carried out using the Sum of Squared Errors (SSE) which is called inertia in the context of K-Means Clustering .SSE is an evaluation metric to measure how far the data points in a cluster are from the centroid center point.From the results of the SSE values in Figure 4.7 above, the highest value is 429.83913472119467followed by a value of 127.54606586829607 after that up to a value of 16.90792640466273 which is the limit value for determining n _clusters or the number of clusters and the optimal result from Figure 4.7 is three clusters at Euclidean distance .
Furthermore, the highest to lowest SSE Manhattan distance values can be seen in Figure 4.8 below.

Figure 9. Manhattan Distance Evaluation View
From the results of the SSE values in Figure 4.8 above, the highest value is 423.643482548034followed by a value of 129.74239383865506 after that up to a value of 17.099894093634045 which is the limit value in determining n _clusters or the number of clusters and the optimal result from Figure 4.8 is three clusters at the Manhattan distance .

Result Report Display
The results report display is the results obtained from the Jupyter Notebook library which will be saved as an excel file.Based on figures 4.11 and 4.12, the cluster is divided into three, namely the first cluster is purple, followed by the second cluster is blue and the third cluster is yellow.Next, the cluster results from each distance calculation can be concluded The comparison of iterations shows that the iterations in calculating the Euclidean distance with a total of 8 iterations are the same as the Manhattan distance with 8 iterations.This shows that the distance calculations for Euclidean distance and Manhattan distance are the same in terms of iteration.
Following are the results of testing the closest distance from Euclidean distance and Manhattan distance which can be seen in table below.From the cluster results in Figure 4.13, the number of cluster 0 members with data on toddlers experiencing the impact of stunting with "Mild" status from the Euclidean distance is 169 data with an average age/month of 38-61, while the Manhattan distance is 163 data with an average age /month 38-61.The results for cluster 1 with data on toddlers who experienced the impact of stunting with the status "Severe" from the Euclidean distance were 208 data with an average age/month of 15-44, while for Manhattan distance there were 214 data with an average age/month of 15-58.The next results in cluster 2 with data on toddlers who experience stunting have the status "Very Severe" from the Euclidean distance as many as 175 data with an average age/month of 0-24, while in the Manhattan distance it has the same value, namely 175 data with an average age /month 0-24.

Conclusion
In results show that the distribution of cluster members for these two distance methods has significant similarities, although there are slight differences in the amount of data belonging to each cluster .However, this similarity shows consistency in applying K-Means using both distance methods.
The analysis results show that using Sum of Squared Errors can be used to determine the optimal number of clusters using the K-Means method with a case study of measuring centroid distances in determining stunting clusters.The application of K-Means with Euclidean distance and Manhattan distance was successful in grouping toddlers based on the severity of stunting symptoms (mild, severe, very severe).This shows the potential of this method to be used in further analysis of child health problems and evaluation of stunting reduction.

Figure 2 .
Figure 2. Data Import View

Figure 4 .
Figure 4. Normalization Process Display Data

Figure 6 .
Figure 6.Euclidean Distance Calculation Process Below, the highest to lowest SSE Euclidean distance values can be seen in Figure 4.7 below.ISSN: 2716-3865 (Print), 2721-1290 (Online) Copyright © 2024, Journal La Multiapp, Under the license CC BY-SA 4.0

Figure 10 .
Figure 10.Euclidean Distance Results Report Display

Figure 12 .
Figure 12.Cluster Iteration Results on Euclidean Distance

Calculating the Closest Distance to the Cluster Center Calculation
of the closest distance to the cluster center point has two similarities in distance comparison analysis.The results of this analysis were obtained for grouping cluster data by getting the same cluster results with different distance calculations.The distance calculations used to determine the closest distance to each cluster are Euclidean distance and Manhattan distance calculations.Following are the results of calculating the overall distance from each data which has three data groups/ clusters which can be seen in table below.

Table 2 .
Results of calculating the Euclidean distance from the initial centroid In the next stage, this is done by determining a new centroid on the Euclidean distance by calculating the middle/ mean value for each cluster data value.The following are the results of the new centroid which can be seen in table below:

Table 3 .
New Centroid Center Euclidean Distance

Table 4 .
Results of calculating the Manhattan distance from the initial centroid 275 ISSN: 2716-3865 (Print), 2721-1290 (Online) Copyright © 2024, Journal La Multiapp, Under the license CC BY-SA 4.0To determine the new centroid center for Manhattan distance in the same way as for Euclidean distance, namely finding the mean value for each data cluster using the cluster center.The following are the results of the new centroid which can be seen in table below:

Table 5 .
New centroid center Manhattan Distance

Table 7 .
So iterates again until the cluster data value reaches the final result.The final results of calculating the Euclidean distance and Manhattan distance in grouping data in the last iteration can be seen in tables 4.10 and 4.11.Before that, the following categories for determining the number of each cluster member can be seen in table below.Euclidean Distance Calculation Results from the Last Iteration K-Means algorithm calculations which gave the same cluster data values as the previous iteration.From the results of the Euclidean distance calculation, we get the results of clustering each data with a minimum distance.In cluster one, 169 data were obtained, then in cluster two there were 208 data and for cluster three there were 175 data.With a total of 552 data, the results of the Euclidean distance calculation in the last iteration can be seen in table below.ISSN: 2716-3865 (Print), 2721-1290 (Online) Copyright © 2024, Journal La Multiapp, Under the license CC BY-SA 4.0

Table 8 .
Centroid Center Euclidean Distance in the Last Iteration Next, the distance calculation results for the Manhattan distance are obtained from each data cluster with a minimum distance.The first cluster produced 163 data, then in the second cluster 214 data and in the third cluster 175 data with a total of 552 data.The results of the last iteration of the Manhattan distance calculation can be seen in table below.

Table 9 .
Manhattan Distance Calculation Results from Iteration FinalAfter determining the results of the Manhattan distance calculation, the centroid center for the last iteration can also be determined .By producing the final centroid value.It can be seen that the value of C1 is the highest value of C2 and C3.So the C1 value can be concluded in the "mild" category, the C2 value in the "Severe" category and for C3 in the "Very Severe" category.The centroid center in the Manhattan distance calculation in the last iteration can be seen in table below.

Table 11 .
Euclidean Distance and Manhattan Distance Test Results Based on table 4.14 above, it can be concluded from cluster 1 for the variables age, weight and height in the Euclidean distance calculation that the centroid values are 0.848288, 0.779408, 0.941526 , whereas with use manhattan distance mark centroid obtained are 0.850749, 0.788078, 0.944960.Comparison results next to cluster 2 for variable age , weight and height at Euclidean distance are mark centroid amounted to 0.471311, 0.497622, 0.802323, while with use manhattan distance value The centroids obtained are 0.478934, 0.499791, 0.803945 and the results comparison finally in cluster 3 , namely with variable age, weight and height in the calculation process distance Euclidean distance with mark centroid of 0.173396, 0.277105, 0.638767, while in the calculation distance manhattan distance mark centroid obtained of 0.174707, 0.276038, 0.638358.From the results comparison in determine accuracy best with measurement distance centroid more dominated by calculations distance euclidean distance .Following This results testing use Mean Squared Error with compare euclidean distance and manhattan distance can be seen in table 4.15 below This .By getting the centroid and mean squared error (MSE) results, you can determine the results of the number of cluster data from Euclidean distance and Manhattan distance .The number of clusters can be seen in Figure4.13 below.
ISSN: 2716-3865 (Print), 2721-1290 (Online) Copyright © 2024, Journal La Multiapp, Under the license CC BY-SA 4.0 ISSN: 2716-3865 (Print), 2721-1290 (Online) Copyright © 2024, Journal La Multiapp, Under the license CC BY-SA 4.0 results comparison average MSE Euclidean distance and Manhattan distance values smallest owned euclidean distance so that fulfil optimal MSE value .Figure 14. the Euclidean Distance and Manhattan Distance Clusters the implementation of K-Means by calculating Euclidean distance and Manhattan distance on 552 data.With the results of the centroid values, the comparison of Euclidean distance and Manhattan distance from the test results determines that the optimal distance calculation used is Euclidean distance with centroid distance values of 0.848288, 0.779408, 0.941526 at centroid 1, then 0.471311, 0.497622, 0.802323 at centroid 2 and 0.173396, 0.2 77105 , 0.638767 at centroid 3.In the iteration process euclidean distance with iteration as many as 8, the same with manhattan distance with iteration as many as 8.This show that in matter iteration own balanced results.Analysis results comparison between Euclidean distance and Manhattan distance in context mark mean squared error (MSE).Euclidean distance shows mark more low compared to with Manhattan distance from every iterations carried out with as much eight iteration from every calculation distance .Euclidean distance has the MSE value dominates with mark smallest of 0.061861306 on iteration First until reach mark lowest 0.030630301 in iteration eighth with The average MSE Euclidean distance calculated from these iterations is 0.035460278.On the other hand, Manhattan distance shows a higher MSE value compared to Euclidean distance , with a maximum value of 0.148646255 in the first iteration and a minimum value of 0.07263329 in the first iteration.eighth.So the average MSE value is 0.08432837 from the first to the eighth iteration.From the test results using the Euclidean distance and Manhattan distance with a total of 552 data.At the Euclidean distance, 169 toddlers experienced symptoms of mild stunting ( cluster 0) with an average age/month of 38-61 , then 208 toddlers experienced symptoms of severe stunting ( cluster 1) with The average age/month is 15-44 and 175 toddlers experience very severe symptoms of stunting ( cluster 2) with an average age/month of 0-24 .Meanwhile, in the Manhattan Distance, 163 toddlers experienced symptoms of mild stunting ( cluster 0) with an average age/month of 38-61 , then 214 toddlers experienced symptoms of severe stunting ( cluster 1) with an average age/month of 15-58 and 175 toddlers experienced very severe symptoms of stunting ( cluster 2) with an average age/month of 0-24.The clustering ISSN: 2716-3865 (Print), 2721-1290 (Online) Copyright © 2024, Journal La Multiapp, Under the license CC BY-SA 4.0