An investigation of fuzzy modeling for spatial prediction with sparsely distributed data




Thomas, Robert

Journal Title

Journal ISSN

Volume Title



Dioxins are highly toxic persistent environmental pollutants that occur in marine harbour sediments as the results of industrial practices around the world and pose a significant risk to human health. To adequately remediate contaminated sediments, the spatial extent of contamination must first be determined by spatial interpolation. The ability to lower sampling frequency and perform laboratory analysis on fewer samples, yet still produce an adequate pollutant distribution map, would reduce the initial cost of new remediation projects. Fuzzy Set Theory has been shown as a way to reduce uncertainty due to data sparsity and provides an advantageous way to quantify gradational changes like those of pollutant concentrations through fuzzing clustering based approaches; Fuzzy modelling has the ability to utilize these advantages for making spatial predictions. To assess the ability of fuzzy modeling to make spatial predictions using fewer sample points, its predictive ability was compared to Ordinary Kriging (OK) and Inverse Distance Weighting (IDW) under increasingly sparse data conditions. This research used a Takagi-Sugeno (T-S) fuzzy modelling approach with fuzzy c-means clustering to make spatial predictions of lead concentrations in soil to determine the efficacy of the fuzzy model for applications of modeling dioxins in marine sediment. The spatial density of the data used to make the predictions was incrementally reduced to simulate increasingly sparse spatial data conditions. To determine model performance, the data at each increment not used for making the spatial predictions was used as validation data, which the model attempted to predict and the performance was analyzed. Initially, the parameters associated with the T-S fuzzy model were determined by the optimum observed performance, where the combination of parameters that produced the most accurate prediction of the validation data were retained as optimal for each increment of the data reduction. To determine performance Mean Absolute Error, the Coefficient of Determination, and Root Mean Squared Error were selected as metrics. To give each metric equal weighting a binned scoring system was developed where each metric received a score from 1 to 10, the average represented that methods score. The Akaike Information Criterion (AIC) was also employed to determine the effect of the varied validation set lengths on performance. For the T-S fuzzy model as the amount of data used to solve the respective validation set points was reduced the number of clusters was lower and the cluster centres were more spread out, the fuzzy overlap between clusters was larger, and the widths of the membership function in the T-S fuzzy model were wider. Although it was possible to determine an optimal number of clusters, fuzzy overlap, and membership function width that yielded an optimal prediction of the validation data, gain in performance was minor compared to many other combinations of parameters. Therefore, for the data used in this study the T-S fuzzy model was insensitive to parameter choice. For OK, as the data was reduced, the range of spatial dependence in the data from variography became lower, and for IDW the power parameters optimal value became lower to give a greater weighting to more widely spread points. For the TS fuzzy model, OK, and IDW the increasingly sparse data conditions resulted in an increasingly poor model performance for all metrics. This was supported by AIC values for each method at each increment of the data reduction that were within 1 point of each other. The ability of the methods to predict outlier points and reproduce the variance in the validation sets was very similar and overall quite poor. Based on the scoring system IDW did exhibit a slight outperformance of the T-S fuzzy model, which slightly outperformed OK. However, the scoring system employed in this research was overly sensitive and so was only useful for assessing relative performance. The performance of the T-S model was very dependent on the number of outliers in the respective validation set. For modeling under sparse data conditions, the T-S fuzzy modeling approach using FCM clustering and constant width Gaussian shaped membership functions used in this research did not show any advantages over IDW and OK for the type of data tested. Therefore, it was not possible to speculate on a possible reduction in sampling frequency for delineating the extent of contamination for new remediation projects.



fuzzy modeling, geospatial, prediction, sparse data, kriging, IDW, Inverse Distance Weighting