i-nth - CUSTODES: Automatic spreadsheet cell clustering and smell detection using strong and weak features

Authors

Shing-Chi Cheung, Wanjun Chen, Yepang Liu, & Chang Xu

Abstract

Various techniques have been proposed to detect smells in spreadsheets, which are susceptible to errors. These techniques typically detect spreadsheet smells through a mechanism based on a fixed set of patterns or metric thresholds.

Unlike conventional programs, tabulation styles vary greatly across spreadsheets. Smell detection based on fixed patterns or metric thresholds, which are insensitive to the varying tabulation styles, can miss many smells in one spreadsheet while reporting many spurious smells in another.

In this paper, we propose CUSTODES to effectively cluster spreadsheet cells and detect smells in these clusters. The clustering mechanism can automatically adapt to the tabulation styles of each spreadsheet using strong and weak features. These strong and weak features capture the invariant and variant parts of tabulation styles, respectively. As smelly cells in a spreadsheet normally occur in minority, they can be mechanically detected as clusters' outliers in feature spaces.

We implemented and applied CUSTODES to 70 spreadsheets files randomly sampled from the EUSES corpus. These spreadsheets contain 1,610 formula cell clusters. Experimental results confirmed that CUSTODES is effective. It successfully detected harmful smells that can induce computation anomalies in spreadsheets with an F-measure of 0.72, outperforming state-of-the-art techniques.

Sample

The CUSTODES methodology consists of four steps: preprocessing, first-stage clustering, second-stage clustering by bootstrapping, and smell detection.

Publication

2016, ICSE, 38th International Conference on Software Engineering

Full article

CUSTODES: Automatic spreadsheet cell clustering and smell detection using strong and weak features