i-nth - Smelly spreadsheet structures: Structural analysis of spreadsheets to enhance smell detection

Authors

Patrick Koch

Abstract

The use of spreadsheets is the most popular form of end-user programming.

As spreadsheets are in general not created and maintained by professional programmers, error rates are high. As a remedy, many approaches were proposed to avoid, find, and fix errors in spreadsheets. A number of these approaches, mostly proposing unit- and type inference techniques, take the structure of spreadsheets into consideration.

However, in general, structural information is rarely used in established spreadsheet quality assurance (QA) techniques. This is attributable to structure inference being a difficult problem. Hence, no general process exists for the extraction of these structures.

In this master’s thesis, we present a novel analysis process to identify spreadsheet structures. Our analysis process follows three main concepts:

First, our approach focusses on structural information provided by formula cells and formula relations.
Second, wherever possible, we utilize one-dimensional areas for our analysis purposes, merging individual cells or partitioning two-dimensional areas.
Third, in case of ambiguity during the analysis process, we opt for the interpretation which requires the least amount of assumptions.

We evaluated the performance of our structural analysis process using the EUSES and ENRON spreadsheet corpora. We compared expected high-level structures, detected by manual inspection, with the results of our structural analysis process.

Almost all of the expected structures could be found at least partly by our approach. Moreover, more than 80% of structures could be inferred in their entirety. We conclude that these structures are well suited to enhance spreadsheet QA techniques.

To demonstrate how structural information could be applied, we enhance the detection processes of spreadsheet smells (a QA technique) by following two approaches:

First, we propose five updates to the catalogue of established smells making use of structural information.
Second, we introduce seven new smells based on spreadsheet structures.

The proposed enhancements reduce the runtime of smell detection routines, lower the number of false positive smell detections, and allow for the detection of further issues in form of new smells.

Sample

The String Distance smell identifies typographical errors using the algorithm introduced by Levenshtein.

In this example, cell C6 is faulty, containing a wrong string. The cell is inferred as smelly by applying String Distance detection on the group in area C2:C6.

Publication

2016, Master's thesis, Graz University of Technology, April

Full article

Smelly spreadsheet structures: Structural analysis of spreadsheets to enhance smell detection