i-nth - Detecting faulty empty cells in spreadsheets

Authors

Liang Xu, Shuo Wang, Wensheng Dou, Bo Yang, Chushu Gao, Jun Wei, Tao Huang

Abstract

Spreadsheets play an important role in various business tasks, such as financial reports and data analysis. In spreadsheets, empty cells are widely used for different purposes, e.g., separating different tables, or default value "0".

However, a user may delete a formula unintentionally, and leave a cell empty. Such ad-hoc modification may introduce a faulty empty cell that should have a formula.

We observe that the context of an empty cell can help determine whether the empty cell is faulty. For example, is the empty cell next to a cell array in which all cells share the same semantics? Does the empty cell have headers similar to other non-empty cells?

In this paper, we propose EmptyCheck, to detect faulty empty cells in spreadsheets. By analyzing the context of an empty cell, EmptyCheck validates whether the cell belong to a cell array. If yes, the empty cell is faulty since it does not contain a formula.

We evaluate EmptyCheck on 100 randomly sampled EUSES spreadsheets. The experimental result shows that EmptyCheck can detect faulty empty cells with high precision (75.00%) and recall (87.04%). Existing techniques can detect only 4.26% of the true faulty empty cells that EmptyCheck detects.

Sample

This is a real spreadsheet extracted from the EUSES corpus.

This spreadsheet contains many empty cells. We cluster them into two categories according to the degree of their harmfulness:

Harmless empty cells. Users may leave cells empty intentionally: To make the layout more intuitive or when data is not available. These empty cells are considered as correct and harmless.
Faulty empty cells. Some empty cells should be filled with formulas. For example, column G is used to calculate the parcel tax by adding columns B, C and E. G8 should have the formula "=B8+C8+E8". We consider cell G8 as a faulty empty cell. Similarly, cells G11 and G12 are also faulty empty cells.

EmptyCheck marks the faulty empty cells with a red triangle.

Publication

2018, IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), March

Full article

Detecting faulty empty cells in spreadsheets