i-nth - WARDER: Refining cell clustering for effective spreadsheet defect detection via validity properties

Authors

Da Li, Huiyan Wang, Chang Xu, Fengmin Shi, Xiaoxing Ma, & Jian Lu

Abstract

Spreadsheets are widely used, but subject to various defects and severe consequences due to poor maintenance by end users. Existing spreadsheet defect detection techniques fall short of effectiveness, either due to limited scopes or relying on rigid patterns.

In this paper, we discuss and improve one state-of-the-art technique, CUSTODES, which uses cell clustering and anomaly detection to extend its scope and make its patterns adaptive to varying spreadsheet styles, but is prone to fragile clustering when involving irrelevant cells, leading to a largely reduced detection precision.

We present WARDER to refine CUSTODES's cell clustering based on validity properties, and experimental results show that WARDER improves the precision by 20.7% on average or reach 100% for 79.8% worksheets on cell clustering, which contributes to a precision improvement of 23.1% for defect detection.

WARDER also exhibits satisfactory results, against other spreadsheet defect detection techniques, and on another large-scale spreadsheet corpus VEnron2.

Sample

True positives for spreadsheet defect detection techniques

We study the intersections of reported true positives between four spreadsheet detection techniques.

In the figure, the yellow ellipse represents the true positives reported by AmCheck, green by CACheck, pink by CUSTODES, and purple by WARDER. Each sub-area represents a specific intersection of reported true positives by two or more techniques.

As a conclusion, WARDER is satisfactory in detecting defects for practical spreadsheets. It achieved the highest precision. Its time cost is a bit high, but comparable to CACheck, and less than its predecessor CUSTODES.

All studied techniques have their own strengths, and are suggested for complementary usage to each other.

Publication

2019, IEEE 19th International Conference on Software Quality, Reliability and Security (QRS)

Full article

WARDER: Refining cell clustering for effective spreadsheet defect detection via validity properties