i-nth logo

Authors

Wensheng Dou, Liang Xu, Shing-Chi Cheung, Chushu Gao, Jun Wei, & Tao Huang

Abstract

Like most conventional software, spreadsheets are subject to software evolution. However, spreadsheet evolution is rarely assisted by version management tools.

As a result, the version information across evolved spreadsheets is often missing or highly fragmented. This makes it difficult for users to notice the evolution issues arising from their spreadsheets.

In this paper, we propose a semi-automated approach that leverages spreadsheets' contexts (e.g., attached emails) and contents to identify evolved spreadsheets and recover the embedded version information. We apply it to the released email archive of the Enron Corporation and build an industrial-scale, versioned spreadsheet corpus VEnron.

Our approach first clusters spreadsheets that likely evolved from one to another into evolution groups based on various fragmented information, such as spreadsheet filenames, spreadsheet contents, and spreadsheet-attached emails. Then, it recovers the version information of the spreadsheets in each evolution group.

VEnron enables us to identify interesting issues that can arise from spreadsheet evolution. For example, the versioned spreadsheets popularly exist in the Enron email archive; changes in formulas are common; and some groups (16.9%) can introduce new errors during evolution.

According to our knowledge, VEnron is the first spreadsheet corpus with version information. It provides a valuable resource to understand issues arising from spreadsheet evolution.

Sample

Distribution of users who changed a spreadsheet group
Distribution of users who changed a spreadsheet group

Spreadsheets are often maintained by multiple people. In this corpus the 7,294 spreadsheets, containing more than nine million formulae, were divided into 360 groups. 72% of the spreadsheet groups were changed by more than one person. 20% of the spreadsheet groups were changed by 5 or more people.

Publication

2016, International Conference on Software Engineering Companion, May, pages 162-171

Full article

VEnron: A versioned spreadsheet corpus and related evolution analysis