i-nth logo

Authors

Liang Xu

Abstract

As a special kind of software, spreadsheets have been evolving during their life cycle. Understanding spreadsheet evolution can help facilitate spreadsheet design, maintenance and fault detection.

However, understanding spreadsheet evolution is challenging in practice. There are many factors that hinder spreadsheet evolution comprehension, such as, lack of version information, complicated structure changes during evolution, etc.

Thus, we propose this work to facilitate the understanding of spreadsheet evolution, including developing semi-automated technique to build versioned spreadsheet corpora, characterizing and understanding how spreadsheet templates are reused, developing automated tools for spreadsheet comparison, and new approaches for fault detection during evolution.

Sample

Overview of SpreadCluster
Overview of SpreadCluster

SpreadCluster is a content-based algorithm to identify different versions of the same spreadsheet.

It works in two phases:

  • Training phase. SpreadCluster extracts features (e.g., table headers and worksheet names) from each spreadsheet. It calculates the similarity between spreadsheets based on the extracted features and trains a clustering model using the training dataset created based on VEnron.
  • Working phase. SpreadCluster extracts the same features from spreadsheets and calculates the similarity between them, then it uses the trained model to cluster spreadsheets into different evolution groups.

Publication

2017, IEEE International Conference on Software Maintenance and Evolution, September, pages 670-674

Full article

Understanding spreadsheet evolution in practice