i-nth logo

Authors

Karl Broman & Kara Woo

Abstract

Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this paper offers pracitical recommendations for organizing spreadsheet data to reduce errors and ease later analyses.

The basic principles are:

  • Be consistent.
  • Write dates like YYYY-MM-DD.
  • Don't leave any cells empty.
  • Put just one thing in a cell.
  • Organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row).
  • Create a data dictionary.
  • Don't include calculations in the raw data files.
  • Don't use font color or highlighting as data.
  • Choose good names for things.
  • Make backups.
  • Use data validation to avoid data entry errors.
  • Save the data in plain text file.

Sample

Examples of good and bad variable names
Examples of good and bad variable names

It is important to pick good names for things. This can be hard, and so it is worth putting some time and thought into it.

Characteristics of good names include:

  • The main principle in choosing names, whether for variables or for file names, is short, but meaningful. So not too short.
  • Don't use spaces. They make programming harder: the analyst will need to surround everything in double quotes.
  • Where you might use spaces, use underscores or perhaps hyphens. But don't use a mixture of underscores and hyphens; pick one and be consistent.
  • Be careful about extraneous spaces at the beginning or end of a variable name.
  • Avoid special characters, except for underscores and hyphens. Other symbols often have special meaning.
  • Never include "final" in a file name. You will invariably end up with "final ver2".

Publication

2017, The American Statistician, Volume 72, Number 1, pages 2-10

Full article

Data organization in spreadsheets