–Geoff-Hart.com: Editing, Writing, and Translation —Home —Services —Books —Articles —Resources —Fiction —Contact me —Français |
You are here: Articles --> 2023 --> Using spreadsheets safely
Vous êtes ici : Essais --> 2023 --> Using spreadsheets safely
By Geoffrey Hart
Previously published as: Hart, G. 2023. Using spreadsheets safely. https://www.worldts.com/english-writing/eigo-ronbun87/index.html
Spreadsheet software seems so simple that it misleads many users into thinking it’s safe. Unfortunately, like any other software, a spreadsheet can be either an essential tool to support analysis—or a nightmare of altered or incorrectly analyzed data. Learning to use spreadsheets safely increases the likelihood that your spreadsheet will become an essential tool. In this article, inspired by Perkel (2022), I’ll discuss some of the problems I’ve seen with spreadsheet use and some potential solutions.
Note: Most of these suggestions work equally well with special-purpose statistical software, so look for ways to implement them with that software too.
My first advice is to avoid using a spreadsheet if you have learned how to use to real statistical software, such as R or SPSS. The statistical software is debugged more carefully and validated more rigorously than most spreadsheets. For example, Microsoft Excel has a long history of calculation bugs (Abeysooriya et al. 2021). Updating your spreadsheet software regularly is one way to ensure that known bugs have been corrected, but there’s a risk that if you update your software too quickly, you’ll encounter a newly introduced bug that hasn’t yet been detected. Don’t be the first person to update nor the last.
The raw data you collect in the field or the lab is irreplaceable. Thus, it’s essential that you store it somewhere safe and always work on a copy of that original data. For additional security, write-protect this file and create multiple backups: one in your working directory, one on external media such as a flash drive or DVD, and one “in the cloud” (e.g., in DropBox). You can use that original version of your data to provide a reference for the spreadsheet data, but the original copy will also let you start your analysis again if you discover a problem with the transformed data in your spreadsheet.
The spreadsheet that looks clear and comprehensible immediately after you create it may become incomprehensible in a year when you try to reuse your data, whether to perform new calculations or validate old ones. Explain the contents of each page of a spreadsheet by using human-comprehensible variable names (see my article on naming variables) and by adding headings and subheadings that are separate from the data and presented as full sentences. Note that although spreadsheets offer many tools for organizing data and making it more legible, you should use those tools sparingly, and test carefully to ensure that they don’t interfere with your ability to transfer the data to other programs, such as R or SPSS.
If you use color to code categories of your data, ensure that the coding remains visible if the file is converted to black and white. (Statistical software probably won’t import any customized formats you created in your spreadsheet.) If you use color, choose colors that differ visibly even for readers who suffer from colorblindness. For example, red–green colorblindness affects up to 8% of men and 0.5% of women, depending on the population. Thus, don’t rely on those two colors to label data as (for example) positive = green and negative = red. Instead, use headings such as “Positive” and “Negative”.
Note: The more complex your data structure, the greater the necessity to describe it clearly in a document that isn’t part of your spreadsheet or that appears on a worksheet separate from the data and calculations. If your spreadsheet doesn’t handle large chunks of text well, create “read me” files that contain the instructions. Then store them in the same directory as your data.
Rely on the tools provided by your software to minimize data-transcription errors. Don’t manually re-type anything if you can copy and paste that information. Don’t copy and paste data if you can import it (i.e., have the new software read the data file directly). These practices eliminate different categories of error (typing errors and copy/paste errors, respectively).
Once the data is stored in your spreadsheet, confirm that the grid of data you defined in your spreadsheet matches the grid in the raw data and the grid that your statistical software will use to import the data. Missing data and mismatched data, such as heading rows that are imported incorrectly as data, can lead to serious errors.
Data validation is tedious and frustrating, but essential. Start with simple steps such as ensuring that the number of records equals the sample size. For a simple, nontechnological solution, have one person read the data aloud while the other person examines the spreadsheet and ensures the data was entered (or imported) correctly. Develop more sophisticated automatic checks. For example, use two methods (e.g., copy/paste the data and then use the spreadsheet’s import function) to create two copies of the same data on separate pages. Then merge the data in a third page and subtract each cell in one page of data from the corresponding cell in the other page. Since the data should be identical in both files, any non-zero values represent a transcription or import error.
Don’t forget to refresh your spreadsheet after you make changes. The software should do this automatically, but sometimes it forgets. To force your spreadsheet to refresh its calculation results, look for a relevant menu choice or keyboard shortcut. In Microsoft Excel, for example, use the Refresh function (under the Data menu on a Macintosh or in the Data tab of the ribbon for both Macintosh and Windows versions), or press F9 to refresh the entire workbook.
Excel offers a useful tool that will automatically change shortcut text into a word or sentence, or automatically reformat the contents of a cell. This can save much time by reducing the amount of typing required. Unfortunately, many of these changes are poorly implemented and can introduce serious errors. A recent review (Abeysooriya et al. 2021) suggested that the well-publicized 2018 Microsoft Excel bug that changed correct gene names into incorrect names had been corrected, but that nearly one-third of the published gene names they checked 2 years later had not been corrected.
Review your sequence of analytical steps carefully to ensure that the logic is correct and that you have defined the correct range of spreadsheet cells. It’s easy to add a new row of data and fail to include the new data in calculations of the total or mean value for one or more columns of data. Once your logic seems correct, validate it using test data; for example, manually calculate the mean value of a subset of the data, and compare it with the same calculation performed by the spreadsheet.
Document your logic (the reasoning behind your method) for calculations and for classifying data into categories. For example, define one or more criteria and choose appropriate threshold values for each criterion. This makes it much easier to revisit your data and confirm that there are no classification errors. In particular, confirm that field types are correctly defined so that the spreadsheet doesn’t damage your data (e.g., by treating numerical data as text, by rounding numbers to the wrong number of decimal places). Ask a colleague who wasn’t involved in the analysis to review your documentation to ensure that it’s as clear as you think it is.
Note: Don’t use built-in commenting features such as Microsoft Excel’s comments. They’re implemented using tiny icons that are easy to miss (particularly on large monitors) or to accidentally delete.
People who work full-time as professional computer programmers or data analysts (e.g., statisticians) usually hold at least one degree in this subject, supplemented by years of experience. Some may have spent 9 or more years achieving a PhD in this subject. You will not acquire this same level of expertise simply by spending a week reading the user manual for your spreadsheet. Always consult experts to ensure that you really know what you’re doing.
Abeysooriya, M.; Soria, M.; Sravya Kasu, M.; Ziemann, M. 2021. Gene name errors: lessons not learned. PLOS Computational Biology.
Perkel, J.M. 2022. Six tips for better spreadsheets. Nature Technology Features 2 August 2022.
©2004–2024 Geoffrey Hart. All rights reserved.