Managing your study data and the supporting documentation. Part 1: Why it’s important to rigorously document your study

By Geoffrey Hart

Previously published as: Hart, G. 2021. Managing your study data and the supporting documentation. Part 1: Why it's important to rigorously document your study. https://www.worldts.com/english-writing/eigo-ronbun82/index.html

Increasingly, researchers are being asked to archive their data so that it remains available in the long term and can be used by future researchers. Kathy R. Berenson (2018) provides a thorough and excellent overview of the subject, but with a focus on psychology research. In this series of four articles, I’ll build upon what she’s written to provide a summary that focuses more on field and laboratory research.

Note: Berenson, K.R. 2018. Managing Your Research Data and Documentation. American Psychological Association. 105 p. including index

Whatever type of research you perform, it’s important to find ways to control the quality of your input data and how you recorded and analyzed the data. Documenting your data, methods, and analyses makes it easier for future researchers to benefit from and build on your results. For example, if you’ve studied an ecosystem and want to support your conclusions by citing another researcher’s results, you must find an ecosystem that is sufficiently similar to your ecosystem and that was studied with a sufficiently similar methodology that the results are truly comparable. Alternatively, the ecosystem must be sufficiently different that you can provide a fair contrast with your results and can propose a logical explanation for those differences. Such comparisons are easiest if the research has been documented well enough that you truly understand what the other researchers did—and well enough that other researchers can truly understand what you did. This is particularly important to support future meta-analyses.

Replicability is an essential part of science, particularly in fields such as psychology and the social sciences that include human subjects and that directly affect human lives. In these fields, studies have proven particularly difficult to replicate, in large part due to the high variability in any human population and the difficulty selecting a representative subset of that population. To mitigate this problem, it’s necessary to manage your data and document how you obtained the data and analyzed it sufficiently well that you could give your dataset and instructions to a colleague and they could reanalyze your data and produce identical results. (This is particularly important to facilitate meta-analyses of large datasets.) Better still, your documentation should be so complete and detailed that another researcher could control their experimental conditions well enough that if they repeated your study, they would be likely to obtain similar results.

A further complication involves the fact that modern research generates large quantities of complex data, and most researchers receive little or no training in how to manage that data. Fortunately, there are logical ways to proceed that make data management easier and more effective. One of the key strategies is to develop objective, clear methods for how to organize, process, and analyze your data—and how to explain what you’ve done so well that another researcher could exactly repeat what you did.

There are important ethical considerations to managing data so that it can be shared. For example, research that received government funding should be available to all the taxpayers who supported that research. This issue is sufficiently important that the American Psychological Association asks its members to ensure that their data remains available for a minimum of 5 years. A better target might be 10 years, particularly for important and innovative research or research that was difficult and expensive to perform.

Create a hierarchical project structure on your computer

Every research project should have its own folder (directory) on your computer. That folder and all subordinate folders it contains should be named clearly enough to distinguish it from the many similar projects that you will subsequently perform during your research career. How to define that name depends on how your research is organized. If you are early in your career and are only conducting one or two studies simultaneously, the folder name may be as simple as your name, a key word related to the subject, and the year. For example, if you store all of your research in a folder named “Research”, your current project might be named “Hart 2020 drought stress field study”. If you’re part of a research group, you may need to include the names of the principal investigators or you may need to use your employer’s project naming structure.

Note: If your research group comprises people from multiple institutions, create a document that lists complete names and contact information for all investigators that was valid at the time of the study. If possible, try to update this document periodically to include current contact information.

Avoid nicknames and abbreviations that only your immediate research group will know. After a few years, these shortcuts becomes problematic because your colleagues may have moved to another institution or retired, and institutional memory of these shortcuts may have been lost; that is, the remaining researchers may no longer remember the meanings of those shortcuts or their reasons. If it’s necessary to use a complex project naming system that only bureaucrats could love, consider creating a document named “Explanation of project folder names” that provides the necessary explanations.

How you name the folders for each project depends on the nature of your research and how you approach the design and subsequent management of your studies. Berenson (2018) recommends creating the following subfolders:

Project files: All of the “paperwork” associated with a project (whether scans of paper documents or electronic copies), such as funding applications and Institutional Review Board approvals.
Data files: All of your original data, formatted as “read only” so that it can’t be accidentally changed. Most computer operating systems let you apply this format directly from the file management system (e.g., the File Explorer in Windows, the Macintosh Finder).

To protect files against accidental modification, change their format to “read only”:

Macintosh: Select the file in a Finder window, and then press Command+I to display the file information dialog box. Scroll down to the heading “Sharing & Permissions”. For the “everyone” settings, select “Read Only” under the heading “Privilege”.

Windows: Using the Windows File Explorer, right-click or Control-click the file and select “Properties”. Select the checkbox beside “Read only”.

Working files: All files that represent your work in progress, such as “cleaned” data files (files from which outliers and erroneous data have been removed), and not the original data files.
Command files: This includes all data-processing scripts (e.g., for the R statistical software) and the code for any software you developed to analyze your data. (I’ll discuss command files in more detail in Part 3 of this article.) If that software evolves during the project, create a version control system so that you can retain old versions of the software in case you need to return to an older version. This doesn't have to be complex; it's sufficient to simply add the revision date to the file name (e.g., R-script-21-October-2021-revision).
Replication files: Files that you can provide to someone who wants to replicate your analysis, using either your data or their own data. Depending on the nature of your research, some of the information may be confidential and should be excluded from the replication files.

Note that although these file protections are helpful, they are not a substitute for a rigorous backup and archiving strategy. Although your employer’s computer staff should implement such a strategy for you, you can also create your own backups. For some thoughts on how to do this, see my article "Backing up your data… and other important things".

An alternative project folder structure might be folders named “original data”, “cleaned data”, “method documentation, and “paperwork”. In this series of articles, I’ll use Berenson’s suggested names so that if you choose to consult her book, you can more easily find details that I don’t have room to discuss here. However, I’ll discuss her categories in an order that seems more logical to me, as it more closely follows the order in which you will probably perform your research.

Whatever names and structure you choose, create a standard nomenclature for all files in a given category and document that nomenclature so that a year after you developed it, when you begin your next project, you can create names that are consistent with that system of names and equally easy to understand. (This is also useful for variable names, since this will make it easier for future researchers to compare your results with results from other studies.) For example, your data files could be named using the format Site–Date–Treatment or the format Subject–Trial number–Data type (where "data type" is replaced by words such as "video" or "chemical analyses"). This will make it much easier to organize and find files because when you display them in alphabetical order on your computer, the files for a given project will be grouped under the same names. Avoid cryptic coded names, even if you document the meanings of those names in a separate document. The ideal system should be easily understood without referring to this document. Modern computers allow long file names, and carefully chosen names reduce the risk of misinterpreting a name and assigning it to the wrong date or treatment.

Note: Don’t rely on computer time stamps to create versions of a file; if you open a file and save it, the time-stamp date will change. Instead, explicitly add the date. For example, “Cleaned data--2 March 2021--outliers removed”.

In Part 2 of this article, I’ll discuss how to manage your project files.