Geoff-Hart.com: Editing, Writing, and Translation

Home Services Books Articles Resources Fiction Contact me Français

You are here: Articles --> 2021 --> Managing your study data, part 4

Vous êtes ici : Essais --> 2021 --> Managing your study data, part 4

Managing your study data and the supporting documentation. Part 4: Replication Files

By Geoffrey Hart

In part 1, part 2, and part 3 of this series, I provided suggestions on how to set up your project files and validate your data. In this final part, I'll discuss how to prepare your "replication files". The goal of creating these files is to help future researchers repeat your research or include your results in a meta-analyses. Your replication files should include enough information that another researcher could repeat the steps in your study to obtain similar results or repeat your data analysis to obtain identical results. You can think of this data as the “public” versions of the “private” files that you store in your other research directories.

Note: This series of articles is based on the following book, but with the details modified for non-psychological research: Berenson, K.R. 2018. Managing Your Research Data and Documentation. American Psychological Association. 105 p. including index.

This category of information should include source data downloaded from a database or provided by a colleague. It should include only data that you are legally or ethically allowed to share. For any data that is proprietary or confidential, be sure to include only the subset of your original data that you can share. For proprietary information, your employer’s intellectual property manager can provide detailed guidance. For human data, where privacy is important, store the identities of the participants separately from their data so that nobody who uses the dataset can identify the individuals unless an institutional review board or other authority determines that it’s necessary to identify individuals. Structure the replication folder (directory) so that you can simply give your colleagues a copy of the whole folder and be confident that they will be able to analyze the data correctly.

Note: Develop a procedure that will ensure that if you modify a source file that will be included in the replication folder, you will remember to update the copy in that folder so that the two copies remain identical. For example, before you send replication files to a colleague, compare their time stamps with the original files. If the file in the replication files folder is older, carefully check whether it's necessary to replace it with the newer version.

It's also appropriate to include methodology documents, such as copies of blank questionnaires and participant recruitment forms for human studies. If the replication folder is intended solely for your own use, include final versions of grant applications that you can reuse, with appropriate modification, in future funding applications. If a colleague provided a copy of one of their papers that contains detailed methodology that you used in your study, save that paper; if you downloaded the paper from a journal site, store it outside the Replication directory so that you protect these copyrighted materials by not sharing them with other researchers. Instead, provide a document that lists all manuscripts that you consulted, and a link to the Web page where the manuscript can be obtained. The advantage of retaining the downloaded copy is that you won’t have to go looking for it again when you perform your next study.

Journal articles are well designed to help you understand what an author did in their research. However, they rarely provide full details of a procedure, and their structure is inefficient when it comes time for you to actually use that method in your research. For example, the Methods section almost never starts with a list of materials and lab instruments you will need to gather together before you can perform an analysis. Moreover, you will often see phrases such as “using the method of Hart (2021)” instead of a description of what that author actually did. In long or complex studies, there may be a dozen or more external documents that readers of your published paper must obtain before they can repeat your experiment. Why not make life easier for all future researchers who use your methods by providing all necessary details of those methods in a laboratory manual or similar resource?

For example, to make your documented methods easy to use, rewrite them like a recipe in a cookbook:

  1. Start with a complete list of ingredients (e.g., laboratory chemicals) and tools (e.g., a gas chromatograph; 50 Erlenmeyer flasks, each with a volume of 200 mL).
  2. Next, describe each step of the analysis as a numbered sentence presented in exactly the same order you would follow to perform the procedure. In effect, you are transforming the journal paper’s high-level overview of the analysis into the equivalent of a laboratory procedures manual.

Carefully note any lessons you learned from previous analyses or from the present analysis. For example, provide details of mistakes to avoid (e.g., types of bias), things you could do to make the work easier (e.g., specific software settings), ways to improve the validity of your data (e.g., triangulation, suggestions for dealing with skewed data distributions), and problems to avoid in the data analysis (e.g., an inadequate sample size). Describe each of these problems by explaining their cause, how to avoid that cause, and what to do if you’re unable to avoid that cause. Provide the most important of these lessons in the Methods section of your paper so that future researchers can benefit from your problems and solutions. In particular, describe any procedures you developed to minimize errors in data entry and retrieval.

For any research method, it’s essential to ask someone who is unfamiliar with your research to test your description to reveal any implicit assumptions you made that should be made explicit ("why did you do that?"), to identify any missing steps, and to ensure that there is no question about what must be done in each step. If a protocol will be repeated over many years and the original researchers may have retired or moved to a different research institute by the time the new year’s group of graduate students arrives, ask one of the graduate students to follow the protocol to ensure that they get the same results you obtained; any error suggests that your instructions are flawed and must be corrected.

Note: It’s very useful to create a “Read Me First” document that explains all the contents of the Replication directory (what information is present, where to look for each type of information, and how to use the information when you find it).

For large datasets, you may find it inconvenient or difficult to provide the data to other researchers, particularly if the data is important and will be reused by many future researchers. Answering hundreds of requests to provide your data takes time that would be better spent on your research. To simplify the task of sharing your data, consider using a public data repository, so that other researchers can access your data without bothering you. The specific data repository you should use varies among journals. For example, many genetics journals ask authors to save gene sequences in a location such as the DNA Database of Japan (DDBJ), whereas journals that specialize in Arabidopsis genetics may ask you to use The Arabidopsis Information Resource, TAIR. The journal Science asks its authors to use a non-profit publicly accessible site such as Dryad, Dataverse, or Zenodo.

The best format for storing textual and numeric data is debated. For example, some authors recommend saving all files in PDF format because the contents of the files cannot be easily altered and this format is likely to remain readable for a long time. Though that’s a reasonable proposal, it suffers from a significant problem: file formats become popular for many years, then abruptly disappear with little warning. (This happened recently with Adobe’s Flash software, which is no longer supported by most computer operating systems.) In more than 35 years of working with computers, I’ve seen popular word processor formats such as WordStar and WordPerfect 5 disappear, making files stored in these formats unreadable by the more modern software that replaced them (e.g., Microsoft Word). In addition, it’s difficult to extract information in a usable format from a PDF file if you make the mistake of using complicated layout (e.g., tables, multiple columns of text). For these reasons, “text” (Unicode) format files, which only include Unicode characters, are a better choice. Any program can read text files, and such files will remain easily readable for the foreseeable future.

Where formatting is important, as in the case of data tables, you have two main options. First, for tabular data such as the contents of an Excel spreadsheet that does not contain calculation formulas, you can save the data in comma-delimited or tab-delimited formats. In this format, the software adds a comma or a tab character between consecutive values and a “hard return” character to mark the end of each row of data. (Don’t use comma-delimited format for data such as text that contains punctuation!) Most spreadsheets and databases can read these files easily. Second, for information that requires more formatting, consider the HTML format, since it offers two huge advantages over competing formats: it is stored in text format, so it can be easily read by most software, and it contains formatting tags (the words inside the < > brackets) that define the meaning of a group of information (e.g., a heading, a table cell) or that specify its formatting. It’s easy to search for and replace these tags if that becomes necessary.

Note: Research data is increasingly moving beyond text and numbers to include sound and graphics. Because these formats are frequently replaced or upgraded (e.g., to support greater compression to reduce file size, to include metadata), the best solution if these types of data must remain available for a decade or more may be to add an annual reminder on your computer’s calendar software to open the files in your current software and save them again in the newer format.

Final thoughts

To learn more, it’s worthwhile consulting the Teaching Integrity in Empirical Research (TIER) project, whose goal is to provide “guidance to students conducting quantitative research to help ensure that their work is transparent and reproducible”. If their guidelines are not directly applicable to your field of research, perhaps you could work with colleagues to create specific guidelines for your field. Future researchers will be very grateful.


©2004–2024 Geoffrey Hart. All rights reserved.