Analyzing your data

By Geoff Hart

Previously published as: Hart, G. 2018. Analyzing your data. World Translation Services, Japan. <https://www.worldts.com/english-writing/eigo-ronbun57/index.html>

Experimental design can be the hardest part of any research project. That’s unfortunate, because if you design your study poorly, you’ll collect data that is at best difficult to analyze or at worst meaningless. But once you have collected your data, you face an additional challenge: how to analyze it. In this article, I’ll build on my previous article on designing effective research by describing a few common problems you should avoid when you analyze your research data.

Normality and data transformations

Researchers often believe that data must be normally distributed before they can analyze the data. If the sample size is relatively small, this is indeed a requirement for some familiar statistical tests, such as Pearson’s correlation coefficient or analysis of variance (ANOVA). This is why you’ll often see researchers describe how they transformed their data (e.g., using the logarithm of the measured values) before testing for significance. But in some cases, such as general linear models, it’s more important that the error term (the residuals) be distributed normally. For example, see the paper by Marc Kéry and Jeff Hatfield (2003, Bulletin of the Ecological Society of America 84(2):92–94.).

In addition to being unnecessary in many cases, transformation of the raw data can create problems. The most obvious—and therefore the easiest to forget—is that if the original data were not normally distributed, the transformed data conceal that lack of normality. When a statistical distribution is strongly skewed (non-normal), that may reveal the action of an important physical or biological phenomenon. An example might be the variable selection pressure caused by fishing, which preferentially eliminates the largest individuals in a population.

Transforming data to produce a normal distribution is not appropriate if the real-world phenomenon cannot reasonably be expected to have a normal distribution. For example, biological sex more closely approximates a bimodal distribution (one with only two values: male or female). Trying to transform data on the distribution of sexes to create a normal (unimodal) distribution would be illogical and misleading. Moreover, closer examination often reveals intermediate values (in this case, intersex individuals who have properties between those of the two dominant sexes), and in psychological research, the transformation would conceal important phenomena such as the difference between gender (one’s perceived sex) and biological (chromosomal) sex.

The most important point about testing for normality is not related to normality itself, but rather to the fact that most statistical tests depend on specific assumptions and have specific requirements. If your data don’t meet those requirements, you must either choose a different test, or must transform the data in a way that doesn’t distort its meaning. Many researchers incorrectly believe that their statistical software will warn them whether a given test is appropriate for their data. This belief is often wrong. Before you use any statistical test, learn whether your software will confirm that the test is valid, or whether you must manually confirm that the test is appropriate.

(Incorrectly) assuming linearity

Another common problem relates to regression analysis. Many researchers choose simple linear regression for scatterplots that show some degree of curvature (i.e., that show evidence of a nonlinear relationship). Linear regression has an additional problem: it assumes that the values of a variable can increase or decrease infinitely, with no maximum or minimum value and no asymptote. For some physical processes, this may be a reasonable assumption; for most physical processes and all biological processes, this assumption is illogical and incorrect. As a result, even a strong and highly significant linear regression can be misleading.

A further problem is that even when a linear relationship is valid for a specific range of conditions, the relationship may change dramatically outside that range. Among physical processes, consider water: if you apply heat to liquid water, the temperature will increase at a rate defined by the water’s heat capacity (about 4.2 kJ/kg-K). However, once the water freezes or boils, its heat capacity changes drastically, and predicting temperature changes using the linear relationship derived for liquid water will produce dramatically incorrect results. Among biological processes, consider an organism’s population growth rate. Most organisms exhibit linear and exponential (nonlinear) growth during different growth stages or at different population densities. You can’t predict nonlinear population growth using linear regression, and vice versa.

These examples demonstrate the importance of thinking about your data’s physical meaning before you determine how to analyze the data. In many cases, it will be necessary to use nonlinear regression (e.g., a sigmoidal ∫ curve with asymptotes) or piecewise regression (with different linear or nonlinear equations for different ranges of data). Always critically consider the meaning of your data instead of blindly relying on the statistical results.

Think carefully before you analyze!

Statistical analysis and data presentation are far more complex than I can discuss in a short series of articles. Unfortunately, many researchers don’t learn these skills thoroughly as undergraduates, and have no time in graduate school to improve their understanding. As a result, they rely on emulating sometimes-flawed examples in the research literature. If you’re uncertain about your design, ask an expert (a statistician) for help. Statisticians can also help you analyze messy data that you might be unable to analyze on your own.

Additional thoughts on triangulation

In my previous article on experimental design, I wrote about "triangulation" as a way to validate hypotheses using datasets for two or more parameters. What you’re doing is describing a single phenomenon from two or more perspectives by formulating hypotheses that describe different aspects of the phenomenon. For example, consider coins. We know from experience that one side of the coin presents its denomination (i.e., amount and currency type) and the other side has a symbol of the nation that created the coin. If we examine one side of a round metal object and see a denomination, we can hypothesize that it is a coin and that the other side will have a national symbol—but we can only be confident of this if we actually examine the other side.

Acknowledgment

I thank Dr. Julian Norghauer (https://www.statsediting.com/) for suggestions about additional design and analysis topics, and for providing a reality check on what I’ve written.