–Geoff-Hart.com: Editing, Writing, and Translation

—Home —Services —Books —Articles —Resources —Fiction —Contact me —Français

You are here: Articles --> 2021 --> Analyzing your data (part 2 of 3): statistical analysis

Vous êtes ici : Essais --> 2021 --> Analyzing your data (part 2 of 3): statistical analysis

Analyzing your data (part 2 of 3): statistical analysis

By Geoffrey Hart

In part 1 of this article, I described how to obtain a preliminary understanding of what your data means. Now it’s time to confirm that understanding by analyzing the data more rigorously. Doing so requires statistical analysis.

Match the analysis to the study design

Ideally, an experiment should be designed to directly support a specific form of data analysis. For example, a replicated random block experimental design with a control and one or more treatments is popular because it directly supports one-way ANOVA, with “treatment” as the factor. Conversely, if you rigorously juxtapose a treatment with its control to create pairs of values obtained under nearly identical conditions, a paired-difference analysis (e.g., a t-test) may be suitable.

Always confirm that your actual data support the analytical method you chose. For example, common tests such as the t-test and ANOVA require normally distributed residuals (i.e., the error terms), not (as is commonly believed) normally distributed raw data. See Kéry and Hatfield (2003) for details. Your statistical software may test this assumption and warn you when a test is inappropriate; however, if the software’s documentation doesn’t state that the software confirms that your data meet the requirements for a test, perform the confirmation yourself. For example, ANOVA requires approximately equal (homogeneous) variance among the samples being compared. A test such as Levene’s test can confirm homogeneity of variance. If the software does perform such tests, learn where to find the results in the test’s output.

Note: If your experiment was not designed to support a specific statistical test, work with a statistician to find the optimal test for your data. Then, plan future experiments to support a specific test.

Look for significant results

Choose an appropriate category of test. Parametric tests are most powerful because they use more aspects of the information contained in the data; for example, they account for both the mean and the variance of the data. However, they require independent, continuously distributed data, and often require normally distributed residuals and homogeneous variance. Here, “independent” means that the datasets being compared don’t depend on each other. For example, if you grow plants from different treatments in the same pot, their interactions within the pot may determine the response more strongly than the treatment.

Nonparametric tests are less powerful than parametric tests because they use less of the information in the data; for example, a simple nonparametric test may use only the ranks of results (i.e., A > B) rather than the mean, standard deviation, skewness, or other characteristics of the data. But if you perform 100 trials, and A>B in all cases despite having very high standard deviation for the individual values of A and B, that result is as meaningful as a parametric test result. Non-parametric tests have the additional benefit of being less affected by outliers and by assumptions about the data’s distribution.

Note: For normally distributed data, tests based on means are appropriate; tests based on medians are likely to be more appropriate for data that contains several large or small values that would distort the mean.

Should you transform the data?

When data don’t meet the requirements for a statistical test, choose another test that is more suitable. The original test will not produce trustworthy results even if the result is statistically significant. However, you may still want to use that test because it perfectly fits your experimental design. In that case, it’s sometimes appropriate to transform the data to meet the test’s requirements. For this approach to be valid, the data should be continuous and the transformation should be invertible; that is, inverting the transformation should restore the original data. For example, an exponential function inverts the corresponding logarithmic function.

To confirm your choice, ask a statistician what transformation is most appropriate for your data and what test will confirm that the transformation succeeded. Transformations don’t always succeed. For example, if the goal of transformation is to produce a normal distribution for the residuals, you can perform a test such as the Kolmogorov–Smirnov test to confirm that the residuals are now normally distributed.

Transformation can create problems. Consider a logarithmic transformation. First, it may lead you to ignore important characteristics of your data, such as the fact that your study system does not produce normally distributed data. Second, transformed data shows a different pattern than the original data. Forgetting this leads to misleading conclusions. Third, testing hypotheses based on the transformed data may accurately reflect the statistical significance of the transformed data, but not the real-world (practical) significance of the non-transformed data. Fourth, using a logarithmic transformation for count data (e.g., the number of individuals) requires the addition of a small value to all raw data to eliminate 0 values before the transformation. This transformation is unlikely to be necessary for a continuous variable such as length; if you are able to measure the length, then by definition that length cannot be 0 (i.e., a physical object always has some non-zero length). The added value must be chosen to avoid distorting the results. O’Hara and Kotze (2010) and Feng et al. (2019) describe some of the problems with logarithmic transformation. Warton and Hui (2011) provide some cautions about arcsine transformations.

Perform a reality check on graphs

While you explored your data visually using graphs, you began to detect certain patterns and trends (see part 1 of this article). Now it’s time to confirm those interpretations. A preliminary graph of your data may show data that falls along a straight or curved line, or the data may form distinct clusters separated by a clear gap. Choose an analytical method that agrees with that pattern. For example, if your graph shows data that parallels the line y=x, you can analyze this data as a single group using a single linear regression equation. But if the data forms one group that lies above the line and a second group that falls below that line (e.g., one group of data falls mostly above the line; another group falls mostly below), consider performing separate regressions for the two groups of data to see if it improves your results. So long as there is a plausible reason to expect two distinct groups with different values (e.g., an organism with two sexes), you can easily justify this analysis.

Instead of saying that two trends are similar because the zigs and zags of the graphs appear to follow the same pattern, quantify that similarity. For example, calculate a correlation (e.g., use Pearson’s r) to confirm that the trend is statistically significant (i.e., is more likely to be real). For time series, consider repeating your analysis using time-lagged data; that is, test whether the value at time t+1 correlates with the value at time t. This is particularly important if you know that changes in an independent variable (e.g., adding heat to a system) will take some time to produce a result (e.g., a change in the system’s temperature).

Note: Use terminology carefully. Correlation strengths should be reported as correlation coefficients (usually, r values), whereas regression strengths should be reported as goodness-of-fit values (usually, R²). They are not the same parameter: R² = r × r. If you only calculate a correlation, the calculation will not provide a regression equation.

Define outliers

When you look for outliers that should be excluded from your analysis, objectively define your thresholds for identifying an outlier. For example, use a criterion such as excluding values that fall more than three standard deviations from the mean; this represents p < 0.01, and gives you reasonable confidence the data can be excluded. (Some fields, such as particle physics, use "five sigma", which represents five standard deviations, as their standard of proof.) However, if you find several outliers, ask whether they may represent real results rather than errors. For example, in a study of plant communities, outliers may represent individuals incorrectly assigned to the focal species, or genetically unusual members of the correct species. They may instead be data-entry errors (e.g., typos). The more outliers you find, and the more they cluster together, the more likely it is that they mean something.

Choose the right variables

Often, two or more related variables may be relevant to an analysis, and one is likely to relate more directly to the physical process you're studying and produce better results. For example, many studies of regional plant growth use the mean annual temperature as an independent variable, but this is misleading. For example, Dublin (Ireland) has a mean annual temperature of about 9.7°C and Toronto (Canada) has a mean annual temperature of 8.3°C. That difference seems minor, but plants are dormant for at least 3 months of the year in Toronto because temperatures decrease below 0°C, whereas plants can grow year-round in Dublin because the temperature rarely drops below 5°C. The annual temperature range (lowest and highest monthly averages) or the temperature during the growing season is likely to be more meaningful.

In part 3 of this article, I’ll describe how to present your results.

Acknowledgments

I’m grateful for the reality check on my statistical descriptions provided by Dr. Julian Norghauer. Any errors in this article are my sole responsibility.

Reference

Feng, C.Y., Wang, H.Y., Lu, N.J. et al. 2019. Log-transformation and its implications for data analysis. Shanghai Archives of Psychiatry 26(2): 105-109.

Kéry, M., Hatfield, J.S. 2003. Normality of raw data in general linear models: the most widespread myth in statistics. Bulletin of the Ecological Society of America 84(2): 92-94.

O’Hara, R.B., Kotze, D.J. 2010. Do not log-transform count data. Methods in Ecology and Evolution 1(2): 118-122.

Warton, D.I., Hui, F.K.C. 2011. The arcsine is asinine: the analysis of proportions in ecology. Ecology 92(1): 3-10.