Analyzing your data (part 1 of 3): exploring your data

You are here: Articles --> 2020 --> Analyzing your data (part 1 of 3): exploring your data

Vous êtes ici : Essais --> 2020 --> Analyzing your data (part 1 of 3): exploring your data

Obtaining data, whether in the lab or the field, can be the most time-consuming part of research. But eventually, you’ll return to your office with your data and face the next challenge: exploring the data to see what you’ve learned and then sharing it with the world. In this three-part article, I’ll discuss one way to think of this process: explore your data first, formally analyze the data to confirm the first impressions that result from that exploration, and then decide how to present your discoveries.

Note: Data analysis is a large, complex subject. Whole books have been written about it, so I cannot cover the entire subject here. Instead, my goal is to describe a helpful way to think about the process.

Decide what to look for

Before you start your exploration, define what you’re looking for. This depends on your research hypothesis. For example, if your goal is:

to get a sense of how your data is distributed, consider plotting a histogram (i.e., a frequency or probability distribution)
to determine the specific values of measured variables: calculate a “measure of central tendency” such as the mean or median
to detect differences between experimental treatments: calculate differences
to describe the variation: calculate standard deviations, standard errors, coefficients of variation, or confidence intervals
to show trends: create graphs of variables as a function of changes in time or changes in another variable.
to detect relationships: examine scatterplots for the relationships between pairs of variables.

At this stage, your goal should be to obtain an overall understanding of your data by making values, differences, variations, trends, and relationships easier to see. Subsequently, as you think more deeply about your data, you may discover better ways to look at your data. We often think of tables and figures only as ways to present data to our readers, but during the exploration phase, they make it easier to explore our data and discover its secrets.

Because this stage is exploratory, don’t focus too narrowly. Keep an open mind as you look for relationships and patterns. Some of these may be unexpected and may or may not relate directly to your research questions. The purpose of exploration is not to confirm your prejudices by looking only for a specific pattern; it’s to discover new things, including patterns you did not anticipate. An overly narrow focus can cause you to miss important results, particularly when those results contradict a research hypothesis.

Group and compare your data

Next, divide the data into groups that you can compare to reveal similarities and differences. If your research is based on clearly defined treatments, group the data initially by treatment. If you based your research on different geographic locations, group your data by location. Now begin to compare those groups to look for similarities and differences both within and between groups.

Examine the data for each group completely on its own merits, rather than starting with an assumption of what you’ll see. Assumptions bias our thinking so that we see only data that supports those assumptions (“confirmation bias”) instead of seeing what the data actually show. For example, a closer look at the data in one group may reveal the existence of two sub-groups. Consider a preliminary graph that shows two clusters of data points with the coordinates (x,y), with one group showing high values of x and y and the other showing low values of both variables, with a large gap that contains no intermediate values between the groups. It’s natural to want to analyze all of your data as a single group, but in this case, you clearly need to subdivide the data into two groups and analyze each group separately. If you can propose a plausible explanation of that clustering, you may have discovered evidence for a physical mechanism that explains the separation between the two groups.

Propose physical explanations for your data

Think carefully about the physical processes you’re studying and how those processes constrain the distribution of your data. Is it reasonable to assume that the relationship between two variables is linear for all possible values of each variable, with no minimum or maximum value and no regions where behavior changes? More often, there are discontinuities, thresholds, upper and lower limits, or other differences that constrain your data.

Note: See my previous article about linear regression for some additional thoughts on this subject.

If you studied a known physical process, the nature of that process may define separate groups of data. These groups may differ from the groups you were thinking of when you defined your research questions. For example, if your results are not normally distributed, what does this mean? Skewness may reveal a physical process that biases the measured values towards small or large values; for example, a forest that is harvested by selection harvesting, in which the harvesters remove only the largest trees, will show a skew towards a population dominated by small trees. In contrast, a bimodal distribution (one with two peaks) suggests a need to segment your population into two subgroups with different means and statistical distributions before you continue your analysis.

Look for patterns and deviations

Although visual interpretations can guide how we analyze data by revealing patterns, don’t rely only on those subjective interpretations; our eyes often mislead us, particularly when we preferentially look for results that support our expectations. Confirm those interpretations using objective techniques. I’ll discuss those techniques in part 2 of this article.

Since you’re still exploring your data at this stage, avoid simplifications that can lead to incorrect interpretations of your data. For example, plot all graphs with both axes starting at 0. Although it’s tempting to plot only the range of data that contains your dataset to clarify details of the variations within that range, this conceals important context: how far your data lie from (0, 0) and how far they fall from the line y=x. Many authors reach incorrect conclusions when they ignore this context. If this additional context proves to be unimportant, you can subsequently graph only the part of the original graphs that contains your data.

Once you establish an overall understanding of the patterns in your data, look for deviations from those patterns (i.e., exceptions). Exceptions sometimes result from random statistical noise, but other times they reveal important exceptions to a pattern. Of course, they also may represent data-entry errors that will be much easier to fix now than in several months, when you’re revising your paper after peer review and have forgotten the details of your raw data.

Consider non-traditional ways to explore your data

When you’re exploring your data, consider unusual alternatives that can make patterns easier to see. For example, animation provides powerful insights. The human eye is exquisitely skilled at detecting visual changes, so you can learn much by animating a graph to show how a variable (e.g., vegetation cover in each month of the year) evolves during the year or how an organism (e.g., an infectious pathogen) moves through a community. Trying to detect such changes by examining a series of static graphs can reveal the same trend, but because this form of analysis is highly abstract, it is more difficult than actually seeing the change. The popular Origin software offers animation as a built-in tool, but you can also use less powerful tools, such as Excel and PowerPoint, to create animations.

Thinking even farther outside the box, some researchers have used sound to explore their data by taking advantage of the power of human hearing to detect changes in frequency or loudness. For example, if you convert all your data to magnitudes, and map those magnitudes to a sound volume or sound frequency, you can play the resulting “song”. The changes in volume or frequency may reveal trends that would be difficult to detect in other ways.

If you found that visual animations or sound files helped you to understand your data, your readers and other researchers will also find them useful. In that case, provide the software you used and the resulting visual or auditory data as online supplemental information.

Learn from your exploration

One thing your exploration may reveal is a more complex situation than you expected when you designed your study. Don’t assume that the simplest explanation is the best explanation; that’s a misunderstanding of Occam’s principle, which is often incorrectly assumed to mean that the simplest solution is most likely to be correct. The correct interpretation is that you should not choose a solution more complex than what is necessary to explain your data; a complex explanation is sometimes the only realistic explanation.

Exploration is not only a way to decide what you’ve found: it’s also an important way to improve your future research. Exploration may reveal problems such as an inadequate sample size or inadequate stratification. Learn from those problems and design your next study to mitigate these problems.

In part 2 of this article, I’ll describe how to rigorously confirm what you found.

Acknowledgments

I’m grateful for the reality check on my statistical descriptions provided by Dr. Julian Norghauer. Any errors in this article are my sole responsibility.