How to Interpret the Exploratory Results of AutoDiscovery

Discovery Map showing potential relationships between biomarkers, mitosis and tumor stages

The main outcome of the intelligent exhaustive exploratory data analysis process of AutoDiscovery is the list of the most relevant potential relationships between the variables of our clinical study or biomedical research project.

But, how should we interpret those relationships? Which new insights do they provide?

If you want to know the answer to these questions, keep reading!

# Exploratory statistics in AutoDiscovery

As we already explained some time ago, AutoDiscovery in an automated exploratory data analysis tool. This kind of statistical analysis is different and complementary to the regular hypothesis testing.

How exploratory and confirmatory data analysis complement each other

These differences are important to understand that AutoDiscovery is no way substitute for the confirmatory statistical tools but that it helps in the early stages of our studies to identify the best work areas for our future research projects.

# Significance, strength and exclusiveness

As the goal of the statistical exploratory process of AutoDiscovery is to minimize type II errors (beta or false negatives), it does not perform any kind of significance / multiple testing corrections. That is why the true potential of every relationship suggested by AutoDiscovery must be properly assessed by the researcher based on the experience or by means of a further confirmatory study.

AutoDiscovery complements the minimum significance criterion (p<0,05) with additional exploratory-specific information items.


It is a cuantitative assessment of the intensity of the association between two given variables. The rule of thumb is, the more evident is the potential effect of a variable on the other one, the higher is the strength of that relationship. As an example, these graphs show how the relationship on the right has a higher strength (175%) than the one on the left (10%).

Visual interpretation of the strength of a relationship


Exclusiveness of a given relationship is a cualitative assessment of its potential scientific relevance. Exclusiveness is associated to the different subsets of our samples in which that relationship was identified.

For example, a potential relationship between a gen X and the overall survival identified in a particular group of our patients with a specific treatment (and not in the rest of the patients) is considered by AutoDiscovery as "more interesting" than other relationships identified in the entire sample which may already be familiar to the researcher.

Strenght and exclusiveness are thus very useful to prioritise and focus our (scarce) resources on these relationships with higher scientific potential.

# 3 types of statistical relationships in AutoDiscovery

AutoDiscovery is designed to exhaustively analyze the potential relationships between every pair of variables in our dataset applying the proper exploratory statistical test in each case.

This automatic selection of the proper test is done after a thorough analysis of the data type, statistical distribution and other methodological aspects of both variables involved in the relationship, in a similar way as it is done by an expert human biostatistician.

In general terms, AutoDiscovery is able to identify three types of statistical relationships.

Monotonic relationships (correlations)

Thanks to the Spearman correlation test, a non-parametric test that can be applied even in non normal data distributions, AutoDiscovery measures the degree of dependence (or inter-relationship) between both variables (whether linear or not).

This test is automatically applied by AutoDiscovery when both variables are quantitative.

An example of negative correlation between two quantitative variables in our study

A positive correlation between variables A and B should be interpreted as ...

“as A increases B does that too, suggesting a potential direct relationship between both variables.”

while a negative correlation should be interpreted as ...

“as A increases B lowers, suggesting a potential inverse relationship between both variables.”

The strength of a correlation is determined by the Spearman's rank coefficient, expressed as a %.

Differences between treatments (analysis of variance)

AutoDiscovery applies a variety of statistical calculations within the family of the analysis of variance (ANOVA) to assess whether different treatments ("factor" variables) show significant differences in other variables ("responses"). In other words, which mean responses are significantly different depending on which factors.

These tests are automatically applied by AutoDiscovery when one of the variables is categorical (factor) and the other one is quantitative (response).

An example of a mean response (Ki67 levels) significantly different identified with an analysis of variance

Such a a relationship should be interpreted as ...

“the differences between the mean 'response' of the groups given by the 'factor' are statistically significant, suggesting a potential association between both variables.”

The strength of a analysis of variance is determined by the average effect size of the factor, expressed as a %.

Heterogeneous distributions (contingencies)

Thanks to the Cramer's coefficient, AutoDiscovery is able to compare the frequency distribution of both variables to the expected frequencies as if they had been independent variables.

A contingency table is composed by two variables and it is based on the calculation of percentages. The goal of this statistical method is to assess whether both variables are related and they way to do that is through the distribution of these percentages.

This method is automatically applied by AutoDiscovery when both variables are categorical.

An example of contingency table analysis suggesting a potential association between both variables

The heterogeneity is usually given by an excessive amount of samples in a particular combination (X, Y) of both variables (A, B). That's why such a relationship should be interpreted, in general, as ...

“the number of samples in our study with the attributes X and Y in the variables A and B is significantly higher/lower than the expected amount, suggesting a potential relationship between both attributes.”

The strength of a contingency analysis is determined by the value of the Cramér's V coefficient expressed as a %.

# Try it out with your own dataset (it's free)

Ray G. Butler (R&D Manager)

If you want to unveil relevant relationships that may be hidden in a dataset of your own study, do not hesitate to contact us. Just ask for a free pilot to assess to what extent AutoDiscovery might be of help in your research projects.

Try AutoDiscovery for free now