How to Design an Efficient EDA Study


How to Design an Efficient EDA Study

The very nature of exploratory studies make that both the initial questions and the research process itself turn out to be more open and flexible than in the classic confirmatory analysis stage, without losing effectivity and accuracy.

Would you like to know how to design an efficient exploratory data analysis study? Then you may want to keep on reading…

The exploratory question

As commented in previous occasions, the greatest contribution of an exploratory study comes when it is carried prior to the confirmatory study.

Exploratory vs Confirmatory Research

Given the fact that it goes around a cyclic process, the exploratory study may be required in order to:

  1. To generate more effective hypothesis, which are to be confirmed at a later stage.

  2. To improve the knowledge that is gathered after a previous confirmatory study.

Whatever the goal may be, we must express it in the form of what I call “exploratory question”.

The exploratory question is equivalent to the formal hypothesis of a confirmatory study, without that level of precision and concretion.

Therefore, given a formal hypothesis (e.g.):

“to assess the role of the inducible nitric oxide synthase on surgical pain”.

the equivalent exploratory question could be:

“which are the most relevant biochemical compounds involved in the surgical pain?”.

As you can see, the exploratory question talks about relationships among complex data groups (multidimensional) out of which the more relevant concrete elements are yet unknown.

>>> Our experience in dozens of scientific projects has allowed us identify the 5 mostly shared kinds of questions that researchers and data scientists ask themselves. <<<

Data is the starting point

Despite the previous knowledge body is very important when we are designing an exploratory study, the ground of our work will always be the data.

The second step of an exploratory study will thus be to identify which sources and datasets are more adequate in order to give an answer to the exploratory question formulated in the first place.

The selected datasets must always comply with a number of minimum requirements:

  • RELEVANCE: data must go around the subject described by the exploratory question.

  • RELIABILITY: data must have been generated through capture and storage processes which are appropriate for the given research study.

  • SUITABILITY: data must contain the greater number of dimensions (variables) and possible records in relation (however vague and faint it may be) with the subject under exploration.

  • AUTOMATION: data must be ready to be stored, exported and treated by automatic data exploration tools.

These datasets may be collected from different sources:

  • Open-access databases: These contain relevant and reliable information on multiple subjects. There are many freeware tools available, thus saving us resources and effort. Despite those may probably turn out to be less “complete” than required, they will anyway help us to get closer to the problem.

  • Your own prior studies: it is a common thing that a given researcher (or his mates from other teams) have already had experience in the area they are to get started with at a certain moment… and most likely… they will have data still stored from that previous experience. That dataset is commonly closer to the problem at hand despite it is also probably of a smaller size.

  • A preliminary stage of the current study: in case neither of the previous options was at hand, it is worth trying to extend the design of the project in order to include a preliminary stage in which generic data in relation with the exploratory question is acquired.

Let your data tell the story

We have got the data and know the question… why not let the data tell its own story?

With AutoDiscovery this process is very simple and it's composed of three steps:

  • CONSOLIDATE: all datasets coming from the different sources must be aggregated in a unique dataset. It will thus be possible to identify relationships among sources.

  • DISCOVER: having defined the exploratory question, we must ask the software to search, filter and priorize all relationships which are relevant to the subject expressed by means of the exploratory question. That is, relationships in which the variables participating in the exploratory question take part.

  • EXPLORE: the discovery process generates a visual representation of the story behind those variables. We now need to “read” that story in order to explore and understand the subject of our exploratory question.

In the following (and last) chapter of this series I will explain precisely how to get the most out of the results generated by the exploration.

Do you dare trying with your own data? You cannot imagine what you could discover today…