Uncovering and Predicting Trends with EDA

Uncovering and Predicting Trends with EDA

Knowing the rules governing a variable’s behavior has a very clear turnover: to be able to estimate its response with precision.

Today we will talk about how the knowledge that is generated with AutoDiscovery may be used to build predictive models that help casting light where we cannot reach with “just data”.

Recipe for a predictive model

In general terms, the goal of our work will be that of obtaining a set of mathematical expressions that lead us to our key variables value (from now on, “response variables”) according to other variables which value is already known.

Canadian Oil Production (1960 to 2020)

Thus way, our model generation starts from two elements that we need to identify right from the beginning:

  • Our response variables: that is, those which behaviour we wish to predict.

  • The data we already count with in order to train it.

Our predictive model is a function of a training data set and our variables of interest

Training the model

First step in the construction of the model will be that of "training it”. That is, feeding the model with the data that has come from real observations, in order to obtain the underlying set of basic rules of our response variable.

In AutoDiscovery, that is carried out in a simple and efficient manner:

  1. Consolidation of observed data, coming from anumber of different sources.

  2. Configuration of variables of interest, making them coincide with the response we are after.

  3. Initiation of a discovery process in order to describe the functioning of our response variables.

The identified numerical correlations and qualitative factors (ANOVAs) will become the foundations of our model. Of course, only those relationships tagged as “meaningful” may be considered within our future model.

Assembling our model

The last stage in our building process will be that of merging all the produced knowledge, taking into account the heterogeneity of the selected relationships, that is:

  • The mathematical expression that sustains the relationship (regression+error in the case of a correlation, mean+deviation in case of an ANOVA),

  • Conditions in which the relationship is produced and

  • Strength of the relationship.

For all those relationships that happen in the same conditions, our proposition consists in obtaining a “single average mathematical expression” that should be weighted through the strength of the relationships. Of course, that expression should include both the estimated value and error.

Estimating the response value

In order to estimate our response value through the knowledge that has been previously gathered we will simply need to introduce the conditions, which are already known for the given model.

These known conditions will allow us to choose which subset of mathematical expressions is relevant in that case and, of course, the introduction of each of those functions related parameters.

This flow diagram is a graphical summary of what’s been described previously:

A flow diagram to estimate the response value

So ... do you dare trying with your own data? You cannot imagine what you could discover today ...