Unlock the Power of Automation: Why Automating Data Exploration is Essential for Your Efficiency
If you have ever analyzed some data, you may have performed a process that does not involve the actual exploration of all the available data for a simple reason: time.
In other words, you may have started with some assumption on which features may not be important, based on the idea you have about the data. This may happen because, especially when we have to deal with data that are not numerical, we tend to eliminate them because we often have no time to correctly treat them; but this leads to biased results.
The question is: is there a solution to all of this? Can we really explore the whole data we have, without the risk to get biased results?
The answer is yes, and it has a name and a surname: Automated Exploratory Data Analysis.
Automated EDA is the process of using algorithms to identify patterns, correlations, and anomalies in data in an automated, smart, comprehensive way.
This process can help analysts gain insights quickly and accurately, reducing the amount of time and effort spent on manual EDA (e.g. scripts or dashboards), and with the possibility to explore all the available data, uncovering hidden patterns and correlations in data that manual analysis might miss.
In this article, we’ll show a particular Automated EDA software called AutoDiscovery, and we’ll see the benefits of using it in a real example.
AutoDiscovery is an intelligent automated exploratory data analysis software that has been primarily created for biomedical researchers to unveil complex relationships hidden in the data files of scientific experiments and clinical trials.
Also, it can perform automated EDA on data regarding any field, as we’ll see in the example provided at the end of this article.
Benefits of using automated EDA
When we analyze data, we understand the power of automation, but let’s be honest: sometimes we may have the feeling that we are losing control over our data and over our explorations and processes.
This is totally understandable, but the truth is that it is only a matter of habit. So, let’s see some benefits of automating our EDAs:
Increased Efficiency: Automated data exploration can quickly process large datasets and perform various data operations in less time than manual exploration.
Cost Savings: Automated data exploration can save time and money compared to manual exploration, which can require extensive human resources.
Increased Accuracy: Automated data exploration can identify patterns and anomalies in data more accurately than manual exploration.
Improved Insights: Automated data exploration can provide insights that may be overlooked in manual exploration.
Increased Scalability: Automated data exploration can scale up or down depending on the size of the dataset, allowing for more efficient data exploration.
Increased Flexibility: Automated data exploration can quickly adjust to changing data requirements and can be easily adapted to new data sources.
Showing the benefits of AutoDiscovery
We had the possibility to test AutoDiscovery on a project we previously made, and here we’ll see the benefits of using AutoDiscovery.
Our solution can be found on GitHub here, and it is about creating a data-driven strategy for a wine marketplace.
This is how the data look like in a Jupyter Notebook environment:
How the data looks like. Image by Author.
The data have been taken from Kaggle and have recorded: the country of the wine, the description, the designation, the points that some tasters have given to the bottle, the taster names and their Twitter accounts, the price of the bottle tasted, the province and region of origin of the wine, the title and the variety of the wine, and the winery where the bottles come from.
Now, we wanted to extract some insights from this data but, as we can see, the majority of the data are not numerical. So, since we had to conclude this project for an exam in a certain amount of time, we made the most natural thing: we assumed we just needed the numerical data.
Or, let’s say it in another way: we created a solution starting with the data we could “rapidly” use. In fact, we had an open-minded task, with many possible paths to follow and explore, related to a goal: understanding how the value of wine works. So, our idea was just to create a strategy for a marketplace like so:
checking for eventual correlation between the price and the points (the higher priced bottles could have received the highest points, for example).
subdivide the wines for the region, defining which wines to buy for a hypothetical wine marketplace, and basing the results on the point and the price. This way, a physical marketplace can be created by buying the bottles from a particular region (ie: just European wines), having received certain points and in a certain range of price.
Let’s briefly see some results.
Firstly, there is no strong correlation between the price and the points:
The correlation matrix. Image by Author.
This is an important result because it means that the tasters have given the points to the bottles regarding their price.
Also, to our great surprise, we can see that the highest prices are somehow spread across the whole Europe. Furthermore, we have an important outlier in the US:
The price of the bottles for countries. Image by Author
But we can do a lot more, as exploratory questions are the foundation of understanding the data we are working on.
For example, we may ask ourselves:
what influences the price of a bottle the most?
This is a very good exploratory question but, as we may understand, we need to explore the whole data to answer this question.
Luckily for us, AutoDiscovery does the analysis for us, and it founds that the variables that influence the most the price of a bottle are: the region, the province, and…the tester name!
Price-region_2 boxplot. Image by Author with AutoDiscovery.
Price-province boxplot. Image by Author with AutoDiscovery.
Price-tester name boxplot. Image by Author with AutoDiscovery.
Now, further analyses can derive from these results, due to more questions. For example, since we found that the price is influenced by the tester's name, can this result be biased? In other words: can the price of a bottle be biased due to the person that has tasted it?
Let’s see this plot created by AutoDiscovery:
Price-tester name boxplot. Image by Author with AutoDiscovery
Jim Gordon tasted 467 wines, while Matt Kettman "just" 55, from the same region. Anyway, Jim has tasted low-priced wines while Matt tasted wines in a wide range. From an statistical point of view, these differences are “significant” so that we can extrapolate, with a high level of confidence, that “Matt is really more focused on expensive wines”. That’s what AutoDiscovery helped us to discover!
So, can the result of Jim being biased since he tasted a lot of wines coming from the same region and low-priced wines? It may…
How AutoDiscovery works
So, we have seen the results that AutoDiscovery can give us, but does it actually work? Let’s see some of its features.
First of all, we start with what we call the “Exploratory question”. Here, we want to simply ask ourselves what results we want to obtain. In this case, we were interested in understanding what influences the price of a bottle of wine:
The exploratory question in AutoDiscovery. Image by Author.
As we can see, in 1 minute and 35 seconds, AutoDiscovery evaluated 199 associations: how much time would we need by hand?
After we have made the exploratory question, we can feed AutoDiscovery with the data (in CSV for example) and set its parameters:
Setting the parameters of AutoDiscovery. Image by Author.
Here’s what we made here:
We set “price” as the variable to be explained. This way, AutoDiscovery will relate all the features to the price, because we are interested in understanding what influences the price of a bottle. At every step, AutoDiscovery will use a proper statistical method, depending on the nature of the data (qualitative vs quantitative) and other relevant aspects (such as the normality of the distribution).
We set the variables to ignore. For example, we have an “unnamed” column which is nothing more than an importation error from the CSV. Also, we have decided to ignore the Twitter account of the tasters since it can really not influence the price of a bottle.
We set the subject groups. AutoDiscovery will use these to subdivide all the available data into subgroups, to test the relationship between the features and the label (the price), stratifying the data for the inserted subgroups. What if the price of a wine is related to a feature just in a specific group of German wines (and not in the rest of wines)? Stratification is the way AutoDiscovery can discover this.
Finally, AutoDiscovery gives the results of its analysis:
The Discovery Map: associations found by AutoDiscovery on the given data. Image by Author.
The List of Associations found by AutoDiscovery on the given data. Image by Author.
Also, AutoDiscovery gives us an idea of how strong the associations it founds and the method used.
For example, it founds that the price is strongly associated with the winery in “region_2=North Coast” with a strength of 43%, using the Kruskal Wallis test:
The results of the association Winery-price with “region_2=North Coast”. Image by Author on AutoDiscovery.
And this is the resulting box plot:
The boxplot of the association Winery-price with “region_2=North Coast”. Image by Author on AutoDiscovery.
As you can see, AutoDiscovery is a white-box method which is able to explain every detail on how and why it concluded that the association was relevant for us.
AutoDiscovery was designed to do its best in biomedical, pharma, and clinical research projects with complex datasets integrating information from genes, blood analysis, demographics, treatments, etc. (you may find several examples in many therapies here) but, as we can see, the same strategy to solve exploratory questions may be followed in a range of other applications: from banking to sports... to wines!
This was just a short overview of how AutoDiscovery can help us in EDA but we believe it really shows its power. How much time would it take to find all these associations by hand? Well, some days, we’d say.