Preparing your Data for AutoDiscovery: Table-Like Structure in Excel.


What data scientists spend the most time doing

Did you know that around 80% of your time is spent on preparing and managing data for analysis? And that more than 75% of us view data preparation as the least enjoyable part of our work?

The bad thing is that, many times, that leads to procrastination and to a very surface analysis.

In this series of articles I will give you some funny superfast tricks to collect, organize and clean your data for an automated exploratory data analysis with AutoDiscovery just using Microsoft Excel.

# The Goal

The overall goal of the data cleaning tasks is to adapt our data files to make the analysis effective and to get the desired results.

To be more precise, the idea is to build a data structure with the most appropriate format and content for the tool we are using, that is, AutoDiscovery.

# Proper data structure and content for AutoDiscovery

In order to combine our data files and to assess the potential relationships between all the variables in the resulting consolidated table, AutoDiscovery just needs two basic requirements to be fulfilled:

1. Table-like data structure: In short, one row per each sample (patient, gen, experiment trial, etc.) and one column per each variable measured (age, gen expression level, trial duration, etc.).

  • Avoid empty rows and columns.

  • Do not use duplicated headers in columns.

2. Proper values in cells: the values in the cells of our table must contain the measurements of each variable involved in our study.

  • Do not mix quantitative (numbers) and qualitative (categories, text) values in the same column.

  • Identify and remove the values that indicate errors in your study (outliers).

  • Keep a cell empty (avoid “N/A”, “-“ or similar values) if the measurement was not properly collected. That will make AutoDiscovery’s work easier.

In this article I will focus on the data structure leaving the proper values to the next chapter. All right?

# I'm Used to Microsoft Excel 2010, And You?

The version of Microsoft Excel I’m used to use is 2010 for Windows. I know that’s an old version but it works very well and I still love its old-fashioned user interface :)

Do not worry if you use an updated version of Excel. Every feature I mention here work in a very similar way.

Of course there are a myriad of advanced techniques to process your data with Excel and other tools. I am confident that the ones I suggest here are more than enough for most researchers.

If you still need help with a particular case do not hesitate to contact me!

# Table-like Data Structure

Data files generated with automated software tools (statistical packages, data acquisition forms, etc.) usually meet the basic requirements of a table-like data structure. However, when data is collected, combined, copied or just typed manually in an Excel sheet some ocassional problems may arise.

Avoid empty rows and columns!

Raise his hands who has never inserted empty rows or columns to facilitate reading and visualizing the data.

An Excel sheet with empty rows and columns

White space is really useful because it tells our brain which elements are most important in our data and helps us to visualize and process it. However, when it comes to automated data analysis, white space is devastating!

Detecting and removing empty rows and columns in Excel is quite simple and fast if you know some magic keyboard shortcuts (disclaimer: if you’re using Microsoft Excel in Mac I strongly recommend you to visit this site)

Removing empty columns

1. Move to the first cell in your data table.

2. Move to the next empty column.

3. Select the entire column.

4. Remove the column.

5. Repeat steps 2 - 4 until reaching the last column in your data table.

Removing empty rows

1. Move to the first cell in your data table.

2. Move to the next empty row.

3. Select the entire row.

4. Remove the row.

5. Repeat steps 2 - 4 until reaching the last column in your data table.

Do not use duplicated headers in columns

Standard data analysis software tools usually consider column headers as the name of the variables in our study. If these headers are duplicated, it would be impossible to differenciate the meaning of two variables.

There are many ways to identify duplicated names in Excel. I will explain now the way that has best worked for us in the exploratory studies we have automated with AutoDiscovery.

I am talking about the Conditional Formatting feature.

That feature makes the duplicated values in a range to be visually highlighted with a particular color and that helps us to identify them.

This is as simple as:

1. Select the row with the column headers.

2. Apply the Conditional Formatting:

a. Click Conditional Format button in the “Home” tab of the ribbon. b. Select the “Highlight Cell Rules” menu. c. Click the “Duplicate Values” option.

3. Accept the default configuration.

Now the cells in the row with duplicated values are shown in red so that it will be easy for you to i) identify them and ii) fix the names (highlights are removed automatically when the names of the columns are edited).

# The Next Steps

Once your data is arranged in a simple plain table, the next step will be adapting the values to our analytic goals.

I will talk about this in the next chapter of this series. Stay tuned!

In the meanwhile, try AutoDiscovery by your own (of course for free!):

Try AutoDiscovery for free now