- Ray G. Butler
Preparing your Data for AutoDiscovery (III): Identifying Outliers in Excel.
Outliers are one of those statistical issues that everyone knows about, but most people aren’t sure how to deal with. Outliers can skew your statistical calculations leading you to false or misleading conclusions about your data.
In this article I will give you some easy-powerful-superfast tricks to identify outliers in your Excel data tables.
# The Goal
Outliers can have many anomalous causes:
a physical apparatus for taking measurements may have suffered a transient malfunction,
there may have been an error in data transmission or transcription,
a sample may have been contaminated with elements from outside the population being examined,
or simply natural deviations in populations.
Although AutoDiscovery mostly uses non-parametric statistics to minimize these issues, calculations are still sensitive to outliers so they can really mess up your data analysis. Additionally, outliers can skew your statistical calculations leading you to false or misleading conclusions about your data.
So the main goal of the outliers identification for us is, if possible, to remove them before exploring potential relationships with AutoDiscovery.
# A Subjective Task
There is no rigid mathematical definition of what constitutes an outlier so determining whether or not an observation is an outlier is ultimately a subjective task.
But the news is not all bad.
Many different effective algorithms and techniques are available to help researchers with identifying potential outliers. Some of them are complex but most are simple and straightforward.
In this article I will provide you with two extremely easy and powerful ways of identifying outliers in small-but-complex datasets –few records but many variables potentially involved in the biological process under study- in Excel.
# Trick #1. Excel AutoFilter
With the AutoFilter function in Excel you can find values quickly. For the most part we use AutoFilters to decide which information we would like to see/hide in the data table.
However, AutoFilters have a secondary useful function which is often overlooked: when you apply an AutoFilter to a column in your data table, Excel provides you with a sorted list of the values stored in the column.
To do that:
1. Select one of the cells of your data table. 2. Press Ctrl + Shift + L shortcut to activate AutoFilters. 3. For each relevant column in your data table:
3.1. Press the filter dropdown button. 3.2. Look at the list of values of the column to identify wrong values, such as:
3.2.1. Numerical values out of the range (look at the first and last values in the list). 3.2.2. Values that were typed wrongly (e.g. null or empty values, dates with a wrong order). 3.2.3. Mixed data types in the same column (e.g. a single numerical value within a text column).
3.3. If you find an anomalous value:
3.3.1. Uncheck the “(Select All)” option in the list
3.3.2. Check just the anomalous value to go to the specific records in the table containing that value. 3.3.3. Fix the value following your own criteria. 3.3.4. Check again the “(Select All)” option in the filter.
This method indeed involves subjectivity and can be hard to use in data sets with many columns but it can be applied to explore both numerical and qualitative information, which is something that not many methods can boast about.
# Trick #2. Highlighting Outlier Cells
Frankly, my original idea was to explain this method in detail but I googled a bit and I found that guys at Dummies.com already did it so I rather prefer to share their post.
I just wanted to point two issues out:
This method works just for numerical columns (which indeed is the most common case when identifying outliers).
I strongly recommend you to automate the process with Excel’s macros. If you want to explore this option but you need some assistance please let me know!
# The Next Steps
In our previous article of this series we learnt how to adapt the numerical data to our analytic goals. Now we have cleaned the anomalous values in our data table to minimize their impact on the exploratory statistical calculations of AutoDiscovery.
The next step before an exhaustive exploratory data analysis will be to check whether the variables in our dataset are properly named. I will talk about this in the next chapter of this series.
In the meanwhile, try AutoDiscovery by your own (of course for free!).