Preparing your Data for AutoDiscovery (II): From Numbers to Categories in Excel.


From Numbers to Categories in Excel

Do you have a large list of numbers and you would like to group them into ranges or categories? Do you want to change numerical codes by more descriptive names to facilitate your data analysis?

In this article I will give you some funny superfast tricks to do these kind of pre-processing taks just using Microsoft Excel.

# The Goal

In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property.

Many times, the values of our categorial variables are stored in the data table as numbers as this is the most efficient way from a computational point of view. However, we humans usually feel more comfortable reading and processing text labels ("Large" or "Small" instead of "17" or "5").

# Trick #1. Numerical Codes to Text Labels

Let's imagine that Dr. Discovery performed a study to assess how a variety of known gen mutations impact the clinical evolution of 200 bed patients with mechanical ventilation.

These patients are arranged into 4 different groups depending on the kind of infection they suffered. Groups are coded in a numerical variable called "Group Code" with the following values:

1 = Bronchitis

2 = Pneumonia

3 = Colonization

4 = Carriage

Clinical records with numerical groups codes

Our goal is to generate a new categorical variable called "Group Name" with the proper group name for each patient.

To do that, Microsoft Excel provides a powerful easy to use function called

INDEX

Technically, this function returns the value of an element in a list, selected by the position in that list.

Given that our group codes are stored in the B column, the INDEX function should be used in the following way:

1. Insert a new column C called "Group_Name".

2. Copy the following function and paste it within the C2 cell:

=INDEX({"Bronchitis";"Pneumonia";"Colonization";"Carriage"}; B2)

Application of the INDEX function

NOTE: Be sure to use the proper character to separate the arguments in the function (semicolon or comma) depending on your setup.

As you see, the group code "1" is associated to the first element in the list, that is, "Bronchitis".

Group code "1" is associated to "Bronchitis"

3. Double-clic the bottom right corner of the cell C2 to extend the function to the rest of the column.

Double-clic the bottom right corner to extend the function

Et voilà! You now have a new categorical variable which may be considered by AutoDiscovery in the analysis of variances or contingencies.

New categorical variable

# Trick #2. Numerical Values to Ranges

Each patient in our study stayed at ICU for a particular period of time that ranges from 0 to 120 days. That information is stored in the variable "ICU Length Stay":

ICU length stay of each patient in days

Our goal now is to classify each patient depending on the length of the ICU stay in the following way:

Short: between 0 and 10 days

Regular: stay between 11 and 30 days

Large: more than 30 days at ICU

To do that, Microsoft Excel provides another powerful function called

MATCH

This function searches for a specified item in a list, and then returns the relative position of that item in the range. The great thing is that the MATCH function can be combined with the above mentioned function INDEX to return the name of that item (instead of just the relative position)!

So let's start.

1. Insert a new column H called "ICU_Stay".

New categorical column "ICU Stay"

2. Copy the following function and paste it within the H2 cell:

=INDEX({"Short";"Regular";"Large"}; MATCH(G2;{0;10;30}))

Ok, this expression is a bit more complicated but the idea is that MATCH looks for the range in which the ICU length stay of the patient falls and, then, the INDEX function returns the name of that range.

In the case of the first patient (26 days at ICU), the categorical value is "Regular".

Categorical value for ICU length stay

3. Extend the function in this cell to the rest of the column by double-clicking the bottom right corner of the H2 cell.

# The Next Steps

In our first article of this series we learnt how to arrange our data in a simple plain table. Now we have adapted the numerical data to our analytic goals.

The next step before an exhaustive exploratory data analysis will be to check whether the cells contain the proper values. I will talk about this in the next chapter of this series.

Stay tuned!

In the meanwhile, try AutoDiscovery by your own (of course for free!):

Try AutoDiscovery for free now