My current project is building a survey (in simple terms) system that is for the reporting of PROM (Patient-reported outcome measures) for regenerative medicine. This is their third attempt at building a product to seize on the opportunities in this space. The data that is the output of the system has huge promise and sets the stage to leave this niche corner of practice.
Some examples of these surveys are the FAOS and the KOOS. Standard surveys like these are well defined and their scoring is discrete. Other surveys, such as ones created by a clinician, can be much more loose in the way that they ask the questions. A simple example would look like this:
Function: Stairs ( normal up and down, normal up down with rail, up with rail down unable, etc...)
This is a pretty straight forward for a human to read and understand, but when you have 1000s to munge through this data doesn’t mean anything. Just like any data that you want to report on, you have to transform the data to a shape that makes sense to a computer. Starting with the example question above, it seems to make sense to use the one-hot-encoder. I found this article and it appears to do what I want: Categorical encoding using Label-Encoding and One-Hot-Encoder.
Using his examples as a guide I can create a data frame from pandas to get started. The data set that I used is only two columns for simplicity:
UserId, Function_Stairs 1, "Normal up and down" 2, "Normal up and down" 3, "Up with rail down unable"
import pandas as pd import numpy as np # creating initial dataframe filename='PROMSingleQuestion.csv' names = ['UserId','Function_Stairs'] stairs_mobility_df = pd.read_csv(filename, names=names)
If I print the data frame at this point, this is what my data looks like:
UserId Function_Stairs
0 1 “Normal up and down”
1 2 “Normal up and down”
2 3 “Up with rail down unable”
The next step is to transform the Function_Stairs column so that we can use it for measurements. First we will drop the User_Id column since we want to protect PII (personal identifiable information) and it is not useful for the experiment.
stairs_mobility_df = stairs_mobility_df.drop(columns=['UserId'])
Now if we print the frame at this point we are left with only the mobility column:
0 “Normal up and down”
1 “Normal up and down”
2 “Up with rail down unable”
It is time to start rotating the data to convert the values(rows) to columns. To do this we will use the pandas function get_dummies for the transformation and then join the new columns back to the original data frame:
dum_df = pd.get_dummies(stairs_mobility_df, columns=["Function_Stairs"], prefix=["Mobility_Level"] ) stairs_mobility_df = stairs_mobility_df.join(dum_df)
When it is all said and done, we are left with the rotation that we want:
| Row | Mobility_Level_Normal up and down | Mobility_Level_Normal up with rail and down unable | |
|---|---|---|---|
| 0 | 1 | 0 | |
| 1 | 1 | 0 | |
| 2 | 0 | 1 |
This is much better. Next time I will start to transform the entire dataset. I am still learning, be nice. Thanks Dinesh Yadav for the help.