Weekly Update 8/1/2020

For the last several weeks, I have been trying to learn more about data science. I have always enjoyed writing queries for reporting to dig into the stories that the data can tel. Data science takes that interest to a whole new level.

As a typical engineer I haven’t exercised my linear algebra or calculus since my undergraduate. The book (Data Science from Scratch: First Principles with Python 2nd Edition) that I am reading starts hard and fast in that direction. Luckily I have most of my old school books, so I have been speed reading through them to try to reclaim some level of understanding. So I have been a nerd with my stats, linear algebra and calculus books for the last several weeks. Reading them a decade later gives me a new appreciation for the material.

All of that leads to why I am really interested into learning data science. On a project that I am working on there is an opportunity to utilize some machine learning techniques. More specifically, the project asks a sequence of questions to a user to arrive at a customer satisfaction score.

One of the areas that we may be able to employ such techniques is determining the next question (excluding next questions only). It would seem reasonable to have the ability to craft the sequence of questions (let’s call it a survey) based on the answers to the previous ones. If you have n surveys completed, then you might be able to do some multiple regression analysis. This could tell you what the probability of the next question being answer in a certain way. This would allow you to remove questions that you know have a high likelihood of being unnecessary since they are consistently answered the same.

One of the first thing that comes to mind with that approach is there may be some questions that are more important thereby requiring a user’s answer. In other words, they may have more weight and should be removed from the possible exclusion set. Another top of mind concern is the actual ordering of the questions. If the ordering is random such that the questions are independent, how could you predict what the next question is going should not be or does the machine determine the order itself.

I think having the machine determine the order of the questions would be pretty awesome. Would that somehow invalidate the comparison between surveys based on some psychological factors that the order would create, I am not sure. Theoretically, if the questions are truly independent, you should be able to present them in any order. I do believe that they could tell a different story even with the same answers if presented in different ways. ¯\_(ツ)_/¯

So at this point, we have multiple regression to determine what the likely next sequence of questions will be answered in a certain way. This would need to be a multi-pass operation, since you should be checking each successive question to determine if it needs to be asked. It is possible that you could reach a point where no further questions are required.

That now covers the exclusion , but what about including new questions dynamically. Would you have a pool of questions that are related to a product or service to be used for possible inclusion? In that scenario would you only include some static questions and allow machine to add questions from a pool based on prior responses? That seems like a simple decision tree, but given that we are already excluding some questions it seems unreasonable to construct that tree. You could always have a parent-child relationship where one answer prompts a series of new questions, but again that would need to be preordained.

It seems that dynamic inclusion is a much more challenging endeavor. I think I will leave it here for now so I can marinate a little more on this topic.

T-SQL String Aggregation

Recently, I needed a way to take a list of name and concatenate them with a separator between them. There are many ways to this in C#, but I was feeling lazy so I wanted to see if there was a simpler way… in SQL. To set the stage, I needed to be able to take a table with the name of the object in rows. So let’s say we have two tables Group and Group Members:

CREATE TABLE [dbo].[Group]
(
    [Id] [BIGINT] IDENTITY(1,1) NOT NULL,
    [Name] VARCHAR(255) NULL
)

CREATE TABLE [dbo].[GroupMembers]
(
    [Id] [BIGINT] IDENTITY(1,1) NOT NULL,
    [GroupId] BIGINT NOT NULL ,
    [UserEmail] VARCHAR(255) NOT NULL 
)

Let’s create a group and some new members:

INSERT INTO [Group] (Name) SELECT 'My Group'

INSERT INTO [GroupMembers] ([GroupId], [UserEmail]) SELECT 1, 'test@test.com'
INSERT INTO [GroupMembers] ([GroupId], [UserEmail]) SELECT 1, 'test1@test.com'
INSERT INTO [GroupMembers] ([GroupId], [UserEmail]) SELECT 1, 'test2@test.com'
INSERT INTO [GroupMembers] ([GroupId], [UserEmail]) SELECT 1, 'test3@test.com'

The goal is to take a group and the first five users and return the name of the group and a concatenation of the emails. To start with, we can just pull the data directly, but it gives us all rows as opposed to just one:

SELECT 
    g.[Name],
    gm.[UserEmail] 
FROM
    [GROUP] g
    INNER JOIN [GroupMembers] gm ON gm.[GroupId] = g.[Id]

Reminder the goal is to have a single row with the emails concatenated and we want to do this in SQL, so this isn’t helpful. If I did choose to do this in code, I would have to iterate over the collection, insert the separator and combine the strings. I would also need to track if I was on the last item so that I didn’t add a trailing separator. Enter STRING_AGG. In their words:

Concatenates the values of string expressions and places separator values between them. The separator is not added at the end of string.

This is perfect, so let’s try it out. From the docs, we can see that we have two parameters: expression and separator.

This takes our previous query and with a few adjustments we get this:

SELECT 
    g.[Name],
    [Emails] = STRING_AGG(gm.[UserEmail], ', ') 
FROM
    [GROUP] g
    INNER JOIN [GroupMembers] gm ON gm.[GroupId] = g.[Id]
GROUP BY
     g.[Name]

When we run this, we get the exact results that we are looking for.

There are a couple of things to note. I added a space after the separator to make it look better and for it to work we have to apply the group by. The only thing that is left is to set the order of the aggregate strings. I mention this because it got logged as a bug. I should have known better, rookie mistake. Oh well. It is really simple, we just need to add the WITHIN statement at the end of the STRING_AGG function:

SELECT 
    g.[Name],
    [Emails] = STRING_AGG(gm.[UserEmail], ', ') WITHIN GROUP ( ORDER BY gm.[UserEmail] DESC) 
FROM
    [GROUP] g
    INNER JOIN [GroupMembers] gm ON gm.[GroupId] = g.[Id]
GROUP BY
     g.[Name]

Its really that simple. Good luck!

State Drug Utilization Sample

I wanted to start looking at legit data sets that exist in the wild, so I went to data.gov and downloaded the state drug utilization data from 2010. I am going to test it out with Pandas and see what I learn. Here is their description of what the data represents:

Drug utilization data are reported by states for covered outpatient drugs that are paid for by state Medicaid agencies since the start of the Medicaid Drug Rebate Program. The data includes state, drug name, National Drug Code, number of prescriptions and dollars reimbursed

This was a pretty big data set at almost 400 MB. The spreadsheet that I am using, Numbers wasn’t up to the job. I am on a mac and that is obviously what I have to work with. To get cracking I need to load the file into Pandas to create a data frame:

import pandas as pd

filename= 'State_Drug_Utilization_Data_2010.csv'
drug_df = pd.read_csv(filename)

Crap, from the start there are already problems loading the data: sys:1: DtypeWarning: Columns (19) have mixed types.Specify dtype option on import or set low_memory=False. With a little bit of Google magic I came across this article which explains that this error really means that Pandas couldn’t read the data because there was a field that it couldn’t determine the type for. The solution was simple, I just needed to manually tell it to load the column as an object:

drug_df = pd.read_csv(filename, dtype={"NDC" : object})

Now when we print the head to get the top 5 rows we get the data that we are looking for:

Utilization Type State … Location NDC
0 FFSU WI … (37.768, -78.2057) 17478021420
1 FFSU NH … (47.5362, -99.793) 409116502
2 FFSU XX … NaN 338062904
3 FFSU TN … (41.6772, -71.5101) 68382009906
4 FFSU WY … (44.0407, -72.7093) 472091145

I want to take a look at the count of the unique drugs that exist in the set and I want to get that same metric grouped by state. As I am learning Pandas I find that basic tasks are pretty simple. I am not sure if that is a factor of the library or the simple nature of Python.

print(drug_df['NDC'].nunique())
by_state = drug_df.groupby('State')['NDC'].nunique()
print(by_state.describe())

Above I have taken the data frame and first captured the count of unique drugs that were prescribed and then I did the same query but had the results grouped by their state. The output of that is pretty long, but the interesting pieces are the unique count is 21977 medicines prescribed and the output of the count grouped by state:

Name: NDC, dtype: int64
count 52.000000
mean 1211.346154
std 613.963877
min 421.000000
25% 842.000000
50% 1077.500000
75% 1362.000000
max 3826.000000

I can’t speak to why it shows that there are 52 states, but I do see that it add a state XX, which I assume is the bucket for NA. I decided to hone in on something familiar. The SSRI Sertraline is a very common anti-anxiety medicine that also has applications across a large spectrum of mental health disorders. So I wanted to ask the question of what is the prescription rate by state and is there any interesting in the data. To get started I took the data frame and filtered it down to only rows that have the drug sertraline. First I had to remove the null rows. The data set has the value presented as upper case and lower case, so I lowered the case and did the comparison that way:

drug_df = drug_df.dropna() # remove the nulls
df = drug_df[drug_df['Product Name'].str.lower() == 'sertraline'] #do the filtering

Now that I have reduced the data set to only contain the drug I am interested in I can start to perform some aggregation. Like mentioned above, I wanted to see the amount of Sertraline that is prescribed in the states so I need to group by State and then sum up the prescription quantities. If we simply print the count by state we get something similar to this:

       Number of Prescriptions State
420                      125.0    AL
1060                      82.0    MS
1305                      51.0    WI
1580                    1162.0    SC
2128                      49.0    GA

Now we can do the grouping and the accumulation and graph it. For the charting I am using matplotlib. I tried several different types of charts, but since the data is so simple a bar chart seems to be the most appropriate. Here is what we need to do now to present the chart:

df = df.groupby('State')['Number of Prescriptions'].sum()
df = df.nlargest(10)
df.plot(kind='bar',x='State',y='Number of Prescriptions',color='red')
plt.show()

So what is the take away from this exercise and the results. The simple answer is I don’t know. It looks like California heavily prescribed the drug Sertraline for Medicare patients in 2010. You cannot make any inferences past that since there were few other fields in the data. I would be interested in seeing a comparison with the same data set from last year, so I will keep searching if that data is available. Anyways…

In this post I learned how to do a few things:

  • Tell Panda what is the column type when you get that weird error: sys:1: DtypeWarning: Columns (19) have mixed types.Specify dtype option on import or set low_memory=False
  • You can take a single column and filter the rows using a standard comparison: df = drug_df[drug_df[‘Product Name’].str.lower() == ‘sertraline’]
  • The groupby function is the same as the one in T-SQL, but maybe a little simpler
  • You can use the plotting tool matplotlib to chart the data: df.plot(kind=’bar’,x=’State’,y=’Number of Prescriptions’,color=’red’)

I am sure that a pro would say that this is child’s play and some day I am certain I will be able to say the same, but for now I am learning so suck it.

Last thing, mental health is a serious issue in the world and if we work to lift the stigma associated with it a lot of good people could get the help they need to live a happier more enriching life!