For the last several weeks, I have been trying to learn more about data science. I have always enjoyed writing queries for reporting to dig into the stories that the data can tel. Data science takes that interest to a whole new level.
As a typical engineer I haven’t exercised my linear algebra or calculus since my undergraduate. The book (Data Science from Scratch: First Principles with Python 2nd Edition) that I am reading starts hard and fast in that direction. Luckily I have most of my old school books, so I have been speed reading through them to try to reclaim some level of understanding. So I have been a nerd with my stats, linear algebra and calculus books for the last several weeks. Reading them a decade later gives me a new appreciation for the material.
All of that leads to why I am really interested into learning data science. On a project that I am working on there is an opportunity to utilize some machine learning techniques. More specifically, the project asks a sequence of questions to a user to arrive at a customer satisfaction score.
One of the areas that we may be able to employ such techniques is determining the next question (excluding next questions only). It would seem reasonable to have the ability to craft the sequence of questions (let’s call it a survey) based on the answers to the previous ones. If you have n surveys completed, then you might be able to do some multiple regression analysis. This could tell you what the probability of the next question being answer in a certain way. This would allow you to remove questions that you know have a high likelihood of being unnecessary since they are consistently answered the same.
One of the first thing that comes to mind with that approach is there may be some questions that are more important thereby requiring a user’s answer. In other words, they may have more weight and should be removed from the possible exclusion set. Another top of mind concern is the actual ordering of the questions. If the ordering is random such that the questions are independent, how could you predict what the next question is going should not be or does the machine determine the order itself.
I think having the machine determine the order of the questions would be pretty awesome. Would that somehow invalidate the comparison between surveys based on some psychological factors that the order would create, I am not sure. Theoretically, if the questions are truly independent, you should be able to present them in any order. I do believe that they could tell a different story even with the same answers if presented in different ways. ¯\_(ツ)_/¯
So at this point, we have multiple regression to determine what the likely next sequence of questions will be answered in a certain way. This would need to be a multi-pass operation, since you should be checking each successive question to determine if it needs to be asked. It is possible that you could reach a point where no further questions are required.
That now covers the exclusion , but what about including new questions dynamically. Would you have a pool of questions that are related to a product or service to be used for possible inclusion? In that scenario would you only include some static questions and allow machine to add questions from a pool based on prior responses? That seems like a simple decision tree, but given that we are already excluding some questions it seems unreasonable to construct that tree. You could always have a parent-child relationship where one answer prompts a series of new questions, but again that would need to be preordained.
It seems that dynamic inclusion is a much more challenging endeavor. I think I will leave it here for now so I can marinate a little more on this topic.



