There are many, many strong walkthroughs on the internet. This post is not supposed to be one of those. There are also dozens of analyses where the reader is interested in the result of the analysis. Again, this post is not that. Instead, this particular article is meant to show the creative process of analysis. Hopefully, you'll see that great portions of this career are not wholly driven by metrics and assumptions. Instead, computing limitations force analysts to make educated guesses in order to get to a finish line, even if it's not the line their stakeholders had set. When reading, watch out for these points:
Data is often intractable. Sometimes it practices gluttony, leaving it far too out of shape to fit into its Excel sheets. It may be prideful, having coding errors and impurity that it cannot bring itself to admit to you. Data can be envious; there are times it will come in a terrible form of nested files, wishing to have come in a clean rectangular form. With constant streams of new information always cascading in, data may be as wrathful as all the Huns at a bowling alley birthday party. With these problems, analysts must spend a lot of their time in inspection and thought. In the case of this example, categorizations of multi-punch questions and rarely-represented countries need special attention.
There are many model types that can do a decent job. At some point, you have to settle. There are always pros and cons to continuing the quest for the "best" model. When does one stop?
Pure performance is regularly not enough. Stakeholders want to know why black-box models choose particular outcomes, and it's very common for them to ask about general trends the models are picking up on. Learning to interpret simple models like trees and regression is easy, but not enough time is spent on interpreting the heavier machine learning algorithms. LIME is my preferred choice as a detective.
Background
Data is cyber gold. It's the basis for the scientific method, yet it can be so hard to find any. Open source advocates complain about this all the time. Scientific journals are behind hefty paywalls, white papers written by private companies rarely provide foundational information supporting their points. In short, it's hard to get access to the fuel that makes data science valuable. So that's why I found myself shocked that StackOverflow offered the results of its yearly developer survey as a data set for a Kaggle competition. Although this is not the post for this discussion, it's helpful to note what this platform is about. Kaggle hosts modeling competitions where the best predictive models win large prizes. Champions interact with noobs, provide write-ups for how they built their algorithms, and even offer boilerplate scripts to help you get started. Again, I did not expect a for-profit company that also has data scientists to offer this trove of intriguing information out to the masses. It's greatly appreciated by the community, but it's hard to imagine they're getting nothing out of this. This survey is also internally analyzed and used for marketing purposes (I was in the running for this position pre-COVID before hiring was frozen, even meeting the previous standard bearer at a conference just days prior).
This all said, the data was very easy to find. I searched StackOverflow itself for it, wondering what the most important features are of prospective data scientists. Maybe two search queries in, I found myself at the foot of a 2018 Kaggle competition dealing with precisely the data set I was looking for. Speaking from experience, this is tremendously lucky. Accessing data can be full of miserable steps like scraping or utilizing APIs, or it can involve hellish political boundaries from wary external teams and companies, or it's held tightly in the clutches of private fat cats looking to squeeze one more penny out from your corneas.
Cleaning
Now, StackOverflow didn't just leave this rich data set in pristine analytical condition. Their survey involved lots of categorical selections, pools of unanswered questions, and an insulting quantity of multi-punch selections. So many questions listed out tons of common programming languages, databases, cloud platforms, and analytical tools. Each asked in different ways, leaving many opportunities for inconsistencies. Thus, this was the barrier for working with this dataset.
Using R, I wrote a function to find multi-punch questions. Instead of having a binary column for each question-answer combination, this dataset has each answer appended with a semicolon. Left unchanged, most modeling algorithms would think each unique combination is its own thing. They would not have the freedom to check the impacts of particular interactions. For instance, knowing Java and Python would be considered as its own thing. So would knowing Java, Python, and C++. The issue is that the modeling process would be blind to these combinations without each one being its own column. When in separate columns, a model has the freedom to ignore the effects of knowing C++ if that is not important to determining salary. Below is an example and the code.
A secondary problem with this data is that many questions are skipped or irrelevant for respondents. This leaves massive gaps in information. There's not a single survey result where a respondent answered every question. This blocks all the cogs for model building, as most don't have mechanisms that understand what to do with this lack of info. Now, it's on the analyst to figure out how to get the best "feed" these hangry modeling algorithms. There are many choices for how to impute empty data: use the most common category or median number, build a submodel to predict what these missing values should be, use 0 or create a new "empty" category, filter out rows with missing data, or finding a new value after reviewing why this value would be missing. In this analysis, I've decided to use many of these depending on the question.
As a final crease, Country has many different unique values, yet only 42 countries have at least 150 responses. Rare events need to be handled. What if a user is from Lesotho and our model has not seen that country yet? The model will fail to predict because it doesn't know what this value means. Instead, we can group smaller countries. Ideally, we'd do this by similarity (geographic region, via K-means clustering, from business knowledge), but instead I group rarer countries into "Other".
This gets me a working dataset that will move through an algorithm just fine. I'm comfortable with each choice of imputation strategy, but it's important to see how this affects the analysis.
Modeling
This data set ends up being fairly large for my laptop. 425 variables across 98,000 observations. Normally, this is nowhere near a problem for storing on RAM. However, parallel processing seems to greatly reduce my computer's efficiency, which seems to be caused by the copying of the dataset to each core (though not 100% on this). So, I recommend not running these models on a slouch machine.
Always start with simple model. Many times, linear regression or a regression tree can give you enough insight to call it a day. At worst, this serves as baseline to determine the effectiveness of more complex methods. Before we build anything, the dataset needs to be separated into training and validation sets. Normally, a test set would be made as a final indicator of model performance, but that is essentially what I'm using the validation set for. Models can overfit the data they've seen and modeled on, which is more like "remembering" than "understanding", just like if someone gave you a math test ahead of time and you tried to remember the answers instead of learning how to get them yourself.
Let's try something more complicated. The validation MAE is 75,027 for this model, which is objectively terrible. It means the predictions are off from the truth by an average of $75,000. Likely, a deeper model can improve on this. For rectangular data like this, XGBoost is hugely popular and is almost always implemented in competitions. Before that, random forests were defined as "2 legit 2 quit". Besides introducing a new model, we can speed this up by using parallel processing. As well, improvements can be made with cross-validation--essentially making mini training and validation sets to determine if a particular version of a model is any good at predicting data it hasn't seen before.
Machine learning algorithms can take unbelievable amounts of time to train. Because they use hyperparameters to determine how to fast to train, what limitations to have on predictors to use, and how deep to make trees, one needs to attempt many different sets of hyperparameters to see what works for their problem. For just 1 set of hyperparameters ensemble models like this are also comprised of some significant number of smaller models. In the case of XGBoost, it uses "weak learners" or shallow trees to continuously learn. This all sounds great and fancy, but it hardly improves on this dataset. Quite a large 73,775 for validation MAE, still.
Going even further, we can try a bunch of different models at once, allowing them to "vote" on what the prediction should be. The caretEnsemble package has nice functions to help us do this fairly easily. If each can give good insights on different parts of the dataset, then their combined knowledge will be a marked improvement over just a single model.
Okay, this model did worse: 75,348 validation MAE. Likely, the additional models are worse than the linear XGBoost model by itself. Stacking is allowing these malcontents to spread their negative dispositions to the whole of the prediction. Bonkers! Instead, we can institute a weighted vote by having a higher-level model choose which of these submodels should be trusted and when. This can be done with the caretStack function, where stack levels of models into a hierarchy. Stacking works best when models are uncorrelated, but still performant. Also, using simpler algorithms helps at this top end to simply "weight" each submodel, while using more sophisticated models can cause extreme overfitting in many cases. So we'll see how these compare.
Evaluation
Validation MAE of 77,717 for the random forest stacker, 72,688 for the linear stacker. The linear stacker beats the XGBoost model, but the random forest is another decline. These were just two choices I made based on the wisdom of others before me. Since this might go the other way, knowing the benefits of exploring certain paths can feel very artistic. Here is where we landed on all metrics:
Now, what does all this look like? What do these high model metrics even mean? Let's take a look at the predictions to see how these are so erroneous. Mind that the red line marks perfect accuracy.
It's obvious from this that these models have no idea what gets someone over $250,000. Not only that, it appears to only have moderate skill in predicting lower salaries as well.
Based on these facts, it's obvious that the work we've done so far is not going to prove to be useful. We'll have to think about this problem more before we get to describing what causes a given prediction. The next post in this series will involve more tuning. If that goes well, it will also cover LIME to help uncover what causes each prediction to come out the way it does. I hope this was a useful read for you, but I'm most excited about wiping away the fog that surrounds black-box models!
Comments