Predicting Crime in SF- a toy WMD Machine Learning 101: from Linear Regression To Deep Learning
See the interactive map with the prediction results here.
For the code, visit the repo for this project here.
When new technologies emerge, our ethics and our laws normally take some time to adjust. As a social scientist and a philosopher by training, I've always been interested in this intersection of technology and morality. A few months ago I read Cathy O'Neil's book Weapons of Math Destruction (link to my review) and realized its message was too important yet neglected by data scientists.
I started this project to show the potential ethical conflicts created by our new algorithms. In every conceivable field, algorithms are being used to filter people. In many cases, the algorithms are obscure, unchallenged, and self-perpetuating. This is what O'Neil refers to as Weapons of Math Destruction - WMDs. They are unfair by design: they are our biases turned into code and let loose. Worst of all, they create feedback loops that reinforce said models.
I decided to create a WMD for illustration purposes. This project is meant to be as simple and straightforward as possible. The two goals are, first, to show how easy it is to create a Weapon of Math Destruction. Secondly, to help aspiring data scientist see the process of a project from start to finish. I hope people are inspired to think twice about the ethical implications of their models.
For this project, I will create a predictive policing model to determine where crime is more likely to occur. I will attempt to show how easy it is to create such a model, and why it can be so dangerous. Models like these are being adopted by police agencies all over the United States. Given the implicit racial bias found in all human beings, and given how people of color are already twice as likely to be killed by police, this is a scary trend. Here's how machine learning can make the problem worse.
The data used for this project is found as part of the open data initiative by the City of San Francisco, a great resource for data scientists interested in public policy. Hopefully more cities will continue follow this initiative and make their data public and machine-readable.
The crime data for 2016 looks like this:
Predictive policing models, and WMDs in general, value being obscure and complicated. They can charge a higher premium for technical wizardry their customers can't understand. They typically use hundreds or even thousands of different input variables to make their predictions. They claim this is what makes their predictions so good.
I will do the opposite, in order to demonstrate the inner workings of a WMD, and how easy it actually is to build one.
I will attempt to predict:
the number of crime incidents that will occur in a given zip code, given the day of the week, and the time of day.
I will train my model on the 2016 data, and then use the 2017 data to test how well I did.
After selecting only the variables I want, and summing the total number of crimes per year per zip code per hour, I get the following:
In other words, 265 crime incidents were reported between 17:00h and 18:00h on Fridays in zip code 94103 during all of 2016. Since I sorted these by number of crimes, we can see that the highest number of crimes always occur in zip code 94103. This already gives you a hint of how easy it is to sell snake-oil models: "Send police to the 94103 to find the crime!"
However, this is far too simple, and if customers knew we're just predicting crime exactly where it already happened most often, nobody would pay too much for it. Let's make it more complicated.
Data scientists commonly split data randomly for testing, with about 70% used for training and 30% used for testing. However, when there is a time component involved, it is common to split it chronologically, and see if we could have predicted the future. I will use 2016 data to see if I could have predicted 2017 data.
Testing is where the magic of machine learning comes in. When I was in demography graduate school, we made predictions about the world and then presented them in a paper. Nobody knew if they were good predictions or bad predictions, but nobody made an effort to measure accuracy. If they seemed reasonable, intuitive, and more or less in line with the literature the project was praised.
Data science is more rigorous. We split the data into a training set and a testing set. We create a model based on the training set, make predictions, and then compare our predictions to the actual results of the test set.
We iterate over and over until we get better results. And then we iterate again, to the point where we're willing to sacrifice understanding for the sake of better accuracy.
This is a double-edged sword. It can be great for certain applications. For example, we don't need to know exactly how an image recognition model works. If it can recognize someone's face when prompted, that's what matters. The problem comes up when we are making decisions that filter people and we can't explain to them why they have been chosen, or worse, discriminated against.
If we tell someone they are fired because our model says they are underperforming, and then we can't explain how our model works, they can never appeal the decision. If we made a mistake, who would know? If our obscure models have encoded common racist, sexist, or classist assumptions, who could stand up against the injustice?
Trying out five different models
Now that we have our data separated into training and testing buckets, we can start evaluating how well different models do.
Note: All these models have different ways to tune them and tweak them (hyperparameters) - for simplicity, I will use the defaults and won't go into this at this point, just be aware there's normally a way to eke out better accuracy from each of these.
1. Linear Regression
This is an early-19th-century statistical technique which is fast and simple. A linear model tries to draw a line through the data that fits it as best as possible. This model is so common because it is fast and explainable. If we wanted to, we could say how important each input is in making the resulting predictions. We train the model with the 2016 data, make predictions, and see how well they match the actual number of crimes in 2017.
Our results show we can predict the number of crimes with 63% accuracy with this model. This is a good baseline, but we can do better with newer techniques.
2. Random Forest Regressor
A random forest model, in a nutshell, is a bunch of random decision trees working together.
A decision tree, in a nutshell, is splitting the data on the most likely places it'll split, and then choosing the most likely outcome.
For example, our model first notices that Friday is the day that has the most crime and the 94103 is the zip code that has the most crime. It then goes through all observations, and asks "did it take place on a Friday?" if the answer is "yes" it predicts a certain number of crimes. If it's "no" the number it will predict will be smaller. Then it asks "did it take place in 94103?" if the answer is yes, the prediction will be higher, if "no", lower. It keeps going in this way and then makes predictions for each of the different inputs.
Let's see how well we do.
Whoa! We're up to 80% accuracy, not bad for three extra lines of code. This is how important it found each feature, or independent variable, to be:
These results are still slightly interpretable. We can see that the most important features that determine the number of crimes this model predicts are whether it's zip code 94103, what time of the day it is, and whether it's zip code 94102. It predicts how many crimes will happen based on a combination of all these inputs.
3. K-Nearest Neighbors
A KNN model is where we start losing our ability to explain how exactly we got our results. The underlying theory is easy to grasp, explaining how a particular outcome came to be, is not.
A KNN model is pretty self-explanatory. It looks at the closest "neighbors" of an input and spits out an answer that is most similar to it's neighbors.
For example, let's say our model sees an input that says "Friday, 94103, 4pm." It might determine that the "nearest neighbors" are "Friday, 94103, 5 pm" and "Friday, 94103, 6 pm." It will take the number of crimes that ocurred at those two neighbors, average them, and the result will be the prediction for our original 4pm input. The number of neighbors, and the ways to define "average" can vary a lot, but the intuition is the same.
Let's see how we do:
This time KNN out of the box doesn't do as well as our other models. This would probably improve if we find an optimal number of neighbors but I'll keep it simple and jut move on to the next model.
XGBoost is an award-winning algorithm that is famous for excelling in Kaggle competitions.
In a nutshell, it's also a collection of trees, like the random forest we saw above. The difference, however, is that random forest splits the data randomly, as the name implies. Boosted tree models however, construct trees iteratively, and the examples are reweighted for each subsequent tree that focuses on reducing error. In other words, it constructs trees, compares how well they did, and then constructs better and better trees.
A very small increase in our accuracy. But good enough to move to the next model.
5. Deep Learning - Multilayer Perceptron
Now we've reached the point where there's no way to explain how we get the results that we do. If someone that was charged higher interest for a loan based on a deep learning model and they ask "what factors contributed to the amount you are charging me?", we would have no way to give them an answer.
This is where data science can become very dangerous if we are affecting people's lives and can't explain why. If we made a mistake, or if we introduced our biases into the model, the model is pretty much a black box.
I will use a multilayer perceptron regressor to see how accurate I can get my predictions to be. I'm choosing 4 hidden layers with 100 nodes on each. I will not do a randomized search to find the best hyperparameters, but there are a lot of options to tweak as you can see.
We can now predict how many crimes will occur in a given zip code in a given hour of day in a given day of the week with about 87% accuracy. We did this even though we didn't even spend time tuning hyperparameters or obtaining more data.
After trying out five different models, we almost reached 90% accuracy. This was done in the simplest possible way. If we wanted to improve our results we could do the following:
- Use more input variables: weather, population density, distance to liquor stores or homeless shelters, demographic makeup of each zip code, among others.
- Tune hyperparameters: we could've done a "gridsearch" to find what's the best number of neighbors to use in the KNN model, the size of the trees in the random forest model, what regularization to use in the linear regression model and many many options in the deep learning model.
- Use more data: data from 2015, 2014 and every other year that was available.
As you can see, there's a lot of room for improvement. But my point was to get the best possible model with the least amount of effort. I wanted to show how easy it really is to make these predictions, and how quickly we lose explainability.
If we start sending more police to the areas were we predict more crime, the police will find crime. However, if we start sending more police anywhere, they will also find more crime. This is simply a result of having more police in any given area trying to find crime.
This means that our model could be off, but it will always appear right. If police already frequent a neighborhood and search people because of their inherent racial bias, they will already have found more crime. This will mean the model will send them there again and again, and will become a self-fulfilling prophecy. Some people have argued that sending enough police to high-crime areas will deter crime and eventually cause a reduction. The question we must consider, is how many innocent lives were accidentally cut short by the police to achieve that reduction?
Data scientists must start to become more aware of the possible misuses of our algorithms. We must start thinking about making our models more transparent. We must be aware of how our models hurt people. We must do better.
See the interactive map with the prediction results here.
For the code, visit the repo for this project here.
Let's grab coffee!
I'm interested in talking about the intersection of technology and ethics, data science consulting, full-time opportunities and passion projects.