How to estimate the impact of algorithms

You’ve just finished training a credit risk tree model with a whooping 57 AUC score, and you feel great. And you should. But let’s dig deeper. How much better will this model be than using no model? Or than using the previous model which had an AUC of 48?

Have you ever wondered what the impact of an algorithm you are building is? How much money you are making for your company? How many lives are our campaigns saving?

Every member of an organization should know how their actions contribute to the organization’s goals. This allows them to prioritize and be more efficient in their work.

The impact of an algorithm is tied to the actions it enables

To estimate the impact of an algorithm, first, we’ll need to define a metric. This will usually be money because it’s the main human mean of value exchange and one of the main goals of businesses. However, depending on the nature of your project, you can use metrics such as lives or time saved.

To estimate the impact of an action, we have to calculate the difference of our metric between two different scenarios:

  1. The current outcome (measured in the metric we’ve defined). This can be 0 if nothing can be done without the algorithm
  2. The outcome we expect to get by using the algorithm instead

Building simplified models of the situation will allow us to make estimations of the impact. This is similar to how we would build a business case.

Let’s make it more clear with an example

STL limited (Short Term Loans) is a credit company that gives 1-year loans. This is how their business is going:

  1. They give loans at a 10% interest rate to everyone that applies for one
  2. Their default rate is 10% (percentage of customers that don’t pay back all the money they owe)
  3. Customers that default had paid back an average of 30% of the loan amount before defaulting
  4. The average loan amount is 1.000$
  5. Every year 100.000 new customers apply for a loan

With this information we can estimate how much they currently earn per year by using a simple excel spreadsheet. We will first estimate the expected earnings per non-defaulting customer (NDC) and per defaulting-customer (DC). After this we will combine those estimations to calculate the expected value per customer by using conditional probabilities.

Building a model to improve earnings

John, the lead data scientist at STL limited, has developed a probability of default model. He has trained it using customer employment data that was collected anyway for regulatory reasons.

John uses the model to make predictions on a holdout set (a dataset that the model has never seen before). He then divides the customers into four groups of the same size based on the probability of default predictions. The following table shows the probability of default for each of the groups:

Modifying the default rate on the previous spreadsheet, we can estimate the expected earings per customer for each of the groups:

The average customer on group 4 loses money for STL limited.

How would much would STL limited earn if they only gave credit to people on groups 1, 2 and 3?

By only giving credit to customers with positive expected earnings, STL could make a total of 3,5M$ per year. This means that the model would have an impact of 1,5M$ (3,5M$ minus the 2M$ of the base case).

Wrapping it up

This impact estimation method is based on simplification and it leaves out second-order consequences of the actions. Additionally, future performance isn’t guaranteed to be the same as in the past. To account for these sources of uncertainty, I generally multiply the impact estimation by a conservative factor of 50-80%.

Nevertheless, the objective of these estimations is not perfect accuracy but getting a ballpark figure that will allow us to compare and prioritize.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

When is a master’s degree the right way into Data Science?

You are a student finishing their bachelor’s degree and unsure of what to do next. Or perhaps you are a professional considering a change of industry. You may even be a PhD that wants to transition into the private sector. Whatever your origin, what concerns you now is how to accomplish this change. And you wonder: should I study a data science master’s? Will it be worth it?

Since you want to get a data science job, enrolling a master’s program sounds like a logic step. A master’s will teach you the knowledge needed to do the job as well as the credentials to get it. However it is also costly. Tuition fees may vary from a few thousand dollars to several tens of thousands. Additionally you will have to invest a year or two into it, potentially going jobless during the time.

Since it’s a very important decision, you consider the alternatives. Is there a better way to learn data science? Maybe self-learning and some online courses? Maybe on another data-related position you can learn it on the job and easily transition later?

In the rest of the article I will compare the master’s and the self-learner routes and highlight when one is better than the other.

Let me tell you a couple of stories

With one year left to finish my degree (Math+Civil engineering) I took interest in data science. That year I only had to do my final thesis so I had a lot of free time. I used that as an opportunity to get into data science by:

  1. Doing a machine learning project as my thesis with the help of some online courses and books
  2. Joining a local analytics consulting firm for an internship, where I learned about SQL and databases

That turned me into a great candidate for starting data science positions and I got my first full time job right after finishing my thesis.

My wife Anna enrolled a master’s in statistics and operations research right after finishing her bachelor’s in mathematics. After finishing her master’s she has held a couple of data scientist jobs and is now a biostatistician. Doing a master’s was a great decision in her case. It led to job offers and meeting great friends.

As illustrated by these examples, both ways can work. Which one is best will depend on your personal situation and preferences.

Why should you study a master’s in data science?

Certification. A master’s degree is a recognizable badge and will make it easier for you to get interviews. Recruiters and HR professionals value it highly. Data scientists don’t value it as much, according to a recent Kaggle survey only 20-30% of data scientist hold a master’s degree. If you decide to skip the master’s, there are some ways to get proof of your skill such as:

  • Data science competitions (Kaggle, local hackathons …)
  • LinkedIn skill assessments
  • Personal projects and OS contributions
  • Experience on adjacent fields (data analyst, data engineer).

Peers. Another advantage of a master’s degree is that you will have a class of like-minded people to study with. During the degree you can have fun together. Afterwards you will be a valuable network of professionals that can help each other. Self-learners won’t have the same camaraderie as classmates. However, you can join online communities as well as local meetups and study groups to network and socialize.

Convenience. The final major advantage of a master’s degree is that it’s simply easier to follow from start to finish. Most people find it a lot easier to commit to a habit once they’ve paid for it or given it some formal structure. Finding the discipline and motivation for consistent self-study is hard. If you struggle with it you can try some of this tricks:

  • Learn with a friend
  • Allocate a certain time of the week to it
  • Find a way to track your progress and give you a sense of accomplishment

What are the advantages of self-learning?

Flexibility. Self-study lets you advance at your own pace, from wherever you are and skip subjects you find boring or uninteresting. This was very important in my case as I have always struggled with things I find boring. Some master’s programs aren’t as rigid as they used to be but still nowhere close to self-learning.

Cost. And an obvious one. The cost of learning data science on your own will be close to zero. You may spend some money on a couple books and online courses. Master’s degrees are very expensive unless education is heavily subsidized where you live.

Quality of education. I know this may come as a shock to some. Self-learning will let you pick and choose the best materials from different sources. On the other hand, a master’s will have you committed to a single program. If you are unsure about the best books and courses, worry not. Online forums like reddit and stack overflow will answer your questions and blogs like this will try to point you in the right direction. Moreover, bloggers and Kaggle winners regularly share their experience and tips, something that many teachers won’t do. So even if you decide a master’s is the better option for you, it’s good to stay online.

And finally, let’s talk about salary. The Kaggle study found no significant salary differences between those who had a master’s and those who didn’t. So we can call this even.

Conclusion

As with most thing in life, wheter or not studying a DS master’s is a good idea, will depend on your situation and preferences.

Study a master’s if you really value the certification, having classmates or aren’t confident on your discipline to do it solo.

Self-learn if you value the flexibility or money is an issue.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you