13 essential tips for learning machine learning and data science

When you start learning, it’s very hard to have a clear direction. You often waste time on uninteresting, useless, or outdated topics. You wander and run in circles.

However, once you’ve mastered the topic, it’s easy to look back and see the fastest path from noob to pro. If you only could go back in time and give yourself the roadmap… Even if I cannot do that with myself, I can do that for others. This is the objective of this article: to give you the tips I wish I knew when I started learning data science and machine learning.

To build this list, first I wrote down what has been useful to me in my experience as a data scientist. Then I went to Reddit, to seek help in curating and completing the list, getting 300+ upvotes and 35+ comments. I hope you find it helpful!

1. Get solid mathematics, probabilities, and statistics foundations

Mathematics and statistics are at the core of machine learning. So it will be very difficult to understand machine learning algorithms if you don’t know the building blocks.

However, this doesn’t mean you need to be a math wizard. You should understand math and stats concepts such as vectors, matrices, derivatives, probability distribution, independent variables, or standard deviation. More advanced mathematics (like learning to prove theorems) won’t help you much when studying machine learning, even though it can be a lot of fun.

2. Learn either Python or R and learn them well

When doing data science and machine learning, you will spend most of your time coding in R/Python. So it’s important to learn the ins and outs of your language of choice.

Data scientists spend a lot of time cleaning and manipulating data, so you should give special attention to data manipulation libraries. The most popular ones are Pandas for Python and data.table and dplyr for R.

3. Learn good programming practices

Writing clean and efficient code will make it easier to share your work with others. And even if you work alone, will make it easier for you to debug and maintain your own code. Entire books have been written about this so I’ll give you a short list:

  1. Use consistent and descriptive names for variables, columns, and functions
  2. Don’t repeat code, use functions or classes if you need to do the same process multiple times
  3. Understandable code is better than compact one: 10 lines everybody understands vs 2 lines nobody understands
  4. Don’t overoptimize your code at the start, but know where the bottlenecks (parts that won’t work well if you increase the volume of data) are in case you need it to scale
  5. Use consistent indentation and try to limit line length

4. You don’t need to learn all the different supervised learning models

This is one I struggled with. When I started learning I thought that every situation would need a different type of model and that I needed to learn them all to be well equipped. But this is far from true. Linear/logistic regression is surprisingly effective for tabular data problems. And XGBoost or random forest will help you if you have a lot of non-linearities. Artificial neural nets are great for image and NLP problems but are otherwise overkill and more difficult to set up.

Aditionally, you don’t have to keep up with all the published papers. Most staple techniques in the industry are decades old. If you ever have to face a very unique problem, then may be a good moment to dive into the literature.

5. Once you know the basics and understand them well, it’s mostly about doing projects

After completing one or two ML courses, don’t spend your time on more theory, dive straight into doing some projects. If you’re lacking some knowledge, you can pick it up on the way.

Working on projects puts your knowledge into practice, and helps you figure if you really understood everything well. Additionally, by doing projects you create valuable experiences that will help you get hired later on.

6. Doing tutorials and reviewing other people’s projects is very helpful at the start

When you’re learning a new tool or model and don’t feel confident about using it on your own, looking at an example is a great way to get some inspiration.

7. You can learn everything online for free, but some paid resources can be helpful

For example, studying a master’s will give you credentials and a class of peers. I’ve actually written a full article about self-learning vs studying a master’s.

Additionally, some useful online resources are paid. I have personally tried to distill my years of experience as a data scientist into Data Projects, a product to learn data science by doing real-world projects. I hope it can help others as much as it would’ve helped me.

8. Explaining your work to others is a great way to consolidate your knowledge

It’s also a great way to work on your communication. You can do this by telling your friends, blogging, or making youtube videos. This will be a crucial skill when working with others.

9. Don’t despair if you don’t get it right

Nobody gets it right the first time. Trial and error is the way to go, especially on fields like this where there is no one exact solution

10. Lean on online communities

The internet is full of helpful and generous people, if you’re struggling with something search and if you don’t find the answers, ask in the forums (reddit or stackoverflow).

11. Learn more about your problem domain

Don’t focus only on the purely technical, try to understand what is really behind the problems you’re modeling. It will help you decide which is the best error metric for the problem, select the most insightful variables, and communicate to non-technical stakeholders using their own language.

12. Work with messy data

Don’t just stick to problems with pre-cleanded data. The world is messy, and having some experience on treating and structuring data will prepare you for future challenges.

13. Work on what makes you curious, that will keep you motivated

Following your curiosity and your passions will make sure you don’t abandon your path to becoming a data scientist halfway through. Additionally, it makes the whole learning experience a lot more fun!

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

How to self-learn data science from scratch

When I learned data science I didn’t know where to start, so I wasted many hours learning only tangentially useful stuff. Now, after more than five years as a data science consultant, I know what I would’ve done differently. In this article, I will offer you a roadmap on how self-learn data science with links to useful resources.

Data science pre-requisites

Even though I believe everyone can learn data science, those with a technical background will have a head start. Before getting into DS specific subjects it is useful to have some notions about mathematics, statistics and probability.

It is not necessary to be an expert in any of those, but you need a solid foundation. If you’ve never studied any of those, don’t worry, I’m here to help. In the following paragraphs, I’ll briefly describe each prerequisite and link to educational resources.

Mathematics for data science

To get started with data science you need to get familiar with some of mathematics’ most common objects. These Khan academy lessons about vectors, matrices and functions are a good place to start. Also, here’s the summary (in more formal mathematical language) of a Stanford course. These concepts are the building blocks of most machine learning algorithms and provide you with a framework for structuring data. Getting to this level of mathematics will allow you to understand and use the algorithms that others have invented and implemented and get results.

If you really like mathematics, you can dive deeper into mathematics by taking full calculus and linear algebra courses. This will require a lot more work but will unlock a more complete understanding of the inner workings of machine learning algorithms and how to implement and adjust them.

Probability and statistics

Probability lies at the core of the data scientists’ view of the world. When dealing with big numbers and random events, probability and statistics provide the tools to make sense of them. It isn’t only about the exact methods or formulas, but also about developing a probabilistic intuition. These courses from Khan academy on probability and statistics are both beginner-friendly and got all the information you’ll need. Here is a mathematically formal summary of a probability course from Stanford.

In addition to formal education in probability and statistics, reading non-fiction books can also help to develop an intuition. I recommend the following books in no particular order: Thinking fast and slow, Factfulness, Thinking in bets, Fooled by randomness (or any of Nassim Taleb’s books).

Finally, reading about statistical paradoxes will help you make sense of data when you face unintuitive conclusions.

Data-oriented programming language

A big part of a data scientist’s job is reading, manipulating and running analysis on data. This is usually done by coding in a data-oriented language. These languages allow us to write instructions for a computer to execute. Even though there are many different programming languages, most of them use very similar structures. The two most popular data-oriented programming languages are Python and R, and you can start with either one. If at some later point you work with people using the other one, you can use that as an opportunity to learn it.

If you’ve never coded before, don’t worry. Both of them can be a good first point of contact with programming. A lot has been written about which one is better, but the truth is they have different strengths.

R’s strong points are:

  • It is designed for data and statistical work, so manipulating data is easier
  • There is a vast universe of statistics libraries
  • The Shiny library makes it very easy to make a web app with no previous web design experience
  • RStudio is a wonderful IDE (I haven’t found one that I like as much for Python)

Python’s strong points are:

  • It’s a general-purpose programing language as well as one of the most popular languages overall
  • It usually runs faster than R
  • It has better packages for deep learning

I personally prefer R because of its more compact syntax in the data.table package and also because I have more experience with it.

Learning R

If you are new to programming, I recommend you start with one of these resources:

If you have been coding for a while, you can get the basics with learn R in Y minutes.

Once you know the basics, it’s time to learn one of the two main data manipulation libraries: data.table (my personal favorite) or dplyr. Another useful library is ggplot2 for making beautiful graphics.

Learning Python

If python is your first programming language you can start with any of these:

If you’re already familiar with coding you can just read this documentation.

And once you’ve mastered python’s basics, you can go into the specialized tools to manipulate data: Pandas and Numpy. Here’s a tutorial and here’s a video to help you learn those packages.

Learn machine learning

Now we get to the exciting part.

There are many different techniques and tools in machine learning. One of them has been my most used analytical tool during my years as a data science consultant. And that technique is supervised learning, in both of its forms: classification and regression.

Supervised learning, also known as predictive modeling, is about learning from examples in which we know in advance the correct answer. In regression the answer is a numerical value, and in classification it is categorical.

Predictive models can be used to make demand forecasts, identify risky creditors and estimate the market price of a house among many other uses.

Here are some courses that will teach you the main framework to approach predictive modeling problems, as well as some supervised learning models:

In my experience, 3 families of models can help you solve most supervised learning problems you’ll ever encounter:

  1. Linear and logistic models (explained in the above courses) are easy to understand, easy to interpret, fast to train and reasonably accurate
  2. XGBoost (gradient boosting trees implementation) is a top-of-the-class model in terms of precision, speed and ease of use. However, they’re not as easy to interpret as linear models. Here’s an introduction to decision trees (pre-requisite) and a couple of articles about how XGBoost works
  3. Neural networks are great for natural language processing and image models. However, I’d leave them to more advanced data scientists since they’re more difficult to set up

Here are some examples of using linear regression in R and Python, and of using XGBoost in both languages.

SQL

SQL is the most used database language and most companies use one of its variants for their database. Even Amazon’s Athena and Google’s big query can be accessed using SQL syntax.

So if you’re planning on getting a job in data science I recommend you learn SQL since it will be a requirement for most employers. If you’re doing personal projects it’s up to you. For small-scale projects, you will be just saving your data on text files. For bigger projects, SQL skills may come in handy.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

What’s next?

Once you’ve learned the basics about R/Python and supervised learning, it’s time to practice. Do a project with open data or participate in a Kaggle competition. Or get a job as a data scientist and learn while getting paid. Practice is what will help you hone your skills and generate proof of your knowledge.

How to get a job (in data science)

In this article, I’ll give you a structured approach to getting a data science job.

In fact, I’ll be sharing all the techniques that have helped me get offers from startups and management consulting firms, along with examples of my own resume and project portfolio. Additionally, I’ll talk about what I look for when screening CVs and running interviews.

So if you want to get a job in data science, you’ll love the actionable steps in this guide.

Let’s get started.

Be eligible for a data science job

Before going into the details of job hunting, let’s get this out of the way: no amount of tricks will get you a job if you don’t have the required skills. So the first step is to learn the fundamentals: coding in a data-friendly language (ideally Python or R), some machine learning, and SQL. Those are the basics for an entry position in data science. And they can be learned for free on the internet.

If your market is very hot or you’re looking for internships, you may get hired with a technical degree (CS, math, physics, engineering) and no specialized data science knowledge.

Some work experience may also help to make you a more attractive candidate. Adjacent positions like data analyst and data engineer can help you move to a data science position.

Additionally, some domain knowledge of the industry of the companies you’re applying to will be a great asset to your job search.

You have to present your story in the best light possible

And this is true through the whole hiring process, from your resume to interviews.

When you are looking for a job, you are both the product and the salesperson. No one else but you will highlight your qualities. There are many ways to explain who you are and what kind of work you’ve done in the past. You should choose the most persuasive way in every situation. To do this focus on two main principles:

  1. Make your story as interesting as possible by being specific enough
  2. Adapt your story to your audience, highlighting what is more relevant to them

For example, when Mary is asked what she does for a living in an interview she can say “I do customer segmentation” and that can be as true as it is boring and unspecific. But she can do better. She could say “I use algorithms to segment users according to their past purchases”. That sounds more interesting.

Moreover, if Mary is on an interview with a software engineer, she can specify that she uses a mix of SQL and Python code for her analysis. If she is talking to the marketing manager, she can explain how her segmentations helped increase the email open rate by 12%.

Additionally, she should try to use her audience’s own words. Her official job title states “Business Analyst”, but she’s using SQL and Python to do her job. If she’s applying for a “Data Analyst” position, she could say her current job is a “Data Analyst” position too.

These ideas apply to interviews as well as the wording of your resume and any other document you present.

Improve the steps of the funnel where you’re weak

A job search is like a sales funnel. You find some job postings and apply to them. Some of those applications will get you interviews. And some of the interviews will result in job offers.

By thinking of the process as a funnel, you can isolate its parts and try to optimize them separately. For example, imagine John has sent lots of applications and isn’t getting any interviews. In that case, before sending more applications John should make sure his CV is well-formatted and that he is a good fit for the positions he’s applying to.

The main parts of the job search funnel are:

How to get a job – search funnel

Applications, increasing the funnel input

The first step to getting a job is finding job postings. The main ways to do so are:

  • Asking your network, which may even let you skip directly to the interview phase
  • Online job searches, I’d suggest searching about once per week (LinkedIn has by far been the best for me)
  • Improving your LinkedIn profile and setting it as open to work
  • Local job banks
  • Company jobs page if you’re  interested in specific companies
  • Cold emails to people in your industry

Once you’ve got some job pots you have to decide which ones to apply for. My rule of thumb is to apply to whatever job you feel confident you can do, regardless of requirements. Very often companies will post job offers where it’s almost impossible to find a person with 100% of the requirements. If you don’t fulfill some of them but feel like you can pick them up easily on the job, just apply.

Maybe you have confidence issues and feel you may not be worthy of the job. In this case try to see what kind of jobs did your classmates get, and shoot for something on that level. If people that studied with you did it, you can do it too!

Unless you’ve applied to at least 20 positions, your best bet is to keep sending more applications. Think of it this way: if 20 people apply for a job your base probability of getting an offer is 5%. Lately, on LinkedIn, I’ve seen many postings with as many as 100 applicants.

Improving your LinkedIn profile

On my last job search, I got contacted by many recruiters that found me through LinkedIn and brought relevant offers. This will probably happen more and more as you advance through your career.

But to get noticed you have to work on your profile. Here’s what I did:

  • Create a complete profile with all your relevant work and learning experiences
  • Follow LinkedIn’s advice to improve your profile
  • Write an “about” section that sounds professional
  • Get your friends to endorse you on the necessary skills for your job search
  • Take LinkedIn’s skill certificates to make your profile stand out
  • Accept random connections, sometimes they have offers for you
  • Set your profile as “Open to work”

Successful applications and getting noticed

Once you’ve selected some job posts, you need to send the best application you can. Your objective when applying is to convince HR that you’re a good fit for the job.

Before going into which documents to send, let’s talk about referrals. This is the step in the process with the highest potential return on your effort. If you have a connection working at the company, talk to them and ask for a referral. This will bring extra attention to your application and possibly let you skip some of the required steps. Just do it if it’s available.

Whether or not you can get a referral, applying for a position involves sending some documents that show who you are. Any document you include, make it a pdf. Word and other editable formats may visualize differently on different computers.

How to write your CV

The most common document is your resume or CV (Curriculum Vitae). The objective of your CV is to communicate who you are to the recruiter. Here are a few guidelines about how to write your curriculum:

  1. Use a nice template, for example
  2. Only include a picture if you look good on it, in the picture dress for the job you want
  3. One page, include only relevant information
  4. Make everything easy to understand
  5. Don’t get too technical (this will probably be filtered by someone with no technical knowledge)

Here’s the resume I used the last time I was looking for a job (2021).

In addition to your CV, you may want to include a cover letter or a project portfolio.

Cover letter

The cover letter should be about why this job is a good match for you and why you are a good match for the job. It could increase your chances of getting a positive response, especially in more formal recruiting processes such as management consulting. You can send it as an email or as an attached pdf (max 1 page). You can structure your cover letter as follows:

  1. Introduction: why you’re writing this document
  2. Why the company is a good fit for you, for example: it’s a market leader, very innovative, you personally use their products, …
  3. Why the position is a good fit for you
  4. Why you are a good fit for the position and how your previous work experience and education has prepared you for this job (try to address all the points in the job description)
  5. Conclusion

Here’s an example of a cover letter I used some time ago to get into management consulting.

Project portfolio tips

Another document you can send is a project portfolio. This is a document explaining some of the projects you’ve worked on. If you have done some projects that you’re proud of, this can make them shine. In fact, the last time I applied for a job I impressed some of the interviewers with my project portfolio.

If you do so, keep in mind the following points:

  1. Don’t make it too long: 2/3 projects, 10/15 slides max
  2. For every project, explain the technology used, the process, and the results
  3. If possible, showcase projects relevant to your potential employer (same industry, technologies, or modeling problems)
  4. Don’t assume the reader has previous knowledge of your projects, give all the necessary context
  5. Don’t share any confidential information

If you send a GitHub link, make sure to keep your profile clean and organized. Also, create a clear readme file on your projects with a summary. Otherwise, reviewers may not know where to start and will just skip it.

What I look for when screening applications

At my past job, I screened applications of potential candidates for data science consulting jobs. This is what I looked for in order of relevance:

  1. Evidence of proactiveness and problem-solving in previous work experience and side projects
  2. Fundamental data science skills (ML, R/Python, SQL). I personally don’t care much whether they’ve taken a master’s degree or some MOOCs
  3. Numerical and coding skills (technical degree, side projects,  …)

What if you don’t get answers to your job applications?

Looking for a job can be tough, especially when companies and recruiters ignore you. Don’t despair. If you find yourself in this spot, here’s what could be happening:

  • You haven’t applied to enough jobs or have been unlucky so far, send more applications
  • Your resume is poorly formatted or difficult to understand, work on it
  • You have the skills but not the credentials, try to explain better why you are a good candidate and maybe give recruiters some proof of your skills (for example: portfolio, LinkedIn certificates)
  • You don’t have the necessary skills for the positions you’re applying to, in which case you should level up your knowledge or apply to other more suitable jobs

Proving yourself: Tests and assignments

If after sending your application letter the company likes you, they will contact you. At this point, some companies will assess your skills and commitment with an assignment or a test.

Tests are like exams. You will have limited time to answer a series of theoretical questions or practical exercises. Ask about it and try to prepare in advance. Doing similar tests from other companies or preparation websites will help. Also, try to schedule it for a time when you’re rested.

Here are some resources to prepare for a Data Science test:

Assignments are small projects that may range from about 2 to 12h. Some assignments are downright abusive. If you aren’t too interested in the job, now is the time to get out of the process.

Other assignments are interesting and fun challenges. You can take them as a chance to see what the job will be like and also test your skills.

Whatever the type of assignment, always think that it’s a relatively small investment compared to the time you’ll spend at the job if you end up with get an offer.

What to do if you are not passing the tests

Don’t worry if you get turned down at one test or assignment, flukes can happen. And sometimes recruiters don’t know what they’re doing.

However, if you fail at tests repeatedly, that means you should review the theory. Try to remember which questions you missed and study those topics and adjacent ones. Also, try to practice doing tests, as that always helps.

If you’re having trouble with assignments, then it’s a matter of practice. Try to do projects on your own and explain them to friends. Reviewing projects by others can also help.

The interview

Interviews come in many shapes and forms but tend to follow a common pattern. Most of them will consist of 4 main parts: introduction, HR-type questions, technical questions, and your questions. Preparing for each of the parts will improve your odds of getting a job offer.

Before an interview, you should review the job description and make sure you understand all concepts mentioned in it. This will automatically make you a better candidate. Bonus points if you think for some time about their business and how data science can improve their bottom line.

In the first part, the interviewer will present herself and also give some information about the company and the role. Then she will ask you to introduce yourself. You should prepare your introduction and practice it in front of a friend to project a better image of yourself.

HR-type questions are usually about your motivations, your character, and your soft skills. Some typical HR questions are:

  • Why did you decide to apply to this role?
  • Tell us about your strengths and weaknesses
  • What do your colleagues think of you?
  • Can you describe your management style?

It is impossible to have a prepared answer for all these kinds of questions. However, taking the time to prepare an answer for some of these will make you better at coming up with good answers for other questions. Additionally, writing down a description of the impression you want to give is also a good way to prepare.

There are 4 main types of technical questions:

  • Explaining a previous project
  • Case-type questions about how you would approach a certain task
  • Theory questions (here are some examples)
  • Practical exercises

Again, the range of possible questions here is almost unlimited. If you’re applying for a big company you may find some information about their interviewing style online. Having more experience with data science projects will give you an edge on technical questions but you can always get blindsided with a theory question about an algorithm you’ve never used. If this happens, acknowledge your ignorance and offer another subject about which you could talk.

Finally, when it’s your turn, asking a couple of questions will make you look interested in the job. You can spend 30 minutes googling the company before the interview to stand out from the competition by asking interesting questions.

How I run interviews

I have run data science interviews at both my current and past jobs. One was for consulting, the other for a SaaS that optimizes retail stock management. In both cases, interviews consisted of a data science business case, in which candidates have to solve business problems using analytical tools. It’s not so much about coding (in fact we don’t do live coding) as it is about problem-solving and knowing when to apply each data science technique. More specifically, what I look for when interviewing is:

  1. Problem-solving, understanding business problems and developing data-driven strategies to solve them.
  2. Communication skills, capacity to explain complex concepts in a clear and concise manner
  3. Leadership and initiative, ability and willingness to propose and run projects as well as to mentor more junior colleagues.
  4. Code craftsmanship, love for writing clear and easy to maintain code, while being conscious of the problems associated with excess complexity.
  5. Analysis depth, for example by identifying confounding variables and getting to the root cause of issues.

Just keep in mind that this is based on my personal opinions and what my company needs. Other interviews may be different.

How to get better at interviews

Many people struggle with interviews. The good news is, practice can help a lot.

Practice your introduction in front of the mirror until it’s perfect. Create or get some interview scripts, and get a friend or relative to interview you. Or if you’re still in college, maybe get together with some other students to interview each other.

After 5-10 mock interviews, you will be more articulate and more confident in yourself.

If you feel like anxiety is an issue in interviews, try to do breathing exercises to relax before going in.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

Tying it up 

The job search doesn’t end when you get the offer, but when you sign the contract. Now it’s time to negotiate the terms. If there is something you don’t like, you have a right to say it and try to reach a different agreement. This can range from salary to vacation days.

Words of encouragement

Finally, the most important thing to keep in mind is that getting a job like many things in life is a numbers game. No amount of effort and skill will guarantee that you get the job. They may forget about your CV, the position may be covered by the CEO’s nephew, or the recruiter may be an ass.

Additionally, for every posting, there may be lots of applicants. So don’t lose faith and don’t let rejection bring you down. Even if it’s tedious and it takes a long time in the end it’s still worth it.

Assuming you send 50 applications and each of them takes you about one hour. Then interview with 10 companies, spending an average of 3 hours with each. This would be a total of 80 hours, which is about 4-5% of what someone with a full-time job works on a year. If you get a 10% raise it’s a great investment. If you get 5%, better conditions, or a more fulfilling job it’s still well worth it.

So, go get it!

How to estimate the impact of algorithms

You’ve just finished training a credit risk tree model with a whooping 57 AUC score, and you feel great. And you should. But let’s dig deeper. How much better will this model be than using no model? Or than using the previous model which had an AUC of 48?

Have you ever wondered what the impact of an algorithm you are building is? How much money you are making for your company? How many lives are our campaigns saving?

Every member of an organization should know how their actions contribute to the organization’s goals. This allows them to prioritize and be more efficient in their work.

The impact of an algorithm is tied to the actions it enables

To estimate the impact of an algorithm, first, we’ll need to define a metric. This will usually be money because it’s the main human mean of value exchange and one of the main goals of businesses. However, depending on the nature of your project, you can use metrics such as lives or time saved.

To estimate the impact of an action, we have to calculate the difference of our metric between two different scenarios:

  1. The current outcome (measured in the metric we’ve defined). This can be 0 if nothing can be done without the algorithm
  2. The outcome we expect to get by using the algorithm instead

Building simplified models of the situation will allow us to make estimations of the impact. This is similar to how we would build a business case.

Let’s make it more clear with an example

STL limited (Short Term Loans) is a credit company that gives 1-year loans. This is how their business is going:

  1. They give loans at a 10% interest rate to everyone that applies for one
  2. Their default rate is 10% (percentage of customers that don’t pay back all the money they owe)
  3. Customers that default had paid back an average of 30% of the loan amount before defaulting
  4. The average loan amount is 1.000$
  5. Every year 100.000 new customers apply for a loan

With this information we can estimate how much they currently earn per year by using a simple excel spreadsheet. We will first estimate the expected earnings per non-defaulting customer (NDC) and per defaulting-customer (DC). After this we will combine those estimations to calculate the expected value per customer by using conditional probabilities.

Building a model to improve earnings

John, the lead data scientist at STL limited, has developed a probability of default model. He has trained it using customer employment data that was collected anyway for regulatory reasons.

John uses the model to make predictions on a holdout set (a dataset that the model has never seen before). He then divides the customers into four groups of the same size based on the probability of default predictions. The following table shows the probability of default for each of the groups:

Modifying the default rate on the previous spreadsheet, we can estimate the expected earings per customer for each of the groups:

The average customer on group 4 loses money for STL limited.

How would much would STL limited earn if they only gave credit to people on groups 1, 2 and 3?

By only giving credit to customers with positive expected earnings, STL could make a total of 3,5M$ per year. This means that the model would have an impact of 1,5M$ (3,5M$ minus the 2M$ of the base case).

Wrapping it up

This impact estimation method is based on simplification and it leaves out second-order consequences of the actions. Additionally, future performance isn’t guaranteed to be the same as in the past. To account for these sources of uncertainty, I generally multiply the impact estimation by a conservative factor of 50-80%.

Nevertheless, the objective of these estimations is not perfect accuracy but getting a ballpark figure that will allow us to compare and prioritize.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

When is a master’s degree the right way into Data Science?

You are a student finishing their bachelor’s degree and unsure of what to do next. Or perhaps you are a professional considering a change of industry. You may even be a PhD that wants to transition into the private sector. Whatever your origin, what concerns you now is how to accomplish this change. And you wonder: should I study a data science master’s? Will it be worth it?

Since you want to get a data science job, enrolling a master’s program sounds like a logic step. A master’s will teach you the knowledge needed to do the job as well as the credentials to get it. However it is also costly. Tuition fees may vary from a few thousand dollars to several tens of thousands. Additionally you will have to invest a year or two into it, potentially going jobless during the time.

Since it’s a very important decision, you consider the alternatives. Is there a better way to learn data science? Maybe self-learning and some online courses? Maybe on another data-related position you can learn it on the job and easily transition later?

In the rest of the article I will compare the master’s and the self-learner routes and highlight when one is better than the other.

Let me tell you a couple of stories

With one year left to finish my degree (Math+Civil engineering) I took interest in data science. That year I only had to do my final thesis so I had a lot of free time. I used that as an opportunity to get into data science by:

  1. Doing a machine learning project as my thesis with the help of some online courses and books
  2. Joining a local analytics consulting firm for an internship, where I learned about SQL and databases

That turned me into a great candidate for starting data science positions and I got my first full time job right after finishing my thesis.

My wife Anna enrolled a master’s in statistics and operations research right after finishing her bachelor’s in mathematics. After finishing her master’s she has held a couple of data scientist jobs and is now a biostatistician. Doing a master’s was a great decision in her case. It led to job offers and meeting great friends.

As illustrated by these examples, both ways can work. Which one is best will depend on your personal situation and preferences.

Why should you study a master’s in data science?

Certification. A master’s degree is a recognizable badge and will make it easier for you to get interviews. Recruiters and HR professionals value it highly. Data scientists don’t value it as much, according to a recent Kaggle survey only 20-30% of data scientist hold a master’s degree. If you decide to skip the master’s, there are some ways to get proof of your skill such as:

  • Data science competitions (Kaggle, local hackathons …)
  • LinkedIn skill assessments
  • Personal projects and OS contributions
  • Experience on adjacent fields (data analyst, data engineer).

Peers. Another advantage of a master’s degree is that you will have a class of like-minded people to study with. During the degree you can have fun together. Afterwards you will be a valuable network of professionals that can help each other. Self-learners won’t have the same camaraderie as classmates. However, you can join online communities as well as local meetups and study groups to network and socialize.

Convenience. The final major advantage of a master’s degree is that it’s simply easier to follow from start to finish. Most people find it a lot easier to commit to a habit once they’ve paid for it or given it some formal structure. Finding the discipline and motivation for consistent self-study is hard. If you struggle with it you can try some of this tricks:

  • Learn with a friend
  • Allocate a certain time of the week to it
  • Find a way to track your progress and give you a sense of accomplishment

What are the advantages of self-learning?

Flexibility. Self-study lets you advance at your own pace, from wherever you are and skip subjects you find boring or uninteresting. This was very important in my case as I have always struggled with things I find boring. Some master’s programs aren’t as rigid as they used to be but still nowhere close to self-learning.

Cost. And an obvious one. The cost of learning data science on your own will be close to zero. You may spend some money on a couple books and online courses. Master’s degrees are very expensive unless education is heavily subsidized where you live.

Quality of education. I know this may come as a shock to some. Self-learning will let you pick and choose the best materials from different sources. On the other hand, a master’s will have you committed to a single program. If you are unsure about the best books and courses, worry not. Online forums like reddit and stack overflow will answer your questions and blogs like this will try to point you in the right direction. Moreover, bloggers and Kaggle winners regularly share their experience and tips, something that many teachers won’t do. So even if you decide a master’s is the better option for you, it’s good to stay online.

And finally, let’s talk about salary. The Kaggle study found no significant salary differences between those who had a master’s and those who didn’t. So we can call this even.

Conclusion

As with most thing in life, wheter or not studying a DS master’s is a good idea, will depend on your situation and preferences.

Study a master’s if you really value the certification, having classmates or aren’t confident on your discipline to do it solo.

Self-learn if you value the flexibility or money is an issue.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you