When I learned data science I didn’t know where to start, so I wasted many hours learning only tangentially useful stuff. Now, after more than five years as a data science consultant, I know what I would’ve done differently. In this article, I will offer you a roadmap on how self-learn data science with links to useful resources.
Table of Contents
Data science pre-requisites
Even though I believe everyone can learn data science, those with a technical background will have a head start. Before getting into DS specific subjects it is useful to have some notions about mathematics, statistics and probability.
It is not necessary to be an expert in any of those, but you need a solid foundation. If you’ve never studied any of those, don’t worry, I’m here to help. In the following paragraphs, I’ll briefly describe each prerequisite and link to educational resources.
Mathematics for data science
To get started with data science you need to get familiar with some of mathematics’ most common objects. These Khan academy lessons about vectors, matrices and functions are a good place to start. Also, here’s the summary (in more formal mathematical language) of a Stanford course. These concepts are the building blocks of most machine learning algorithms and provide you with a framework for structuring data. Getting to this level of mathematics will allow you to understand and use the algorithms that others have invented and implemented and get results.
If you really like mathematics, you can dive deeper into mathematics by taking full calculus and linear algebra courses. This will require a lot more work but will unlock a more complete understanding of the inner workings of machine learning algorithms and how to implement and adjust them.
Probability and statistics
Probability lies at the core of the data scientists’ view of the world. When dealing with big numbers and random events, probability and statistics provide the tools to make sense of them. It isn’t only about the exact methods or formulas, but also about developing a probabilistic intuition. These courses from Khan academy on probability and statistics are both beginner-friendly and got all the information you’ll need. Here is a mathematically formal summary of a probability course from Stanford.
In addition to formal education in probability and statistics, reading non-fiction books can also help to develop an intuition. I recommend the following books in no particular order: Thinking fast and slow, Factfulness, Thinking in bets, Fooled by randomness (or any of Nassim Taleb’s books).
Finally, reading about statistical paradoxes will help you make sense of data when you face unintuitive conclusions.
Data-oriented programming language
A big part of a data scientist’s job is reading, manipulating and running analysis on data. This is usually done by coding in a data-oriented language. These languages allow us to write instructions for a computer to execute. Even though there are many different programming languages, most of them use very similar structures. The two most popular data-oriented programming languages are Python and R, and you can start with either one. If at some later point you work with people using the other one, you can use that as an opportunity to learn it.
If you’ve never coded before, don’t worry. Both of them can be a good first point of contact with programming. A lot has been written about which one is better, but the truth is they have different strengths.
R’s strong points are:
- It is designed for data and statistical work, so manipulating data is easier
- There is a vast universe of statistics libraries
- The Shiny library makes it very easy to make a web app with no previous web design experience
- RStudio is a wonderful IDE (I haven’t found one that I like as much for Python)
Python’s strong points are:
- It’s a general-purpose programing language as well as one of the most popular languages overall
- It usually runs faster than R
- It has better packages for deep learning
I personally prefer R because of its more compact syntax in the data.table package and also because I have more experience with it.
Learning R
If you are new to programming, I recommend you start with one of these resources:
- Swirl, an interactive R tutorial on the console
- A complete video tutorial
If you have been coding for a while, you can get the basics with learn R in Y minutes.
Once you know the basics, it’s time to learn one of the two main data manipulation libraries: data.table (my personal favorite) or dplyr. Another useful library is ggplot2 for making beautiful graphics.
Learning Python
If python is your first programming language you can start with any of these:
- Automate the boring stuff, a great book
- Learnpython’s tutorials
- This extensive video tutorial
If you’re already familiar with coding you can just read this documentation.
And once you’ve mastered python’s basics, you can go into the specialized tools to manipulate data: Pandas and Numpy. Here’s a tutorial and here’s a video to help you learn those packages.
Learn machine learning
Now we get to the exciting part.
There are many different techniques and tools in machine learning. One of them has been my most used analytical tool during my years as a data science consultant. And that technique is supervised learning, in both of its forms: classification and regression.
Supervised learning, also known as predictive modeling, is about learning from examples in which we know in advance the correct answer. In regression the answer is a numerical value, and in classification it is categorical.
Predictive models can be used to make demand forecasts, identify risky creditors and estimate the market price of a house among many other uses.
Here are some courses that will teach you the main framework to approach predictive modeling problems, as well as some supervised learning models:
In my experience, 3 families of models can help you solve most supervised learning problems you’ll ever encounter:
- Linear and logistic models (explained in the above courses) are easy to understand, easy to interpret, fast to train and reasonably accurate
- XGBoost (gradient boosting trees implementation) is a top-of-the-class model in terms of precision, speed and ease of use. However, they’re not as easy to interpret as linear models. Here’s an introduction to decision trees (pre-requisite) and a couple of articles about how XGBoost works
- Neural networks are great for natural language processing and image models. However, I’d leave them to more advanced data scientists since they’re more difficult to set up
Here are some examples of using linear regression in R and Python, and of using XGBoost in both languages.
SQL
SQL is the most used database language and most companies use one of its variants for their database. Even Amazon’s Athena and Google’s big query can be accessed using SQL syntax.
So if you’re planning on getting a job in data science I recommend you learn SQL since it will be a requirement for most employers. If you’re doing personal projects it’s up to you. For small-scale projects, you will be just saving your data on text files. For bigger projects, SQL skills may come in handy.
What’s next?
Once you’ve learned the basics about R/Python and supervised learning, it’s time to practice. Do a project with open data or participate in a Kaggle competition. Or get a job as a data scientist and learn while getting paid. Practice is what will help you hone your skills and generate proof of your knowledge.