When you start learning, it’s very hard to have a clear direction. You often waste time on uninteresting, useless, or outdated topics. You wander and run in circles.
However, once you’ve mastered the topic, it’s easy to look back and see the fastest path from noob to pro. If you only could go back in time and give yourself the roadmap… Even if I cannot do that with myself, I can do that for others. This is the objective of this article: to give you the tips I wish I knew when I started learning data science and machine learning.
To build this list, first I wrote down what has been useful to me in my experience as a data scientist. Then I went to Reddit, to seek help in curating and completing the list, getting 300+ upvotes and 35+ comments. I hope you find it helpful!
1. Get solid mathematics, probabilities, and statistics foundations
Mathematics and statistics are at the core of machine learning. So it will be very difficult to understand machine learning algorithms if you don’t know the building blocks.
However, this doesn’t mean you need to be a math wizard. You should understand math and stats concepts such as vectors, matrices, derivatives, probability distribution, independent variables, or standard deviation. More advanced mathematics (like learning to prove theorems) won’t help you much when studying machine learning, even though it can be a lot of fun.
2. Learn either Python or R and learn them well
When doing data science and machine learning, you will spend most of your time coding in R/Python. So it’s important to learn the ins and outs of your language of choice.
Data scientists spend a lot of time cleaning and manipulating data, so you should give special attention to data manipulation libraries. The most popular ones are Pandas for Python and data.table and dplyr for R.
3. Learn good programming practices
Writing clean and efficient code will make it easier to share your work with others. And even if you work alone, will make it easier for you to debug and maintain your own code. Entire books have been written about this so I’ll give you a short list:
- Use consistent and descriptive names for variables, columns, and functions
- Don’t repeat code, use functions or classes if you need to do the same process multiple times
- Understandable code is better than compact one: 10 lines everybody understands vs 2 lines nobody understands
- Don’t overoptimize your code at the start, but know where the bottlenecks (parts that won’t work well if you increase the volume of data) are in case you need it to scale
- Use consistent indentation and try to limit line length
4. You don’t need to learn all the different supervised learning models
This is one I struggled with. When I started learning I thought that every situation would need a different type of model and that I needed to learn them all to be well equipped. But this is far from true. Linear/logistic regression is surprisingly effective for tabular data problems. And XGBoost or random forest will help you if you have a lot of non-linearities. Artificial neural nets are great for image and NLP problems but are otherwise overkill and more difficult to set up.
Aditionally, you don’t have to keep up with all the published papers. Most staple techniques in the industry are decades old. If you ever have to face a very unique problem, then may be a good moment to dive into the literature.
5. Once you know the basics and understand them well, it’s mostly about doing projects
After completing one or two ML courses, don’t spend your time on more theory, dive straight into doing some projects. If you’re lacking some knowledge, you can pick it up on the way.
Working on projects puts your knowledge into practice, and helps you figure if you really understood everything well. Additionally, by doing projects you create valuable experiences that will help you get hired later on.
6. Doing tutorials and reviewing other people’s projects is very helpful at the start
When you’re learning a new tool or model and don’t feel confident about using it on your own, looking at an example is a great way to get some inspiration.
7. You can learn everything online for free, but some paid resources can be helpful
For example, studying a master’s will give you credentials and a class of peers. I’ve actually written a full article about self-learning vs studying a master’s.
Additionally, some useful online resources are paid. I have personally tried to distill my years of experience as a data scientist into Data Projects, a product to learn data science by doing real-world projects. I hope it can help others as much as it would’ve helped me.
8. Explaining your work to others is a great way to consolidate your knowledge
It’s also a great way to work on your communication. You can do this by telling your friends, blogging, or making youtube videos. This will be a crucial skill when working with others.
9. Don’t despair if you don’t get it right
Nobody gets it right the first time. Trial and error is the way to go, especially on fields like this where there is no one exact solution
10. Lean on online communities
The internet is full of helpful and generous people, if you’re struggling with something search and if you don’t find the answers, ask in the forums (reddit or stackoverflow).
11. Learn more about your problem domain
Don’t focus only on the purely technical, try to understand what is really behind the problems you’re modeling. It will help you decide which is the best error metric for the problem, select the most insightful variables, and communicate to non-technical stakeholders using their own language.
12. Work with messy data
Don’t just stick to problems with pre-cleanded data. The world is messy, and having some experience on treating and structuring data will prepare you for future challenges.
13. Work on what makes you curious, that will keep you motivated
Following your curiosity and your passions will make sure you don’t abandon your path to becoming a data scientist halfway through. Additionally, it makes the whole learning experience a lot more fun!