The Ouroboros King, big update

Since these posts aren’t getting much traction, I haven’t made any in what feels like an eternity. Since I last posted, I have made lots of improvements to the game:

  • Improved the map both visually and its inner workings, now offering more paths
  • Added some juice, by improving some of the game’s visual effects, adding a slight screen shake, and blood stains on the board
  • Added a lot more pieces, relics, and items
  • Prepared the infrastructure for the full game (3 chapters + final boss)

You can check out all these improvements on the Steam demo and on itch.

Additionally, I’ve participated in a couple of festivals, getting to ~250 wishlists. Assuming I double the number of wishlists by launch, that wishlist conversion rate is 20% and the price is 10€, the game will make ~1.000€ and I’ll probably get half of that amount after steam cut and taxes. This figure is way below my initial expectations, but it’s only my first commercial game and I’ll still be happy. However, I’ll still do what I can to get it to more people.

My plan is to release it in February. Before launch, the plan goes as follows:

  • Finish the game by the end of December. This includes a new type of map location involving sacrifices in exchange for more powerful units and relics, completing the game lore (I have a draft in my head but it needs some ironing), and populating the final stages (90% of the content is made, I just need to tell the game where to show it)
  • Run a beta to gather some feedback. I’ll get players for the beta from alphabetagamer and some specific subreddits. My plan is to credit and give steam keys to everyone that makes a contribution to improving the game
  • Contact streamers (hopefully by mid-January) to start generating some interest in the game
  • Participate in February’s Steam Next Fest
  • Release a week after Next Fest

So that’s that. Hopefully, the plan works out and I can be here in a couple of months talking about a successful launch 😉

HS Battlegrounds, optimizing your late game Naga board (post-nerf)

In May 2022 the Naga tribe was introduced to HS Battlegrounds. From the start, the tribe was completely OP with decent early-game units what and crazy late-game scaling. Since then they’ve been nerfed twice, lowering both the initial stats and scaling potential of some minions. In this post I’ll help you build a Naga board optimized for scaling, using the tools of numerical analysis.

The growth engine

This scaling is thanks to growth engines that interact with spells and the new spellcraft mechanic. There are many Naga that scale when you play spells, but not all of them are equally effective. Here are the scaling Nagas ordered by decreasing order of effectiveness:

  • Tidemistress Athissa, is not as OP as it used to be, but still very strong. If you get 5 procs (a quite conservative amount, 4 Spellcrafts on board and cycling 2 extra spells), that is +18/+18 on your board, more than a golden Ligthfang with 4 tribes or a Charly and a Pumba. Note that Athissa procs on all spells, including coins, blood gems and discovers from triples. We’ll compare the other minions to Athissa.
  • Critter Wrangler, half the scaling of Athissa on Spellcrafts, none on other spells. All in all, this will be ~40% as effective as Athissa, depending on whether Quilboar are on the lobby and the number of triples you get.
  • Eventide Brute (after you cast a spell, gain +1/+1). ~33% of Athissa’s scaling and it gets all the buffs, making it more vulnerable to poison/Leeroy.
  • Lava Lurker (the 1st Spellcraft spell cast on this each turn is permanent). The best spell you can use on it is Shoal Commander’s one, which gives it +7/+7 assuming you have 7 Nagas. If you optimize your setup for the Lurker and get 1 golden Lurker and 2 golden Commanders, you could get +28/+28 scaling per turn, which is still below the conservative estimate for Athissa. All in all, Lava Lurker can help you in the mid-game, but it falls short as a scaling engine.
  • Corrupted Myrmidon (Start of combat: double this minion’s stats). It doesn’t grow on its own but utilizes buffs better than other minions. Assuming you get all Athissa procs on it, you’ll get an extra ~25% plus you can double the stats from gems. If you have Critter Wrangler instead, you’ll double its efficiency on spells from hand. Another bonus is that it gives you a lot of tempo if you already have some Spellcrafts to buff it. As with Eventide Brute, concentrating buffs on this will make you susceptible to poison and Leeroy.

The clear winner by a wide margin is Athissa. In its absence, you can try to survive with a combination of Wranglers, Brutes, Corrupted Myrmidons and Lava Lurker.

Spellcraft minions

There are 7 Spellcraft minions, 6 of which are Naga and the other one gives you Nagas. Let’s analyze them:

  • Orgozoa, the Tender is not a Naga, but procs Athissa and also gives you more Nagas to round up your composition or proc Athissa again. Once you have 4 Naga on the board, this gives you the best scaling since it can discover more spells for extra procs.
  • Glowscale is great for combat, giving you the ability to DS your biggest minion.
  • Other Spellcraft minions. They offer a moderate amount of stats and taunt/windfury. They can be useful in helping you survive while you get your growth engine, but won’t help you scale as much as Orgozoa and their buffs aren’t as significant as DS in the late game. The best of them in terms of stats is Shoal Commander. However, even if you get a golden Commander, it will give be +14/+14 in combat stats which can be easily outclassed by one or two turns of scaling with Athissa. The only case when it’s relevant and even necessary is when you include Lava Lurker on your composition.

The ideal composition

Once we know the pieces of the puzzle, it’s time to think about the best way to assemble it. How many Spellcraft minions should we get? Is Lava Lurker worth it?

To analyze the composition, I’ve simulated the number of +1/+1 buffs we get for many different board combinations. These simulations make the following assumptions:

  • We have 6 “stable” minions that you are growing and 1 flex slot that you use to rotate spells
  • 3 played spells per turn from the shop (Spellcraft, coins, gems, discovers)
  • 80% of the spells are Spellcraft, and 20% are other types
  • We have a maximum of 1 Corrupted Myrmidon (or a golden one), which gets an equivalent of an extra 80% of the Critter Wrangler procs (you may put DS on other minions our use the discover from Orgozoa) and 20% of the Athissa procs
  • We have a maximum of 1 Lava Lurker (or a golden one) and it gets +7/+7 each turn (+14/+14 if golden), equivalent to having 1 Shoal Commander (2 if golden) and 7 Naga on board

With this in mind, we can calculate the number of procs as follows:

Spells cast = Other spells + Spellcraft minions

Athissa procs = Spells cast * (3 * Athissa + 6 * golden Athissa)

Critter Wrangler procs = 80% * Spells cast * 80% * (1.5 * Critter Wrangler + 3 * golden Critter Wrangler)

Eventide Brute procs = Spells cast * (Eventide Brute + 2 * Golden Brute)

Corrupted Myrmidon procs = (20% * Athissa procs + Critter Wrangler procs) * (Corrupted Myrmidon + 1.5 * Golden Corrupted Myrmidon)

Lava Lurker procs = 7 * Lava Lurker + 14 * Golden Lava Lurker

Procs = Athissa + Critter Wrangler + Eventide Brute + Corrupted Myrmidon + Lava Lurker

The best composition gets an equivalent of 104 +1/+1 procs per turn and consists of 2 golden Athissa, 2 golden Critter Wrangler, 1 golden Myrmidon and 1 Spellcraft minions.

The best composition without golden Athissa gets an equivalent of 79 +1/+1 procs and consists of 3 golden Wranglers, 1 golden Corrupted Myrmidon and 2 Spellcraft minion.

The best composition without any golden minions gets an equivalent of 46 +1/+1 procs and consists of 2 Athissa, 1 Critter Wrangler, 1 Corrupted Myrmidon and 2 Spellcraft minions.

I’ve measured the importance of each minion by calculating the average number of appearances on the top 10 compositions for each scenario. All copies are golden unless forbidden by the scenario:

All compositionsNo golden AthissaNo golden minions
Corrupted Myrmidon110.5
Critter Wrangler1.42.80.8
Lava Lurker0.10.30.5
Eventide Brute00.10.1
Spellcraft Minions1.412.1
Avg. procs per turn977444

I’ve made this spreadsheet calculator to calculate the number of procs you’d get based on your composition. It’s read-only so it remains the same, but you can copy it to another spreadsheet and use it if you want.

The flex slot

As suggested above, the flex slot is used to rotate minions that give you spells (Spellcraft, Seashell Collector, Quilboar). However, at the end of the turn, you should be playing a minion on that slot.

If you feel like the combat will be easy, you can try to get an extra spell for the next round by playing a Spellcraft minion or a Quilboar that gets gems on combat. If you play a Spellcraft minion, you should do so after playing all your spells so it doesn’t “steal” any procs.

If you’re pressured, try to get a Leeroy, Mantid Queen, Ghastcoiler or Selfless Hero to strengthen your board.

Getting there

This article just covers the ideal composition in a void, but on a BG game, you need to survive while you build your comp. In some cases, it will be impossible to build full scaling and you’ll keep your early Lurker or Brute on the board, that’s completely fine.


I’ve done the math on scaling for Naga comps, here are the main take aways:

  • Get as many copies of Athissa as you can
  • Critter Wrangler is a great minion to complement Athissa
  • A Corrupted Myrmidon (especially golden), is a great receptor of Athissa and Wrangler buffs
  • Lava Lurker (if you have Shoal Commander) and Eventide Brute are also viable
  • Get between 1 and 3 Spellcraft minions on the board, Orgozoa and Glowscale are the best
  • Round up your comp with another Spellcraft for a bit more scaling or another useful unit if under pressure
Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Subscribe if you want more content like this sent to you:

How to use simulations in data science

Simulation is a very potent tool that is often lacking in many data scientists’ toolkits. In this article, I will teach you how to use simulation in combination with other analytical tools.

I will be sharing some educational and professional examples of simulation with Python code. If you are a data scientist (or on the road to becoming one), you’ll love the possibilities that simulation opens for you.

What is simulation?

Simulating is digitally running a series of events and recording their outcomes. Simulations help us when we have a good understanding of how individual events work, but not of how the aggregate works.

In physics, simulations are often used when we have a hard-to-solve differential equation. We know the starting state, and we know the rules for infinitesimal (very small) changes, but we don’t have a closed formula for longer timespans. Simulation allows us to project that initial state into the future, step by step.

In data science, we usually work with probabilistic events. Sometimes we can easily aggregate them analytically. Other times there is no analytical solution, or it’s very hard to reach it. We can estimate the probabilities and expected results of complex chains of events, by running multiple simulations and aggregating the results. This can be very useful to understand the risks we are exposed to.

Simulation is also used in hard artificial intelligence. When interacting with others, simulation can allow us to anticipate their behavior and plan accordingly. For example, Deep Mind’s Alpha Go uses simulations to calculate some moves into the future and make a better assessment of the best moves in its current position.

To run a simulation we will need a model of the underlying events. This model will tell us what can happen at any given point, the probabilities of each outcome and how we should evaluate the results.

The better our model, the better the accuracy of the simulation. However, simulations with imperfect models can still be helpful and give us a ballpark estimate.

Simulation is a subject where examples work better than theory, so let’s jump into some use cases.

Example 1. Estimate the value of pi by using simulation

This task can be done in many ways. One of the easiest is as follows:

  1. Draw a square of side 2 and with its center at the origin of coordinates of a 2d plane
  2. Draw the inscribed center of that square (radius 1 and its center at the origin of coordinates)
  3. Sample random points from the square (two uniform distributions from -1 to 1)
  4. Whenever you draw a point, check whether it is inside the circle or not
  5. The proportion of points inside the circle will be proportional to the area of the circle so:

    \[{Num\_points\_inside\_circle \over Num\_total\_points} \approx {Area\_of\_circle \over Area\_of\_square} = {\pi \cdot 1^2 \over 2 \cdot 2} =  {\pi \over 4}\]

And finally:

    \[\pi  \approx 4 \cdot {Area\_of\_circle \over Area\_of\_square}\]

Here is Python code to simulate the value of pi:

import numpy as np
import matplotlib.pyplot as plt


num_sims = 5000
x_random = np.random.rand(num_sims)
y_random = np.random.rand(num_sims)

inside_circle = ((x_random ** 2 + y_random**2) < 1)

n_to_one = np.arange(1, num_sims+1)
plt.plot(n_to_one , 4*inside_circle.cumsum() / n_to_one)
Pi simulation convergence

Similar methods can be used to estimate the value of integrals via simulation.

Example 2. Solve a difficult probability problem

Solve this problem by P. Winkler:

One hundred people line up to board an airplane. Each has a boarding pass with an assigned seat. However, the first person to board has lost his boarding pass and takes a random seat. After that, each person takes the assigned seat if it is unoccupied, and one of the unoccupied seats at random otherwise. What is the probability that the last person to board gets to sit in his assigned seat?

The problem can be solved using logic and probabilities, but it can also be solved by simply programming the described behavior and running some simulations:

import numpy as np
import matplotlib.pyplot as plt


def simulate_boarding(num_passengers):
    passenger_seats = set(range(num_passengers))
    for i in range(num_passengers):
        if i == num_passengers - 1:
            if list(passenger_seats)[0] == i:
                return 1
                return 0
        if (i == 0) or (not i in passenger_seats):
            i = list(passenger_seats)[np.random.randint(0, num_passengers - i)]
num_sims = 10000
num_passengers = 100
positives = 0

is_same_seat = [simulate_boarding(num_passengers) for i in range(num_sims)]
is_same_seat = np.array(is_same_seat)

one_to_n = np.arange(1, num_sims+1)
plt.plot(one_to_n, is_same_seat.cumsum() / one_to_n)
Probability simulation convergence

You can find more probability problems to practice here.

Example 3. Simulating game outcomes

How many games would it take Magnus Carlsen (Elo of 2847 as of 18-07-2021) to get back to his current rating if he was dropped at 1000?

To solve this problem we need to understand how the Elo system works.

First, given two player’s Elo ratings, the probability of player1 beating player2 is:

    \[P(\textrm{player1 beats player2}) = {1 \over 1 + K \cdot 10 ^{(Elo_2 - Elo_1)/400}}\]

Second, after the game, player1’s Elo rating is updated as follows:

    \[Elo_1= Elo_1+K \cdot (\textrm{result} - P(\textrm{player1 beats player2}))\]


  • result is 1 for a win, 0.5 for a tie and 0 for a loss
  • K (also known as K-factor) is the maximum possible adjustment per game and varies depending on the player’s age, games played and ELO

Now that we have a model, we just have to initialize Magnus current Elo to 1000 and code a while loop that:

  1. Has Magnus play a game against a player of his current Elo
  2. Calculates the probability of winning using the real Elo and simulates the outcome of the game
  3. Updates Magnus’s current Elo according to the result
  4. Stops the loop if Magnus has reached his real Elo
import numpy as np
import matplotlib.pyplot as plt


def get_prob(elo1, elo2):
    return 1/(1+10**((elo2 - elo1)/400))

def update_elo(elo, prob, result, k):
    return elo + k * (result - prob)

def play_until_top(real_elo, initial_elo):
    current_elo = initial_elo
    num_games = 0
    k = 40
    elo_list = [initial_elo]
    while current_elo < real_elo:
        if num_games > 30:
            k = 20
        if current_elo > 2400:
            k = 10
        prob_win = get_prob(real_elo, current_elo)
        result = 1 if np.random.rand(1)[0] < prob_win else 0
        current_elo = update_elo(current_elo, 0.5, result, k)
        num_games += 1
    return elo_list

num_sims = 1000

num_games = [len(play_until_top(2847, 1000)) for i in range(num_sims)]
num_games = np.array(num_games)


elo_history = np.array(play_until_top(2847, 1000))
plt.plot(np.arange(0, len(elo_history)), elo_history)
Example Elo trajectory
Games to real Elo distibution

Another cool example would be to simulate the NBA playoffs. For a first approach, you can assume that each team has a probability of winning proportional to the games they won during the regular season (GW) so that in any game the probability of team 1 winning is GW1 / (GW1 + GW2). You can also analyze how probabilities change if you change the series from Best of 7 to Best of 5 or Best of 9.

Example 4. Business application, estimating value at risk

Collectors LTD is a debt collection company focused on enterprise debt. It buys portfolios of business loans that have defaulted at some point and tries to collect the payments for those loans. Some of the companies will be bankrupt and won’t be able to pay, and others are likely to go bankrupt in the future. The key to Collectors LTD’s business is in estimating the value it can get back from a portfolio. For this reason, Collectors LTD has developed a model that predicts the probability of a company repaying part of that debt. Among those companies that repay some of the debt, the amount paid is distributed uniformly from 0% to 100%. Collectors LTD can use its model in combination with simulation to evaluate the expected return of the portfolio, and how volatile that return is.

Since I can’t share the real data with you, I’ve created a synthetic dataset that mimics the relevant properties:

import numpy as np
import matplotlib.pyplot as plt


def generate_synthetic_portfolio(num_companies):
    debt = 100000 * np.random.weibull(0.75, num_companies)
    prob_repayment = np.random.normal(0.2, 0.1, num_companies)
    prob_repayment = np.clip(prob_repayment, a_min=0, a_max=1)
    return debt, prob_repayment

num_companies = 1000
debt, prob_repayment = generate_synthetic_portfolio(num_companies)

Given the following synthetically generated portfolio, estimate the expected amount to be collected and the 95% percentile.

def simulate_collection(debt, prob_repay):
    num_companies = len(debt)
    did_repay = (np.random.rand(num_companies) < prob_repay)
    pct_paid = np.random.rand(num_companies)
    amount_collected = debt * did_repay * pct_paid
    return amount_collected.sum()

num_sims = 1000
amount_collected =
    np.array([simulate_collection(debt, prob_repay) for i in range(num_sims)])

print(f"Total debt: {np.round(amount_collected.mean())} usd")
print(f"Average amount collected: {np.round(amount_collected.mean())} usd")
percentile_95 = np.round(np.sort(amount_collected)[int(0.05*num_sims)])
print(f"95% percentile collection: {percentile_95} usd")

Debt collection distribution

Keep in mind that this solution assumes the probabilities of collection are independent of one another. This isn’t true for systemic risks such as a global economic downturn.


I hope you’ve liked these examples and that you can find applications of simulation in your day-to-day data science job. If you’ve enjoyed the article, please subscribe and share it with your friends.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

13 essential tips for learning machine learning and data science

When you start learning, it’s very hard to have a clear direction. You often waste time on uninteresting, useless, or outdated topics. You wander and run in circles.

However, once you’ve mastered the topic, it’s easy to look back and see the fastest path from noob to pro. If you only could go back in time and give yourself the roadmap… Even if I cannot do that with myself, I can do that for others. This is the objective of this article: to give you the tips I wish I knew when I started learning data science and machine learning.

To build this list, first I wrote down what has been useful to me in my experience as a data scientist. Then I went to Reddit, to seek help in curating and completing the list, getting 300+ upvotes and 35+ comments. I hope you find it helpful!

1. Get solid mathematics, probabilities, and statistics foundations

Mathematics and statistics are at the core of machine learning. So it will be very difficult to understand machine learning algorithms if you don’t know the building blocks.

However, this doesn’t mean you need to be a math wizard. You should understand math and stats concepts such as vectors, matrices, derivatives, probability distribution, independent variables, or standard deviation. More advanced mathematics (like learning to prove theorems) won’t help you much when studying machine learning, even though it can be a lot of fun.

2. Learn either Python or R and learn them well

When doing data science and machine learning, you will spend most of your time coding in R/Python. So it’s important to learn the ins and outs of your language of choice.

Data scientists spend a lot of time cleaning and manipulating data, so you should give special attention to data manipulation libraries. The most popular ones are Pandas for Python and data.table and dplyr for R.

3. Learn good programming practices

Writing clean and efficient code will make it easier to share your work with others. And even if you work alone, will make it easier for you to debug and maintain your own code. Entire books have been written about this so I’ll give you a short list:

  1. Use consistent and descriptive names for variables, columns, and functions
  2. Don’t repeat code, use functions or classes if you need to do the same process multiple times
  3. Understandable code is better than compact one: 10 lines everybody understands vs 2 lines nobody understands
  4. Don’t overoptimize your code at the start, but know where the bottlenecks (parts that won’t work well if you increase the volume of data) are in case you need it to scale
  5. Use consistent indentation and try to limit line length

4. You don’t need to learn all the different supervised learning models

This is one I struggled with. When I started learning I thought that every situation would need a different type of model and that I needed to learn them all to be well equipped. But this is far from true. Linear/logistic regression is surprisingly effective for tabular data problems. And XGBoost or random forest will help you if you have a lot of non-linearities. Artificial neural nets are great for image and NLP problems but are otherwise overkill and more difficult to set up.

Aditionally, you don’t have to keep up with all the published papers. Most staple techniques in the industry are decades old. If you ever have to face a very unique problem, then may be a good moment to dive into the literature.

5. Once you know the basics and understand them well, it’s mostly about doing projects

After completing one or two ML courses, don’t spend your time on more theory, dive straight into doing some projects. If you’re lacking some knowledge, you can pick it up on the way.

Working on projects puts your knowledge into practice, and helps you figure if you really understood everything well. Additionally, by doing projects you create valuable experiences that will help you get hired later on.

6. Doing tutorials and reviewing other people’s projects is very helpful at the start

When you’re learning a new tool or model and don’t feel confident about using it on your own, looking at an example is a great way to get some inspiration.

7. You can learn everything online for free, but some paid resources can be helpful

For example, studying a master’s will give you credentials and a class of peers. I’ve actually written a full article about self-learning vs studying a master’s.

Additionally, some useful online resources are paid. I have personally tried to distill my years of experience as a data scientist into Data Projects, a product to learn data science by doing real-world projects. I hope it can help others as much as it would’ve helped me.

8. Explaining your work to others is a great way to consolidate your knowledge

It’s also a great way to work on your communication. You can do this by telling your friends, blogging, or making youtube videos. This will be a crucial skill when working with others.

9. Don’t despair if you don’t get it right

Nobody gets it right the first time. Trial and error is the way to go, especially on fields like this where there is no one exact solution

10. Lean on online communities

The internet is full of helpful and generous people, if you’re struggling with something search and if you don’t find the answers, ask in the forums (reddit or stackoverflow).

11. Learn more about your problem domain

Don’t focus only on the purely technical, try to understand what is really behind the problems you’re modeling. It will help you decide which is the best error metric for the problem, select the most insightful variables, and communicate to non-technical stakeholders using their own language.

12. Work with messy data

Don’t just stick to problems with pre-cleanded data. The world is messy, and having some experience on treating and structuring data will prepare you for future challenges.

13. Work on what makes you curious, that will keep you motivated

Following your curiosity and your passions will make sure you don’t abandon your path to becoming a data scientist halfway through. Additionally, it makes the whole learning experience a lot more fun!

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

How to self-learn data science from scratch

When I learned data science I didn’t know where to start, so I wasted many hours learning only tangentially useful stuff. Now, after more than five years as a data science consultant, I know what I would’ve done differently. In this article, I will offer you a roadmap on how self-learn data science with links to useful resources.

Data science pre-requisites

Even though I believe everyone can learn data science, those with a technical background will have a head start. Before getting into DS specific subjects it is useful to have some notions about mathematics, statistics and probability.

It is not necessary to be an expert in any of those, but you need a solid foundation. If you’ve never studied any of those, don’t worry, I’m here to help. In the following paragraphs, I’ll briefly describe each prerequisite and link to educational resources.

Mathematics for data science

To get started with data science you need to get familiar with some of mathematics’ most common objects. These Khan academy lessons about vectors, matrices and functions are a good place to start. Also, here’s the summary (in more formal mathematical language) of a Stanford course. These concepts are the building blocks of most machine learning algorithms and provide you with a framework for structuring data. Getting to this level of mathematics will allow you to understand and use the algorithms that others have invented and implemented and get results.

If you really like mathematics, you can dive deeper into mathematics by taking full calculus and linear algebra courses. This will require a lot more work but will unlock a more complete understanding of the inner workings of machine learning algorithms and how to implement and adjust them.

Probability and statistics

Probability lies at the core of the data scientists’ view of the world. When dealing with big numbers and random events, probability and statistics provide the tools to make sense of them. It isn’t only about the exact methods or formulas, but also about developing a probabilistic intuition. These courses from Khan academy on probability and statistics are both beginner-friendly and got all the information you’ll need. Here is a mathematically formal summary of a probability course from Stanford.

In addition to formal education in probability and statistics, reading non-fiction books can also help to develop an intuition. I recommend the following books in no particular order: Thinking fast and slow, Factfulness, Thinking in bets, Fooled by randomness (or any of Nassim Taleb’s books).

Finally, reading about statistical paradoxes will help you make sense of data when you face unintuitive conclusions.

Data-oriented programming language

A big part of a data scientist’s job is reading, manipulating and running analysis on data. This is usually done by coding in a data-oriented language. These languages allow us to write instructions for a computer to execute. Even though there are many different programming languages, most of them use very similar structures. The two most popular data-oriented programming languages are Python and R, and you can start with either one. If at some later point you work with people using the other one, you can use that as an opportunity to learn it.

If you’ve never coded before, don’t worry. Both of them can be a good first point of contact with programming. A lot has been written about which one is better, but the truth is they have different strengths.

R’s strong points are:

  • It is designed for data and statistical work, so manipulating data is easier
  • There is a vast universe of statistics libraries
  • The Shiny library makes it very easy to make a web app with no previous web design experience
  • RStudio is a wonderful IDE (I haven’t found one that I like as much for Python)

Python’s strong points are:

  • It’s a general-purpose programing language as well as one of the most popular languages overall
  • It usually runs faster than R
  • It has better packages for deep learning

I personally prefer R because of its more compact syntax in the data.table package and also because I have more experience with it.

Learning R

If you are new to programming, I recommend you start with one of these resources:

If you have been coding for a while, you can get the basics with learn R in Y minutes.

Once you know the basics, it’s time to learn one of the two main data manipulation libraries: data.table (my personal favorite) or dplyr. Another useful library is ggplot2 for making beautiful graphics.

Learning Python

If python is your first programming language you can start with any of these:

If you’re already familiar with coding you can just read this documentation.

And once you’ve mastered python’s basics, you can go into the specialized tools to manipulate data: Pandas and Numpy. Here’s a tutorial and here’s a video to help you learn those packages.

Learn machine learning

Now we get to the exciting part.

There are many different techniques and tools in machine learning. One of them has been my most used analytical tool during my years as a data science consultant. And that technique is supervised learning, in both of its forms: classification and regression.

Supervised learning, also known as predictive modeling, is about learning from examples in which we know in advance the correct answer. In regression the answer is a numerical value, and in classification it is categorical.

Predictive models can be used to make demand forecasts, identify risky creditors and estimate the market price of a house among many other uses.

Here are some courses that will teach you the main framework to approach predictive modeling problems, as well as some supervised learning models:

In my experience, 3 families of models can help you solve most supervised learning problems you’ll ever encounter:

  1. Linear and logistic models (explained in the above courses) are easy to understand, easy to interpret, fast to train and reasonably accurate
  2. XGBoost (gradient boosting trees implementation) is a top-of-the-class model in terms of precision, speed and ease of use. However, they’re not as easy to interpret as linear models. Here’s an introduction to decision trees (pre-requisite) and a couple of articles about how XGBoost works
  3. Neural networks are great for natural language processing and image models. However, I’d leave them to more advanced data scientists since they’re more difficult to set up

Here are some examples of using linear regression in R and Python, and of using XGBoost in both languages.


SQL is the most used database language and most companies use one of its variants for their database. Even Amazon’s Athena and Google’s big query can be accessed using SQL syntax.

So if you’re planning on getting a job in data science I recommend you learn SQL since it will be a requirement for most employers. If you’re doing personal projects it’s up to you. For small-scale projects, you will be just saving your data on text files. For bigger projects, SQL skills may come in handy.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

What’s next?

Once you’ve learned the basics about R/Python and supervised learning, it’s time to practice. Do a project with open data or participate in a Kaggle competition. Or get a job as a data scientist and learn while getting paid. Practice is what will help you hone your skills and generate proof of your knowledge.

How to get a job (in data science)

In this article, I’ll give you a structured approach to getting a data science job.

In fact, I’ll be sharing all the techniques that have helped me get offers from startups and management consulting firms, along with examples of my own resume and project portfolio. Additionally, I’ll talk about what I look for when screening CVs and running interviews.

So if you want to get a job in data science, you’ll love the actionable steps in this guide.

Let’s get started.

Be eligible for a data science job

Before going into the details of job hunting, let’s get this out of the way: no amount of tricks will get you a job if you don’t have the required skills. So the first step is to learn the fundamentals: coding in a data-friendly language (ideally Python or R), some machine learning, and SQL. Those are the basics for an entry position in data science. And they can be learned for free on the internet.

If your market is very hot or you’re looking for internships, you may get hired with a technical degree (CS, math, physics, engineering) and no specialized data science knowledge.

Some work experience may also help to make you a more attractive candidate. Adjacent positions like data analyst and data engineer can help you move to a data science position.

Additionally, some domain knowledge of the industry of the companies you’re applying to will be a great asset to your job search.

You have to present your story in the best light possible

And this is true through the whole hiring process, from your resume to interviews.

When you are looking for a job, you are both the product and the salesperson. No one else but you will highlight your qualities. There are many ways to explain who you are and what kind of work you’ve done in the past. You should choose the most persuasive way in every situation. To do this focus on two main principles:

  1. Make your story as interesting as possible by being specific enough
  2. Adapt your story to your audience, highlighting what is more relevant to them

For example, when Mary is asked what she does for a living in an interview she can say “I do customer segmentation” and that can be as true as it is boring and unspecific. But she can do better. She could say “I use algorithms to segment users according to their past purchases”. That sounds more interesting.

Moreover, if Mary is on an interview with a software engineer, she can specify that she uses a mix of SQL and Python code for her analysis. If she is talking to the marketing manager, she can explain how her segmentations helped increase the email open rate by 12%.

Additionally, she should try to use her audience’s own words. Her official job title states “Business Analyst”, but she’s using SQL and Python to do her job. If she’s applying for a “Data Analyst” position, she could say her current job is a “Data Analyst” position too.

These ideas apply to interviews as well as the wording of your resume and any other document you present.

Improve the steps of the funnel where you’re weak

A job search is like a sales funnel. You find some job postings and apply to them. Some of those applications will get you interviews. And some of the interviews will result in job offers.

By thinking of the process as a funnel, you can isolate its parts and try to optimize them separately. For example, imagine John has sent lots of applications and isn’t getting any interviews. In that case, before sending more applications John should make sure his CV is well-formatted and that he is a good fit for the positions he’s applying to.

The main parts of the job search funnel are:

How to get a job – search funnel

Applications, increasing the funnel input

The first step to getting a job is finding job postings. The main ways to do so are:

  • Asking your network, which may even let you skip directly to the interview phase
  • Online job searches, I’d suggest searching about once per week (LinkedIn has by far been the best for me)
  • Improving your LinkedIn profile and setting it as open to work
  • Local job banks
  • Company jobs page if you’re  interested in specific companies
  • Cold emails to people in your industry

Once you’ve got some job pots you have to decide which ones to apply for. My rule of thumb is to apply to whatever job you feel confident you can do, regardless of requirements. Very often companies will post job offers where it’s almost impossible to find a person with 100% of the requirements. If you don’t fulfill some of them but feel like you can pick them up easily on the job, just apply.

Maybe you have confidence issues and feel you may not be worthy of the job. In this case try to see what kind of jobs did your classmates get, and shoot for something on that level. If people that studied with you did it, you can do it too!

Unless you’ve applied to at least 20 positions, your best bet is to keep sending more applications. Think of it this way: if 20 people apply for a job your base probability of getting an offer is 5%. Lately, on LinkedIn, I’ve seen many postings with as many as 100 applicants.

Improving your LinkedIn profile

On my last job search, I got contacted by many recruiters that found me through LinkedIn and brought relevant offers. This will probably happen more and more as you advance through your career.

But to get noticed you have to work on your profile. Here’s what I did:

  • Create a complete profile with all your relevant work and learning experiences
  • Follow LinkedIn’s advice to improve your profile
  • Write an “about” section that sounds professional
  • Get your friends to endorse you on the necessary skills for your job search
  • Take LinkedIn’s skill certificates to make your profile stand out
  • Accept random connections, sometimes they have offers for you
  • Set your profile as “Open to work”

Successful applications and getting noticed

Once you’ve selected some job posts, you need to send the best application you can. Your objective when applying is to convince HR that you’re a good fit for the job.

Before going into which documents to send, let’s talk about referrals. This is the step in the process with the highest potential return on your effort. If you have a connection working at the company, talk to them and ask for a referral. This will bring extra attention to your application and possibly let you skip some of the required steps. Just do it if it’s available.

Whether or not you can get a referral, applying for a position involves sending some documents that show who you are. Any document you include, make it a pdf. Word and other editable formats may visualize differently on different computers.

How to write your CV

The most common document is your resume or CV (Curriculum Vitae). The objective of your CV is to communicate who you are to the recruiter. Here are a few guidelines about how to write your curriculum:

  1. Use a nice template, for example
  2. Only include a picture if you look good on it, in the picture dress for the job you want
  3. One page, include only relevant information
  4. Make everything easy to understand
  5. Don’t get too technical (this will probably be filtered by someone with no technical knowledge)

Here’s the resume I used the last time I was looking for a job (2021).

In addition to your CV, you may want to include a cover letter or a project portfolio.

Cover letter

The cover letter should be about why this job is a good match for you and why you are a good match for the job. It could increase your chances of getting a positive response, especially in more formal recruiting processes such as management consulting. You can send it as an email or as an attached pdf (max 1 page). You can structure your cover letter as follows:

  1. Introduction: why you’re writing this document
  2. Why the company is a good fit for you, for example: it’s a market leader, very innovative, you personally use their products, …
  3. Why the position is a good fit for you
  4. Why you are a good fit for the position and how your previous work experience and education has prepared you for this job (try to address all the points in the job description)
  5. Conclusion

Here’s an example of a cover letter I used some time ago to get into management consulting.

Project portfolio tips

Another document you can send is a project portfolio. This is a document explaining some of the projects you’ve worked on. If you have done some projects that you’re proud of, this can make them shine. In fact, the last time I applied for a job I impressed some of the interviewers with my project portfolio.

If you do so, keep in mind the following points:

  1. Don’t make it too long: 2/3 projects, 10/15 slides max
  2. For every project, explain the technology used, the process, and the results
  3. If possible, showcase projects relevant to your potential employer (same industry, technologies, or modeling problems)
  4. Don’t assume the reader has previous knowledge of your projects, give all the necessary context
  5. Don’t share any confidential information

If you send a GitHub link, make sure to keep your profile clean and organized. Also, create a clear readme file on your projects with a summary. Otherwise, reviewers may not know where to start and will just skip it.

What I look for when screening applications

At my past job, I screened applications of potential candidates for data science consulting jobs. This is what I looked for in order of relevance:

  1. Evidence of proactiveness and problem-solving in previous work experience and side projects
  2. Fundamental data science skills (ML, R/Python, SQL). I personally don’t care much whether they’ve taken a master’s degree or some MOOCs
  3. Numerical and coding skills (technical degree, side projects,  …)

What if you don’t get answers to your job applications?

Looking for a job can be tough, especially when companies and recruiters ignore you. Don’t despair. If you find yourself in this spot, here’s what could be happening:

  • You haven’t applied to enough jobs or have been unlucky so far, send more applications
  • Your resume is poorly formatted or difficult to understand, work on it
  • You have the skills but not the credentials, try to explain better why you are a good candidate and maybe give recruiters some proof of your skills (for example: portfolio, LinkedIn certificates)
  • You don’t have the necessary skills for the positions you’re applying to, in which case you should level up your knowledge or apply to other more suitable jobs

Proving yourself: Tests and assignments

If after sending your application letter the company likes you, they will contact you. At this point, some companies will assess your skills and commitment with an assignment or a test.

Tests are like exams. You will have limited time to answer a series of theoretical questions or practical exercises. Ask about it and try to prepare in advance. Doing similar tests from other companies or preparation websites will help. Also, try to schedule it for a time when you’re rested.

Here are some resources to prepare for a Data Science test:

Assignments are small projects that may range from about 2 to 12h. Some assignments are downright abusive. If you aren’t too interested in the job, now is the time to get out of the process.

Other assignments are interesting and fun challenges. You can take them as a chance to see what the job will be like and also test your skills.

Whatever the type of assignment, always think that it’s a relatively small investment compared to the time you’ll spend at the job if you end up with get an offer.

What to do if you are not passing the tests

Don’t worry if you get turned down at one test or assignment, flukes can happen. And sometimes recruiters don’t know what they’re doing.

However, if you fail at tests repeatedly, that means you should review the theory. Try to remember which questions you missed and study those topics and adjacent ones. Also, try to practice doing tests, as that always helps.

If you’re having trouble with assignments, then it’s a matter of practice. Try to do projects on your own and explain them to friends. Reviewing projects by others can also help.

The interview

Interviews come in many shapes and forms but tend to follow a common pattern. Most of them will consist of 4 main parts: introduction, HR-type questions, technical questions, and your questions. Preparing for each of the parts will improve your odds of getting a job offer.

Before an interview, you should review the job description and make sure you understand all concepts mentioned in it. This will automatically make you a better candidate. Bonus points if you think for some time about their business and how data science can improve their bottom line.

In the first part, the interviewer will present herself and also give some information about the company and the role. Then she will ask you to introduce yourself. You should prepare your introduction and practice it in front of a friend to project a better image of yourself.

HR-type questions are usually about your motivations, your character, and your soft skills. Some typical HR questions are:

  • Why did you decide to apply to this role?
  • Tell us about your strengths and weaknesses
  • What do your colleagues think of you?
  • Can you describe your management style?

It is impossible to have a prepared answer for all these kinds of questions. However, taking the time to prepare an answer for some of these will make you better at coming up with good answers for other questions. Additionally, writing down a description of the impression you want to give is also a good way to prepare.

There are 4 main types of technical questions:

  • Explaining a previous project
  • Case-type questions about how you would approach a certain task
  • Theory questions (here are some examples)
  • Practical exercises

Again, the range of possible questions here is almost unlimited. If you’re applying for a big company you may find some information about their interviewing style online. Having more experience with data science projects will give you an edge on technical questions but you can always get blindsided with a theory question about an algorithm you’ve never used. If this happens, acknowledge your ignorance and offer another subject about which you could talk.

Finally, when it’s your turn, asking a couple of questions will make you look interested in the job. You can spend 30 minutes googling the company before the interview to stand out from the competition by asking interesting questions.

How I run interviews

I have run data science interviews at both my current and past jobs. One was for consulting, the other for a SaaS that optimizes retail stock management. In both cases, interviews consisted of a data science business case, in which candidates have to solve business problems using analytical tools. It’s not so much about coding (in fact we don’t do live coding) as it is about problem-solving and knowing when to apply each data science technique. More specifically, what I look for when interviewing is:

  1. Problem-solving, understanding business problems and developing data-driven strategies to solve them.
  2. Communication skills, capacity to explain complex concepts in a clear and concise manner
  3. Leadership and initiative, ability and willingness to propose and run projects as well as to mentor more junior colleagues.
  4. Code craftsmanship, love for writing clear and easy to maintain code, while being conscious of the problems associated with excess complexity.
  5. Analysis depth, for example by identifying confounding variables and getting to the root cause of issues.

Just keep in mind that this is based on my personal opinions and what my company needs. Other interviews may be different.

How to get better at interviews

Many people struggle with interviews. The good news is, practice can help a lot.

Practice your introduction in front of the mirror until it’s perfect. Create or get some interview scripts, and get a friend or relative to interview you. Or if you’re still in college, maybe get together with some other students to interview each other.

After 5-10 mock interviews, you will be more articulate and more confident in yourself.

If you feel like anxiety is an issue in interviews, try to do breathing exercises to relax before going in.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

Tying it up 

The job search doesn’t end when you get the offer, but when you sign the contract. Now it’s time to negotiate the terms. If there is something you don’t like, you have a right to say it and try to reach a different agreement. This can range from salary to vacation days.

Words of encouragement

Finally, the most important thing to keep in mind is that getting a job like many things in life is a numbers game. No amount of effort and skill will guarantee that you get the job. They may forget about your CV, the position may be covered by the CEO’s nephew, or the recruiter may be an ass.

Additionally, for every posting, there may be lots of applicants. So don’t lose faith and don’t let rejection bring you down. Even if it’s tedious and it takes a long time in the end it’s still worth it.

Assuming you send 50 applications and each of them takes you about one hour. Then interview with 10 companies, spending an average of 3 hours with each. This would be a total of 80 hours, which is about 4-5% of what someone with a full-time job works on a year. If you get a 10% raise it’s a great investment. If you get 5%, better conditions, or a more fulfilling job it’s still well worth it.

So, go get it!

How to estimate the impact of algorithms

You’ve just finished training a credit risk tree model with a whooping 57 AUC score, and you feel great. And you should. But let’s dig deeper. How much better will this model be than using no model? Or than using the previous model which had an AUC of 48?

Have you ever wondered what the impact of an algorithm you are building is? How much money you are making for your company? How many lives are our campaigns saving?

Every member of an organization should know how their actions contribute to the organization’s goals. This allows them to prioritize and be more efficient in their work.

The impact of an algorithm is tied to the actions it enables

To estimate the impact of an algorithm, first, we’ll need to define a metric. This will usually be money because it’s the main human mean of value exchange and one of the main goals of businesses. However, depending on the nature of your project, you can use metrics such as lives or time saved.

To estimate the impact of an action, we have to calculate the difference of our metric between two different scenarios:

  1. The current outcome (measured in the metric we’ve defined). This can be 0 if nothing can be done without the algorithm
  2. The outcome we expect to get by using the algorithm instead

Building simplified models of the situation will allow us to make estimations of the impact. This is similar to how we would build a business case.

Let’s make it more clear with an example

STL limited (Short Term Loans) is a credit company that gives 1-year loans. This is how their business is going:

  1. They give loans at a 10% interest rate to everyone that applies for one
  2. Their default rate is 10% (percentage of customers that don’t pay back all the money they owe)
  3. Customers that default had paid back an average of 30% of the loan amount before defaulting
  4. The average loan amount is 1.000$
  5. Every year 100.000 new customers apply for a loan

With this information we can estimate how much they currently earn per year by using a simple excel spreadsheet. We will first estimate the expected earnings per non-defaulting customer (NDC) and per defaulting-customer (DC). After this we will combine those estimations to calculate the expected value per customer by using conditional probabilities.

Building a model to improve earnings

John, the lead data scientist at STL limited, has developed a probability of default model. He has trained it using customer employment data that was collected anyway for regulatory reasons.

John uses the model to make predictions on a holdout set (a dataset that the model has never seen before). He then divides the customers into four groups of the same size based on the probability of default predictions. The following table shows the probability of default for each of the groups:

Modifying the default rate on the previous spreadsheet, we can estimate the expected earings per customer for each of the groups:

The average customer on group 4 loses money for STL limited.

How would much would STL limited earn if they only gave credit to people on groups 1, 2 and 3?

By only giving credit to customers with positive expected earnings, STL could make a total of 3,5M$ per year. This means that the model would have an impact of 1,5M$ (3,5M$ minus the 2M$ of the base case).

Wrapping it up

This impact estimation method is based on simplification and it leaves out second-order consequences of the actions. Additionally, future performance isn’t guaranteed to be the same as in the past. To account for these sources of uncertainty, I generally multiply the impact estimation by a conservative factor of 50-80%.

Nevertheless, the objective of these estimations is not perfect accuracy but getting a ballpark figure that will allow us to compare and prioritize.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you

When is a master’s degree the right way into Data Science?

You are a student finishing their bachelor’s degree and unsure of what to do next. Or perhaps you are a professional considering a change of industry. You may even be a PhD that wants to transition into the private sector. Whatever your origin, what concerns you now is how to accomplish this change. And you wonder: should I study a data science master’s? Will it be worth it?

Since you want to get a data science job, enrolling a master’s program sounds like a logic step. A master’s will teach you the knowledge needed to do the job as well as the credentials to get it. However it is also costly. Tuition fees may vary from a few thousand dollars to several tens of thousands. Additionally you will have to invest a year or two into it, potentially going jobless during the time.

Since it’s a very important decision, you consider the alternatives. Is there a better way to learn data science? Maybe self-learning and some online courses? Maybe on another data-related position you can learn it on the job and easily transition later?

In the rest of the article I will compare the master’s and the self-learner routes and highlight when one is better than the other.

Let me tell you a couple of stories

With one year left to finish my degree (Math+Civil engineering) I took interest in data science. That year I only had to do my final thesis so I had a lot of free time. I used that as an opportunity to get into data science by:

  1. Doing a machine learning project as my thesis with the help of some online courses and books
  2. Joining a local analytics consulting firm for an internship, where I learned about SQL and databases

That turned me into a great candidate for starting data science positions and I got my first full time job right after finishing my thesis.

My wife Anna enrolled a master’s in statistics and operations research right after finishing her bachelor’s in mathematics. After finishing her master’s she has held a couple of data scientist jobs and is now a biostatistician. Doing a master’s was a great decision in her case. It led to job offers and meeting great friends.

As illustrated by these examples, both ways can work. Which one is best will depend on your personal situation and preferences.

Why should you study a master’s in data science?

Certification. A master’s degree is a recognizable badge and will make it easier for you to get interviews. Recruiters and HR professionals value it highly. Data scientists don’t value it as much, according to a recent Kaggle survey only 20-30% of data scientist hold a master’s degree. If you decide to skip the master’s, there are some ways to get proof of your skill such as:

  • Data science competitions (Kaggle, local hackathons …)
  • LinkedIn skill assessments
  • Personal projects and OS contributions
  • Experience on adjacent fields (data analyst, data engineer).

Peers. Another advantage of a master’s degree is that you will have a class of like-minded people to study with. During the degree you can have fun together. Afterwards you will be a valuable network of professionals that can help each other. Self-learners won’t have the same camaraderie as classmates. However, you can join online communities as well as local meetups and study groups to network and socialize.

Convenience. The final major advantage of a master’s degree is that it’s simply easier to follow from start to finish. Most people find it a lot easier to commit to a habit once they’ve paid for it or given it some formal structure. Finding the discipline and motivation for consistent self-study is hard. If you struggle with it you can try some of this tricks:

  • Learn with a friend
  • Allocate a certain time of the week to it
  • Find a way to track your progress and give you a sense of accomplishment

What are the advantages of self-learning?

Flexibility. Self-study lets you advance at your own pace, from wherever you are and skip subjects you find boring or uninteresting. This was very important in my case as I have always struggled with things I find boring. Some master’s programs aren’t as rigid as they used to be but still nowhere close to self-learning.

Cost. And an obvious one. The cost of learning data science on your own will be close to zero. You may spend some money on a couple books and online courses. Master’s degrees are very expensive unless education is heavily subsidized where you live.

Quality of education. I know this may come as a shock to some. Self-learning will let you pick and choose the best materials from different sources. On the other hand, a master’s will have you committed to a single program. If you are unsure about the best books and courses, worry not. Online forums like reddit and stack overflow will answer your questions and blogs like this will try to point you in the right direction. Moreover, bloggers and Kaggle winners regularly share their experience and tips, something that many teachers won’t do. So even if you decide a master’s is the better option for you, it’s good to stay online.

And finally, let’s talk about salary. The Kaggle study found no significant salary differences between those who had a master’s and those who didn’t. So we can call this even.


As with most thing in life, wheter or not studying a DS master’s is a good idea, will depend on your situation and preferences.

Study a master’s if you really value the certification, having classmates or aren’t confident on your discipline to do it solo.

Self-learn if you value the flexibility or money is an issue.

Sorry, your subscription could not be saved. Please try again.
Thanks for subscribing!

Get more articles like this emailed to you