Sign in

I write about Data Science, Data Visualization, books and learning more effectively 📚🌱💡🚀

HANDS-ON TUTORIALS

Your dog’s nap time as a regularized linear model

Image for post
Image for post

When you’re building a machine learning model you’re faced with the bias-variance tradeoff, where you have to find the balance between having a model that:

  1. Is very expressive and captures the real patterns in the data.
  2. Generates predictions that are not too far off from the actual values,

A model that is very expressive has a low bias, but it can also be too complex. While a model that generates predictions that aren’t too far off from the true value has low variance.

Overfitting

When the model is too complex and tries to encode more patterns from the training data than…


Markov defined a way to represent real-world stochastic systems and processes that encode dependencies and reach a steady-state over time.

Image for post
Image for post

Andrei Markov didn’t agree with Pavel Nebrasov, when he said independence between variables was necessary for the Weak Law of Large Numbers to be applied.

The Weak Law of Large Numbers states something like this:

When you collect independent samples, as the number of samples gets bigger, the mean of those samples converges to the true mean of the population.

But Markov believed independence was not a necessary condition for the mean to converge. So he set out to define how the average of the outcomes from a process involving dependent random variables could converge over time.

Markov chain: a random chain of dependencies

Thanks to this…


Image for post
Image for post

SQL is a fundamental part of a Data Scientist’s toolbox. It’s a great tool to explore and prepare your data, either for analysis or to create a machine learning model.

An effective approach to learn SQL is to focus on the questions you want to answer, rather than on specific methods or functions. Once you know what you’re looking for, what questions you want to answer with data, the functions and operands you use to get there will make more sense.

This article is organized around what questions to ask about data, and you’ll become familiar with:

  • Structure of a…


Image for post
Image for post

This year was challenging, stressful, messed up, overwhelming, brutal, … for everyone.

Lots of us found comfort in books, which might seem like a small thing, but it’s an immense privilege.

This year my readings gravitated around:

  • Fiction,
  • Science,
  • Entrepreneurship & history behind high-performers,
  • Personal development & curiosity.

I love talking about books, and discover lots of new books through these kinds of lists. So I hope you enjoy this article and that you can find a book that sparks your interest on something new.

Happy reading 📚

Fiction


Hands-on Tutorials

Image for post
Image for post

Monte Carlo Methods is a group of algorithms that simulate the behavior of a complex system, or probabilistic phenomena, using inferential statistics. They simulate physical processes that are typically time-consuming, or too expensive to setup and run for a large number times.

Since it is a tool to model probabilistic real-world processes, Monte Carlo Methods are widely used in areas ranging from particle Physics and Biochemistry to Engineering. So, if you can model it, you can use Monte Carlo Methods and run simulations!

Why you want to run simulations

Monte Carlo simulations are great methodology when you want to:

  • Simulate processes that are time consuming, i.e…


Image for post
Image for post

The Central Limit Theorem (CLT) is one of the most popular theorems in statistics and it’s very useful in real world problems. In this article we’ll see why the Central Limit Theorem is so useful and how to apply it.

In a lot of situations where you use statistics, the ultimate goal is to identify the characteristics of a population.

Central Limit Theorem is an approximation you can use when the population you’re studying is so big, it would take a long time to gather data about each individual that’s part of it.

Population

Population is the group of individuals that…


Image for post
Image for post

Boxplots are underrated. They are jam-packed with insights about the underlying distribution, because they condense lots of information about your data into a small visualization.

In this article you see how Boxplots are great tools to:

  • Understand the spread of the data.
  • Spot outliers.
  • Compare distributions, and how small tweaks in the boxplot visualization make it easier spot differences between distributions.

Understanding the spread of the data

During exploratory data analysis, boxplots can be a great complement to histograms.

With histograms it’s easy to see the shape and trends in a distribution. Because histograms highlight how frequently each data point occurs in the distribution.

Boxplots don’t…


Image for post
Image for post

Growing up I was not the type of kid who wanted to be an astronaut. And even though science found its way into my life through engineering, I was never interested in space or cosmology.

But when I read the Martian it sparked my interest in sci-fi and, unexpectedly, my interest in space. After that I discovered Chris Hadfield’s book, An astronaut’s guide to life on Earth. One book leads to another and, a few weeks ago, I finished Scott Kelly’s book Endurance: A Year in Space, A Lifetime of Discovery. …


How to use SVMs in classification problems.

Image for post
Image for post

Support vector machines (SVM) is a supervised machine learning technique. And, even though it’s mostly used in classification, it can also be applied to regression problems.

SVMs define a decision boundary along with a maximal margin that separates almost all the points into two classes. While also leaving some room for misclassifications.

Support vector machines are an improvement over maximal margin algorithms. Its biggest advantage is that it can define both a linear or a non-linear decision boundary by using kernel functions. …


Understanding model error and how to improve it.

Image for post
Image for post

In supervised machine learning, the goal is to build a high-performing model that is good at predicting the targets of the problem at hand and does so with a low bias and low variance.

But, if you reduce bias you can end up increasing variance and vice-versa. That’s where the bias-variance tradeoff comes into play.

In this article, we’re going to look into what bias and variance mean in the context of machine learning models, and what you can do to minimize them.

To build a supervised machine learning model you take a dataset that looks somewhat like this.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store