How to use Data to get your way

Hi again! This is a post about a recurrent discusion about a traditional dish from Porto called Francesinha and its relationship to the work of Joseph Berkson. Most of the images are from a presentation I did at pyData Edinburgh. You can find it here.

First off, what is a Francesinha? It’s an amazing dish that is not only healthy but also easy to digest.

When it comes to Francesinha, there are basically two schools of though:

The “You can’t have it all” argument

These people say you can either find a francesinha which has a great sauce or you find a francesinha with a great selection of meats and ingredients (I’ll call this combination ‘the sandwich’ itself from now on) but rarely both. This also means that, if the quality of one increases, the quality of the other decreases.

The “Better Ingredients = Better Francesinha” argument

These people argue that, has you get better and better ingredients, both the ‘sandwich’ and sauce will improve. Not only that, has you become a better cook both things will improve in parallel. This means that there should be a positive correlation between the two things.

Who’s right?

As a data scientist I had to do some research on the topic. First step was to get the data. Luckily for me, I had a database teacher that started this project. Obviously I gathered the data and drew this very scientific plot:

At first glance it looks like the first argument was right but here is where our friend Joseph comes into play. In this paper Berkson discussed the influence of the fact that hospital populations mischaracterise the correlation between diseases. You could, for example, find a strong negative correlation between two diseases simply because in a hospital study you would tend to find people with either of the diseases and possibly even both (to a less extent) but you would tend to find less people without either of the diseases because, logically, they would not be in the hospital in the first place.

Let’s go back to our dilema. Remember that our friend at ‘project francesinha’ were having a francesinha every other week and, as you might have realised already, there is a limited number of those things you can have in a lifetime. So, what would they do? They would follow the recomendations of friends, family or (god forbid) the internet. The one thing these have in common is that they will tend to suggest places where either the sauce or the sandwich (or both) is considered good. Inevitably this means that we are missing a huge chunk of the population: the restaurants that are not that great. In fact, if you zoom out on our previous plot that is what we see:

Without this data it is impossible to know if there is really any correlation between the two things and thus, the only scientifically correct way of solving this problem, is to go out there and GET MORE SAMPLES.

I hope you enjoyed this post, let me know what you think. See you next time!

