Bayes’ theorem — practice makes perfect

Bayes’ theorem — practice makes perfect

Outstanding data scientists are rarer than needles in a haystack: less than one in one thousand students of the discipline turn out to be truly exceptional. This said, let’s assume that you’ve just “aced” a new skills test that promises to correctly identify such top talent 99% of the time…[i]

What is the probability that you are truly exceptional? Why would Bayes’ theorem be useful here, how is it applied to machine learning, what are its assumptions, and what precautions we should take when relying on this methodology?

Bayes’ theorem is attributed to the work of Rev. Thomas Bayes, the 19th century English mathematician who studied how one can infer the causes of an event from its effects. His work, later substantiated by Pierre-Simon Laplace, was driven by a simple idea that the pertinence of our predictions can improve through a better use of observable data. The key takeaway here is that predictive modeling should be based on experience, our initial beliefs must be continuously updated as we gain additional information on the problem at hand. The major implication for Data Science is that machine learning algorithms are inherently bound by both theory and experience.

Bayes’ theorem is based on the intimate relationship between joint and conditional probabilities. In a netshell, Bayes’ rule posits that the posterior probability equals the likelihood times the prior divided by the normalization constant.

The Posterior, or response variable, is the dependent variable we are trying to predict. The Likelihood, or conditional probability, is the chance of observing new evidence given our initial hypothesis. The Prior, or existing knowledge, is the probability of our hypothesis being correct without any additional information. Finally, the Marginal Likelihood, or normalization constant, is the absolute probability of observing the Evidence.[ii] When we employ Naïve Bayes, we assume that the data we are studying conforms to a normalized distribution- i.e. each of the variables is conditionally independent.

How are Bayes’ networks used in Data Science? Bayes inference can be used profitably in binary or multiclass classification problems whenever the amount of data to model is moderate, incomplete and/or uncertain. Bayes’ classifiers require relatively few computational resources and perform well event with large data sets or high-dimensional data points. On one level, these classifiers have been especially popular in text analytics, where they are frequently used to tackle the challenges of natural language processing, text classification, and spam detection. More generally, Bayes’ algorithms can be deployed to predict the probability of response variables given a new set of attributes. Finally, Bayes’ theorem can be used to calibrate expert opinions and or advice, for they combine both human and machine learning.

What precautions should we take when using Bayes’ theorem? Because Naive Bayes assumes the conditional independence of dependent variables it cannot be used to detect interactions between features. Bayes rule also assumes that response variables reflect recognizable distributions over the model parameters — Gaussian for continuous variables, Bernoulli or multinomial for discrete variables. Finally, Bayesian logic is meaningful only when associated with prior knowledge — the goal is to solve a specified “learning problem” rather than explore the higher level “problems of learning”.

It’s not what you know that gets you into trouble — it’s what you know for sure that just isn’t so “– Josh Billings

Bayesian logic offers Data Scientists more than just an algorithm, it provides a mindset to think about Data Science problems. Before crunching the numbers, we would be wise to examine all the pertinent evidence (prior probabilities), to test our vision of the problem with competing visions (conditional probability), and to continually update our predictions based on new evidence (weighted probability).[iii] In the case that introduced this post on predicting exceptional talent, we need to carefully consider the prior of how few exceptional data scientists there really are (one in a thousand). Even if the test correctly identifies 99 % of the top talent, it incorrectly qualifies 10 cases for every correct prediction. Under such conditions, even though you’ve passed this imaginary test, there is only a 9% chance today that you truly exceptional. Keep working — practice makes perfect!

The practice of business analytics is the heart and soul of the Business Analytics Institute. In our Summer School in Bayonne, as well as in our Master Classes in Europe, the Business Analytics Institute focuses on digital economics, data-driven decision making, machine learning, and visual communications will put analytics to work for you and your organization.

Lee Schlenker is a Professor and a Principal in the Business Analytics Institute His LinkedIn profile can be viewed at You can follow us on Twitter at


[i] One percent false negatives

[ii] Soni, D., (2018), What is Bayes Rule?, Towards Data Science

[iii] Galef,J., (2015). A Visual Guide to Baysian thinking

Trust by Design

Trust by Design

Shark Attack - explaining the use of Poisson regression

Shark Attack - explaining the use of Poisson regression