Shark Attack - explaining the use of Poisson regression
Nothing quite racks havoc on an idealistic summer swim off the Florida Keys quite like a shark attack. Images of an imposing dorsal fin, large “white” teeth and generalized panic aside, how much have shark attacks over the last thirty years influenced the influx of residents on the Florida shoreline? Where would you look for the data? More importantly, which algorithm would a Data Scientist choose to predict the results and why?
The Florida Museum of Natural History had maintained the International Shark Attack File, a record of unprovoked shark attacks (and the resulting fatalities) worldwide dating back to 1957.[i] Using this data we can use Poisson regression to predict changes in the costal population by examining patterns in the attack rates rather than the rather fishy assumption that changes in population are directly tied to the attack counts. Puns aside, let’s examine here what is the Poisson regression, why this method is the logical pick here, what are its assumptions and what precautions we should take when relying on this methodology.
Poisson regression is a generalized linear model form of regression analysis introduced by Siméon Denis Poisson in 1837 to support his work exploring the causes of wrongful criminal convictions. Poisson Regression, also referred to a log-linear model when working with categorical data, is now common in most analytical packages and is recommended wherever you need to model count data or construct contingency tables. This model for example could be used profitably to predict retweets of Twitter data, or failures of nuclear plants under various operating conditions, or then again to predict exam success rates among identifiable groups of students.
What are the assumptions of this model? Poisson regression is similar to logistic regression in that it also has a discrete response variable. The model assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. The recorded events occur with a known constant rate and independently of the time since previous events. More importantly, the model assumes that the expected value (the response variable) has a Poisson, rather than normal, distribution. In other words, the possible values of the response variable will be nonnegative integers like 0, 1, 2, 3, etc.
What are the use cases for Poisson regression? The model can be used profitably in stochastic processes where the observable events occur randomly over time. The likelihood of at least one occurrence of the event in a given time interval is proportional to each interval’s length. There should be little, if any, chance that two or more occurrences of the event could transpire in each interval. Accepting these assumptions, we can argue that the probability distribution of the number of occurrences of the event in a fixed time (distance, area, or volume) interval conforms to a Poisson distribution.
What precautions should we take when using the Poisson regression model? Three potential characteristics of the test population come to mind. As a Data Scientist, you should be wary of the potential for heterogeneity in the data — is there more than one process generating expected values? Overdispersion is a second anomaly that needs to be accounted for — is the variance of the fitted model larger than what could be expected by your assumptions? Finally, does the data sample reflect underdispersion — i.e. does the data exhibit less variation, due to autocorrelation of adjacent subgroups, than one would expect in a binomial distribution?
In the suggested case of shark attacks, the choice of the Poisson model is justified by the relative rarity of such horrid events. As Jeffrey Simonoff suggests, we can apply a generalized linear model assuming the Poisson model using a log link to connect the Poisson mean μμ with rate change over time. Dividing fitted means by population size yields a model for rates. To improve the accuracy of the model, we can introduce an offset term for the log of the population size in the population equation. As a result, the predicted time trend follows an exponential pattern rather than linear due to the use of the log link.
The practice of business analytics is the heart and soul of the Business Analytics Institute. In our Summer School in Bayonne, as well as in our Master Classes in Europe, the Business Analytics Institute focuses on digital economics, data-driven decision making, machine learning, and visual communications will put analytics to work for you and your organization.
Lee Schlenker is a Professor and a Principal in the Business Analytics Institute http://baieurope.com. His LinkedIn profile can be viewed at www.linkedin.com/in/leeschlenker. You can follow us on Twitter at https://twitter.com/DSign4Analytics
[ii] Simonoff, Jeffrey, 2003, Analyzing Categorical Data, Springer