Change-point analysis with MCMC

Data: Time series of mining accidents
Techniques: Bayesian analysis, MCMC


Mining disasters

This project uses data from a time series of recorded coal mining disasters in the UK from 1851 to 1962. Sometime during this period, new legislation was passed that changed safety regulations. The number of disasters is thought to have gone down because of these changes, but we don’t know for sure when this change-point (switchpoint) occurred.

Here I go through a tutorial to estimate when the change-point occurred, and the rate of disasters before and after the change-point, using PYMC3. There will be a lot of annotations to explain what the code is doing.

%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pymc3 import DiscreteUniform, Poisson, Exponential
from pymc3.math import switch
from pymc3 import Metropolis, NUTS, sample, Model, traceplot
from pymc3 import summary
sns.set_style("whitegrid")
sns.set_context("poster")
# The data
disaster_data = np.ma.masked_values([4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
                            3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
                            2, 2, 3, 4, 2, 1, 3, -999, 2, 1, 1, 1, 1, 3, 0, 0,
                            1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
                            0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
                            3, 3, 1, -999, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
                            0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1], value=-999)
year = np.arange(1851, 1962)

plt.plot(year, disaster_data, 'o', markersize=8);
plt.ylabel("Disaster count")
plt.xlabel("Year")

png

The plot of disasters over time seems to show that the change-point occurred sometime in the early 1900s.

Specifying the model

We can model the occurrences of disasters with a Poisson, with an early rate for the early part of the time series, and a later (smaller) rate for the later part.

For the year in which the switchpoint occurred, we are just using a uniform distribution from the beginning to the end of the period (this is a rather “diffuse” prior that does not give the model much information on when the switchpoint is thought to have occurred).

# Specifying the model
with Model() as disaster_model:
    
    # Discrete uniform prior for the switchpoint
    switchpoint = DiscreteUniform('switchpoint', lower=year.min(), upper=year.max(), testval=1900)

    # Exponential priors for pre- and post-switch rates
    early_rate = Exponential('early_rate', 1)
    late_rate = Exponential('late_rate', 1)
    
    # <switch> in PYMC3 works like an if statement- it uses the first argument (switchpoint) to switch between the next two arguments.
    # Allocate appropriate Poisson rates to years before and after the current
    rate = switch(switchpoint >= year, early_rate, late_rate) 
    disasters = Poisson('disasters', rate, observed=disaster_data)
Applied log-transform to early_rate and added transformed early_rate_log_ to model.
Applied log-transform to late_rate and added transformed late_rate_log_ to model.

The Model() object is a container for the model random variables.

All statements indented after the ‘with’ statement specify the components of the model.

We have several stochastic variables here, with Poisson, discrete-uniform, and exponential priors. They are stochastic because their values are partly random.

We also have a deterministic random variable (rate). “Deterministic” (as opposed to stochastic) means that its value is completely determined by its parents’ values.

Obtain posterior estimates

with disaster_model:
    trace = sample(10000) # draw 10000 posterior samples. 
Assigned Metropolis to switchpoint
Assigned NUTS to early_rate_log_
Assigned NUTS to late_rate_log_
Assigned Metropolis to disasters_missing
 [-----------------100%-----------------] 10000 of 10000 complete in 6.1 sec

The ‘sample’ function runs the step method (a particular MCMC sampling algorithm). These can be assigned manually or assigned automatically by PYMC3.

-Binary variables will be assigned to BinaryMetropolis

-Discrete variables will be assigned to Metropolis

-Continuous variables will be assigned to NUTS

It returns a trace object, containing the samples collected.

Posterior analysis

Let’s look at the output from the sampling.

The left column: a smoothed histogram (using kernel density estimation) of the marginal posteriors of each stochastic random variable. The right column: the samples of the Markov chain plotted in order.

# Look at the posterior plot
traceplot(trace);

png

There is about a 10 year span that’s credible for our switchpoint, though it looks like most of the probability mass is over a 5 year span around the early 1890s- this is our interval estimate of when the switchpoint occurred.

Posterior statistics

We can also get an output of some common posterior statistics.

summary(trace)
switchpoint:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  1889.726         2.428            0.097            [1884.000, 1894.000]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  1886.000       1888.000       1890.000       1891.000       1896.000


disasters_missing:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  2.075            1.819            0.081            [0.000, 6.000]
  0.921            0.939            0.026            [0.000, 3.000]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.000          1.000          2.000          3.000          6.000
  0.000          0.000          1.000          1.000          3.000


early_rate:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  3.093            0.287            0.005            [2.550, 3.651]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  2.561          2.895          3.085          3.280          3.667


late_rate:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.930            0.119            0.002            [0.696, 1.158]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.715          0.848          0.924          1.005          1.180

We estimate that the switchpoint occurred between 1884 and 1894- this is the highest posterior density interval.

The early rate of disasters was about 3.09/year. It went down to about .93/year after the switchpoint.