Chapter 2¶

Original content created by Cam Davidson-Pilon

Ported to Python 3 and PyMC3 by Max Margenot (@clean_utensils) and Thomas Wiecki (@twiecki) at Quantopian (@quantopian)

Ported to Julia and Turing by Abid

This chapter introduces more Turing syntax and variables and ways to think about how to model a system from a Bayesian perspective. It also contains tips and data visualization techniques for assessing goodness-of-fit for your Bayesian model.

A little more on Turing¶

Model Context¶

In Turing, we typically handle all the variables we want in our model within the context of the Model macro.

using Turing,Distributions
@model function demo_model(data_generator)
    λ              ~ Exponential(1)
    data_generator ~ Poisson(λ)
    data_plus_one  = data_generator + 1
    return λ,data_generator,data_plus_one
end

demo_model (generic function with 1 method)

This is an extra layer of convenience compared to PyMC. Any variables created within a given Model's context will be automatically assigned to that model. If you try to define a variable outside of the context of a model, you will get an error.

All the variables must be defined inside the Model macro. We can examine the same variables outside of the model context once they have been defined, but to define more variables that the model will recognize they have to be within the context.

Each variable assigned to a model will be defined with its own name, the first string parameter (we will cover this further in the variables section). To create a different model macro with the same name as one we have used previously, we need rewrite it again.

@model function demo_model(data_generator)
    θ              ~ Exponential(2)
    data_generator ~ Poisson(θ)
    data_plus_one  = data_generator + 1
    return data_plus_one
end

demo_model (generic function with 1 method)

We can also define an entirely separate model. Note that we are free to name our models whatever we like, so if we do not want to overwrite an old model we need only make another.

@model function ab_testing(p_A,p_B)
    p_A  ~ Uniform(0,1)
    p_B  ~ Uniform(0,1)
end

ab_testing (generic function with 1 method)

Turing Variables¶

Turing models allow us for sampling from the prior by using a missing argument:

test_value = demo_model(missing)
println( "λ sample value is                 = ", test_value()[1])
println( "data_generator sample value is    = ", test_value()[2])
println( "data_plus_onen sample value is    = ", test_value()[3])

λ sample value is                 = 0.2183628417498831
data_generator sample value is    = 0
data_plus_onen sample value is    = 1

Turing is concerned with two types of programming variables: stochastic and deterministic.

stochastic variables are variables that are not deterministic, i.e., even if you knew all the values of the variables' parameters and components, it would still be random. Included in this category are instances of classes Poisson, DiscreteUniform, and Exponential.
deterministic variables are variables that are not random if the variables' parameters and components were known. This might be confusing at first: a quick mental check is if I knew all of variable foo's component variables, I could determine what foo's value is.

We will detail each below.

Initializing Stochastic variables¶

Initializing a stochastic, or random, variable requires a name argument, plus additional parameters that are class specific. For example:

                  λ ~ Uniform(a::Integer, b::Integer)

where a, b are the Uniform-specific lower and upper bound on the random variable.The Distributions package contain the specific parameters for stochastic variables. (Or use ?? if you are using IPython!)

For multivariable problems, rather than creating a Julia array of stochastic variables, addressing the filldist function in the call to a stochastic variable creates multivariate array of (independent) stochastic variables.

The filldist function also solves the annoying case where you may have many variables $\beta_i, \; i = 1,...,N$ you wish to model. Instead of creating arbitrary names and variables for each one, like:

beta_1 = Uniform( 0, 1)
beta_2 = Uniform( 0, 1)
...

we can instead wrap them into a single variable:

betas = filldist(Uniform("betas", 0, 1, N))

Deterministic variables¶

The use of the deterministic variable was seen in the previous chapter's text-message example. Recall the model for $\lambda$ looked like:

$$ \lambda = \begin{cases}\lambda_1 & \text{if } t \lt \tau \cr \lambda_2 & \text{if } t \ge \tau \end{cases} $$

Turing variables both stochastic and deterministic have to be defined in the same model macro. And in Turing code you can use normal Julia flow statements like:

                   λ = (τ >idx ? λ₁:λ₂ )

@model function model(y)
    λ₁ ~ Exponential(1)
    λ₂ ~ Exponential(1)
    τ  ~ DiscreteUniform(0,10)
    
    for idx in length(y)
        λ = τ > idx ? λ₁ : λ₂  
        y[idx] ~ Poisson(λ) 
    end 
    return λ₁,λ₂,τ
end

model (generic function with 1 method)

Modeling approaches¶

A good starting thought to Bayesian modeling is to think about how your data might have been generated. Position yourself in an omniscient position, and try to imagine how you would recreate the dataset.

In the last chapter we investigated text message data. We begin by asking how our observations may have been generated:

We started by thinking "what is the best random variable to describe this count data?" A Poisson random variable is a good candidate because it can represent count data. So we model the number of sms's received as sampled from a Poisson distribution.
Next, we think, "Ok, assuming sms's are Poisson-distributed, what do I need for the Poisson distribution?" Well, the Poisson distribution has a parameter $\lambda$.
Do we know $\lambda$? No. In fact, we have a suspicion that there are two $\lambda$ values, one for the earlier behaviour and one for the later behaviour. We don't know when the behaviour switches though, but call the switchpoint $\tau$.
What is a good distribution for the two $\lambda$s? The exponential is good, as it assigns probabilities to positive real numbers. Well the exponential distribution has a parameter too, call it $\alpha$.
Do we know what the parameter $\alpha$ might be? No. At this point, we could continue and assign a distribution to $\alpha$, but it's better to stop once we reach a set level of ignorance: whereas we have a prior belief about $\lambda$, ("it probably changes over time", "it's likely between 10 and 30", etc.), we don't really have any strong beliefs about $\alpha$. So it's best to stop here.

What is a good value for $\alpha$ then? We think that the $\lambda$s are between 10-30, so if we set $\alpha$ really low (which corresponds to larger probability on high values) we are not reflecting our prior well. Similar, a too-high alpha misses our prior belief as well. A good idea for $\alpha$ as to reflect our belief is to set the value so that the mean of $\lambda$, given $\alpha$, is equal to our observed mean. This was shown in the last chapter.
We have no expert opinion of when $\tau$ might have occurred. So we will suppose $\tau$ is from a discrete uniform distribution over the entire timespan.

Below we give a graphical visualization of this, where arrows denote parent-child relationships. (provided by the Daft Python library )

PyMC3, and other probabilistic programming languages, have been designed to tell these data-generation stories. More generally, B. Cronin writes [5]:

Probabilistic programming will unlock narrative explanations of data, one of the holy grails of business analytics and the unsung hero of scientific persuasion. People think in terms of stories - thus the unreasonable power of the anecdote to drive decision-making, well-founded or not. But existing analytics largely fails to provide this kind of story; instead, numbers seemingly appear out of thin air, with little of the causal context that humans prefer when weighing their options.

Same story; different ending.¶

Interestingly, we can create new datasets by retelling the story. For example, if we reverse the above steps, we can simulate a possible realization of the dataset.

1. Specify when the user's behaviour switches by sampling from $\text{DiscreteUniform}(0, 80)$:

τ = rand(DiscreteUniform(0,80))

42

2. Draw $\lambda_1$ and $\lambda_2$ from an $\text{Exp}(\alpha)$ distribution:

α = 20
λ₁,λ₂ = rand(Exponential(α),2)
println(λ₁," ", λ₂)

33.347313344745956 17.372703994390484

3. For days before $\tau$, represent the user's received SMS count by sampling from $\text{Poi}(\lambda_1)$, and sample from $\text{Poi}(\lambda_2)$ for days after $\tau$. For example:

data = vcat(rand(Poisson(λ₁),τ),rand(Poisson(λ₂),80-τ))

80-element Vector{Int64}:
 32
 29
 34
 37
 33
 25
 29
 30
 35
 25
 24
 30
 42
  ⋮
 15
 15
 18
 22
 15
 19
 15
 15
 11
 20
 18
 14

4. Plot the artificial dataset:

using Plots,StatsPlots,PlotThemes,Logging
theme(:ggplot2)
disable_logging(Logging.Warn)

n_count_data = length(data)
x = collect(1:80)
y = data

cats = [x >= τ ? "a" : "b" for x in x]
colorMap = Dict("a"=>"red","b"=>"blue")
labelMap = Dict("a"=>"After Change","b"=>"Before Change")
colours = [colorMap[i] for i in cats]
labels = [labelMap[i] for i in cats]

bar(x,y,colour=colours,group=labels, 
     legend=:topright,
     title  = "Artificial dataset",
     ylabel = "count of text-msgs received",
     xlabel = ("Time (days)"),
     xlim   = (0,n_count_data),
     size   = (800, 450),
     xticks = 0:10:80,
     yticks = 0:10:70,
     titlefontsize  = 10,
     yguidefontsize = 8,
     xguidefontsize = 8)

It is okay that our fictional dataset does not look like our observed dataset: the probability is incredibly small it indeed would. Turing's engine is designed to find good parameters, $\lambda_i, \tau$, that maximize this probability.

The ability to generate artificial dataset is an interesting side effect of our modeling, and we will see that this ability is a very important method of Bayesian inference. We produce a few more datasets below:

function makePlots()
    τ = rand(DiscreteUniform(0,80))
    α = 20
    λ₁,λ₂ = rand(Exponential(α),2)
    data = vcat(rand(Poisson(λ₁),τ),rand(Poisson(λ₂),80-τ))
    n_count_data = length(data)
    x = collect(1:80)
    y = data
    cats = [x >= τ ? "a" : "b" for x in x]
    colorMap = Dict("a"=>"red","b"=>"blue")
    labelMap = Dict("a"=>"After Change","b"=>"Before Change")
    colours = [colorMap[i] for i in cats]
    labels = [labelMap[i] for i in cats]

     p =   bar(x,y,colour=colours,group=labels, 
                 legend=:topright,
                 title  = "Artificial dataset",
                 ylabel = "count of text-msgs received",
                 xlabel = ("Time (days)"),
                 xlim   = (0,n_count_data),
                 size   = (800, 200),
                 xticks = 0:10:80,
                 yticks = 0:10:70,
                 titlefontsize  = 10,
                 yguidefontsize = 8,
                 xguidefontsize = 8)
    return p
end

for i in 1:4
    display(plot(makePlots(),layout = (4,i)))
end

Later we will see how we use this to make predictions and test the appropriateness of our models.

Example: Bayesian A/B testing¶

A/B testing is a statistical design pattern for determining the difference of effectiveness between two different treatments. For example, a pharmaceutical company is interested in the effectiveness of drug A vs drug B. The company will test drug A on some fraction of their trials, and drug B on the other fraction (this fraction is often 1/2, but we will relax this assumption). After performing enough trials, the in-house statisticians sift through the data to determine which drug yielded better results.

Similarly, front-end web developers are interested in which design of their website yields more sales or some other metric of interest. They will route some fraction of visitors to site A, and the other fraction to site B, and record if the visit yielded a sale or not. The data is recorded (in real-time), and analyzed afterwards.

Often, the post-experiment analysis is done using something called a hypothesis test like difference of means test or difference of proportions test. This involves often misunderstood quantities like a "Z-score" and even more confusing "p-values" (please don't ask). If you have taken a statistics course, you have probably been taught this technique (though not necessarily learned this technique). And if you were like me, you may have felt uncomfortable with their derivation -- good: the Bayesian approach to this problem is much more natural.

A Simple Case¶

As this is a hacker book, we'll continue with the web-dev example. For the moment, we will focus on the analysis of site A only. Assume that there is some true $0 \lt p_A \lt 1$ probability that users who, upon shown site A, eventually purchase from the site. This is the true effectiveness of site A. Currently, this quantity is unknown to us.

Suppose site A was shown to $N$ people, and $n$ people purchased from the site. One might conclude hastily that $p_A = \frac{n}{N}$. Unfortunately, the observed frequency $\frac{n}{N}$ does not necessarily equal $p_A$ -- there is a difference between the observed frequency and the true frequency of an event. The true frequency can be interpreted as the probability of an event occurring. For example, the true frequency of rolling a 1 on a 6-sided die is $\frac{1}{6}$. Knowing the true frequency of events like:

fraction of users who make purchases,
frequency of social attributes,
percent of internet users with cats etc.

are common requests we ask of Nature. Unfortunately, often Nature hides the true frequency from us and we must infer it from observed data.

The observed frequency is then the frequency we observe: say rolling the die 100 times you may observe 20 rolls of 1. The observed frequency, 0.2, differs from the true frequency, $\frac{1}{6}$. We can use Bayesian statistics to infer probable values of the true frequency using an appropriate prior and observed data.

With respect to our A/B example, we are interested in using what we know, $N$ (the total trials administered) and $n$ (the number of conversions), to estimate what $p_A$, the true frequency of buyers, might be.

To setup a Bayesian model, we need to assign prior distributions to our unknown quantities. A priori, what do we think $p_A$ might be? For this example, we have no strong conviction about $p_A$, so for now, let's assume $p_A$ is uniform over [0,1];

Had we had stronger beliefs, we could have expressed them in the prior above.

For this example, consider $p_A = 0.05$, and $N = 1500$ users shown site A, and we will simulate whether the user made a purchase or not. To simulate this from $N$ trials, we will use a Bernoulli distribution: if $X\ \sim \text{Ber}(p)$, then $X$ is 1 with probability $p$ and 0 with probability $1 - p$. Of course, in practice we do not know $p_A$, but we will use it here to simulate the data.

#set constants
p_true = 0.05  # remember, this is unknown.
N = 1500

# sample N Bernoulli random variables from Ber(0.05).
# each random variable has a 0.05 chance of being a 1.
# this is the data-generation step
occurrences = rand(Bernoulli(p_true),N)

#print(occurrences)
println(sum(occurrences))

65

The observed frequency is:

# Occurrences.mean is equal to n/N.
using Printf
@printf "What is the observed frequency in Group A? %.4f" mean(occurrences)
@printf "   Does this equal the true frequency? %s"  (mean(occurrences) == p_true)

What is the observed frequency in Group A? 0.0433   Does this equal the true frequency? false

@model function model(data)
    pₐ ~ Uniform(0,1)
    #include the observations, which are Bernoulli
    for i in eachindex(data)
        data[i] ~ Bernoulli(pₐ)
    end
    return pₐ
end

model (generic function with 1 method)

web_model = model(occurrences)
chain     = sample(web_model,MH(),20000)

Chains MCMC chain (20000×2×1 Array{Float64, 3}):

Iterations        = 1:1:20000
Number of chains  = 1
Samples per chain = 20000
Wall duration     = 2.35 seconds
Compute duration  = 2.35 seconds
parameters        = pₐ
internals         = lp

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   e ⋯
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64     ⋯

          pₐ    0.0442    0.0082     0.0001    0.0003   572.9506    1.0037     ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

          pₐ    0.0344    0.0405    0.0439    0.0473    0.0540

pₐ_samples = chain[:pₐ]
histogram(pₐ_samples,
          normalize = true,
          bins = 0.02:0.001:0.08 ,
          title = "Posterior distribution of pₐ, the true effectiveness of site A",
          titlefontsize = 10,
          size = (800, 400),
          label= nothing
          )
vline!([0.05],label = "true pₐ (unknown)")

Our posterior distribution puts most weight near the true value of $p_A$, but also some weights in the tails. This is a measure of how uncertain we should be, given our observations. Try changing the number of observations, N, and observe how the posterior distribution changes.

A and B Together¶

A similar analysis can be done for site B's response data to determine the analogous $p_B$. But what we are really interested in is the difference between $p_A$ and $p_B$. Let's infer $p_A$, $p_B$, and $\text{delta} = p_A - p_B$, all at once. We can do this using Turing's deterministic variables. (We'll assume for this exercise that $p_B = 0.04$, so $\text{delta} = 0.01$, $N_B = 750$ (significantly less than $N_A$) and we will simulate site B's data like we did for site A's data )

using Turing, Distributions

#these two quantities are unknown to us.
true_p_A = 0.05
true_p_B = 0.04

#notice the unequal sample sizes -- no problem in Bayesian analysis.
N_A = 1500
N_B = 750

#generate some observations
observations_A = rand(Bernoulli(true_p_A),N_A)
observations_B = rand(Bernoulli(true_p_B),N_B)

println("Obs from Site A: ", observations_A[1:20], "...")
println("Obs from Site B: ", observations_B[1:20], "...")

Obs from Site A: Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]...
Obs from Site B: Bool[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0]...

println(mean(observations_A))
println(mean(observations_B))

0.054
0.03866666666666667

# Set up the Turing model. Again assume Uniform priors for p_A and p_B.

@model function model(data_A,data_B)
    p_A ~ Uniform(0,1)
    p_B ~ Uniform(0,1)
    
    # Define the deterministic delta function. This is our unknown of interest.
    delta = p_A-p_B
    
    # Set of observations, in this case we have two observation datasets.
    for i in eachindex(data_A)
        data_A[i] ~ Bernoulli(p_A)
    end
    
    for j in eachindex(data_B)
        data_B[j] ~ Bernoulli(p_B)
    end
    return delta
end

model (generic function with 2 methods)

# Use the given observation data to fit the model
web_model_AB = model(observations_A,observations_B)

# Sample to find out the posterior
chain     = sample(web_model_AB,MH(),200000)

Chains MCMC chain (200000×3×1 Array{Float64, 3}):

Iterations        = 1:1:200000
Number of chains  = 1
Samples per chain = 200000
Wall duration     = 5.69 seconds
Compute duration  = 5.69 seconds
parameters        = p_B, p_A
internals         = lp

Summary Statistics
  parameters      mean       std   naive_se      mcse        ess      rhat   e ⋯
      Symbol   Float64   Float64    Float64   Float64    Float64   Float64     ⋯

         p_A    0.0531    0.0071     0.0000    0.0002   505.8402    1.0012     ⋯
         p_B    0.0402    0.0099     0.0000    0.0002   649.7590    1.0018     ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

         p_A    0.0400    0.0493    0.0533    0.0570    0.0646
         p_B    0.0240    0.0355    0.0404    0.0450    0.0584

# Deterministic Variable can be found out using the `generated_quantities` function in Turing
gen = generated_quantities(web_model_AB, chain)

# Extract the traces
p_A_samples = chain[:p_A]
p_B_samples = chain[:p_B]
delta_samples = gen

200000×1 Matrix{Float64}:
  0.04127093911604329
  0.04127093911604329
  0.04127093911604329
  0.04127093911604329
 -0.2456667523393874
 -0.12578478784083846
 -0.12578478784083846
 -0.26123447687053214
 -0.26123447687053214
 -0.26123447687053214
 -0.26123447687053214
 -0.26123447687053214
 -0.26123447687053214
  ⋮
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087
  0.001402841316159087

p1 =  histogram(p_A_samples,
          normalize = true,
          bins = 0.02:0.002:0.08 ,
          size = (800, 400),
          label= nothing
          )
vline!([true_p_A],label = "true p_A (unknown)")

p2 =  histogram(p_B_samples,
          normalize = true,
          bins = 0.02:0.002:0.08 ,
          size = (800, 400),
          label= nothing
          )
vline!([true_p_B],label = "true p_B (unknown)")

p3 =  histogram(delta_samples,
          normalize = true,
          bins = -0.02:0.005:0.04 ,
          size = (800, 400),
          label= nothing
          )
vline!([true_p_A-true_p_B ],label = "true delta (unknown)")

plot(p1,p2,p3,layout=(3,1))

Notice that as a result of N_B < N_A, i.e. we have less data from site B, our posterior distribution of $p_B$ is fatter, implying we are less certain about the true value of $p_B$ than we are of $p_A$.

With respect to the posterior distribution of $\text{delta}$, we can see that the majority of the distribution is above $\text{delta}=0$, implying there site A's response is likely better than site B's response. The probability this inference is incorrect is easily computable:

# Count the number of samples less than 0, i.e. the area under the curve
# before 0, represent the probability that site A is worse than site B.
println("Probability site A is WORSE than site B: ", mean(delta_samples .< 0))
println("Probability site A is BETTER than site B: ",mean(delta_samples .> 0))

Probability site A is WORSE than site B: 0.10462
Probability site A is BETTER than site B: 0.89538

If this probability is too high for comfortable decision-making, we can perform more trials on site B (as site B has less samples to begin with, each additional data point for site B contributes more inferential "power" than each additional data point for site A).

Try playing with the parameters true_p_A, true_p_B, N_A, and N_B, to see what the posterior of $\text{delta}$ looks like. Notice in all this, the difference in sample sizes between site A and site B was never mentioned: it naturally fits into Bayesian analysis.

I hope the readers feel this style of A/B testing is more natural than hypothesis testing, which has probably confused more than helped practitioners. Later in this book, we will see two extensions of this model: the first to help dynamically adjust for bad sites, and the second will improve the speed of this computation by reducing the analysis to a single equation.

An algorithm for human deceit¶

Social data has an additional layer of interest as people are not always honest with responses, which adds a further complication into inference. For example, simply asking individuals "Have you ever cheated on a test?" will surely contain some rate of dishonesty. What you can say for certain is that the true rate is less than your observed rate (assuming individuals lie only about not cheating; I cannot imagine one who would admit "Yes" to cheating when in fact they hadn't cheated).

To present an elegant solution to circumventing this dishonesty problem, and to demonstrate Bayesian modeling, we first need to introduce the binomial distribution.

The Binomial Distribution¶

The binomial distribution is one of the most popular distributions, mostly because of its simplicity and usefulness. Unlike the other distributions we have encountered thus far in the book, the binomial distribution has 2 parameters: $N$, a positive integer representing $N$ trials or number of instances of potential events, and $p$, the probability of an event occurring in a single trial. Like the Poisson distribution, it is a discrete distribution, but unlike the Poisson distribution, it only weighs integers from $0$ to $N$. The mass distribution looks like:

$$P( X = k ) = {{N}\choose{k}} p^k(1-p)^{N-k}$$

If $X$ is a binomial random variable with parameters $p$ and $N$, denoted $X \sim \text{Bin}(N,p)$, then $X$ is the number of events that occurred in the $N$ trials (obviously $0 \le X \le N$). The larger $p$ is (while still remaining between 0 and 1), the more events are likely to occur. The expected value of a binomial is equal to $Np$. Below we plot the mass probability distribution for varying parameters.

using Distributions

x = collect(0:11)
p = [0.1,0.2,0.4,0.6,0.8,0.9]
N = 10

using Plots,Logging
animation = @animate for i in eachindex(p)
    # Plotting
    y = p[i]
    bar(x ,pdf(Binomial(10,y), x),
           bar_width=0.8,
            ylabel = "Probability of k",
            yguidefontsize=6,
            title = "Probability mass function of a Binomial random variable(N=10); differing p values",
            label=" p = $y ",
            size = (800, 400),
            xticks = 0:1:10,
            bar_position = :dodge,
            titlefontsize = 10,
            )
end
disable_logging(Logging.Info)
gif(animation,"tmp.gif",fps = 0.5)

The special case when $N = 1$ corresponds to the Bernoulli distribution. There is another connection between Bernoulli and Binomial random variables. If we have $X_1, X_2, ... , X_N$ Bernoulli random variables with the same $p$, then $Z = X_1 + X_2 + ... + X_N \sim \text{Binomial}(N, p )$.

The expected value of a Bernoulli random variable is $p$. This can be seen by noting the more general Binomial random variable has expected value $Np$ and setting $N=1$.

Example: Cheating among students¶

We will use the binomial distribution to determine the frequency of students cheating during an exam. If we let $N$ be the total number of students who took the exam, and assuming each student is interviewed post-exam (answering without consequence), we will receive integer $X$ "Yes I did cheat" answers. We then find the posterior distribution of $p$, given $N$, some specified prior on $p$, and observed data $X$.

This is a completely absurd model. No student, even with a free-pass against punishment, would admit to cheating. What we need is a better algorithm to ask students if they had cheated. Ideally the algorithm should encourage individuals to be honest while preserving privacy. The following proposed algorithm is a solution I greatly admire for its ingenuity and effectiveness:

In the interview process for each student, the student flips a coin, hidden from the interviewer. The student agrees to answer honestly if the coin comes up heads. Otherwise, if the coin comes up tails, the student (secretly) flips the coin again, and answers "Yes, I did cheat" if the coin flip lands heads, and "No, I did not cheat", if the coin flip lands tails. This way, the interviewer does not know if a "Yes" was the result of a guilty plea, or a Heads on a second coin toss. Thus privacy is preserved and the researchers receive honest answers.

I call this the Privacy Algorithm. One could of course argue that the interviewers are still receiving false data since some Yes's are not confessions but instead randomness, but an alternative perspective is that the researchers are discarding approximately half of their original dataset since half of the responses will be noise. But they have gained a systematic data generation process that can be modeled. Furthermore, they do not have to incorporate (perhaps somewhat naively) the possibility of deceitful answers. We can use Turing to dig through this noisy model, and find a posterior distribution for the true frequency of liars.

Suppose 100 students are being surveyed for cheating, and we wish to find $p$, the proportion of cheaters. There are a few ways we can model this in Turing. I'll demonstrate the most explicit way, and later show a simplified version. Both versions arrive at the same inference. In our data-generation model, we sample $p$, the true proportion of cheaters, from a prior. Since we are quite ignorant about $p$, we will assign it a $\text{Uniform}(0,1)$ prior.

We directly skip to the condensed model(Alternative PyMC3 Model) , you can refer to the normal one in the PyMC3 notebooks

Turing Model (Alternative PyMC3 Model)¶

Given a value for $p$ (which from our god-like position we know), we can find the probability the student will answer yes:

\begin{align} P(\text{"Yes"}) = & P( \text{Heads on first coin} )P( \text{cheater} ) + P( \text{Tails on first coin} )P( \text{Heads on second coin} ) \\\\ & = \frac{1}{2}p + \frac{1}{2}\frac{1}{2}\\\\ & = \frac{p}{2} + \frac{1}{4} \end{align}

Thus, knowing $p$ we know the probability a student will respond "Yes". In PyMC3, we can create a deterministic function to evaluate the probability of responding "Yes", given $p$:

using Turing, Distributions,StatsPlots,Plots

N = 100
@model function model(yes_responses,N)
    p                ~ Uniform(0,1)
    p_skewed         = p*(0.5)+0.25
    yes_responses    ~ Binomial(N,p_skewed)
    return p_skewed
end

model (generic function with 1 method)

This is where we include our observed 35 "Yes" responses. In the fuction argument of the model, we include yes_responses = 35 and N = 100.

trace = sample(model(35,100),MH(), 100000)

Chains MCMC chain (100000×2×1 Array{Float64, 3}):

Iterations        = 1:1:100000
Number of chains  = 1
Samples per chain = 100000
Wall duration     = 1.95 seconds
Compute duration  = 1.95 seconds
parameters        = p
internals         = lp

Summary Statistics
  parameters      mean       std   naive_se      mcse          ess      rhat   ⋯
      Symbol   Float64   Float64    Float64   Float64      Float64   Float64   ⋯

           p    0.2079    0.0908     0.0003    0.0006   21855.8983    1.0001   ⋯
                                                                1 column omitted

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

           p    0.0409    0.1437    0.2039    0.2691    0.3960

using Plots,StatsPlots,PlotThemes,Logging
theme(:ggplot2)
disable_logging(Logging.Warn)

LogLevel(1001)

p_samples = trace[:p]

histogram(p_samples,
          normalize = true,
          bins = 0:0.01:0.9 ,
          size = (800, 250),
          label= "posterior distribution"
          )
vline!([0.05,0.35])

With regards to the above plot, we are still pretty uncertain about what the true frequency of cheaters might be, but we have narrowed it down to a range between 0.05 to 0.35 (marked by the solid lines). This is pretty good, as a priori we had no idea how many students might have cheated (hence the uniform distribution for our prior). On the other hand, it is also pretty bad since there is a .3 length window the true value most likely lives in. Have we even gained anything, or are we still too uncertain about the true frequency?

I would argue, yes, we have discovered something. It is implausible, according to our posterior, that there are no cheaters, i.e. the posterior assigns low probability to $p=0$. Since we started with an uniform prior, treating all values of $p$ as equally plausible, but the data ruled out $p=0$ as a possibility, we can be confident that there were cheaters.

This kind of algorithm can be used to gather private information from users and be reasonably confident that the data, though noisy, is truthful.

More Turing Tricks¶

Example: Challenger Space Shuttle Disaster ¶

On January 28, 1986, the twenty-fifth flight of the U.S. space shuttle program ended in disaster when one of the rocket boosters of the Shuttle Challenger exploded shortly after lift-off, killing all seven crew members. The presidential commission on the accident concluded that it was caused by the failure of an O-ring in a field joint on the rocket booster, and that this failure was due to a faulty design that made the O-ring unacceptably sensitive to a number of factors including outside temperature. Of the previous 24 flights, data were available on failures of O-rings on 23, (one was lost at sea), and these data were discussed on the evening preceding the Challenger launch, but unfortunately only the data corresponding to the 7 flights on which there was a damage incident were considered important and these were thought to show no obvious trend. The data are shown below (see [1]):

using CSV , DataFrames

challenger_data = DataFrame(CSV.File("data/challenger_data.csv"))

challenger_data_new = challenger_data[ :,2:3]
challenger_data_new = filter(:"Damage Incident" => !=("NA"), challenger_data_new)
challenger_data_new = filter(:"Damage Incident" => !=("Challenger Accident"), challenger_data_new)
challenger_data_new."Damage Incident" = parse.(Float64, challenger_data_new."Damage Incident")
challenger_data_new

scatter(challenger_data_new[:,1],challenger_data_new[:,2],
        size = (800,400),
        label = nothing,
        title  = "Defects of the Space Shuttle O-Rings vs temperature",
        ylabel = "Damage Incident?",
        xlabel = "Outside temperature (Fahrenheit)",
        yticks = [0,1],
        titlefontsize  = 10,
        yguidefontsize = 7,
        xguidefontsize = 8
        )

It looks clear that the probability of damage incidents occurring increases as the outside temperature decreases. We are interested in modeling the probability here because it does not look like there is a strict cutoff point between temperature and a damage incident occurring. The best we can do is ask "At temperature $t$, what is the probability of a damage incident?". The goal of this example is to answer that question.

We need a function of temperature, call it $p(t)$, that is bounded between 0 and 1 (so as to model a probability) and changes from 1 to 0 as we increase temperature. There are actually many such functions, but the most popular choice is the logistic function.

$$p(t) = \frac{1}{ 1 + e^{ \;\beta t } } $$

In this model, $\beta$ is the variable we are uncertain about. Below is the function plotted for $\beta = 1, 3, -5$.

function logistic(x,beta)
    return @. 1.0/(1.0+exp(beta*x))
end

x = LinRange(-4,4,100)

plot(x,logistic(x,1),;label = "β = 1",size = (800,200))
plot!(x,logistic(x,3),;label = "β = 3")
plot!(x,logistic(x,-5),;label = "β = -5")

But something is missing. In the plot of the logistic function, the probability changes only near zero, but in our data above the probability changes around 65 to 70. We need to add a bias term to our logistic function:

$$p(t) = \frac{1}{ 1 + e^{ \;\beta t + \alpha } } $$

Some plots are below, with differing $\alpha$.

function logistic_dot(x,beta ,alpha = 0)
    return @. 1.0/(1.0+exp(dot(beta,x) + alpha))
end

plot(x,logistic_dot(x,1),;label = "β = 1",size = (800,200),line =  ( :dot,  0.5, 4,:red))
plot!(x,logistic_dot(x,3),;label = "β = 3",line =  ( :dot,  0.5, 4,:blue))
plot!(x,logistic_dot(x,-5),;label = "β = -5",line =  ( :dot,  0.5, 4,:green))
plot!(x,logistic_dot(x,1,1),;label = "β = 1 , α = 1" ,line = (:red))
plot!(x,logistic_dot(x,3,-2),;label = "β = 3, α = -2",line = (:blue))
plot!(x,logistic_dot(x,-5,7),;label = "β = -5, α = -7",line = (:green))

Adding a constant term $\alpha$ amounts to shifting the curve left or right (hence why it is called a bias).

Let's start modeling this in Turing. The $\beta, \alpha$ parameters have no reason to be positive, bounded or relatively large, so they are best modeled by a Normal random variable, introduced next.

Normal distributions¶

A Normal random variable, denoted $X \sim N(\mu, 1/\tau)$, has a distribution with two parameters: the mean, $\mu$, and the precision, $\tau$. Those familiar with the Normal distribution already have probably seen $\sigma^2$ instead of $\tau^{-1}$. They are in fact reciprocals of each other. The change was motivated by simpler mathematical analysis and is an artifact of older Bayesian methods. Just remember: the smaller $\tau$, the larger the spread of the distribution (i.e. we are more uncertain); the larger $\tau$, the tighter the distribution (i.e. we are more certain). Regardless, $\tau$ is always positive.

The probability density function of a $N( \mu, 1/\tau)$ random variable is:

$$ f(x | \mu, \tau) = \sqrt{\frac{\tau}{2\pi}} \exp\left( -\frac{\tau}{2} (x-\mu)^2 \right) $$

We plot some different density functions below.

x = LinRange(-8, 7, 150)
μ = [-2,0,3]
τ = [0.7, 1, 2.8]
colors = ["#348ABD", "#A60628", "#7A68A6"]

using Plots,Logging
animation = @animate for i in eachindex(μ)
    # Plotting
    a = μ[i]
    b = 1.0/τ[i]
    c = τ[i]
    plot(x ,pdf(Normal(a,b) , x),
            ylabel = "density function at x",
            yguidefontsize=6,
            title = "Probability distribution of three different Normal random variables",
            label= (" μ = $a ,τ = $c "),
            size = (800, 400),
            xlabel = "x",
            xticks = -8:2:8,
            ylims = (0,1.2),
            titlefontsize = 10,
            fillrange = 0,
            fillalpha = 0.5,
            fillcolor = colors[i],
            legend=:topleft
            )
end
disable_logging(Logging.Info)
gif(animation,"tmp.gif",fps = 0.5)

A Normal random variable can be take on any real number, but the variable is very likely to be relatively close to $\mu$. In fact, the expected value of a Normal is equal to its $\mu$ parameter:

$$ E[ X | \mu, \tau] = \mu$$

and its variance is equal to the inverse of $\tau$:

$$Var( X | \mu, \tau ) = \frac{1}{\tau}$$

Below we continue our modeling of the Challenger space craft:

using Turing, Distributions

temperature = challenger_data_new[:,1]
D = challenger_data_new[:,2]

@model function challenger(data,temperature)
    β ~ Normal(0,10)
    α ~ Normal(0,10)
    p = @. 1/(1+exp(dot(temperature,β)+α))
    for i in eachindex(p)
        data[i] ~ Bernoulli(p[i])
    end
    return p
end

challenger (generic function with 1 method)

using Optim

# Instantiate the challenger model with our data.
model = challenger(D,temperature)

# Generate a MAP estimate.
map_estimate = optimize(model, MAP())

# Sampling with the MAP/MLE as initial states
trace = sample(model,MH(),1000000, init_theta = map_estimate.values.array)

Chains MCMC chain (1000000×3×1 Array{Float64, 3}):

Iterations        = 1:1:1000000
Number of chains  = 1
Samples per chain = 1000000
Wall duration     = 15.35 seconds
Compute duration  = 15.35 seconds
parameters        = α, β
internals         = lp

Summary Statistics
  parameters       mean       std   naive_se      mcse         ess      rhat   ⋯
      Symbol    Float64   Float64    Float64   Float64     Float64   Float64   ⋯

           β     0.1853    0.0750     0.0001    0.0007   2125.2114    1.0115   ⋯
           α   -11.8383    5.1210     0.0051    0.0508   2106.6253    1.0099   ⋯
                                                                1 column omitted

Quantiles
  parameters       2.5%      25.0%      50.0%     75.0%     97.5% 
      Symbol    Float64    Float64    Float64   Float64   Float64 

           β     0.0493     0.1318     0.1861    0.2302    0.3305
           α   -21.8920   -14.9777   -11.7391   -8.1991   -2.6428

β_samples = trace[:β]
α_samples = trace[:α]

 p1 = histogram(β_samples,
          normalize = true,
          bins = 0:0.05:0.6 ,
          size = (800, 300),
          label= "posterior distribution of β" 
          )
 p2 = histogram(α_samples,
          normalize = true,
          bins = -30:5:5,
          size = (800, 400),
          label= "posterior distribution of α" 
          )
plot(p1, p2, layout = (2, 1))

All samples of $\beta$ are greater than 0. If instead the posterior was centered around 0, we may suspect that $\beta = 0$, implying that temperature has no effect on the probability of defect.

Similarly, all $\alpha$ posterior values are negative and far away from 0, implying that it is correct to believe that $\alpha$ is significantly less than 0.

Regarding the spread of the data, we are very uncertain about what the true parameters might be (though considering the low sample size and the large overlap of defects-to-nondefects this behaviour is perhaps expected).

Next, let's look at the expected probability for a specific value of the temperature. That is, we average over all samples from the posterior to get a likely value for $p(t_i)$.

t = LinRange(minimum(temperature) -5,maximum(temperature)+5,50)

p_t = logistic(transpose(t),β_samples,α_samples)

mean_prob_t = mean(p_t,dims = 1)[:]

50-element Vector{Float64}:
 0.8945714357940184
 0.8868877794981587
 0.8784767167121309
 0.8692660197930718
 0.8591776858870359
 0.8481281152740635
 0.8360285750904654
 0.8227860526979158
 0.8083046254242187
 0.7924874927837388
 0.7752398276836711
 0.7564725969860879
 0.7361074713706405
 ⋮
 0.0952395680582839
 0.08626398862131693
 0.07825917064592262
 0.07111958283668392
 0.06474946593179307
 0.05906261259812182
 0.05398187887604805
 0.049438543004854854
 0.04537159223762226
 0.04172699171411963
 0.03845697014169672
 0.03551934335591508

plot(t ,mean_prob_t ,label = "Avg. posterior prob of defect",size = (800,400),lw = 3)
scatter!(temperature,D,label = nothing)
plot!(t, p_t[60000,:] ,label="realization from posterior 1",line = (:dot) )
plot!(t, p_t[50000,:] ,label="realization from posterior 2",line = (:dot))

Above we also plotted two possible realizations of what the actual underlying system might be. Both are equally likely as any other draw. The blue line is what occurs when we average all the 20000 possible dotted lines together.

An interesting question to ask is for what temperatures are we most uncertain about the defect-probability? Below we plot the expected value line and the associated 95% intervals for each temperature.

using SciPy

qs = stats.mstats.mquantiles(p_t,[0.025, 0.975], axis=0)
plot(t, qs[1,:], fillrange = qs[2,:], fillalpha = 0.35, c = 1,yticks = 0:0.2:1, 
          label = "Confidence band", legend = :bottomleft,size = (800,400))
plot!(t ,mean_prob_t ,label = "Avg. posterior prob of defect",size = (800,400),lw = 3)
scatter!(temperature,D,label = nothing)

The 95% credible interval, or 95% CI, painted in purple, represents the interval, for each temperature, that contains 95% of the distribution. For example, at 65 degrees, we can be 95% sure that the probability of defect lies between 0.25 and 0.75.

More generally, we can see that as the temperature nears 60 degrees, the CI's spread out over [0,1] quickly. As we pass 70 degrees, the CI's tighten again. This can give us insight about how to proceed next: we should probably test more O-rings around 60-65 temperature to get a better estimate of probabilities in that range. Similarly, when reporting to scientists your estimates, you should be very cautious about simply telling them the expected probability, as we can see this does not reflect how wide the posterior distribution is.

What about the day of the Challenger disaster?¶

On the day of the Challenger disaster, the outside temperature was 31 degrees Fahrenheit. What is the posterior distribution of a defect occurring, given this temperature? The distribution is plotted below. It looks almost guaranteed that the Challenger was going to be subject to defective O-rings.

prob_31=logistic_dot(31,β_samples,α_samples)

histogram(prob_31,
          normalize = true,
          bins = 0.995:0.0001:1 ,
          size = (800, 300),
          label= "posterior distribution of β" ,
          legend = :topleft
          )

Is our model appropriate?¶

The skeptical reader will say "You deliberately chose the logistic function for $p(t)$ and the specific priors. Perhaps other functions or priors will give different results. How do I know I have chosen a good model?" This is absolutely true. To consider an extreme situation, what if I had chosen the function $p(t) = 1,\; \forall t$, which guarantees a defect always occurring: I would have again predicted disaster on January 28th. Yet this is clearly a poorly chosen model. On the other hand, if I did choose the logistic function for $p(t)$, but specified all my priors to be very tight around 0, likely we would have very different posterior distributions. How do we know our model is an expression of the data? This encourages us to measure the model's goodness of fit.

We can think: how can we test whether our model is a bad fit? An idea is to compare observed data (which if we recall is a fixed stochastic variable) with artificial dataset which we can simulate. The rationale is that if the simulated dataset does not appear similar, statistically, to the observed dataset, then likely our model is not accurately represented the observed data.

Previously in this Chapter, we simulated artificial dataset for the SMS example. To do this, we sampled values from the priors. We saw how varied the resulting datasets looked like, and rarely did they mimic our observed dataset. In the current example, we should sample from the posterior distributions to create very plausible datasets. Luckily, our Bayesian framework makes this very easy.

1.) Create missing data Vector for the variable you want to simulate

2.) Instanstiate the model using the missing data vector

3.) use predict function with the instantiated model and the previous chain (trace)

missing_data = Vector{Missing}(missing, length(D))
test_data    = challenger(missing_data,temperature)

DynamicPPL.Model{var"#23#24", (:data, :temperature), (), (), Tuple{Vector{Missing}, Vector{Int64}}, Tuple{}}(:challenger, var"#23#24"(), (data = [missing, missing, missing, missing, missing, missing, missing, missing, missing, missing  …  missing, missing, missing, missing, missing, missing, missing, missing, missing, missing], temperature = [66, 70, 69, 68, 67, 72, 73, 70, 57, 63  …  53, 67, 75, 70, 81, 76, 79, 75, 76, 58]), NamedTuple())

predictions  = predict(test_data,trace)

Chains MCMC chain (1000000×23×1 Array{Float64, 3}):

Iterations        = 1:1:1000000
Number of chains  = 1
Samples per chain = 1000000
parameters        = data[23], data[9], data[2], data[4], data[18], data[11], data[19], data[8], data[13], data[21], data[22], data[5], data[3], data[6], data[7], data[10], data[20], data[1], data[17], data[12], data[16], data[14], data[15]
internals         = 

Summary Statistics
  parameters      mean       std   naive_se      mcse          ess      rhat 
      Symbol   Float64   Float64    Float64   Float64      Float64   Float64 

     data[1]    0.4096    0.4918     0.0005    0.0013   33860.0038    1.0001
     data[2]    0.2564    0.4366     0.0004    0.0010   44665.6982    1.0005
     data[3]    0.2899    0.4537     0.0005    0.0011   42695.7070    1.0004
     data[4]    0.3275    0.4693     0.0005    0.0012   39883.3462    1.0003
     data[5]    0.3668    0.4819     0.0005    0.0012   37196.3287    1.0001
     data[6]    0.1967    0.3975     0.0004    0.0009   43461.1489    1.0006
     data[7]    0.1722    0.3775     0.0004    0.0009   43045.5787    1.0006
     data[8]    0.2553    0.4360     0.0004    0.0010   43692.0089    1.0005
     data[9]    0.7447    0.4360     0.0004    0.0017   15023.4915    1.0004
    data[10]    0.5356    0.4987     0.0005    0.0015   24307.4859    1.0000
    data[11]    0.2553    0.4360     0.0004    0.0010   44336.8862    1.0005
    data[12]    0.0891    0.2849     0.0003    0.0007   34949.2448    1.0005
    data[13]    0.3669    0.4820     0.0005    0.0012   37114.8408    1.0002
    data[14]    0.8299    0.3757     0.0004    0.0016   12404.1132    1.0003
    data[15]    0.3672    0.4820     0.0005    0.0012   37109.4349    1.0001
    data[16]    0.1318    0.3383     0.0003    0.0008   39703.8590    1.0007
    data[17]    0.2560    0.4364     0.0004    0.0010   43767.9185    1.0004
    data[18]    0.0612    0.2396     0.0002    0.0007   31310.9655    1.0004
    data[19]    0.1153    0.3194     0.0003    0.0008   38183.0971    1.0006
    data[20]    0.0781    0.2683     0.0003    0.0007   33979.0180    1.0005
    data[21]    0.1315    0.3380     0.0003    0.0008   40135.4681    1.0007
    data[22]    0.1158    0.3200     0.0003    0.0008   38268.8048    1.0006
    data[23]    0.7173    0.4503     0.0005    0.0017   15988.9886    1.0003

Quantiles
  parameters      2.5%     25.0%     50.0%     75.0%     97.5% 
      Symbol   Float64   Float64   Float64   Float64   Float64 

     data[1]    0.0000    0.0000    0.0000    1.0000    1.0000
     data[2]    0.0000    0.0000    0.0000    1.0000    1.0000
     data[3]    0.0000    0.0000    0.0000    1.0000    1.0000
     data[4]    0.0000    0.0000    0.0000    1.0000    1.0000
     data[5]    0.0000    0.0000    0.0000    1.0000    1.0000
     data[6]    0.0000    0.0000    0.0000    0.0000    1.0000
     data[7]    0.0000    0.0000    0.0000    0.0000    1.0000
     data[8]    0.0000    0.0000    0.0000    1.0000    1.0000
     data[9]    0.0000    0.0000    1.0000    1.0000    1.0000
    data[10]    0.0000    0.0000    1.0000    1.0000    1.0000
    data[11]    0.0000    0.0000    0.0000    1.0000    1.0000
    data[12]    0.0000    0.0000    0.0000    0.0000    1.0000
    data[13]    0.0000    0.0000    0.0000    1.0000    1.0000
    data[14]    0.0000    1.0000    1.0000    1.0000    1.0000
    data[15]    0.0000    0.0000    0.0000    1.0000    1.0000
    data[16]    0.0000    0.0000    0.0000    0.0000    1.0000
    data[17]    0.0000    0.0000    0.0000    1.0000    1.0000
    data[18]    0.0000    0.0000    0.0000    0.0000    1.0000
    data[19]    0.0000    0.0000    0.0000    0.0000    1.0000
    data[20]    0.0000    0.0000    0.0000    0.0000    1.0000
    data[21]    0.0000    0.0000    0.0000    0.0000    1.0000
    data[22]    0.0000    0.0000    0.0000    0.0000    1.0000
    data[23]    0.0000    0.0000    1.0000    1.0000    1.0000

┌ Warning: timestamp of type Missing unknown
└ @ MCMCChains C:\Users\91863\.julia\packages\MCMCChains\VwH27\src\chains.jl:364
┌ Warning: timestamp of type Missing unknown
└ @ MCMCChains C:\Users\91863\.julia\packages\MCMCChains\VwH27\src\chains.jl:364

Once you have the simulated data use predictions.value[ : , : , 1] to read them into a matrix

posterior_simuulated_data = predictions.value[:,:,1]

2-dimensional AxisArray{Float64,2,...} with axes:
    :iter, 1:1:1000000
    :var, [Symbol("data[1]"), Symbol("data[2]"), Symbol("data[3]"), Symbol("data[4]"), Symbol("data[5]"), Symbol("data[6]"), Symbol("data[7]"), Symbol("data[8]"), Symbol("data[9]"), Symbol("data[10]")  …  Symbol("data[14]"), Symbol("data[15]"), Symbol("data[16]"), Symbol("data[17]"), Symbol("data[18]"), Symbol("data[19]"), Symbol("data[20]"), Symbol("data[21]"), Symbol("data[22]"), Symbol("data[23]")]
And data, a 1000000×23 Matrix{Float64}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0     1.0  1.0  1.0  1.0  1.0  1.0  1.0
 ⋮                        ⋮              ⋱                      ⋮         
 1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     1.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  1.0  0.0  1.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0     1.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  1.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 1.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0  0.0  0.0  1.0  1.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  1.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0     1.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 1.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  1.0

x = temperature

using Plots
animation = @animate for i in 1:10
    # Plotting
    scatter(x,posterior_simuulated_data[i*100000,:],
        size = (800,300),
        label = nothing,
        title  = "Defects of the Space Shuttle O-Rings vs temperature",
        yticks = 0:0.2:1,
        titlefontsize  = 10,
        yguidefontsize = 7,
        xguidefontsize = 8
        )
end
disable_logging(Logging.Info)
gif(animation,"tmp.gif",fps = 0.5)

References¶

[1] Dalal, Fowlkes and Hoadley (1989),JASA, 84, 945-957.
[2] German Rodriguez. Datasets. In WWS509. Retrieved 30/01/2013, from http://data.princeton.edu/wws509/datasets/#smoking.
[3] McLeish, Don, and Cyntha Struthers. STATISTICS 450/850 Estimation and Hypothesis Testing. Winter 2012. Waterloo, Ontario: 2012. Print.
[4] Fonnesbeck, Christopher. "Building Models." PyMC-Devs. N.p., n.d. Web. 26 Feb 2013. http://pymc-devs.github.com/pymc/modelbuilding.html.
[5] Cronin, Beau. "Why Probabilistic Programming Matters." 24 Mar 2013. Google, Online Posting to Google . Web. 24 Mar. 2013. https://plus.google.com/u/0/107971134877020469960/posts/KpeRdJKR6Z1.
[6] S.P. Brooks, E.A. Catchpole, and B.J.T. Morgan. Bayesian animal survival estimation. Statistical Science, 15: 357–376, 2000
[7] Gelman, Andrew. "Philosophy and the practice of Bayesian statistics." British Journal of Mathematical and Statistical Psychology. (2012): n. page. Web. 2 Apr. 2013.
[8] Greenhill, Brian, Michael D. Ward, and Audrey Sacks. "The Separation Plot: A New Visual Method for Evaluating the Fit of Binary Models." American Journal of Political Science. 55.No.4 (2011): n. page. Web. 2 Apr. 2013.

	Date	Temperature	Damage Incident
	String	Int64	String
1	04/12/1981	66	0
2	11/12/1981	70	1
3	3/22/82	69	0
4	6/27/82	80	NA
5	01/11/1982	68	0
6	04/04/1983	67	0
7	6/18/83	72	0
8	8/30/83	73	0
9	11/28/83	70	0
10	02/03/1984	57	1
11	04/06/1984	63	1
12	8/30/84	70	1
13	10/05/1984	78	0
14	11/08/1984	67	0
15	1/24/85	53	1
16	04/12/1985	67	0
17	4/29/85	75	0
18	6/17/85	70	0
19	7/29/85	81	0
20	8/27/85	76	0
21	10/03/1985	79	0
22	10/30/85	75	1
23	11/26/85	76	0
24	01/12/1986	58	1
25	1/28/86	31	Challenger Accident

	Temperature	Damage Incident
	Int64	Float64
1	66	0.0
2	70	1.0
3	69	0.0
4	68	0.0
5	67	0.0
6	72	0.0
7	73	0.0
8	70	0.0
9	57	1.0
10	63	1.0
11	70	1.0
12	78	0.0
13	67	0.0
14	53	1.0
15	67	0.0
16	75	0.0
17	70	0.0
18	81	0.0
19	76	0.0
20	79	0.0
21	75	1.0
22	76	0.0
23	58	1.0