Lecture 12 : Sampling and Monte Carlo Methods
Assoc. Prof. Yuan-Sen Ting
Astron 5550 : Advanced Astronomical Data Analysis
Logistics
Homework Assignment 4 due this Friday,11:59pm
Group Project 2 - Presentation - April 16 (Next Week, Wed)
Group Project 2 - Report - April 28 (no deferment)
Remote lecture this Wednesday (link on Carmen)
Plan for Today :
Grid-Based Sampling
Inverse Cumulative Distribution Function
Rejection Sampling
Importance Sampling
Sampling and Monte Carlo Methods
We've covered supervised learning (regression, classification) and unsupervised learning (PCA, clustering)
All involve maximizing likelihood and deriving posterior distributions
Posterior distribution always has the form:
p(\boldsymbol{\theta}|\mathcal{D}) \propto p(\mathcal{D}|\boldsymbol{\theta}) \cdot p(\boldsymbol{\theta})
The Bayesian Framework
Knowing the mathematical form isn't enough for practical use
We need to draw samples or compute expectations
Many ML tasks require computing expectations:
E[f(z)] = \int f(z) p(z) dz

Posterior Predictive Distribution
Integrates over all parameter values weighted by posterior probability:
Allows us to make predictions for new observations
Accounts for all uncertainty in model parameters
p(y_*|x_*, \mathcal{D}) = \int p(y_*|x_*, \boldsymbol{\theta}) p(\boldsymbol{\theta}|\mathcal{D}) d\boldsymbol{\theta}
From Simple to Complex Model
Simple models (linear regression): Posterior is Gaussian, analytically tractable
Slightly non-linear models (logistic regression): Require approximations
Complex models (non-linear, hierarchical): Analytical solutions unavailable
Need methods to sample directly from complex posteriors
Monte Carlo Approximation
Approximate expectations using samples:
Samples \( z^{(1)}, z^{(2)}, \ldots, z^{(N)} \) drawn from distribution \( p(z) \)
Error decreases at rate of \( \mathcal{O}(1/\sqrt{N}) \) regardless of dimensionality
E[f(z)] = \int f(z) p(z) dz
E[f(z)] \approx \frac{1}{N} \sum_{l=1}^{N} f(z^{(l)})

Buffon's Needle Experiment
Sampling has deep historical roots, predating modern computers
Buffon's needle experiment (18th century): A Monte Carlo method for estimating \( \pi \)
Drop needles of length \( L\) on floor with parallel lines spaced D apart \(L \leq D \)
p(\text{crossing}) = \int^\pi_0 \frac{L \sin \theta}{D \pi} d \theta

From Physical to Computational Randomization
Estimate \(\pi \) by:
Profound connection between physical randomization and numerical integration
Enables work with arbitrarily complex models while performing proper Bayesian inference
\pi \approx 2\frac{L}{D} \frac{N}{n}
N = total needles, n = crossing needles
Grid-Based Sampling
Discretize continuous domain into equally spaced grid points
For domain \( [a,b] \):
Evaluate PDF at each grid point: \( p_i = p(x_i) \)
x_i = a + i\Delta x, \quad i = 0, 1, \ldots, n

Grid-Based Sampling
Discretize continuous domain into equally spaced grid points
For domain \( [a,b] \):
Evaluate PDF at each grid point: \( p_i = p(x_i) \)
x_i = a + i\Delta x, \quad i = 0, 1, \ldots, n
Normalized probabilities:
\tilde{p}_i = \frac{p_i}{Z} , \quad Z = \sum_{i=0}^{n} p_i


Generating Samples
Generate uniform random number \( u \sim \mathcal{U}(0, 1) \)
Compute partial sums: \( s_i = \sum_{j=0}^{i} \tilde{p}_j \) for \( i = 0, 1, \ldots, n \)
Find bin where \( s_{i-1} < u \leq s_i \)
Return corresponding grid point \( x_i \) as sample
Extension to Multiple Dimensions
For \(d\)-dimensional random variable, create \(d\)-dimensional grid
Grid points:
Evaluate joint PDF at each grid point and normalize
Sampling procedure similar to one-dimensional case, computing a flattened cumulative sum.
x_{i_1, i_2, \ldots, i_d} = (a_1 + i_1\Delta x_1, a_2 + i_2\Delta x_2, \ldots, a_d + i_d\Delta x_d)
Limitation of Grid-Based Sampling
Curse of dimensionality: Total bins scale as \( N^D \) exponential growth
Discretization error: Samples constrained to grid points
Only practical for low dimensions, e.g.,1D or 2D.
Inverse CDF Method: A Continuous Approach
Inverse CDF method: Limiting case of grid approach as bin width approaches zero
Works directly with continuous distributions rather than discretized approximations
Grid approach partial sums generalize to cumulative distribution function (CDF)
P(x) = \int_{-\infty}^{x} p(t) dt

The Inverse CDF Method
Simple procedure for generating samples:
Generate uniform random number \( u \sim \mathcal{U}(0,1) \)
Compute \( x = P^{-1}(u) \), where \( P^{-1} \) is inverse of CDF
Return \( x \) as sample from target distribution

Geometric Interpretation
In regions of high PDF: CDF increases rapidly, samples compressed on x-axis
In regions of low PDF: CDF increases slowly, samples spread out on x-axis
Analytical Examples - Exponential Distribution
PDF: For \( x \geq 0 \)
CDF:
Inverse CDF:
Generating samples: Take negative logarithm of uniform random numbers, divide by \( \lambda \)
p(x) = \lambda e^{-\lambda x}
P(x) = 1 - e^{-\lambda x}
x = -\frac{1}{\lambda} \ln(1-u)
Analytical Examples - Power Law Distribution
PDF: \( p(x) = C x^{-\alpha} \) where \( \alpha > 1 \) and \( C = \frac{\alpha - 1}{x_{\min}^{1-\alpha} - x_{\max}^{1-\alpha}} \)
CDF:
Inverse CDF:
P(x) = \frac{x_{\min}^{1-\alpha} - x^{1-\alpha}}{x_{\min}^{1-\alpha} - x_{\max}^{1-\alpha}}
x = [x_{\min}^{1-\alpha} - u(x_{\min}^{1-\alpha} - x_{\max}^{1-\alpha})]^{\frac{1}{1-\alpha}}
Advantages of the Inverse CDF Method
Conceptually straightforward
Computationally efficient when inverse CDF can be calculated analytically
Generates samples that exactly follow target distribution
Limitations of the Inverse CDF Method
Requires closed-form expression for inverse CDF
Challenges with multivariate distributions
Rosenblatt transformation uses conditional distributions
Practical only when conditional distributions have known forms - rare cases
Rejection Sampling
More versatile than inverse CDF - handles distributions without closed-form inverse CDFs
Evaluating \( p(x) \) doesn't directly tell us how to generate samples
Transforms samples from simpler distribution to target distribution
Applicable to a broader class of distributions than previously discussed methods

Proposal Distribution
Target distribution \( p(x) \) from which we want to draw samples
Proposal distribution: PDF \( q(x) \) that satisfies two requirements
Must be able to efficiently generate samples from \( q(x) \)
An envelope completely covering target distribution: Exists constant \( M \geq 1 \) such that \( M q(x) \geq p(x) \) for all \( x \)
Rejection Sampling Algorithm
Generate sample \( x_* \sim q(x) \) from proposal distribution
Compute acceptance ratio \( \)
Generate uniform random number \( u \sim \mathcal{U}(0, 1) \)
If \( u \leq \alpha \), accept \( x_* \) as sample from \( p(x) \); otherwise, reject
\alpha = \frac{p(x_*)}{M q(x_*)}

Efficiency Considerations
Optimal proposal distribution balances:
Ease of sampling from \( q(x) \)
Similarity to target \( p(x) \) in shape
Sufficient heavy tails to dominate \( p(x) \)
Advantages of Rejection Sampling
Extends naturally to multidimensional distribution
Works for any distribution we can evaluate
(even up to normalizing constant)
Requires minimal assumptions about target distribution
Limitations of Rejection Sampling
Efficiency deteriorates rapidly as dimensionality increases
Challenges with heavy-tailed distributions
Advanced techniques (MCMC) necessary for high-dimensional problems
Importance Sampling
Different approach from rejection sampling but shares some similarities
Efficiently estimating expectations rather than generating exact samples
Uses proposal distribution but weights samples instead of rejecting them
The Expectation Problem
Many problems involve computing expectations:
Direct Monte Carlo approach:
But we do not know how to sample from \( p(x) \)
E_{p}[f] = \int f(x) p(x) dx
E_{p}[f] \approx \frac{1}{N} \sum_{i=1}^{N} f(x^{(i)}),
x^{(i)} \sim p(x)
The Key Insight
Sample from proposal distribution \( q(x) \) that is easier to work with
E_{p}[f] = \int f(x) p(x) dx
= \int f(x) \frac{p(x)}{q(x)} q(x) dx
= E_{q}\left[f(x) \frac{p(x)}{q(x)}\right]
The Key Insight
This leads to importance sampling estimator:
E_{p}[f] \approx \frac{1}{N} \sum_{i=1}^{N} w(x^{(i)}) f(x^{(i)}),
w(x) = \frac{p(x)}{q(x)}
E_{q}\left[f(x) \frac{p(x)}{q(x)}\right]
x^{(i)} \sim q(x)
Importance weights defined as:
Handling Unnormalized Distributions
In Bayesian inference, we often know \( p(x) \) only up to normalizing constant:
p(x) = \frac{\tilde{p}(x)}{Z_p}
E_{p}[f] = \frac{1}{Z_p} \int f(x) \tilde{p}(x) dx = \frac{Z_q}{Z_p} \int f(x) \frac{\tilde{p}(x)}{\tilde{q}(x)} q(x) dx
Rewrite expectation:
Handling Unnormalized Distributions
Unnormalized importance weights:
\tilde{w}(x) = \frac{\tilde{p}(x)}{\tilde{q}(x)}
Z_p = \int \tilde{p}(x) dx = \int \frac{\tilde{p}(x)}{\tilde{q}(x)} \tilde{q}(x) dx
Normalization constant
= Z_q \int \frac{\tilde{p}(x)}{\tilde{q}(x)} q(x) dx = Z_q E_{q}\left[\frac{\tilde{p}(x)}{\tilde{q}(x)}\right]
Handling Unnormalized Distributions
Therefore:
\frac{Z_p}{Z_q} = E_q\left[\frac{\tilde{p}(x)}{\tilde{q}(x)}\right] \approx \frac{1}{N} \sum_{i=1}^{N} \tilde{w}(x^{(i)})
E_{p}[f] \approx \frac{\sum_{i=1}^{N} \tilde{w}(x^{(i)}) f(x^{(i)})}{\sum_{i=1}^{N} \tilde{w}(x^{(i)})},
Self-normalized estimator:
x^{(i)} \sim q(x)
Effective Sample Size
Finite sampling noise goes as \( 1 / \sqrt{\text{ESS}} \), instead of \( 1 / \sqrt{N} \)
Effective Sample Size (ESS):
Interpretation: Measures weight distribution evenness
Range: \( 1 \leq \text{ESS} \leq N \)
(equal weights → ESS = N, one dominant weight → ESS ≈ 1)
\text{ESS} \approx \frac{\left(\sum_{i=1}^{N} \tilde{w}(x^{(i)})\right)^2}{\sum_{i=1}^{N} \tilde{w}(x^{(i)})^2}
ESS and the Truncated Normal Example
Target: Standard normal truncated to region \( [5,\infty) \)
Standard normal: With \( N = 10^7 \) samples, expect only 3 non-zero weights. ESS = 3 regardless of the samples generated
Better proposal (shifted exponential):
\( q(x) = \lambda e^{-\lambda(x-5)} \cdot \mathbf{1}(x \geq 5) \)

Proposal Distribution Selection Principles
Support coverage: \( q(x) > 0 \) whenever \( f(x)p(x) \neq 0 \)
Maximize effective sample size: Ideally \( q(x) \propto |f(x)|p(x) \)
Tail behavior: \( q(x) \) should have heavier tails than \( p(x) \)
Tractability: \( q(x) \) should be easy to sample from and evaluate
What We Have Learned:
Grid-Based Sampling
Inverse Cumulative Distribution Function
Rejection Sampling
Importance Sampling
Astron 5550 - Lecture 12
By Yuan-Sen Ting
Astron 5550 - Lecture 12
- 103