Max Likelihood Estimator#
Using BPF with Gradient Descent#
Recall setup:
With a slight abuse of notations, we denote \(\theta\) as the full parameters \(m_0, P_0, \theta, Q, H, R\) which parametrize all three distributions \(\pi_{\theta}, \tau_{\theta}, g_{\theta}\)
To estimate \(p_{\theta}(y_{1:T})\), we want to find the \(\theta\)
which maximizes likelihood \( \log p_{\theta}(y_{1:T})\).
To do this, we perform gradient descent on \( -\log p_{\theta}(y_{1:T})\), and quote that
We can estimate this gradient using monte carlo, by sampling \( x_{0:T}^{(i)} \sim p_{\theta}(x_{0:T}| y_{1:T}) \), and obtain
We denote the MC estimator as \(\alpha_k = \frac{1}{N} \sum_{i=1}^N \nabla \log p_{\theta}(x_{0:T}^{(i)}, y_{1:T})\), and perform gradient descent over \( -\log p_{\theta}(y_{1:T})\):
Where \(\gamma_k\) is the learning rate / step-size.
Now to estimate \(\alpha_k = \frac{1}{N} \sum_{i=1}^N \alpha_k^{(i)}\), where \(\alpha_k^{(i)} = \nabla \log p_{\theta}(x_{0:T}^{(i)}, y_{1:T}) \), note that
We define
The whole BPF pseudocode is given by
Fix \(N, \theta\), given \(x_{0:t-1}^{(i)}\) and \( \alpha_{n-1}^{(i)}\),
Sample:
\( x_0^{(i)} \sim \pi_0 \) for \( i = 1, \ldots, N. \)
Init \( s_{\theta, 0}(x_0^{(i)}) = \nabla \log \mu_{\theta}(x_0^{(i)})\) for \( i = 1, \ldots, N. \)
for \( t = 1, \ldots, T \) do
Sample: \( \bar{x}_t^{(i)} \sim \tau_{\theta}(\cdot|x_{t-1}^{(i)}) \quad \text{for} \quad i = 1, \ldots, N. \)
Weight: \( W_t^{(i)} = g_{\theta}(y_t|\bar{x}_t^{(i)}), \) for \( i = 1, \ldots, N \).
compute \( w_t^{(i)} = \frac{W_t^{(i)}}{\sum_{j=1}^{N} W_t^{(j)}}, \)
Resample: Sample \( o_t(1), \dots, o_t(N) \sim \) Multinomial( \(w_t^{(1)}, \dots, w_t^{(N)}\) ), and set \( x_t^{(i)} = \bar{x}_t^{(o_t(i))} \) for \( i = 1, \ldots, N \).
Compute: \(s_{\theta, t}(x_{t-1}^{(o_t^{(i)})}, x_t^{(i)}) = \nabla \log g_{\theta}(y_t \mid x_t^{(i)}) + \nabla \log \tau_{\theta}(x_t^{(i)} \mid x_{t-1}^{(o_t^{(i)})}).\) for \( i = 1, \ldots, N \).
Update the gradient: \( \quad \alpha_t^{(i)} = \alpha_{t-1}^{(o_t^{(i)})} + s_{\theta, t}(x_{t-1}^{(o_t^{(i)})}, x_t^{(i)}). \)
end for, return final sequence \(\{\alpha_T^{(i)}\}\)
Perform gradient descent \( \theta_{k+1} = \theta_k + \gamma_k \bigg(\frac{1}{N} \sum_{i=1}^N \alpha_T^{(i)} \bigg) \)
We keep doing this until \(|\theta_{k+1} - \theta_k|\) is smaller than some pre-defined threshold, and output the \(\theta\) as our max-likelihood estimator. We then plug this into the Kalman filter we wrote in 2.1 to obtain \(p(y)\)
Bayesian Inference#
MCMC#
Build MCMC chain on \(\theta\)
NPF#
Nested Particle Filtering
Let prior \(p_0(\theta) = \text{Unif}(-1, 1)\)
Sample: \( \hat{\theta}_0^{(i)} \sim p_0\) for \( i = 1, \ldots, M \).
for \( t = 1, \ldots, T \) do
Propose \( \hat{\theta}_t^{(i)} \sim \kappa(\cdot|\theta_{t-1}^{(i)}) \quad \text{for} \quad i = 1, \ldots, M. \)
Sample Inner layer: \( \bar{x}_t^{(i, j)} \sim \tau_{\hat{\theta}_t^{(i)}}(\cdot|x_{t-1}^{(i, j)}) \quad \text{for} \quad j = 1, \ldots, N, \quad \text{for} \quad i = 1, \ldots, M. \)
Weight: \( V_t^{(i, j)} = g_{\hat{\theta}_t^{(i, j)}}(y_t|\bar{x}_t^{(i, j)}), \quad \text{for} \quad j = 1, \ldots, N, \quad \text{for} \quad i = 1, \ldots, M \).
Normalize \( v_t^{(i,j)} = \frac{V_t^{(i,j)}}{\sum_{j=1}^N V_t^{(i,j)}}, \)
Approximate outler layer: \( W_t^{(i)} = \frac{1}{N} \sum_{j=1}^N v_t^{(i,j)} \)
Normalize \( w_t^{(i)} = \frac{W_t^{(i)}}{\sum_{j=1}^M W_t^{(j)}}, \quad \text{for} \quad i = 1, \ldots, M \)
Resample Inner particles \( \tilde{x}_t^{(i,j)} \sim \sum_{j=1}^N v_t^{(i,j)} \delta_{\bar{x}_t^{(i,j)}}, \quad \text{for} \quad j = 1, \ldots, N, \quad \text{for} \quad i = 1, \ldots, M \)
Resample Sample \( o_t(1), \dots, o_t(N) \sim \) Multinomial( \(w_t^{(1)}, \dots, w_t^{(M)}\) ), and set
for \( j = 1, \ldots, N \).
Return \( \{\theta_T^{(i)}\}_{i=1}^M\), and approximate posterior by