Max Likelihood Estimator

Max Likelihood Estimator#

Using BPF with Gradient Descent#

Recall setup:

\[\begin{align*} \pi_0(x_0) &= \mathcal{N}(x_0; m_0, P_0) \\ \tau(x_t | x_{t-1}) &= \mathcal{N}(x_t; \theta x_{t-1}, Q) \\ g(y_t | x_t) &= \mathcal{N}(y_t; H x_t, R) \end{align*}\]

With a slight abuse of notations, we denote \(\theta\) as the full parameters \(m_0, P_0, \theta, Q, H, R\) which parametrize all three distributions \(\pi_{\theta}, \tau_{\theta}, g_{\theta}\)

To estimate \(p_{\theta}(y_{1:T})\), we want to find the \(\theta\) which maximizes likelihood \( \log p_{\theta}(y_{1:T})\).
To do this, we perform gradient descent on \( -\log p_{\theta}(y_{1:T})\), and quote that

\[ \nabla \log p_{\theta}(y_{1:T}) = \int \nabla \log p_{\theta}(x_{0:T}, y_{1:T}) \cdot p_{\theta}(x_{0:T}| y_{1:T}) dx_{0:T} \]

We can estimate this gradient using monte carlo, by sampling \( x_{0:T}^{(i)} \sim p_{\theta}(x_{0:T}| y_{1:T}) \), and obtain

\[\begin{align*} \nabla \log p_{\theta}(y_{1:T}) &\approx \frac{1}{N} \sum_{i=1}^N \nabla \log p_{\theta}(x_{0:T}^{(i)}, y_{1:T}) \end{align*}\]

We denote the MC estimator as \(\alpha_k = \frac{1}{N} \sum_{i=1}^N \nabla \log p_{\theta}(x_{0:T}^{(i)}, y_{1:T})\), and perform gradient descent over \( -\log p_{\theta}(y_{1:T})\):

\[ \theta_{k+1} = \theta_k + \gamma_k \alpha_k \]

Where \(\gamma_k\) is the learning rate / step-size.

Now to estimate \(\alpha_k = \frac{1}{N} \sum_{i=1}^N \alpha_k^{(i)}\), where \(\alpha_k^{(i)} = \nabla \log p_{\theta}(x_{0:T}^{(i)}, y_{1:T}) \), note that

\[\begin{align*} \nabla \log p_{\theta}(x_{0:T}, y_{1:T}) &= \nabla \log \mu_{\theta}(x_0) + \sum_{t=1}^T \nabla \log \tau_{\theta}(x_t \mid x_{t-1}) + \sum_{t=1}^T \nabla \log g_{\theta}(y_t \mid x_t). \end{align*}\]

We define

\[\begin{align*} s_{\theta, 0}(x_0) &= \nabla \log \mu_{\theta}(x_0), \\ s_{\theta, t}(x_{t-1}, x_t) &= \nabla \log g_{\theta}(y_t \mid x_t) + \nabla \log \tau_{\theta}(x_t \mid x_{t-1}). \end{align*}\]

The whole BPF pseudocode is given by

Fix \(N, \theta\), given \(x_{0:t-1}^{(i)}\) and \( \alpha_{n-1}^{(i)}\),

Sample:
\( x_0^{(i)} \sim \pi_0 \) for \( i = 1, \ldots, N. \)
Init \( s_{\theta, 0}(x_0^{(i)}) = \nabla \log \mu_{\theta}(x_0^{(i)})\) for \( i = 1, \ldots, N. \)

for \( t = 1, \ldots, T \) do

  • Sample: \( \bar{x}_t^{(i)} \sim \tau_{\theta}(\cdot|x_{t-1}^{(i)}) \quad \text{for} \quad i = 1, \ldots, N. \)

  • Weight: \( W_t^{(i)} = g_{\theta}(y_t|\bar{x}_t^{(i)}), \) for \( i = 1, \ldots, N \).

  • compute \( w_t^{(i)} = \frac{W_t^{(i)}}{\sum_{j=1}^{N} W_t^{(j)}}, \)

  • Resample: Sample \( o_t(1), \dots, o_t(N) \sim \) Multinomial( \(w_t^{(1)}, \dots, w_t^{(N)}\) ), and set \( x_t^{(i)} = \bar{x}_t^{(o_t(i))} \) for \( i = 1, \ldots, N \).

  • Compute: \(s_{\theta, t}(x_{t-1}^{(o_t^{(i)})}, x_t^{(i)}) = \nabla \log g_{\theta}(y_t \mid x_t^{(i)}) + \nabla \log \tau_{\theta}(x_t^{(i)} \mid x_{t-1}^{(o_t^{(i)})}).\) for \( i = 1, \ldots, N \).

  • Update the gradient: \( \quad \alpha_t^{(i)} = \alpha_{t-1}^{(o_t^{(i)})} + s_{\theta, t}(x_{t-1}^{(o_t^{(i)})}, x_t^{(i)}). \)

end for, return final sequence \(\{\alpha_T^{(i)}\}\)

Perform gradient descent \( \theta_{k+1} = \theta_k + \gamma_k \bigg(\frac{1}{N} \sum_{i=1}^N \alpha_T^{(i)} \bigg) \)

We keep doing this until \(|\theta_{k+1} - \theta_k|\) is smaller than some pre-defined threshold, and output the \(\theta\) as our max-likelihood estimator. We then plug this into the Kalman filter we wrote in 2.1 to obtain \(p(y)\)

Bayesian Inference#

MCMC#

Build MCMC chain on \(\theta\)

NPF#

Nested Particle Filtering

Let prior \(p_0(\theta) = \text{Unif}(-1, 1)\)

Sample: \( \hat{\theta}_0^{(i)} \sim p_0\) for \( i = 1, \ldots, M \).

for \( t = 1, \ldots, T \) do

  • Propose \( \hat{\theta}_t^{(i)} \sim \kappa(\cdot|\theta_{t-1}^{(i)}) \quad \text{for} \quad i = 1, \ldots, M. \)

  • Sample Inner layer: \( \bar{x}_t^{(i, j)} \sim \tau_{\hat{\theta}_t^{(i)}}(\cdot|x_{t-1}^{(i, j)}) \quad \text{for} \quad j = 1, \ldots, N, \quad \text{for} \quad i = 1, \ldots, M. \)

  • Weight: \( V_t^{(i, j)} = g_{\hat{\theta}_t^{(i, j)}}(y_t|\bar{x}_t^{(i, j)}), \quad \text{for} \quad j = 1, \ldots, N, \quad \text{for} \quad i = 1, \ldots, M \).

  • Normalize \( v_t^{(i,j)} = \frac{V_t^{(i,j)}}{\sum_{j=1}^N V_t^{(i,j)}}, \)

  • Approximate outler layer: \( W_t^{(i)} = \frac{1}{N} \sum_{j=1}^N v_t^{(i,j)} \)

  • Normalize \( w_t^{(i)} = \frac{W_t^{(i)}}{\sum_{j=1}^M W_t^{(j)}}, \quad \text{for} \quad i = 1, \ldots, M \)

  • Resample Inner particles \( \tilde{x}_t^{(i,j)} \sim \sum_{j=1}^N v_t^{(i,j)} \delta_{\bar{x}_t^{(i,j)}}, \quad \text{for} \quad j = 1, \ldots, N, \quad \text{for} \quad i = 1, \ldots, M \)

  • Resample Sample \( o_t(1), \dots, o_t(N) \sim \) Multinomial( \(w_t^{(1)}, \dots, w_t^{(M)}\) ), and set

\[\begin{align*} \theta_t^{(i)} =& \hat{\theta}_t^{(o_t(i))}\\ x_t^{(i,j)} =& \bar{x}_t^{(o_t(i), j)} \end{align*}\]

for \( j = 1, \ldots, N \).

Return \( \{\theta_T^{(i)}\}_{i=1}^M\), and approximate posterior by

\[ p(\mathrm{d}\theta|y_{1:T}) \approx \frac{1}{M} \sum_{i=1}^M \delta_{\theta_T^{(i)}}(\mathrm{d}\theta) \]