The purpose of this vignette is to present the calculations for a peicewise linear regression where for each time step there are multiple independent observations.
In the follow variables identified by Greek letters are considered unknown.
At time step \(t\) the vector of iid observations \(\mathbf{y}_{t}=\left\{y_{t,1},\ldots,y_{t,p}\right\}\) is explained by the design matrix \(\mathbf{X}_{t}\) and modelled as a multivariate Gaussian distribution. Consider known, ‘’background’’, parameters \(\mathbf{m}_{t}\) and precision matrix \(\mathbf{S}_{t} = \mathbf{U}^{\prime}\mathbf{U}\) deviation from which are modelled by \(\theta\) and \(\Lambda\) through the likelihood
\[ L\left(\mathbf{y}_{t} \left| \theta,\lambda\right.\right) = \left(2\pi\right)^{-p/2} \det\left(\mathbf{S}_{t}\right)^{1/2} \det\left(\Lambda\right)^{1/2} \exp\left(-\frac{1}{2}\left( \mathbf{y}_{t} - \mathbf{X}_{t} \mathbf{m}_{t} - \mathbf{X}_{t} \theta\right)^{\prime} \mathbf{U}^{\prime} \Lambda \mathbf{U} \left( \mathbf{y}_{t} - \mathbf{X}_{t} \mathbf{m}_{t} - \mathbf{X}_{t} \theta\right) \right) \]
Pre whitening the known values such that \(\hat{\mathbf{y}}_{t} = \mathbf{U}_{t} \left(\mathbf{y}_{t} - \mathbf{X}_{t} \mathbf{m}_{t}\right)\) and \(\hat{\mathbf{X}}_{t} = \mathbf{U}_{t} \mathbf{X}_{t}\) gives
\[ L\left(\mathbf{y}_{t} \left| \theta,\Lambda\right.\right) = \left(2\pi\right)^{-p/2} \det\left(\mathbf{S}_{t}\right)^{1/2} \det\left(\Lambda\right)^{1/2} \exp\left(-\frac{1}{2}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta\right)^{\prime} \Lambda \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta\right) \right) \]
Grouping the known values into \(K_{t} = p\log\left(2\pi\right) - \log\left(\det{\mathbf{S}_{t}}\right)\) the log likelihood is \[ l\left(\mathbf{y}_{t} \left| \theta,\Lambda \right.\right) = -\frac{1}{2}K_{t} + \frac{1}{2}\log\left( \det\left(\Lambda\right)\right) -\frac{1}{2}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta\right)^{\prime} \Lambda \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta\right) \]
Suppose an anomaly with common parameters occurs of \(n_{k}\) consecuative time steps in the set \(T_{k}\). The log-likelihood of \(\mathbf{y}_{t \in T_{k}}\) is \[ l\left(\mathbf{y}_{t \in T_{k}} \left| \theta_{k},\Lambda_{k} \right.\right) = -\frac{1}{2}\sum_{t \in T_{k}}K_{t} + \frac{n_{k}}{2}\log\left( \det\left(\Lambda\right)\right) -\frac{1}{2}\sum_{t \in T_{k}}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right)^{\prime} \Lambda_{k} \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right) \]
with the cost being twice the negative log likelihood plus a penalty \(\beta\) giving
\[ C\left(\mathbf{y}_{t \in T_{k}} \left| \theta_{k}, \Lambda_{k} \right.\right) = \sum_{t \in T_{k}}K_{t} - n_{k}\log\left( \det\left(\Lambda\right)\right) +\sum_{t \in T_{k}}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right)^{\prime} \Lambda_{k} \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right) + \beta \]
Computation is greatly aided by being able to keep adequate sufficent statistics. Expanding the summation in the cost gives \[ \sum_{t \in T_{k}}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right)^{\prime} \Lambda_{k} \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right) = \sum_{t \in T_{k}} \left( \hat{\mathbf{y}}^{\prime}_{t} \Lambda \hat{\mathbf{y}}_{t} + \theta^{\prime}_{k}\hat{\mathbf{X}}^{\prime}_{t} \Lambda \hat{\mathbf{X}}_{t} \theta_{k} - 2 \theta^{\prime}_{k} \hat{\mathbf{X}}^{\prime}_{t}\Lambda \hat{\mathbf{y}}_{t} \right) \] \[ \sum_{t \in T_{k}} \left( \mathrm{tr}\left( \hat{\mathbf{y}}_{t}\hat{\mathbf{y}}^{\prime}_{t} \Lambda \right) + \theta^{\prime}_{k}\hat{\mathbf{X}}^{\prime}_{t} \Lambda \hat{\mathbf{X}}_{t} \theta_{k} - 2 \theta^{\prime}_{k} \hat{\mathbf{X}}^{\prime}_{t}\Lambda \hat{\mathbf{y}}_{t} \right) \]
Here \(\theta_{k}=\mathbf{0}\), \(\Lambda_{k}\) is an identify matrix and there is no penalty so \(\beta = 0\). The resulting csot is \[ C_{B}\left(\mathbf{y}_{t \in T_{k}} \left| \theta_{k}, \Lambda_{k} \right.\right) = \sum_{t \in T_{k}} K_{t} + \sum_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \hat{\mathbf{y}}_{t} \]
There is no change in variance so \(\Lambda_{k}\) is an identify matrix. The estimate \(\hat{\theta}_{k}\) of \(\theta_{k}\) can be selected to minimise the cost by taking
\[ \hat{\theta}_{k} = \left( \sum\limits_{t \in T_k} \hat{\mathbf{X}}_{t}^{\prime} \hat{\mathbf{X}}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \hat{\mathbf{X}}_{t}^{\prime} \hat{\mathbf{y}}_{t} \right) \]
\[ C_{C}\left(\mathbf{y}_{t \in T_{k}} \left| \mu_t,m_k,\sigma_k,s_k\right.\right) = \sum_{t \in T_{k}} K_{t} + \left( \sum_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) - \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right)^{\prime} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) +\beta \]
These is no mean anomaly in the regression parameters so \(\theta_{k}=0\). The estimate of \(\sigma_{k}\) therfore changes to
\[ \hat{\sigma}_{k} = \frac{1}{n_{k}} \sum\limits_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \]
while the cost is \[ C_{C}\left(\mathbf{y}_{t \in T_{k}} \left| \mu_t,m_k,\sigma_k,s_k\right.\right) = \sum_{t \in T_{k}} K_{t} +n_{k} \log\left(\hat{\sigma}_{k}\right) +n_{k} +\beta \]
Since \[ \sum_{t \in T_{k}} \left( \hat{\mathbf{y}}_{t} - \mathbf{X}_{t} \theta_{k}\right)^{\prime} \mathbf{S}_{t}^{-1} \left( \hat{\mathbf{y}}_{t} - \mathbf{X}_{t} \theta_{k}\right) = \sum_{t \in T_{k}} \left( \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} - 2 \theta_{k}^{\prime}\mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1} \hat{\mathbf{y}}_{t} + \theta_{k}^{\prime}\mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \theta_{k} \right) \]
The estimate \(\hat{\theta}_{k}\) of \(\theta_{k}\) can be selected to minimise the cost by taking \[ \hat{\theta}_{k} = \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) \]
Subsitution of this result into the cost gives \[ \hat{\sigma}_{k} = \frac{1}{n_{k}} \sum_{t \in T_{k}} \left( \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} - 2 \hat{\theta}_{k}^{\prime}\mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1} \hat{\mathbf{y}}_{t} + \hat{\theta}_{k}^{\prime}\mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \hat{\theta}_{k} \right) \] which simplifies to \[ \hat{\sigma}_{k} = \frac{1}{n_{k}} \left[ \left( \sum_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) - \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right)^{\prime} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) \right] \]
The cost is given by \[ C_{C}\left(\mathbf{y}_{t \in T_{k}} \left| \mu_t,m_k,\sigma_k,s_k\right.\right) = \sum_{t \in T_{k}} K_{t} +n_{k} \log\left(\hat{\sigma}_{k}\right) +n_{k} +\beta \]
There is no change in variance so \(\sigma_{k}=1\). The estimate of \(\hat{\theta}_{k}\) is unchanged which gives a cost of
\[ C_{C}\left(\mathbf{y}_{t \in T_{k}} \left| \mu_t,m_k,\sigma_k,s_k\right.\right) = \sum_{t \in T_{k}} K_{t} + \left( \sum_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) - \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right)^{\prime} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) +\beta \]
These is no mean anomaly in the regression parameters so \(\theta_{k}=0\). The estimate of \(\sigma_{k}\) therfore changes to
\[ \hat{\sigma}_{k} = \frac{1}{n_{k}} \sum\limits_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \]
while the cost is \[ C_{C}\left(\mathbf{y}_{t \in T_{k}} \left| \mu_t,m_k,\sigma_k,s_k\right.\right) = \sum_{t \in T_{k}} K_{t} +n_{k} \log\left(\hat{\sigma}_{k}\right) +n_{k} +\beta \]
A point anomaly occurs at a single time instance and is represented as a variance anomaly. Naively the cost could be computed using the formulea for a variance anomaly as \[ C_{p}\left(\mathbf{y}_{t}\left| \sigma_{t}\right.\right) = K_{t} + n_{t} \log\left( \hat{\sigma}_{t} \right) + n_{t} + \beta \] with \[ \hat{\sigma}_{t} = \frac{1}{n_{t}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \]
Relating this to the background cost we see that point anomalies may be accepted in the capa search when \[ f\left(\hat{\sigma}_{t},\gamma,\beta\right) = C_{p}\left(y_{t}\left| \mu_{t},\sigma_{t}\right.\right) - C_{B}\left(y_{t}\left| \mu_{t},\sigma_{t}\right.\right) = n_{t} \log \left( \hat{\sigma}_{t} \right) + n_{t} + \beta - n_{t} \hat{\sigma}_{t} < 0 \]
The following plot shows \(\log \left( \hat{\sigma}_{t} \right) + 1 - \hat{\sigma}_{t}\) which indicates that point anomalies may be declared for both outlying and inlying data.
In the case of \(n_{t}=1\) Fisch et al. control this by modifying the cost of a point anomaly so it is expressed as \[ C_{p}\left(y_{t}\left| \sigma_{t}, \mathbf{X}_{t}\right.\right) = \log\left(\exp\left(-\beta\right) + \hat{\sigma}_{t} \right) + K_{t} + 1 + \beta \]
This has the effect of allowing only outlier anomalies, something that can be much more easily acheived by taking
\[ \hat{\sigma}_{t} = \max\left(1,\frac{1}{n_{t}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t}\right) \]
giving the cost as
\[ C\left(\mathbf{y}_{t \in T_{k}} \left| \mu_t,m_k,\sigma_k,s_k\right.\right) = \sum_{t \in T_{k}} K_{t} +n_{k} \log\left(\hat{\sigma}_{k}\right) +\frac{1}{\hat{\sigma}_{k}}\sum_{t \in T_{k}} \left( \hat{\mathbf{y}}_{t} \mathbf{S}_{t}^{-1} \hat{\mathbf{y}}_{t} \right) +\beta \]