Regression Cost Calculations

The purpose of this vignette is to present the calculations for a peicewise linear regression where for each time step there are multiple independent observations.

In the follow variables identified by Greek letters are considered unknown.

Linear regression

At time step t the vector of iid observations yt = {yt, 1, …, yt, p} is explained by the design matrix Xt and modelled as a multivariate Gaussian distribution. Consider known, ‘’background’’, parameters mt and precision matrix St = UU deviation from which are modelled by θ and Λ through the likelihood

$$ L\left(\mathbf{y}_{t} \left| \theta,\lambda\right.\right) = \left(2\pi\right)^{-p/2} \det\left(\mathbf{S}_{t}\right)^{1/2} \det\left(\Lambda\right)^{1/2} \exp\left(-\frac{1}{2}\left( \mathbf{y}_{t} - \mathbf{X}_{t} \mathbf{m}_{t} - \mathbf{X}_{t} \theta\right)^{\prime} \mathbf{U}^{\prime} \Lambda \mathbf{U} \left( \mathbf{y}_{t} - \mathbf{X}_{t} \mathbf{m}_{t} - \mathbf{X}_{t} \theta\right) \right) $$

Pre whitening the known values such that $\hat{\mathbf{y}}_{t} = \mathbf{U}_{t} \left(\mathbf{y}_{t} - \mathbf{X}_{t} \mathbf{m}_{t}\right)$ and $\hat{\mathbf{X}}_{t} = \mathbf{U}_{t} \mathbf{X}_{t}$ gives

$$ L\left(\mathbf{y}_{t} \left| \theta,\Lambda\right.\right) = \left(2\pi\right)^{-p/2} \det\left(\mathbf{S}_{t}\right)^{1/2} \det\left(\Lambda\right)^{1/2} \exp\left(-\frac{1}{2}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta\right)^{\prime} \Lambda \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta\right) \right) $$

Grouping the known values into Kt = plog (2π) − log (det St) the log likelihood is $$ l\left(\mathbf{y}_{t} \left| \theta,\Lambda \right.\right) = -\frac{1}{2}K_{t} + \frac{1}{2}\log\left( \det\left(\Lambda\right)\right) -\frac{1}{2}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta\right)^{\prime} \Lambda \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta\right) $$

Suppose an anomaly with common parameters occurs of nk consecuative time steps in the set Tk. The log-likelihood of yt ∈ Tk is $$ l\left(\mathbf{y}_{t \in T_{k}} \left| \theta_{k},\Lambda_{k} \right.\right) = -\frac{1}{2}\sum_{t \in T_{k}}K_{t} + \frac{n_{k}}{2}\log\left( \det\left(\Lambda\right)\right) -\frac{1}{2}\sum_{t \in T_{k}}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right)^{\prime} \Lambda_{k} \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right) $$

with the cost being twice the negative log likelihood plus a penalty β giving

$$ C\left(\mathbf{y}_{t \in T_{k}} \left| \theta_{k}, \Lambda_{k} \right.\right) = \sum_{t \in T_{k}}K_{t} - n_{k}\log\left( \det\left(\Lambda\right)\right) +\sum_{t \in T_{k}}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right)^{\prime} \Lambda_{k} \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right) + \beta $$

Sufficent statistics

Computation is greatly aided by being able to keep adequate sufficent statistics. Expanding the summation in the cost gives $$ \sum_{t \in T_{k}}\left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right)^{\prime} \Lambda_{k} \left( \hat{\mathbf{y}}_{t} - \hat{\mathbf{X}}_{t} \theta_{k}\right) = \sum_{t \in T_{k}} \left( \hat{\mathbf{y}}^{\prime}_{t} \Lambda \hat{\mathbf{y}}_{t} + \theta^{\prime}_{k}\hat{\mathbf{X}}^{\prime}_{t} \Lambda \hat{\mathbf{X}}_{t} \theta_{k} - 2 \theta^{\prime}_{k} \hat{\mathbf{X}}^{\prime}_{t}\Lambda \hat{\mathbf{y}}_{t} \right) $$ $$ \sum_{t \in T_{k}} \left( \mathrm{tr}\left( \hat{\mathbf{y}}_{t}\hat{\mathbf{y}}^{\prime}_{t} \Lambda \right) + \theta^{\prime}_{k}\hat{\mathbf{X}}^{\prime}_{t} \Lambda \hat{\mathbf{X}}_{t} \theta_{k} - 2 \theta^{\prime}_{k} \hat{\mathbf{X}}^{\prime}_{t}\Lambda \hat{\mathbf{y}}_{t} \right) $$

Baseline: No Anomaly

Here θk = 0, Λk is an identify matrix and there is no penalty so β = 0. The resulting csot is $$ C_{B}\left(\mathbf{y}_{t \in T_{k}} \left| \theta_{k}, \Lambda_{k} \right.\right) = \sum_{t \in T_{k}} K_{t} + \sum_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \hat{\mathbf{y}}_{t} $$

Collective Anomalies

Anomaly in Regression parameters

There is no change in variance so Λk is an identify matrix. The estimate θ̂k of θk can be selected to minimise the cost by taking

$$ \hat{\theta}_{k} = \left( \sum\limits_{t \in T_k} \hat{\mathbf{X}}_{t}^{\prime} \hat{\mathbf{X}}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \hat{\mathbf{X}}_{t}^{\prime} \hat{\mathbf{y}}_{t} \right) $$

$$ C_{C}\left(\mathbf{y}_{t \in T_{k}} \left| \mu_t,m_k,\sigma_k,s_k\right.\right) = \sum_{t \in T_{k}} K_{t} + \left( \sum_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) - \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right)^{\prime} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) +\beta $$

Anomaly in Variance

These is no mean anomaly in the regression parameters so θk = 0. The estimate of σk therfore changes to

$$ \hat{\sigma}_{k} = \frac{1}{n_{k}} \sum\limits_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} $$

while the cost is CC(yt ∈ Tk|μt, mk, σk, sk) = ∑t ∈ TkKt + nklog (σ̂k) + nk + β

Anomaly in regression parameters and variance

Since $$ \sum_{t \in T_{k}} \left( \hat{\mathbf{y}}_{t} - \mathbf{X}_{t} \theta_{k}\right)^{\prime} \mathbf{S}_{t}^{-1} \left( \hat{\mathbf{y}}_{t} - \mathbf{X}_{t} \theta_{k}\right) = \sum_{t \in T_{k}} \left( \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} - 2 \theta_{k}^{\prime}\mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1} \hat{\mathbf{y}}_{t} + \theta_{k}^{\prime}\mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \theta_{k} \right) $$

The estimate θ̂k of θk can be selected to minimise the cost by taking $$ \hat{\theta}_{k} = \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) $$

Subsitution of this result into the cost gives $$ \hat{\sigma}_{k} = \frac{1}{n_{k}} \sum_{t \in T_{k}} \left( \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} - 2 \hat{\theta}_{k}^{\prime}\mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1} \hat{\mathbf{y}}_{t} + \hat{\theta}_{k}^{\prime}\mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \hat{\theta}_{k} \right) $$ which simplifies to $$ \hat{\sigma}_{k} = \frac{1}{n_{k}} \left[ \left( \sum_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) - \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right)^{\prime} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) \right] $$

The cost is given by CC(yt ∈ Tk|μt, mk, σk, sk) = ∑t ∈ TkKt + nklog (σ̂k) + nk + β

Anomaly in Regression parameters

There is no change in variance so σk = 1. The estimate of θ̂k is unchanged which gives a cost of

$$ C_{C}\left(\mathbf{y}_{t \in T_{k}} \left| \mu_t,m_k,\sigma_k,s_k\right.\right) = \sum_{t \in T_{k}} K_{t} + \left( \sum_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) - \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right)^{\prime} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\mathbf{X}_{t} \right)^{-1} \left( \sum\limits_{t \in T_k} \mathbf{X}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} \right) +\beta $$

Anomaly in Variance

These is no mean anomaly in the regression parameters so θk = 0. The estimate of σk therfore changes to

$$ \hat{\sigma}_{k} = \frac{1}{n_{k}} \sum\limits_{t \in T_{k}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} $$

while the cost is CC(yt ∈ Tk|μt, mk, σk, sk) = ∑t ∈ TkKt + nklog (σ̂k) + nk + β

Point anomaly

A point anomaly occurs at a single time instance and is represented as a variance anomaly. Naively the cost could be computed using the formulea for a variance anomaly as Cp(yt|σt) = Kt + ntlog (σ̂t) + nt + β with $$ \hat{\sigma}_{t} = \frac{1}{n_{t}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t} $$

Relating this to the background cost we see that point anomalies may be accepted in the capa search when f(σ̂t, γ, β) = Cp(yt|μt, σt) − CB(yt|μt, σt) = ntlog (σ̂t) + nt + β − ntσ̂t < 0

The following plot shows log (σ̂t) + 1 − σ̂t which indicates that point anomalies may be declared for both outlying and inlying data.

In the case of nt = 1 Fisch et al. control this by modifying the cost of a point anomaly so it is expressed as Cp(yt|σt, Xt) = log (exp (−β) + σ̂t) + Kt + 1 + β

This has the effect of allowing only outlier anomalies, something that can be much more easily acheived by taking

$$ \hat{\sigma}_{t} = \max\left(1,\frac{1}{n_{t}} \hat{\mathbf{y}}_{t}^{\prime} \mathbf{S}_{t}^{-1}\hat{\mathbf{y}}_{t}\right) $$

giving the cost as

$$ C\left(\mathbf{y}_{t \in T_{k}} \left| \mu_t,m_k,\sigma_k,s_k\right.\right) = \sum_{t \in T_{k}} K_{t} +n_{k} \log\left(\hat{\sigma}_{k}\right) +\frac{1}{\hat{\sigma}_{k}}\sum_{t \in T_{k}} \left( \hat{\mathbf{y}}_{t} \mathbf{S}_{t}^{-1} \hat{\mathbf{y}}_{t} \right) +\beta $$