RNNs for Irregularly Spaced Time Series

This post is mainly based on

Irregularly Space Time Series

Real-world time-series observations are recorded irregularly, with measurement frequency varying between data sources, variables and over time
Simple Imputation: fix-width time steps + imputation + indicator variable
GRU-D: variable-width time steps + time interval width + decay
CT-GRU: decomposition of hidden state decaying at different rates

Simple Imputation

Strategy: Group observations into a sequence with discrete, fixed-width time steps
Data processing: Indicator Variable + Imputation
Performance: significantly outperforms Linear Regression / MLP / Engineered features

Missing Values

Missing values are not missing at random
The pattern of recorded measurements could contain potential information about the state of the data source
For example, in clinical setting, the fact that some lab values are measured more often than other may implies state of patients
Heuristic or unsupervised imputation ignores the information carried in the missingness itself

Treatment

Indicator Variable
- if $x^{(t)}_i$ is missing, set $m^{(t)}_i = 1$
Imputation
- Zero imputation: $x^{(t)}_i = 0$ if missing variable $i$ at time step $t$
- Forward-filling: $x^{(t)}_i = x^{(t’)}_i$ for previous recorded time step $t’$
  - If no previous measurement available, use training data median

impute

Top left: no imputation or indicators. Bottom left: imputation absent indicators. Top right: indicators but no imputation. Bottom right: indicators and imputation. Time flows from left to right.

Experiments

Data
- Irregularly spaced measurements of 13 variables in clinical dataset
- Combine multiple measurements of the same variable within the same hour window by taking their mean
- Scale each variable to the [0, 1] interval, using expert-defined ranges
Model: LSTM
- Layer = 2
- Dim = 128
- Non-recurrent dropout = 0.5
- L-2 weight decay = $10^{−6}$

impute-eval

Performance on aggregate metrics for logistic regression (Log Reg), MLP, and LSTM classifiers with and without imputation and missing data indicators.

Analysis of Result

RNN out-performs Hand-Engineered Features
Missing pattern contains information
- See: LSTM - Indicators Only
Even without indicators, the RNN might learn to recognize filled-in vs real values
- For forward-filling, the RNN could learn to recognize exact repeats
- For zero-filling, the RNN could recognize that values set to exactly 0 were likely missing measurements
- See: LSTM - Zeros and LSTM - Impute

GRU-D

Handle variable-width time steps with missing values
- Missing values: imputation + indicator variable
- Variable-width time steps: absolute time + $\Delta$ time
Outperforms SOTA on MIMIC-III, PhysioNet mortality prediction

Treatment

Data pre-processing / added dimensions

grud-data

An example of measurement vectors $x_t$, time stamps $s_t$, masking $m_t$, and time interval $\delta_t$.

GRU-D

GRU-D: GRU with trainable decay
Influence of the input variables will fade away over time if the variable has been missing for a while
Decay
- Input decay rate: $\gamma_t$
- Hidden state decay rate: $\gamma_h$

Input decay

\[\gamma_t = \exp \{ -\max(0, W_\gamma \delta_t + b_\gamma) \}\]

where $W_\gamma, b_\gamma$ are trainable and $W_\gamma$ constrain to be diagonal (decay rate of each variable independent from the others).

Input decay is directly applied to forward imputed missing value to decay it over time toward the empirical mean

\[\gamma_t^d x_{t'}^d + (1-\gamma_t^d)\tilde{x}^d\]

where,

$d$: dimension number
$x_{t’}^d$: last observation of the d-th variable
$\tilde{x}^d$: empirical mean of the d-th variable

Hidden state decay

\[h_{t-1} \leftarrow \gamma_h \odot h_{h-1}\]

where $W_\gamma$ is not constrain to be diagonal

GRU-D

\[z_t = \sigma(W_z x_t + U_z h_{t-1} + V_z m_t + b_z)\] \[r_t = \sigma(W_r x_t + U_r h_{t-1} + V_r m_t + b_r)\] \[\tilde{h}_t = \tanh(W x_t + U (r_t \odot h_{t-1}) + V m_t + b)\] \[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]

where $\odot$ is element-wise multiplication

GRU for Reference

\[z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)\] \[r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)\] \[\tilde{h}_t = \tanh(W x_t + U (r_t \odot h_{t-1}) + b)\] \[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]

grud-arch

Experiments

Datasets
- MIMIC-III: ICD-9 20 tasks
- PhysioNet: All 4 tasks
Baseline: indicator variable ($m_t$) + mean/forward imputation
Adding $\delta_t$ improves performance (GRU-simple)
Adding learned decay parameter further improves performance (GRU-D)

grud-eval

Model performances measured by AUC score (mean ± std) for mortality prediction.

Ablation

grud-ablation

Left: GRU-D performance continued to increased when given longer historical data. x-axis, # of hours after admission; y-axis, AUC score; Dash line, RF-simple results for 48 hours.

Right: GRU-D scaled better with more training data. x-axis, subsampled dataset size; y-axis, AUC score.

CT-GRU / Hidden State Decomposition

Impact of event could be short-lived or long-lasting
- Decompose hidden state $h$ into memory-decaying at different rate
- Each decaying rate is called a trace
- Store event in different traces
Results: fails to outperform benchmark

Background

Problems with GRU
- Sequences may have different structure at different scales
- GRU has too much flexibility / no inductive bias on hidden state decay
Goal: Add a temporal inductive bias to RNN (similar to spatial inductive bias of CNN)
Continuous-time GRU (CT-GRU)
- Endow each hidden unit with multiple memory traces that span a range of time scales
- Define future state $h$ as differential equation: $dh/dt = -h/\tau$
- Hence, $h(t) = e^{-t/\tau} h(0)$
- Time-scale $\tau$ is defined as the time for the state to decay to a proportion $e^{−1} = 0.37$ of its initial level

Suppose impact of event decay at ground truth rate $\tau^s_k$

$s$: “storage”
$k$: time step

Fixed set of M traces with log-linear spaced time scales: $\tilde{T} = { \tau_1, \tau_2, …, \tau_M }$ Approximate true $\tau^s_k$ by a mixture of traces from $\tilde{T}$

ctgru-decay

(a) Half life for a range of time scales: true value (dashed black line) and mixture approximation (blue line). (b) Decay curves for time scales $\tau \in [10, 100]$ (solid lines) and the mixture approximation (dashed lines).

CT-GRU

Refer to paper’s Section 2.2

Experiments

Baseline model
- GRU
- GRU with $\Delta t$ (add time-lag between current event and previous event as input)
CT-GRU fails to outperform benchmark
- CT-GRU performs no better than the GRU with $\Delta t$ on Synthetic / Real world dataset
The author suggest that: Although CT-GRU and GRU enforce different degree of flexibility, but perhaps the space of solutions they can encode is roughly the same