Spatio-Temporal Graph Convolutional Network with an attention-based MIL ensemble.

Our solution for the Video-based Seizure Detection Challenge 2026 at the Artificial Intelligence in Epilepsy and Neurological Disorders Conference.

Overview

We achieved 5th place in the 2026 Video-based Seizure Detection Challenge at the Artificial Intelligence in Epilepsy and Other Neurological Disorders Conference with a Spatio-Temporal Graph Convolutional Network (ST-GCN) combined with an attention-based multiple instance learning (MIL) ensemble.

The challenge asked participants to detect epileptic spasms from short anonymized video clips of infants using pose estimation sequences rather than raw RGB videos. Publicly, the organizers describe the task as classifying 5-second segments as seizure or non-seizure and frame the challenge around earlier and more accessible diagnosis of Infantile Epileptic Spasm Syndrome (IESS).¹

This post is a technical write-up of how we approached the problem, what ended up mattering most, and why the final solution was built around skeletal dynamics instead of conventional frame-based video models.

The code is available at Github.

The Challenge

The public challenge page highlights three constraints that shaped the modeling strategy immediately:¹

The input is video-based, but the usable representation is pose estimation data.
Each sample is a short 5-second segment.
The target is binary seizure detection for infant spasms.

That combination makes the problem interesting. It is a vision task, but not a classic image-classification problem. We are not trying to recognize static appearance. We are trying to detect a temporal motor event from noisy human keypoints under partial visibility, missing landmarks, and child-specific movement patterns.

The pose-only setup is also important from a privacy standpoint. Since the organizers do not permit redistribution of the dataset, we are sharing the code, modeling ideas, and implementation details, but not the data itself.

Our Approach

1. Pose Preprocessing

The raw input for each segment is a tensor of shape (T, V, 5), where:

T = 150 frames
V = 33 MediaPipe body landmarks
the 5 channels are x, y, z, visibility, presence

The first part of the pipeline is entirely about making these pose sequences more stable and more informative.

Interpolation over missing landmarks

Pose extraction is never perfect. Some joints disappear temporarily, some frames are partially corrupted, and some landmarks are unreliable. Instead of discarding those frames, we interpolate missing (x, y, z) values over time and keep a joint-validity mask so the model still knows which coordinates were originally observed.

def interpolate_landmarks(arr):
    xyz = arr[..., 0:3]
    valid = np.isfinite(xyz).all(axis=-1)

    xyz_out = np.zeros((PP.T, PP.V, 3), dtype=np.float32)
    for v in range(PP.V):
        m = valid[:, v]
        for k in range(3):
            xyz_out[:, v, k] = _interp_1d_over_time(xyz[:, v, k], m)

    joint_mask = valid.astype(np.float32)[..., None]
    return xyz_out, joint_mask

Root-relative coordinates

Absolute pose coordinates are less useful than body-relative motion. A child moving slightly in the frame or a change in camera framing should not look like a seizure. We therefore compute the body root as the midpoint of the left and right hip landmarks and express all coordinates relative to that root. When the hips are missing, we carry forward the last valid root to avoid discontinuities.

This one decision makes the model much more sensitive to movement pattern instead of camera position.

Mask-aware motion features

After interpolation and recentering, we compute:

position: x, y, z
confidence channels: visibility, presence
validity mask
velocity: vx, vy, vz
acceleration: ax, ay, az
speed magnitude
frame-level valid-joint fraction

That gives us 14 channels per joint per frame. The most useful design choice here was making the derivatives mask-aware: velocity is only computed when a joint is valid in consecutive frames, and acceleration is only computed when velocity itself is valid in consecutive steps.

def compute_vel_acc_masked(xyz, joint_mask):
    m = (joint_mask[..., 0] > 0.5)
    T, V, _ = xyz.shape

    frame_valid_frac = (m.sum(axis=1) / float(V)).astype(np.float32)[:, None]

    vel = np.zeros((T, V, 3), dtype=np.float32)
    acc = np.zeros((T, V, 3), dtype=np.float32)

    vmask = m[1:] & m[:-1]
    dv = (xyz[1:] - xyz[:-1]).astype(np.float32)
    vel[1:] = dv * vmask[..., None].astype(np.float32)

    amask = vmask[1:] & vmask[:-1]
    da = (vel[2:] - vel[1:-1]).astype(np.float32)
    acc[2:] = da * amask[..., None].astype(np.float32)

    return vel, acc, frame_valid_frac

In practice, this avoids teaching the network false motion created by missing pose detections.

2. Graph Construction

Once the features are built, each frame becomes a graph over the 33 MediaPipe landmarks. The edges come from the standard body connectivity structure: arms, legs, torso, face anchors, and hip-shoulder links.

The adjacency matrix is row-normalized and includes self-loops:

def build_adjacency(num_nodes=33, edges=POSE_EDGES, self_loops=True):
    A = np.zeros((num_nodes, num_nodes), dtype=np.float32)
    for i, j in edges:
        A[i, j] = 1.0
        A[j, i] = 1.0
    if self_loops:
        np.fill_diagonal(A, 1.0)
    D = A.sum(axis=1, keepdims=True) + 1e-6
    A = A / D
    return torch.from_numpy(A)

Why use a graph at all? Because the event we care about is not just motion over time. It is structured motion of a human body. A graph prior gives the model a natural way to reason about how movement in one limb relates to movement in neighboring joints.

3. ST-GCN Backbone

The backbone is an efficient ST-GCN with:

a stem projection from 14 channels to a learned feature space
graph convolution across body joints
depthwise separable temporal convolution for efficiency
residual connections
temporal downsampling in intermediate blocks

The backbone is intentionally not huge. For this challenge, reliability and clean inductive bias mattered more than stacking excessive depth.

class STGCN_MIL(nn.Module):
    def __init__(self, A, c_in=14, c_base=64, dropout=0.15,
                 mil_mode="attn", topk_frac=0.25):
        super().__init__()
        self.register_buffer("A_buf", A.float())

        self.stem = nn.Sequential(
            nn.Conv2d(c_in, c_base, kernel_size=1, bias=False),
            nn.BatchNorm2d(c_base),
            nn.ReLU(inplace=True),
        )

        self.b1 = STGCNBlock(c_base, c_base, dropout=dropout, stride_t=1)
        self.b2 = STGCNBlock(c_base, c_base, dropout=dropout, stride_t=2)
        self.b3 = STGCNBlock(c_base, c_base * 2, dropout=dropout, stride_t=2)
        self.b4 = STGCNBlock(c_base * 2, c_base * 2, dropout=dropout, stride_t=1)

Two choices were especially useful here:

Root-relative input features gave the backbone cleaner motion patterns.
Depthwise temporal convolutions kept the model efficient enough for repeated cross-validation and ensembling.

4. Attention-based MIL and Ensembling

Not every frame in a 5-second segment is equally informative. Some frames are almost irrelevant; a few may carry the strongest evidence. That is why we used multiple instance learning on top of the temporal features.

Instead of forcing the model to treat all timesteps equally, the MIL head learns how to weight them. We experimented with different aggregation rules and settled on attention-based pooling in the final setup.

class MILHead(nn.Module):
    def __init__(self, d, mode="attn", topk_frac=0.25):
        super().__init__()
        assert mode in ["topk", "logsumexp", "attn"]
        self.mode = mode
        if mode == "attn":
            self.attn = nn.Sequential(
                nn.Linear(d, d),
                nn.Tanh(),
                nn.Linear(d, 1)
            )
        self.cls = nn.Linear(d, 1)

    def forward(self, h_t):
        frame_logits = self.cls(h_t).squeeze(-1)
        w = self.attn(h_t).squeeze(-1)
        w = torch.softmax(w, dim=1)
        seg_logit = (w * frame_logits).sum(dim=1)
        return seg_logit, frame_logits

This is a good fit for seizure detection because the evidence within a clip is often sparse and unevenly distributed in time.

We then trained five GroupKFold models, grouped by child_id, and averaged the fold probabilities at inference time. The final decision threshold was chosen from out-of-fold predictions rather than fixed at 0.5.

Training Setup

The training recipe was straightforward but deliberate:

Split strategy: GroupKFold(n_splits=5) using child_id to prevent identity leakage across train and validation folds.
Loss: focal loss with the positive-class weight derived from class imbalance.
Augmentation: time crop and resample, small xy rotation, xy translation, and joint dropout.
Optimization: AdamW with mixed precision, gradient clipping, and early stopping.
Selection criterion: validation F1 first, sensitivity second.

The child-wise split matters a lot. If segments from the same child appear in both train and validation sets, performance can look much better than it really is. For a medical detection problem, that kind of leakage would make the evaluation much less trustworthy.

Why This Worked

Several parts of the solution helped, but a few mattered more than the rest:

Representing motion explicitly. Seizure-related events are temporal. Velocity, acceleration, and speed cues gave the model a more direct signal than coordinates alone.
Treating missing data carefully. Interpolation plus validity masks preserved information without pretending the pose extractor was perfect.
Using a body graph prior. ST-GCN is naturally aligned with skeletal dynamics.
Letting the model focus on informative moments. Attention-based MIL handled sparse temporal evidence better than uniform pooling.
Preventing leakage. GroupKFold by child was essential for meaningful validation.

In other words, the solution worked less because it was large and more because it matched the structure of the problem.

Code and Reproducibility

Click here to access the code.

We are sharing the full modeling pipeline and code structure, including:

preprocessing
augmentation
graph construction
ST-GCN backbone
MIL pooling
cross-validation
fold ensembling
threshold tuning

We are not sharing the challenge dataset, because redistribution is not allowed by the challenge policy and the data involves sensitive clinical material.

If you want to reproduce the approach on the official challenge setup, the main implementation stages are:

Load the pose .npy files and parse child_id from the segment name.
Convert each segment into a (C, V, T) feature tensor with the 14-channel preprocessing pipeline.
Train ST-GCN + MIL with child-wise GroupKFold.
Save each fold model and compute an out-of-fold threshold.
Ensemble fold probabilities at inference time.

Closing Thoughts

This challenge was a good example of how much can be achieved with a model that respects the structure of the data. We did not need raw pixels, giant video transformers, or overly complicated training tricks. A careful skeletal representation, graph-based temporal modeling, and robust validation were enough to produce a strong result.

More broadly, this kind of work is exciting because it points toward practical, privacy-conscious, AI-assisted neurological screening tools. There is still a large gap between leaderboard performance and clinical deployment, but challenges like this are useful because they force us to think about robustness, generalization, and medically meaningful failure modes.

Challenge information from the official pages: Computational Neurology Video Challenge. ↩ ↩²