The purpose of this article is to investigate how the EEGNet convolutional neural network works. The corresponding code for this project will be explained in an interactive, step-by-step methodology. This implementation was created to be educational and fun.
Click on the nodes to activate the network
EEGNet, created by Lawhern et al. in 2018, is a single CNN architecture designed to accurately classify EEG signals from different BCI paradigms while remaining computationally efficient and easy to implement [1].
Since publication, EEGNet has been imported into various machine learning frameworks, such as TensorFlow, PyTorch, and scikit-learn, becoming the standard for BCI signal classification and feature extraction due to its robustness and minimal parameter requirements.
The goal of this project is to build a foundational understanding of how EEGNet works from scratch, using only NumPy.
Specifically, this investigation will observe how EEGNet can be used to detect the P300 Event-Related Potential (ERP).
The P300 is an Event-Related Potential (ERP) β a measurable brain voltage change that occurs roughly 300 milliseconds after a person perceives an infrequent or meaningful stimulus. It is most commonly produced when a subject is presented with a series of repetitive, frequent stimuli (e.g., a low-pitched tone or white image) and is then tasked with detecting or responding to a rare, different stimulus (e.g., a high-pitched tone or change in image color). It appears as a positive deflection in EEG recordings, most prominently at the parietal midline electrode Pz and the occipital electrode Oz. Click the electrodes below to see how the signal differs between P300 sites and non-P300 sites.
Rather than using real recordings (which require ethics approval and specialised hardware), this implementation generates synthetic EEG trials, which are relatively easy to mimic compared to real EEG data. Each trial is an 8-channel Γ 100-sample matrix of Gaussian background noise. For βHitβ trials, a scaled copy of the P300 template is embedded in the Pz and Oz channels at time-indices 40β44 (out of 100 samples), simulating the positive deflection that occurs ~300βms after stimulus onset.
Training batches are balanced: half Hit, half No-Hit, then shuffled. This ensures the model cannot achieve high accuracy by simply predicting the majority class.
def generate_eeg_trial(has_p300=True, n_channels=8, n_time=100, noise_std=0.2):
"""Generate a single EEG trial with or without a P300 signal."""
data = np.random.normal(0, noise_std, (n_channels, n_time))
if has_p300:
p300_onset = 40
p300_length = len(P300_TEMPLATE)
p300_amplitude = 1.5
# Embed P300 in Pz (index 4) - primary P300 location
data[4, p300_onset:p300_onset + p300_length] += P300_TEMPLATE * p300_amplitude
# Embed P300 in Oz (index 7) - secondary P300 location (slightly weaker)
data[7, p300_onset:p300_onset + p300_length] += P300_TEMPLATE * p300_amplitude * 0.8
return data
def generate_batch(batch_size=32, n_channels=8, n_time=100):
"""Generate a balanced batch of EEG trials for training."""
X = np.zeros((batch_size, n_channels, n_time))
y = np.zeros(batch_size, dtype=int)
# First half: Hit trials (P300 present)
for i in range(batch_size // 2):
X[i] = generate_eeg_trial(has_p300=True, n_channels=n_channels, n_time=n_time)
y[i] = 1
# Second half: No-Hit trials (noise only)
for i in range(batch_size // 2, batch_size):
X[i] = generate_eeg_trial(has_p300=False, n_channels=n_channels, n_time=n_time)
y[i] = 0
# Shuffle the batch
perm = np.random.permutation(batch_size)
return X[perm], y[perm]Before feeding a trial to the network, we apply global z-score normalization: subtract the mean and divide by the standard deviation. Now we apply this z-score normalization to ensure that all trials have zero mean and variance of 1, which prevents large-amplitude noise from dominating the gradient updates. Just to clarify, the normalization process scales all input features equally, which ensures that sudden, high-voltage artifacts (i.e. eye blinks or muscle twitches) do not disproportionately skew the network's weights to one direction. A small Ξ΅ is added to the denominator for numerical stability.
def z_score_normalize(x, epsilon=1e-6):
"""
Apply global z-score normalization.
This normalizes the entire input to have zero mean and unit variance,
preserving the relative shape of patterns across channels.
"""
mean = x.mean()
std = x.std()
x_norm = (x - mean) / (std + epsilon)
return x_norm, mean, stdEEGNet [1][4] processes data through a compact pipeline: a temporal convolution extracts time-domain features, a spatial convolution learns which specific electrodes matter for the P300 detection, an ELU activation introduces non-linearity, average pooling compresses the temporal dimension, and a fully connected layer produces the final classification [5]. Hover over each block to see its weight shape and output dimensions; click to jump to that section.
class EEGNet:
"""EEGNet-inspired architecture for P300 detection."""
def __init__(self, n_channels=8, n_time=100, kernel_size=5, pool_size=4):
self.n_channels = n_channels
self.n_time = n_time
self.kernel_size = kernel_size
self.pool_size = pool_size
# Compute dimensions through the network
self.temporal_out_len = n_time - kernel_size + 1 # After temporal conv
self.pooled_len = self.temporal_out_len // pool_size # After pooling
# LAYER 1: Temporal Convolution
# A single kernel shared across all channels to detect time patterns
self.temp_kernel = np.random.randn(kernel_size) * 0.1
# LAYER 2: Spatial Convolution (Depthwise)
# Learns which electrode locations are informative
self.spat_weights = np.random.randn(n_channels) * np.sqrt(2.0 / n_channels)
# LAYER 3: Fully Connected Classification
# Maps pooled features to 2 classes (No-Hit, Hit)
self.fc_weights = np.random.randn(2, self.pooled_len) * np.sqrt(2.0 / self.pooled_len)
self.fc_bias = np.zeros(2)The first layer slides a small kernel (size 5) across time for every channel independently. At each position, the kernel computes a dot product with the 5-sample window of z-score normalized EEG values beneath it. This is the classical 1-D convolution: the same set of weights is shared across all channels, so the kernel learns a single temporal pattern β ideally, the P300 shape. Note: Because the kernel is of size 5, the output will be 100 - 5 + 1 = 96 time points.
Press Play to watch the kernel slide across the Pz channel. Notice how the output peaks when the kernel aligns with the P300 peak around sample 40.
# Inside EEGNet.forward():
# Store input for backward pass
self.input = x
# STAGE 1: Temporal Convolution
# Slides the kernel across time for each channel independently
# Output shape: (n_channels, temporal_out_len)
self.temp_out = np.zeros((self.n_channels, self.temporal_out_len))
for c in range(self.n_channels):
for t in range(self.temporal_out_len):
# Extract window of input at time t
window = x[c, t:t + self.kernel_size]
# Dot product with kernel (no bias)
self.temp_out[c, t] = np.sum(window * self.temp_kernel)After the temporal convolution produces 8 feature maps (one per channel), the spatial convolution collapses them into a single time-series using a learned weighted sum. Each electrode gets its own weight β the network must learn that Pz and Oz carry the P300 signal, while frontal and central electrodes contribute mostly noise.
Drag the sliders to see how different spatial weight configurations change the combined output. Click βLearned weightsβ to see the trained modelβs solution.
# Inside EEGNet.forward() (continued):
# STAGE 2: Spatial Convolution
# Weighted sum across channels at each time point
# Output shape: (temporal_out_len,)
self.spat_out = np.zeros(self.temporal_out_len)
for t in range(self.temporal_out_len):
# Weighted combination of all channels at time t
self.spat_out[t] = np.sum(self.temp_out[:, t] * self.spat_weights)The ELU (Exponential Linear Unit) activation function [6] introduces non-linearity. The previous steps (temporal and spatial convolutions) are ultimately just linear mathematical operations (multiplying and adding weights). If you stacked them directly into a Fully Connected (FC) layer without this ELU activation function in between, the model would not be able to learn complex patterns in the data. Note: While ELU is the standard activation function for EEGNet, other activation functions may work as well, such as ReLU, Leaky ReLU, and PReLU [7].
Here Ξ± is a hyperparameter that controls the slope for negative inputs. Positive values pass through unchanged, while negative values are smoothly compressed toward βΞ±. Look at the graphical representation of the ELU function below to understand how ELU changes the shape of the signal.
Average pooling then divides the time axis into non-overlapping windows of size 4, replacing each window with its mean. This reduces 96 time points to 96 / 4 = 24 time points, which captures the essential trend while discarding noise. This is a common technique in time-series data analysis to reduce dimensionality and improve computational efficiency.
def elu(self, x, alpha=1.0):
"""
Exponential Linear Unit activation.
ELU(x) = x if x > 0, else alpha * (exp(x) - 1)
"""
return np.where(x > 0, x, alpha * (np.exp(x) - 1))
def elu_derivative(self, x, alpha=1.0):
"""
Derivative of ELU for backpropagation.
d/dx ELU(x) = 1 if x > 0, else alpha * exp(x)
"""
return np.where(x > 0, 1.0, alpha * np.exp(x))
# Inside EEGNet.forward() (continued):
# STAGE 3: ELU Activation
self.elu_out = self.elu(self.spat_out)
# STAGE 4: Average Pooling
# Reduces temporal dimension by averaging over windows
# Output shape: (pooled_len,)
self.pool_out = np.zeros(self.pooled_len)
for i in range(self.pooled_len):
start = i * self.pool_size
end = start + self.pool_size
self.pool_out[i] = np.mean(self.elu_out[start:end])Groups of 4 values are averaged, reducing 96 samples to 24.
The pooled vector has 24 values. The Fully Connected (FC) layer multiplies it by a weight matrix W of shape (2,β24) β two rows, one per class β and adds a bias. This produces two raw scores called logits. The softmax function (a generalization of the sigmoid function [3]) then converts these logits into probabilities that sum toΒ 1.
Where does the weight matrix come from? It is randomly initialized at the start and then learned during training via backpropagation β exactly like the temporal kernel and spatial weights. Toggle between Hit and No-Hit below to watch the full pipeline animate step by step.
Cross-entropy loss measures how wrong the prediction is: βlog(pcorrect). When the model is confident and correct, loss is near zero; when confident and wrong, loss spikes toward infinity.
def softmax(self, x):
"""
Softmax function for classification output.
Converts logits to probabilities that sum to 1.
"""
# Subtract max for numerical stability (prevents overflow)
exp_x = np.exp(x - np.max(x))
return exp_x / np.sum(exp_x)
def cross_entropy_loss(probs, target_label):
"""
Compute cross-entropy loss for a single sample.
Loss = -log(probability of correct class)
"""
# Clip to avoid log(0)
prob = np.clip(probs[target_label], 1e-15, 1.0 - 1e-15)
return -np.log(prob)
# Inside EEGNet.forward() (continued):
# STAGE 5: Fully Connected + Softmax
self.logits = np.dot(self.fc_weights, self.pool_out) + self.fc_bias
self.probs = self.softmax(self.logits)
return self.probsThe output of avg-pooling β 24 numbers summarizing the entire trial.
Each row is dot-multiplied with the pooled vector. These weights are randomly initialized and learned during training.
We just traversed the entire neural network forwards! The model has gone from taking in raw brainwave data to making a prediction on whether or not it contains the P300 spike.
But what if the prediction was wrong? How can the model be refined? This is where Backpropagation comes in.
Think of it this way: the model just made a guess. Maybe it was right, maybe it was completely wrong. Either way, the loss function gave us a single number that says how inaccurate that guess was. The question now is: Which weights in the network were most responsible for the inaccuracy of the prediction, and how should they change to do better next time?
That is exactly what backpropagation [2] answers. It walks backwards through the network β starting from the loss and working all the way back to the very first temporal kernel β and at every single layer it asks: βif I nudged this weight up by a tiny amount, how much would the loss change?β That ratio (change in loss / change in weight) is called the gradient, and it tells us both the direction and the magnitude of the adjustment we need to make.
The math behind this is the chain rule from calculus. Since the network is just a chain of operations (convolve β combine channels β ELU β pool β multiply by weights β softmax), the gradient at any layer is the product of all the local derivatives from the loss back to that layer.
Once all the gradients are computed, we update every weight by subtracting the gradient scaled by a small learning rate. That is one complete learning step: forward pass β compute loss β backpropagate gradients β update weights. Repeat this over and over again until the model steadily improves.
Step through each layer below to see exactly how the gradient flows from right to left β from the loss, all the way back to the temporal kernel.
def backward(self, target_label):
"""Backward pass: compute gradients for all parameters."""
# GRADIENT OF CROSS-ENTROPY LOSS w.r.t. LOGITS
# For softmax + cross-entropy: dL/dz = probs - one_hot(target)
d_logits = self.probs.copy()
d_logits[target_label] -= 1.0
# GRADIENT OF FC LAYER
d_fc_weights = np.outer(d_logits, self.pool_out)
d_fc_bias = d_logits.copy()
d_pool_out = np.dot(self.fc_weights.T, d_logits)
# GRADIENT THROUGH AVERAGE POOLING
d_elu_out = np.zeros(self.temporal_out_len)
for i in range(self.pooled_len):
start = i * self.pool_size
end = start + self.pool_size
d_elu_out[start:end] = d_pool_out[i] / self.pool_size
# GRADIENT THROUGH ELU ACTIVATION
d_spat_out = d_elu_out * self.elu_derivative(self.spat_out)
# GRADIENT OF SPATIAL CONVOLUTION
d_spat_weights = np.zeros(self.n_channels)
d_temp_out = np.zeros((self.n_channels, self.temporal_out_len))
for c in range(self.n_channels):
for t in range(self.temporal_out_len):
d_spat_weights[c] += d_spat_out[t] * self.temp_out[c, t]
d_temp_out[c, t] = d_spat_out[t] * self.spat_weights[c]
# GRADIENT OF TEMPORAL CONVOLUTION KERNEL (CRITICAL)
# dL/d_kernel[k] = sum_{c,t} dL/d_temp_out[c,t] * input[c, t+k]
d_temp_kernel = np.zeros(self.kernel_size)
for k in range(self.kernel_size):
for c in range(self.n_channels):
for t in range(self.temporal_out_len):
d_temp_kernel[k] += d_temp_out[c, t] * self.input[c, t + k]
return {
'temp_kernel': d_temp_kernel,
'spat_weights': d_spat_weights,
'fc_weights': d_fc_weights,
'fc_bias': d_fc_bias
}
def update_weights(self, gradients, learning_rate):
"""Update all weights using computed gradients."""
self.temp_kernel -= learning_rate * gradients['temp_kernel']
self.spat_weights -= learning_rate * gradients['spat_weights']
self.fc_weights -= learning_rate * gradients['fc_weights']
self.fc_bias -= learning_rate * gradients['fc_bias']dL/dz = softmax(z) β one_hot(y)
The starting point. We subtract the true label (as a one-hot vector: [1,0] for Hit, [0,1] for No-Hit) from the softmax probabilities. If the model said "80% Hit" and the answer was indeed Hit, the gradient for the Hit logit is just 0.8 β 1 = β0.2. That negative sign tells the network: push the Hit score higher next time.
The training loop [2] generates a fresh batch each epoch, runs forward and backward passes for every sample, averages the gradients, and updates weights. A cosine-similarity regularization term gently steers the temporal kernel toward the P300 template shape.
Adjust the hyperparameters and press Train to run the full EEGNet training in your browser. Watch the loss decrease and accuracy climb proportionally. The temporal kernel should converge to the P300 bump shape. The spatial weights should peak at Pz and Oz.
def cosine_similarity(a, b):
"""
Compute cosine similarity between two vectors.
A value of 1.0 means perfect alignment, -1.0 means opposite.
"""
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
if norm_a < 1e-8 or norm_b < 1e-8:
return 0.0
return np.dot(a, b) / (norm_a * norm_b)
def train_eegnet(epochs=2000, batch_size=32, learning_rate=0.005):
"""Train the EEGNet model on synthetic P300 data."""
model = EEGNet(n_channels=8, n_time=100, kernel_size=5, pool_size=4)
initial_kernel = model.temp_kernel.copy()
history = {'loss': [], 'accuracy': [], 'kernel_similarity': []}
for epoch in range(epochs):
X_batch, y_batch = generate_batch(batch_size=batch_size)
batch_gradients = {
'temp_kernel': np.zeros_like(model.temp_kernel),
'spat_weights': np.zeros_like(model.spat_weights),
'fc_weights': np.zeros_like(model.fc_weights),
'fc_bias': np.zeros_like(model.fc_bias)
}
batch_loss = 0.0
batch_correct = 0
for i in range(batch_size):
x = X_batch[i]
y = y_batch[i]
x_norm, _, _ = z_score_normalize(x)
probs = model.forward(x_norm, training=True)
loss = cross_entropy_loss(probs, y)
batch_loss += loss
prediction = np.argmax(probs)
if prediction == y:
batch_correct += 1
gradients = model.backward(y)
# Regularization: encourage kernel to match P300 template
kernel_norm = model.temp_kernel / (np.linalg.norm(model.temp_kernel) + 1e-8)
template_norm = P300_TEMPLATE / (np.linalg.norm(P300_TEMPLATE) + 1e-8)
reg_lambda = 0.1
gradients['temp_kernel'] += reg_lambda * (kernel_norm - template_norm)
for key in batch_gradients:
batch_gradients[key] += gradients[key]
# Average gradients and update weights
for key in batch_gradients:
batch_gradients[key] /= batch_size
model.update_weights(batch_gradients, learning_rate)
avg_loss = batch_loss / batch_size
accuracy = batch_correct / batch_size
kernel_sim = cosine_similarity(model.temp_kernel, P300_TEMPLATE)
history['loss'].append(avg_loss)
history['accuracy'].append(accuracy)
history['kernel_similarity'].append(kernel_sim)
return model, history, initial_kernelCongratulations, you have finished exploring the entire EEGNet architecture! That was a great journey, wasn't it?