Skip to main content

Ch 7: Preprocessing Steps Necessary and Useful for Advanced Data Analysis

This chapter goes over the basics of preprocessing your data before full analysis

7.1 What is Preprocessing?

Preprocessing refers to any transformations or reorganization that occurs between data collection and data analysis. This can include organizing data into epochs, removing bad data or filtering out noise.

Be sure to record all steps you take during this process, so anyone can recreate your preprocessed data again from the source raw data!

7.2 The Balance between Signal and Noise

While sometimes noise is obvious and can easily be removed (such as in large noise spikes), often one has to compromise between maximizing signal retention and minimizing noise as they are often mixed in EEG data.

image.png

There is no one size fits all process, sometimes one scientist's noise is another's data. 

7.3 Creating Epochs

Epoching is cutting continuous trial time series data into segments surrounding experiment events. The event used for time locking the epochs of trials may be obvious (such as stimulus time), but sometimes it can be harder to ascertain. Since you can redo time locks during analysis, there is no worry about losing data later.

It is important to have sufficient time before and after the t = 0 event. Epoch lengths themselves need to be sufficient for the analysis:

  • ERPs can get by with about 1 second
  • time-frequency analysis needs more time to avoid edge artifacts

Edge artifacts occur from the discontinuous breaks in time series, as there is no data beyond these edges (they default to 0), causing high amplitude band artifacts in results. Edge artifacts tend to last 2-3 cycles, and thus you should have at least 3 cycles of the lowest frequency of interest as a buffer zone on either side.

image.png

Very large epochs may cause overlaps in the raw data. While often not an issue, it can lead to bias in independent component analysis results as some points are sampled multiple times.

If epoched data is too short, a last resort mitigation can be to reflect the data and add that to the beginning and end of the epoch, creating an epoch 3x as long. This is only valid for use as buffers, not analysis.

image.png

7.4 Matching Trial Count across Conditions

It is ideal for all experiment conditions to have an equal number of trials, with phase based analysis being particularly affected as a small number of trials causes a positive bias. ERPs don't become biased as they can have positive or negative values , but noise is more likely to be increased with a lower trial count.

While minor differences in trial count when number of trials are high may not be an issue, it is a problem when there is a substantial difference in trial count.

There are a few ways to artificially match trial count:

  • Select the first X number of trials
    • This is bad as it biases data toward conditions earlier in the experiment
  • Select random X number of trials
    • Reduces bias but ensure that you record which trials were chosen to ensure data reproducibility
  • Select X number of trials based on some condition
    • Can be used as a vector to "reduce noise" but can obviously cause bias into data. Record what trials were chosen and discuss why in your results
7.5 Filtering

Notch filtering 50/60hz power line frequency is good. 

High pass frequency filtering should be done only on raw data (not epoched data) since lower frequencies may cycle slower than the length of the epoch.

7.6 Trial Rejection

It is important to remove trials with obvious artifacts.

Removal of artifacted trials can be done via algorithm, but can cause Type 1 and Type 2 errors (false rejections / false retentions). Visual inspection may be a better solution but beware of induced bias.

The author personally suggests using visual inspection to remove bad trials, but acknowledges this could be contested.

7.7 Spatial Filtering

Spatial filters allow localization of a result (where does a activity peak occur?).

It can also be used to filter out low spatial frequency features, IE to separate processing activity from 2 regions of the brain at once.

7.8 Referencing

Referencing is only relevant for EEG, not MEG as it intrinsically does not need it. Also, some analysis such as the surface Laplacian is not dependent on the reference.

Placing the reference electrode on the earlobe of mastoid is a common location, but may be affected by facial movements. In general, keep the reference electrode away from the area you intend to measure activity from. Ensure you label your reference electrode

Averaging electrode into an average reference may work with a large number of electrodes (over 100).

7.9 Interpolating Bad Electrodes

You can use interpolation using known data to estimate data from missing / noisy electrodes. While it is more accurate with a higher number of electrodes, ultimately it does not add data and may be better to reject a trial with a bad electrode instead. This should only be done as a last resort. Better to monitor the experiment to see if the issue pops up and remove a handful of trials, rather than have to interpolate an electrode for a whole experiment.

Since interpolating electrodes do not add data, this reduces the rank of the data matrix which can cause issues with analysis that uses the matrix inverse (you may need to use a pseudo-inverse as a work around in this case).

Another option, if possible instead of interpolation, is to simply remove the electrode from the set (setting the value as null). This may also cause some issues with some analysis as different trials may have a different number of electrodes.

It should also be worth investigating to ensure that the electrode data truly is in-salvageable, or if the noise can be filtered out.

7.10 Start with Clean Data

Garbage in -> Garbage out.

While preprocessing helps make good data even better, it cannot rescue bad data!