Epistemic and aleatoric uncertainty in machine learning

Error in a machine learning prediction is a combination of epistemic and aleatoric uncertainty. Understanding the two is essential for model improvement and explaining the model performance. For example, getting more data can decrease only the epistemic uncertainty but not the aleatoric uncertainty, meaning you are tackling only one source of the error.

Assume we would like to discover the relationship between one input and one output variable. We collected the data (Fig. 1 - blue dots) and fit a line (Fig. 1 - orange line). In practice, we almost never know the true relationship, but in this case, it is known since the data has been generated synthetically (Fig. 1 - blue line).

With Figure 1 we can easily show what epistemic and aleatoric uncertainty are. Epistemic uncertainty is the difference between the true model (blue line, what we aim for) and our model (orange line, our fit). Even if we reach the true model, there will still be some uncertainty left which is the aleatoric uncertainty.

The difference between model prediction (orange line) and an observation (blue dot) is the error (red line), which is the sum of epistemic and aleatoric uncertainty.

Figure 1: Linear fit (orange line) to a dataset (blue dots) sampled from a data generator (blue line). Error in one prediction (red line) and its components are shown as well.

After understanding what these errors are, we can now understand how to reduce them. Getting more data, hyper-parameter tuning, model selection will help you with handling the epistemic uncertainty. Figure 2 shows how getting more data decreases the epistemic uncertainty but not the aleatoric uncertainty.



Figure 2: Effect of getting more data on the sources of uncertainty.

Tackling aleatoric uncertainty is more about designing a dataset. Adding more/better features, using better data collection methods will help you with aleatoric uncertainty. Figure 3 shows what happens to aleatoric uncertainty if we use a higher resolution data collection method.



Figure 3: Effect of improving data collection on the sources of uncertainty.

As mentioned previously, in practice we don't know the blue line so what good is knowing the sources of uncertainty?

Imagine doing image classification using images collected with a high-resolution camera. The project leader tells you to switch to a low-resolution camera to decrease the cost of the hardware and collect more data than before to improve the accuracy (by decreasing epistemic uncertainty). Figure 4 shows that even though this will increase the current model accuracy, it might decrease the upper limit you can eventually reach (due to increased aleatoric uncertainty).

Figure 4: Possible effects of design choices on accuracy.

Epistemic and aleatoric uncertainties constitute the difference between our current model performance and perfect model performance. Understanding what action decreases which uncertainty is a big step towards building better models.

Comments

Popular posts from this blog

Machine learning models for physics and engineering

Physics-guided Neural Networks (PGNNs)