Inspecting Model Performance
Understanding your model performance and how to analyze them through the Model Metrics View

Introduction to Model Metrics View

The Model Metrics View, accessible by clicking on the bar chart icon (
) in the left hand navigation bar, is the primary way to interact with your models' aggregate performance in Aquarium.
This page will cover how to use the different features within Aquarium so your team can analyze your model's performance and more efficiently find insights in the underlying datasets.
To get started, select a dataset and at least one inference set in the project path underneath the top navigation bar.
Example of what your view will look like with a dataset and an inference selected

The Model Metrics View is split into two tabs which we elaborate more on in later sections:


Using Scenarios requires setting up model performance segments within your dataset. Learn more about organizing your data with Segments.
The scenarios tab provides a summary view of your model's performance against pre-defined subsets of your dataset.
  • Scenarios allow you to define a target threshold for your models' performance against a known set of frames, and then evaluate all inference sets against those thresholds. This may be as simple as reaching as target F1 score on the test set, or as complex as multi-metric pass/fail regression tests against domain specific problems.


The Metrics tab provides a high level overview on the overall dataset, or an individual segment. Metrics provides additional drill through capability over the Scenarios view, including:

Configuring Thresholds for Model Performance Segments

When you create a Model Performance type segment, you can set thresholds for both precision and recall to easily evaluate the performance of an inference set. These thresholds are especially useful when used with Regression Test type segments because your teams can query the API and retrieve a pass/fail result.
Example Payload From Querying the Results of a Regression Test

Setting a Threshold for a Model Performance Segment

There are two ways to navigate to the page to set a model performance segment's thresholds.
  1. 1.
    From a segment overview page, click on the Metrics tab
Metrics tab on Segment Overview page
2. From the Model Metrics View, click on the fly out button in one of the Scenario cards
Button to take you from the Model Metrics Scenarios view to the Specific Metrics page to set thresholds
On the metrics page, for each metric, you'll see a dot plotted representing each inference set related to the base labeled dataset.
Arrow highlighting each inference set down below for each metric
Once you have navigated to the Metrics Page, to set a threshold:
  1. 1.
    Click the gear button (
  2. 2.
    Enter in a number or use the arrows to set a value from 0.0 to 1.0 to represent the desired threshold value (Must type 0 first, ie. 0.8)
  3. 3.
    Click out of the input box anywhere on screen for the value to take effect
How to set a threshold
You'll notice once you set a threshold your values will turn green or red depending on if your inference set metrics are over or under that threshold.
Once you set the thresholds, you'll also see dotted lines that represent the threshold values superimposed on the PR curves in the Model Metrics View:
On the left you can see an example segment with Thresholds set and one with no thresholds on the right

Scenarios Tab

The Scenarios tab summarizes your models' performance across all defined Model Performance Segments.
Model Performance Segments are grouped into three primary categories
  • Splits
    • Always includes a segment card for All Frames in the dataset.
    • Typically the test, training and validation subsets of your dataset.
  • Regression Tests
    • Sets of frames within your dataset that the model must perform to a certain threshold on in order to be considered for deployment.
    • Regression tests might be tied to overall business goals, specific model development experiments, domain specific difficult performance scenarios, etc.
  • Scenarios
    • Any other subset of frames you'd like to evaluate the model's performance on (e.g. data source, labeling provider, embedding clusters, etc.)
  • View the performance of any uploaded inference set compared to the segment-specific precision, recall and F1 target thresholds,
  • Modify the metric target thresholds,
  • Manage the segment's elements and segment metadata,
From the scenarios tab select up to two inference sets to compare model performance.
  • Initially, the metrics calculations will respect the project-wide default IOU and confidence settings.
  • Otherwise, use the metrics settings to adjust the confidence and IOU thresholds for both models, or either model independently.
Comparing two inference sets
Click the fly out button to open the Segment Details view.
From here you can:
  • View the performance of any uploaded inference set compared to the segment-specific precision, recall and F1 target thresholds,
  • Modify the metric target thresholds,
  • Manage the segment's elements and segment metadata,
Access Segment details.
Click anywhere in the segment card to open the Metrics tab, pre-filtered to only the frames in that specific segment.
View metrics for a single Segment.

Metrics Tab

Here, you can see a high level overview of the model's performance on the base dataset or scenario subset in a few forms:
You can move the slider for parameters like confidence threshold and IOU matching threshold, and the metrics will recompute to reflect the model's performance with those parameters. You can also change the Metric Class.
Example of changing the Metric Class
Adjust Metrics Thresholds

Many Class Metrics View

When the number of classes exceeds 25, the model metrics view switches from the grids above to a filterable table format for better legibility.
Selecting a row in the Classification Report (left) will filter the Confusion Matrix (right) to only rows where either GT or Prediction matches the selected class. Selecting a row in the Confusion Matrix will show examples of those types of confusions, and sort by the amount of confusion.

Understanding the Confusion Matrix

The confusion matrix in the Metrics tab is extremely useful for identifying label/data quality issues. While we have a whole guide on how to move through a data quality workflow, this section will focus specifically on how to use all the features of the matrix.
For this example we

Filtering Buttons

Filtering buttons in the top left corner of the screen
In the Metrics tab, these filtering buttons allow you to quickly filter and view subsets of your dataset.
For example, when you click on FN (False Negatives), you'll see specific cells highlighted in the matrix. In addition after the matrix cells are highlighted, you'll see those examples populate down below for review.
Interacting with the FN button
The buttons Confused As and Confused From reveal a dropdown where you can filter on specific classes. When selecting a class, you'll again see cells in the matrix highlighted and examples that meet the criteria populating down below.
Interacting with the Confused As dropdown

Toggle Buttons

Toggle buttons above the confusion matrix

Absolute/Percentage Toggle Buttons

The toggle buttons on the left allows you to change the value displayed on the cell:
  • Absolute is the number of crops that meet the label/inference class criteria.
  • Percentage depends on which option is selected on the other toggle button (row, column, value).
    • If Row is selected, cell percentage represents the count of the crops in the cell compared to the row
    • If Column is selected, cell percentage represents the count of the crops in the cell compared to the column
    • If Value is selected, cell percentage represents the count of the crops in the cell compared to the total number of crops across all classes
Absolute/Percentage Toggle

Row/Column/Value Toggle Buttons

The toggle buttons above the confusion matrix change the way you see the cell values colored and the numbers that display on cell hover.
When you toggle between row/column/value the cells denominator changes on hover to reflect the total row value, column value, or total count value
The darker the color, the larger the percentage of the Row/Column/Overall Value that specific cell represents.
Also, depending on the toggled option, the denominator that is displayed on cell hover will reflect the total count per row, column, or for the entire dataset.

Comparing Two Inference Sets

In the Model Metrics View it is possible to compare two inference sets at once.
You can see two inference sets selected
When two inference sets are selected, both the value displayed on a cell and the values displayed when hovering over a cell will appear different than with just a single inference.
It's worth noting, whatever inference set is selected first up top from the drop down is the one you will see listed above the confusion matrix and does have an effect on the results you will see in the confusion matrix.
An example of confusion matrix with two inference sets selected
Taking a look at coloring in the matrix pictured above, the darker the blue the better, the darker the red the worse.
Breaking this statement down, the coloring depends on if we are looking at values on the main diagonal or off of the diagonal.
Here we mean the diagonal that represents the correct predictions from the model:
The diagonal we are referring to
On the diagonal, any positive value is good and signifies that value is the increase in the number of correct classifications in the second inference set compared to the first. So since it is a positive change, positive numbers on the diagonal will be blue. On that same train of thought, any negatives on the diagonal signify a decrease in performance and are colored red.
For any value outside the diagonal, the colors actually represent the opposite because outside the diagonal, each cell represents a specific kind of error. So positive numbers actually represent MORE misclassifications in the second inference set compared to the first. Whereas negative numbers represent less error = better performance = blue colored cells.

Understanding the Values on Hover

zoomed in image of a cell from the images above
Looking at the image above, when comparing two inference sets the format of the message on hover for a cell is slightly different.
The message for this cell reads:
delta - straight + 10 (2 -> 12)
This reads as: for objects classified as delta but labeled straight, the second inference set had 10 more of these failures than the first inference set. The first inference set had 2 examples of this particular failure and the second selected inference set has 12 examples of this failure. (2 + 10 = 12)
how to determine which inference set is first vs second
Copy link
Introduction to Model Metrics View
Configuring Thresholds for Model Performance Segments
Setting a Threshold for a Model Performance Segment
Scenarios Tab
Metrics Tab
Understanding the Confusion Matrix
Filtering Buttons
Toggle Buttons
Comparing Two Inference Sets