Updating Datasets

Key Features

Previously, datasets were uploaded to Aquarium in batches, and as a result, datasets could not be modified or added to after uploading.
As of version 0.0.61 of the Python client, the default upload mode for a new dataset is as a "mutable dataset." Uploading your dataset as mutable is the recommended way to get your data into Aquarium, and provides access to several useful features:

Fully Versioned w/ Edit History

As a dataset grows and changes, we maintain a versioned history of every previous version. This has many benefits, including:
  • Reproducible experiment results. If an experiment produced inferences that were evaluated against version X of the dataset, it can be evaluated and explored against that version, even if the dataset continues to be updated in the background.
  • Time-travel / rollbacks. Do you want to know what the dataset looked like before a major relabeling effort? Did that effort introduce problems that you want to undo? Load up a previous version at any time!
  • Edit histories / Audit logs. Each entry is versioned by its ID, so you can always look up the full history for a given image and see each modification made to its labels.

Streaming Inserts + Partial Success

Traditional Aquarium datasets required batching together a full dataset into a single operation, which would either fully succeed or fully fail after analyzing all entries.
Mutable datasets also allow you to upload data in a streaming format -- for example, one at a time as you receive images back from a labeling provider. If one batch of updates encounters an error, only those will fail, and the rest of the dataset will be processed and available for users.

Using Mutable Datasets

In the case of uploading new datasets, no action is necessary, since streaming is now the default upload mode.
To update existing frames in your dataset, for example to change the label(s) on the frame, simply specify the original frame_id when reuploading that frame, so that Aquarium can link it to the original. Any new information provided will be appended to the frame as a new version. The previous state of the frame will continue to be available as an old version.
Note: If you are looking only to update metadata, you can use the client.update_frames_metadata(...) method to avoid having to pull in all the data you would need to instantiate a LabeledDataset and LabeledFrames.
Inferences can be updated in the same fashion, using create_or_update_inferences.

Monitoring Upload Status

Similar to batch uploads, you'll be able to view the status of your streaming uploads in the web app.
If you go to the Project Details page, you'll see a Streaming Uploads tab (previous batch uploads under your project will still be visible under Uploads):
Each upload ID corresponds to a subset of your dataset/inference set (with the associated frame count + label count).
To view more details on which specific frames/labels are present in a given upload, you can click on the Status (e.g. DONE). A pop-up will appear with the following info:
In the case of a failed upload, you can debug via the Errors section (which exposes frame-specific debug logs), and download this info to determine which frames/crops may need to be re-uploaded.
If you are running into an error and the error logs are not sufficient to understand how to fix the issue, please reach out to the Aquarium team and we can help resolve your problem.

Migrating Projects with Immutable Datasets to Mutable

Any scripts that make calls to client.create_dataset or client.create_or_update_dataset no longer need to pass the pipeline_mode argument in order to use streaming mode, as "STREAMING" is now the default argument. However, if you would prefer to continue using batch mode for your uploads, you will now have to specify pipeline_mode="BATCH".
You may have existing immutable datasets that were uploaded via batch mode, and want to convert them to mutable datasets.
If you go to the "Datasets" tab of your "Project Details" page, each of the listed legacy datasets should now have a new teal "Clone as New Mutable Dataset" button:
When you click this button, the cloning will begin:
After a minute or so, if you refresh the page, the new dataset will appear with the prefix "MUTABLE_". The old dataset will also have a tooltip that points to the new dataset:
Depending on the size of your original dataset, it may take some more time for this new mutable dataset to be fully processed and become viewable in "Explore" view.
NOTE: A dataset's corresponding inference sets will not be automatically cloned for now, but can be uploaded to the mutable dataset using the Aquarium client. Please contact us if you have questions about migrating inference sets.

Notes & Limitations

There are a few things to be aware of when using mutable datasets:

Some Dataset Attributes Are Still Immutable

This change allows all elements of a dataset (frames, metadata values, labels, bounding box geometry, etc.) to be added / updated / deleted, but they must still be compatible with the dataset as a whole.
Most notably, the following dataset attributes must remain consistent over time:
  • Set of known valid label classes
  • User provided metadata field schemas
  • Embedding source (i.e.., embeddings are expected to be compatible between all frames in the dataset)
We plan to support changes to all of these in the future. Please let us know if any of them are particularly valuable for you.

Inference Sets Are Pinned to a Specific Dataset Version

When an inference set is uploaded, it will be pinned to a specific version of the labeled dataset, which will default to the most up-to-date version at the time of submission.
Updates to the inference set itself will show up in the UI, but updates to the base dataset (ground truth) won't be incorporated.
Metrics shown will be computed against those pinned dataset labels, and any visualizations of the ground truth will be from that specific version.

Segment Elements are Versioned

Elements (frames, labels, inferences) added to segments correspond to the specific version of the element when it was added to the segment. They do not automatically update to the latest version of the element within the dataset. This is intentional, but has tradeoffs:
  • At any time, you can review elements in segments as they were when they were created. This makes it easy to reproduce issues in datasets across members of the team, even as the labels or associated frames change.
  • Because you're viewing the version of the dataset as it was when it was added to the segment, reviewing a segment after a data quality issue has been corrected may show that the issue is still present, despite the current version of the dataset being correct. Use the segment state tracking features in Aquarium (and archive old segments that have been resolved) to mitigate any potential confusion.
For example, you might create a segment with example labels of "bounding box too loose." If you go re-label those boxes and update them in the dataset, the segment will still contain the original (poorly drawn) labels, with an icon indicating that it belongs to an older version of the dataset.
Warning icon indicating that an issue element is out-of-date.
It will be available for viewing, but some features (like within-dataset similarity search) may be disabled for out-of-date elements.