Collection Campaigns
Identify and submit the most relevant data for labeling
If you have a large corpus of new, unlabeled data, Aquarium's Collection Campaign segment helps you quickly collect the subset you actually want to use—without the need for someone to manually review the full corpus.
Utilizing Collection Campaigns requires setting up a Collection Campaign segment within your dataset. Learn more about organizing your data with Segments.
Based on a set of difficult edge cases grouped into a Collection Campaign segment, you can find more examples similar to these. You can then send these examples to a labeling provider, use them to retrain your model, and get the most model improvement for the least labeling cost!

User Guide

This guide runs through the complete flow of setting up a Collection Campaign and collecting new, unlabeled data similar to those you've previously identified in a Collection Campaign Segment.
The flow will be something like this:
Collection Campaign Flow


In order to successfully create a Collection Campaign, the following requirements must be met:
  • This feature will only work on Issues where all of the contained elements come from Datasets and Inference Sets uploaded on or after January 11th, 2021.
  • All elements within the Issue must be from the same Dataset or Inference Set.
    • NOTE: This means that an Issue can't have an element from a dataset and an element from the dataset's corresponding inference set. Those count as distinct sets.
  • Embeddings must be generated for the data corpus being searched through.
    • Whatever model you use to generate the new corpus's embeddings must be the same model you used to generate the Issue elements' embeddings.
  • The data corpus must have its data accessible to Aquarium (URLs, GCS paths, etc.), much like how it currently is for uploaded Datasets and Inference Sets.

1. Start a Collection Campaign (Web App)

In order to start a Collection Campaign, first navigate to the Segments tab in the web app.
Navigate to a Collection Campaign Segment that contains the sort of data that you want more examples of. (If you haven't done this yet, create such an Issue.)
For collection campaigns, a well-curated set of issue elements will help you achieve better sampling results.
Ordinarily, this can be a lengthy and tedious manual process, but you can use the process described in Finding Similar Elements Within a Dataset to speed that up.
A further note: your collection campaign will collect the same element type as your seed issue---if your issue is made up of crops, your campaign will collect similar crops. If your issue is made up of frames, your campaign will select similar frames. Both workflows should otherwise operate the same way.
When you go to that Issue's page, you should see a Collection Campaign box in the right panel. Click the Start Campaign within that box, as follows:
Once a collection campaign is created for an issue, you should see a new Collection Samples tab pop up. There won't be any samples displayed, because the Python collection client hasn't run yet.
In the sidebar, you'll also be able to see additional info such as its version and status:

Deactivating and Reactivating Campaigns

If you no longer want the Python client (described below) to collect new samples for that particular Issue, you can click Deactivate Campaign.
The examples previously collected by a deactivated campaign will still be visible, and this campaign can easily be reactivated at any time by clicking Reactivate Campaign.

Set a Sampling Threshold (Optional)

The sampling threshold (default 0.5) allows you to control how "strict" you want to be for a given campaign. During sampling, a similarity score is calculated for each unlabeled dataframe, which determines whether it qualifies for upload. A lower threshold will result in more samples, but also more false positives. You can tune it according to your labeling needs.
Collection Campaigns work as a point-in-time snapshot of all items inside an issue. The Python collection client will search for examples most similar to what was in that snapshot.
If you activate a Collection Campaign and subsequently modify the Issue it is based on (e.g. adding or removing elements), these changes won't be automatically picked up by the Python collection client.
However, a Commit New Campaign Version button will appear. To register any changes, you'll need to click this button:

2. Collect and Upload Your Data (Python Client)

In this section, we'll talk about the Setup and API calls necessary to start scanning your local data corpus, and uploading the examples that are most similar to the ones in your active Collection Campaign.
You'll be using Aquarium's Python client to write a script, similar to how you've used it in the past to upload your data.

Initialize Your Collection Client

First, you'll need to initialize a new collection client, much in the way you would initialize an aquariumlearning client when uploading data.
import aquariumlearning as al
al_client = al.CollectionClient() # Note: This is the new client to use
al_client.set_credentials(api_key="YOUR API KEY HERE")

Fetch the Latest Collection Campaign Info

One of the first commands to run is syncing state. The following command downloads information to the local client that represents all active Collection Campaigns.
NOTE: As the number of items in your Collection Campaigns increase, the amount of data downloaded also increases.
Please ensure that there is sufficient disk space to support your Collection Campaigns.
Optionally, you can constrain sampling to collection campaigns to a specific list of projects:
project_names = ["some_project_1", "some_project_2"]
Alternately, if you want to specify individual issues, you can do so as follows:
issue_uuids = ["cf8c92f5-e720-47fd-bf8e-ed5b07d47372", "5e8cb31c-9b3e-4a97-89b8-3428543a9778"]

Preprocess Your Data Corpus

Now, you'll need to turn the data corpus you are scanning through into a construct that the client can understand.
Luckily, the client already has a Labeled Frames data type to handle this. You can construct Labeled Frames much like you already do when you upload a dataset.
Unlike before, you'll add them directly to a list, rather than a Labeled Dataset:
corpus_of_data_frames = []
for item in my_corpus_of_data:
# Create a Frame
frame = al.LabeledFrame(frame_id=item.frame_id, date_captured=item.date_captured)
# Add relevant Metadata
frame.add_user_metadata("location", item.location)
frame.add_user_metadata("vehicle", item.vehicle)
# Add the actual image url
sensor_id=item.sensor_id, image_url=item.image_url, date_captured=item.date_captured
# Add relevant embeddings
# Add the frame to the list of frames

Labels (optional)

Since all the frames in your corpus are labeled frames, labels can also be added to each one if needed.
You will want to add labels (and their corresponding embeddings) if your original issue is made up of crop elements (e.g. bounding boxes).
Adding labels to each frame must use the same task type as the dataset you are running the collection campaign on (e.g. use add_label_2d_classification if your dataset is a 2D classification task, etc.). Added frame labels will be visible in the collection campaign results.
NOTE: It is not currently supported to add confidence to your labels in collection campaign results.
Here are some common label types, their expected formats, and how to work with them in Aquarium:
2D Bounding Box
3D Cuboid
2D Semseg
2D Polygon Lists
# Standard 2D case
# The sensor id of the image this label corresponds to
# A unique id across all other labels in this dataset
# 3D classification
# A unique id across all other labels in this dataset
# Optional, defaults to implicit WORLD coordinate frame
# The sensor id of the image this label corresponds to
# A unique id across all other labels in this dataset
# Coordinates are in absolute pixel space
Aquarium supports 3D cuboid labels, with 6-DOF position and orientation.
# XYZ dimensions of this cuboid
dimensions=[1.0, 0.5, 0.5],
# XYZ position of the center of this object
position=[2.0, 2.0, 1.0],
# An XYZW ordered object rotation quaternion
rotation=[0.0, 0.0, 0.0, 1.0],
# Optional: If your cuboid is relative to a specific
# coordinate frame, you can reference it by name here.
2D Semantic Segmentation labels are represented by an image mask, where each pixel is assigned an integer value in the range of [0, 255]. For efficient representation across both servers and browsers, Aquarium expects label masks to be encoded as grey-scale PNGs of the same dimension as the underlying image.
If you have your label masks in the form of a numpy.ndarray, we recommend using the pillow python library to convert it into a PNG:
! pip3 install pillow
from PIL import Image
# 2D array, where each value is [0,255] corresponding to a class_id
# in the project's label_class_map.
int_arr = your_2d_ndarray.astype('uint8')
Because this will be loaded dynamically by the web-app for visualization, this image mask will need to be hosted somewhere. To upload it as an asset to Aquarium, you can use the following utility:
mask_url = al_client.upload_asset_from_filepath(project_id, dataset_id, filepath)
This utility hosts and stores a copy of the label mask (not the underlying RGB image) with Aquarium. If you would like your label masks to remain outside of Aquarium, chat with us and we'll help figure out a good setup.
Now, we add the label to the frame like any other label type:
# The sensor id of the image this label corresponds to
# A unique id across all other labels in this dataset
# Expected to be a PNG, with values in [0,255] that correspond
# to the class_id of classes in the label_class_map
Aquarium represents instance segmentation labels as 2D Polygon Lists. Each label is represented by one or more polygons, which do not need to be connected.
# The sensor id of the image this label corresponds to
# A unique id across all other labels in this dataset
# All coordinates are in absolute pixel space
# These are polygon vertices, not a line string. This means
# that no vertices are duplicated in the lists.
{'vertices': [(x1, y1), (x2, y2), ...]},
{'vertices': [(x1, y1), (x2, y2), ...]}
# Optional: indicate the center position of the object
center: [center_x, center_y]

Assign Similarity Scores to Each Data Corpus Frame

Now that you've transformed your data corpus into a list of Labeled Frames to scan, you'll call two simple API endpoints.
The first API call iterates through each frame in your list, and assigns a similarity score between this frame and each of the active Collection Campaigns. This call does not upload any data:
# Can be called any number of times
If you are dealing with a crop issue, a similarity score will be calculated for each crop in a given sample frame, and the highest scoring crop will be the overall frame's "similarity score."
This means that even if a given sample frame has multiple qualifying "similar" crops, only the most similar crop will appear in the app UI.
Filter and Upload Relevant Examples
The second API call will filter the frames based on an internally calibrated threshold. This threshold is determined as follows:
  • If the override_sampling_threshold parameter is specified in thesave_for_collection call, this threshold is used for all of the collection campaigns from the earlier sync_state call
  • Otherwise, if a campaign's sampling threshold was specifically configured in the web app, this is the threshold used for that campaign.
  • If no override or campaign-specific threshold was set, a default of 0.5 is used.
The frames that meet this threshold are the most similar examples, and will be uploaded back to Aquarium for analysis:
# Will upload all frames passing the threshold that had
# sample_probabilities called on it since the client was
# initialized
# Alternately, you can specify an override threshold.
# Alternately, you can specify a target count. That will attempt
# to save up to `target_sample_count` entries, prioritized by
# highest similarity score.
If you want to see what your collected samples look like before actually uploading them, there is a dry_runflag that you can specify:
It will (1) display basic stats and (2) link out to a "preview frame", where a single sample frame is uploaded so you can make sure it looks how you expect (similar to the one used in dataset uploads).

3. View your Collection Campaign (Web App)

Now you can view the collected samples in the web app!
To do so, simply navigate back to the Issue that contains the active Collection Campaign (or refresh the page if you already have it open). New data should've appeared, assuming that your data corpus had examples that passed the similarity threshold.
You can sort the samples according to similarity score or campaign version.

Understanding Why Samples were Selected

If you are using the most recent version of the client, you can now view the cluster of issue elements that a sample was closest to (which may help build intuition on why a sample was selected).
Simply click on the question mark displayed next to a particular sample's campaign version info:

Viewing Collection Rate

Note: Collection rate is not displayed for older collection campaigns---some of the info required to calculate this was not recorded at the time.
In the sidebar, you can see the collection rate of your campaign:
This reports the number of samples uploaded, out of the number of dataframes actually processed by the Python collection client.
Note: Although uploaded samples are deduped bytask_id , the collection client does not dedupe when tracking the number of frames that have been looked at.
Consequently, if you run the collection client over the same (or overlapping) set of unlabeled dataframes, your reported collection rate will be lower than it actually is.

Discarding Bad Samples

To remove samples that don't match what you are looking for, you can select and discard them:

Exporting Samples for Labeling

To export the collected frames, you can click the blue Download button, much like you already do when exporting Issue Elements from Aquarium today.
You can then send these to a labeling provider and use the results to retrain your model. This latest dataset iteration (and corresponding inference set) can be uploaded to Aquarium via the standard data ingestion flow, and you can continue repeating this process to improve your model performance!
Yay positive feedback loops

Unlabeled Indexed In-App Collections

  • Similar to with collection campaign uploads, your unlabeled indexed dataset should have region proposals (likely from your model), which are uploaded as "labels".
The previously described collections campaign flow requires periodically running the Aquarium Python client to find potentially relevant samples---however, with this new feature, you can do a one-time upload of your corpus as an unlabeled indexed dataset (the search dataset), and handle the rest of your sampling workflow in the app UI (or via the Python client):
  1. 1.
    Identify rare examples of interest from a labeled dataset or inference set (the seed dataset), and add them to an Issue.
  2. 2.
    Choose which unlabeled indexed dataset (the search dataset) to search through
  3. 3.
    Generate similar elements from that unlabeled indexed dataset
  4. 4.
    (Optional) Export the relevant unlabeled examples to labeling
Once your engineering team has uploaded the corpus to Aquarium, your ops team can handle the entire "rare scenario" workflow on their own.

Generating Embedding Versions

In order for similarity search to work, your search dataset and your seed dataset must have compatible embedding spaces. To specify this explicitly, you will be using embedding versions (represented by UUIDs).
To determine the embedding version for your seed dataset, go to the Project Details page and select the Embeddings tab:
Select the name of your seed dataset from the dropdown and click Get Version:
The UUID that appears is the embedding version that you will use in the following section, when uploading an unlabeled indexed dataset via the Python client.
Note that the Get Version button will be disabled if your seed dataset is still post-processing.

Uploading an Unlabeled Indexed Dataset

NOTE: The arguments seed_dataset_name_for_unlabeled_search and is_unlabeled_indexed_dataset have been deprecated and are no longer supported by our API.
Going forward, new unlabeled datasets will need to be created using the UnlabeledDataset class, and instead of seed_dataset_name_for_unlabeled_search, you will provide the existing_embedding_version_uuid.
This is essentially the same as uploading a normal labeled dataset, but instead of LabeledDataset and LabeledFrame, you will use UnlabeledDataset and UnlabeledFrame.
To link the new unlabeled examples to the embedding space of your previous labeled dataset, you'll need the embedding version (see previous section) for the issue that is used for the sampling -- in other words, which dataset the example issue elements are coming from.
End-to-end Code Sample
First upload a labeled dataset (see code example). Say it's called fullset_0.
You go to the UI to get the embedding version. For the sake of example, say that it's f700a622-d252-4169-839a-70a3ec6ca741. (For a real upload, replace with the one that you see with the one from your own seed dataset).
Then upload your unlabeled dataset as follows:
search_dataset = al.UnlabeledDataset()
for entry in label_entries:
# Create a frame object, using the filename as an id
frame_id = entry['file_name'].split('.jpg')[0]
frame = al.UnlabeledFrame(frame_id=frame_id)
# Add arbitrary metadata
frame.add_user_metadata('some_field', 'some_value')
# Add an image to the frame
image_url = "" + entry['file_name']
frame.add_image(sensor_id='cam', image_url=image_url)
# Add the region proposal as a label
label_id = frame_id + '_proposal'
# Add the frame to the dataset collection
aquarium_search_dataset_name, # A new name, not the seed dataset

(Option 1) Running a search in the app

Once your unlabeled indexed dataset has finished uploading, you can run through the following workflow:
Google Docs

(Option 2) Running a Search via the Python Client

Iterative Refinement Collection Campaigns

Users often will want to iteratively refine their search results. A user can start a search with a few "query" samples that they want to look for, run a collection campaign search in their labeled / unlabeled dataset, select + add relevant elements to the issue, and doing another search. By going through multiple iterations of search and refinement, the user can grow the number of relevant examples in the issue and get more relevant collection results after each iteration.
To run an iterative refinement collection campaign:
  1. 1.
    Follow the steps above to generate similar elements from an unlabeled indexed dataset.
  2. 2.
    "Accept" relevant elements returned from the search, and optionally, "Discard" irrelevant elements. "Accepted" elements will be used as seed elements when the similar search is rerun.
  3. 3.
    Click "Recalculate Similar Dataset Elements" to rerun the search to generate new similar elements seeded from the new "Accepted" elements as well as the original issue elements. Elements that have been classified as either "Accepted" or "Discarded" will keep their status when the search is rerun.
  4. 4.
    Repeat as needed.
In addition to the above, we've added support to further refine collection results via a feature-flagged classifier feature (see the link for further info).