How to collect relevant data in your labeled and unlabeled datasets using Aquarium
It is no easy task to decide what data you should label next and add to your training set. Determining which data will make the greatest impact to improving your model also presents its own set of challenges.
By focusing data collection and labeling on highest value data, you can get more model improvement in less time and with less labeling costs than random sampling.
If you have a large unlabeled dataset, Aquarium's Collection Campaignsegmenthelps you quickly collect the subset you actually want to use—without the need for someone to manually review the entirety of your unlabeled data.
Aquarium enables you to analyze your datasets to determine if there are underrepresented areas within your data or areas where your model struggles. Once you have identified these difficult cases, you can group your data into a Collection Campaign segment and can find more examples similar to these in the unlabeled dataset. You can then send these examples to a labeling provider, use them to retrain your model, and get the most model improvement for the least labeling cost!
In this guide, the main steps we will cover are:
Navigating a Collection Campaign segment
Kicking off a similarity search through an unlabeled dataset
Exporting your newly collected data from Aquarium
Once completed you should feel comfortable:
Searching unlabeled datasets using example data you've collected
Reviewing the results of the similarity search the process of exporting your newly found data
This guide makes the assumption that you have already found subsets of your data of interest for targeted unlabeled data collection, and added them to a Collection Campaign type segment. Your teams can use Aquarium's various views to accomplish the task of understanding where your training dataset could benefit from additional, targeted data.
We have an entire guide dedicated to the process of assessing your data quality here.
In summary, Aquarium has tools that can help you find areas of confusion, low model metrics scoring, and sparse representation in order to target your data collection towards datapoints that are most helpful to improving your model.
Embeddings must be generated for the unlabeled dataset being searched through.
Whatever model you use to generate the new corpus's embeddings must be the same model you used to generate the Issue elements' embeddings.
See this guide for uploading unlabeled data correctly using the embedding_version parameter
In order to follow the step-by-step instructions, this guides makes the assumption you have already:
This guide runs through the complete flow of setting up a Collection Campaignand collecting new, unlabeled data similar to those you've previously identified in aCollection Campaign Segment.
1. Navigate to Your Collection Campaign Segment
At this point the idea is that you have already created a Collection Campaign segment. If you need help knowing how to use Aquarium to assess your data quality and find subsets of data to put into a Collection Campaign segment, check out this guide!
In the top navigation bar in Aquarium click the "Segments" button to be brought to the Segments page.
Once on the Segments page, you'll be able to view all your created segments. Navigate to the Data Collection tab to view all of your created Collection Campaign segments.
Data Collection Tab
Once you have selected which Collection Campaign you would like to work with, click on the name in the table to view details regarding your specific segment.
Detail view of a Collection Campaign segment
2. Click on "Collected" Tab
For more information regarding a Collection Campaign segment, read here.
If you have not run a Collection Campaign before on this segment, your screen will look like this:
You will be able to see text that says, "No samples have ben collected yet!"
3. Start the Similarity Search
In the collected tab, if you have properly uploaded an unlabeled dataset you will see a dropdown with all of the valid unlabeled datasets that you are able to search through.Depending on your goals for the similarity search, it may make sense to split your unlabeled datasets up in different ways when uploading.
Unlabeled dataset dropdown
Click the button to the right of your indicated unlabeled dataset that says "Calculate Similar Dataset Elements", and you'll see the text below change to reflect the status of you similarity search. You'll also see a green bar pop up at the top of your screen indicating the similarity search has started.
4. Review Results of the Search
Results can take anywhere from 10 seconds to a couple minutes as Aquarium compares your subset of data to the indexed unlabeled dataset.Once returned you will see your screen look like this:
You can see the total number of results returned and tiles for each results
You can scroll through the returned images and use the Sort By Ascending or Descending to view the elements from the perspective of similarity score.
At this point you could export your data and follow step 6 in this guide. But for even better results we may want to take it a step further and refine your search results.
5. Iteratively Refine Search Results
You may want to iteratively refine their search results, and by starting your first similarity similarity search in your unlabeled dataset you have taken the first step.
Once you have done a first pass on your initial results, you are able to select an image by click on the circle in the top left corner of each tile, and then choose to accept or discard a frame or crop. (The examples shown are at the frame level view). You'll notice that element will then show up under the Accepted tab or Discarded tab. Elements added to the Accepted bucket will be used as seed elements when rerunning the similarity search.
Once you have added at least 10 elements to the Accepted bucket and 20 to the Discarded bucket, run another similarity search by clicking the white 'Recalculate Similar Dataset Elements' button that will appear.
By going through multiple iterations of search and refinement, the user can grow the number of relevant examples in the issue and get more relevant collection results after each iteration.
Repeat this process as many times as needed to refine your newly collected dataset.
6. Export Your Collected Data
Once you have completed running your similarity search and your data refinement, Aquarium provides two options of how to export your newly created dataset:
Batch export to JSON
Use a webhook to export your data directly to a labeling provider.
We have separate pages in our docs dedicated to exporting data out of Aquarium. These docs will show you how the export data is formatted as well as things like how to set up a webhook with a labeling provider.To access both options, use the dropdown button in the top right corner of your screen to select which export option you would like to use. Note if you have not set up a webhook to the labeling provider the button will be greyed out.
Greyed out export button
GIF demonstrating how to export your data to a JSON file depending on which tab you are in
Within the Unsorted, Accepted, and Discarded tabs, you can also select individual elements to export instead of all of the data contained in the tab.
Your download will start immediately and depending on how much data you are exporting can take a little longer, but the download should start within a few seconds.
And congrats! You have successfully located new targeted subsets of your unlabeled data to then label and add into your training set in order to improve model performance!
Have questions about other export formats or want to discuss a more custom option to the workflow in this guide? Please feel free to reach out to us here.