Data Sharing Methodologies

Methodologies for properly allowed Aquarium to render your images
If you don't see your specific configuration mentioned here, please reach out and we'll point you to the right cloud vendor resources!
To state it plainly: Aquarium works without needing to store or host sensitive raw data on Aquarium servers

Aquarium's Data Model

Aquarium needs the following information to operate:
  • Raw data (images, pointclouds, etc.) in your dataset OR URLs to raw data hosted in your system.
  • Labels (bounding boxes, classifications, etc.) on your dataset.
  • Inferences from your model on your dataset.
  • Any additional metadata (timestamps, device id, etc.) that you'd like Aquarium to index.
  • Optionally, embeddings generated from your own neural network for each image and each object. If this is not available, Aquarium can use a set of pre-trained models to generate embeddings with generally good results.
Aquarium provides insights primarily based on metadata such as data URLs, labels and inferences, and embeddings.
If a user hosts their data on their own S3 bucket / similar service, they can provide a URL to that data instead of the data itself. This has the benefit of reducing the amount of data sent to Aquarium and makes uploads through our client library much faster.
Raw data is only accessed for two purposes:
  • Visualization
  • Embedding Generation

Anonymous Mode

Aquarium currently doesn't have an on-prem offering. However, most users don't actually need on-prem (outside of specific health care and defense contexts) because Aquarium's Anonymous Mode satisfies their requirements for data security.
Anonymous Mode is a mode of operation that allows users to get full value out of Aquarium while protecting their sensitive data. To do so, users must provide access-controlled URLs that only they are allowed to see, and then submit their own embeddings so Aquarium doesn't need to generate them.

Embedding Generation

Aquarium can generate embeddings for your dataset as long as you are NOT using Anonymous Mode. Aquarium needs access to your raw data in order to generate these embeddings.
  • When users explore their datasets in Aquarium, the client-side code references a data URL to render a visualization in their browser.
  • If users do not provide their own embeddings, Aquarium will load the images a single time to run pretrained models on the raw data to generate embeddings.
If a user provides access-controlled data URLs and embeddings they generated themselves, Aquarium servers won't require access to the raw data to derive useful insights. In other words, a user can have full access to the functionality of Aquarium without ever exposing their raw data!

Data Sharing

Before anything else, we should figure out data sharing, since you probably aren't working with public images of puppies. Aquarium offers several easy ways to securely work with your data assets on our platform, from re-hosting it with us through locked down data that we won’t ever get to touch. If you don't see a solution here, reach out -- we’ve worked with many enterprise IT teams on unique security schemes too.
Options with an asterisk (*) will only allow your users to view raw data (images, point clouds, etc.) and will never make them accessible to Aquarium's servers.
This increased data privacy does mean that some features around clustering and similarity search will require users to provide their own "image embeddings," as we won't be able to compute them for you.
Public URLs
Aquarium Hosted
Aquarium Mirrored
Local Bucket Credentials*
Network Restricted*
Your data is free to share publicly.
This is the easiest, assuming you're working on data without any security restrictions. Anywhere you need to provide a URL to an asset, just provide any URL that's accessible on the public internet. For example, the above toy dataset example uses public URLs like the following:
image_url = ""
Your data needs to be kept secure, and it's ok for Aquarium to host a copy.
By allowing Aquarium to issue you a storage bucket and host your data, it allows Aquarium to manage a lot of the annoying details that go into a producing a snappy user experience. Browser cache control headers, CORS settings, user access credentials -- we'll handle all of these for you.
We'll issue you a Google Cloud Storage bucket of the form gs://aquarium-customer-abcd1234/ and a Google Cloud service account with read/write permissions for that bucket. If your organization uses Google Cloud, we can also directly grant permissions to your admin users.
To upload data:
# Install gsutil, google cloud's storage CLI utility
gsutil --version
# Activate the service account credentials
gcloud auth activate-service-account \
--key-file /path/to/credentials.json
# Copy directories up! This command will recursively copy a directory
# Feel free to check out the docs for more options:
gsutil -m rsync -r \
./projectA/imgs/ gs://aquarium-customer-abcd1234/projectA/imgs/
To reference the data:
image_url = ""
Your data needs to be kept secure, and it's ok for Aquarium to mirror a copy.
By signing data urls so that they can be accessed by Aquarium, we can mirror these assets in a secure bucket without the need of uploading your images into an Aquarium bucket directly.
To upload:
# Flag that tells Aquarium to mirror the image
The mirror_asset flag is available for all Data Formats seen below.
Your data is in a private storage bucket, and you don't want Aquarium to ever access the raw data.
Because the underlying raw data is only needed for visualization purposes, you can provide bucket paths that point to secure resources. Then, when your users want to view them in the application, they can use local credentials to view the data. Neither the credentials nor the data will ever leave your users' browser / local device.
To reference the data, simply use the bucket path as the data url:
image_url = "s3://yourbucket/path/to/img.jpg"
image_url = "gs://yourbucket/path/to/img.jpg"
Then, each user can go to to point to a local credentials file. This file is expected to match the formats provided by your cloud provider's admin/IAM console. Reach out if you have questions about the format!
Local device credentials files
Doing this securely requires modern browser capabilities that aren't available in all browsers yet. We recommend upgrading to the latest version of Google Chrome or Microsoft Edge if you want to use this data sharing scheme.
Your team locks down resources through a corporate VPN.
Because the underlying raw data is only needed for visualization purposes, you can provide URLs that will only resolve for users on your corporate network. Because your users will be accessing Aquarium on that network, they'll be able to access those resources without them being accessible to Aquarium's servers.
The best option depends on your specific infrastructure setup, but common options include:
  • An image server / URL signing service only accessible on your network.
  • Bucket-level IP allow permissions for your VPN's known IP Addresses.
If you're planning on this path, reach out to the Aquarium team and we'd be happy to chat with your IT/Infra team to find the easiest solution!

Generating Access-Controlled URLs

There are many ways to share data -- here are a few common approaches, ordered from least to most locked-down

Authenticated URLs (Aquarium Auth Headers)

Users can provide URLs that are authenticated through normal authentication measures, such as by providing URL signing endpoints for raw data stored on their own S3 / GCS buckets. Requests from Aquarium will include auth headers identifying itself, so users can allow Aquarium to see data without exposing it to the wider world.

Authenticated URLs (Company Identity Provider)

If your company has more complex access controls, we can also discuss integration with Identity Providers like Okta, and include that information when requesting resources. This means that all requests must come from an environment where the user has authenticated with the company's identity provider.

IP Restricted URLs

If users do not want Aquarium to have read access to their data at all, they can restrict permissions on their URLs to only be accessible to users within their corporate network or VPN. This way, Aquarium's servers won't have read access to the raw data. When someone uses Aquarium from within an approved network, the Aquarium browser client will be able to access and render the raw data correctly.This can also be done in conjunction with other access control schemes, such as authenticated image signing URLs.

Serving Locally Hosted Images

Aquarium's only requirement is that images are available via an HTTP request from the user's browser. Users can host data from their own computer if needed, which is particularly useful for fast experimentation workflows. You can serve a folder of local images with a simple HTTP server by installing Python or node, and running one of the following commands:
If you're working with semantic segmentation or point cloud data, you'll need to use a local file server that supports Cross-Origin Resource Sharing (CORS). The recommended NPM package below supports it as written.
If you are unable to use an NPM package, please reach out and we can get you set up.
# Python 3
python3 -m http.server 5000
# Python 2
python -m SimpleHTTPServer 5000
npx http-server --cors='*' --port=5000
Afterwards, you can submit URLs to Aquarium that are formatted like https://localhost:5000/{image_path}.jpg When the Aquarium client tries to render these URLs, it will load the images served by the Python server on your local machine with minimal latency.

Granting Aquarium Read Access to an AWS S3 Bucket

There exist many ways for granting others access to data in an S3 Bucket. We recommend the combination of creating a custom IAM Role, and then allowing cross-organization "Assume Role." That allows a different organization (Aquarium) to temporarily (and easily revokably) take on a role in your policies. This gives full control of permissions and usage logs to you, the user / data owner, while limiting the number of secrets you must share with Aquarium.
We're following the recommended AWS practices described here, broken down to include screenshots of relevant AWS console views and minimal permissions for this use case.
Note: S3 bucket access does not support automatic embedding computation at this time. Please reach out if this is a feature you want to see in Aquarium.

Step-by-Step Breakdown

First, reach out to Aquarium for our 12 digit Account ID, which you'll be granting access to.
Navigate to the IAM Roles Page, and create a new role using the button in the top right:
IAM Roles Page
On this screen, create a role where:
  • Trusted Entity is "AWS Account"
  • Aquarium's 12 digit Account ID is entered as "Another AWS account"
Setting Trusted Entity
On the next screen, you can attach or create an appropriate IAM Policy with the permissions you will grant Aquarium. This should only include s3:GetObject on the S3 Buckets you wish for Aquarium to have access to. Please limit the scope of access to the specific bucket(s) you want to grant Aquarium access to.
If you have not yet created an IAM policy, you can create it on this page, which we will show. To start, click on "Create Policy" in the top right.
Add Permissions Page
You would want a simple policy containing only s3:GetObject read access to a specific bucket. If created with explicit JSON, you want the following policy:
If created through the visual editor, you want a config like the following, with Resources restricted to just the one bucket you wish to share.
Policy Creation Visual Editor
Set any appropriate tags and descriptions, then create the policy.
Back at the "Add Permissions" screen, press the refresh button next to the "Create Policy" button, then select your newly created policy and press Next.
Add Permissions Page With Role Selected
Add a name, description, tags, etc., review the permissions one last time, and create the role. After creation, you should see a green success banner:
Successful Role Creation
If you view that role by clicking the View Role button on the banner, you can copy the role ARN from the center of the screen:
Role Summary Page with ARN
You're done! Reach back out to Aquarium with:
  • the full ARN string you just copied
    • Example: arn:aws:iam::227217811048:role/Aquarium-Customer-Bucket-GetObject-Access-Role
  • The AWS region(s) that contains the buckets you want to grant Aquarium access to
    • Example: us-east-1, us-west-2, ap-northeast-1
  • The S3 Bucket name(s) you want to grant Aquarium access to
    • Example: example_bucket_name, ground_truth_images
Within 2-3 business days, your images should be visible in the Aquarium app. And as always, please reach out to us if you have any trouble with this process.