DeepView Familiarity: Dataset Distribution

Compare the distributions of two datasets, e.g. train/test datasets, synthetic/real datasets, etc.

Please see the doc page for a discussion on applying Familiarity to dataset distribution analysis, including what actions can be taken to improve the dataset.

For a more detailed guide on using all of these DeepView components, try the Familiarity for Rare Data Discovery Notebook.

# Don't run this cell if stochasticity is desired
import numpy as np

Optional: Download MobileNet and CIFAR-10

This example uses MobileNet (trained on ImageNet) and CIFAR-10, but feel free to use any other model and dataset. This notebook uses TFModelExamples and TFDatasetExamples to load in MobileNet and CIFAR-10. Please see the DeepView docs for information about how to load a model or dataset. This page also describes how responses can be collected outside of DeepView, and passed into Familiarity via a Producer.

# User-Defined Variables #

# Change the following labels to see which labels are more familiar.
#   The example illustrates a comparison between the distributions of the train
#   and test sets for automobiles, for 100 images.
TRAIN_CLASS_LABEL = 'automobile'
TEST_CLASS_LABEL = 'automobile'
from deepview.processors import ImageResizer, SnapshotSaver
from deepview.base import Batch, PixelFormat, pipeline, ImageFormat
from deepview_tensorflow import TFDatasetExamples, TFModelExamples

# Load CIFAR10 dataset and feed into MobileNet,
# observing responses from layer conv_pw_13
mobilenet = TFModelExamples.MobileNet()
mobilenet_preprocessor = mobilenet.preprocessing
assert mobilenet_preprocessor is not None

# Load CIFAR-10 with train and test datasets, and
# attach metadata (labels, dataset origins, image filepaths) to each batch
cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)

# Create pre-processing pipeline
preprocessing_stages = (
    # Save a snapshot of the raw image data to refer back to later

    # Preprocess the image batches in the manner expected by MobileNet

    # Resize images to fit the input of MobileNet, (224, 224) using an ImageResizer
    ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),

# Create producers for subsets of the dataset for comparing train / test distribution
# :: Note: The subset method will filter the batch LABELS metadata matching the provided dict
data_producers = {
    'train': cifar10.subset(labels=[TRAIN_CLASS_LABEL], datasets=["train"], max_samples=N_SAMPLES),
    'test': cifar10.subset(labels=[TRAIN_CLASS_LABEL], datasets=["test"], max_samples=N_SAMPLES),
Put it all together to produce familiarity scores

For a more detailed breakdown of these steps, see the Familiarity for Rare Data Discovery Notebook.

A. Define user variables

First define some user variables, which can be modified to play around with different classes, or different datasets.

B. Create producers

from deepview.processors import Cacher, Pooler

producers = {
    split: pipeline(

        # Apply previously-defined preprocessing stages for Mobilenet & CIFAR

        # run inference -- pass a list of requested responses or a single string

        # perform spatial max pooling on the result
        Pooler(dim=(1, 2), method=Pooler.Method.MAX),

        # Cache results to re-run the pipeline later without recomputing the responses
    for split in ('train', 'test')

Reduce dimensionality of responses

from deepview.introspectors import DimensionReduction

# Configure the DimensionReduction Introspector
#    The dimensionality of the data will be reduced from 1024 to 40
n_dim = 40

# Trigger the pipeline & fit the PCA model on the train dataset, which will used as the base
pca = DimensionReduction.introspect(producers["train"], strategies=DimensionReduction.Strategy.PCA(n_dim))

# Apply the PipelineStage pca object to both train/test pipelines to reduce responses in all batches to a lower dimension
reduced_producers = {
    name: pipeline(producer, pca)
    for name, producer in producers.items()

Build Familiarity model on train & test data combined

from deepview.introspectors import Familiarity

# The Familiarity model is first fit on the base dataset, which is "train" in this case
#   Trigger pipeline & run DeepView Familiarity, default strategy is Familiarity.Strategy.GMM
familiarity = Familiarity.introspect(reduced_producers['train'])

# Use dict-comprehension to apply familiarity to the train and test datasets individually
scored_producers = {
    producer_name : pipeline(
    # reduced_producers maps 'train'/'test' to the split's reduced producer
    for producer_name, cached_response_producer in reduced_producers.items()

Compute familiarity likelihood score

Produce the final familiarity likelihood score.

  • If the likelihood score is close to 0, both distributions are equivalent.

  • Typically, the train dataset’s mean log score will be smaller than the test dataset’s, since familiarity was fit to this first/train dataset. The more negative the overall likelihood score is, the larger the distribution gap. One of the datasets is likely in need of being re-collected.

  • It may still happen that the likelihood score is greater than 0. This is also explained by a distribution gap, and will require analysis and possibly data re-collection.

Please refer to the doc page for more information, and check out the other Familiarity use case, discovering rare samples, or the DatasetReport to evaluate why there is a distribution gap.

from deepview.base import Producer

def compute_score_mean(producer: Producer, response_name: str, meta_key: Batch.DictMetaKey) -> float:
    """ Compute mean of score, for given metadata key, response name, and producer """
    scores = [
        for batch in producer(32)
        for index in range(batch.batch_size)
    return np.mean(scores)

# Trigger remaining pipeline, compute mean of familiarity scores for both train and test datasets
stats = {
    producer_name : compute_score_mean(
    # scored_producers maps 'train'/'test' to the split's scored producer
    for producer_name, producer in scored_producers.items()

familiarity_ratio = stats['test'] - stats['train']
print(f"Likelihood ratio [{TRAIN_CLASS_LABEL}]->[{TEST_CLASS_LABEL}] = {familiarity_ratio:0.4f}")
Likelihood ratio [automobile]->[automobile] = 12.3662
