Dataset Report

Explore a dataset to find rare data samples, duplicate data, annotation errors, or dataset bias. The DatasetReport is a combination of three DeepView dataset introspection algorithms:

To explore the dataset in an interactive UI, the Dataset Report results can be fed directly into Canvas Framework, a vsiualization platform for creating interactive data science components that allows for filtering, sorting, and exporting data samples.

For motivation behind the Dataset Report, see Description below.

General Usage

For getting started with DeepView code, please see the how-to pages.

Assuming a pipeline is set up to produce responses from a model, the DatasetReport can be run as so:

from deepview.introspectors import DatasetReport

producer = ...  # pipeline setup here

# Run DatasetReport on responses from a producer
report = DatasetReport.introspect(producer, batch_size=128)

Introspection is typically performed on intermediate model responses (rather than the final outputs of a network). Here’s a full example using the CIFAR10 dataset, which uses the outputs of the last convolution layer conv_pw_13 from a MobileNet model to run the analysis:

from deepview.introspectors import DatasetReport
from deepview_tensorflow import TFDatasetExamples, TFModelExamples
from deepview.processors import Cacher, ImageResizer
from deepview.base import pipeline

# Load CIFAR10 dataset and feed into MobileNet,
# observing responses from layer conv_pw_13
cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)
mobilenet = TFModelExamples.MobileNet()
producer = pipeline(
   cifar10,
   ImageResizer(pixel_format=ImageResizer.Format.HWC, size=(224, 224)),
   mobilenet(requested_responses=['conv_pw_13']),
   Pooler(dim=(1, 2), method=Pooler.Method.MAX),
   Cacher()
)

# Run DatasetReport on intermediate layer conv_pw_13's responses to the data:
report = DatasetReport.introspect(producer)

Visualization

Exploring with Canvas

DeepView’s DatasetReport can also connect with the Canvas UI framework to explore a dataset in a web browser or in a jupyter notebook. Please see Canvas’s documentation for an example of how to feed the output of DatasetReport.introspect directly into Canvas. These reports created with Canvas are interactive and shareable.

Warning

The current release of Canvas operates only on images, audio, and tabular data. To visualize other data types, it’s possible to run the DeepView side of the DatasetReport on any dataset type and visualize in a custom manner.

pip install "deepview[dataset-report]"

Exploring as Pandas DataFrame

The resulting Dataset Report object has a property, data, that is a Pandas DataFrame of all DatasetReport results. Each row represents a data sample, and each column is report data, e.g., duplicate set.

report.data

These results can be visualized in a custom manner, but it’s recommended to try Canvas for image, audio, or tabular data.

Saving and Loading

To save the report, call to_disk() on the report object. To load a saved report, use DatasetReport.from_disk(filepath).

Config Options

Dataset Report’s introspect method has a parameter config that accepts a ReportConfig object. The config can be used to run only a subset of introspectors. For instance, to run only duplicates analysis:

from deepview.introspectors import DatasetReport, ReportConfig

config = ReportConfig(
    projection=None,
    familiarity=None
)

The strategies used in the underlying algorithms can also be modified via the config. See ReportConfig in the API docs for more details.

Description

Automated dataset diversity analysis often looks inter-class diversity, i.e. diversity across classes, as defined by metadata labels. Known methods include grouping data by label and performing various statistical analyses to see how well the number of data samples or model accuracy is distributed across these different labels.

Intra-class diversity, like fairness within a particular class label, is also important, yet harder to evaluate in an automated fashion. Intra-class diversity analysis is often manual, which doesn’t scale to large datasets. Manual analysis can also make it harder to communicate findings with team members or partners. Because of these problems, sometimes intra-class diversity analysis is skipped altogether.

The Dataset Report aims to automate and simplify the process of analyzing datasets for both inter and intra-class diversity, in a manner that enables sharing and exploration. With the Canvas framework, it’s possible to build a standalone static report or explore results live in a Jupyter notebook. Canvas also contains a centralized filtering, grouping, highlighting, and selection across all widgets, to form a cohesive workspace for dataset exploration.

Example

A Jupyter notebook that demonstrates how to run the Dataset Report on the CIFAR-10 dataset:

Relevant API