.. _data_producers: ============== Data Producers ============== DeepView provides a some data :class:`Producers ` that load data and dispatch batches. All :func:`pipelines ` must start with a data producer. (Batches are only pulled through a :func:`pipeline ` when an :ref:`Introspector's ` :attr:`.introspect() ` method is called.) Here are most of the available out-of-the-box data loaders, and links to their API for more information. Example Datasets ---------------- :mod:`deepview_tensorflow` has several example datasets, which wrap off-the-shelf datasets from the `Keras `_ library. They are loaded via the :class:`TFDatasetExamples `: .. code-block:: python from deepview_tensorflow import TFDatasetExamples cifar10 = TFDatasetExamples.CIFAR10() Available datasets are: :class:`CIFAR10 `, :class:`CIFAR100 `, :class:`MNIST `, and :class:`FashionMNIST `. All example datasets are :class:`TrainTestSplitProducers ` with :meth:`attach_metadata ` and :meth:`max_samples ` initialization parameters, and :meth:`shuffle ` and :meth:`subset ` methods. For instance, the following code loads MNIST, then creates a subset of 100 fives drawn only from the test set: .. code-block:: python mnist = TFDatasetExamples.MNIST(attach_metadata=True) mnist_100_test_fives = mnist.subset(labels=[5], datasets=["test"], max_samples=100) # Inspect the batches generated by mnist_100_test_fives: for batch in mnist_100_test_fives(batch_size=10): imgs = batch.fields["samples"] # ndarray in shape (batch.batch_size, 28, 28, 1) labels = batch.metadata[Batch.StdKeys.LABELS]["label"] # will be all 5's dataset_ids = batch.metadata[Batch.StdKeys.LABELS]["dataset"] # will be all 1's In addition, CIFAR10, CIFAR100 and FashionMNIST accept string labels on the :meth:`subset ` method, e.g. to load all foxes from CIFAR100: .. code-block:: python cifar100 = TFDatasetExamples.CIFAR100(label_mode='fine') foxes = cifar100.subset(labels=["fox"]) To check what string labels an example dataset supports, the built-in Keras datasets contain a :meth:`str_to_label_idx` method. Data loaders ------------ :class:`ImageProducer ` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ DeepView provides a helper producer, :class:`ImageProducer `, to load all images from a local directory. By default, it will do a recursive search through all subdirectories. For example, if the MNIST dataset is stored locally: .. code-block:: python from deepview.base import ImageProducer mnist_dataset = ImageProducer('path/to/mnist/directory') .. autoclass:: deepview.base.ImageProducer :noindex: :class:`TrainTestSplitProducer ` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. autoclass:: deepview.base.TrainTestSplitProducer :noindex: .. _creating_custom_producer: Writing a custom Producer ------------------------- To teach DeepView how to load data as batches, it's possible to create a custom producer by subclassing :class:`Producer `. Here is an example of creating a custom Producer for the `CIFAR-10 dataset `_. DeepView operates on datasets in batches, so that it can handle large-scale datasets without loading everything into memory at once. For each batch, it's possible to also attach metadata like unique identifiers and labels, which help in :ref:`data introspector ` analysis and visualization features: - **Identifier**: unique identifier for each data sample (in this case, the path to the file) - **Label**: a dict of (key, value) with a number of labels for each data sample - E.g. :code:`"class"` label: airplane, automobile, etc. - E.g. :code:`"dataset"` label: train vs. test Follow the comments in the ``Cifar10Producer`` code block below to learn how to create a custom producer. .. code-block:: python import os from pathlib import Path import typing as t import numpy as np import cv2 from keras.datasets import cifar10 from deepview.base import Producer, Batch class Cifar10Producer(Producer): def __init__(self, data_path: str, max_data: int = -1) -> None: # Where data will be written to be packaged up with Dataset Report self.data_path = data_path # Max data samples to pull from. This is helpful for local debugging. self.max_data = max_data # Load entire CIFAR10 dataset into memory (x_train, y_train), (x_test, y_test) = cifar10.load_data() # Concatenate the train and test into one array, as well as the train/test labels, and the class labels self.dataset = np.concatenate((x_train, x_test)) self.dataset_labels = ['train']*len(x_train) + ['test']*len(x_test) self.class_labels = np.squeeze(np.concatenate((y_train, y_test))) self.class_to_name = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'] def __post_init__(self) -> None: if self.max_data <=0: self.max_data = len(self.dataset) def _class_path(self, index: int) -> str: return f"{self.dataset_labels[index]}/{self.class_to_name[int(self.class_labels[index])]}" def _write_images_to_disk(self, ii: int, jj: int) -> None: for idx in range(ii, jj): base_path = os.path.join(self.data_path, self._class_path(idx)) Path(base_path).mkdir(exist_ok=True, parents=True) filename = os.path.join(base_path, f"image{idx}.png") # Write to disk after converting to BGR format, used by opencv cv2.imwrite(filename, cv2.cvtColor(self.dataset[idx, ...], cv2.COLOR_RGB2BGR)) def __call__(self, batch_size: int) -> t.Iterable[Batch]: """The important function... yield a batch of data from the downloaded dataset""" # Iteratively loop over the data samples and yield it in batches for ii in range(0, self.max_data, batch_size): jj = min(ii+batch_size, self.max_data) # Optional step, write data locally since it was loaded from keras self._write_images_to_disk(ii, jj) # Create batch from data already in memory builder = Batch.Builder( fields={"images": self.dataset[ii:jj, ...]} ) # Use pathname as the identifier for each data sample, excluding base data directory builder.metadata[Batch.StdKeys.IDENTIFIER] = [ os.path.join(self._class_path(idx), f"image{idx}.png") for idx in range(ii, jj) ] # Add class and dataset labels builder.metadata[Batch.StdKeys.LABELS] = { "class": [self.class_to_name[int(lbl_idx)] for lbl_idx in self.class_labels[ii:jj]], "dataset": self.dataset_labels[ii:jj] } yield builder.make_batch() To use this custom Producer, it can be instantiated like so: .. code-block:: python cifar10_producer = Cifar10Producer( # Where to store the data to disk. For exporting a standalone report, # this should be a local (relative) path to the current working directory. data_path='./cifar/', # This "max data" param is purely for running a notebook quickly # Remove this param to run on the whole dataset max_data=1000 ) .. _creating_response_producer: Producer of Model Responses ^^^^^^^^^^^^^^^^^^^^^^^^^^^ As noted, a Producer can also produce model responses to feed directly into an Introspector. .. code-block:: python from deepview.base import Producer, Batch from deepview.introspectors import Familiarity def function_to_run_model_inference_on_batch(data: np.ndarray, response_name: str) -> np.ndarray: # extract `response_name` for data when running through dummy model # ... return responses_for_input_data class MyModelResponseProducer(Producer): def __init__(self, datafolder: str, response_name: str) -> None: self.datafolder = datafolder self.response_name = response_name def __call__(self, batch_size: int) -> t.Iterable[Batch]: for ii in range(0, n_data_samples, batch_size): current_data = # read next `batch_size` data samples responses = function_to_run_model_inference_on_batch(current_data, self.response_name) yield Batch({self.response_name: responses}) my_response_producer = MyModelResponseProducer(datafolder, 'response1') familiarity = Familiarity.introspect(my_response_producer) ... This is a good option if model responses have already been generated and saved, the model is in a format that DeepView does not currently support (e.g., JAX), or the model is hosted on the cloud and responses will be fetched asynchronously.