Introspectors API

Data Introspectors

Familiarity

class deepview.introspectors.Familiarity(meta_key, _distributions)[source]

An algorithm that fits a density model to model responses and produces a PipelineStage that can score responses.

Like other introspectors, use Familiarity.introspect to instantiate.

Parameters:
class Strategy[source]

Bundled Familiarity computation strategies. See FamiliarityStrategyType

class GMM(*, gaussian_count=5, convergence_threshold=0.001, max_iterations=200, covariance_type=GMMCovarianceType.DIAG, _random_state=None)

A FamiliarityStrategyType that fits a mixture of multivariate gaussian distributions on the introspected responses using sklearn.mixture.GaussianMixture.

Parameters:
  • gaussian_count[keyword arg, optional] Number of gaussian distributions to be fitted in the mixture model.

  • convergence_threshold[keyword arg, optional] Convergence threshold used when fitting the mixture model.

  • max_iterations[keyword arg, optional] Maximum number of iterations to use when fitting the mixture model.

  • covariance_type[keyword arg, optional] Covariance type, usually GMMCovarianceType.DIAG or GMMCovarianceType.FULL. See sklearn’s GaussianMixture docs for extra information.

convergence_threshold: float = 0.001

Convergence threshold used when fitting the mixture model.

covariance_type: GMMCovarianceType = 'diag'

Covariance type, usually GMMCovarianceType.DIAG or GMMCovarianceType.FULL. See sklearn’s GaussianMixture docs for extra information.

gaussian_count: int = 5

Number of gaussian distributions to be fitted in the mixture model.

max_iterations: int = 200

Maximum number of iterations to use when fitting the mixture model.

metadata_key: ClassVar = Batch.DictMetaKey(name='GMM')

Key used to view the GMM Familiarity results.

static introspect(producer, *, strategy=None, batch_size=1024)[source]

Examines the producer to fit a model for classifying familiarity of another set of responses.

Parameters:
Returns:

a Familiarity PipelineStage that, when added into a pipeline, will score responses with regard to the fit familiarity model to the input producer and attach the score as metadata using its meta_key.

Return type:

Familiarity

meta_key: DictMetaKey[FamiliarityResult]

Metadata key used to access the familiarity result (FamiliarityResult). This is accessible via:

Example

results = batch.metadata[familiarity_processor.meta_key]['response_a']
# type of results: t.Sequence[FamiliarityResult]
class deepview.introspectors.FamiliarityStrategyType(*args, **kwargs)[source]

Protocol for a class/function that takes a Producer and produces a per-layer mapping of FamiliarityDistribution.

__call__(producer, batch_size=1024)[source]
Parameters:
  • producer (Producer) – producer of model responses

  • batch_size (int) – [optional] how many data samples to pull through the producer at a time

Returns:

mapping of layer name (fields of input producer.

Return type:

Mapping[str, FamiliarityDistribution]

metadata_key: ClassVar[DictMetaKey[FamiliarityResult]]

Key that will be used to view the metadata for a particular strategy.

class deepview.introspectors.FamiliarityResult(*args, **kwargs)[source]

Protocol for the result of applying a FamiliarityDistribution to a response.

score: float

Familiarity score.

class deepview.introspectors.GMMCovarianceType(value)[source]

Covariance type to be learnt from data. Typically, use FULL for low dimensional data and DIAG for high dimensional data.

The main problem with FULL in high dimensions is that the algorithm learns dim x dim parameters for each gaussian, and so overfitting or degenerate solutions may be a problem.

The boundary between low and high dimensional data is fuzzy, and the choice of covariance type also depends on the application, data distribution or amount of data available.

A general rule is:

  • If there are concerns about overfitting due to a lack of data, dimensions are high wrt. the data available, etc. Then use DIAG. This is typically the case when working with DNN embeddings.

  • Else, use FULL. For example, if fitting 2D data.

For more information about covariance types, refer to the sklearn GMM covariances page.

DIAG = 'diag'

Diagonal covariance type, only the diagonal parameters will be learnt from data.

FULL = 'full'

Full covariance type, all dim x dim parameters will be learnt from data.

class deepview.introspectors.FamiliarityDistribution(*args, **kwargs)[source]

The per-response result of FamiliarityStrategyType. An instance of this represents the distribution for a single layer and can evaluate the contents of a response.

compute_familiarity_score(x)[source]

Compute and return the Familiarity score for each data point in x.

Parameters:

x (ndarray) – input data samples to score according to the built distribution

Returns:

Familiarity score for each data sample

Return type:

Sequence[FamiliarityResult]

Dimensionality Reduction

class deepview.introspectors.DimensionReduction(_reducers)[source]

Introspector to reduce dimensionality of :class`Batch <deepview.base.Batch>` fields (usually model responses).

Like other introspectors, use DimensionReduction.introspect to instantiate.

class Strategy[source]

Bundled dimension reduction strategies. See DimensionReductionStrategyType.

The available options are:

  • PCA – an Incremental PCA algorithm from sklearn that can process data incrementally without accumulating the dataset

  • StandardPCA – PCA algorithm from sklearn that requires accumulating the full dataset in memory

  • TSNE – t-SNE algorithm from sklearn that requires accumulating the full dataset in memory

  • UMAP – the UMAP algorithm from umap-learn that requires accumulating the full dataset in memory

  • PaCMAP – the PaCMAP algorithm that requires accumulating the full dataset in memory

class PCA(target_dimensions=2)

Principal Component Analysis based dimension reduction using SKLearn IncrementalPCA.

Note

This does not require reading all of the responses into memory to compute the model. A larger batch size will improve the quality of the fit at the cost of additional memory. The incremental approach produces an approximation of PCA, but is documented to be very close and testing backs this up.

DimensionReduction.Strategy.StandardPCA can be used if exact computation of PCA is necessary.

Parameters:

target_dimensions[optional] Target dimensionality of the data.

target_dimensions: int = 2

Target dimensionality of the data.

class PaCMAP(target_dimensions=2, *, _parameters=None, **kwargs)

PaCMAP (Pairwise Controlled Manifold Approximation) is a dimensionality reduction method built with PaCMAP. PaCMAP can be used for visualization, preserving both local and global structure of the data in original space.

This dimension reduction strategy requires reading all of the data into memory before producing the projection. Typically the input data should be reduced from high dimension to low, e.g. 1024 -> 40, before applying PaCMAP.

Parameters:
  • target_dimensions[optional] Target dimensionality of the data.

  • kwargs

    [optional] Any additional PaCMAP keyword args

target_dimensions: int = 2

Target dimensionality of the data.

class StandardPCA(target_dimensions=2)

Principal Component Analysis based dimension reduction using SKLearn PCA.

This dimension reduction strategy requires reading all of the data into memory before producing the projection.

DimensionReduction.Strategy.PCA is preferred for its lower memory use.

Parameters:

target_dimensions[optional] Target dimensionality of the data.

target_dimensions: int = 2

Target dimensionality of the data.

class TSNE(target_dimensions=2, *, _parameters=None, **kwargs)

t-distributed Stochastic Neighbor Embedding (t-SNE) using SKLearn t-SNE.

This dimension reduction strategy requires reading all of the data into memory before producing the projection. Typically the input data should be reduced from high dimension to low, e.g. 1024 -> 40, before applying t-SNE.

Parameters:
  • target_dimensions[optional] Target dimensionality of the data.

  • kwargs[optional] Any additional SKLearn t-SNE args.

target_dimensions: int = 2

Target dimensionality of the data.

class UMAP(target_dimensions=2, *, _parameters=None, **kwargs)

UMAP based dimension reduction using umap-learn (https://umap-learn.readthedocs.io).

This dimension reduction strategy requires reading all of the data into memory before producing the projection. Typically the input data should be reduced from high dimension to low, e.g. 1024 -> 40, before applying UMAP.

Parameters:
  • target_dimensions[optional] Target dimensionality of the data.

  • kwargs[optional] Any additional umap-learn args.

Raises:

DeepViewException – if a layer’s response shape does not have exactly 2 dimensions.

target_dimensions: int = 2

The dimension of the space to embed into. This defaults to 2 to provide straightforward visualization, but can reasonably be set to any integer value in the range 2 to 100. (from https://umap-learn.readthedocs.io)

static introspect(producer, *, strategies, batch_size=None)[source]

Perform dimension reduction using training data generated by producer, and return a DimensionReduction PipelineStage that can perform dimensionality reduction in a pipeline.

The producer must produce 1d vectors, e.g. the Batch will be of dimension BxN. See Flattener or Pooler if multi-dimensional data is used.

Note: some strategies will need to read all of the response data into memory to fit their model. Currently only the PCA algorithm runs in a streaming fashion.

Parameters:
  • producer (Producer) – the source of data to train the strategies on

  • strategies (DimensionReductionStrategyType | Mapping[str, DimensionReductionStrategyType]) – [keyword arg] which dimension reduction strategy to use or a mapping from field name to strategy (for running a different dimension reduction per layer.

  • batch_size (int | None) – [keyword arg, optional] size of batch to read out – this must be >= the target dimension. For some strategies like PCA, this will improve the quality of the dimension reduction. The default value will select the batch_size automatically.

Raises:
Return type:

DimensionReduction

OneOrManyDimStrategies

alias of Union[DimensionReductionStrategyType, Mapping[str, DimensionReductionStrategyType]]

class deepview.introspectors.DimensionReductionStrategyType(*args, **kwargs)[source]

Strategy for performing dimension reduction on a single layer. This is initialized with the target dimensions.

The fit_incremental() method is called repeatedly for each batch that is processed. When all data has been visited, the fit_complete() method is called. Algorithms that require the full data set in memory may collect values with the first call and then combine and process in fit_complete().

transform() is used to transform high dimensional data into the target dimensions.

check_batch_size(batch_size)[source]

Validate the batch_size and throw an error if there is an issue.

Parameters:

batch_size (int) – batch size to validate

Return type:

None

default_batch_size()[source]

Compute the default batch size.

Return type:

int

fit_complete()[source]

Called when all fit data has been passed.

Return type:

None

fit_incremental(data)[source]

Fit the reducer to the incremental data

Parameters:

data (ndarray) – data to fit the reducer to

Return type:

None

property is_one_shot: bool

Returns True if this can transform input data via transform(), or if the entire input data set is transformed at once via transform_one_shot().

property target_dimensions: int

How many dimensions this is reducing to.

transform(data)[source]

Transform the given high dimensional data into the target dimensions. See is_one_shot().

Parameters:

data (ndarray) – data to transform

Return type:

ndarray

transform_one_shot()[source]

Returns the input data transformed per the reducer. See is_one_shot().

Return type:

ndarray

Duplicates

class deepview.introspectors.Duplicates(results, count)[source]

Introspector for finding duplicate data in a Producer. This uses an approximate nearest neighbor algorithm to build clusters of nearby samples, Duplicates.DuplicateSetCandidate. Specifically, it uses the ANNOY - Approximate Nearest Neighbor Oh My! algorithm.

Like other introspectors, use Duplicates.introspect to instantiate.

Parameters:
class DuplicateSetCandidate(std, mean, projection, indices, batch)[source]
Parameters:
batch: Batch

Set of data in Batch form, which are duplicate candidates.

indices: Sequence[int]

Indices of the elements in the cluster from the original producer.

mean: float

Mean of the distance to the centroid from each of the points in the cluster.

projection: ndarray | None

Optional 2-d projection of the data – this can be displayed to show the relationship between the samples. The order corresponds to the order in the batch.

property size: int

Size of the cluster.

std: float

Std. deviation of the distance to the centroid from each of the points in the cluster.

class KNNStrategy[source]

Bundled K Nearest Neighbours computation strategies. See FamiliarityStrategyType

class KNNAnnoy

Strategy for computing duplicates using the Annoy library.

class KNNFaiss

Strategy for computing duplicates using the FAISS library.

class ThresholdStrategy[source]
class Percentile(percentile)

Strategy that determines the closeness threshold by taking the nth percentile distance number in the sorted distances. For example a value of 98.5 would use a threshold such that 98.5% of the points were not considered close.

Parameters:

percentile – n_th percentile to use for “closeness” in the sorted distances

percentile: float

n_th percentile to use for “closeness” in the sorted distances

class Slope(sensitivity=5)

Given an array of distances, find the “close” threshold – the distance where points are close to each other.

This strategy determines the closeness threshold dynamically using a sensitivity value. A lower sensitivity (down to 2) will consider more items to be close (less sensitive to the curve of distances). A value of 5 will use a sliding window 1/5 the size of the distance array (related to the size of the dataset) and is a good default. A sensitivity of 20 will use a window 1/20 the size of the distance array and is a reasonable large value.

The distance are likely a sharp up-slope followed by a elbow and finally a long, possibly rising, tail. The target delta will be computed from the difference between the 25th and 75h percentile values. A sliding window will be run over the data with a size of len(distances) // sensitivity to find when the delta in the window exceeds the middle delta. This will approximate the tail end of the elbow.

This returns the threshold value and the index into the distances array where it was found.

Parameters:

sensitivity[optional] lower value considers more items to be close, a larger value considers less items to be close.

Raises:

ValueError – if sensitivity <=2

sensitivity: int = 5

Lower value considers more items to be close, a larger value considers less items to be close.

count: int

Number of elements in the producer.

static introspect(producer, *, batch_size=32, strategy=None, threshold=None)[source]

Uses an approximate nearest neighbor to build a distance matrix for all samples and build clusters from the closest samples.

Although this works on data of any dimension, the performance is linear in the number of samples in the producer AND the number of dimensions. Consider using DimensionReduction to reduce the number of dimensions before detecting duplicates – if the dimensions are already being reduced for Familiarity, the same can be used here, otherwise a reduction to 40 still gives good results.

The data from the producer is L2 normalized per-column – this will help keep one column from dominating the distance metric. See also this explanation about how any why this is done.

producer = Producer...
duplicates = Duplicates.introspect(producer)

for response_name, clusters in duplicates.items():
    # sort by the mean distance to the centroid
    clusters = sorted(clusters, key=lambda x: x.mean)
    ...
Parameters:
  • producer (Producer) – producer of data

  • batch_size (int) – [optional] size of batch to read while collecting data from the producer

  • strategy (DuplicatesStrategyType | None) – [optional] strategy to use for finding the nearest neighbors. Default is KNNAnnoy

  • threshold (DuplicatesThresholdStrategyType | None) – [optional] strategy to use for finding the distance between points that are considered duplicates. Default is Slope threshold.

Returns:

Duplicates, which contains candidate duplicates for each response name

Return type:

Duplicates

results: Mapping[str, Sequence[DuplicateSetCandidate]]

Mapping from response name to a list of candidate duplicates.

class deepview.introspectors.DuplicatesStrategyType(*args, **kwargs)[source]

Protocol for code that takes anarray of vectors (embeddings) and computes a list of duplicates for each point.

__call__(vectors, threshold)[source]

Given the sorted distances compute the list of duplicates for each point.

Parameters:
Return type:

Sequence[Sequence[int]]

class deepview.introspectors.DuplicatesThresholdStrategyType(*args, **kwargs)[source]

Protocol for code that takes a sorted array of distances and computes a duplicate threshold – how close do two points need to be to be considered duplicates.

__call__(distances)[source]

Given the sorted distances compute the distance threshold for duplicates.

Parameters:

distances (ndarray) – sorted distances

Return type:

float

Dataset Report

class deepview.introspectors.DatasetReport(data, _report_save_data_path=PosixPath('report_save_data.pkl'))[source]

A report built to inspect a dataset for a given model from the perspective of fairness.

Like other introspectors, use DatasetReport.introspect to instantiate, or load a saved report using DatasetReport.from_disk.

This report is particularly useful for introspecting datasets that have various class labels attached. See overall DatasetReport page in docs to learn more.

The following components can be run (default to all), configured using a ReportConfig. - Summarize overall dataset, including by metadata labels, if they exist - Find near duplicate data samples, see Duplicates - Find most / least representative data overall and per metadata label, see Familiarity - Project the data down to visualize overall in a 2D scatterplot

The input Producer to this class’s instantiation is expected to have fields of model responses (likely a layer towards the end of the model but not the last response). These responses can come either from loading data and running it through a DeepView Model, or by loading the responses directly from file into a Producer. In each Batch's metadata, this report looks for identifiers and optional labels attached as metadata using Batch.StdKeys.IDENTIFIER and Batch.StdKeys.LABELS metadata keys.

Note

For the moment, the Batch.StdKeys.IDENTIFIER should be a path to the image data.

This class creates a pandas.DataFrame full of the data needed to build the UI for the DatasetReport, which can then be exported into a standalone static site to explore. The different components built in the UI interact with each other.

# Build all components of the dataset report using default configuration.
#    This output can then be used to visualize the results with Canvas:
#        (1) as a standalone web dashboard to explore interactively
#        (2) inline in a Jupyter notebook to explore interactively
#    Please see the Canvas documentation for an example:
#    https://satishlokkoju.github.io/deepview/
report = DatasetReport.introspect(producer)
Parameters:

data – do not instantiate DatasetReport directly, use DatasetReport.introspect

data: DataFrame

pandas.DataFrame of introspection results for responses and report components

static from_disk(directory)[source]

Create DatasetReport object from a report save directory

Parameters:

directory (str | Path) – path of directory where report file has been saved

Returns:

Instance of DatasetReport

Return type:

DatasetReport

static introspect(producer, *, config=None, batch_size=1024)[source]

Build relevant DatasetReport components from input Producer.

Parameters:
  • producer (Producer) – response producer (separate caching not needed as responses are cached in this function)

  • config (ReportConfig | None) – [keyword arg, optional] ReportConfig. Set components to None to omit them from report.

  • batch_size (int) – [keyword arg, optional] number of samples to batch at once

Returns:

a DatasetReport whose results can be exported into different formats

Return type:

DatasetReport

to_disk(directory='./report_save', *, overwrite=False)[source]

Save Dataset Report’s data to a directory, to avoid running introspect again when visualizing or sharing results.

Parameters:
  • directory (str | Path) – [optional] directory to save data within

  • overwrite (bool) – [keyword arg, optional] True to overwrite any existing report save files in this directory

Return type:

None

class deepview.introspectors.ReportConfig(projection=<factory>, duplicates=<factory>, familiarity=<factory>, dim_reduction=None, split_familiarity_min=50)[source]

Configuration for which components to build into the DatasetReport, and what strategies to use to build those components. Default config corresponds to running all components with default strategies (projection, duplicates, and familiarity).

When running familiarity, “split” familiarity is also run, which means that a familiarity model is built for each label, for each label category, and then that subgroup of data is evaluated according to the model.

Parameters:
dim_reduction: DimensionReductionStrategyType | Mapping[str, DimensionReductionStrategyType] | None = None

If None, default to DimensionReduction.Strategy.PCA before running familiarity, duplicates, and/or projection`. Else provide DimensionReduction.Strategy.

duplicates: DuplicatesThresholdStrategyType | None

Skip Duplicates if None, else Duplicates.ThresholdStrategy (default is Slope).

familiarity: FamiliarityStrategyType | None

Skip Familiarity if None, else provide Familiarity.Strategy to apply to overall and split familiarity.

property n_stages: int

How many stages of multi introspect need to be run (not counting stub intropectors)

projection: DimensionReductionStrategyType | Mapping[str, DimensionReductionStrategyType] | None

Skip projection if None, else provide a DimensionReduction.Strategy that projects down to 2 dimensions, for visualization (default is DimensionReduction.Strategy.UMAP).

split_familiarity_min: int = 50

If running Familiarity, min data that must exist per-label for fitting individual models to subgroups of data determined by label (“split” familiarity).

property use_dim_reduction: bool

True if overall dimension reduction needs to be run

Model Introspectors

Principal Filter Analysis

class deepview.introspectors.PFA(failed_responses, _covariance_result_by_response)[source]

Like other introspectors, use PFA.introspect to instantiate.

Use PFA to discover highly correlated filter, or more generically unit, responses within layers of a neural network. Exploit data to guide network compression in order to decrease inference time and memory footprint while improving generalization. See the DeepView docs for more information.

Parameters:

failed_responses – do not instantiate PFA directly, use PFA.introspect

class Strategy[source]

Bundled PFA strategies. To implement a custom strategy, see PFAStrategyType.

class Energy(energy_threshold, min_kept_count=0)

Energy strategy for generating PFA recipes – this targets a given energy_threshold to keep.

Parameters:
  • energy_threshold – The spectral energy to keep

  • min_kept_count[optional] The minimum number of outputs to keep per response

energy_threshold: float

The spectral energy to keep

min_kept_count: int = 0

The minimum number of output to keep per response

class KL(interpolation_function=None)

KL strategy for generating PFA recipes.

Parameters:

interpolation_function[optional] the interpolation function to use, see KLInterpolationFunction.

class KLInterpolationFunction(*args, **kwargs)

A protocol to map a KL divergence to the ratio of the number of units in the layer. The KL divergence is that between the distribution of eigenvalues of the covariance matrix of model responses and the uniform distribution.

class LinearInterpolation(*args, **kwargs)

A concrete KLInterpolationFunction function that performs its intended mapping by linearly interpolating [kl_divergence, max_kl_divergence] to [0, 1]

class Size(relative_size, min_kept_count=0, epsilon_energy=1e-08)

Size strategy for generating PFA recipes – this targets a given relative_size to produce a cross-layer energy threshold that will produce that result.

Parameters:
  • relative_size – The relative amount of channels to keep (in 0..1)

  • min_kept_count[optional] The minimum number of output to keep per response

  • epsilon_energy[optional] Minimum level of energy

epsilon_energy: float = 1e-08

Minimum level of energy

min_kept_count: int = 0

The minimum number of output to keep per response

relative_size: float

The relative amount of channels to keep (in 0..1)

class UnitSelectionStrategy[source]

Strategy for selecting the maximally correlated units. To implement a custom strategy, see PFAUnitSelectionStrategyType.

class AbsMax

Given a correlation matrix, choose units based on the one with the greatest coefficient

direction: Callable[[ndarray], ndarray]

Direction of selection

distance: _DirectionalDistanceCalculation

Distance function

class AbsMin

Given a correlation matrix, choose units based on the one with the lowest coefficient

direction: Callable[[ndarray], ndarray]

Direction of selection

distance: _DirectionalDistanceCalculation

Distance function

class L1Max

Given a correlation matrix, choose units based on the one with the greatest L1 norm

direction: Callable[[ndarray], ndarray]

Direction of selection

distance: _DirectionalDistanceCalculation

Distance function

class L1Min

Given a correlation matrix, choose units based on the one with the lowest L1 norm

direction: Callable[[ndarray], ndarray]

Direction of selection

distance: _DirectionalDistanceCalculation

Distance function

classmethod get_algos()[source]

Returns: a list of all registered unit selection strategies.

Return type:

Iterable[PFAUnitSelectionStrategyType]

class VisType[source]

Type of visualization modality for PFA, available to visualize via PFA.show()

CHART: Final = 'chart'

Chart comparing recommended vs. original unit counts per layer

TABLE: Final = 'table'

Table of all PFA result data

failed_responses: Sequence[str]

The names of any responses that failed to generate output. This caused by layers with insufficient data to support the analysis.

get_recipe(*, strategy=None, unit_strategy=None)[source]

Generate a recipe using the given algorithm and unit strategy. For more information refer to the PFA documentation page.

Parameters:
Returns:

a mapping from response name to PFARecipe for the given algorithm and unit strategy.

Return type:

Mapping[str, PFARecipe]

static introspect(producer, *, batch_size=32, epsilon_inactive=1e-08)[source]

Perform Principal Filter Analysis on the responses (fields) generated by the producer.

Caution

The responses generated by producer are assumed to be 2D (Batch x C). Thus it might be necessary to pipeline together the Producer with a Processor (e.g., Pooler), that transforms each individual response from multi-dimensional to mono-dimensional.

Parameters:
  • producer (Producer) – The producer of the responses (in fields) to be analyzed

  • batch_size (int) – [keyword arg, optional] the batch size to use when consuming the responses (via batch.fields)

  • epsilon_inactive (float) – [keyword arg, optional] factor used to identify inactive units (whose var < epsilon_inactive * np.max(var))

Returns:

an instance of PFA that can generate PFARecipes using a PFAStrategyType (e.g., PFA.Strategy.KL).

Return type:

PFA

static show(recipe_result, *, vis_type='table', include_columns=None, exclude_columns=None)[source]

Create table or chart to visualize PFA results in iPython / Jupyter notebook.

Note: Requires pandas (vis_type is PFA.VisType.TABLE) or matplotlib (vis_type is PFA.VisType.CHART), which can be installed with pip install "deepview[notebook]"

Parameters:
  • recipe_result (Mapping[str, PFARecipe] | Collection[Mapping[str, PFARecipe]]) – result of pfa.get_recipe, mapping of layer to PFARecipe. When plotting for vis_type PFA.VisType.TABLE, a sequence of t.Mapping[str, PFARecipe] can be passed in to compare multiple results.

  • vis_type (str) – [keyword arg, optional] determines visualization type. PFA.VisType.TABLE for pandas dataframe result or PFA.VisType.CHART for matplotlib pyplot of recommended vs. original unit counts

  • include_columns (Sequence[str] | None) – [keyword arg, optional] For vis_type as PFA.VisType.TABLE only. If included, only return pandas.DataFrame with these columns. Defaults to include all columns (value None). Options are: [layer name, original count, recommended count, units to keep, KL divergence, PFA strategy, units ratio, kept energy].

  • exclude_columns (Sequence[str] | None) – [keyword arg, optional] For vis_type as PFA.VisType.TABLE only. If included, return pandas.DataFrame without these columns (irrelevant if include_columns is specified). Defaults to None. Options are: [layer name, original count, recommended count, units to keep, KL divergence, PFA strategy, units ratio, kept energy].

Returns:

pandas.DataFrame or matplotlib.axes.Axes of PFA results from input recipe_result

Return type:

Axes | DataFrame

class deepview.introspectors.PFAKLDiagnostics(kl_divergence, units_ratio)[source]

Diagnostic information for PFA.Strategy.KL

Parameters:
kl_divergence: float

Computed Kullback-Leibler (KL) divergence

units_ratio: float

Indicates how much of a layer is deemed uncorrelated based on the KL divergence found.

class deepview.introspectors.PFAEnergyDiagnostics(total_kept_energy)[source]

Diagnostic information for PFA.Strategy.Energy

Parameters:

total_kept_energy (float) – see total_kept_energy

total_kept_energy: float

Energy remaining after recommended compression

class deepview.introspectors.PFARecipe(original_output_count, recommended_output_count, maximally_correlated_units, number_inactive_units, diagnostics)[source]

Recommendation about a specific model response. This will likely never be instantiated directly, and instead an instance will be returned from pfa.get_recipe.

Parameters:
diagnostics: PFAKLDiagnostics | PFAEnergyDiagnostics | None

Per algorithm diagnostic information

maximally_correlated_units: Sequence[int]

Maximally correlated units found with this recommendation.

number_inactive_units: int

Number of inactive units. If maximally_correlated_units exists then the first number_inactive_units are the units selected due to inactivity.

original_output_count: int

Original length of the response.

recommended_output_count: int

Recommended length of the response.

class deepview.introspectors.PFAUnitSelectionStrategyType(*args, **kwargs)[source]

Given a correlation matrix and a number of units to keep, choose which units are maximally correlated.

__call__(covariances, *, num_units_to_keep)[source]
Parameters:
  • covariances (PFACovariancesResult) – the covariance data for the layer

  • num_units_to_keep (int) – [keyword arg, optional] number of recommended units to be kept

Returns:

numpy.ndarray with the list of indexes that corresponds to the unit that is maximally correlated (the first part of the list contains the indices of the inactive units). The number of inactive units can be found in covariances.inactive_units.shape[0]

Return type:

ndarray

class deepview.introspectors.PFAStrategyType(*args, **kwargs)[source]

Protocol for PFA strategies (PFA.Strategy). These examine per-layer PFACovariancesResult and produces per-layer PFARecipe.

Note

This takes all layers and produces a result for each of the layers, but the algorithm operates on each layer independently.

__call__(covariances)[source]
Parameters:

covariances (Mapping[str, PFACovariancesResult]) – mapping from layer name (field name) to PFACovariancesResult for that layer

Return type:

Mapping[str, PFARecipe]

class deepview.introspectors.PFACovariancesResult(covariances, eigenvalues, eigenvectors, original_output_count, inactive_units)[source]

Encapsulates the results of the covariance calculation

Parameters:
covariances: ndarray

The covariances matrix. This is a two dimensional square array of size original_output_count.

eigenvalues: ndarray

The eigenvalues of the covariances. This is a one dimensional array of size original_output_count.

eigenvectors: ndarray

The eigenvectors of the covariances. This is a two dimensional square array of size original_output_count.

inactive_units: ndarray

The indices of the inactive units

static make_covariance_result(*, covariances, epsilon_inactive=1e-08)[source]

Create an instance of PFACovariancesResult.

Parameters:
  • covariances (ndarray) – [keyword arg] the covariances

  • epsilon_inactive (float) – [keyword arg, optional] factor used to identify inactive units (whose var < epsilon_inactive * np.max(var)).

Return type:

PFACovariancesResult

original_output_count: int

The number of features in the response data

Inactive Unit Analysis

class deepview.introspectors.IUA(_layer_counts, _unit_counts, _total_probe_counts)[source]

An introspector that evaluates responses to compute for inactive unit statistics.

Like other introspectors, use IUA.introspect to instantiate.

class Result(mean_inactive, std_inactive, inactive, unit_inactive_count, unit_inactive_proportion)[source]

Per-response IUA Result

Parameters:
inactive: Sequence[float]

sequence tracking the number of inactive units in the layer per batch input, used to compute mean_inactive and std_inactive

mean_inactive: float

mean inactive units in the batch

std_inactive: float

standard deviation in number of inactive units across batch inputs

unit_inactive_count: Sequence[float]

sequence tracking the number of times each unit was inactive across batch inputs

unit_inactive_proportion: Sequence[float]

sequence tracking the proportion of times each unit was inactive across batch inputs

class VisType[source]

Type of visualization modality for IUA, available to visualize via IUA.show()

CHART: Final = 'chart'

Charts showing inactive units per layer

TABLE: Final = 'table'

Table of all IUA result data

static introspect(producer, *, batch_size=32, rtol=1e-05, atol=1e-08)[source]

Compute inactive unit statistics (mean, standard deviation, counts, and unit frequency) for each layer (field) in the input producer of model responses.

Parameters:
  • producer (Producer) – The producer of the model responses to be introspected

  • batch_size (int) – [keyword arg, optional] number of inputs to pull from producer at a time

  • rtol (float) – [keyword arg, optional] float relative tolerance parameter (see doc for numpy.isclose()).

  • atol (float) – [keyword arg, optional] float absolute tolerance parameter (see doc for numpy.isclose()).

Returns:

an IUA instance that can provide information about inactive units in the model

Return type:

IUA

property results: Mapping[str, Result]

A per-layer IUA.Result encapsulating Inactive Unit Analysis results.

static show(iua, *, vis_type='table', response_names=None)[source]

Create table or chart to visualize IUA results in iPython / Jupyter notebook.

Note: Requires pandas

(vis_type is IUA.VisType.TABLE) or matplotlib (vis_type is IUA.VisType.CHART), which can be installed with pip install "deepview[notebook]"

Parameters:
  • iua (IUA) – result of IUA.introspect(), instance of IUA

  • vis_type (str) – [keyword arg, optional] determines visualization type. IUA.VisType.TABLE for pandas dataframe result or IUA.VisType.CHART for matplotlib pyplot of inactive units

  • response_names (Sequence[str] | None) – [keyword arg, optional] For IUA.VisType.CHART vis. Sequence of responses (field names) to visualize (defaults to None for showing all responses)

Returns:

pandas.DataFrame or matplotlib.axes.Axes of IUA results

Return type:

Axes | DataFrame