Note
This page was generated from a Jupyter notebook. The original can be downloaded from here.
Duplicates: Find Near Duplicate Data Samples¶
Discover near duplicate data samples, which might indicate that there is not enough variation in the dataset.
Please see the doc page on near duplicate discovery and removal for a description of the problem, the Duplicates introspector, and what actions can be taken to improve the dataset.
For a more detailed guide on using all of these DeepView components, try the Familiarity for Rare Data Discovery Notebook.
1. Setup¶
Here, the required imports are grouped together, and then the desired paths are set.
[1]:
import logging
logging.basicConfig(level=logging.INFO)
from deepview.base import pipeline, ImageFormat
from deepview.processors import Cacher, ImageResizer, Pooler
from deepview_tensorflow import TFDatasetExamples, TFModelExamples
# For future protection, any deprecated DeepView features will be treated as errors
from deepview.exceptions import enable_deprecation_warnings
enable_deprecation_warnings()
2025-06-21 23:16:56.729207: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1750547816.743315    5449 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750547816.747837    5449 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750547816.759174    5449 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1750547816.759191    5449 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1750547816.759193    5449 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1750547816.759195    5449 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-06-21 23:16:56.763083: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Optional: Download MobileNet and CIFAR-10¶
This example uses MobileNet (trained on ImageNet) and CIFAR-10, but feel free to use any other model and dataset. This notebook also uses TFModelExamples and TFDatasetExamples to load in MobileNet and CIFAR-10. Please see the docs for information about how to load a model or dataset. This doc page also describes how responses can be collected outside of DeepView, and passed into Familiarity via a Producer.
No processing is done at this point! Batches are only pulled from the Producer and through the pipeline when the Dataset Report “introspects”.
[2]:
# Load CIFAR10 dataset
# :: Note: max_samples makes it only go up to that many data samples. Remove to run on entire dataset.
cifar10 = TFDatasetExamples.CIFAR10(attach_metadata=True)
# Create a subset of the CIFAR dataset that produces only automobiles
cifar10_cars = cifar10.subset(datasets=['train'], labels=['automobile'], max_samples=1000)
# Load the MobileNet model
# :: Note: If loading a model from tensorflow,
# :: see deepview_tensorflow's load_tf_model_from_path method.
mobilenet = TFModelExamples.MobileNet()
mobilenet_preprocessor = mobilenet.preprocessing
assert mobilenet_preprocessor is not None
requested_response = 'conv_pw_13'
# Create a pipeline, feeding CIFAR data into MobileNet,
# and observing responses from layer conv_pw_13
producer = pipeline(
    # Pull from the CIFAR dataset subset that was created earlier,
    # so that only 1000 samples are analyzed in the "automobile" class here
    cifar10_cars,
    # Preprocess the image batches in the manner expected by MobileNet
    mobilenet_preprocessor,
    # Resize images to fit the input of MobileNet, (224, 224) using an ImageResizer
    ImageResizer(pixel_format=ImageFormat.HWC, size=(224, 224)),
    # Run inference with MobileNet and extract intermediate embeddings
    # (this time, just `conv_pw_13`, but other layers can be added)
    # :: Note: This auto-detects the input layer and connects up 'images' to it:
    mobilenet.model(requested_responses=[requested_response]),
    # Max pool the responses before DeepView processing using a DeepView Pooler
    Pooler(dim=(1, 2), method=Pooler.Method.MAX),
    # Cache responses
    Cacher()
)
2025-06-21 23:17:00.895478: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
INFO:deepview_tensorflow.TF2:Instantiating TF2 Model
INFO:deepview_tensorflow.TF2:GPUs Available: 0
(Optional) Perform Dimension Reduction¶
Although the Duplicates algorithm can run data with any number of dimensions, there will be improved performance and near identical results if the number of dimensions is reduced. The DeepView DimensionReduction introspector can be used for this.
[3]:
from deepview.introspectors import DimensionReduction
dimension_reduction = DimensionReduction.introspect(
    producer,batch_size=64, strategies=DimensionReduction.Strategy.PCA(40))
reduced_producer = pipeline(
    producer,
    dimension_reduction,
)
INFO:faiss.loader:Loading faiss with AVX2 support.
INFO:faiss.loader:Successfully loaded faiss with AVX2 support.
INFO:faiss:Failed to load GPU Faiss: name 'GpuIndexIVFFlat' is not defined. Will not load constructor refs for GPU indexes. This is only an error if you're trying to use GPU Faiss.
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(64, 224, 224, 3))']
  warnings.warn(msg)
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/keras/src/models/functional.py:241: UserWarning: The structure of `inputs` doesn't match the expected structure.
Expected: [['input_layer']]
Received: inputs=['Tensor(shape=(40, 224, 224, 3))']
  warnings.warn(msg)
2. Look for duplication in embedding space using Duplicates introspector¶
In this step, use the Duplicates introspector to build the clusters of duplicates. This can consume a producer or reduced_producer with no other arguments and give good results.
In some cases, it might be necessary to refine the results. The first way to do this is by modifying the threshold parameter to the introspect() function, of type Duplicates.ThresholdStrategy. For using the percentile or close_sensitivity parameters to the introspect() call. By default the introspector will use a dynamic method to find the bend in the sorted distances. The sensitivity can be controlled with close_sensitivity – the default value is 5. Setting it to 2
will give more duplicates and a sensitivity of 20 will produce fewer. The percentile will use the given percentile in the distance as the “close” threshold. For example a percentile value of 99.5 would take the 99.5th percentile distance and consider that “close” when building thresholds. A lower value would produce more duplicates.
The other way to refine the results is to sort or filter the clusters. They can be sorted by the mean distance to the centroid of the cluster to see the tightest clusters first. For example, filter out clusters larger than N to avoid seeing any very large clusters.
[4]:
from deepview.introspectors import Duplicates
duplicates = Duplicates.introspect(reduced_producer)
print(f"Found {len(duplicates.results[requested_response])} unique clusters of duplicates.")
INFO:deepview.base._cached_producer.Cacher:Using cached batches. Attempting to retrieve values..
INFO:deepview.introspectors.duplicates:Building duplicate clusters with 1000 samples
INFO:deepview.introspectors.duplicates:Found 153 duplicate clusters
Found 57 unique clusters of duplicates.
3. Manual analysis of near duplicates - Images¶
Now that sets of the most similar images have been computed, visually inspect them and note which data to consider removing before training. All data displayed are unique.
As mentioned in the introduction, it’s not always bad to have similar samples in the dataset. It is bad practice to have duplicates across training and testing datasets, and it’s important to consider the effective training/testing dataset size (including class breakdowns) really is. Remember that slight variation of a data sample for training can be achieved with data augmentation and applied uniformly across all training samples instead of just to a few specific samples as part of the stored dataset.
In general, it’s possible to evaluate the following criteria with the sets of near duplicates:
- Set contains train + test images: Are there near identical data samples that span the train and test sets for that class? 
- Set contains only train images: For near duplicates in the training set, is the network learning anything more by including all near-duplicates? Could some be replaced with good data augmentation methods? 
- Set contains only test images: For near duplicates in the test set, is there anything new learned about the model if it is able to classify both all the near-duplicate data samples? Does this skew the reported accuracy? 
Notice that in the following results, some of the images in the same set look very different (e.g. Cluster 8), but in other lines (e.g. Cluster 1, 2, 3, etc.), the images look nearly identical. This is why it’s important to perform some manual inspection of the data before removing samples.
Even if these images do not appear similar to the human eye, they are closest in the current embedding space, for the chosen layer response name. This can still provide useful information about what the network is using to distinguish between data samples, and it’s recommended to look for what unites the images, and if desired, collect more data to help the network distinguish between the uniting feature.
Anecdote: In this notebook, a small dataset was chosen for efficiency purposes. However, duplicate analysis has been run on all of CIFAR-10, across train and test datasets, and an extremely large number of duplicates has been found. CIFAR-10 is a widely used dataset for training and evaluation purposes, but the fact that there are so many duplicates calls into question the validity of the accuracy scores reported on this dataset for different models, especially with respect to model generalization. Though this is not the first time someone has noticed this issue in the dataset, still, this dataset is widely used in the ML community. It’s a lesson for all of us to always look at the data.
For now, grab all the images into a data structure in order to index into them and grab the data to visualize.
[5]:
images = [
    element.fields['samples']
    for b in cifar10_cars(batch_size=64)
    for element in b.elements
]
[6]:
from deepview.base import Batch
import matplotlib.pyplot as plt
clusters = duplicates.results[requested_response]
sorted_clusters = sorted(clusters, key=lambda x: x.mean)
for cluster_number, cluster in enumerate(sorted_clusters):
    print(f'Cluster {cluster_number + 1}, mean={cluster.mean}')
    img_idx_list = cluster.batch.metadata[Batch.StdKeys.IDENTIFIER]
    f, axarr = plt.subplots(1, len(img_idx_list), squeeze=False, figsize=(15,6))
    for i, img_idx in enumerate(img_idx_list):
        assert isinstance(img_idx, int), "For this example plot, the identifier should be an int"
        axarr[0, i].imshow(images[img_idx])
        plt.setp(axarr[0, i].spines.values(), lw=2)
        axarr[0, i].yaxis.set_major_locator(plt.NullLocator())
        axarr[0, i].xaxis.set_major_locator(plt.NullLocator())
        axarr[0, i].set_title(f'Data Index {img_idx}')
    plt.show()
Cluster 1, mean=0.03711043563031244
 
Cluster 2, mean=0.041643341011409304
 
Cluster 3, mean=0.04701003477656179
 
Cluster 4, mean=0.048393732793773764
 
Cluster 5, mean=0.049093733175808854
 
Cluster 6, mean=0.06304742316401657
 
Cluster 7, mean=0.06342160643286605
 
Cluster 8, mean=0.06524132105362718
 
Cluster 9, mean=0.07335787412940689
 
Cluster 10, mean=0.07456910167594859
 
Cluster 11, mean=0.07502994561601287
 
Cluster 12, mean=0.07518076527941331
 
Cluster 13, mean=0.07521830559473024
 
Cluster 14, mean=0.07648248642358052
 
Cluster 15, mean=0.07665654496282272
 
Cluster 16, mean=0.07691458887388733
 
Cluster 17, mean=0.07691480214421785
 
Cluster 18, mean=0.07716838695061487
 
Cluster 19, mean=0.07759753306979492
 
Cluster 20, mean=0.07784050039936251
 
Cluster 21, mean=0.07823455444377718
 
Cluster 22, mean=0.0784515712941939
 
Cluster 23, mean=0.0786274355175134
 
Cluster 24, mean=0.07895573143658405
 
Cluster 25, mean=0.07899509860416559
 
Cluster 26, mean=0.07925082693168495
 
Cluster 27, mean=0.0797877484019068
 
Cluster 28, mean=0.07986661029937701
 
Cluster 29, mean=0.07990785182752508
 
Cluster 30, mean=0.08007968293919594
 
Cluster 31, mean=0.08015837223712717
 
Cluster 32, mean=0.08030362766148633
 
Cluster 33, mean=0.08031058684150633
 
Cluster 34, mean=0.08034479115429904
 
Cluster 35, mean=0.08039016054952672
 
Cluster 36, mean=0.08087715325494958
 
Cluster 37, mean=0.08108779819015267
 
Cluster 38, mean=0.0811695673849599
 
Cluster 39, mean=0.08128661488477157
 
Cluster 40, mean=0.08154354498809352
 
Cluster 41, mean=0.08177106627411254
 
Cluster 42, mean=0.08570682941214111
 
Cluster 43, mean=0.08947846843938324
 
Cluster 44, mean=0.09168960041621523
 
Cluster 45, mean=0.09400866406358196
 
Cluster 46, mean=0.09696206437470721
 
Cluster 47, mean=0.09769157306397269
 
Cluster 48, mean=0.09890874392695877
 
Cluster 49, mean=0.10321206749949945
 
Cluster 50, mean=0.10350246791750546
 
Cluster 51, mean=0.10377487376360454
 
Cluster 52, mean=0.10568047469992141
 
Cluster 53, mean=0.10860266362862597
 
Cluster 54, mean=0.11469585501620631
 
Cluster 55, mean=0.11564367087970955
 
Cluster 56, mean=0.12017353475354665
 
Cluster 57, mean=0.12377428607580715
 
[ ]: