Utilization of Neighbor Information for Image Classification with Different Levels of Supervision

Abstract

We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering. Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task – GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently. We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting. State-of-the-art clustering methods already rely heavily on nearest neighbors. We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors. Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet-200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircrafts).

Methodology

The training process involves optimizing an embedding function $f_{\text{emb};\theta_e}$ and a classifier function $ f_{\text{cls};\theta_c} $ to improve image classification by leveraging nearest and furthest neighbors in the embedding space. Initially, each image is embedded as a vector, and neighbors are identified based on Euclidean distances. Positive neighbors are selected from the closest $ \tau_1 $ examples, while negative neighbors are those farther than the first $ \tau_2 $ nearest neighbors. A cleaning process discards neighbors if their second-order neighborhood exceeds a threshold $ \eta $. During training, a triplet consisting of an image, a positive neighbor, and a negative neighbor is used to compute losses: $ L_{\text{pos}} $, which encourages similarity between the image and its positive neighbor, and $ L_{\text{neg}} $, which enforces dissimilarity between the image and its negative neighbor. The overall objective function balances these losses with an entropy term to ensure well-formed class distributions. A Vision Transformer (ViT-B/16) serves as the backbone for embeddings, initialized with pretrained DINO weights and further refined with SimGCD. The classifier is a two-layer MLP, initialized randomly, and trained with transformed images of size $ 224 \times 224 $. The optimization process results in refined parameters $ \theta_e^* $ and $ \theta_c^* $, which enhance embedding quality and classification performance.

Results

Image Clustering Results. We achieve SOTA on accuracy, NMI, and ARI for all datasets. †denotes our implementations.

Generalized Category Discovery Results reported for all, old, and new Classes with DINOv1 backbone. DINOv2 [68] results are reported for a limited set where there are comparision numbers in literature. UNIC achieves SOTA for ImageNet-100, CUB-200, Aircrafts, and SCars. * denotes pretrained as per SimGCD

Generalized Category Discovery Example: Example images from ImageNet-100 GCD for old classes. Left : highest confident true positives; Middle: least confident true positives; Right: False positives.

Generalized Category Discovery Example: Example images from ImageNet-100 GCD for new classes. Left : highest confident true positives; Middle: least confident true positives; Right: False positives

Clustering Example: Example images from ImageNet-100 clustering. Left : highest confident true positives; Middle: least confident true positives; Right: False positives.

BibTeX

@misc{unic-2025, title={Utilization of Neighbor Information for Image Classification with Different Levels of Supervision}, author={Gihan Jayatilaka and Abhinav Shrivastava and Matthew Gwilliam}, year={2025}, eprint={2503.14500}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.14500} }