Utilization of Neighbor Information for Image Classification with Different Levels of Supervision

Department of Computer Science, University of Maryland
2025
Project Teaser

Unifying Clustering and GCD. We observe that the goals of image clustering and generalized category discovery (GCD) are identical, they only differ slightly in terms of supervision (top). Therefore, we propose a clustering approach based on mining of positive and negative neighbors, which belong to the same class as an anchor and a different class, respectively (bottom left). We can extend this approach for GCD by using the ground truth labels for "perfect" neighbors (bottom right).

Abstract

We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering. Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task – GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently. We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting. State-of-the-art clustering methods already rely heavily on nearest neighbors. We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors. Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet-200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircrafts).

Methodology

The training process involves optimizing an embedding function $f_{\text{emb};\theta_e}$ and a classifier function $ f_{\text{cls};\theta_c} $ to improve image classification by leveraging nearest and furthest neighbors in the embedding space. Initially, each image is embedded as a vector, and neighbors are identified based on Euclidean distances. Positive neighbors are selected from the closest $ \tau_1 $ examples, while negative neighbors are those farther than the first $ \tau_2 $ nearest neighbors. A cleaning process discards neighbors if their second-order neighborhood exceeds a threshold $ \eta $. During training, a triplet consisting of an image, a positive neighbor, and a negative neighbor is used to compute losses: $ L_{\text{pos}} $, which encourages similarity between the image and its positive neighbor, and $ L_{\text{neg}} $, which enforces dissimilarity between the image and its negative neighbor. The overall objective function balances these losses with an entropy term to ensure well-formed class distributions. A Vision Transformer (ViT-B/16) serves as the backbone for embeddings, initialized with pretrained DINO weights and further refined with SimGCD. The classifier is a two-layer MLP, initialized randomly, and trained with transformed images of size $ 224 \times 224 $. The optimization process results in refined parameters $ \theta_e^* $ and $ \theta_c^* $, which enhance embedding quality and classification performance.

Results

BibTeX

@misc{unic-2025,
        title={Utilization of Neighbor Information for Image Classification with Different Levels of Supervision}, 
        author={Gihan Jayatilaka and Abhinav Shrivastava and Matthew Gwilliam},
        year={2025},
        eprint={2503.14500},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.14500} 
}