Μεταπηδήστε στο περιεχόμενο

Multi-label prototype selection based on editing nearest neighbor rule

Pattern Recognition,Volume 180, Part A,2026,114001,ISSN 0031-3203, 22 May 2026

Abstract

A common method to improve the efficiency of instance-based classifiers, while maintaining high accuracy, is to decrease the training set size by substituting it with a smaller, representative subset. This is typically achieved by utilizing Data Reduction Techniques (DRTs). The latter enjoy wide applicability in handling single-label classification. This is not the case though in cases of multi-label data, where each instance may be associated with multiple classes. In the present paper, we adapt a popular single-label data reduction technique, the Edited Nearest Neighbor (eNN) rule, to handle multi-label data. In single-label classification, eNN focuses on the removal of noise and close border instances making the dataset clear and the borders well-separated. The core idea is that instances with a different class than the majority of their neighbors are considered noise and are removed. In the context of multi-label data, label boundaries tend to be ambiguous, and the notion of noise is not clearly defined. Nevertheless, we hypothesize that instances whose labelsets significantly differ from those dominating their local neighborhood can be treated as noise and their removal serves to condense the training set. Based on this principle, we propose three new eNN variations for multi-label data and test them in practice. Experimental tests and statistical analysis conducted across nine (9) diverse multi-label datasets indicate that the proposed algorithms reduce significantly the size of the datasets, without compromising on accuracy, while also demonstrating superior performance compared to existing methods.

Keywords: Multi-label classification, Data reduction techniques, Prototype selection, eNN rule, MLeNN, BRkNN

Link: https://doi.org/10.1016/j.patcog.2026.114001

Condensing multi-label data based on Clustering

In 28th Pan-Hellenic Conference on Progress in Computing and Informatics (PCI 2024), December 13–15, 2024, Egaleo, Greece. ACM, New York, NY, USA, 6 pages.

Abstract

A common approach to speed up the instance-based classifiers,while preserving accuracy, is to use a subset of the original training set. This process, known as condensing, reduces the training dataset
to a smaller, representative set. This is accomplished by employing a data reduction technique that selects the most relevant training instances. While many of these techniques have been successfully
applied to single-label classification, most are not appropriate for multi-label data, where instances can be associated with multiple classes. This paper proposes a simple data reduction technique for
multi-label datasets utilizing 𝐾-means++ clustering. The proposed method selects instances near cluster centroids to form the condensing set, utilizing 𝐾-means++’s initialization strategy to achieve
widely spread initial centroids. Experiments conducted on nine datasets, combined with a statistical test, show that our approach achieves significant data reduction while maintaining high classifi-
cation accuracy.

Keywords: Multi-label classification, Data Reduction, Prototype Selection, Condensing, 𝐾-means++, BR𝑘NN

Link:https://doi.org/10.1145/3716554.3716591

Prototype Selection for Multilabel Instance-Based Learning

Information 202314(10), 572; This paper is an extended version of our paper published in 27th International Database Engineered Application Symposium, IDEAS 2023, Heraklion, Greece, 5–7 May 2023.

Abstract

Reducing the size of the training set, which involves replacing it with a condensed set, is a widely adopted practice to enhance the efficiency of instance-based classifiers while trying to maintain high classification accuracy. This objective can be achieved through the use of data reduction techniques, also known as prototype selection or generation algorithms. Although there are numerous algorithms available in the literature that effectively address single-label classification problems, most of them are not applicable to multilabel data, where an instance can belong to multiple classes. Well-known transformation methods cannot be combined with a data reduction technique due to different reasons. The Condensed Nearest Neighbor rule is a popular parameter-free single-label prototype selection algorithm. The IB2 algorithm is the one-pass variation of the Condensed Nearest Neighbor rule. This paper proposes variations of these algorithms for multilabel data. Through an experimental study conducted on nine distinct datasets as well as statistical tests, we demonstrate that the eight proposed approaches (four for each algorithm) offer significant reduction rates without compromising the classification accuracy.

Condensed Nearest Neighbour Rules for Multi-Label Datasets

IDEAS ’23: Proceedings of the International Database Engineered Applications Symposium Conference Pages 43–50

Abstract

Reducing the size of the training set, that is, replacing it with a condensing set, while maintaining the classification accuracy as much as possible is a very common practice to speed up instance-based classifiers. Data reduction techniques, also known as prototype selection or generation algorithms, can be used to accomplish this. There are numerous such algorithms that can be found in the literature that are effective for single-label classification problems, but the majority of them cannot be used for multi-label data where an instance may belong to multiple classes. Due to the numerous binary condensing sets it creates, the well-known Binary Relevance transformation method cannot be combined with a Data Reduction algorithm. Condensed Nearest Neighbor is a well-known parameter-free single-label prototype selection algorithm. This study proposes three variations of that algorithm for training datasets with multiple labels. An experimental study that we conducted over nine distinct datasets shows that our three proposed approaches provide good reduction rates while not tampering with the classification rates.

Link: https://doi.org/10.1145/3589462.3589492

Data reduction via multi-label prototype generation

Neurocomputing, Volume 526, 14 March 2023, Pages 1-8

Abstract

A very common practice to speed up instance based classifiers is to reduce the size of their training set, that is, replace it by a condensing set, hoping that their accuracy will not worsen. This can be achieved by applying a Prototype Selection or Generation algorithm, also referred to as a Data Reduction Technique. Most of these techniques cannot be applied on multi-label problems, where an instance may belong to more than one classes. Reduction through Homogeneous Clustering (RHC) and Reduction by Space Partitioning (RSP3) are parameter-free single-label Prototype Generation algorithms. Both are based on recursive data partitioning procedures that identify homogeneous clusters of training data, which they replace by their representatives. This paper proposes variations of these algorithms for multi-label training datasets. The proposed methods generate multi-label prototypes and inherit all the desirable properties of their single-label versions. They consider clusters that contain instances that share at least one common label as homogeneous clusters. It is shown via an experimental study based on nine multi-label datasets that the proposed algorithms achieve good reduction rates without negatively affecting classification accuracy.

Link: https://doi.org/10.1016/j.neucom.2023.01.004
Link in ResearchGate: Panagiotis Filippakis on ResearchGate

Prototype Generation for Multi-label Nearest Neighbours Classification

International Conference on Hybrid Artificial Intelligence Systems HAIS 2021: Hybrid Artificial Intelligent Systems pp 172–183

Abstract

Numerous Prototype Selection and Generation algorithms for instance based classifiers and single label classification problems have been proposed in the past and are available in the literature. They build a small set of prototypes that represents as best as possible the initial training data. This set is called the condensing set and has the benefit of low computational cost while preserving accuracy. However, the proposed Prototype Selection and Generation algorithms are not applicable to multi-label problems where an instance may belong to more than one classes. The popular Binary Relevance transformation method is also inadequate to be combined with a Prototype Selection or Generation algorithm because of the multiple binary condensing sets it builds. Reduction through Homogeneous Clustering (RHC) is a simple, fast, parameter-free single label Prototype Generation algorithm that is based on k-means clustering. This paper proposes a RHC variation for multi-label training datasets. The proposed method, called Multi-label RHC (MRHC), inherits all the aforementioned desirable properties of RHC and generates multi-label prototypes. The experimental study based on nine multi-label datasets shows that MRHC achieves high reduction rates without negatively affecting accuracy.

Link: https://doi.org/10.1007/978-3-030-86271-8_15

Επιλογή και συνδυασμός ταξινομητών για τη βελτιστοποίηση απόδοσης αλγορίθμων Τεχνητής Νοημοσύνης για διάγνωση ασθενειών

Bachelor’s thesis

Nowadays the medical profession is faced with a number of challenges since numerous diseases can be treated effectively if diagnosed correctly at an early stage. In this battle, the science of computing and some of its techniques come to our aid. To this end, there is a continuing effort to create mathematical models (classifiers) to act supportively in the correct diagnosis of illnesses such as cancer, diabetes, Alzheimer’s, heart disease, etc. The classifiers act either separately or combined to ensure maximum effectiveness. The calculating power of computing has increased significantly so it can process a bulk of data quickly in order to diagnose multifactorial diseases. The objective of this work is to study and implement the combination of these classifiers in a particular method through its application in ten datasets of different illnesses. The aim of this dissertation is to show that when the method of combining heterogenic classifiers is implemented in medical data, it can yield better illness predictions than the use of single classifiers would. This is why it has been implemented in ten different medical problems, each with its own distinctive features, so as to have as a broader view of the results it may yield as possible. The source we gathered the data of the diseases from has been of great importance in drawing conclusions. This is why we received data only from hospital doctors as well as databases available on the net. We made use of the WEKA tool and Excel to implement the classifiers used in our study, on which we developed our own computer program.

Link: https://apothesis.eap.gr/archive/item/143199?lang=en