Salutary Labeling with Zero Human Annotation (2024)

Wenxiao Xiao
Department of Computer Science
Brandeis University
Waltham, MA 02452
wenxiaoxiao@brandeis.edu
&Hongfu Liu
Department of Computer Science
Brandeis University
Waltham, MA 02452
hongfuliu@brandeis.edu

Abstract

Active learning strategically selects informative unlabeled data points and queries their ground truth labels for model training. The prevailing assumption underlying this machine learning paradigm is that acquiring these ground truth labels will optimally enhance model performance. However, this assumption may not always hold true or maximize learning capacity, particularly considering the costly labor annotations required for ground truth labels. In contrast to traditional ground truth labeling, this paper proposes salutary labeling, which automatically assigns the most beneficial labels to the most informative samples without human annotation. Specifically, we utilize the influence function, a tool for estimating sample influence, to select newly added samples and assign their salutary labels by choosing the category that maximizes their positive influence. This process eliminates the need for human annotation. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our salutary labeling approach over traditional active learning strategies. Additionally, we provide several in-depth explorations and practical applications of large language model (LLM) fine-tuning.

1 Introduction

Active learning[15, 87, 69] is a specialized area in machine learning that focuses on effectively training models by enabling them to request the labeling of particularly informative data points with a certain budget. This approach arises from the challenge and expense involved in obtaining labeled data, which is often a major bottleneck in machine learning applications. The principle behind active learning is that a machine learning model can achieve higher accuracy with fewer ground truth labels if it is allowed to choose the data from which it learns. This makes active learning particularly valuable in fields where the labeling process is costly and time-consuming.

Consequently, significant research efforts have been dedicated to active learning in various research areas such as computer vision[40, 9], natural language processing[88, 58], and medical diagnosis[7, 77].Traditionally, active learning works select data points based on uncertainty and representativeness. The early uncertainty-based methods mainly measure the data uncertainty with the posterior probability predicted by the model[38, 78, 4], while some recent works utilize auxiliary modules[49, 44] to estimate uncertainty. Solely focusing on the uncertainty might cause bias in sampling, therefore other methods[83, 40] aim to find the most representative subset of the full data. Recently, some works[55, 14] attempt to estimate the effect of integrating each data point on the training loss with the influence function[17, 45].

The above active learning approaches show promising results but hinge on a critical assumption that training with ground truth labels of the selected samples will optimally enhance model performance. However, this assumption may not always hold, as some human-annotated labels can be incorrect or misleading, potentially harming the model’s efficacy[75, 11]. Moreover, even the correct annotation might still harm or limit the model performance.Therefore, we believe that the most valuable annotations may not always be the ground truth but the labels that most improve the model.This perspective underscores the need for a strategy that not only identifies the most informative data points but also accurately determines the labels that offer the greatest benefit.

Contributions.In this paper, we present salutary labeling, which aims to select the most informative samples and automatically annotate them with the most beneficial labels, enhancing training efficacy and minimizing the need for human intervention.We summarize our contribution as follows:

•
We consider a new task named salutary labeling, which integrates the querying and annotating processes of active learning into a single autonomous step. To the best of our knowledge, this is the first initiative aimed at both maximizing model performance and eliminating the need for ground truth in active learning with an automatic labeling strategy.
•
We adapt the influence function to calculate the sample influence, which serves as a criterion for selecting the most influential sample for labeling. However, the label information is required during calculating sample influence. Our salutary labeling ingeniously addresses this challenge by assessing the impact of each sample across all possible labels and assigning the label that yields the greatest positive influence. This simple strategy allows the model to automatically select and label samples, maximizing their overall benefit without any human annotation.
•
We validate the efficacy of our approach on nine benchmark datasets, comparing with four classical methods in active learning and two recent influence function-based methods. In addition to standard active learning experiments, we also conduct various in-depth explorations to address key questions concerning salutary labeling and extend its application to LLM fine-tuning.

2 Related Work

Our work intersects with several areas within machine learning. Among them, our work is most closely related but in contrast to existing active learning. Active learning[15, 79] selectively queries the user to annotate data points that are likely to be most beneficial for improving model performance, while our work introduces a new task named salutary labeling, aiming to accomplish the querying and labeling in one unified step without any human annotation. Traditionally, some strategies[87, 69, 53] select important data points with indirect criteria such as uncertainty or representativeness. Uncertainty-based methods define sample uncertainty in one of three main ways: the entropy of the posterior probability distribution[72, 82, 38], the probability of the predicted class[51, 78, 63], or the margin between the probabilities of the highest two predicted classes[43, 70, 4].Beyond these, research works[28, 29] utilize consensus among multiple classifiers[73, 44], or employ an auxiliary module[86] to measure uncertainty.Another strand of active learning approaches focuses on selecting the most representative samples[83, 40, 71] through clustering[62] or by maximizing the distances between selected samples[36].Alternatively, several methods[33, 36, 85] attempt to identify the most diverse subset to represent the full dataset.Unlike these uncertainty-based and representativeness-based methods, our salutary labeling directly estimates each sample’s impact on model performance with influence function.

Technically, our work is therefore inherently related to influence function[17], which measures the change in a model’s output due to an infinitesimal perturbation of one training data point. Following Koh and Liang [45], significant research efforts[30, 46, 67, 12] are dedicated to quantifying the impact of individual or group of training samples on model performance. Recently, ISAL[55] extends the influence function to active learning by utilizing pseudo labels to calculate the influence.Alternatively, IBDS[14] incorporates an auxiliary regression module, which is specifically trained on labeled data and their calculated influences, to estimate the impact of unlabeled samples. While these methods avoid the requirement of labels in calculating influence function, they still rely on human annotators to label the selected data. In contrast, our method eliminates the need for human annotation, thereby avoiding the labor-intensive process of annotations and the potential inaccuracies associated with detrimental ground truth labels.

Conceptually, our work is also related to several data-centric topics.Data relabeling methods[84, 47] seek to relabel the harmful training samples for better model performance, whilepartial label learning[42, 57, 31] aims to train a classifier to accurately predict the ground-truth label using partially labeled data, where each training instance is associated with multiple candidate labels.Although both tasks involve automatically assigning labels to data points, neither of them is designed to query unseen samples for further improving model performance.Data-efficient learning[41, 60, 16, 65] aims to accelerate model training by selecting a minimum subset of the data, which requires ground truth labels for all available data.Antidote data[52, 68] overlaps with our method as it generates additional training data to modify specific model behaviors such as fairness. However, these approaches do not primarily focus on the context of active learning.

Salutary Labeling with Zero Human Annotation (1)

Salutary Labeling with Zero Human Annotation (2)

3 Motivation

Conventional active learning methods aim to strategically select unlabeled samples for annotation, assuming that correctly labeled samples inherently enhance model performance. However, this assumption may not always hold true. Research in the realm of noisy labels[61, 74] has revealed that even a small subset of samples with noisy labels can contribute positively to model improvement. Our own observations, depicted in Figure1 (top), further substantiate this claim. Leveraging the influence function, we discern the impact of individual samples on model performance. Based on this analysis, we calculate the sample influence with the most salutary label adjustment, maximizing its impact on model performance. Subsequently, we partition the entire training set into 20 equally-sized bins and replace the labels of samples within each bin with their optimal counterparts. Notably, the red line in the figure illustrates the model’s performance with the entire training set, but with the labels of samples within each bin adjusted accordingly. Note that the dots representing equally-sized samples along the red line do not have uniform intervals and do not align with the unevenly-sized histogram. Surprisingly, for bins with high influence scores, retraining the model with these adjusted labels results in a significant performance improvement. For instance, in the last bin, the accuracy increases from 69% to 74%. This underscores the presence of salutary labels that surpass ground truth labels in enhancing model performance.

Expanding on the concept of salutary labels, we apply it within the framework of active learning, as depicted in Figure1 (bottom). Analogous to our previous approach, we sort the data points in the pool set based on their influence when labeled with salutary labels, dividing the pool set into 20 equally-sized bins. The red and blue lines represent the performance when each bin is added to the initial set with ground truth labels and salutary labels, respectively. Our salutary labeling strategy consistently outperforms ground truth labels in most scenarios, particularly notable for samples with high influence estimates, which exhibit a remarkable 5% improvement over ground truth labels. It is noteworthy that the inclusion of bins with low influence leads to a decrease in prediction accuracy, highlighting the presence of detrimental samples. These findings motivate us to pursue active learning with salutary labels, a strategy that not only enhances performance compared to ground truth labels but also alleviates the need for costly annotation efforts.

4 Method

4.1 Preliminaries

Active learning. The active learning process begins with training a model on a small initial labeled dataset $L$ $=$ $\{(x_{i},y_{i})\}_{i=1}^{N_{L}}$ . Guided by certain criteria, active learning selects an amount of the most informative unlabeled data points from a pool set $U$ $=$ $\{x_{j}\}_{j=1}^{N_{U}}$ , queries their labels to obtain $B$ $=$ $\{(x_{j^{\prime}},y_{j^{\prime}})\}_{j^{\prime}=1}^{b}$ , where $b$ represents the querying budget in each iteration, and updates the model with the newly labeled data $L\cup B$ . These queried samples are then removed from the unlabeled pool for subsequent iterations. This learning cycle is repeated for multiple rounds, gradually enhancing model performance while minimizing labeling effort.

Influence function.For a labeled training dataset $\{(x_{i},y_{i})\}_{i=1}^{N}$ and a model with a convex loss function $\ell(\cdot,\cdot)$ , the optimized parameters for empirical risk minimization (ERM) can be represented as $\hat{\theta}$ $=$ ${\arg\min}_{\theta\in\Theta}\frac{1}{N}\sum_{i}\ell{(x_{i},y_{i})}$ $+$ $\frac{\lambda}{2}\|\theta\|^{2}_{2}$ . If one training point $(x_{j},y_{j})$ is down-weighted by infinitesimal $\epsilon$ during the training, the new optimized parameters change to $\hat{\theta}_{(x_{j},y_{j});-\epsilon}$ $=$ ${\arg\min}_{\theta\in\Theta}\frac{1}{N}\sum_{i}\ell{(x_{i},y_{i})}-\epsilon%\ell{(x_{j},y_{j})}+\frac{\lambda}{2}\|\theta\|^{2}_{2}$ . Without actually retraining the model, the influence function[17] estimates the actual change by $\hat{\theta}_{(x_{j},y_{j});-\epsilon}-\hat{\theta}=-\mathbf{H}_{\hat{\theta}}%^{-1}\nabla_{\hat{\theta}}\ell{(x_{j},y_{j})}$ ,where $\mathbf{H}_{\hat{\theta}}$ $=$ $\frac{1}{N}\sum_{i=1}\nabla^{2}_{\hat{\theta}}\ell{(x_{i},y_{i})}$ $+$ $\lambda\mathbf{I}$ is the positive definite Hessian matrix for ${\hat{\theta}}$ .

4.2 Salutary Labeling for Active Learning

In this work, we propose salutary labeling for active learning, a novel approach that directly evaluates the impact of each unlabeled sample and automatically assigns labels to the selected data without any human annotation. This method circumvents the requirement for ground truth labels in influence function calculation, by systematically exploring all possible labels for each data point and calculating the influence corresponding to each label. The label with the highest influence estimation is then assigned to each sample as the salutary label.This salutary influence, estimated using the salutary label, represents the maximum possible benefit when incorporating the data point into training. Subsequently, our method selects the unlabeled samples with the highest salutary influence and annotates them with salutary labels in a unified step, without requiring any human intervention. In the following section, we introduce the notations and provide technical details of our method.

Training protocol and technical notations.In each iteration of active learning, the model is trained on the labeled training set $L$ with label space $\mathcal{C}$ .The optimized model parameters for the convex training loss function $\ell(\cdot,\cdot)$ are donated as $\hat{\theta}$ .To actively query the most beneficial samples from the unlabeled pool set $U=\{x_{i}\}_{i=1}^{N_{U}}$ , our salutary labeling algorithm calculates the influence estimation of every data point $x_{i}$ with its salutary label on the validation loss $\mathcal{L}_{v}=\ell(V;{\hat{\theta}})$ .The samples with the highest influences are selected as the salutary set, donated as $B=\{(x_{j},y_{j}^{s})\}_{j=1}^{b}$ , where $y_{j}^{s}\in\mathcal{C}$ represents the salutary label of the queried data and the superscript ‘s’ represents the salutary label.After forming the salutary set, it is removed from the pool $U$ , thus updating $U=U\setminus B$ . Subsequently, the model is re-trained on the expanded labeled set $L=L\cup B$ for the next active learning cycle.

Salutary labeling with influence function. With the concept of the salutary label, we can handle the absence of label information when calculating the influence function. Specifically, for an unlabeled sample, we compute the influence estimations for each label and pick the one with the largest influence value as follows:

\mathcal{I}(x_{j},y_{j}^{s})=\mathcal{I}(x_{j},\hat{c}),\ \textup{where}\ \hat%{c}=\operatorname*{arg\,max}_{c\in\mathcal{C}}\mathcal{I}(x_{j},c).

(2)

Autonomous active learning. Eq.(2) directly measures the impact of each unlabeled sample and automatically assigns the salutary label, enabling our method to query the unlabeled data without human intervention. Specifically, the model selects the top $b$ samples with the highest influences from the pool set $U$ and annotates them with salutary labels, to form an active salutary set $B=\{(x_{j},y_{j}^{s})\}_{j=1}^{b}$ . This salutary set is then removed from $U$ and integrated into the labeled training set $L$ , to update the model.

We summarize the full training protocol of our salutary labeling for active learning in Algorithm1. The time complexity of salutary labeling is bounded by the calculation of the influence function in Eq.(2). For each label $c\in\mathcal{C}$ , the calculation of gradients for all unlabeled samples will take $\mathcal{O}(nd)$ , where $n$ is the number of samples and $d$ is the dimension of model parameter $\theta$ .Notice that the computation of the Hessian matrix and its inverse only involves the label information of the validation set. Therefore, these calculations only need to be performed once for all potential labels. The explicit computation of Hessian takes $\mathcal{O}(nd^{2})$ and its inversion takes $\mathcal{O}(d^{3})$ , we apply conjugate gradients and stochastic estimations of Hessian-vector products[45], reducing the time complexity to $\mathcal{O}(nd)$ .

Input: Labeled training set $L$ , unlabeled pool set $U$ , validation set $V$ and model parameters $\theta$ ;
Parameters: Total active training round $R$ and the query budget $b$ ;

1:Train the model and obtain the optimized parameters $\hat{\theta}$ with loss term $\frac{1}{N_{L}}\Sigma_{(x_{i},y_{i})\in L}\ell(x_{i},y_{i})$ ;

2:for $r=1$ to $R$ do

3:for $x_{j}\in U$ do

4:Calculate the sample influence with its salutary label $\mathcal{I}(x_{j},y_{j}^{s})$ by Eq.(2).

5:endfor

6:Select $b$ samples with the highest influence as salutary set $B=\{(x_{j},y_{j}^{s})\}_{j=1}^{b}$ .

7:Update the labeled training set as $L=L\cup B$ .

8:Remove the salutary set from the pool set as $U=U\setminus B$ .

9:Re-train the model with Eq.(1) and update $\hat{\theta}$ .

10:endfor

Output: The final optimized model parameters $\hat{\theta}$ after $R$ rounds of active learning.

5 Experiments

We demonstrate the performance of our method in this section.We first introduce the experimental setup, then report the algorithmic performance of extended active learning experiments,and finally provide various in-depth analyses of autonomous salutary labeling.

5.1 Experimental Setup

Datasets.We use the seven tabulate datasets[25] and two vision datasets in our experiments.Bank[59] dataset has a total of 30,488 records of bank telemarketing phone calls. Each sample contains 51 features which are used to predict if a client will subscribe to a term deposit or not.Diabetic[20] dataset contains 1,151 retina images of patients for predicting if the patients suffer from Diabetes or not. We use 19 features extracted by Antal and Hajdu [1].CelebA[56] has a total of 104,163 samples of face images with 39 features from each sample image and we treat the features as tabulated data to predict if the person is smiling or not.Musk_v2[10] dataset contains 6,598 instances of molecules, and 166 features to represent the low-energy conformations of the molecules, which is used to learn to predict whether new molecules will be musks or non-musks.Electrical[2] dataset contains 10,000 points and 11 attributes such as power consumption and price in a 4-node star electrical grid system, which is used to predict if the system is stable or not.Wine[18] dataset consists of the physicochemical test results for 4,898 variants of the Portuguese “Vinho Verde” wine. We use it to predict the quality scores (from 3 to 9) based on 11 physicochemical attributes, such as acidity, density, and alcohol rate.Waveform[8] dataset contains 5,000 instances of waveform records, each described by 21 attributes. We use it to classify each record into one of the three waveform classes.MNIST[22] is a collection of 70,000 handwritten digit images (0 through 9). We use a ResNet-34[37], which is pre-trained on the ImageNet[21], to extract 512 deep features for each image.CIFAR10[48] consists of 60,000 real-life images in 10 classes, with 6,000 images per class. Similar to the MNIST dataset, we also extract 512 features with the pre-trained ResNet-34. We summarize the datasets used in the experiments in AppendixA.

Method	Electric	Bank	Diabetic	CelebA	Musk_v2	Wine	Waveform	CIFAR10	MNIST
Init	63.85	65.89	56.43	73.33	73.45	44.76	79.11	46.74	77.75
Random	65.15	67.77	58.41	82.06	78.33	46.31	81.10	55.92	80.93
Entropy[38]	69.72	73.84	65.34	81.23	79.11	45.00	83.23	53.91	83.77
Margin[4]	69.72	73.84	65.34	81.23	79.11	47.30	82.26	56.95	83.72
Uncertainty[63]	69.72	73.84	65.34	81.23	79.11	44.53	83.33	55.47	83.63
ISLA[55]	67.98	64.41	61.38	84.71	77.72	47.15	79.40	53.91	79.35
IBDS[14]	67.66	65.14	64.35	82.49	78.15	44.84	82.91	54.61	80.05
Ours	71.31	78.07	71.28	85.50	81.06	49.92	84.21	58.33	86.68

Salutary Labeling with Zero Human Annotation (3)

Baseline methods.We include the six baseline methods for active learning. Random sampling is the most intuitive baseline which randomly queries samples from the pool set. Entropy sampling[38] selects the unlabeled samples with the highest entropy of the current model’s predictions. Margin sampling[4] ranks all pool samples by the margin between the highest and second-highest values from the soft-max logits predicted by the model. Uncertainty sampling[63] queries by the classification uncertainty, which is determined by the probability of the predicted class as assigned by the classifier. We also include two influence-based active learning methods, both of which choose the unlabeled data set with influence estimation. Influence Selection for Active Learning (ISLA)[55] uses base model predictions as pseudo-labels to compute influence. Influence-based Data Selection (IBDS)[14] uses an influence regressor, which is trained with labeled training data and their influences calculated with Eq.(1), to predict the influence estimations for the unlabeled data. It is important to note that while all six baseline methods require human effort to annotate the queried unlabeled samples, our approach is completely human annotation-free.

Experimental protocol and implementation details.In our experiments, we divide all datasets into training set (60%), validation set (20%), and test set (20%), except for Bank, CelebA and Diabetic datasets, which have predefined splits for training, validation, and testing. The influence-based models, including ISAL, IBDS, and our methods, exclusively utilize the validation set to compute influence estimations. This setup ensures that none of the methods access any information from the test set, maintaining the testing data unseen to the models during the evaluation. All experiments are repeated 5 times with different random seeds. In each run, we randomly choose 300 samples from the training set as the initial set and reserve the rest as the pool set.

We implementour method with Scikit-learn[66] and Pytorch[64]. All experiments are conducted on our workstation equipped with one 24GB NVIDIA TITAN RTX GPU.We choose a logistic regression classification model that satisfies the convex requirement of the influence function.We initiate the process by training this model with the initial set.Subsequently, we conduct active learning for $R=10$ active rounds. In each round, the model queries 10 samples from the pool dataset $U$ . For baseline methods, the ground truth labels of these selected samples are used, whereas our method automatically assigns salutary labels according to Eq.(2). After labeling, the queried data points are integrated into the labeled set for re-training the model. After each round of learning, we evaluate the model’s performance by measuring prediction accuracy on the test set.

We set the query budget $b$ to 10 to maintain the distinction in performance between different models. Using a larger budget, such as 1% of the pool set, might cause the model to reach the performance ceiling on some datasets. We provide a detailed discussion and visualization on this in AppendixD.

5.2 Algorithmic Performance

We evaluate the performance of our salutary labeling method alongside the active learning baselines. Note that the entropy, margin, and uncertainty samples yield the same results for the same random initial/pool splits in binary classification datasets, as these three metrics have the same rank for 2-dimensional logits.As shown in Table1, our method shows significant improvements over the initial model despite a limited querying budget and achieves the highest accuracy among all active learning methods.We notice that the two influence-based baselines do not perform well on datasets like Diabetic and Wine. This highlights the difficulties in estimating influence without access to label information, emphasizing the challenges and limitations of current influence-based approaches in handling complex datasets where salutary labeling shows a clear advantage.

Moreover, we also present the accuracy change throughout 10 learning rounds for all methods in Figure2. Notably, our method demonstrates a significant and steady improvement in accuracy, particularly in challenging datasets like Bank, Waveform, and Wine, where the baselines show limited progress.This indicates the efficiency of salutary labeling in active learning, particularly noteworthy as it operates without the need for human annotation efforts. This capability to function autonomously underscores the potential of our method for practical applications.

5.3 In-depth Explorations

We would like to answer the following questions for salutary labeling in our in-depth exploration:

•
The influence function has been demonstrated as an accurate estimation for leave-one-out influence[45], which estimates the impact of removing a training sample. On the contrary, salutary labeling adapts this function to assess the effect of adding a sample unseen during model training, raising the question: How accurate is this estimation?
•
As salutary labeling does not require human annotation, there is no budget constraint. Is it possible to achieve better performance training with more pool samples?
•
The influence function requires the learning model to be convex, which limits its applied scenarios. Can we circumvent the convex requirement of influence function and extend the salutary labeling to applications involving non-convex deep models?

Influence estimation vs. add-one-in retraining.We empirically verify how accurate is the influence function when estimating the impact of adding a new data point on three datasets, namely Diabetic, CelebA, and Bank.For each dataset, we compare the predicted influence estimations with the actual changes in loss observed after adding a sample and re-training the model.Using the initial set, we train a logistic regression model $\hat{\theta}$ and compute the influence $\mathcal{I}(x_{j},y_{j})$ for every data point in the pool set. Consequently, we individually add each pool sample $(x_{j},y_{j})$ to the training set and update the model parameters $\hat{\theta_{j}}$ . We compare influence estimation $\mathcal{I}(x_{j},y_{j})$ and the validation loss difference after add a sample $\ell(V;{\hat{\theta_{j}}})-\ell(V;{\hat{\theta}})$ . As shown in Figure3,The influence estimation for new samples does not perfectly match the actual loss change, likely because they were unseen during initial training.However, the influence estimations are highly correlated with actual loss differences, as measured by Spearman’s rank correlation coefficient.Therefore, the influence function still provides an accurate indication of each sample’s relative impact.

Salutary Labeling with Zero Human Annotation (4)

Salutary labeling with more data points.In Section5.2, we demonstrated the efficacy of salutary labeling. The fact that salutary labeling requires zero human intervention allows our method to query even more unlabeled samples without incurring any annotation costs. Therefore, we conduct additional experiments to evaluate the effectiveness of our method with more pool samples. Following the setup described in Section5.2, we split the data into an initial set for training the initial logistic regression model, along with a pool set, validation set, and test set. For each data set, the model queries and automatically annotates 10 samples from the pool set with salutary labeling in each active learning iteration. We allow the model to query up to 50% samples from the pool set and choose the iteration that has the best predicting accuracy on the validation set as the final model.

In addition to evaluating our salutary labeling, we also report the test accuracy obtained after training the model with all labeled data from both the initial and pool sets. This provides a reference point to the maximum achievable accuracy when the model is supervised by all available data. As demonstrated in Table2, our method outperforms supervised learning on four datasets, which validates that salutary labels can indeed provide superior guidance compared to ground truth labels under certain conditions. On Musk_v2[10], Wine[18], and Waveform[8] datasets, the fully supervised model only slightly lead our method by a margin less than 1%, the fully supervised model leads our method by a narrow margin of less than 1%. On CIFAR10[48] and MNIST[22], our method trails the fully supervised model by approximately 3.5%, but it still boosts the accuracy by over 15% compared to the initial model. Notably, these gains are achieved without any human annotation, demonstrating the effectiveness of our approach in leveraging unlabeled data.

Method	Electrical	Bank	Diabetic	CelebA	Musk_v2	Wine	Waveform	CIFAR10	MNIST
Init	63.85	65.89	56.43	73.33	73.45	44.76	79.11	46.74	77.75
Full Pool	70.08	80.14	72.27	85.07	85.75	52.53	85.60	65.67	95.36
Ours	72.25	81.21	73.26	85.89	85.68	52.38	85.50	62.05	92.06

Salutary labeling for LLM fine-tuning.In our above experiments, we use a logistic regression model to fulfill the convex requirement of the influence function. In this section, we aim to expand our salutary labeling method to practical applications with complex model structures. Specifically, we conduct the active learning experiments in the LLM fine-tuning with RoBERTa[54] model on three datasets of GLUE[76] repository, namely, WNLI[50], MRPC[23] and RTE[6]. We simulate an active learning scenario for fine-tuning the RoBERTa model, denoted by $g\circ h$ , where $g$ represents the transformer layers and $h$ represents the classification head. Following the setting of Section2, we divide each dataset into the initial set, pool set, validation set, and test set.

During the whole training, we fix the transformer layers $g$ in RoBERTa and fine-tune the non-convex classification head $h$ . Initially, we train the model using the initial set. Subsequently, in each learning cycle, we use the 768-dimensional hidden state extracted by $g$ , along with predictions from of $h$ , to train a surrogate logistic regression model $h^{\prime}(\cdot;\hat{\theta})$ . This surrogate model was then used to identify and annotate 10 samples from the pool set, as detailed in Algorithm1. The newly annotated samples are used to update the classification head $h$ .We provide the training details in AppendixC.

As illustrated in Figure4, our method outperforms all baseline approaches in all three tasks after 10 learning cycles. The performance advantage is consistent across most rounds, with detailed per-round results displayed in Figure5 of the AppendixC. These findings underscore the potential of our method in practical applications, highlighting the adaptability and effectiveness of our approach in real-world settings, even when the model is not strictly convex.

Salutary Labeling with Zero Human Annotation (5)

6 Conclusion

In this paper, we delved into the realm of active learning and proposed a novel concept called salutary labeling, which seamlessly merges the querying and annotating processes of active learning into a single autonomous step. Unlike traditional methods, our approach eliminates the need for human annotation; instead, it automatically assigns a salutary label, i.e., the label category that maximizes model performance. Technically distinct from conventional active learning approaches that rely on indirect measurements such as uncertainty and representativeness to select samples for labeling, we utilized the influence function to directly compute sample influence. However, a significant challenge arises when dealing with pool samples in active learning tasks, as label information may be unavailable. Our salutary labeling method adeptly overcomes this hurdle by evaluating the impact of each sample across all possible labels and assigning the label that generates the greatest positive influence. Extensive experimental results underscored the efficacy and advantages of our salutary labeling approach across various scenarios.

References

Antal and Hajdu [2014]Bálint Antal and András Hajdu.An ensemble-based system for automatic screening of diabetic retinopathy.Knowledge-Based Systems, 2014.
Arzamasov [2018]Vadim Arzamasov.Electrical Grid Stability Simulated Data .UCI Machine Learning Repository, 2018.
Bae etal. [2022]Juhan Bae, Nathan Ng, Alston Lo, Marzyeh Ghassemi, and RogerB Grosse.If influence functions are the answer, then what is the question?Advances in Neural Information Processing Systems, 2022.
Balcan etal. [2007]Maria-Florina Balcan, Andrei Broder, and Tong Zhang.Margin based active learning.In International Conference on Computational Learning Theory, 2007.
Basu etal. [2020]Samyadeep Basu, Phil Pope, and Soheil Feizi.Influence functions in deep learning are fragile.In International Conference on Learning Representations, 2020.
Bentivogli etal. [2017]Luisa Bentivogli, Ido Dagan, and Bernardo Magnini.The recognizing textual entailment challenges: Datasets and methodologies.In Handbook of Linguistic Annotation. 2017.
Biswas etal. [2023]Angona Biswas, NasimMd AbdullahAl, MdShahin Ali, Ismail Hossain, MdAzim Ullah, and Sajedul Talukder.Active learning on medical image.In Data Driven Approaches on Medical Imaging. 2023.
Breiman and Stone [1988]L.Breiman and C.J. Stone.Waveform Database Generator (Version 2).UCI Machine Learning Repository, 1988.
Chai etal. [2021]Junyi Chai, Hao Zeng, Anming Li, and EricWT Ngai.Deep learning in computer vision: A critical review of emerging techniques and application scenarios.Machine Learning with Applications, 2021.
Chapman and Jain [1994]David Chapman and Ajay Jain.Musk (Version 2).UCI Machine Learning Repository, 1994.
Chen etal. [2019]Pengfei Chen, BenBen Liao, Guangyong Chen, and Shengyu Zhang.Understanding and utilizing deep neural networks trained with noisy labels.In International Conference on Machine Learning, 2019.
Chen etal. [2021]Yuanyuan Chen, Boyang Li, Han Yu, Pengcheng Wu, and Chunyan Miao.Hydra: Hypergradient data relevance analysis for interpreting deep neural networks.In AAAI Conference on Artificial Intelligence, 2021.
Chen etal. [2023]Zizhang Chen, Peizhao Li, Hongfu Liu, and Pengyu Hong.Characterizing the influence of graph elements.In International Conference on Learning Representations, 2023.
Chhabra etal. [2024]Anshuman Chhabra, Peizhao Li, Prasant Mohapatra, and Hongfu Liu."What data benefits my classifier?" Enhancing model performance and interoperability through influence-based data selection.In International Conference on Learning Representations, 2024.
Cohn etal. [1996]DavidA Cohn, Zoubin Ghahramani, and MichaelI Jordan.Active learning with statistical models.Journal of Artificial Intelligence Research, 1996.
Coleman etal. [2019]Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, and Matei Zaharia.Selection via proxy: Efficient data selection for deep learning.arXiv preprint arXiv:1906.11829, 2019.
Cook and Weisberg [1980]R.Dennis Cook and Sanford Weisberg.Characterizations of an empirical influence function for detecting influential cases in regression.Technometrics, 1980.
Cortez etal. [2009]Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis.Wine Quality.UCI Machine Learning Repository, 2009.
Decencière etal. [2014]Etienne Decencière, Xiwei Zhang, Guy Cazuguel, Bruno Lay, Béatrice Cochener, Caroline Trone, Philippe Gain, John-Richard Ordóñez-Varela, Pascale Massin, Ali Erginay, Béatrice Charton, and Klein Jc.Feedback on a publicly distributed image database: The messidor database.Image Analysis & Stereology, 2014.
Decencière etal. [2014]Etienne Decencière, Xiwei Zhang, Guy Cazuguel, Bruno Lay, Béatrice Cochener, Caroline Trone, Philippe Gain, Richard Ordonez, Pascale Massin, Ali Erginay, and etal.Feedback on a publicly distributed image database: The messidor database.Image Analysis and Stereology, 2014.
Deng etal. [2009]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and LiFei-Fei.Imagenet: A large-scale hierarchical image database.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009.
Deng [2012]LiDeng.The mnist database of handwritten digit images for machine learning research.IEEE Signal Processing Magazine, 2012.
Dolan and Brockett [2005]Bill Dolan and Chris Brockett.Automatically constructing a corpus of sentential paraphrases.In International Workshop on Paraphrasing, 2005.
Du etal. [2015]BoDu, Zengmao Wang, Lefei Zhang, Liangpei Zhang, Wei Liu, Jialie Shen, and Dacheng Tao.Exploring representativeness and informativeness for active learning.IEEE Transactions on Cybernetics, 2015.
Dua etal. [2017]Dheeru Dua, Casey Graff, etal.UCI Machine Learning Repository.URL http://archive. ics. uci. edu/ml, 2017.
Epifano etal. [2023]JacobR Epifano, RaviP Ramachandran, AaronJ Masino, and Ghulam Rasool.Revisiting the fragility of influence functions.Neural Networks, 2023.
Fang etal. [2020]Minghong Fang, NeilZhenqiang Gong, and Jia Liu.Influence function based data poisoning attacks to top-n recommender systems.In International World Wide Web Conference, 2020.
Freytag etal. [2014]Alexander Freytag, Erik Rodner, and Joachim Denzler.Selecting influential examples: Active learning with expected model output changes.In European Conference on Computer Vision, 2014.
Gal and Ghahramani [2016]Yarin Gal and Zoubin Ghahramani.Dropout as a bayesian approximation: Representing model uncertainty in deep learning.In International Conference on Machine Learning, 2016.
Giordano etal. [2019]Ryan Giordano, William Stephenson, Runjing Liu, Michael Jordan, and Tamara Broderick.A swiss army infinitesimal jackknife.In International Conference on Artificial Intelligence and Statistics, 2019.
Gong etal. [2022]Xiuwen Gong, Dong Yuan, and Wei Bao.Partial label learning via label influence function.In International Conference on Machine Learning, 2022.
Gu etal. [2021]Bin Gu, Zhou Zhai, Cheng Deng, and Heng Huang.Efficient active learning by querying discriminative and representative samples and fully exploiting unlabeled data.IEEE Transactions on Neural Networks and Learning Systems, 2021.
Guo [2010]Yuhong Guo.Active instance sampling via matrix partition.Advances in Neural Information Processing Systems, 2010.
Han etal. [2020]Xiaochuang Han, ByronC. Wallace, and Yulia Tsvetkov.Explaining black box predictions and unveiling data artifacts through influence functions.In Annual Meeting of the Association for Computational Linguistics, 2020.
Harris etal. [2020]CharlesR. Harris, K.Jarrod Millman, StéfanJ. vander Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, NathanielJ. Smith, Robert Kern, Matti Picus, Stephan Hoyer, MartenH. van Kerkwijk, Matthew Brett, Allan Haldane, JaimeFernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and TravisE. Oliphant.Array programming with NumPy.Nature, 585, 2020.
Hasan and Roy-Chowdhury [2015]Mahmudul Hasan and AmitK Roy-Chowdhury.Context aware active learning of activity recognition models.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015.
He etal. [2016]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
Holub etal. [2008]Alex Holub, Pietro Perona, and MichaelC Burl.Entropy-based active learning for object recognition.In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008.
Huang etal. [2010]Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou.Active learning by querying informative and representative examples.Advances in Neural Information Processing Systems, 2010.
Huang etal. [2018]Sheng-Jun Huang, Jia-Wei Zhao, and Zhao-Yang Liu.Cost-effective training of deep cnns with active model adaptation.In International Conference on Knowledge Discovery & Data Mining, 2018.
Huggins etal. [2016]Jonathan Huggins, Trevor Campbell, and Tamara Broderick.Coresets for scalable bayesian logistic regression.Advances in Neural Information Processing Systems, 2016.
Hüllermeier and Beringer [2005]Eyke Hüllermeier and Jürgen Beringer.Learning from ambiguously labeled examples.Intelligent Data Analysis, 2005.
Joshi etal. [2009]AjayJ. Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos.Multi-class active learning for image classification.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009.
Kee etal. [2018]Seho Kee, Enrique DelCastillo, and George Runger.Query-by-committee improvement with diversity and density in batch active learning.Information Sciences, 2018.
Koh and Liang [2017]PangWei Koh and Percy Liang.Understanding black-box predictions via influence functions.In International Conference on Machine Learning, 2017.
Koh etal. [2019]Pang WeiW Koh, Kai-Siang Ang, Hubert Teo, and PercyS Liang.On the accuracy of influence functions for measuring group effects.Advances in Neural Information Processing Systems, 2019.
Kong etal. [2021]Shuming Kong, Yanyan Shen, and Linpeng Huang.Resolving training biases via influence-based data relabeling.In International Conference on Learning Representations, 2021.
Krizhevsky and Hinton [2009]Alex Krizhevsky and Geoffrey Hinton.Learning multiple layers of features from tiny images.Master’s thesis, University of Tront, 2009.
Lakshminarayanan etal. [2017]Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell.Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in Neural Information Processing Systems, 2017.
Levesque etal. [2011]HectorJ. Levesque, Ernest Davis, and L.Morgenstern.The winograd schema challenge.In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, 2011.
Lewis and Catlett [1994]DavidD. Lewis and Jason Catlett.Heterogeneous uncertainty sampling for supervised learning.In Machine Learning Proceedings. 1994.
Li and Liu [2022]Peizhao Li and Hongfu Liu.Achieving fairness at no utility cost via data reweighing with influence.In International Conference on Machine Learning, 2022.
Li etal. [2024]Xingjian Li, Pengkun Yang, Yangcheng Gu, Xueying Zhan, Tianyang Wang, Min Xu, and Chengzhong Xu.Deep active learning with noise stability.In AAAI Conference on Artificial Intelligence, 2024.
Liu etal. [2019]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019.
Liu etal. [2021]Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, and Conghui He.Influence selection for active learning.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
Liu etal. [2015]Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.Deep learning face attributes in the wild.In International Conference on Computer Vision, 2015.
Lyu etal. [2020]Gengyu Lyu, Songhe Feng, Yidong Li, YiJin, Guojun Dai, and Congyan Lang.Hera: partial label learning by combining heterogeneous loss with sparse and low-rank regularization.ACM Transactions on Intelligent Systems and Technology, 2020.
Ma etal. [2023]Ying Ma, YuZhang, ArunKumar Sangaiah, Ming Yan, Guoqi Li, and Tian Wang.Active learning for name entity recognition with external knowledge.ACM Transactions on Asian and Low-Resource Language Information Processing, 2023.
Moro etal. [2014]Sérgio Moro, Paulo Cortez, and Paulo Rita.A data-driven approach to predict the success of bank telemarketing.Decision Support Systems, 2014.
Munteanu etal. [2018]Alexander Munteanu, Chris Schwiegelshohn, Christian Sohler, and David Woodruff.On coresets for logistic regression.Advances in Neural Information Processing Systems, 2018.
Natarajan etal. [2013]Nagarajan Natarajan, InderjitS Dhillon, PradeepK Ravikumar, and Ambuj Tewari.Learning with noisy labels.Advances in Neural Information Processing Systems, 2013.
Nguyen and Smeulders [2004]HieuT. Nguyen and Arnold Smeulders.Active learning using pre-clustering.In International Conference on Machine Learning, 2004.
Nguyen etal. [2022]Vu-Linh Nguyen, Mohammad Shaker, and Eyke Hüllermeier.How to measure uncertainty in uncertainty sampling for active learning.Machine Learning, 2022.
Paszke etal. [2019]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, LuFang, Junjie Bai, and Soumith Chintala.Pytorch: An imperative style, high-performance deep learning library.Advances in Neural Information Processing Systems, 2019.
Paul etal. [2021]Mansheej Paul, Surya Ganguli, and GintareKarolina Dziugaite.Deep learning on a data diet: Finding important examples early in training.Advances in Neural Information Processing Systems, 2021.
Pedregosa etal. [2011]F.Pedregosa, G.Varoquaux, A.Gramfort, V.Michel, B.Thirion, O.Grisel, M.Blondel, P.Prettenhofer, R.Weiss, V.Dubourg, J.Vanderplas, A.Passos, D.Cournapeau, M.Brucher, M.Perrot, and E.duch*esnay.Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 2011.
Pruthi etal. [2020]Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan.Estimating training data influence by tracing gradient descent.Advances in Neural Information Processing Systems, 2020.
Rastegarpanah etal. [2019]Bashir Rastegarpanah, KrishnaP Gummadi, and Mark Crovella.Fighting fire with fire: Using antidote data to improve polarization and fairness of recommender systems.In ACM International Conference on Web Search and Data Mining, 2019.
Ren etal. [2021]Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, BrijB. Gupta, Xiaojiang Chen, and Xin Wang.A survey of deep active learning.ACM Computing Surveys, 2021.
Roth and Small [2006]Dan Roth and Kevin Small.Margin-based active learning for structured output spaces.In European Conference on Machine Learning, 2006.
Sener and Savarese [2018]Ozan Sener and Silvio Savarese.Active learning for convolutional neural networks: A core-set approach.In International Conference on Learning Representations, 2018.
Settles and Craven [2008]Burr Settles and Mark Craven.An analysis of active learning strategies for sequence labeling tasks.In Conference on Empirical Methods in Natural Language Processing, 2008.
Seung etal. [1992]HSebastian Seung, Manfred Opper, and Haim Sompolinsky.Query by committee.In Annual Workshop on Computational Learning Theory, 1992.
Song etal. [2022]Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee.Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 2022.
Song etal. [2023]Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee.Learning from noisy labels with deep neural networks: A survey.IEEE Transactions on Neural Networks and Learning Systems, 2023.
Wang etal. [2018]Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and SamuelR Bowman.Glue: A multi-task benchmark and analysis platform for natural language understanding.arXiv preprint arXiv:1804.07461, 2018.
Wang etal. [2024]Haoran Wang, Qiuye Jin, Shiman Li, Siyu Liu, Manning Wang, and Zhijian Song.A comprehensive survey on deep active learning in medical image analysis.Medical Image Analysis, 2024.
Wang etal. [2016]Keze Wang, Dongyu Zhang, YaLi, Ruimao Zhang, and Liang Lin.Cost-effective active learning for deep image classification.IEEE Transactions on Circuits and Systems for Video Technology, 2016.
Wei etal. [2015]Kai Wei, Rishabh Iyer, and Jeff Bilmes.Submodularity in data subset selection and active learning.In International Conference on Machine Learning, 2015.
Wes McKinney [2010]Wes McKinney.Data Structures for Statistical Computing in Python.In Python in Science Conference, 2010.
Wolf etal. [2020]Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, TevenLe Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and AlexanderM. Rush.Transformers: State-of-the-art natural language processing.In Conference on Empirical Methods in Natural Language Processing, 2020.
Wu etal. [2022]Jiaxi Wu, Jiaxin Chen, and DiHuang.Entropy-based active learning for object detection with progressive diversity constraint.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
Xu etal. [2003]Zhao Xu, Kai Yu, Volker Tresp, Xiaowei Xu, and ji*zhi Wang.Representative sampling for text classification using support vector machines.In Advances in Information Retrieval, 2003.
Yang and Yu [2023]Jinghan Yang and Lequan Yu.Relabel minimal training subset to flip a prediction.arXiv preprint arXiv:2305.12809, 2023.
Yang etal. [2015]YiYang, Zhigang Ma, Feiping Nie, Xiaojun Chang, and AlexanderG Hauptmann.Multi-class active learning by uncertainty sampling with diversity maximization.International Journal of Computer Vision, 2015.
Yoo and Kweon [2019]Donggeun Yoo and InSo Kweon.Learning loss for active learning.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
Zhan etal. [2022]Xueying Zhan, Qingzhong Wang, Kuan-hao Huang, Haoyi Xiong, Dejing Dou, and AntoniB Chan.A comparative survey of deep active learning.arXiv preprint arXiv:2203.13450, 2022.
Zhang etal. [2022]Zhisong Zhang, Emma Strubell, and Eduard Hovy.A survey of active learning for natural language processing.In Conference on Empirical Methods in Natural Language Processing, 2022.

Appendix

Appendix A Dataset

We summarize some key statistics of nine datasets we use in experiments of Section5.2 in Table3. For all datasets, we conduct 5 runs of experiments with different random seeds. In each experimental run, we fix the validation set and test set, then randomly choose 300 data points from the training samples as the initial set, and reserve the rest as the pool set. All datasets are publicly available with CC BY 4.0 license.

Dataset	#ofTraining	#ofVal	# of Test	# of Classes	# of Dim	Data Type
Bank[59]	18,292	6,098	6,098	2	51	tabulate
Diabetic[20]	950	100	100	2	19	tabulate
CelebA[56]	62,497	20,833	20,833	2	39	tabulate
Musk_v2[10]	3,958	1,320	1,320	2	166	tabulate
Electrical[2]	6,000	2,000	2,000	2	12	tabulate
Waveform[8]	3,000	1,000	1,000	3	21	tabulate
Wine[18]	3,896	1,300	1,300	7	11	tabulate
MNIST[22]	54,000	6,000	10,000	10	512	vision
CIFAR10[48]	45,000	5,000	10,000	10	512	vision

Appendix B Detailed algorithmic performance with standard deviation

We do not include the standard deviation in Table1 for better visualization.Here we report the full experimental results with standard deviation in Table4, which includes the active learning experiments in Section5.2 and the LLM fine-tuning experiments in Section2.Our salutary labeling method outperforms all baseline methods across multiple datasets in the standard active learning setting for both convex logistic regression and non-convex LLM fine-tuning, all without requiring any human annotation.Notably, our method not only achieves the highest final predicting accuracy across all datasets but also maintains relatively small standard deviations, keeping consistent performance across different experimental runs. These results highlight the efficacy of our method, emphasizing its potential in practical applications.

Method	Electrical	Bank	Diabetic	CelebA	Musk_v2	Wine
Init	63.85	65.89	56.43	73.33	73.45	44.76
Random	65.15_±0.40	67.77_±0.61	58.41_±0.93	82.06_±0.13	78.33_±1.30	46.31_±0.16
Entropy[38]	69.72_±0.55	73.84_±1.33	65.34_±0.11	81.23_±2.11	79.11_±0.60	45.00_±0.41
Margin[4]	69.72_±0.55	73.84_±1.33	65.34_±0.11	81.23_±2.11	79.11_±0.60	47.30_±0.25
Uncertainty[63]	69.72_±0.55	73.84_±1.33	65.34_±0.11	81.23_±2.11	79.11_±0.60	44.53_±0.67
ISLA[55]	67.98_±0.74	64.41_±0.54	61.38_±0.80	84.71_±0.41	77.72_±0.14	47.15_±0.65
IBDS[14]	67.66_±0.94	65.14_±0.15	64.35_±0.46	82.49_±0.36	78.15_±0.64	44.84_±0.64
Ours	71.31_±0.04	78.07_±0.92	71.28_±1.68	85.50_±0.12	81.06_±0.39	49.92_±0.61

Method	Waveform	CIFAR10	MNIST	WNLI	MRPC	RTE
Init	79.11	46.74	77.75	40.69	60.13	52.87
Random	81.10_±0.39	55.92_±0.52	80.93_±0.30	40.77_±1.33	61.51_±1.31	55.23_±1.76
Entropy[38]	83.23_±0.44	53.91_±0.46	83.77_±0.13	42.25_±3.04	63.95_±0.40	54.73_±1.19
Margin[4]	82.26_±0.59	56.95_±0.64	83.72_±0.38	41.32_±1.75	63.89_±0.38	54.78_±1.24
Uncertainty[63]	83.33_±0.23	55.47_±1.01	83.63_±0.50	41.31_±1.64	63.93_±0.42	54.99_±1.48
ISLA[55]	79.40_±0.80	53.91_±0.87	79.35_±1.87	46.01_±2.39	60.21_±0.45	53.54_±1.02
IBDS[14]	82.91_±0.41	54.61_±0.60	80.05_±2.28	45.98_±1.32	63.88_±1.02	55.95_±1.02
Ours	84.21_±0.40	58.33_±0.33	86.68_±0.42	55.86_±0.66	68.59_±0.53	59.44_±0.17

Salutary Labeling with Zero Human Annotation (6)

Appendix C Training details for active LLM fine-tuning

We conduct our LLM fine-tuning experiments on three datasets of GLUE[76] repository, namely, WNLI[50], MRPC[23] and RTE[6].WNLI is a reading comprehension dataset, where the authors construct sentence pairs by replacing the ambiguous pronoun in the original sentence with each possible referent.This dataset is used for predicting whether the sentence with the pronoun substituted is entailed by the original sentence or not. MRPC is a corpus of sentence pairs automatically extracted from online news sources,and we use it to predict whether the sentences in the pair are semantically equivalent or not.RTE are constructed based on news and Wikipedia text.The task is to classify each sample into one of the two classes assigned by human annotators.

For each dataset, we randomly select 100 samples from the predefined training split to form the initial set and use the remaining data as the pool set. Half of the predefined validation split serves as the validation set for salutary labeling, with the other half used as the test set. We use the Hugging Face[81] implementation of RoBERTa[54], denoted by $g\circ h$ . We fix the transformer layers $g$ while fine-tuning the classification head $h$ , which is a two-layer multilayer perceptron model with dropout before both layers and $tahn$ activation function between the two layers.Initially, the model is trained with the initial set. In each of the 10 active learning cycles, it annotates 10 samples. For sampling methods like entropy, margin, and uncertainty, the output of $h$ determines the pool set queries. For influence-based methods including ISAL[55], IBDS[14] and our method, we train a surrogate logistic regression model $h^{\prime}(\cdot;\hat{\theta})$ using the 768-dimensional hidden states extracted by $g$ and predictions from $h$ . This surrogate model was then used to calculate the influence function and query pool samples for model re-training. We compute the accuracy on the test set after each round and plot the results in Figure5.

Salutary Labeling with Zero Human Annotation (7)

Appendix D Choice of query budget $b$

We set a relatively small query budget $b$ to maintain clear performance distinctions between different models. In our preliminary exploration stage, we found that a larger budget, such as 1% of the pool set, allows models to reach the performance ceiling on datasets like CelebA[56], Waveform[8], and Electrical[2].As shown in Figure6, such a budget causes different active learning methods to perform very similarly after several rounds. Consequently, we opted for a smaller budget in our experiments to better evaluate the distinct capabilities of each model.

Appendix E Broader Impact and Limitations

This paper presents work whose goal is to advance the field of Machine Learning.We broaden the scope of active learning with a novel approach called salutary labeling, which integrates the querying and annotating processes of active learning into a single, autonomous step.The proposed salutary labeling method eliminates human annotation and maximizes benefits from queried data.Beyond the impact mentioned above, there are also other potential societal consequences of our work, none of which we feel must be specifically highlighted here.

One potential limitation of our method stems from the influence function, one key component of salutary labeling.The influence function requires the model to be convex, ensuring that its Hessian matrix is positive definite, and invertible after training to convergence. Despite the ongoing discussions[5, 3, 26] on the accuracy of the influence function on non-convex models, many research works have successfully applied the influence function across various applications[27, 34, 13]. In this work, we adopt the same strategy as in the work of Li and Liu [52], which uses a surrogate convex model on the embeddings extracted by the non-convex model, and achieve promising results as illustrated in Section5.3.Further exploring the application of the influence function to non-convex models is not the focus of this study, so we defer this topic to future work.

Appendix F Code and Reproducibility

The code for the implementation of our method will be released soon.

All experiments were conducted on a Linux workstation running Ubuntu 20.04.6 LTS. The CPU used was an Intel(R) Core(TM) i9-10850K CPU @ 3.60GHz. Any experiments requiring GPU (such as multi-class influence calculation and LLM fine-tuning) were conducted with one NVIDIA TITAN RTX GPU with 24 VRAM and CUDA version 11.4.

All code is written in Python, and utilizes basic libraries such as NumPy[35], scikit-learn[66], PyTorch[64], Pandas[80], etc. Detailed package information will be provided in the code repository.