Events
Maintaining Privacy of the Minority without Ignoring It: Differentially Private Oversampling for Imbalanced Learning
Speaker: Rachel Cummings
Location: Courant Institute / Warren Weaver Hall
Date: Wednesday, April 19, 2023
The problem of learning from imbalanced datasets, where the label classes are not equally represented, arises often in practice. A widely used approach to address this problem is to rely on pre-processing techniques to amplify the minority class, e.g., by reusing or resampling the existing minority instances. However, when the training data are sensitive and privacy is a concern, this will also amplify privacy leakage from members of the minority class. Therefore, oversampling pre-processing steps must be designed with downstream privacy in mind. Our work provides insights about the implications of pre-processing before the use of DP algorithms, which we hope will be of broad interest.
In this paper, we first quantify the privacy degradation from running a commonly used Synthetic Minority Oversampling TEchnique (SMOTE) as a pre-processing step before applying any differentially private algorithm. We show that this degregation increases exponentially in the dimensionality of the data, which harms the accuracy of downstream private learning at a corresponding rate. We then present a differentially private variant of this algorithm (DP-SMOTE) that guarantees formal privacy for the minority class, even during oversampling. Finally, we show empirically that, when applied as a pre-processing step to highly imbalanced data before private regression, DP-SMOTE outperformed SMOTE and other baseline methods, due primarily to the privacy degradation in the non-private pre-processing methods.
In this paper, we first quantify the privacy degradation from running a commonly used Synthetic Minority Oversampling TEchnique (SMOTE) as a pre-processing step before applying any differentially private algorithm. We show that this degregation increases exponentially in the dimensionality of the data, which harms the accuracy of downstream private learning at a corresponding rate. We then present a differentially private variant of this algorithm (DP-SMOTE) that guarantees formal privacy for the minority class, even during oversampling. Finally, we show empirically that, when applied as a pre-processing step to highly imbalanced data before private regression, DP-SMOTE outperformed SMOTE and other baseline methods, due primarily to the privacy degradation in the non-private pre-processing methods.