1. Introduction
Recently, with the rapid evolution of machine learning technology and the expansion of data due to developments in information technology, it has become increasingly important that companies determine how they might utilize big data effectively and efficiently. However, big data often include personal and private information; thus, careless utilization of such sensitive information may lead to unexpected sanctions.
To overcome this problem, many privacypreserving technologies have been proposed for the utilization of data that nevertheless maintains privacy. Typical privacypreserving technologies include data anonymization (e.g., [1,2,3]) and secure computation (e.g., [4]). This paper focuses on the relationship between data anonymization and decision trees, a typical machine learning method. Historically, data anonymization research has progressed from pseudonymization to kanonymity [1], ℓdiversity [2,5], and tcloseness [3] and is continually growing. Currently, many researchers are focused on membership privacy and differential privacy [6].
In [7,8], the authors pointed out that the decision tree is not robust to homogeneity attacks and background knowledge attacks; they then demonstrated the application of kanonymity and ℓdiversity in order to amplify security. However, their proposals could not satisfy the requirements of differential privacy. In this paper, we discuss how we might prevent the leakage of private information via differential privacy provided by a learned decision tree using data anonymization techniques such as kanonymity and ℓdiversity.
To prevent leakage of private information, we propose the application of kanonymity and sampling to a random decision tree, which is a variation of the expanded decision tree proposed by Fan et al. [9]. Interestingly, we show in this paper that this modification results in differential privacy. The essential idea is that instead of adding Laplace noise, as in [10,11] (please see [12] for a survey of a differentially private (random) decision tree), we propose a method of enhancing the security of a random decision tree by sampling and then removing the leaf containing fewer data than some threshold k, which applies to the other leaves of the tree. The basic concept is outlined in [13]. Our proposed model, in which kanonymity is achieved after sampling, provides differential privacy, as in [13].
As mentioned above, researchers have shifted their attention to differential privacy rather than kanonymization and ℓdiversity. In fact, building upon the work outlined in [14], decision trees that satisfy differential privacy use techniques that are typical of differential privacy, such as the exponential, Gaussian, and Laplace mechanisms [10,11,15]. That is, all of these algorithms achieve differential privacy by adding some kind of noise. Our approach is very different from those of others. That said, the basic technique involves applying kanonymity to each leaf in the random decision tree; this is similar to pruning, which is a widely accepted technique used to avoid overfitting.
The remainder of this paper is organized as follows. Section 2 introduces relevant preliminary information, e.g., anonymization methods and decision trees, and demonstrates how strategies for attacking data anonymization can be converted into attacks targeting decision trees. In Section 3, we demonstrate how much security and accuracy can be achieved in practice when the random decision tree is strengthened using a method that is similar to kanonymity. In Section 4, the potential advantages of our proposal are discussed. Finally, the paper is concluded in Section 5, which includes a brief discussion of potential future research topics.
2. Preliminaries
2.1. Data Anonymization
When providing personal data to a third party, it is necessary to modify data to preserve user privacy. Here, modifying the user’s data (i.e., a particular record) such that an adversary cannot reidentify a specific individual is referred to as data anonymization. As a basic technique and to prevent reidentification, an identifier, e.g., a person’s name or employee number, is deleted or replaced with a pseudonym ID by the data holder. This process is referred to as pseudonymization. However, simply modifying identifiers does not ensure the preservation of privacy. In some cases, individuals can be reidentified by a combination of features (i.e., a quasiidentifier); thus, it is necessary to modify both the identifier and the quasiidentifier to reduce the risk of reidentification. In most cases, the identifiers themselves are not used for data analysis; thus, removing identifiers does not significantly sacrifice the quality of the dataset. However, if we modify quasiidentifiers in the same manner, although the data may become anonymous, they will also become useless. A typical anonymization technique for quasiidentifiers is “roughening” the numerical values.
2.1.1. Attacks Targeting Pseudonymization
A simple attack is possible against pseudonymized data from which identifiers, e.g., names, have been removed. In this attack, the attacker uses the quasiidentifier of a user u. If this attacker obtains the pseudonymized data, by searching for user u’s quasiidentifier in pseudonymized data, the attacker can obtain sensitive information about u. For example, if the attacker obtains the dataset shown in Table 1 and knows friend u’s zip code is 13068, their age is 29, and their nationality is American, then, by searching the dataset, the attacker can identify that user u is suffering from some heartrelated disease. This attack is referred to as the uniqueness attack.
2.1.2. kAnonymity
kanonymity is a countermeasure used to prevent uniqueness attacks.
In kanonymity, features are divided into quasiidentifiers and sensitive information, and the same quasiidentifier is modified such that it does not become less than k users. Table 2 shows anonymized data that has been kanonymized ($k=4$) using quasiidentifiers, e.g., zip code, age, and nationality.
2.1.3. Homogeneity Attack
At a cursory glance, kanonymity appears to be secure; however, even if kanonymity is employed, a homogeneity attack is still feasible. This attack becomes possible if the sensitive information is the same. Taking the kanonymized dataset shown on the right side of Figure 1 as an example and assuming the presence of the attacker on the left side of Figure 1, we can make the following statements. Here, the attacker has the necessary information (zip, age) = (13053, 37); all of this sensitive information in the records corresponds with cancer. The attacker can therefore deduce that Bob has cancer.
2.1.4. Background Knowledge Attack
Homogeneous attacks suggest a problem when records with the same quasiidentifier have the same sensitive information; however, a previous study [3] also argued that there was a problem even in cases in which records were not the same. The kanonymized dataset on the right side of Figure 2 shows four records with quasiidentifiers (130, <30, *), and two types of sensitive information, i.e., (heart, flu). Here, we can assume that the attacker has background knowledge of the data similar to that shown on the left side of Figure 2. In this case, there are certainly possibilities of heart disease and flu; however, if the probability of Japanese individuals experiencing heart disease is extremely low, Umeko is estimated to instead have the flu. Thus, it must be acknowledged that kanonymity does not provide a high degree of security.
ℓdiversity: ℓdiversity is a measure used to counteract homogeneity attacks. A kanonymization table is denoted ℓdiverse if each similar class of quasiidentifiers has at least ℓ “wellrepresented” values for sensitive information. There are several different interpretations of the term “wellrepresented” [2,3]. In this paper, we adopt distinct ℓdiversity. Distinct ℓdiversity means that there are at least ℓ distinct values for the sensitive information in each similar class of quasiidentifiers. Table 3 shows anonymized data that are twodiverse ($\ell =2$).
2.2. Decision Trees
2.2.1. Basic Decision Trees
Decision trees are supervised learning methods that are primarily used for classification tasks, and a tree structure is created while learning from data (Figure 3). When predicting the label y of $\mathbf{x}$, the process begins from the root of the tree, and the corresponding leaf is searched for while referring to each feature of $\mathbf{x}$. Finally, through this referral process, y is predicted.
The label determined by the leaf is derived from the dataset D used to generate the tree structure. In other words, after the tree structure is created, for each element $({\mathbf{x}}_{i},{y}_{i})$ in dataset D, the corresponding leaf $\mathtt{leaf}$ is found, and the value of ${y}_{i}$ is stored. If ${y}_{i}\in Y=\{0,1\}$, then in each $\mathtt{leaf}$, the number of ${y}_{i}$ labeled 0 and the number of ${y}_{i}$ labeled 1 are preserved. More precisely, $[n\left({\mathtt{leaf}}^{\left(1\right)}\right),n\left({\mathtt{leaf}}^{\left(0\right)}\right)]$ are preserved for each leaf $\mathtt{leaf}$, where $n\left({\mathtt{leaf}}^{\left(0\right)}\right)$ and $n\left({\mathtt{leaf}}^{\left(1\right)}\right)$ represent the numbers of data points with label y that have values of 0 and 1, respectively. Table 4 shows the notations used in the paper.
For a given prediction $\mathbf{x}$, we first search for the corresponding leaf, and it may be denoted as 1 if $\frac{n\left({\mathtt{leaf}}^{\left(1\right)}\right)}{n\left({\mathtt{leaf}}^{\left(0\right)}\right)+n\left({\mathtt{leaf}}^{\left(1\right)}\right)}}>{\displaystyle \frac{1}{2}$, and 0 otherwise. Here, the threshold $1/2$ can be set flexibly depending on where the decision tree is applied, and when providing the learned decision tree to a third party, it is possible to pass $n\left({\mathtt{leaf}}^{\left(0\right)}\right)$ and $n\left({\mathtt{leaf}}^{\left(1\right)}\right)$ together for each leaf $\mathtt{leaf}$. In this paper, we considered the security of decision trees in such situations.
Generally, the deeper the tree structure, the more likely it is to overfit; thus, we frequently prune the tree, and this technique was employed to preserve data privacy within this paper.
2.2.2. Random Decision Tree
The random decision tree was proposed by Fan et al. [9]. Notably, in their approach, the tree is randomly generated without depending on the data. Furthermore, sufficient performance can be ensured through the appropriate selection of parameters.
The shape of a normal (not random) decision tree depends on the data used; the eventual shape may cause private information to be leaked from the tree. However, the random decision tree avoids this leakage due to its random generation. Therefore, its performance is expected to match the performance of other proposed security methods.
A random decision tree is described in Algorithms1 and2. Algorithm1 shows that the generated tree does not depend on dataset D, except for ${n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)$ created by $\mathtt{UpdateStatistics}$. Here, ${n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)$ denotes the 2D array, which represents the number of feature vectors reaching each leaf $\mathtt{leaf}$. However, the preservation of privacy is necessary because ${n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)$ depends on D.
In the measurement of each parameter, the depth of the tree being ($\frac{\mathrm{dimension}\phantom{\rule{0ex}{0ex}}\mathrm{of}\phantom{\rule{0ex}{0ex}}\mathrm{feature}\phantom{\rule{0ex}{0ex}}\mathrm{vectors}}{2}$) and the number of trees being 10 are general rules.
2.3. Security Definitions
Part of this work adopts differential privacy to evaluate the security and efficiency of the model.
Definition1.
A randomized algorithm A satisfies $(\u03f5,\delta )$DP, if for any pair of neighboring datasets, D and ${D}^{\prime}$, and any $O\subseteq \mathcal{O}$:
$$Pr\left[A\left(D\right)\in O\right]\le {e}^{\u03f5}\xb7Pr[A\left({D}^{\prime}\right)\in O]+\delta ,$$
where $\mathcal{O}$ is a range of A.
When studying differential privacy, it is assumed that the attacker knows all the elements in D. However, such an assumption may not be realistic. This is taken into consideration, and the following definition is given in [13]:
Definition2
(DP undersampling, $(\beta ,\u03f5,\delta )$DPS). An algorithm A satisfies $(\beta ,\u03f5,\delta )$DPS if and only if the algorithm ${A}^{\beta}$ satisfies $(\u03f5,\delta )$DP, where ${A}^{\beta}$ denotes the algorithm used to initially sample each tuple in the input dataset A with probability β.
In other words, the definition describes the output of A by inputting ${D}^{\prime}$, which is a result of sampling dataset D. Hence, the attacker may know D but not ${D}^{\prime}$.
Algorithm 1 Training the random decision tree [9] 
Input: Training data $D=\{({\mathbf{x}}_{1},{y}_{1}),\dots ,({\mathbf{x}}_{n},{y}_{n})\}$, the set of features $X=\{{F}_{1},\dots ,{F}_{m}\}$, number of random decision trees to be generated ${N}_{t}$ Output: Random decision trees ${T}_{1},\dots ,{T}_{{N}_{t}}$

Algorithm 2 Classifying the random decision tree [9] 
Require:$\{{T}_{1},\dots ,{T}_{{N}_{t}}\},\mathbf{x}$ 1: return ${\sum}_{i=1}^{{N}_{t}}{n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)/\left({\sum}_{{y}^{\prime}}{\sum}_{i=1}^{{N}_{t}}{n}_{i}\left({\mathtt{leaf}}^{\left({y}^{\prime}\right)}\right)\right)$ for each $y\in Y$, where $\mathtt{leaf}$ denotes the leaf corresponding to $\mathbf{x}$. 
2.4. Attacks Targeting the Decision Tree
Generally, a decision tree is constructed from a given dataset; however, it is also possible to partially reconstruct the dataset using the decision tree. This kind of attack is considered in [7]. In this section, we explain the workings of such attacks.
Figure 4 shows the reconstruction of a dataset from the decision tree shown in Figure 3. As shown, it is impossible to reconstruct the original data completely from a binary tree model; however, it is possible to extract some of the data. By exploiting this essential property, it is possible to mount some attacks against reconstructed data, as discussed in Section 2.1. In the following, taking Figure 4 as an example, we present a uniqueness attack, a homogeneous attack, and a background knowledge attack.
Uniqueness attack: in the dataset (Figure 4) recovered from the model, there is one user whose height is greater than 170 and who is under 15 years of age (as shown in the sixth row (see the 2nd red box shown on the figure), where the relevant leaf is converted via $[{n}_{3}\left({\mathtt{leaf}}^{\left(1\right)}\right),{n}_{3}\left({\mathtt{leaf}}^{\left(0\right)}\right)]=[1,0]$); thus, it is possible to target this user with a uniqueness attack.
Homogeneous attack: similarly, in the fourth and fifth rows (see the 1st red box shown on the figure), which are converted from the relevant leaf via $[{n}_{2}\left({\mathtt{leaf}}^{\left(1\right)}\right),{n}_{2}\left({\mathtt{leaf}}^{\left(0\right)}\right)]=[0,2]$, the height $<170$, weight $\ge 60$, and health status are the same (i.e., “unhealthy”); therefore, a homogeneous attack is possible.
Background knowledge attack: Similarly, in the seventh, eighth, and ninth rows, there are three users whose data match in both height $\ge 170$ and age $\ge 15$. Among these users, one is healthy (yes) and two are unhealthy (no). As an attacker, we can consider the following:
(Background knowledge of user A) height: 173, age: 33, healthy;
(Background knowledge of target user B) height: 171, age: 19.
In this case, if the adversary knows that user A is healthy, he/she can identify that user B is unhealthy.
3. Proposal: Applying $\mathit{k}$Anonymity to a (Random) Decision Tree
3.1. Construction of $k\mathtt{AnonToRdt}$
In this section, we demonstrate how one might achieve differential privacy from kanonymity. More specifically, we present a proposal based on a random decision tree, which is a variant of the decision tree outlined in Section 2.2.2. Said proposal is shown as Algorithm3. It differs from the original random decision tree in the following ways:
(Pruning): for some threshold k, if there exists a tree ${T}_{i}$, a leaf $\mathtt{leaf}$, and a label y, satisfying ${n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)<k$, then let ${n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)$ equal 0.
(Sampling): training ${T}_{i}$ using ${D}_{i}$, which is the result obtained after sampling dataset D of each tree ${T}_{i}$ with probability $\beta $.
3.2. Security: Strongly Safe kAnonymization Ensures Differential Privacy
In the field of data anonymization, in study [13], the authors demonstrated that performing kanonymity after sampling achieved differential privacy; our proposal is a development upon this core principle. Below, we outline the items necessary to evaluate data security.
Definition3
(Strongly Safe kanonymization algorithm [13]). Suppose that a function g has $\mathcal{D}\to \mathcal{T}$, where $\mathcal{D}$ and $\mathcal{T}$ are the domain and range of g, respectively. Suppose that g does not depend on $D\subseteq \mathcal{D}$, i.e., g is constant. The strongly safe kanonymization algorithm A with input $D\subseteq \mathcal{D}$ is defined as follows:
Compute ${Y}_{1}=\left\{g\left(\mathbf{x}\right)\right\mathbf{x}\in D\}$.
${Y}_{2}=\left\{\right(y,\left\{\mathbf{x}\in D:g\left(\mathbf{x}\right)=y\}\right):y\in {Y}_{1}\}$.
For each element in ${Y}_{2}=\left\{(y,c)\right\}$, if $c<k$, then the element is set to $(y,0)$, and the result is set to ${Y}_{3}$.
Algorithm 3 Proposed training process 
Input: Training data $D=\{({\mathbf{x}}_{1},{y}_{1}),\dots ,({\mathbf{x}}_{n},{y}_{n})\}$, the set of features $X=\{{F}_{1},\dots ,{F}_{m}\}$, number of random decision trees to be generated ${N}_{t}$ Output: Random decision trees ${T}_{1},\dots ,{T}_{{N}_{t}}$

Assume that $f(j;n,\beta )$ denotes the probability mass function: the probability of succeeding j times after n attempts, where the probability of success after one attempt is $\beta $. Furthermore, the cumulative distribution function is expressed as follows in Equation(2):
$${f}^{\prime}(j;n,\beta )=\sum _{i=0}^{j}f(i;n,\beta ).$$
Theorem1
(Theorem 5 in [13]). Any strongly safe kanonymization algorithm satisfies $(\beta ,\u03f5,\delta )$DPS for any $0<\beta <1$, $\u03f5\ge ln(1\beta )$, and
$$\delta =d(k,\beta ,\u03f5)={\mathrm{max}}_{n:n\ge \lceil {\displaystyle \frac{k}{\gamma}}1\rceil}\sum _{j>\gamma n}^{n}f(j;n,\beta ),$$
where $\gamma ={\displaystyle \frac{{e}^{\u03f5}1+\beta}{{e}^{\u03f5}}}$.
Equation (3) shows the relationship between $\beta $ and $\delta $ in determining the value of $\u03f5$ when k is fixed.
Let us consider the case where record $(\mathbf{x},y)\in D$ is applied to a random decision tree ${T}_{i}$, and the leaf reached is denoted by $\mathtt{leaf}$. If a function ${g}_{i}$ is defined as
$${g}_{i}\left((\mathbf{x},y)\right)=(\mathtt{leaf},y),$$
then ${g}_{i}$ in Equation (4) is apparently constant, that is, it does not depend on D. Therefore, ${n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)$, which is generated using ${g}_{i}$, can be regarded as an example of strongly safe kanonymization; consequently, Theorem1 can be applied.
However, Theorem1 above can be applied in its original form when there is one ${T}_{i}$, i.e., when the number of trees ${N}_{t}=1$. Theorem2 can be applied when ${N}_{t}>1$.
Theorem2
(Theorem 3.16 in [17]). Assume ${A}_{i}$ is an $({\u03f5}_{i},{\delta}_{i})$DP algorithm for $1\le i\le {N}_{t}$. Then, the algorithm
$$A\left(D\right):=({A}_{1}\left(D\right),{A}_{2}\left(D\right),\dots ,{A}_{{N}_{t}}\left(D\right))$$
satisfies $({\sum}_{i=1}^{{N}_{t}}{\u03f5}_{i},{\sum}_{i=1}^{{N}_{t}}{\delta}_{i})$DP.
In Algorithm3, each ${T}_{i}$ is selected randomly, and sampling is performed for each tree. Hence, the following conclusion can be reached.
Corollary1.
The proposed algorithm satisfies $(\beta ,{N}_{t}\u03f5,{N}_{t}\delta )$DPS, for any $0<\beta <1$, $\u03f5\ge ln(1\beta )$, and
$$\delta =d(k,\beta ,\u03f5)={\mathrm{max}}_{n:n\ge \lceil {\displaystyle \frac{k}{\gamma}}1\rceil}\sum _{j>\gamma n}^{n}f(j;n,\beta ),$$
where $\gamma ={\displaystyle \frac{{e}^{\u03f5}1+\beta}{{e}^{\u03f5}}}.$
Table 5 shows the relationship, derived from Equation (6), between $\beta $ and ${N}_{t}\delta $ in determining the value of ${N}_{t}\u03f5$ when k and ${N}_{t}$ are fixed. The cells in the table represent the approximate value of ${N}_{t}\delta $. For k and ${N}_{t}$, we chose $(k,{N}_{t})=(5,10),(10,10)$, and $(20,10)$, as shown in Table 5.
3.3. Experiments on k$\mathtt{AnonToRdt}$
The efficiency of the proposal was verified using the Nursery dataset [18], the Adult dataset [19], the Mushroom dataset [20], and the Loan dataset [21]. The characteristics of each dataset are as follows.
The Nursery dataset contains 12,960 records with eight features, with a maximum of five values for each feature;
The Adult dataset contains 48,842 records with 14 features. Here, each feature has more possible values and more records than in the Nursery datasets;
The Mushroom dataset contains 8124 records with 22 features. Compared to the above two datasets, there are more features, but the number of records is small. In general, applying kanonymity to this kind of dataset is challenging.
The Loan dataset [21] contains 9578 records with 13 features. We used the “not.fully.paid” feature to label classes. There are four binary attributes and seven numerical attributes in this dataset.
Appendix A contains the evaluation of the basic decision tree with each dataset. Firstly, $\delta ,{N}_{t},k$ were set to $\delta =0.4$, ${N}_{t}\in \{8,\dots ,11\}$, and $k\in \{5,10\}$.
The results from the Nursery dataset obtained via $k\mathtt{AnonToRdt}$ with these parameters are as follows:
The accuracy of the original decision tree was 0.933, as shown in Table A1.
As shown in Table 6 (a), for a tree depth equal to four, the accuracy obtained was 0.84, which was inferior to that of the original decision tree.
As shown in Table 6 (a), for a tree depth equal to five, the accuracy decreased drastically as k increased.
The results from the Adult dataset obtained via $k\mathtt{AnonToRdt}$ with the same $\delta ,{N}_{t},k$ were as follows: (there were numerical values in this dataset; to handle this, a threshold t was chosen randomly from its domain, and two children for ≤t and >t were produced by the tree)
The accuracy of the original decision tree was 0.855, as shown in Table A1.
As shown in Table 6 (b), the achieved accuracy when $k=5$ was 0.817.
The results from the Mushroom dataset obtained via $k\mathtt{AnonToRdt}$ with the same $\delta ,{N}_{t},k$ were as follows:
The accuracy of the original decision tree was 0.995, as shown in Table A1.
As shown in Table 6 (c), the achieved accuracy when $k=5$ was 0.98.
The results from the Loan dataset obtained via $k\mathtt{AnonToRdt}$ with the same $\delta ,{N}_{t},k$ were as follows:
The accuracy of the original decision tree was 0.738 in our experiment.
As shown in Table 6 (d), the achieved accuracy was around 0.84.
In summary, the accuracy achieved by $k\mathtt{AnonToRdt}$ was slightly inferior to that of the original decision tree.
Changing sampling rate β: To achieve secure differential privacy, the sampling rate should remain small. Maintaining values of ${N}_{t}=10$, ${N}_{t}\u03f5=2.0$, which were small enough for our practical application, Table 7 shows how the values of ${N}_{t}\delta $ changed according to the sampling rate $\beta $. As shown, for some parameters, the accuracy of the proposed method was relatively good even when ${N}_{t}\u03f5=2.0$ and ${N}_{t}\delta $ were small.
4. Discussion
In another highly relevant study [10], Jagannathan et al. proposed a variant of a random decision tree that achieved differential privacy. The accuracy of their proposal is shown in Figures 1 and 2 of [10] for the same datasets with the same class labels; (in [10], instead of five class labels, three were used for the Nursery dataset, i.e., some of the similar labels were merged) their method resulted in similar precision. Because our proposal employs sampling, it is limited by the size of the dataset being utilized; the smaller the dataset (e.g., the Mushroom dataset), the less pronounced the accuracy. However, it must be noted that their approach was very different from ours: Laplace noise was added instead of pruning and sampling. Notably, within their proposal,
$${n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)+\mathrm{Laplace}\phantom{\rule{0ex}{0ex}}\mathrm{Noise}$$
for all trees ${T}_{i}$, all leaves $\mathtt{leaf}$, and all labels y. Even in this context, if ${n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)$ is small for a certain ${T}_{i}$, $\mathtt{leaf}$, and y, it may be regarded almost as a personal record. A good general approach to handling such cases is to remove the rare records, i.e., to “remove the leaves containing fewer records”. This is a broadly accepted data anonymization technique [22] that is commonly used to avoid legal difficulties. Our proposal shows that pruning and sampling can be combined to ensure differential privacy. If rare sensitive records need to be removed, our method may therefore represent an excellent option.
5. Conclusions
In this paper, we aimed to show the close relationship between the security of data anonymity and decision trees. Specifically, we show how to obtain differentially private (random) decision tree from kanonymity. Our proposal consists of applying sampling and kanonymity to the original random decision tree method, which results in differential privacy. Compared to existing schemes, the advantage of our proposal is its ability to implement differential privacy without adding Laplace or Gaussian noise, which provides trained decision trees with a new route to differential privacy. We believe that in addition to random decision trees, other similar algorithms can be augmented to achieve differential privacy for general decision trees. In addition, in future studies, we will explore the differential privacy of federated learning decision trees by extending the proposed method.
Author Contributions
Conceptualization, A.W. and R.N.; methodology, A.W., R.N. and L.W.; software, A.W. and R.N.; formal analysis, A.W., R.N. and L.W.; investigation, A.W., R.N. and L.W.; resources, R.N.; data curation, A.W.; writing—original draft preparation, A.W., R.N. and L.W; writing—review and editing, L.W.; supervision, R.N.; project administration, R.N. and L.W.; funding acquisition, R.N. and L.W. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by JST CREST Grant Number JPMJCR21M1, Japan.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author. The source code used in this study is available on request from the corresponding author since January 2025.
Acknowledgments
We would like to thank Ryousuke Wakabayashi and Yusaku Ito for their technical assistance with the experiments.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A. Experiments on Attacks against the Decision Tree
We used three datasets to evaluate the vulnerability of decision trees to uniqueness and homogeneous attacks: the Nursery dataset [18], the Adult dataset [19], and the Mushroom dataset [20], In these experiments, we used the $\mathtt{Python}\mathtt{3}$ and $\mathtt{sklearn}$ libraries to train the decision trees.
In the experiments, the tree depths were set to four, five, six, seven, and eight. We divided each dataset into a training set and an evaluation set. The training set, which was used to train the decision tree, contained 80% of the records in the dataset. Here, the decision tree was trained 10 times, and averages of the following numbers were computed.
${\mathcal{H}}_{\mathtt{leaf}}$: the number of leaves that can be identified by a homogeneous attack, that is, the number of leaves $\mathtt{leaf}$ for $\exists {y}^{*}\in Y,n\left({\mathtt{leaf}}^{\left({y}^{*}\right)}\right)\ge 2$ and $\forall y\in Y\setminus {y}^{*},n\left({\mathtt{leaf}}^{\left(y\right)}\right)=0$;
${\mathcal{H}}_{u}$: the number of users who can be identified by a homogeneous attack,
$${n}_{i}\left({\mathtt{leaf}}^{\left(y\right)}\right)+\mathrm{Laplace}\phantom{\rule{0ex}{0ex}}\mathrm{Noise}$$
where
leafdenotes the leaves that suffer the homogeneous attack.
Note that in a uniqueness attack, the number of leaves that can be identified is equal to the number of identifiable users. Table A1 shows the experimental results of the homogeneous attack. Regarding the homogeneous attack, even if the tree depth is small, information can be leaked in all datasets. In addition, susceptibility to homogeneous attacks increases as the tree depth increases, as shown in Figure A1.
Table A1. Number of users (${\mathcal{H}}_{u}$) and leaves (${\mathcal{H}}_{\mathtt{leaf}}$) that can be identified by homogeneity attacks.
Table A1. Number of users (${\mathcal{H}}_{u}$) and leaves (${\mathcal{H}}_{\mathtt{leaf}}$) that can be identified by homogeneity attacks.
Tree  Nursery  Adult  Mushroom  

Depth  $\mathtt{ACC}$  ${\mathcal{H}}_{\mathbf{u}}$  ${\mathcal{H}}_{\mathcal{leaf}}$  $\mathtt{ACC}$  ${\mathcal{H}}_{\mathbf{u}}$  ${\mathcal{H}}_{\mathtt{leaf}}$  $\mathtt{ACC}$  ${\mathcal{H}}_{\mathbf{u}}$  ${\mathcal{H}}_{\mathtt{leaf}}$ 
4  0.863  3965.5  2  0.843  215.2  2.4  0.979  3293.6  9 
5  0.880  5217  4.9  0.851  829.5  7.1  0.980  3568.4  12 
6  0.888  5747.1  9.9  0.853  1621.7  18.3  0.995  6122.6  16 
7  0.921  6837.4  19.5  0.855  1957.4  35.2  1.000  6499  20 
8  0.933  7912.7  35.7  0.855  2316.1  60.8  1.000  6499  20 
Figure A1. Tree depth and the proportion of users who can be identified by a homogeneous attack.
Figure A1. Tree depth and the proportion of users who can be identified by a homogeneous attack.
References
 Sweeney, L. kAnonymity: A Model for Protecting Privacy. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef]
 Machanavajjhala, A.; Gehrke, J.; Kifer, D.; Venkitasubramaniam, M. ℓDiversity: Privacy Beyond kAnonymity. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, Atlanta, GA, USA, 3–8 April 2006; IEEE Computer Society: Washington, DC, USA, 2006; p. 24. [Google Scholar] [CrossRef]
 Li, N.; Li, T.; Venkatasubramanian, S. tCloseness: Privacy Beyond kAnonymity and ℓDiversity. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, 15–20 April 2007; IEEE Computer Society: Washington, DC, USA, 2007; pp. 106–115. [Google Scholar] [CrossRef]
 Yao, A.C. How to Generate and Exchange Secrets (Extended Abstract). In Proceedings of the 27th Annual Symposium on Foundations of Computer Science, Toronto, ON, Canada, 27–29 October 1986; IEEE Computer Society: Washington, DC, USA, 1986; pp. 162–167. [Google Scholar] [CrossRef]
 Praveena Priyadarsini, R.; Sivakumari, S.; Amudha, P. Enhanced ℓ– Diversity Algorithm for Privacy Preserving Data Mining. In Digital Connectivity—Social Impact, Proceedings of the 51st Annual Convention of the Computer Society of India, CSI 2016, Coimbatore, India, 8–9 December 2016, Proceedings; Subramanian, S., Nadarajan, R., Rao, S., Sheen, S., Eds.; Springer: Singapore, 2016; pp. 14–23. [Google Scholar] [CrossRef]
 Stadler, T.; Oprisanu, B.; Troncoso, C. Synthetic Data  Anonymisation Groundhog Day. In Proceedings of the 31st USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, 10–12 August 2022; USENIX Association: Berkeley, CA, USA, 2022; pp. 1451–1468. [Google Scholar]
 Friedman, A.; Wolff, R.; Schuster, A. Providing kanonymity in data mining. VLDB J. 2008, 17, 789–804. [Google Scholar] [CrossRef]
 Ciriani, V.; di Vimercati, S.D.C.; Foresti, S.; Samarati, P. k Anonymous Data Mining: A Survey. In PrivacyPreserving Data Mining— Models and Algorithms; Aggarwal, C.C., Yu, P.S., Eds.; Advances in Database Systems; Springer: Berlin/Heidelberg, Germany, 2008; Volume 34, pp. 105–136. [Google Scholar] [CrossRef]
 Fan, W.; Wang, H.; Yu, P.S.; Ma, S. Is random model better? On its accuracy and efficiency. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), Melbourne, FL, USA, 19–22 December 2003; IEEE Computer Society: Washingtom, DC, USA, 2003; pp. 51–58. [Google Scholar] [CrossRef]
 Jagannathan, G.; Pillaipakkamnatt, K.; Wright, R.N. A Practical Differentially Private Random Decision Tree Classifier. Trans. Data Priv. 2012, 5, 273–295. [Google Scholar]
 Fletcher, S.; Islam, M.Z. A Differentially Private Random Decision Forest Using Reliable SignaltoNoise Ratios. In AI 2015: Advances in Artificial Intelligence, Proceedings of the 28th Australasian Joint Conference, Canberra, ACT, Australia, 30 November–4 December 2015, Proceedings; Pfahringer, B., Renz, J., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9457, pp. 192–203. [Google Scholar] [CrossRef]
 Fletcher, S.; Islam, M.Z. Decision Tree Classification with Differential Privacy: A Survey. ACM Comput. Surv. 2019, 52, 83:1–83:33. [Google Scholar] [CrossRef]
 Li, N.; Qardaji, W.H.; Su, D. On sampling, anonymization, and differential privacy or, kanonymization meets differential privacy. In Proceedings of the 7th ACM Symposium on Information, Compuer and Communications Security, ASIACCS ’12, Seoul, Republic of Korea, 2–4 May 2012; ACM: New York, NY, USA, 2012; pp. 32–33. [Google Scholar] [CrossRef]
 Blum, A.; Dwork, C.; McSherry, F.; Nissim, K. Practical privacy: The SuLQ framework. In PODS, Proceedings of the TwentyFourth ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems, Baltimore, Maryland, 13–15 June 2005; Li, C., Ed.; ACM: New York, NY, USA, 2005; pp. 128–138. [Google Scholar]
 Friedman, A.; Schuster, A. Data mining with differential privacy. In KDD ’10, Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; Rao, B., Krishnapuram, B., Tomkins, A., Yang, Q., Eds.; ACM: New York, NY, USA, 2010; pp. 493–502. [Google Scholar] [CrossRef]
 Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. Ldiversity: Privacy beyond kanonymity. ACM Trans. Knowl. Discov. Data 2007, 1, 3. [Google Scholar] [CrossRef]
 Dwork, C.; Roth, A. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
 Rajkovic, V. Nursery. UCI Machine Learning Repository. 1997. Available online: https://archive.ics.uci.edu/dataset/76/nursery (accessed on 24 August 2024).
 Becker, B.; Kohavi, R. Adult. UCI Machine Learning Repository. 1996. Available online: https://archive.ics.uci.edu/dataset/2/adult (accessed on 24 August 2024).
 Mushroom. UCI Machine Learning Repository. 1987. Available online: https://archive.ics.uci.edu/dataset/73/mushroom (accessed on 24 August 2024).
 Mahdi, N. Bank_Personal_Loan_Modelling. Available online: https://www.kaggle.com/datasets/ngnnguynthkim/bankpersonalloanmodellingcsv (accessed on 24 August 2024).
 Mobile Kukan Toukei (Guidelines). Available online: https://www.intage.co.jp/english/service/platform/mobilekukantoukei/ (accessed on 24 August 2024).
Figure 1. Homogeneity attack. The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Figure 1. Homogeneity attack. The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Figure 2. Background knowledge attack. The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Figure 2. Background knowledge attack. The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Figure 3. An example of a decision tree concerning a dataset with two labels, e.g., 1 and 0. The vectors in the leaves show the number of data points being classified, in which the two terms of each vector correspond to the number of data points labeled 1 and 0, respectively.
Figure 3. An example of a decision tree concerning a dataset with two labels, e.g., 1 and 0. The vectors in the leaves show the number of data points being classified, in which the two terms of each vector correspond to the number of data points labeled 1 and 0, respectively.
Figure 4. Example of conversion from decision tree (Figure 3) to anonymized data, where * in Weight and Age columns mean no partitioned for the corresponding attributes, respectively.
Figure 4. Example of conversion from decision tree (Figure 3) to anonymized data, where * in Weight and Age columns mean no partitioned for the corresponding attributes, respectively.
Table 1. Example of a dataset *.
Table 1. Example of a dataset *.
Zip  Age  Nationality  Disease 

13053  28  Russian  Heart 
13068  29  American  Heart 
13068  21  Japanese  Flu 
13053  23  American  Flu 
14853  50  Indian  Cancer 
14853  55  Russian  Heart 
14850  47  American  Flu 
14850  59  American  Flu 
13053  31  American  Cancer 
13053  37  Indian  Cancer 
13068  36  Japanese  Cancer 
13068  32  American  Cancer 
* Excerpt from [16].
Table 2. kanonymity ($k=4$) of the dataset shown in Table 1.
Table 2. kanonymity ($k=4$) of the dataset shown in Table 1.
Zip  Age  Nationality  Disease 

130 **  <30  *  Heart 
130 **  <30  *  Heart 
130 **  <30  *  Flu 
130 **  <30  *  Flu 
1485 *  <30  *  Cancer 
1485 *  >40  *  Heart 
1485 *  >40  *  Flu 
1485 *  >40  *  Flu 
130 **  30–40  *  Cancer 
130 **  30–40  *  Cancer 
130 **  30–40  *  Cancer 
130 **  30–40  *  Cancer 
The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Table 3. ℓdiversity (ℓ = 2) after kanonymity of the dataset shown in Table 1.
Table 3. ℓdiversity (ℓ = 2) after kanonymity of the dataset shown in Table 1.
Zip  Age  Nationality  Disease 

130 **  <30  *  Heart 
130 **  <30  *  Heart 
130 **  <30  *  Flu 
130 **  <30  *  Flu 
1485 *  <30  *  Cancer 
1485 *  >40  *  Heart 
1485 *  >40  *  Flu 
1485 *  >40  *  Flu 
130 **  30–40  *  Cancer 
130 **  30–40  *  Cancer 
130 **  30–40  *  Cancer 
130 **  30–40  *  Flu 
The * and ** in the Zip column mean that the last one and two digits of the original data are hidden, respectively; while the * in the Nationality column means no partitioned for the corresponding attribute.
Table 4. The notations used in this paper.
Table 4. The notations used in this paper.
Notation  Description 

k  Parameter for kanonymity 
ℓ  Parameter for ℓdiversity 
D  Dataset $\left\{({\mathbf{x}}_{i},{y}_{i})\right\}$, for $i\in \{1,\dots ,n\}$ 
$\mathbf{X}$  Data space with m features ${F}_{1},\dots ,{F}_{m}$ 
$\mathbf{x}$  Data $({x}_{1},\dots ,{x}_{m})\in \mathbf{X}=({F}_{1},\dots ,{F}_{m})$ 
y  Label of data $\mathbf{x}$ 
Y  Label space 
$\mathtt{leaf}$, ${\mathtt{leaf}}_{i}$  Leaf or leaves 
$n\left({\mathtt{leaf}}_{i}^{\left(y\right)}\right)$  The number of data points classed as ${\mathtt{leaf}}_{i}$with 
label $y\in Y$.  
$n\left({\mathtt{leaf}}_{i}\right)$  The number of data points classed as ${\mathtt{leaf}}_{i}$, i.e., 
$n\left({\mathtt{leaf}}_{i}\right)={\sum}_{y\in Y}n\left({\mathtt{leaf}}_{i}^{\left(y\right)}\right)$  
${n}_{j}\left({\mathtt{leaf}}_{i}^{\left(y\right)}\right)$  The number of data points with label $y\in Y$ classed as 
${\mathtt{leaf}}_{i}$ of the jth tree.  
This is used for the random decision tree.  
${\mathcal{H}}_{u}$  Total number of users who can be identified by 
a homogeneous attack  
${\mathcal{H}}_{\mathtt{leaf}}$  Number of leaves that can be identified by 
a homogeneity attack  
${N}_{t}$  Number of random decision trees 
ACC  Accuracy 
Table 5. Approximate value of ${N}_{t}\delta $for$k=5,10,20$ and ${N}_{t}=10$.
Table 5. Approximate value of ${N}_{t}\delta $for$k=5,10,20$ and ${N}_{t}=10$.
${\mathit{N}}_{\mathit{t}}\mathit{\u03f5}\setminus \mathit{\beta}$  $\mathit{k}=5$  $\mathit{k}=10$  $\mathit{k}=20$  

$\mathbf{0.01}$  $\mathbf{0.1}$  $\mathbf{0.4}$  $\mathbf{0.01}$  $\mathbf{0.1}$  $\mathbf{0.4}$  $\mathbf{0.01}$  $\mathbf{0.1}$  $\mathbf{0.4}$  
$1.0$  $0.001$      $4.66$ × ${10}^{7}$  $0.355$    $1.20\times {10}^{13}$  $0.045$   
$2.0$  $5.52\times {10}^{5}$  0.352    $1.08\times {10}^{9}$  $0.034$    $7.00\times {10}^{19}$  $0.000$   
$3.0$  $7.68\times {10}^{6}$  0.127    $2.72\times {10}^{11}$  $0.005$    $4.75\times {10}^{22}$  $7.82\times {10}^{6}$  $0.426$ 
$4.0$  $1.86\times {10}^{6}$  0.043    $1.68\times {10}^{12}$  $0.001$  $0.583$  $1.92\times {10}^{24}$  $2.37\times {10}^{7}$  $0.134$ 
$5.0$  $7.47\times {10}^{7}$  0.028  0.994  $2.85\times {10}^{13}$  $0.000$  $0.348$  $3.54\times {10}^{26}$  $1.61\times {10}^{8}$  $0.051$ 
$6.0$  $2.417\times {10}^{7}$  0.009  0.963  $3.19\times {10}^{14}$  $3.93\times {10}^{5}$  $0.191$  $7.71\times {10}^{28}$  $1.03\times {10}^{9}$  $0.016$ 
$7.0$  $1.22\times {10}^{7}$  0.009  0.963  $8.51\times {10}^{15}$  $2.05\times {10}^{5}$  $0.175$  $5.75\times {10}^{29}$  $1.48\times {10}^{10}$  $0.008$ 
$8.0$  $1.22\times {10}^{7}$  0.004  0.498  $4.07\times {10}^{15}$  $4.53\times {10}^{6}$  $0.093$  $6.27\times {10}^{30}$  $1.56\times {10}^{11}$  $0.003$ 
$9.0$  $5.47\times {10}^{8}$  0.002  0.410  $7.58\times {10}^{16}$  $1.87\times {10}^{6}$  $0.078$  $5.06\times {10}^{31}$  $2.82\times {10}^{12}$  $0.001$ 
Table 6. Accuracy of $k\mathtt{AnonToRdt}$.
Table 6. Accuracy of $k\mathtt{AnonToRdt}$.
(a)  (b)  
k  #Trees  Nursery Dataset (with 5 class labels)  Adult Dataset  
8  $d=3$  $d=4$  $d=5$  $d=7$  $d=8$  $d=9$  
0 *  0.748  0.832  0.854  0.781  0.806  0.814  
5  0.788  0.818  0.783  0.781  0.807  0.811  
10  0.749  0.782  0.598  0.794  0.801  0.814  
0  9  0.807  0.819  0.867  0.789  0.813  0.817 
5  0.798  0.843  0.816  0.789  0.813  0.817  
10  0.808  0.775  0.626  0.804  0.812  0.813  
0  10  0.815  0.841  0.860  0.795  0.791  0.816 
5  0.808  0.841  0.818  0.795  0.791  0.816  
10  0.816  0.801  0.673  0.786  0.803  0.813  
0  11  0.832  0.847  0.868  0.786  0.807  0.818 
5  0.836  0.847  0.823  0.786  0.807  0.817  
10  0.830  0.818  0.663  0.793  0.797  0.813  
(c)  (d)  
k  #Trees  Mushroom Dataset  Loan Dataset  
8  $d=3$  $d=4$  $d=5$  $d=4$  $d=5$  $d=6$  
0 *  0.935  0.964  0.979  0.838  0.843  0.838  
5  0.935  0.964  0.979  0.843  0.843  0.840  
10  0.903  0.964  0.974  0.838  0.845  0.838  
0  9  0.940  0.967  0.980  0.839  0.839  0.842 
5  0.940  0.967  0.980  0.839  0.839  0.842  
10  0.854  0.963  0.978  0.839  0.839  0.84  
0  10  0.957  0.972  0.978  0.841  0.846  0.835 
5  0.957  0.972  0.978  0.836  0.841  0.839  
10  0.957  0.949  0.975  0.841  0.846  0.835  
0  11  0.944  0.973  0.978  0.844  0.841  0.840 
5  0.944  0.973  0.978  0.839  0.841  0.840  
10  0.921  0.968  0.977  0.844  0.841  0.840 
* $k=0$ means original decision tree without pruning. #Trees denote the number of trees.
Table 7. Accuracy and approximate ${N}_{t}\delta $ of $k\mathtt{AnonToRdt}$ when ${N}_{t}\u03f5=2.0$.
Table 7. Accuracy and approximate ${N}_{t}\delta $ of $k\mathtt{AnonToRdt}$ when ${N}_{t}\u03f5=2.0$.
k  Nursery Dataset (with 3 Class Labels)  Mushroom Dataset (with 2 Class Labels)  

$\mathit{\beta}\mathbf{=}\mathbf{0.01}$  $\mathit{\beta}\mathbf{=}\mathbf{0.1}$  $\mathit{\beta}\mathbf{=}\mathbf{0.4}$  $\mathit{\beta}\mathbf{=}\mathbf{0.01}$  $\mathit{\beta}\mathbf{=}\mathbf{0.1}$  $\mathit{\beta}\mathbf{=}\mathbf{0.4}$  
${\mathit{N}}_{\mathit{t}}\mathit{\delta}$  $\mathbf{ACC}$  ${\mathit{N}}_{\mathit{t}}\mathit{\delta}$  $\mathbf{ACC}$  ${\mathit{N}}_{\mathit{t}}\mathit{\delta}$  $\mathbf{ACC}$  ${\mathit{N}}_{\mathit{t}}\mathit{\delta}$  $\mathbf{ACC}$  ${\mathit{N}}_{\mathit{t}}\mathit{\delta}$  $\mathbf{ACC}$  ${\mathit{N}}_{\mathit{t}}\mathit{\delta}$  $\mathbf{ACC}$  
0    0.971    0.971    0.971    0.945    0.945    0.945 
5  $5.52\times {10}^{5}$  0.942  0.352  0.958    0.972  $5.52\times {10}^{5}$  0.900  0.352  0.942    0.914 
10  $1.08\times {10}^{9}$  0.774  0.034  0.969    0.971  $1.08\times {10}^{9}$  0.833  0.034  0.930    0.933 
20  $7.00\times {10}^{19}$  0.383  0  0.965    0.963  $7.00\times {10}^{19}$  0.631  0  0.913    0.932 
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).