0% found this document useful (0 votes)
51 views8 pages

BoostingDEA and R Language

boostingDEA and R language

Uploaded by

ljw1998123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views8 pages

BoostingDEA and R Language

boostingDEA and R language

Uploaded by

ljw1998123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

SoftwareX 24 (2023) 101549

Contents lists available at ScienceDirect

SoftwareX
journal homepage: www.elsevier.com/locate/softx

Original software publication

boostingDEA: A boosting approach to Data Envelopment Analysis in R


Maria D. Guillen a , Juan Aparicio a,b ,∗, Victor J. España a
a
Center of Operations Research (CIO), Miguel Hernandez University of Elche (UMH), 03202 Elche (Alicante), Spain
b
Valencian Graduate School and Research Network of Artificial Intelligence (valgrAI), 46022 Valencia, Spain

ARTICLE INFO ABSTRACT

Keywords: boostingDEA is a new package for R that includes functions to estimate production frontiers and make ideal
R output predictions in the Data Envelopment Analysis (DEA) context. The package implements both standard
Data Envelopment Analysis models from DEA and Free Disposal Hull (FDH) and, for the first time, incorporates boosting techniques.
Free Disposal Hull
Boosting is a method used in machine learning that attempts to overcome the overfitting issue, typically
Boosting
sustained in standard methods, by training multiple models sequentially to improve the accuracy of the
overall system. Moreover, the package includes code for estimating several technical efficiency measures
using different models such as the input and output-oriented radial measures, the input and output-oriented
Russell measures, the Directional Distance Function (DDF), the Weighted Additive Measure (WAM) and the
Slacks-Based Measure (SBM).

Code metadata

Current code version 0.1.0


Permanent link to code/repository used for this code version https://github.com/ElsevierSoftwareX/SOFTX-D-23-00466
Permanent link to Reproducible Capsule NA
Legal Code License AGPL (≥ 3)
Code versioning system used Git
Software code languages, tools, and services used R (≥ 3.5.0)
Compilation requirements, operating environments & dependencies R (≥ 3.5.0)
If available Link to developer documentation/manual https://cran.r-
project.org/web/packages/boostingDEA/boostingDEA.pdf
https://cran.r-
project.org/web/packages/boostingDEA/vignettes/boostingDEA.html
Support email for questions maria.guilleng@umh.es

1. Motivation and significance

Efficiency is a crucial aspect of any production process, as it determines how effectively resources are used to generate output. In the context
of organizations, technical efficiency refers to the ability to produce maximum output from a given set of inputs. Measuring technical efficiency
provides insights into the performance of firms and can help identify areas where productivity improvements can be made [1,2]. In this context, both
parametric (e.g., Stochastic Frontier Analysis [3]) and non-parametric methodologies (e.g., Data Envelopment Analysis [4]) have been developed.
However, non-parametric approaches are more appealing than their parametric counterpart due to their flexibility, the mild conditions required
and their natural treatment of multi-input multi-output production contexts [5].
One widely-used non-parametric method for measuring technical efficiency is Data Envelopment Analysis (DEA). DEA [4] determines the relative
efficiency of a set of decision-making units (DMUs) by taking into account multiple inputs and outputs and comparing the performance of a given
DMU to the other DMUs in the set. To do so, DEA relies on the construction of a technology in the input–output space that satisfies free disposability
(i.e., it is always possible to do worse), envelopmentness (i.e., including all observed DMUs), convexity and minimal extrapolation (i.e., the smallest
set) [6]. On the other hand, another renowned non-parametric approach is Free Disposal Hull (FDH) [7], which is based upon the construction of

∗ Corresponding author at: Center of Operations Research (CIO), Miguel Hernandez University of Elche (UMH), 03202 Elche (Alicante), Spain.
E-mail address: j.aparicio@umh.es (Juan Aparicio).

https://doi.org/10.1016/j.softx.2023.101549
Received 20 July 2023; Received in revised form 14 September 2023; Accepted 2 October 2023
Available online 10 October 2023
2352-7110/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Maria D. Guillen et al. SoftwareX 24 (2023) 101549

Fig. 1. Software architecture of boostingDEA.

a technology that satisfies only free disposability, envelopmentness and minimal extrapolation. In fact, FDH may be considered the ‘‘skeleton’’ of
DEA, as the convex hull of the frontier estimated by FDH is identical to the DEA frontier [8].
To make a fair comparison between DMUs, a key aspect is knowing the maximum output a DMU can ideally produce given a certain amount
of inputs. However, DEA and FDH tend to overfitting due to the minimal extrapolation principle [9], since this axiom ensures that the estimator
of the production function is as close to the data set of DMUs as possible. To overcome this issue, a topic of growing interest is the adaptation of
Machine Learning techniques to the efficiency analysis context. In this sense, one major standard machine learning technique is gradient boosting.
The basic idea of gradient boosting [10] is to iteratively train a series of weak models, where each model tries to correct the errors made by the
previous models. In each iteration, the model is trained on the residuals, which are the differences between the true values and the predicted values
of the previous model. The ‘‘gradient’’ in gradient boosting refers to the use of the gradient of the loss function with respect to the model’s output
to adjust the weights of the data points. This helps the model to focus on the data points that are most difficult to predict. The main advantage of
gradient boosting is its ability to produce highly accurate models, especially in case of noisy data or when relationships between the variables are
complex [11].
In this paper, we introduce a new R package called boostingDEA that, besides implementing the most popular DEA and FDH models, includes
two different boosting algorithms for estimating production frontiers: an adaptation of the Gradient Tree Boosting known as EATBoosting [12,13]
and the adaptation of the LS-Boosting algorithm [10] using adapted Multivariate Adaptive Regression Splines (MARS) models [14] as base learners
(from now on referred as MARSBoosting [15]). EATBoosting shares similarities with FDH since graphically both generate a step function, while
MARSBoosting resembles DEA. However, both algorithms overcome the overfitting problems that characterize standard techniques. Furthermore,
in this package, we show how to calculate different technical efficiency measures defined in the literature using different models. In particular,
the input and output-oriented radial measures [6], the input and output-oriented Russell measures [16], the Directional Distance Function (DDF)
[17], the Weighted Additive Measure (WAM) [18] and the Enhanced Russell Graph measure (ERG) [19], also known as the Slacks-Based Measure
(SBM) [20], are included.

2. Software description

In the following subsections, we provide details of the boostingDEA architecture and outline the software functionality which include creating
the efficiency models, making output predictions and measuring efficiency using them.

2.1. Software architecture

boostingDEA implements functions to create four different efficiency models separated in 4 entities: DEA, FDH, EATBoost, and MARSBoost.
Once created, these models can then be used to make predictions, using the function predict(), and to calculate their related efficiency measures,
using the function efficiency(). This is explained visually in Fig. 1.

2.2. Software functionalities

In the data envelopment analysis context, the productivity and economic performance of a set ℵ of 𝑖 = 1, … , 𝑛 observed DMUs is measured.
Each of each these DMUs consumes a vector of 𝑗 = 1, … , 𝑚 positive inputs 𝐱 ∈ R𝑚 𝑠
+ to produce a vector of 𝑟 = 1, … , 𝑠 positive outputs 𝐲 ∈ R+ . In
the package, DMUs’ data are managed as matrix and/or data.frame and their inputs/outputs indexes as vector. The set that includes all the
technical feasible combinations of (𝐱, 𝐲) is known as the production possibility set or technology, formally defined as 𝜓 = {(𝐱, 𝐲) ∈ R𝑚+𝑠 + ∶ 𝐱 can
produce 𝐲} [21]. Technical efficiency is then defined as the distance from a point belonging to 𝜓 to the production frontier, where the production
frontier 𝜕(𝜓) is the ‘‘upper’’ border of the technology. There are various ways of estimating the true frontier of 𝜓 and the true efficiency scores,
hence the different models 2.2.1 and their respective predictions 2.2.2. Furthermore, since there are several ways of calculating distances, different
efficiency measures are available in the efficiency literature, as explained in 2.2.3.

2
Maria D. Guillen et al. SoftwareX 24 (2023) 101549

2.2.1. Creating the models


In the context of DEA, the production technology is calculated under assumptions of free disposability, convexity, envelopmentness and minimal
extrapolation. For this case, [6] provide an estimate of the variable returns to scale technology 𝜓 as:

𝑛
𝜓𝐷𝐸𝐴 ={(𝐱, 𝐲) ∈ R𝑚+𝑠
+ ∶ 𝑦(𝑟) ≤ 𝜆𝑖 𝑦(𝑟)
𝑖 , 𝑟 = 1, … , 𝑠,
𝑖=1

𝑛
𝑥 (𝑗)
≥ 𝜆𝑖 𝑥(𝑗)
𝑖 ,𝑗 = 1, … , 𝑚, (1)
𝑖=1

𝑛
𝜆𝑖 = 1, 𝜆𝑖 ≥ 0, 𝑖 = 1, … , 𝑁}
𝑖=1

This can be computed in the package using the DEA(data,x,y) function. This function only requires the data containing the units for the
analysis and the indexes of the input and output variables (x and y respectively).
Similarly, FDH also estimates production frontiers, but it is based on just three axioms: free disposability, envelopmentness and minimal
extrapolation [7]. In this case, an estimate of the technology is computed as:

𝑛
𝜓𝐹 𝐷𝐻 ={(𝐱, 𝐲) ∈ R𝑚+𝑠
+ ∶ 𝑦(𝑟) ≤ 𝜆𝑖 𝑦(𝑟)
𝑖 , 𝑟 = 1, … , 𝑠,
𝑖=1

𝑛
𝑥(𝑗) ≥ 𝜆𝑖 𝑥(𝑗)
𝑖 , 𝑗 = 1, … , 𝑚, (2)
𝑖=1

𝑛
𝜆𝑖 = 1, 𝜆𝑖 ∈ {0, 1}, 𝑖 = 1, … , 𝑁}
𝑖=1

Likewise, the FDH model can be computed in R using the FDH(data,x,y) function.
As an alternative to the standard FDH technique, the EATBoosting algorithm [12,13] was introduced. This algorithm is an adaptation of the
machine learning Gradient Tree Boosting algorithm [10] to estimate production frontiers while satisfying the usual axioms from production theory.
In the EATBoosting algorithm, many weak tree-like EAT [9] models are combined in a forward stagewise strategy (where a new predictive model
is added in sequence) to generate one strong learner. The final prediction fulfills free disposability and envelopmentness and generates a step-wise
surface as an estimator, just as FDH does, but avoids the typical overfitting problem linked to the latter.
The EATBoosting model can be computed in R using the function EATBoost (data, x, y, num.iterations, num.leaves, learn-
ing.rate). This function requires the data containing the DMUs for the analysis, the indexes of the input and output variables (x and y
respectively) and a set of hyperparameters that control the performance of the model. Those parameters are:

• num.iterations, which refers to the number of iterations that the algorithm will perform (i.e., the number of trees in the model).
• num.leaves, which limits the number of final leaves of each tree at every iteration.
• learning.rate, which applies a weighting factor for the corrections by new trees when added to the model.
MARSBoosting is an adaptation of the Least Squared Boosting (LS-Boosting) algorithm by [10] to estimate production functions, based on the
idea of using boosting to ensemble adapted Multivariate Adaptive Regression Splines (MARS) models [14] as base learners. In particular, the
adapted LS-Boosting constructs additive regression models by sequentially fitting the adapted MARS models to current pseudo-residuals by least
squares at each iteration. MARS algorithm fits piece-wise linear regressions using discrete regression splines at various intervals of the predictor
variable space. As a result, the final estimator satisfies envelopmentness, monotonicity and concavity, which allows the approach to be compared to
DEA. The final prediction obtained does not have a continuous first derivative, so sharp trend changes can occur. As an alternative, an smoothing
procedure can also be applied.
In R, the MARSBoosting model is implemented using the function MARSBoost(data, x, y, num.iterations, num.terms, learn-
ing.rate). This function requires the data, the indexes of the input (x) and output variables (y). As hyperparameters, it requires:
• num.iterations: the number of iterations the algorithm will perform (i.e., the number of MARS models in the final model).
• num.terms: the number of splines created by the MARS algorithm .
• learning.rate: the shrinkage factor.
Besides, one of the key aspects of both the EATBoosting and MARSBoosting algorithms is the hyperparameter tuning. To try to find the best
value for the hyperparameters, we can resort to a grid of parameter values which can be tested through training and test samples in a user-specified
proportion.
In the case of the EATBoosting algorithm, this grid search can be done through the function bestEATBoost(training, test, x, y,
num.iterations, learning.rate, num.leaves, verbose), while in the case of the MARSBoosting algorithm this can be done through
the function bestMARSBoost(training, test, x, y, num.iterations, learning.rate, num.terms, verbose). These functions
receive a vector instead of a single value for each hyperparameter as arguments and evaluate each possible combination in the grid through Mean
Squared Error (MSE). Furthermore, through the verbose argument, we can specify if we want to ‘‘see’’ the training progress.

2.2.2. Making predictions


In the efficiency context, it is common to measure the performance of one DMU against another DMU [22]. However, to make a fair comparison,
an important issue is knowing the maximum output that a DMU can ideally produce given a certain amount of inputs. These estimations can

3
Maria D. Guillen et al. SoftwareX 24 (2023) 101549

Fig. 2. Example comparing the FDH and the EATBoositng predictions in a single-input single-output scenario with 𝑦 = 𝑥0.5 𝑒−𝑢 , where 𝑥 ∼ 𝑈 𝑛𝑖[0, 1] and 𝑢 ∼ |𝑁(0, 0.4)|.

be computed in R using predict(object, newdata, x) where object refers to the previously described DEA, FDH, EATBoost, and
MARSBoost models. On the other hand, newdata refers to a data.frame of DMUs and x, to the set of input variables indexes. The data.frame
used as the newdata parameter can be the same one as that used by the original DMUs to create the model or a new one, in case of different
DMUs, but in the same economic context.
Furthermore, in the case of the MARSBoosting algorithm, both the models with or without the smoothing procedure can be used to compute the
predictions. This is specified by an additional parameter class. If class equals 1 or is unspecified, the model without the smoothing procedure is
used, while if it equals 2, the smooth model is applied. Finally, it should be noted that predictions can only be made in single-output scenarios for the
DEA, FDH, and MARSBoost models [15], whereas EATBoost models can be used in both single-output contexts and multi-output scenarios [12].
Regarding DEA and FDH, these techniques were not originally defined for providing output predictions. Nevertheless, the current version of our
package is able to provide these predictions in the single-output case. The multi-output framework will be implemented in future versions of the
package. In the case of MARSBoosting, as far as we know, the development of the algorithm is only focused on the single-output scenario (see [15]).
Additional developments will be incorporated into the package in future releases when a multi-output version of MARSBoosting is available. In
contrast, EATBoosting was originally introduced to provide output predictions for both the single-output scenario and the multi-output production
context (see [13]).
To further illustrate the prediction ability of the different algorithms, Fig. 2 compares FDH with EATBoosting and Fig. 3, DEA with MARSBoosting
(with and without the smoothing procedure). In both cases, we can see how the prediction made by the boosting algorithms is closer to the true
frontier than standard techniques.

2.2.3. Measuring efficiency


In the technical efficiency context, efficiency scores are used to measure the relative efficiency of different DMUs. In the boostingDEA package,
these scores can be calculated thanks to the function efficiency(model, measure, data, x, y, . . . ). For this function, it is necessary
to specify the model we want to calculate the score for (i.e., DEA, FDH, EATBoost and MARSBoost) and the mathematical programming model
which is used to calculate the score, that is, the measure. This is specified using the model and the measure parameters. If no measure is specified,
the radial output measure is calculated. Moreover, the given dataset (data) we want to calculate the efficiency scores for and its corresponding
input index(es) (x) and output index(es) (y) must be specified. For best results, it is suggested that the data set with the DMUs whose efficiency is
to be calculated match the data set used to estimate the frontier. However, it is also possible to calculate efficiency scores for new data (although
infeasibilities can occur if the new unobserved DMUs are outside the computed technologies, resulting in −∞ or +∞ scores depending on if the
corresponding optimization program is associated with maximizing or minimizing the objective function, respectively). Additionally, depending on
the measure used, extra parameters are required.
The implemented measures are:
( )
• rad.out: Output-oriented radial measure [6], which determines the efficiency score for a DMU 𝑘 with 𝒙𝑘 , 𝒚𝑘 ∈ 𝑅𝑚+𝑠 + by equiproportionally
increasing all its outputs while maintaining inputs constant.
( )
• rad.in: Input-oriented radial measure [6], which determines the efficiency score for an evaluated point 𝒙𝑘 , 𝒚𝑘 by equiproportionally
decreasing all its inputs while maintaining outputs constant
• Russell.out: Output-oriented Russell measure [16], which consider that there might be slack in some but not all of outputs after the
output-oriented radial efficiency is achieved.

4
Maria D. Guillen et al. SoftwareX 24 (2023) 101549

Fig. 3. Example comparing the DEA and the MARSBoosting predictions in a single-input single-output scenario with 𝑦 = 𝑥0.5 𝑒−𝑢 , where 𝑥 ∼ 𝑈 𝑛𝑖[0, 1] and 𝑢 ∼ |𝑁(0, 0.4)|.

• Russell.in: Input-oriented Russell measure, [16], which consider that there might be slack in some but not all of outputs after the
input-oriented radial efficiency is achieved.
( ) ( )
• DDF: Directional Distance Function [17], which projects a DMU 𝒙𝑘 , 𝒚𝑘 in a preassigned direction 𝐠 = −𝐠𝑥 , 𝐠𝑦 ≠ 0𝑚+𝑠 , with 𝐠𝑥 ∈ 𝑅𝑚+ and
𝐠𝑦 ∈ 𝑅𝑠+ . The direction vector is specified using the parameter direction.vector. Several directional vectors used in the literature are
included:

– The unit vector: (−𝐠𝑥 , 𝐠𝑦 ) = (𝟏, 𝟏).


( )
– The values of the evaluated observation: (−𝐠𝑥 , 𝐠𝑦 ) = 𝐱𝑘 , 𝐲𝑘 .
– The input and output variables’ means: (−𝐠𝑥 , 𝐠𝑦 ) = (𝐱,
̄ 𝐲).
̄
– A user-specified vector

• WAM: Weighted Additive Models [18], which consider that not all units should be of equal importance and introduce some weights 𝐰− =
+
(𝑤−
1
, … , 𝑤+ 𝑚 + + 𝑠
𝑚 ) ∈ R+ and 𝐰 = (𝑤1 , … , 𝑤𝑠 ) ∈ R+ representing the relative importance of unit inputs and unit outputs, respectively. The
package can compute a set of models known as General Efficiency Measures (GEMs) using the parameter weights. Those are:

– The measure of inefficiency proportions (MIP) [23], which uses the weights:
1 1
(𝐰− , 𝐰+ ) = ( , )
𝐱𝑘 𝐲𝑘

– The range adjusted measure (RAM) [24], which uses the weights:
1 1
(𝐰− , 𝐰+ ) = ( , )
(𝑚 + 𝑠)𝑅− (𝑚 + 𝑠)𝑅+
where 𝑅− and 𝑅+ are the inputs and outputs variables’ ranges.
– The bounded adjusted measure (BAM) [25], which uses the weights:
1 1
(𝐰− , 𝐰+ ) = ( , )
(𝑚 + 𝑠)(𝐱𝑘 − 𝐱) (𝑚 + 𝑠)(𝐲̄ − 𝐲𝑘 )
where 𝐱 and 𝐲̄ are the minimum and maximum observed values of inputs and outputs, respectively.
– The normalized weighted additive DEA model [18], which uses the weights:
1 1
(𝐰− , 𝐰+ ) = ( , )
𝜎− 𝜎+
where 𝜎 − and 𝜎 + are the standard deviations of inputs and outputs, respectively.
– A user-specified vector of weights.

• ERG: The Enhanced Russell Graph measure [19], which deals directly with both the input excesses and the output shortfalls of the DMU
concerned.

5
Maria D. Guillen et al. SoftwareX 24 (2023) 101549

Calculating various measures for EATBoosting models can be highly time-consuming [13]. Therefore, a heuristic approach to compute these
scores is provided. The parameter heuristic (defaulted to FALSE) specifies whether this approach is used. Additionally, when using the
MARSBoosting algorithm as the model, only the output-oriented radial measure is currently calculable [15]. Other efficiency measures have not
been yet defined in the case of this algorithm. When these measures were defined using MARSBoosting, the package will be updated accordingly.

3. Illustrative example

To demonstrate all the functions available in this package, we utilize a single data set that is included in the package itself. The data set
comprises data on banks that were operational in Taiwan in 2010 (originally sourced from [26]). The data set banks contains 31 banks with 6
variables. Those variables are:

• Financial.funds: deposits and borrowed funds (in millions of TWD).


• Labor: number of employees.
• Physical.capital: net amount of fixed assets (in millions of TWD).
• Financial.investments: financial assets, securities, and equity investments (in millions of TWD).
• Loans: loans and discounts (in millions of TWD).
• Revenue: interests from financial investments and loans.

The first three variables are considered as inputs for the models, while the following two are considered as outputs. Moreover, the variable
Revenue is interpreted as a combination of the other two output variables, and can be used as the target variable for a single-output scenario.
# Load the database
R> data( " banks " )
# Save the input and output indexes
R> x <- 1:3
R> y <- 4:5
# Save the matrix of inputs and outputs
R> input <- banks [,x]
R> output <- banks [,y]
# Create training ans test sets
R> N <- nrow( banks )
R> selected <- sample (1:N, N * 0.8) # Training indexes
R> training <- banks [selected , ] # Training set
R> test <- banks [- selected , ] # Test set

Here it is an example of how to create all the different models:

# Creates a DEA model with:


# 3 inputs : Financial .funds , Labor and Physical . capital
# 1 output : Revenue
R> DEA_model <- DEA(data = banks , x = 1:3, y = 6)

# Creates a FDH model with:


# 3 inputs : Financial .funds , Labor and Physical . capital
# 1 output : Revenue
R> FDH_model <- FDH(data = banks , x = 1:3, y = 6)

# Creates an EATBoosting model with:


# 3 inputs : Financial .funds , Labor and Physical . capital
# 2 outputs : Financial . investments , Loans
# Search of best hyperparameters for an EATBoosting model
R> grid_ EATBoost <- bestEATBoost (
training = training , test = test ,
x = 1:3, y = 4:5,
num. iterations = c(5, 6, 7),
learning .rate = c(0.4, 0.5, 0.6),
num. leaves = c(6, 7, 8),
verbose = FALSE
)
# Best EATBoosting model
R> EATBoost _best <- EATBoost (
data = banks , x = 1:3, y = 4:5,
num. iterations = grid_ EATBoost [1, " num. iterations " ],
learning .rate = grid_ EATBoost [1, " learning .rate " ],
num. leaves = grid_ EATBoost [1, " num. leaves " ]
)

6
Maria D. Guillen et al. SoftwareX 24 (2023) 101549

# Creates a MARSBoosting model with:


# 3 inputs : Financial .funds , Labor and Physical . capital
# 1 output : Revenue
# Search of best hyperparameters for a MARSBoosting model
R> grid_ MARSBoost <- bestMARSBoost (
training = training , test = test ,
x = 1:3, y = 6,
num. iterations = c(5,6,7),
learning .rate = c(0.4, 0.5, 0.6),
num.terms = c(6, 8, 10),
verbose = FALSE
)
# Best MARSBoosting model
R> MARSBoost _best <- MARSBoost (
data = banks , x = 1:3, y = 6,
num. iterations = grid_ MARSBoost [1, " num. iterations " ],
learning .rate = grid_ MARSBoost [1, " learning .rate " ],
num.terms = grid_ MARSBoost [1, " num. terms " ]
)

Next, we can see how to make output predictions.

# DEA prediction
R> predict ( object = DEA_model , newdata = banks ,
x = 1:3, y = 6)
Revenue _pred
1 1056.000
2 41007.000
3 23610.840
4 4267.654
5 31506.000
6 35510.000}

# MARSBoosting prediction (with the smoothing procedure )


R> predict ( object = MARSBoost _best , newdata = banks ,
x = 1:3, class = 2)
Revenue _pred
1 1121.456
2 43058.293
3 25857.201
4 5156.947
5 32379.240
6 35519.006

Finally, here is an example of how to calculate some efficiency measures:

R> efficiency ( model = EATBoost _best , heuristic = FALSE ,


measure = " Russell .in " , data = banks ,
x = 1:3, y = 4:5)
EATBoost . Russell .in
Export - Import Bank 1.0000000
Bank of Taiwan 0.3440001
Taipei Fubon Bank 0.3068248
Bank of Kaohsiung 0.4822518
Land Bank 0.3525174
Cooperative Bank 0.2897980

R> efficiency ( model = EATBoost _best , heuristic = FALSE ,


measure = " DDF " , direction . vector = " mean " ,
data = banks , x = 1:3, y = 4:5)
EATBoost .DDF
Export - Import Bank 0.0000000000
Bank of Taiwan 0.0000000000
Taipei Fubon Bank 0.2346833493
Bank of Kaohsiung 0.0000000000
Land Bank 0.0000000000
Cooperative Bank 0.1363531594

7
Maria D. Guillen et al. SoftwareX 24 (2023) 101549

4. Impact and conclusions

Although there are currently several R packages available for estimating technical efficiency, including deaR [27] for the estimation of
technical efficiency using both classic and fuzzy DEA moles; Benchmarking [28] to calculate DEA under different technology assumptions;
productivity [29] providing various indexes related to DEA; or FEAR [30] allowing computing non-parametric efficiency estimates, making
inference, and testing hypotheses in frontier models; the boostingDEA package differs by offering an alternative based on machine learning to
estimate production frontiers. This machine learning approach relies on boosting to overcome the overfitting problems that characterize DEA and
FDH (see e.g. [9]). Specifically, boostingDEA implements an adaptation of the Gradient Tree Boosting algorithm as well as an adaptation of the
LS-Boosting algorithm using MARS models as base learners. Besides, it is shown how to tune these models to determine the best combination of
hyperparemeters.
The package provides functions to make ideal output estimation given a certain amount of inputs and functions to calculate different efficiency
measures such as the input and output-oriented radial measures, the input and output-oriented Russell measures, the Directional Distance Function
(DDF), the Weighted Additive Measure (WAM) and the Enhanced Russell Graph measure (ERG). Furthermore, to illustrate all the models in the
package, a data set consisting of banks operating in Taiwan is also provided. Finally, the code is open-source and freely available on a repository
(https://github.com/itsmeryguillen/boostingDEA) for users to modify, extend, and adapt according to their needs.

Declaration of competing interest

The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: The
authors report financial support was provided by the Spanish Ministry of Science and Innovation and by the Valencian Community (Spain).

Data availability

Data will be made available on request.

Acknowledgments

The authors thank the financial support from the Spanish Ministry of Science and Innovation and the State Research Agency under grants
PID2019-105952GB-I00 and PID2022-136383NB-I00/ AEI /10.13039/501100011033. Additionally, Juan Aparicio, Maria D. Guillen and Victor J.
España thank the grants PROMETEO/2021/063, CIACIF/2021/345 and ACIF/2021/135 respectively, funded by the Valencian Community (Spain).

References

[1] Pastor JT, Aparicio J, Zofío JL. Benchmarking economic efficiency. Internat Ser Oper Res Management Sci 2022.
[2] O’Donnell CJ. Productivity and efficiency analysis. Springer; 2018.
[3] Aigner DJ, Chu S-f. On estimating the industry production function. Am Econ Rev 1968;58(4):826–39.
[4] Charnes A, Cooper WW, Rhodes E. Measuring the efficiency of decision making units. European J Oper Res 1978;2(6):429–44.
[5] Orea L, Zofío JL. Common methodological choices in nonparametric and parametric analyses of firms’ performance. Palgrave Handb Econ Perform Anal 2019;419–84.
[6] Banker RD, Charnes A, Cooper WW. Some models for estimating technical and scale inefficiencies in data envelopment analysis. Manag Sci 1984;30(9):1078–92.
[7] Deprins D, Simar L, Tulkens H. Measuring labor-efficiency in post offices, no. 571. LIDAM Reprints CORE, Université catholique de Louvain, Center for Operations Research
and Econometrics (CORE); 1984.
[8] Daraio C, Simar L. Introducing environmental variables in nonparametric frontier models: a probabilistic approach. J Prod Anal 2005;24(1):93–121.
[9] Esteve M, Aparicio J, Rabasa A, Rodriguez-Sala JJ. Efficiency analysis trees: A new methodology for estimating production frontiers through decision trees. Expert Syst Appl
2020;162:113783.
[10] Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat 2001;1189–232.
[11] Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer; 2009.
[12] Guillen MD, Aparicio J, Esteve M. Gradient tree boosting and the estimation of production frontiers. Expert Syst Appl 2023;214:119134.
[13] Guillen MD, Aparicio J, Esteve M. Performance evaluation of decision making units through boosting methods in the context of free disposal hull: some exact and heuristic
algorithms. Int J Inf Technol Decis Mak 2022.
[14] Friedman JH. Multivariate adaptive regression splines. Ann Stat 1991;19(1):1–67.
[15] España VJ, Aparicio J, Barber X, Esteve M. Estimating production functions through additive models based on regression splines. Eur J Oper Res 2024;312(2):684–99.
[16] Färe R, Lovell CK. Measuring the technical efficiency of production. J Econom Theory 1978;19(1):150–62.
[17] Chambers RG, Chung Y, Färe R. Profit, directional distance functions, and Nerlovian efficiency. J Optim Theory Appl 1998;98:351–64.
[18] Lovell CK, Pastor JT. Units invariant and translation invariant DEA models. Oper Res Lett 1995;18(3):147–51.
[19] Pastor JT, Ruiz JL, Sirvent I. An enhanced DEA Russell graph efficiency measure. European J Oper Res 1999;115(3):596–607.
[20] Tone K. A slacks-based measure of efficiency in data envelopment analysis. European J Oper Res 2001;130(3):498–509.
[21] Färe R, Primont D. Multi-output production and duality: Theory and applications. Springer science & business media; 1995.
[22] Bogetoft P, Otto L. Benchmarking with Dea, Sfa, and R, vol. 157. Springer Science & Business Media; 2010.
[23] Charnes A, Cooper WW, Wei Q. A semi-infinite multicriteria programming approach to data envelopment analysis with infinitely many decision-making units. Tech. rep.,
Texas univ. at Austin center for cybernetic studies; 1987.
[24] Cooper WW, Park KS, Pastor JT. RAM: a range adjusted measure of inefficiency for use with additive models, and relations to other models and measures in DEA. J Prod
Anal 1999;11:5–42.
[25] Cooper WW, Pastor JT, Borras F, Aparicio J, Pastor D. BAM: a bounded adjusted measure of efficiency for use with bounded additive models. J Prod Anal 2011;35:85–94.
[26] Juo J-C, Fu T-T, Yu M-M, Lin Y-H. Profit-oriented productivity change. Omega 2015;57:176–87.
[27] Coll-Serrano V, Bolos V, Suarez RB. deaR: Conventional and fuzzy data envelopment analysis. 2022, R package version 1.3.1.
[28] Bogetoft P, Otto L. Benchmarking with DEA and SFA. 2022, R package version 0.31.
[29] Dakpo KH, Desjeux Y, Latruffe L. productivity: Indices of productivity and profitability using data envelopment analysis (DEA). 2018, R package version 1.1.0.
[30] Wilson PW. FEAR: A software package for frontier efficiency analysis with R. Socio-Econ Plan Sci 2008;42(4):247–54.

You might also like