0% found this document useful (0 votes)
11 views11 pages

半导体良率AI

This paper presents a novel framework for predicting Final Test (FT) yield in semiconductor manufacturing using machine learning techniques, allowing for earlier identification of low yield issues during the wafer fabrication stage. The framework employs Gaussian Mixture Models and various data pre-processing techniques to handle both numerical and categorical data, achieving high performance across multiple production products. By automating yield classification, the framework aims to enhance operational efficiency and reduce production costs in the semiconductor industry.

Uploaded by

maxingyu17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

半导体良率AI

This paper presents a novel framework for predicting Final Test (FT) yield in semiconductor manufacturing using machine learning techniques, allowing for earlier identification of low yield issues during the wafer fabrication stage. The framework employs Gaussian Mixture Models and various data pre-processing techniques to handle both numerical and categorical data, achieving high performance across multiple production products. By automating yield classification, the framework aims to enhance operational efficiency and reduce production costs in the semiconductor industry.

Uploaded by

maxingyu17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE RELIABILITY SOCIETY SECTION

Received October 12, 2020, accepted October 22, 2020, date of publication October 29, 2020, date of current version November 11, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3034680

A Novel Framework for Semiconductor


Manufacturing Final Test Yield Classification
Using Machine Learning Techniques
DAN JIANG1,2 , WEIHUA LIN2 , AND NAGARAJAN RAGHAVAN 1, (Member, IEEE)
1 Engineering Product Development (EPD) Pillar, Singapore University of Technology and Design, Singapore 487372
2 Silicon Laboratories International, Singapore 539775

Corresponding author: Dan Jiang (dan_jiang@mymail.sutd.edu.sg)


This work was supported by the Economic Development Board of Singapore (EDB) and Silicon Laboratories,
Inc. (SLAB) through the Industry Postgraduate Program (IPP) under Grant IGIPSILICON1901.

ABSTRACT Advanced data analysis tools and techniques are important for semiconductor companies
to gain competitive advantage. In particular, yield prediction tools, which fully utilize production data,
help to improve operational efficiency and reduce production costs. This paper introduces a novel and
scalable framework for semiconductor manufacturing Final Test (FT) yield prediction leveraging machine
learning techniques. This framework is able to predict FT yield at wafer fabrication stage, so that FT low
yield problems can be caught at an earlier production stage compared to past studies. Our work presents
a robust solution to automatically handle both numerical and categorical production related data without
prior knowledge of the low yield root cause. Gaussian Mixture Models, One Hot Encoder and Label
Encoder techniques are adopted for data pre-processing. To improve model performance for both binary and
multi-class classification, model selection and model ensemble using the F1-macro method is demonstrated.
The framework has been applied to three mass production products with different wafer technologies and
manufacturing flows. All of them achieved high F1-macro test score indicative of the robustness of our
framework.

INDEX TERMS Semiconductor manufacturing, smart manufacturing, yield prediction, final test, Gaussian
mixture models, clustering, ensemble methods.

I. INTRODUCTION cost and forgone production losses, especially for fabless


The semiconductor manufacturing process flow involves semiconductor companies.
hundreds of processes and the production life-cycle from Current practice for FT low yield analysis is to monitor
raw material to packaged chips can take 8-16 weeks in all. production FT yield. If there is any low yield problem, engi-
In general, Wafer Fabrication (WF), Wafer Sort (WS) and neers need to manually review all related production data and
Final Test (FT) are the three major stages where huge amount identify the root cause. There are two major categories of
of production data are generated every day, but most of them root causes. The first one is front-end WF process variation.
are not fully utilized. During WF stage, Wafer Acceptance The second one is backend manufacturing flow problems,
Test (WAT) is conducted to monitor important WF process involving package types, product configurations, test facili-
related parameters. Wafers that have passed WAT will then ties, human interference, etc. However, due to lack of high
proceed to WS stage where functional defects are filtered dimensional and unstructured data analysis capability, it is
before assembly. FT is done on packaged chips and it has the very time consuming to carry out manual root cause analysis,
largest test coverage and longest test time to make sure defec- which results in a prolonged corrective action process.
tive parts are not shipped to customers. Normally, FT has In this paper, we propose a holistic framework for FT yield
more low yield problems and higher test cost compared to prediction using a suite of machine learning techniques. The
WAT and WS. Therefore, FT yield control is one of the framework is able to predict FT yield at the WF stage itself,
most important factors which contribute to manufacturing which implies that FT low yield problems can be identi-
fied two months earlier when compared to current practice.
The associate editor coordinating the review of this manuscript and The novelty of our framework is that it takes into consid-
approving it for publication was Zhaojun Li . eration of all manufacturing related parameters and is able

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 197885
D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

FIGURE 1. Semiconductor manufacturing flow and final test yield prediction at wafer fabrication stage.

to automatically handle numeric, categorical, nominal and is suitable for pattern recognition application and wafer maps
cardinal type of manufacturing data. Based on the output can be treated as images [5]. However, DNN is more suitable
from the framework, corrective actions can be taken to reduce for large datasets and may not be suitable for yield predic-
yield loss at an earlier stage as illustrated in Fig 1. Using tion due to over-fitting problem and poor model visibility as
the WAT measurements and backend manufacturing flow mentioned in [6]. Kong and Ni [6] used SVM and partial
parameters, our proposed model is able to classify wafer least square algorithm to predict functional block based die
material into different yield sub-populations. Based on the yield with inline metrology data as input parameters. Their
binning or multi-modal classification of the yield, wafer yield model was based on the assumption that wafer yield
process adjustment or backend related manufacturing flow loss is dominated by inline defects which may not be suit-
adjustment can be selectively carried out. For example, low able for other situations in a real production environment.
to moderate yield sub-population wafers can be intentionally Kim et al. in Ref. [7] discussed equipment-related variable
used for fabrication of low-end non-critical application prod- selection for WS yield prediction. Partial least squares, least
ucts (such as home-based security and IoT solutions) and sold absolute shrinkage and selection operator regression were
to customers in such markets. Moreover, the testing priority utilized as prediction models. The limitation of their work
can be adjusted for each yield sub-population and resource is that the proposed model only allowed for prediction of
allocation and shipment forecast made easier based on the two wafer process parameters’ performance instead of the
prediction results. Our data-driven decision-making process overall yield. Besides, the model does not show the rela-
is able to overcome the limitations of manual work for low tionship between the two measurements. In practise, if the
yield materials’ data review and disposition. two measurements are not independent, adjustment of one of
the equipment variables will affect the other measurement as
II. OVERVIEW OF RELATED WORK well. Therefore, the overall yield may not necessarily increase
Recent semiconductor yield prediction studies are focusing when the correlation between the measurements is ignored.
mainly on WF and WS stages. The prediction targets are There are few studies on FT yield prediction. WS mea-
wafer map design optimization [1], [2], wafer map defect pat- surements and wafer spatial features were used as input by
tern monitoring [3]–[5] as well as wafer yield prediction [6]. S. Kang et al. in Ref. [8] to predict two types of die level
Jang et al. [1] introduced a die level yield prediction model FT yield. To the best of our knowledge, there is no prior
with wafer die spatial features as input parameters. The model study on FT yield prediction using WAT data yet. Besides,
was used to evaluate the productivity of wafer maps consider- most of the past studies mentioned above tend to predict a
ing yield variations based on the die positions and die sizes of specific failure mode, which does not cover all failure types.
a wafer map. Studies by Kim et al. [2] also used wafer die spa- The input data were closely related to low yield problems
tial features as input parameters, and the model was focused based on engineers’ past experience or prior knowledge on
on evaluation of lithography process related yield problem. the root cause. Their input data including wafer die fea-
Both these studies used the Deep neural network (DNN) tures, WS measurements, inline metrology data or process
algorithm. Convolutional Neural Network (CNN) was also equipment information, only represent certain manufacturing
applied by Nakata et al. [3] for wafer map failure pattern stages. For example, wafer map design is one of the most
monitoring. The authors proposed a framework to classify important factors for wafer low yield problems. However,
failure patterns taking the wafer map as input. Their results wafer low yield problem root causes are not limited to wafer
showed that CNN outperforms Support Vector Machine map design. It can be related to product design constraints,
(SVM). A novel DNN model was proposed by J. Wang et al. human error, equipment or subcontractors’ performance
in Ref. [4] to resolve wafer map imbalance data problem. deviations etc. In our proposed framework, all production
DNN is widely used in wafer map related studies because it related parameters including both numerical and categorical

197886 VOLUME 8, 2020


D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

parameters are considered as input parameters to account have a training dataset with N number of FT yield values
for all manufacturing stages’ factors into the yield model. {X1 , . . . , XN }. Let Z be the latent parameter where
Moreover, no manual data labelling or filtering is required, p(Z = m) = πm , m = 1, . . . , M , (1)
the data can be automatically fed into our FT yield prediction
πm are the mixture weights for the M components and
model framework.
therefore
III. METHODOLOGIES ADOPTED IN OUR FRAMEWORK M
X
A. OUTPUT DISCRETIZATION πm = 1 (2)
Production FT yield is a continuous random variable ranging m=1

from 0% − 100%. The yield distribution can sometimes be The joint probability of X with a latent variable Z is
M
multi-modal, highly skewed or long tailed. The distribution X
variation is caused by different wafer fab technology and P(X , Z ) = πm N (Xi |µm , 6m ) (3)
product-specific manufacturing flow. For individual prod- m=1
ucts, majority materials’ FT yield are acceptable to ship to where πm , µm , 6m are unknown parameters representing the
customer when all the processes are well controlled. Only a mth Gaussian component’s mixture weight, mean and covari-
small portion of low yield material need further analysis and ance. Therefore, the log-likelihood is
reprocessing. Normally, the low yield threshold is decided N
X XM
based on engineers’ past experience and manual review of L(π, µ, 6) = log πm N (Xi |µm , 6m ) (4)
historical production data. In order to define material as high i=1 m=1
yield or low yield, output discretization is required to convert One of the popular algorithm to estimate maximum
the numeric yield into categorical classes. Past semiconductor likelihood of GMM’s unknown parameters is
yield problems used quartile discretization method, which Expectation-Maximization (EM) [10]. The algorithm is used
aimed to identify excursions for the points outside the region to optimize the log likelihood function L(π, µ, 6) using an
of (Q1−k × IQR, Q3 + k × IQR), where k is a constant iterative approach by repeating the following two steps until
and interquartile range (IQR) is the difference between the there is no more update required for the parameters or the
third and the first quartiles (Q3 and Q1) of the yield [9]. update meets predefined threshold.
However, this method is not suitable for multi-modal yield
distributions. The equal width and equal frequency binning 1) EXPECTATION-STEP
approaches are also commonly used discretization methods. The first step is to compute posterior distribution of latent
However, they require users to decide on the number of variable Z :
intervals, k, and then discretize the continuous attributes into k P(Xi |Zi = m)P(Zi = m)
P(Zi = m|Xi ) =
intervals simultaneously. However, the hard coded k may not P(Xi )
be suitable for all products. Other discretization algorithms πm N (µm , 6m )
= PM = γZi (m) (5)
focus on either minimizing the number of identified intervals
m=1 πm N (µm , 6m )
or maximizing the classification accuracy.
The purpose of our framework is not only to optimize 2) MAXIMIZATION-STEP
prediction accuracy, but also to correctly identify different Once we compute the value of γZi (m), parameters
yield classes and provide guided material disposition and πm , µm , 6m can be updated using below equations
root cause analysis. Material subject to similar manufacturing PN
i=1 γZi (m)
flows tend to have similar yield distributions. For products πm = (6)
with the same manufacturing flow, but different FT subcon- N
PN
tractors, the major low yield root cause is that one of the γZi (m)Xi
subcontractor’s test performance is worse than the others. µm = Pi=1 N
(7)
The FT yield distribution then becomes bi-modal due to the i=1 γZi (m)
particular subcontractor. For output discretization, we’ll need PN
i=1 γZi (m)(Xi − µm )
2
to differentiate the target beforehand due to subcontractor 6m = PN (8)
variations. Therefore, in our framework, we propose using i=1 γZi (m)
Gaussian Mixture Models (GMM) to automatically cluster C. ONE HOT ENCODER AND LABEL ENCODER
and identify optimal number of FT yield classes.
In our FT yield prediction framework, part of the input data
B. GAUSSIAN MIXTURE MODELS (GMM) are descriptive text data which need to be converted to a
GMM is a probabilistic model representing a mixture of numerical representation in order to fit into the machine
Gaussian distributions. It is a popular statistical technique learning models. One Hot Encoding and Label Encoding are
and widely used for clustering problem, heterogeneous pop- two of the most popular techniques for categorical parameter
ulations and multivariate density estimations. Let M denote pre-processing. Both of them have benefits and drawbacks.
the number of Gaussian components, which represent FT It depends on the characteristics of the dataset to decide which
yield classes in our proposed framework. Assuming we one should be used for categorical data pre-processing.

VOLUME 8, 2020 197887


D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

The benefit of the Label Encoder is that it does not increase It aims to reduce FN, which is the case when actual low yield
the dimension of the input data. Label Encoder directly uses class wafers are not fully identified. Both metrics are equally
integers to represent different text under the same categorical important as we need to have balance between Precision
parameter. For example, the product A to be discussed in and Recall to better control production cost and shipment
later section has one categorical parameter named Tester forecast. Therefore, in this framework, we use the F1 metric
Type, which includes 6 different types. After applying Label which takes into account both Precision and Recall. F1 score
Encoder, integers from 0 − 5 are used to represent different is defined in Ref. [11] as
Tester Types. But one of the drawbacks for using Label Pm × Rm
F1m = 2 ∗ (11)
Encoder is that it introduces meaningless numerical com- Pm + Rm
parison between the different types. Practically there should
For a multi-class classification problem, metrics can be com-
be no weight or ordering difference for each tester and they
puted using a micro averaging or macro averaging method.
should be treated equally during model training. The ordering
Micro averaging method is to compute the probability with
problem can bring misinterpretation, thereby affecting model
total number of all TP, FP or FN. In contrast, the macro
performance.
averaging method takes the average performances for each
The One Hot Encoder is able to avoid this problem.
class. It is known that macro-averaged scores are more influ-
It creates new parameters to represent different categorical
enced by the performance of rare categories as mentioned by
values, and assigns 0 or 1 to indicate whether or not each data
Y. Yang et al. in Ref. [12]. Therefore, F1-macro is the apt
point belongs to the particular category. All the processed
model evaluation metric for our proposed framework. Since
columns have an independent relationship. For example, after
F1-macro treats all classes equally, it can be mathematically
applying the One Hot Encoder for the Tester Type column,
defined as
the parameter columns vector size increases from 1 to 6, PM M
where each column denotes one type of Tester. For a partic- F1m 2 X Pm × Rm
F1(macro) = m=1 = (12)
ular data point, if it uses TESTER_03, then the third element M M Pm + Rm
m=1
of the vector TESTER_03 is 1 and the entries in the other five
columns are all zero (0). However, the disadvantage of One IV. FINAL TEST YIELD CLASSIFICATION FLOW
Hot Encoder is that it leads to a huge increase in the input The overall FT yield classification flow framework is
dimension space. For input parameters with high cardinality, illustrated in Fig 2. In this section, we will explain the fol-
this will result in sparse and high dimensional data thereby lowing steps in detail.
resulting in poor model performance.
A. DATA PREPROCESSING
D. F1 MACRO MEASUREMENT METRIC 1) NUMERICAL AND CATEGORICAL INPUT DATA
FT yield prediction is either a binary or a multi-class problem The numeric input data are the WAT parameters, whose range
with an imbalanced dataset. Normally, production yield dis- varies widely between 10−13 and 103 . It is important to
tributions are skewed towards the high yield end, while low keep all the numerical values in a similar range of value
yield wafers tend to be the minority. The common model or magnitude. To do this, first, a WAT parameter standard
accuracy metric is not suitable in such a case because the scaler is generated by fitting with historical production WAT
majority class will tend to dominate the accuracy result. data. For dimension reduction purpose, the Pearson Corre-
However, the low yield wafers’ analysis and disposition are lation is calculated for the WAT data and highly correlated
more important from a cost perspective. WAT parameters are removed if their correlation coefficient
Precision and recall are two effective metrics used for exceeds 0.9. Thereby, a WAT standard scaler with reduced
evaluating imbalanced dataset’s model performance. Let TP, dimension is generated. Any new incoming production WAT
FP, FN denotes True Positives, False Positives and False Neg- data are transformed using this scaler.
atives. The precision for any individual class m is defined as The categorical input data describe the variety in the man-
TPm ufacturing configurations. In general, the categorical data
Pm = (9)
TPm + FPm include wafer technology, RAM/ROM versions, firmware
It measures the probability of a wafer that is classified as versions, package types, product functionality, fab and test
‘‘positive class’’ is truly positive. It focuses on reducing FP. locations, test program versions, tester and handler types
For example, let positive class stand for low yield wafers. etc. They are descriptive string type data and can be either
By increasing Precision, it helps reduce cases where high nominal or ordinal. The number of parameters and their
yield class wafers are misclassified as low yield wafers. From values are different across products and production lines.
a production control point of view, it helps reduce wastage of For example, IoT products tend to have many customized
good material and avoid excessive wafer fabrication. firmware versions while audio products’ firmware version is
Recall evaluates the ratio of TP over all Positive Class relatively more standardized. The most significant categorical
wafers. It is defined as parameters are selected using ANOVA analysis. Categorical
TPm parameters with significance levels corresponding to a p −
Rm = (10) value ≥ 0.05 are removed. To overcome the limitations
TPm + FNm
197888 VOLUME 8, 2020
D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

FIGURE 2. Flow chart for final test yield classification.

imposed by the Label Encoder and One Hot Encoder methods a sample size of at least 10 per parameter is required to obtain
as discussed previously, both of them are used for categorical trustworthy results. Therefore, we propose here a guideline
input conversion. Based on the overview from Kline [13], that when the ratio between the number of data points and

VOLUME 8, 2020 197889


D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

the number of input parameters is larger than 10, it is rec- variety of disciplines. It is best known and used in the medical
ommended to use the Label Encoder and for all other cases, and healthcare domains. Also, it is commonly used in social
the One Hot Encoder is preferred. sciences, economic research and in physical sciences. LR is
one of the statistical tools used in Six Sigma quality control
2) OUTPUT DATA
analyses, and it plays an important role in the data mining
The output data ‘‘FT Yield’’ is a continuous random vari-
domain [19]. LR is able to explain the relationship between
able. To convert yield prediction into a classification prob-
one dependent data variable and one or more nominal and
lem, we need to carry out output labeling. In this paper,
ordinal independent variables, which is suitable for the FT
we propose to use the GMM for FT yield clustering and
yield classification. Therefore, it is used as the benchmark
labeling. The number of classes for binning and evaluation
for model performance measurement.
can range anywhere from one to four. The GMM model
The remaining three classifiers used are ensemble learning
with the lowest Bayesian Information Criterion (BIC) score
methods based on decision trees. Extra Tree Classifier (XT)
is selected. BIC is a common model selection criterion which
builds an ensemble of unpruned decision trees according
allows for a penalty based approach for statistical mixture
to the classical top-down procedure. It essentially consists
distribution fitting to sample data. The penalty term in BIC
of randomizing strongly both attribute and cut-point choice
prevents redundant overfitting of the data. It is easy to inter-
while splitting a tree node [20]. The benefit of XT is that
pret by visual inspection and can be used generally with any
the variance is smaller compared to weak randomization
prior [14]. We avoid considering a class size higher than
methods like Random Forest. The other main strength of
four as it may result in poor prediction performance due
the XT algorithm is its high computational efficiency. The
to highly imbalanced datasets and is practically unneces-
Gradient Boost (GB) model utilizes machine learning based
sary for production wafers disposition in the semiconductor
boosting method. It can be used as a generic algorithm to find
domain.
approximate solutions to the additive modeling problem [21].
B. MODEL TRAINING AND VALIDATION It improves weaker learner’s performance by minimizing the
After data preprocessing, the train test dataset split is carried loss function. The reason we choose GB is that it produces
out. 90% of the data are used for training and validation competitive, highly robust and interpretable procedures for
while 10% of the data are used for testing at the final stage. classification [22]. Finally, XGBoost (XGB) provides for
A ten split stratified cross validation step is used for model an efficient and scalable implementation of gradient boost-
selection, because its bias and variance are relatively lower ing framework with L1 and L2 regularization to improve
compared to regular cross validation methods as concluded model generalization [23]. It is widely used by data scientists
by Kohavi et al. in Ref. [15]. Several popular and diversified to achieve state-of-the-art results on many machine learn-
classifiers are applied in this step, as listed below along with ing challenge datasets [24]. XGB is able to handle sparse
the justification for their choice. and noisy data, and the parallel and distributed computing
Support Vector Machine Classifier (SVC) uses hyper plane makes execution speed faster than the traditional GB, thereby
and kernel tricks to classify different groups. It was used enabling quicker model exploration.
in several semiconductor yield related studies [3], [6]. The
C. MODEL OPTIMIZATION AND ENSEMBLE
K Nearest Neighbor (KNN) determines the result based on
Based on the above cross validation results, the top three high
a majority vote from K closest neighbors. The neighbors
performance models are selected using F1-macro measure-
are selected based on the distance metric function. It is a
ments. Grid search with cross validation is applied for hyper
non-parametric algorithm [16] and easy to implement but
parameter tuning and optimization of the top models. Hard
very sensitive to training samples. It is suitable for classifi-
voting and soft voting grid search results are compared to
cation problems without any prior knowledge on the training
define which one of them is to be used as the final voting
dataset because the KNN algorithm does not require any
classifier (VC). The hard voting result is computed based on
assumptions on the underlying data distribution. Gaussian
an average weight for each classifier whereas soft voting is to
process is a natural way of defining prior distributions over
sum up all the prediction probability values and the prediction
functions of one or more input variables [17]. It is widely used
result for each classifier. Finally, test result is generated using
in statistical settings and machine learning applications due to
10% test dataset.
its high flexibility, ability to render interpretable results and
its conceptual simplicity [18]. V. EXPERIMENTS AND RESULTS
For model selection here, we use the Gaussian Process We have conducted experiments for three different products
Classifier (GP) with a Laplace’s method for approximating in this study. All production manufacturing data are pro-
the Bayesian inference. The reason we choose GP is that vided by Silicon Laboratories. In this section, we will discuss
it is robust to noisy data and able to work even for small product A’s classification procedure in detail. The other two
datasets. Moreover, models defined with GP can discover products prediction flow is similar and therefore, a summary
higher-level properties of the data, such as which inputs are of the results are presented in a tabular format.
relevant to predicting the response [17]. Logistic Regres- Product A has 1887 backend lots, with FT yield ranging
sion (LR) models are used to understand data from a wide from 82.36% to 99.27%. After output data pre-processing,

197890 VOLUME 8, 2020


D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

TABLE 1. Input and output parameters after data pre-processing. Categorical input pre-processing using one hot encoder.

FIGURE 4. Product A cross validation: F1-macro comparison for One Hot


Encoder, Label Encoder and without categorical input parameters.

the encoded number is numerically much larger than WAT


parameters which ranges from −3 to 3. SVC using Label
Encoder score has the lowest value of 0.345 in Fig.4. As it
FIGURE 3. Product A - FT yield GMM clustering result. is based on the Euclidean distance, a high cardinality cate-
gorical parameter can distort the decision plane. The KNN is
three classes are identified using GMM clustering method also based on Euclidean distance; however, its performance is
as shown in Fig 3. Each backend lot consists of 74 WAT better than SVC. The reason is that Product A’s dataset is not
parameters and 18 categorical parameters. After numerical linearly separable as illustrated by the poor LR results. The
and categorical input data pre-processing, the number of WAT KNN in this case is more suitable for the non-linear problem
parameters has been reduced to 57 and categorical parameters compared to the Support Vector Machine Classifier (SVC)
reduced to 3. After applying the One Hot Encoder technique model. Besides, KNN is able to handle noisy datasets [25]
to significant categorical parameters, the dimension increases which is another reason it outperforms SVC. Tree-based
from 3 to 86. The total input dimension is therefore 143, still models including GB, XT and XGB performance are better
less than 10% of training dataset size. Therefore, the One Hot than Non-Tree based models. This is because Tree-based
Encoder is preferred for this product. Overall input and output models are better at handling both categorical and continuous
data after pre-processing are presented in Table.1. numerical parameter values.
To compare the performance for different encoding meth- During model validation and selection step, three top
ods, the cross validation F1-macro result comparison between models are selected for further optimization. The top 3 mod-
One Hot Encoder and Label Encoder and without using cate- els are XT, GP and XGB for both encoding methods based
gorical inputs is shown in Fig.4. The F1-macro result for each on the validation results in Fig.4. Three models’ F1-macro
model is the mean value over 10 split cross validation. By scores are 0.776, 0.799 and 0.784 by using One Hot Encoder.
using Label Encoder, total number of input parameters is 60. The average score is 0.786. The F1-macro scores are 0.783,
For the case of no categorical inputs, the input parameters are 0.744 and 0.765 by using Label Encoder, wherein the average
only the WAT parameters. If we compare the mean F1-macro score is 0.764. Although One Hot Encoder model, XT’s
across all the 7 models, One Hot Encoder method has the F1-macro score is slightly lower than that using the Label
highest score of 0.714, while Label Encoder has the lowest Encoder, the average performance is 2.88% higher than Label
score of 0.652. The Label Encoder method performance is Encoder. Therefore, it is proven that One Hot Encoder method
worse than without using categorical input whose score is is preferred for product A. The detailed number of input
0.672. The reason for this trend is that product A’s WAT parameters and cross validation results’ comparison is sum-
parameters are the major root cause factor for low yield marized in Table.2 for the three products. For Product B
problem. However, Label Encoder brings with it the ordering and Product C, cross validation of all models’ results are
problem for high cardinality categorical parameters, where presented in Fig.5 and Fig.6.

VOLUME 8, 2020 197891


D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

TABLE 2. Three product input parameters and cross validation results’ comparison with different encoding methods.

TABLE 3. Test result for 3 products.

without using it for all models except for SVC. SVC per-
formance is worst when handling a mixture of continuous
numeric and categorical inputs. For the One Hot Encoder
method, the average F1-macro score of the 7 models is 0.772,
higher than Label Encoder’s 0.751. If we only evaluate top
3 models’ performance, namely XT, KNN and XGB, Label
Encoder’s averaged result of 0.799 is slightly better than that
of the One Hot Encoder’s 0.798. For Product C, the Label
Encoder F1-macro results are better for both seven models
and top three models. Overall, the results prove that if we
choose the proper encoding method based on our proposed
guideline, we are able to achieve better performance than
with other encoding method or without using categorical
parameters.
Based on the three products’ validation results, it can be
FIGURE 5. Product B cross validation: F1-macro comparison for One Hot seen that the prediction performance varies between different
Encoder, Label Encoder and without categorical input parameters.
models and different encoding methods. Therefore, when
doing data pre-processing and machine learning model selec-
tion, it is important to take into consideration of the following
factors which provide guidelines as to the device model to
be used and its computational demand: number of input
parameters, size of training dataset and model’s capability to
handle mixture type of input.
At the model optimization step, product A’s top three
model (XT, GP, XGB) hyper parameters are fine tuned with
grid search cross-validation method. Finally, VC is generated
using the optimized top three models. The 10% test dataset
VC F1-macro results are compared with other models as pre-
sented in Fig.7. The detailed test results including precision-
macro, recall-macro and F1-macro are presented in Table.3.
The best outcomes of our analysis are highlighted in bold.
In general, VC’s performance using all the three measure-
FIGURE 6. Product C cross validation: F1-macro comparison for One Hot
ment metrics are top among all the models. Product A’s
Encoder, Label Encoder and without categorical input parameters. recall-macro score is 0.809, which is slightly lower than
GP’s score of 0.828. However its precision-macro score
For product B, categorical parameters play an important of 0.859 is much higher than GP’s 0.786. Therefore, VC’s
role for FT yield variation based on engineers’ past experi- overall performance is still better than GP. It can been inferred
ence, which is already reflected in the cross validation results. from the table that the F1-macro is a suitable metric for
The validation results using categorical input are better than model selection and optimization for all the three products.

197892 VOLUME 8, 2020


D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

FIGURE 8. Product A ExtraTree Classifier Top 15 Feature Importance


Ranking.

FIGURE 7. Products A, B and C test dataset prediction result with a


comparison between VC (voting classifier) and other classifiers.

The VC performance is the best among all the models for all
three products with an F1-macro score of 0.831, 0.889 and
0.961. We have achieved 28.17%, 3.44%, 23.48% F1-macro
improvement compared to a purely LR performance for
Products A, B and C respectively.
VI. FEATURE IMPORTANCE ANALYSIS
Based on the above analysis, we have now generated suitable
classifiers for FT yield prediction. With this, we can proceed
to carry out a feature importance analysis of the data using
Gini importance [26]. Gini importance is a general indicator
of feature relevance. It describes the importance of a feature
by computing the normalized total reduction of the criterion
(which is the objective function, in our case, it is the discrete
class value of the yield sub-population - (0, 1, 2)) brought
about by that feature [26]. The best classifier for product
A turns out to be XT. We carry out the feature importance
ranking and visualisation of the fitted XT model using the
method prescribed in Ref. [27]. The resultant most important
15 features are plotted in Fig 8. It can be seen that the top three FIGURE 9. Product A ExtraTree Classifier based feature importance
ranking analysis results (a) without PACKAGE_MSOP3 and (b) with
features are all categorical features - PACKAGE_MSOP_3, PACKAGE_MSOP3 showing the 15 most sensitive input parameters.
PROGRAM_37,TESTER_024. Based on the production data There is a significant change in the order of importance of the input
parameter for these two cases.
investigation, we confirm that the material tested with
PROGRAM_37 and TESTER_024 did exhibit low yield prob-
the top three most important features are completely different
lems. Corrective actions can now be taken to modify the test
between these two conditions. For product with only one type
program and the tester as well for yield improvements.
of package, PACKAGE_MSOP3, the top three WAT param-
One of the three important features identified viz.
eters are all contact resistance related parameters, includ-
PACKAGE_MSOP3 stands for the assembly package type,
ing Contact_Resistance_DNW, Contact_Resistance_Nw and
which is in turn related to the product functionality. Yield
Contact_Resistance_TV. While for the other package types,
variations between different product functionality may most
the top three WAT parameters are Continuity_M6, Thresh-
likely be caused by different WAT parameters. Therefore,
old_Voltage_N4H and Contact_Resistance_V3. These results
WAT parameter feature importance analysis can now be
can now be used for fine tuning of the top three WAT param-
done under two conditions. First condition is considering the
eters, separately for the different package types, so as to
dataset with assembly package PACKAGE_MSOP3 only and
optimize the product yield further.
the second one is using dataset without PACKAGE_MSOP3.
The same model optimization process is then repeated for VII. CONCLUSION
these two conditions and the resulting top 15 feature impor- In this paper, we have introduced a novel framework for final
tance results are shown in Fig 9(a) and Fig 9(b). The top three test yield prediction at the wafer fabrication stage. This is
most important features’ names are modified with descriptive a challenging task since there are many unknown factors in
names for better understanding in Fig 9. It is worth noting that between WF and FT that can cause FT low yield problems.

VOLUME 8, 2020 197893


D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

Three different products’ production data were fitted into [9] C.-F. Chien, C.-W. Liu, and S.-C. Chuang, ‘‘Analysing semiconduc-
the framework and all of them managed to achieve a good tor manufacturing big data for root cause detection of excursion for
yield enhancement,’’ Int. J. Prod. Res., vol. 55, no. 17, pp. 5095–5107,
F1-macro score. This framework can be extended to predict Sep. 2017.
any semiconductor production stage’s yield based on the [10] A. P. Dempster, N. M. Laird, and D. B. Rubin, ‘‘Maximum likelihood from
production data prior to the stage. The major contribution incomplete data via the EM algorithm,’’ J. Roy. Stat. Soc., B (Methodol.),
vol. 39, no. 1, pp. 1–22, 1977.
of our framework is to automatically convert semiconductor [11] C. J. van Rijsbergen, Information Retrieval. London, U.K.: Butterworths,
production related data into the yield prediction model with- 1979.
out prior knowledge of the data. Furthermore, the framework [12] Y. Yang and X. Liu, ‘‘A re-examination of text categorization methods,’’ in
Proc. 22nd Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR),
introduced a model selection and ensemble method to achieve 1999, pp. 42–49.
good model performance for both binary and multi-class [13] R. B. Kline, Principles and Practice of Structural Equation Modeling.
problems. New York, NY, USA: Guilford publications, 2015.
[14] K. P. Burnham and D. R. Anderson, ‘‘Multimodel inference: Understand-
The novelty of our study is that we propose a generic ing AIC and BIC in model selection,’’ Sociol. Methods Res., vol. 33, no. 2,
framework that can be applied to address several semi- pp. 261–304, 2004.
conductor manufacturing yield problems for advanced [15] R. Kohavi, ‘‘A study of cross-validation and bootstrap for accuracy estima-
tion and model selection,’’ in Proc. IJCAI, vol. 14. Montreal, QC, Canada,
logic / memory technology nodes. The framework is robust, 1995, pp. 1137–1145.
scalable and configurable to include both numerical as well [16] A. Kataria and M. Singh, ‘‘A review of data classification using k-nearest
as categorical inputs and map their relationships to the output neighbour algorithm,’’ Int. J. Emerg. Technol. Adv. Eng., vol. 3, no. 6,
pp. 354–360, 2013.
yield of multiple product lines and also allows for automated [17] R. M. Neal, ‘‘Monte Carlo implementation of Gaussian process models
feature importance, sensitivity analysis and multi-modal for Bayesian regression and classification,’’ 1997, arXiv:physics/9701026.
yield classification. Our framework takes into account all [Online]. Available: https://arxiv.org/abs/physics/9701026
[18] M. Seeger, ‘‘PAC-Bayesian generalisation error bounds for Gaussian pro-
the manufacturing related parameters as input data, with no cess classification,’’ J. Mach. Learn. Res., vol. 3, no. 2, pp. 233–269,
necessity for any manual filtering as compared to existing Feb. 2003.
yield prediction models [9], [28]. Additional machine learn- [19] J. M. Hilbe, Logistic Regression Models. Boca Raton, FL, USA: CRC
Press, 2009.
ing models can also be added as candidate models during [20] P. Geurts, D. Ernst, and L. Wehenkel, ‘‘Extremely randomized trees,’’
the model selection step to accommodate other types of yield Mach. Learn., vol. 63, no. 1, pp. 3–42, Apr. 2006.
problems. [21] P. Bühlmann and T. Hothorn, ‘‘Boosting algorithms: Regularization,
prediction and model fitting,’’ Stat. Sci., vol. 22, no. 4, pp. 477–505,
Future work will involve further methodological explorations Nov. 2007.
into improving the model performance and enabling dedi- [22] J. H. Friedman, ‘‘Greedy function approximation: A gradient boosting
cated low yield root cause analysis. The low yield root cause machine,’’ Ann. Statist., vol. 29, no. 5, pp. 1189–1232, Oct. 2001.
[23] J. Friedman, T. Hastie, and R. Tibshirani, ‘‘Additive logistic regression:
analysis task should be able to automatically identify whether A statistical view of boosting (With discussion and a rejoinder by the
the causal factor is WAT related or production flow related so authors),’’ Ann. Statist., vol. 28, no. 2, pp. 337–407, Apr. 2000.
as to provide corrective and effective recommendations for [24] T. Chen and C. Guestrin, ‘‘Xgboost: A scalable tree boosting system,’’ in
Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2016,
faster turn around yield improvements. pp. 785–794.
REFERENCES [25] H. A. Abu Alfeilat, A. B. Hassanat, O. Lasassmeh, A. S. Tarawneh,
[1] S.-J. Jang, J.-H. Lee, T.-W. Kim, J.-S. Kim, H.-J. Lee, and J.-B. Lee, ‘‘A M. B. Alhasanat, H. S. Eyal Salman, and V. S. Prasath, ‘‘Effects of distance
wafer map yield model based on deep learning for wafer productivity measure choice on K-Nearest neighbor classifier performance: A review,’’
enhancement,’’ in Proc. 29th Annu. SEMI Adv. Semiconductor Manuf. Big Data, vol. 7, no. 4, pp. 221–248, Dec. 2019.
Conf. (ASMC), Apr. 2018, pp. 29–34. [26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
[2] J.-S. Kim, S.-J. Jang, T.-W. Kim, H.-J. Lee, and J.-B. Lee, ‘‘A productivity- O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
oriented wafer map optimization using yield model based on machine A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,
learning,’’ IEEE Trans. Semiconductor Manuf., vol. 32, no. 1, pp. 39–47, ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. Learn. Res., vol. 12,
Feb. 2019. pp. 2825–2830, Oct. 2011.
[3] K. Nakata, R. Orihara, Y. Mizuoka, and K. Takagi, ‘‘A comprehensive [27] L. Breiman, ‘‘Random forests,’’ Mach. Learn., vol. 45, no. 1, pp. 5–32,
Big-Data-Based monitoring system for yield enhancement in semiconduc- 2001.
tor manufacturing,’’ IEEE Trans. Semiconductor Manuf., vol. 30, no. 4, [28] B. Lenz and B. Barak, ‘‘Data mining and support vector regression machine
pp. 339–344, Nov. 2017. learning in semiconductor manufacturing to improve virtual metrology,’’ in
[4] J. Wang, Z. Yang, J. Zhang, Q. Zhang, and W.-T.-K. Chien, ‘‘AdaBalGAN: Proc. 46th Hawaii Int. Conf. Syst. Sci., Jan. 2013, pp. 3447–3456.
An improved generative adversarial network with imbalanced learning for
wafer defective pattern recognition,’’ IEEE Trans. Semiconductor Manuf., DAN JIANG received the B.Eng. degree in
vol. 32, no. 3, pp. 310–319, Aug. 2019. electrical and electronics engineering and the
[5] T. Ishida, I. Nitta, D. Fukuda, and Y. Kanazawa, ‘‘Deep learning-based M.Eng. degree in communications engineering
wafer-map failure pattern recognition framework,’’ in Proc. 20th Int. Symp. from Nanyang Technological University (NTU),
Qual. Electron. Design (ISQED), Mar. 2019, pp. 291–297. Singapore, in 2013 and 2017, respectively. She
[6] Y. Kong and D. Ni, ‘‘A practical yield prediction approach using inline
is currently pursuing the Ph.D. degree with the
defect metrology data for system-on-chip integrated circuits,’’ in Proc. 13th
Engineering Product Development (EPD) Pillar,
IEEE Conf. Autom. Sci. Eng. (CASE), Aug. 2017, pp. 744–749.
[7] K.-J. Kim, K.-J. Kim, C.-H. Jun, I.-G. Chong, and G.-Y. Song, ‘‘Variable Singapore University of Technology and Design
selection under missing values and unlabeled data in semiconductor pro- (SUTD). Since 2013, she has been working as
cesses,’’ IEEE Trans. Semiconductor Manuf., vol. 32, no. 1, pp. 121–128, a Product Test Engineer at Silicon Laboratories
Feb. 2019. International, on semiconductor product test and data analytics tools devel-
[8] S. Kang, S. Cho, D. An, and J. Rim, ‘‘Using wafer map features to better opment. Her current research interests include semiconductor manufac-
predict die-level failures in final test,’’ IEEE Trans. Semiconductor Manuf., turing yield prediction as well as big data and artificial intelligence for
vol. 28, no. 3, pp. 431–437, Aug. 2015. semiconductor process and quality optimization.

197894 VOLUME 8, 2020


D. Jiang et al.: Novel Framework for Semiconductor Manufacturing FT Yield Classification Using Machine Learning Techniques

WEIHUA LIN received the B.Eng. degree in NAGARAJAN RAGHAVAN (Member, IEEE)
nuclear electronics from the University of Sci- received the Ph.D. degree in microelectronics from
ence and Technology of China, in 1991, and the the Division of Microelectronics, Nanyang Tech-
M.Sc. degree in electronics engineering from the nological University (NTU), Singapore, in 2012.
National University of Singapore (NUS), in 2001. He is currently an Assistant Professor with the
He is currently the Senior Product Test Engi- Engineering Product Development (EPD) Pillar,
neering (PTE) Director of Silicon Laboratories Singapore University of Technology and Design
International. He has 30 years of working experi- (SUTD). Prior to this, he was a Postdoctoral Fel-
ence in several semiconductor companies, such as low with the Massachusetts Institute of Tech-
National Semiconductor, Lucent Microelectron- nology (MIT), Boston, and at IMEC, Belgium,
ics, and Silicon Labs. He has been focusing on IC test and product engi- in joint association with the Katholieke Universiteit Leuven (KUL). His
neering as well as field application and customer support. He is currently work focuses on integrated machine learning and physics-based reliabil-
providing technical consultation role in IC and module test development, ity assessment, characterization and lifetime prediction for nanoelectronic
product qualification, and yield optimization to achieve good product quality devices as well as material design for reliability, uncertainty quantifica-
at lower possible cost. tion for additive manufacturing and prognostics, and health management
of electronic devices and systems. He has authored/coauthored more than
190 international peer-reviewed publications and five invited book chapters.
He was an Invited Member of the IEEE GOLD Committee (2012–2014).
He was a recipient of the IEEE Electron Device Society (EDS) Early Career
Award in 2016, an Asia–Pacific recipient for the IEEE EDS Ph.D. Student
Fellowship, in 2011, and the IEEE Reliability Society Graduate Scholarship
Award, in 2008. He serves as the General Chair for the IEEE IPFA 2021 at
Singapore, and has consistently served on the review committee for various
IEEE journals and conferences, including IRPS, IIRW, IPFA and ESREF.
He is currently serving as an Associate Editor for IEEE ACCESS.

VOLUME 8, 2020 197895

You might also like