Report
Report
Nguyen_Tra_My
2024-01-04
library(readxl)
german_credit=read_excel("C:/Users/LUONG/Downloads/German_credit.xlsx");
head(german_credit)
## # A tibble: 6 x 11
## Creditability Duration_of_Credit_(mon~1 Purpose Instalment_per_cent Guarantors
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 18 2 4 1
## 2 1 9 0 2 1
## 3 1 12 9 2 1
## 4 1 12 0 3 1
## 5 1 12 0 4 1
## 6 1 10 0 1 1
## # i abbreviated name: 1: ‘Duration_of_Credit_(month)‘
## # i 6 more variables: Length_of_current_employment <dbl>,
## # ‘Sex_Marital Status‘ <dbl>, ‘Age_(years)‘ <dbl>, Occupation <dbl>,
## # No_of_dependents <dbl>, Credit_Amount <dbl>
2.VARIABLES EXPLAINATION
We will remove the following variables to do the analysis:
Purpose: a qualitative data with eleven levels (0-10) representing possible reasons for taking out a loan.
Guarantors: a qualitative data containing 3 categories: 1 - none, 2 - co-applicant, 3 - guarantor.
Sex_Marital Status: Personal status and gender, a qualitative data containing 4 categories: 1 - male (di-
vorced/separated), 2 - female (divorced/separated/married), 3 - male (single), 4 - male (married/widowed).
1
Occupation: a qualitative data containing 4 categories: 1 - unemployed/ unskilled - non-resident, 2 - unskilled
- resident, 3 - skilled employee / official, 4 - management/ self-employed/highly qualified employee/ officer.
In order to do the analysis we will keep these following variables:
Creditability: a binary data: 1 - Applicant has Creditability, 0 - Applicant does not have Creditability.
Duration_of_month: a numerical data, the duration in months.
Installment_per_cent: a qualitative data with 4 categories.
Length_of_current_employment: a qualitative data containing 5 categories: 1 - unemployed, 2 - < 1 year,
3 - 1 <= . . . < 4 years, 4 - 4 <=. . . < 7 years, 5 : >= 7 years.
Age_(years): The age in years of applicant.
Credit_Amount: The amount in the credit.
3.DESCRIPTION OF THE DATA
After removing variables that are considered, we have the new dataset as follow.
german_credit1=na.omit(subset(german_credit,select = -c(3,5,7,9)));
head(german_credit1)
## # A tibble: 6 x 7
## Creditability ‘Duration_of_Credit_(month)‘ Instalment_per_cent
## <dbl> <dbl> <dbl>
## 1 1 18 4
## 2 1 9 2
## 3 1 12 2
## 4 1 12 3
## 5 1 12 4
## 6 1 10 1
## # i 4 more variables: Length_of_current_employment <dbl>, ‘Age_(years)‘ <dbl>,
## # No_of_dependents <dbl>, Credit_Amount <dbl>
To consider the dimensions of the data set german_credit1, we use dim() function as below.
dim(german_credit1)
## [1] 1000 7
We use function names() to illustrate the names of variables of the data set german_credit1
names(german_credit1)
4.CHARACTERISTICS OF VARIABLES
In this part of the study, all the variables that are listed on the dataset german_credit1 will be investigated
to have the descriptive statistics of each one. Therefore, we can acknowledge the characteristics of all the
variables considered.
2
For each of variables, the authors are going to use these functions: summary(), hist(), boxplot() to do analysis
and visuallize the data.
attach(german_credit1)
Creditability
This is a binary data. Thus, central tendencies, dispersion does not make any sense. Because of that reason,
the authors are not going to use summary() to analyse this variable. Hence, the functions table() and hist()
are considered.
table(Creditability)
## Creditability
## 0 1
## 300 700
Histogram of Creditability
700
500
Frequency
300
0 100
Creditability
Duration_of_Credit_(month)
This is a continous variable with numeric data, the function summary() is considered.
summary(`Duration_of_Credit_(month)`)
3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 18.0 20.9 24.0 72.0
Histogram of Duration_of_Credit_(month)
260
250
216
200
164
Frequency
150
123
100
89
57 49
50
16 13
7 3 2 0 0 1
0
0 20 40 60
Duration_of_Credit_(month)
Instalment_per_cent
Installment rate as % of disposable income.It is a qualitative data.It has a 4 categories. It is no use for
summary().
Frequency table of Instalment_per_cent is given by table()
table(Instalment_per_cent)
## Instalment_per_cent
## 1 2 3 4
## 136 231 157 476
hist(Instalment_per_cent, col="blue")
4
Histogram of Instalment_per_cent
400
300
Frequency
200
100
0
Instalment_per_cent
table(Length_of_current_employment)
## Length_of_current_employment
## 1 2 3 4 5
## 62 172 339 174 253
5
Histogram of Length_of_current_employment
350
250
Frequency
150
50
0
1 2 3 4 5
Length_of_current_employment
The mode of this variable is 3. Thus, the number of applicants who have the length of current employment
which is between 1 to 4 years is the biggest.
Age_(years)
This is a continuous data which has numeric data. We use summary() to summarize the descriptive statistics
of this variable.
summary(`Age_(years)`)
The authors are going to use the function hist() and boxplot() to visuallize the data considered.
6
Histogram of Age_(years)
200
150
Frequency
100
50
0
20 30 40 50 60 70
Age_(years)
boxplot(`Age_(years)`)
7
70
60
50
40
30
20
We can possibly claim that people who are between 20 and 40 years old have more intentions to make loans
than others.
No_of_dependents
This is a qualitative data which has 2 categories, we use table() to describe the frequency.
table(No_of_dependents)
## No_of_dependents
## 1 2
## 845 155
hist(No_of_dependents, col="blue")
8
Histogram of No_of_dependents
800
600
Frequency
400
200
0
No_of_dependents
summary(Credit_Amount)
The authors are going to use the function hist() and boxplot() to visuallize the data considered.
9
Histogram of Credit_Amount
400
300
Frequency
200
100
0
Credit_Amount
boxplot(Credit_Amount)
10
15000
10000
5000
0
5.SIMPLE REGRESSION
In this part of the research, we will consider the relationship between the variables pairwise and investigate
some simple regressions when it makes sense.
We will use pairs() and cor() to have a general view about the relationship between variables.
pairs(german_credit1)
11
10 40 70 1 3 5 1.0 1.6
0.8
Creditability
0.0
60
Duration_of_Credit_(month)
10
3.5
Instalment_per_cent
1.0
1 3 5
Length_of_current_employment
60
Age_(years)
20
1.8
No_of_dependents
1.0
Credit_Amount
0
0.0 0.6 1.0 2.5 4.0 20 50 0 10000
cor(german_credit1)
## Creditability Duration_of_Credit_(month)
## Creditability 1.000000000 -0.21492667
## Duration_of_Credit_(month) -0.214926665 1.00000000
## Instalment_per_cent -0.072403937 0.07474882
## Length_of_current_employment 0.116002036 0.05738103
## Age_(years) 0.091271949 -0.03754986
## No_of_dependents 0.003014853 -0.02383448
## Credit_Amount -0.154740146 0.62498846
## Instalment_per_cent Length_of_current_employment
## Creditability -0.07240394 0.116002036
## Duration_of_Credit_(month) 0.07474882 0.057381027
## Instalment_per_cent 1.00000000 0.126161307
## Length_of_current_employment 0.12616131 1.000000000
## Age_(years) 0.05727075 0.259116153
## No_of_dependents -0.07120694 0.097192004
## Credit_Amount -0.27132228 -0.008376109
## Age_(years) No_of_dependents Credit_Amount
## Creditability 0.09127195 0.003014853 -0.154740146
## Duration_of_Credit_(month) -0.03754986 -0.023834475 0.624988461
## Instalment_per_cent 0.05727075 -0.071206943 -0.271322281
## Length_of_current_employment 0.25911615 0.097192004 -0.008376109
## Age_(years) 1.00000000 0.118589183 0.032272677
## No_of_dependents 0.11858918 1.000000000 0.017143582
## Credit_Amount 0.03227268 0.017143582 1.000000000
12
Through the charts and correlations that are given by pairs() and cor() we can see that there are some promis-
ing simple linear relationship such as the one between Duration_of_Credit_(month) and Credit_Amount
whose correlation is approximately 0.625 or the one between Instalment_per_cent and Credit_Amount
with correlation of -0.271.
The aim of the report is to find the relationships between Credit_amount and the others. Thus, the authors
decide to consider these five simple linear models between Credit_amount and Duration_of_Credit_(month),
Instalment_per_cent, Creditability, Age_(years), No_of_dependents respectively. Decision not to
consider the relationship between Credit_amount and Length_of_current_employment is made because
they seem to be not related witch each other through the scatter chart and correlation (approximately
-0.0084).
Credit_amount = β0 + β1 × Duration_of_Credit_(month).
plot(`Duration_of_Credit_(month)`, Credit_Amount)
15000
Credit_Amount
10000
5000
0
10 20 30 40 50 60 70
Duration_of_Credit_(month)
cor(`Duration_of_Credit_(month)`, Credit_Amount)
## [1] 0.6249885
13
From the chart and correlation that are given, the positive relationship between variables are considered.
The author use summary() and lm() to call out the simple linear regression model between these two variables.
summary(lm(Credit_Amount~`Duration_of_Credit_(month)`))
##
## Call:
## lm(formula = Credit_Amount ~ ‘Duration_of_Credit_(month)‘)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5151.7 -1260.0 -432.9 653.2 13805.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.169 139.569 1.527 0.127
## ‘Duration_of_Credit_(month)‘ 146.299 5.784 25.292 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2205 on 998 degrees of freedom
## Multiple R-squared: 0.3906, Adjusted R-squared: 0.39
## F-statistic: 639.7 on 1 and 998 DF, p-value: < 2.2e-16
The p-value that is given for the intercept of the model is 0.127 which means a probability of 12,7% (which
is bigger than 5%) wrong rejecting the nullity. The author decide to remove the intercept of the model to
find another result.
summary(lm(Credit_Amount~`Duration_of_Credit_(month)`+0))
##
## Call:
## lm(formula = Credit_Amount ~ ‘Duration_of_Credit_(month)‘ + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5489.5 -1156.5 -341.3 744.4 13972.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## ‘Duration_of_Credit_(month)‘ 153.952 2.891 53.25 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2206 on 999 degrees of freedom
## Multiple R-squared: 0.7395, Adjusted R-squared: 0.7392
## F-statistic: 2835 on 1 and 999 DF, p-value: < 2.2e-16
Through the new result, We reject the nullity of the coefficient and support the linear relationship between
Credit_amount and Duration_of_Credit_(month) because the p-value is very small that means a very
small probability to be wrong rejecting.
14
Besides, the Multiple R-squared which is 0.7395 is pretty high and much better that the old one of the last
model (0.3906). We can say that 73.95% of variation in Credit_amount can be explained by the variability
in the Duration_of_Credit_(month).
The model given is:
To visualize the model, the functions plot() and abline() are considered.
plot(`Duration_of_Credit_(month)`, Credit_Amount)
abline(0,153.952, col="blue")
15000
Credit_Amount
10000
5000
0
10 20 30 40 50 60 70
Duration_of_Credit_(month)
plot(Instalment_per_cent, Credit_Amount)
15
15000
Credit_Amount
10000
5000
0
Instalment_per_cent
cor(Credit_Amount, Instalment_per_cent)
## [1] -0.2713223
There seem to be a negative simple linear model between variables considered. The function summary() and
lm() are used.
summary(lm(Credit_Amount~Instalment_per_cent))
##
## Call:
## lm(formula = Credit_Amount ~ Instalment_per_cent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4021.0 -1659.6 -854.5 788.9 13802.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5306.57 244.18 21.732 <2e-16 ***
## Instalment_per_cent -684.60 76.87 -8.905 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
16
## Residual standard error: 2718 on 998 degrees of freedom
## Multiple R-squared: 0.07362, Adjusted R-squared: 0.07269
## F-statistic: 79.31 on 1 and 998 DF, p-value: < 2.2e-16
The authors reject the nullity of the coefficients and support the linear relationship between Credit_Amount
and Instalment_per_cent because the p-value is very small that means a very small probability to be wrong
rejecting.
The model given is:
However, the Multiple R-squared is calculated to equal to 0.07362. The model given does not restitute the
dispersion of the responses (7.362% of variation in Credit_amount can be explained by the variability in
Instalment_per_cent)
The visualization of the model is taken as follow.
plot(Instalment_per_cent, Credit_Amount)
abline(5306.57,-684.60, col="blue")
15000
Credit_Amount
10000
5000
0
Instalment_per_cent
The authors consider the scatter chart and correlation between Credit_amount and Creditability.
17
plot(Creditability, Credit_Amount)
15000
Credit_Amount
10000
5000
0
Creditability
cor(Creditability, Credit_Amount)
## [1] -0.1547401
summary(lm(Credit_Amount~Creditability))
##
## Call:
## lm(formula = Credit_Amount ~ Creditability)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3505.1 -1765.6 -858.4 771.8 14485.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3938.1 161.1 24.447 < 2e-16 ***
## Creditability -952.7 192.5 -4.948 8.8e-07 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
18
##
## Residual standard error: 2790 on 998 degrees of freedom
## Multiple R-squared: 0.02394, Adjusted R-squared: 0.02297
## F-statistic: 24.48 on 1 and 998 DF, p-value: 8.795e-07
The authors reject the nullity of the coefficients and support the linear relationship between Credit_Amount
and Creditability because the p-value is very small that means a very small probability to be wrong
rejecting.
The model given is proposed:
The Multiple R-squared is calculated to equal to 0.02394. The model given does not restitute the dispersion of
the responses (2.394% of variation in Credit_amount can be explained by the variability in Creditability)
The visualization of the model is proposed as follow:
plot(Creditability, Credit_Amount)
abline(3938.1,-952.7, col="blue")
15000
Credit_Amount
10000
5000
0
Creditability
19
plot(`Age_(years)`, Credit_Amount)
15000
Credit_Amount
10000
5000
0
20 30 40 50 60 70
Age_(years)
cor(Credit_Amount, `Age_(years)`)
## [1] 0.03227268
There is possibly a positive linear model between two variables but it is not very significant. The function
summary() and lm() are used to call out the regression model.
summary(lm(Credit_Amount~`Age_(years)`))
##
## Call:
## lm(formula = Credit_Amount ~ ‘Age_(years)‘)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3066.5 -1896.1 -956.4 717.8 15181.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2986.047 293.495 10.17 <2e-16 ***
## ‘Age_(years)‘ 8.024 7.867 1.02 0.308
## ---
20
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2823 on 998 degrees of freedom
## Multiple R-squared: 0.001042, Adjusted R-squared: 4.057e-05
## F-statistic: 1.041 on 1 and 998 DF, p-value: 0.3079
It can be clearly see that the coefficient of the slope of this model have the p-value of 0.308 which means a
probability of 30.8% (which is bigger than 5%) wrong rejecting the nullity. The p-value given is so high that
we accept the nullity of the coefficient of the slope.
The authors conclude that there is no simple linear regression model between Credit_amount and
Age_(years).
The authors consider the scatter chart and correlation of these variables by plot() and cor() function.
plot(No_of_dependents,Credit_Amount)
15000
Credit_Amount
10000
5000
0
No_of_dependents
cor(Credit_Amount, No_of_dependents)
## [1] 0.01714358
There seem to be a slight positive linear relationship between the variables Credit_Amount and
No_of_dependents. The authors call out the regression model by summary() and lm() functions.
21
summary(lm(Credit_Amount~No_of_dependents))
##
## Call:
## lm(formula = Credit_Amount ~ No_of_dependents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3000.5 -1896.3 -938.9 717.0 15173.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3116.9 298.6 10.437 <2e-16 ***
## No_of_dependents 133.6 246.7 0.542 0.588
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2824 on 998 degrees of freedom
## Multiple R-squared: 0.0002939, Adjusted R-squared: -0.0007078
## F-statistic: 0.2934 on 1 and 998 DF, p-value: 0.5882
The p-value of coefficient of the slope of the model is 0.588 that means a probability of 58.8% (which is bigger
than 5%) wrong rejecting the nullity. The p-value of this coefficient is so high that the nullity is accepted.
The conclusion is there is no simple linear regression model between two following variables: Credit_Amount
and No_of_dependents.
6.MULTIPLE REGRESSION
In the sixth part of the research, we are going to consider the multiple regression between Credit_amount
and all other variables in the dataset german_credit1. In order to do that, we will use summary() and lm()
to call out the multiple regression.
summary(lm(Credit_Amount~Creditability+`Duration_of_Credit_(month)`+Instalment_per_cent+Length_of_curren
##
## Call:
## lm(formula = Credit_Amount ~ Creditability + ‘Duration_of_Credit_(month)‘ +
## Instalment_per_cent + Length_of_current_employment + ‘Age_(years)‘ +
## No_of_dependents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6145.5 -1167.3 -222.7 592.6 11866.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2157.829 370.128 5.830 7.49e-09 ***
## Creditability -277.692 143.304 -1.938 0.052933 .
## ‘Duration_of_Credit_(month)‘ 150.748 5.408 27.874 < 2e-16 ***
## Instalment_per_cent -819.499 57.596 -14.228 < 2e-16 ***
## Length_of_current_employment -49.431 55.314 -0.894 0.371727
## ‘Age_(years)‘ 21.003 5.823 3.607 0.000325 ***
## No_of_dependents 12.013 177.285 0.068 0.945988
22
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2001 on 993 degrees of freedom
## Multiple R-squared: 0.5005, Adjusted R-squared: 0.4975
## F-statistic: 165.8 on 6 and 993 DF, p-value: < 2.2e-16
summary(lm(Credit_Amount~`Duration_of_Credit_(month)`+Instalment_per_cent+`Age_(years)`))
##
## Call:
## lm(formula = Credit_Amount ~ ‘Duration_of_Credit_(month)‘ + Instalment_per_cent +
## ‘Age_(years)‘)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6055.7 -1142.4 -252.4 586.8 12178.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1848.255 281.547 6.565 8.38e-11 ***
## ‘Duration_of_Credit_(month)‘ 152.636 5.275 28.937 < 2e-16 ***
## Instalment_per_cent -818.473 56.911 -14.382 < 2e-16 ***
## ‘Age_(years)‘ 18.731 5.596 3.347 0.000847 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2003 on 996 degrees of freedom
## Multiple R-squared: 0.498, Adjusted R-squared: 0.4965
## F-statistic: 329.3 on 3 and 996 DF, p-value: < 2.2e-16
The authors reject the nullity of the coefficients and support the relationship between Credit_Amount and
other variables in this model because the p-values are very small that means very small probability to be
wrong rejecting the nullity.
The new model given is:
Credit_Amount = 1848.255+152.636×Duration_of_Credit_(month)−818.473×Instalment_per_cent+18.731×Age_(y
The multiple R-squared of this model is 0.498 which is pretty good. We can say that the model given
does restitute the dispersion of the responses.49.8% of variation in Credit_amount can be explained by the
variability in the other variables in the new model.
23
By making a comparison with the last model, the authors see that the result of this case will be slightly worse
in term of Multiple R-squared (0.498<0.5005) and we got the higher Residual Standard Error (2003>2001).
However, the gaps between them is small and not significant enough to consider.
The fact that adjusted R-squared slightly decrease after removing three variables (from 0.4975 to 0.4965)
may not meet the expectation because this decrease could be the confirmation of the necessity level of the
three variables that the authors have already removed. However, the level of decrease is small and not
significant enough to consider.
7.CONCLUSION
After analysis of the given data, the authors come to conclusions about relationship between Credit_Amount
and the other variables.
The impact of the variable Instalment_per_cent on Credit_Amount seem to be the most significant
one. There is no simple linear relationship found between Credit_Amount and two variables Age_(years),
No_of_dependents respectively.
The multiple regression model given indicates the multiple relationship between Credit_Amount and these
following variables: Duration_of_Credit_(month), Instalment_per_cent, Age_(years). The results from
this model is pretty good in term of multiple R-squared which is 0.498. The variables that seems to have
the most significant impact on Credit_Amount in this multiple model is Instalment_per_cent which has
the weight of -818.473.
24