0% found this document useful (0 votes)

20 views24 pages

Report

Uploaded by

Hạnh Trương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views24 pages

Report

Uploaded by

Hạnh Trương

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Group_2_Project-German_Credit_Data_Analysis

Nguyen_Tra_My

2024-01-04

GROU P 2 AP P LIED ST AT IST ICS P ROJECT

GERM AN CREDIT DAT A AN ALY SIS
1.INTRODUCTION
Credit is a contract agreement in which a borrower receives a sum of money or something of value and repays
the lender at a later date, generally with interest.
The dataset given is collected in the context of Germany which contains information about loan applicants,
their financial history, and loan outcomes. This dataset is used for research, machine learning, and education
on the German credit system. Popular examples include Statlog and German Credit Risk.
The goal of the study is to find the relationship between the amount of the credit and some other features of
the loan applicants such as duration of the credit, the installment rate in percentage of disposable income,
age in years, number of dependents and creditability.

library(readxl)
german_credit=read_excel("C:/Users/LUONG/Downloads/German_credit.xlsx");
head(german_credit)

## # A tibble: 6 x 11
## Creditability Duration_of_Credit_(mon~1 Purpose Instalment_per_cent Guarantors
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 18 2 4 1
## 2 1 9 0 2 1
## 3 1 12 9 2 1
## 4 1 12 0 3 1
## 5 1 12 0 4 1
## 6 1 10 0 1 1
## # i abbreviated name: 1: ‘Duration_of_Credit_(month)‘
## # i 6 more variables: Length_of_current_employment <dbl>,
## # ‘Sex_Marital Status‘ <dbl>, ‘Age_(years)‘ <dbl>, Occupation <dbl>,
## # No_of_dependents <dbl>, Credit_Amount <dbl>

2.VARIABLES EXPLAINATION
We will remove the following variables to do the analysis:
Purpose: a qualitative data with eleven levels (0-10) representing possible reasons for taking out a loan.
Guarantors: a qualitative data containing 3 categories: 1 - none, 2 - co-applicant, 3 - guarantor.
Sex_Marital Status: Personal status and gender, a qualitative data containing 4 categories: 1 - male (di-
vorced/separated), 2 - female (divorced/separated/married), 3 - male (single), 4 - male (married/widowed).

1
Occupation: a qualitative data containing 4 categories: 1 - unemployed/ unskilled - non-resident, 2 - unskilled
- resident, 3 - skilled employee / official, 4 - management/ self-employed/highly qualified employee/ officer.
In order to do the analysis we will keep these following variables:
Creditability: a binary data: 1 - Applicant has Creditability, 0 - Applicant does not have Creditability.
Duration_of_month: a numerical data, the duration in months.
Installment_per_cent: a qualitative data with 4 categories.
Length_of_current_employment: a qualitative data containing 5 categories: 1 - unemployed, 2 - < 1 year,
3 - 1 <= . . . < 4 years, 4 - 4 <=. . . < 7 years, 5 : >= 7 years.
Age_(years): The age in years of applicant.
Credit_Amount: The amount in the credit.
3.DESCRIPTION OF THE DATA
After removing variables that are considered, we have the new dataset as follow.

german_credit1=na.omit(subset(german_credit,select = -c(3,5,7,9)));
head(german_credit1)

## # A tibble: 6 x 7
## Creditability ‘Duration_of_Credit_(month)‘ Instalment_per_cent
## <dbl> <dbl> <dbl>
## 1 1 18 4
## 2 1 9 2
## 3 1 12 2
## 4 1 12 3
## 5 1 12 4
## 6 1 10 1
## # i 4 more variables: Length_of_current_employment <dbl>, ‘Age_(years)‘ <dbl>,
## # No_of_dependents <dbl>, Credit_Amount <dbl>

To consider the dimensions of the data set german_credit1, we use dim() function as below.

dim(german_credit1)

## [1] 1000 7

We use function names() to illustrate the names of variables of the data set german_credit1

names(german_credit1)

## [1] "Creditability" "Duration_of_Credit_(month)"

## [3] "Instalment_per_cent" "Length_of_current_employment"
## [5] "Age_(years)" "No_of_dependents"
## [7] "Credit_Amount"

4.CHARACTERISTICS OF VARIABLES
In this part of the study, all the variables that are listed on the dataset german_credit1 will be investigated
to have the descriptive statistics of each one. Therefore, we can acknowledge the characteristics of all the
variables considered.

2
For each of variables, the authors are going to use these functions: summary(), hist(), boxplot() to do analysis
and visuallize the data.

attach(german_credit1)

Creditability
This is a binary data. Thus, central tendencies, dispersion does not make any sense. Because of that reason,
the authors are not going to use summary() to analyse this variable. Hence, the functions table() and hist()
are considered.

table(Creditability)

## Creditability
## 0 1
## 300 700

hist(Creditability, col = "blue")

Histogram of Creditability
700
500
Frequency

300
0 100

0.0 0.2 0.4 0.6 0.8 1.0

Creditability

Duration_of_Credit_(month)
This is a continous variable with numeric data, the function summary() is considered.

summary(`Duration_of_Credit_(month)`)

3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 18.0 20.9 24.0 72.0

We use hist() and boxplot() to visualize the data.

hist(`Duration_of_Credit_(month)`, labels = T, col = "blue")

Histogram of Duration_of_Credit_(month)
260
250

216
200

164
Frequency

150

123
100

89
57 49
50

16 13
7 3 2 0 0 1
0

0 20 40 60

Duration_of_Credit_(month)

Instalment_per_cent
Installment rate as % of disposable income.It is a qualitative data.It has a 4 categories. It is no use for
summary().
Frequency table of Instalment_per_cent is given by table()

table(Instalment_per_cent)

## Instalment_per_cent
## 1 2 3 4
## 136 231 157 476

We visuallize the data by the function hist()

hist(Instalment_per_cent, col="blue")

4
Histogram of Instalment_per_cent
400
300
Frequency

200
100
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Instalment_per_cent

We can see the mode of Instalment_per_cent is 4.

Length_of_current_employment
This variable contains qualitative data which has 5 categories.Central tendencies ,dispersion does not make
any sense.Frequency table,mode are calculated for qualitative data.

table(Length_of_current_employment)

## Length_of_current_employment
## 1 2 3 4 5
## 62 172 339 174 253

hist(Length_of_current_employment, col = "blue")

5
Histogram of Length_of_current_employment
350
250
Frequency

150
50
0

1 2 3 4 5

Length_of_current_employment

The mode of this variable is 3. Thus, the number of applicants who have the length of current employment
which is between 1 to 4 years is the biggest.
Age_(years)
This is a continuous data which has numeric data. We use summary() to summarize the descriptive statistics
of this variable.

summary(`Age_(years)`)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 19.00 27.00 33.00 35.54 42.00 75.00

The authors are going to use the function hist() and boxplot() to visuallize the data considered.

hist(`Age_(years)`, col = "blue")

6
Histogram of Age_(years)
200
150
Frequency

100
50
0

20 30 40 50 60 70

Age_(years)

boxplot(`Age_(years)`)

7
70
60
50
40
30
20

We can possibly claim that people who are between 20 and 40 years old have more intentions to make loans
than others.
No_of_dependents
This is a qualitative data which has 2 categories, we use table() to describe the frequency.

table(No_of_dependents)

## No_of_dependents
## 1 2
## 845 155

We use hist() to visualize the data.

hist(No_of_dependents, col="blue")

8
Histogram of No_of_dependents
800
600
Frequency

400
200
0

1.0 1.2 1.4 1.6 1.8 2.0

No_of_dependents

The mode of No_of_dependents is 1.

Credit_Amount
This variable is continuous. Therefore, to summarize the description of the variable we use function sum-
mary().

summary(Credit_Amount)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 250 1366 2320 3271 3972 18424

The authors are going to use the function hist() and boxplot() to visuallize the data considered.

hist(Credit_Amount, col = "blue")

9
Histogram of Credit_Amount
400
300
Frequency

200
100
0

0 5000 10000 15000 20000

Credit_Amount

boxplot(Credit_Amount)

10
15000
10000
5000
0

5.SIMPLE REGRESSION
In this part of the research, we will consider the relationship between the variables pairwise and investigate
some simple regressions when it makes sense.
We will use pairs() and cor() to have a general view about the relationship between variables.

pairs(german_credit1)

11
10 40 70 1 3 5 1.0 1.6

0.8
Creditability

0.0
60

Duration_of_Credit_(month)
10

3.5
Instalment_per_cent

1.0
1 3 5

Length_of_current_employment

60
Age_(years)

20
1.8

No_of_dependents
1.0

Credit_Amount

0
0.0 0.6 1.0 2.5 4.0 20 50 0 10000

cor(german_credit1)

## Creditability Duration_of_Credit_(month)
## Creditability 1.000000000 -0.21492667
## Duration_of_Credit_(month) -0.214926665 1.00000000
## Instalment_per_cent -0.072403937 0.07474882
## Length_of_current_employment 0.116002036 0.05738103
## Age_(years) 0.091271949 -0.03754986
## No_of_dependents 0.003014853 -0.02383448
## Credit_Amount -0.154740146 0.62498846
## Instalment_per_cent Length_of_current_employment
## Creditability -0.07240394 0.116002036
## Duration_of_Credit_(month) 0.07474882 0.057381027
## Instalment_per_cent 1.00000000 0.126161307
## Length_of_current_employment 0.12616131 1.000000000
## Age_(years) 0.05727075 0.259116153
## No_of_dependents -0.07120694 0.097192004
## Credit_Amount -0.27132228 -0.008376109
## Age_(years) No_of_dependents Credit_Amount
## Creditability 0.09127195 0.003014853 -0.154740146
## Duration_of_Credit_(month) -0.03754986 -0.023834475 0.624988461
## Instalment_per_cent 0.05727075 -0.071206943 -0.271322281
## Length_of_current_employment 0.25911615 0.097192004 -0.008376109
## Age_(years) 1.00000000 0.118589183 0.032272677
## No_of_dependents 0.11858918 1.000000000 0.017143582
## Credit_Amount 0.03227268 0.017143582 1.000000000

12
Through the charts and correlations that are given by pairs() and cor() we can see that there are some promis-
ing simple linear relationship such as the one between Duration_of_Credit_(month) and Credit_Amount
whose correlation is approximately 0.625 or the one between Instalment_per_cent and Credit_Amount
with correlation of -0.271.
The aim of the report is to find the relationships between Credit_amount and the others. Thus, the authors
decide to consider these five simple linear models between Credit_amount and Duration_of_Credit_(month),
Instalment_per_cent, Creditability, Age_(years), No_of_dependents respectively. Decision not to
consider the relationship between Credit_amount and Length_of_current_employment is made because
they seem to be not related witch each other through the scatter chart and correlation (approximately
-0.0084).

• Credit_amount and Duration_of_Credit_(month)

Credit_amount = β0 + β1 × Duration_of_Credit_(month).

The scatter chart between these variables are taken by plot()

plot(`Duration_of_Credit_(month)`, Credit_Amount)
15000
Credit_Amount

10000
5000
0

10 20 30 40 50 60 70

Duration_of_Credit_(month)

The correlation is also taken by cor()

cor(`Duration_of_Credit_(month)`, Credit_Amount)

## [1] 0.6249885

13
From the chart and correlation that are given, the positive relationship between variables are considered.
The author use summary() and lm() to call out the simple linear regression model between these two variables.

summary(lm(Credit_Amount~`Duration_of_Credit_(month)`))

##
## Call:
## lm(formula = Credit_Amount ~ ‘Duration_of_Credit_(month)‘)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5151.7 -1260.0 -432.9 653.2 13805.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.169 139.569 1.527 0.127
## ‘Duration_of_Credit_(month)‘ 146.299 5.784 25.292 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2205 on 998 degrees of freedom
## Multiple R-squared: 0.3906, Adjusted R-squared: 0.39
## F-statistic: 639.7 on 1 and 998 DF, p-value: < 2.2e-16

The p-value that is given for the intercept of the model is 0.127 which means a probability of 12,7% (which
is bigger than 5%) wrong rejecting the nullity. The author decide to remove the intercept of the model to
find another result.

summary(lm(Credit_Amount~`Duration_of_Credit_(month)`+0))

##
## Call:
## lm(formula = Credit_Amount ~ ‘Duration_of_Credit_(month)‘ + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5489.5 -1156.5 -341.3 744.4 13972.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## ‘Duration_of_Credit_(month)‘ 153.952 2.891 53.25 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2206 on 999 degrees of freedom
## Multiple R-squared: 0.7395, Adjusted R-squared: 0.7392
## F-statistic: 2835 on 1 and 999 DF, p-value: < 2.2e-16

Through the new result, We reject the nullity of the coefficient and support the linear relationship between
Credit_amount and Duration_of_Credit_(month) because the p-value is very small that means a very
small probability to be wrong rejecting.

14
Besides, the Multiple R-squared which is 0.7395 is pretty high and much better that the old one of the last
model (0.3906). We can say that 73.95% of variation in Credit_amount can be explained by the variability
in the Duration_of_Credit_(month).
The model given is:

Credit_amount = 153.952 × Duration_of_Credit_(month).

To visualize the model, the functions plot() and abline() are considered.

plot(`Duration_of_Credit_(month)`, Credit_Amount)
abline(0,153.952, col="blue")
15000
Credit_Amount

10000
5000
0

10 20 30 40 50 60 70

Duration_of_Credit_(month)

• Credit_amount and Instalment_per_cent

The scatter chart and correlation of these variables are taken.

plot(Instalment_per_cent, Credit_Amount)

15
15000
Credit_Amount

10000
5000
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Instalment_per_cent

cor(Credit_Amount, Instalment_per_cent)

## [1] -0.2713223

There seem to be a negative simple linear model between variables considered. The function summary() and
lm() are used.

summary(lm(Credit_Amount~Instalment_per_cent))

##
## Call:
## lm(formula = Credit_Amount ~ Instalment_per_cent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4021.0 -1659.6 -854.5 788.9 13802.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5306.57 244.18 21.732 <2e-16 ***
## Instalment_per_cent -684.60 76.87 -8.905 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##

16
## Residual standard error: 2718 on 998 degrees of freedom
## Multiple R-squared: 0.07362, Adjusted R-squared: 0.07269
## F-statistic: 79.31 on 1 and 998 DF, p-value: < 2.2e-16

The authors reject the nullity of the coefficients and support the linear relationship between Credit_Amount
and Instalment_per_cent because the p-value is very small that means a very small probability to be wrong
rejecting.
The model given is:

Credit_Amount = 5306.57 − 684.60 × Instalment_per_cent.

However, the Multiple R-squared is calculated to equal to 0.07362. The model given does not restitute the
dispersion of the responses (7.362% of variation in Credit_amount can be explained by the variability in
Instalment_per_cent)
The visualization of the model is taken as follow.

plot(Instalment_per_cent, Credit_Amount)
abline(5306.57,-684.60, col="blue")
15000
Credit_Amount

10000
5000
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Instalment_per_cent

• Credit_amount and Creditability

The authors consider the scatter chart and correlation between Credit_amount and Creditability.

17
plot(Creditability, Credit_Amount)

15000
Credit_Amount

10000
5000
0

0.0 0.2 0.4 0.6 0.8 1.0

Creditability

cor(Creditability, Credit_Amount)

## [1] -0.1547401

To call out regression model, we use summary() and lm()

summary(lm(Credit_Amount~Creditability))

##
## Call:
## lm(formula = Credit_Amount ~ Creditability)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3505.1 -1765.6 -858.4 771.8 14485.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3938.1 161.1 24.447 < 2e-16 ***
## Creditability -952.7 192.5 -4.948 8.8e-07 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

18
##
## Residual standard error: 2790 on 998 degrees of freedom
## Multiple R-squared: 0.02394, Adjusted R-squared: 0.02297
## F-statistic: 24.48 on 1 and 998 DF, p-value: 8.795e-07

The authors reject the nullity of the coefficients and support the linear relationship between Credit_Amount
and Creditability because the p-value is very small that means a very small probability to be wrong
rejecting.
The model given is proposed:

Credit_Amount = 3938.1 − 952.7 × Creditability.

The Multiple R-squared is calculated to equal to 0.02394. The model given does not restitute the dispersion of
the responses (2.394% of variation in Credit_amount can be explained by the variability in Creditability)
The visualization of the model is proposed as follow:

plot(Creditability, Credit_Amount)
abline(3938.1,-952.7, col="blue")
15000
Credit_Amount

10000
5000
0

0.0 0.2 0.4 0.6 0.8 1.0

Creditability

• Credit_amount and Age_(years)

Scatter chart and correlation between these variables are considered.

19
plot(`Age_(years)`, Credit_Amount)

15000
Credit_Amount

10000
5000
0

20 30 40 50 60 70

Age_(years)

cor(Credit_Amount, `Age_(years)`)

## [1] 0.03227268

There is possibly a positive linear model between two variables but it is not very significant. The function
summary() and lm() are used to call out the regression model.

summary(lm(Credit_Amount~`Age_(years)`))

##
## Call:
## lm(formula = Credit_Amount ~ ‘Age_(years)‘)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3066.5 -1896.1 -956.4 717.8 15181.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2986.047 293.495 10.17 <2e-16 ***
## ‘Age_(years)‘ 8.024 7.867 1.02 0.308
## ---

20
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2823 on 998 degrees of freedom
## Multiple R-squared: 0.001042, Adjusted R-squared: 4.057e-05
## F-statistic: 1.041 on 1 and 998 DF, p-value: 0.3079

It can be clearly see that the coefficient of the slope of this model have the p-value of 0.308 which means a
probability of 30.8% (which is bigger than 5%) wrong rejecting the nullity. The p-value given is so high that
we accept the nullity of the coefficient of the slope.
The authors conclude that there is no simple linear regression model between Credit_amount and
Age_(years).

• Credit_amount and No_of_dependent

The authors consider the scatter chart and correlation of these variables by plot() and cor() function.

plot(No_of_dependents,Credit_Amount)
15000
Credit_Amount

10000
5000
0

1.0 1.2 1.4 1.6 1.8 2.0

No_of_dependents

cor(Credit_Amount, No_of_dependents)

## [1] 0.01714358

There seem to be a slight positive linear relationship between the variables Credit_Amount and
No_of_dependents. The authors call out the regression model by summary() and lm() functions.

21
summary(lm(Credit_Amount~No_of_dependents))

##
## Call:
## lm(formula = Credit_Amount ~ No_of_dependents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3000.5 -1896.3 -938.9 717.0 15173.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3116.9 298.6 10.437 <2e-16 ***
## No_of_dependents 133.6 246.7 0.542 0.588
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2824 on 998 degrees of freedom
## Multiple R-squared: 0.0002939, Adjusted R-squared: -0.0007078
## F-statistic: 0.2934 on 1 and 998 DF, p-value: 0.5882

The p-value of coefficient of the slope of the model is 0.588 that means a probability of 58.8% (which is bigger
than 5%) wrong rejecting the nullity. The p-value of this coefficient is so high that the nullity is accepted.
The conclusion is there is no simple linear regression model between two following variables: Credit_Amount
and No_of_dependents.
6.MULTIPLE REGRESSION
In the sixth part of the research, we are going to consider the multiple regression between Credit_amount
and all other variables in the dataset german_credit1. In order to do that, we will use summary() and lm()
to call out the multiple regression.

summary(lm(Credit_Amount~Creditability+`Duration_of_Credit_(month)`+Instalment_per_cent+Length_of_curren

##
## Call:
## lm(formula = Credit_Amount ~ Creditability + ‘Duration_of_Credit_(month)‘ +
## Instalment_per_cent + Length_of_current_employment + ‘Age_(years)‘ +
## No_of_dependents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6145.5 -1167.3 -222.7 592.6 11866.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2157.829 370.128 5.830 7.49e-09 ***
## Creditability -277.692 143.304 -1.938 0.052933 .
## ‘Duration_of_Credit_(month)‘ 150.748 5.408 27.874 < 2e-16 ***
## Instalment_per_cent -819.499 57.596 -14.228 < 2e-16 ***
## Length_of_current_employment -49.431 55.314 -0.894 0.371727
## ‘Age_(years)‘ 21.003 5.823 3.607 0.000325 ***
## No_of_dependents 12.013 177.285 0.068 0.945988

22
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2001 on 993 degrees of freedom
## Multiple R-squared: 0.5005, Adjusted R-squared: 0.4975
## F-statistic: 165.8 on 6 and 993 DF, p-value: < 2.2e-16

The p-values of the variables Creditability, Length_of_current_employment and No_of_dependents are

5.2933%, 37.1727%, 94.5988% respectively. These p-values which are the probability to be wrong rejecting
the nullity are all bigger than 5% which means that the nullity of the coefficients corresponding to these
variables is all accepted.
The author decide to remove the following variables: Creditability, Length_of_current_employment,
No_of_dependents.
The new multiple regression model is taken as follow:

summary(lm(Credit_Amount~`Duration_of_Credit_(month)`+Instalment_per_cent+`Age_(years)`))

##
## Call:
## lm(formula = Credit_Amount ~ ‘Duration_of_Credit_(month)‘ + Instalment_per_cent +
## ‘Age_(years)‘)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6055.7 -1142.4 -252.4 586.8 12178.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1848.255 281.547 6.565 8.38e-11 ***
## ‘Duration_of_Credit_(month)‘ 152.636 5.275 28.937 < 2e-16 ***
## Instalment_per_cent -818.473 56.911 -14.382 < 2e-16 ***
## ‘Age_(years)‘ 18.731 5.596 3.347 0.000847 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2003 on 996 degrees of freedom
## Multiple R-squared: 0.498, Adjusted R-squared: 0.4965
## F-statistic: 329.3 on 3 and 996 DF, p-value: < 2.2e-16

The authors reject the nullity of the coefficients and support the relationship between Credit_Amount and
other variables in this model because the p-values are very small that means very small probability to be
wrong rejecting the nullity.
The new model given is:

Credit_Amount = 1848.255+152.636×Duration_of_Credit_(month)−818.473×Instalment_per_cent+18.731×Age_(y

The multiple R-squared of this model is 0.498 which is pretty good. We can say that the model given
does restitute the dispersion of the responses.49.8% of variation in Credit_amount can be explained by the
variability in the other variables in the new model.

23
By making a comparison with the last model, the authors see that the result of this case will be slightly worse
in term of Multiple R-squared (0.498<0.5005) and we got the higher Residual Standard Error (2003>2001).
However, the gaps between them is small and not significant enough to consider.
The fact that adjusted R-squared slightly decrease after removing three variables (from 0.4975 to 0.4965)
may not meet the expectation because this decrease could be the confirmation of the necessity level of the
three variables that the authors have already removed. However, the level of decrease is small and not
significant enough to consider.
7.CONCLUSION
After analysis of the given data, the authors come to conclusions about relationship between Credit_Amount
and the other variables.

• About Simple Regression

The impact of the variable Instalment_per_cent on Credit_Amount seem to be the most significant
one. There is no simple linear relationship found between Credit_Amount and two variables Age_(years),
No_of_dependents respectively.

• About Multiple Regression

The multiple regression model given indicates the multiple relationship between Credit_Amount and these
following variables: Duration_of_Credit_(month), Instalment_per_cent, Age_(years). The results from
this model is pretty good in term of multiple R-squared which is 0.498. The variables that seems to have
the most significant impact on Credit_Amount in this multiple model is Instalment_per_cent which has
the weight of -818.473.

RCode Group 4
No ratings yet
RCode Group 4
21 pages
21nku14 - Data Visualization Assignment
No ratings yet
21nku14 - Data Visualization Assignment
10 pages
Thera Bank: Targeting Loan Growth
100% (10)
Thera Bank: Targeting Loan Growth
79 pages
R Statistical Analysis Guide
No ratings yet
R Statistical Analysis Guide
52 pages
Project On Data Mining-Raveendra Babu Gaddam
No ratings yet
Project On Data Mining-Raveendra Babu Gaddam
29 pages
Summary and Context
No ratings yet
Summary and Context
51 pages
FRA Group Assignment - Report
No ratings yet
FRA Group Assignment - Report
22 pages
Credit Card Default
No ratings yet
Credit Card Default
30 pages
Capastone Project Taiwan Customer Default
67% (3)
Capastone Project Taiwan Customer Default
36 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Thera Bank-Project
100% (12)
Thera Bank-Project
26 pages
DM Assignment - Thena Bank
No ratings yet
DM Assignment - Thena Bank
39 pages
PFDA (Programming For Data Analysis) APU
No ratings yet
PFDA (Programming For Data Analysis) APU
60 pages
Cart Project
75% (4)
Cart Project
17 pages
Documentation - Group Project FP 2019
No ratings yet
Documentation - Group Project FP 2019
7 pages
Capstone Project
No ratings yet
Capstone Project
33 pages
Advanced Modelling Techniques Anurag Payel
No ratings yet
Advanced Modelling Techniques Anurag Payel
41 pages
Descriptive Statistics in R
No ratings yet
Descriptive Statistics in R
46 pages
Credit EDA Case Study
No ratings yet
Credit EDA Case Study
42 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
Analysis of German Credit Data
100% (1)
Analysis of German Credit Data
24 pages
EDA Credit Assignment Shakti - PDF
No ratings yet
EDA Credit Assignment Shakti - PDF
51 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Code
No ratings yet
Code
3 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Bank Loan Case Study Report
No ratings yet
Bank Loan Case Study Report
23 pages
SanatKulkarni - AP22110010183 - Assignment3-1
No ratings yet
SanatKulkarni - AP22110010183 - Assignment3-1
4 pages
Default of Credit Card Clients
No ratings yet
Default of Credit Card Clients
27 pages
Project3: Loading Library
No ratings yet
Project3: Loading Library
17 pages
EDA Assignment
100% (1)
EDA Assignment
19 pages
Linear+Regression+ +transcription
No ratings yet
Linear+Regression+ +transcription
22 pages
Rstudio Cours
No ratings yet
Rstudio Cours
11 pages
Bank Loan Data Analysis Study
No ratings yet
Bank Loan Data Analysis Study
11 pages
Jahnavijillella ML1 30 06 2024 PDF
No ratings yet
Jahnavijillella ML1 30 06 2024 PDF
53 pages
Group 5 - Applied Statistics and Experimental 152611
No ratings yet
Group 5 - Applied Statistics and Experimental 152611
28 pages
Discriminant Analysis Guide
No ratings yet
Discriminant Analysis Guide
16 pages
(Practical) Programming With R
No ratings yet
(Practical) Programming With R
5 pages
DSC Project 442
No ratings yet
DSC Project 442
12 pages
ECN190 Term Project: Predicting Credit Card Default Risk: Introduction and Literature
No ratings yet
ECN190 Term Project: Predicting Credit Card Default Risk: Introduction and Literature
18 pages
Omicron
No ratings yet
Omicron
23 pages
EDA Loan Case Study PPT - Ver 1.1
80% (5)
EDA Loan Case Study PPT - Ver 1.1
22 pages
Bank Loan PPT
No ratings yet
Bank Loan PPT
45 pages
November 2010)
No ratings yet
November 2010)
6 pages
Programming For Data Analysis Assignment
No ratings yet
Programming For Data Analysis Assignment
38 pages
Factors in R
No ratings yet
Factors in R
6 pages
EDA Group Case Study
No ratings yet
EDA Group Case Study
33 pages
Naive Bayes Vs Logistic Regression
No ratings yet
Naive Bayes Vs Logistic Regression
16 pages
A Note On R
No ratings yet
A Note On R
90 pages
LDA CreditCardDefault Code N
No ratings yet
LDA CreditCardDefault Code N
11 pages
Modelling With R
No ratings yet
Modelling With R
3 pages
India Credit Risk Model Report
No ratings yet
India Credit Risk Model Report
18 pages
Progress Report 2
No ratings yet
Progress Report 2
10 pages
Midterm Project Group 6
No ratings yet
Midterm Project Group 6
41 pages
EDA Case Study
No ratings yet
EDA Case Study
94 pages
Praktikum Modul 3
No ratings yet
Praktikum Modul 3
5 pages
Predictive Modeling Mini Project
No ratings yet
Predictive Modeling Mini Project
25 pages
Vehicle Loan Default Prediction Report
No ratings yet
Vehicle Loan Default Prediction Report
23 pages
Capstone Project - Credit Risk Analysis
67% (6)
Capstone Project - Credit Risk Analysis
50 pages
Oup 8
No ratings yet
Oup 8
6 pages
OM Chap 14
No ratings yet
OM Chap 14
6 pages
Group 8 Financial Perspective KPIs
No ratings yet
Group 8 Financial Perspective KPIs
9 pages
Final
No ratings yet
Final
33 pages
CHap 9
No ratings yet
CHap 9
4 pages
Group 8-KPIs
No ratings yet
Group 8-KPIs
4 pages
Lecture 9. Work Place Performance Evaluation.
No ratings yet
Lecture 9. Work Place Performance Evaluation.
20 pages
Case Study HRM
No ratings yet
Case Study HRM
5 pages
Group 8
No ratings yet
Group 8
4 pages
Group 11
No ratings yet
Group 11
2 pages
EM3810E-Chapter 1
No ratings yet
EM3810E-Chapter 1
18 pages
Chapter 1 - HRM - Group 4
No ratings yet
Chapter 1 - HRM - Group 4
6 pages
HRM Case
No ratings yet
HRM Case
10 pages
Report Part 1
No ratings yet
Report Part 1
5 pages
Lecture Presentation 3
No ratings yet
Lecture Presentation 3
29 pages
Black White Green Modern Bold Professional Business Plan Proposal Presentation
No ratings yet
Black White Green Modern Bold Professional Business Plan Proposal Presentation
15 pages
vở nguyên lý kế toán
No ratings yet
vở nguyên lý kế toán
46 pages
HW 1 Sol
No ratings yet
HW 1 Sol
6 pages
Cereal Test
No ratings yet
Cereal Test
17 pages
Maba1-Algebra 27.7m
No ratings yet
Maba1-Algebra 27.7m
81 pages
Quiz 2key
No ratings yet
Quiz 2key
2 pages
Chương 5 - Đánh Giá R I Ro - Safety Risk Assessments - Training Material
No ratings yet
Chương 5 - Đánh Giá R I Ro - Safety Risk Assessments - Training Material
31 pages
Example Rogerian Argument Essay
100% (2)
Example Rogerian Argument Essay
7 pages
Library & Information Science Research: Steven Buchanan, Fionnuala Cousins
No ratings yet
Library & Information Science Research: Steven Buchanan, Fionnuala Cousins
6 pages
Day1 Presentation
No ratings yet
Day1 Presentation
41 pages
Grade 9 Maths QP
No ratings yet
Grade 9 Maths QP
9 pages
Bss Preliminary Assessment 2024-25
No ratings yet
Bss Preliminary Assessment 2024-25
8 pages
Sano Gervais Presentation About Learning Aim A Investigating Data Modelling (Autosaved) (Autosaved) (Autosaved) (Autosaved)
No ratings yet
Sano Gervais Presentation About Learning Aim A Investigating Data Modelling (Autosaved) (Autosaved) (Autosaved) (Autosaved)
22 pages
Organizational Culture Insights
100% (5)
Organizational Culture Insights
49 pages
Power Plant and Calculations - 15-Equipments Efficiency Calculation in Power Plant
No ratings yet
Power Plant and Calculations - 15-Equipments Efficiency Calculation in Power Plant
12 pages
Intensive Program For NEET-2025-01 Result 27-03-2025 ALL
No ratings yet
Intensive Program For NEET-2025-01 Result 27-03-2025 ALL
8 pages
Nelson
No ratings yet
Nelson
1 page
FPSC SST Secondary School Teacher Past Paper Mcqs PDF
No ratings yet
FPSC SST Secondary School Teacher Past Paper Mcqs PDF
20 pages
Perancangan Relayout Tata Letak Fasilitas Guna Mengurangi Biaya Material Handling Pada UKM Tahu "SRT" Kediri
No ratings yet
Perancangan Relayout Tata Letak Fasilitas Guna Mengurangi Biaya Material Handling Pada UKM Tahu "SRT" Kediri
10 pages
TB 1-1500-346-20
No ratings yet
TB 1-1500-346-20
17 pages
FIR Filter Design with MATLAB
No ratings yet
FIR Filter Design with MATLAB
4 pages
Electric Flux Problems and Solutions
No ratings yet
Electric Flux Problems and Solutions
1 page
Eigo Ganbare JLPT n3 Kanji 1czb
No ratings yet
Eigo Ganbare JLPT n3 Kanji 1czb
9 pages
2nd Quarter RESEARCH II Module 2 Lesson 2
100% (1)
2nd Quarter RESEARCH II Module 2 Lesson 2
6 pages
Anime & Disability: A Comparative Study
No ratings yet
Anime & Disability: A Comparative Study
7 pages
01 +Deep+Breathing+1-6
No ratings yet
01 +Deep+Breathing+1-6
6 pages
Sehss 2025
0% (1)
Sehss 2025
1 page
Marketing VIII
No ratings yet
Marketing VIII
30 pages
Characterization of Hemp-Lime Bio-Composite
No ratings yet
Characterization of Hemp-Lime Bio-Composite
9 pages
Paper 5 Essentials Guideline
No ratings yet
Paper 5 Essentials Guideline
5 pages
2002 JMP Pseudo-Hermiticity Versus PT-symmetry II A Complete Characterization of Non-Hermitian Hamiltonians With A Real Spectrum
No ratings yet
2002 JMP Pseudo-Hermiticity Versus PT-symmetry II A Complete Characterization of Non-Hermitian Hamiltonians With A Real Spectrum
4 pages
Bot Youtube Comentar Curtir
No ratings yet
Bot Youtube Comentar Curtir
3 pages
Answer Key A2.1 Op1 - FINAL
No ratings yet
Answer Key A2.1 Op1 - FINAL
9 pages
CDD 733 Short Contact Session June 2025
No ratings yet
CDD 733 Short Contact Session June 2025
57 pages
Haffmans CPT: CO Purity Tester
No ratings yet
Haffmans CPT: CO Purity Tester
2 pages
Camsplice Assembly Instructions: Loose Tube Fiber Preparation
No ratings yet
Camsplice Assembly Instructions: Loose Tube Fiber Preparation
2 pages

Report

Uploaded by

Report

Uploaded by

Group_2_Project-German_Credit_Data_Analysis

GROU P 2 AP P LIED ST AT IST ICS P ROJECT

## [1] "Creditability" "Duration_of_Credit_(month)"

hist(Creditability, col = "blue")

0.0 0.2 0.4 0.6 0.8 1.0

We use hist() and boxplot() to visualize the data.

hist(`Duration_of_Credit_(month)`, labels = T, col = "blue")

We visuallize the data by the function hist()

1.0 1.5 2.0 2.5 3.0 3.5 4.0

We can see the mode of Instalment_per_cent is 4.

hist(Length_of_current_employment, col = "blue")

## Min. 1st Qu. Median Mean 3rd Qu. Max.

hist(`Age_(years)`, col = "blue")

We use hist() to visualize the data.

1.0 1.2 1.4 1.6 1.8 2.0

The mode of No_of_dependents is 1.

## Min. 1st Qu. Median Mean 3rd Qu. Max.

hist(Credit_Amount, col = "blue")

0 5000 10000 15000 20000

• Credit_amount and Duration_of_Credit_(month)

The scatter chart between these variables are taken by plot()

The correlation is also taken by cor()

Credit_amount = 153.952 × Duration_of_Credit_(month).

• Credit_amount and Instalment_per_cent

The scatter chart and correlation of these variables are taken.

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Credit_Amount = 5306.57 − 684.60 × Instalment_per_cent.

1.0 1.5 2.0 2.5 3.0 3.5 4.0

• Credit_amount and Creditability

0.0 0.2 0.4 0.6 0.8 1.0

To call out regression model, we use summary() and lm()

Credit_Amount = 3938.1 − 952.7 × Creditability.

0.0 0.2 0.4 0.6 0.8 1.0

• Credit_amount and Age_(years)

Scatter chart and correlation between these variables are considered.

• Credit_amount and No_of_dependent

1.0 1.2 1.4 1.6 1.8 2.0

The p-values of the variables Creditability, Length_of_current_employment and No_of_dependents are

• About Simple Regression

• About Multiple Regression

You might also like