0% found this document useful (0 votes)
20 views24 pages

Report

Uploaded by

Hạnh Trương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views24 pages

Report

Uploaded by

Hạnh Trương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Group_2_Project-German_Credit_Data_Analysis

Nguyen_Tra_My

2024-01-04

GROU P 2 AP P LIED ST AT IST ICS P ROJECT


GERM AN CREDIT DAT A AN ALY SIS
1.INTRODUCTION
Credit is a contract agreement in which a borrower receives a sum of money or something of value and repays
the lender at a later date, generally with interest.
The dataset given is collected in the context of Germany which contains information about loan applicants,
their financial history, and loan outcomes. This dataset is used for research, machine learning, and education
on the German credit system. Popular examples include Statlog and German Credit Risk.
The goal of the study is to find the relationship between the amount of the credit and some other features of
the loan applicants such as duration of the credit, the installment rate in percentage of disposable income,
age in years, number of dependents and creditability.

library(readxl)
german_credit=read_excel("C:/Users/LUONG/Downloads/German_credit.xlsx");
head(german_credit)

## # A tibble: 6 x 11
## Creditability Duration_of_Credit_(mon~1 Purpose Instalment_per_cent Guarantors
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 18 2 4 1
## 2 1 9 0 2 1
## 3 1 12 9 2 1
## 4 1 12 0 3 1
## 5 1 12 0 4 1
## 6 1 10 0 1 1
## # i abbreviated name: 1: ‘Duration_of_Credit_(month)‘
## # i 6 more variables: Length_of_current_employment <dbl>,
## # ‘Sex_Marital Status‘ <dbl>, ‘Age_(years)‘ <dbl>, Occupation <dbl>,
## # No_of_dependents <dbl>, Credit_Amount <dbl>

2.VARIABLES EXPLAINATION
We will remove the following variables to do the analysis:
Purpose: a qualitative data with eleven levels (0-10) representing possible reasons for taking out a loan.
Guarantors: a qualitative data containing 3 categories: 1 - none, 2 - co-applicant, 3 - guarantor.
Sex_Marital Status: Personal status and gender, a qualitative data containing 4 categories: 1 - male (di-
vorced/separated), 2 - female (divorced/separated/married), 3 - male (single), 4 - male (married/widowed).

1
Occupation: a qualitative data containing 4 categories: 1 - unemployed/ unskilled - non-resident, 2 - unskilled
- resident, 3 - skilled employee / official, 4 - management/ self-employed/highly qualified employee/ officer.
In order to do the analysis we will keep these following variables:
Creditability: a binary data: 1 - Applicant has Creditability, 0 - Applicant does not have Creditability.
Duration_of_month: a numerical data, the duration in months.
Installment_per_cent: a qualitative data with 4 categories.
Length_of_current_employment: a qualitative data containing 5 categories: 1 - unemployed, 2 - < 1 year,
3 - 1 <= . . . < 4 years, 4 - 4 <=. . . < 7 years, 5 : >= 7 years.
Age_(years): The age in years of applicant.
Credit_Amount: The amount in the credit.
3.DESCRIPTION OF THE DATA
After removing variables that are considered, we have the new dataset as follow.

german_credit1=na.omit(subset(german_credit,select = -c(3,5,7,9)));
head(german_credit1)

## # A tibble: 6 x 7
## Creditability ‘Duration_of_Credit_(month)‘ Instalment_per_cent
## <dbl> <dbl> <dbl>
## 1 1 18 4
## 2 1 9 2
## 3 1 12 2
## 4 1 12 3
## 5 1 12 4
## 6 1 10 1
## # i 4 more variables: Length_of_current_employment <dbl>, ‘Age_(years)‘ <dbl>,
## # No_of_dependents <dbl>, Credit_Amount <dbl>

To consider the dimensions of the data set german_credit1, we use dim() function as below.

dim(german_credit1)

## [1] 1000 7

We use function names() to illustrate the names of variables of the data set german_credit1

names(german_credit1)

## [1] "Creditability" "Duration_of_Credit_(month)"


## [3] "Instalment_per_cent" "Length_of_current_employment"
## [5] "Age_(years)" "No_of_dependents"
## [7] "Credit_Amount"

4.CHARACTERISTICS OF VARIABLES
In this part of the study, all the variables that are listed on the dataset german_credit1 will be investigated
to have the descriptive statistics of each one. Therefore, we can acknowledge the characteristics of all the
variables considered.

2
For each of variables, the authors are going to use these functions: summary(), hist(), boxplot() to do analysis
and visuallize the data.

attach(german_credit1)

Creditability
This is a binary data. Thus, central tendencies, dispersion does not make any sense. Because of that reason,
the authors are not going to use summary() to analyse this variable. Hence, the functions table() and hist()
are considered.

table(Creditability)

## Creditability
## 0 1
## 300 700

hist(Creditability, col = "blue")

Histogram of Creditability
700
500
Frequency

300
0 100

0.0 0.2 0.4 0.6 0.8 1.0

Creditability

Duration_of_Credit_(month)
This is a continous variable with numeric data, the function summary() is considered.

summary(`Duration_of_Credit_(month)`)

3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 18.0 20.9 24.0 72.0

We use hist() and boxplot() to visualize the data.

hist(`Duration_of_Credit_(month)`, labels = T, col = "blue")

Histogram of Duration_of_Credit_(month)
260
250

216
200

164
Frequency

150

123
100

89
57 49
50

16 13
7 3 2 0 0 1
0

0 20 40 60

Duration_of_Credit_(month)

Instalment_per_cent
Installment rate as % of disposable income.It is a qualitative data.It has a 4 categories. It is no use for
summary().
Frequency table of Instalment_per_cent is given by table()

table(Instalment_per_cent)

## Instalment_per_cent
## 1 2 3 4
## 136 231 157 476

We visuallize the data by the function hist()

hist(Instalment_per_cent, col="blue")

4
Histogram of Instalment_per_cent
400
300
Frequency

200
100
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Instalment_per_cent

We can see the mode of Instalment_per_cent is 4.


Length_of_current_employment
This variable contains qualitative data which has 5 categories.Central tendencies ,dispersion does not make
any sense.Frequency table,mode are calculated for qualitative data.

table(Length_of_current_employment)

## Length_of_current_employment
## 1 2 3 4 5
## 62 172 339 174 253

hist(Length_of_current_employment, col = "blue")

5
Histogram of Length_of_current_employment
350
250
Frequency

150
50
0

1 2 3 4 5

Length_of_current_employment

The mode of this variable is 3. Thus, the number of applicants who have the length of current employment
which is between 1 to 4 years is the biggest.
Age_(years)
This is a continuous data which has numeric data. We use summary() to summarize the descriptive statistics
of this variable.

summary(`Age_(years)`)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 19.00 27.00 33.00 35.54 42.00 75.00

The authors are going to use the function hist() and boxplot() to visuallize the data considered.

hist(`Age_(years)`, col = "blue")

6
Histogram of Age_(years)
200
150
Frequency

100
50
0

20 30 40 50 60 70

Age_(years)

boxplot(`Age_(years)`)

7
70
60
50
40
30
20

We can possibly claim that people who are between 20 and 40 years old have more intentions to make loans
than others.
No_of_dependents
This is a qualitative data which has 2 categories, we use table() to describe the frequency.

table(No_of_dependents)

## No_of_dependents
## 1 2
## 845 155

We use hist() to visualize the data.

hist(No_of_dependents, col="blue")

8
Histogram of No_of_dependents
800
600
Frequency

400
200
0

1.0 1.2 1.4 1.6 1.8 2.0

No_of_dependents

The mode of No_of_dependents is 1.


Credit_Amount
This variable is continuous. Therefore, to summarize the description of the variable we use function sum-
mary().

summary(Credit_Amount)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 250 1366 2320 3271 3972 18424

The authors are going to use the function hist() and boxplot() to visuallize the data considered.

hist(Credit_Amount, col = "blue")

9
Histogram of Credit_Amount
400
300
Frequency

200
100
0

0 5000 10000 15000 20000

Credit_Amount

boxplot(Credit_Amount)

10
15000
10000
5000
0

5.SIMPLE REGRESSION
In this part of the research, we will consider the relationship between the variables pairwise and investigate
some simple regressions when it makes sense.
We will use pairs() and cor() to have a general view about the relationship between variables.

pairs(german_credit1)

11
10 40 70 1 3 5 1.0 1.6

0.8
Creditability

0.0
60

Duration_of_Credit_(month)
10

3.5
Instalment_per_cent

1.0
1 3 5

Length_of_current_employment

60
Age_(years)

20
1.8

No_of_dependents
1.0

Credit_Amount

0
0.0 0.6 1.0 2.5 4.0 20 50 0 10000

cor(german_credit1)

## Creditability Duration_of_Credit_(month)
## Creditability 1.000000000 -0.21492667
## Duration_of_Credit_(month) -0.214926665 1.00000000
## Instalment_per_cent -0.072403937 0.07474882
## Length_of_current_employment 0.116002036 0.05738103
## Age_(years) 0.091271949 -0.03754986
## No_of_dependents 0.003014853 -0.02383448
## Credit_Amount -0.154740146 0.62498846
## Instalment_per_cent Length_of_current_employment
## Creditability -0.07240394 0.116002036
## Duration_of_Credit_(month) 0.07474882 0.057381027
## Instalment_per_cent 1.00000000 0.126161307
## Length_of_current_employment 0.12616131 1.000000000
## Age_(years) 0.05727075 0.259116153
## No_of_dependents -0.07120694 0.097192004
## Credit_Amount -0.27132228 -0.008376109
## Age_(years) No_of_dependents Credit_Amount
## Creditability 0.09127195 0.003014853 -0.154740146
## Duration_of_Credit_(month) -0.03754986 -0.023834475 0.624988461
## Instalment_per_cent 0.05727075 -0.071206943 -0.271322281
## Length_of_current_employment 0.25911615 0.097192004 -0.008376109
## Age_(years) 1.00000000 0.118589183 0.032272677
## No_of_dependents 0.11858918 1.000000000 0.017143582
## Credit_Amount 0.03227268 0.017143582 1.000000000

12
Through the charts and correlations that are given by pairs() and cor() we can see that there are some promis-
ing simple linear relationship such as the one between Duration_of_Credit_(month) and Credit_Amount
whose correlation is approximately 0.625 or the one between Instalment_per_cent and Credit_Amount
with correlation of -0.271.
The aim of the report is to find the relationships between Credit_amount and the others. Thus, the authors
decide to consider these five simple linear models between Credit_amount and Duration_of_Credit_(month),
Instalment_per_cent, Creditability, Age_(years), No_of_dependents respectively. Decision not to
consider the relationship between Credit_amount and Length_of_current_employment is made because
they seem to be not related witch each other through the scatter chart and correlation (approximately
-0.0084).

• Credit_amount and Duration_of_Credit_(month)

Credit_amount = β0 + β1 × Duration_of_Credit_(month).

The scatter chart between these variables are taken by plot()

plot(`Duration_of_Credit_(month)`, Credit_Amount)
15000
Credit_Amount

10000
5000
0

10 20 30 40 50 60 70

Duration_of_Credit_(month)

The correlation is also taken by cor()

cor(`Duration_of_Credit_(month)`, Credit_Amount)

## [1] 0.6249885

13
From the chart and correlation that are given, the positive relationship between variables are considered.
The author use summary() and lm() to call out the simple linear regression model between these two variables.

summary(lm(Credit_Amount~`Duration_of_Credit_(month)`))

##
## Call:
## lm(formula = Credit_Amount ~ ‘Duration_of_Credit_(month)‘)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5151.7 -1260.0 -432.9 653.2 13805.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.169 139.569 1.527 0.127
## ‘Duration_of_Credit_(month)‘ 146.299 5.784 25.292 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2205 on 998 degrees of freedom
## Multiple R-squared: 0.3906, Adjusted R-squared: 0.39
## F-statistic: 639.7 on 1 and 998 DF, p-value: < 2.2e-16

The p-value that is given for the intercept of the model is 0.127 which means a probability of 12,7% (which
is bigger than 5%) wrong rejecting the nullity. The author decide to remove the intercept of the model to
find another result.

summary(lm(Credit_Amount~`Duration_of_Credit_(month)`+0))

##
## Call:
## lm(formula = Credit_Amount ~ ‘Duration_of_Credit_(month)‘ + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5489.5 -1156.5 -341.3 744.4 13972.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## ‘Duration_of_Credit_(month)‘ 153.952 2.891 53.25 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2206 on 999 degrees of freedom
## Multiple R-squared: 0.7395, Adjusted R-squared: 0.7392
## F-statistic: 2835 on 1 and 999 DF, p-value: < 2.2e-16

Through the new result, We reject the nullity of the coefficient and support the linear relationship between
Credit_amount and Duration_of_Credit_(month) because the p-value is very small that means a very
small probability to be wrong rejecting.

14
Besides, the Multiple R-squared which is 0.7395 is pretty high and much better that the old one of the last
model (0.3906). We can say that 73.95% of variation in Credit_amount can be explained by the variability
in the Duration_of_Credit_(month).
The model given is:

Credit_amount = 153.952 × Duration_of_Credit_(month).

To visualize the model, the functions plot() and abline() are considered.

plot(`Duration_of_Credit_(month)`, Credit_Amount)
abline(0,153.952, col="blue")
15000
Credit_Amount

10000
5000
0

10 20 30 40 50 60 70

Duration_of_Credit_(month)

• Credit_amount and Instalment_per_cent

The scatter chart and correlation of these variables are taken.

plot(Instalment_per_cent, Credit_Amount)

15
15000
Credit_Amount

10000
5000
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Instalment_per_cent

cor(Credit_Amount, Instalment_per_cent)

## [1] -0.2713223

There seem to be a negative simple linear model between variables considered. The function summary() and
lm() are used.

summary(lm(Credit_Amount~Instalment_per_cent))

##
## Call:
## lm(formula = Credit_Amount ~ Instalment_per_cent)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4021.0 -1659.6 -854.5 788.9 13802.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5306.57 244.18 21.732 <2e-16 ***
## Instalment_per_cent -684.60 76.87 -8.905 <2e-16 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##

16
## Residual standard error: 2718 on 998 degrees of freedom
## Multiple R-squared: 0.07362, Adjusted R-squared: 0.07269
## F-statistic: 79.31 on 1 and 998 DF, p-value: < 2.2e-16

The authors reject the nullity of the coefficients and support the linear relationship between Credit_Amount
and Instalment_per_cent because the p-value is very small that means a very small probability to be wrong
rejecting.
The model given is:

Credit_Amount = 5306.57 − 684.60 × Instalment_per_cent.

However, the Multiple R-squared is calculated to equal to 0.07362. The model given does not restitute the
dispersion of the responses (7.362% of variation in Credit_amount can be explained by the variability in
Instalment_per_cent)
The visualization of the model is taken as follow.

plot(Instalment_per_cent, Credit_Amount)
abline(5306.57,-684.60, col="blue")
15000
Credit_Amount

10000
5000
0

1.0 1.5 2.0 2.5 3.0 3.5 4.0

Instalment_per_cent

• Credit_amount and Creditability

The authors consider the scatter chart and correlation between Credit_amount and Creditability.

17
plot(Creditability, Credit_Amount)

15000
Credit_Amount

10000
5000
0

0.0 0.2 0.4 0.6 0.8 1.0

Creditability

cor(Creditability, Credit_Amount)

## [1] -0.1547401

To call out regression model, we use summary() and lm()

summary(lm(Credit_Amount~Creditability))

##
## Call:
## lm(formula = Credit_Amount ~ Creditability)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3505.1 -1765.6 -858.4 771.8 14485.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3938.1 161.1 24.447 < 2e-16 ***
## Creditability -952.7 192.5 -4.948 8.8e-07 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

18
##
## Residual standard error: 2790 on 998 degrees of freedom
## Multiple R-squared: 0.02394, Adjusted R-squared: 0.02297
## F-statistic: 24.48 on 1 and 998 DF, p-value: 8.795e-07

The authors reject the nullity of the coefficients and support the linear relationship between Credit_Amount
and Creditability because the p-value is very small that means a very small probability to be wrong
rejecting.
The model given is proposed:

Credit_Amount = 3938.1 − 952.7 × Creditability.

The Multiple R-squared is calculated to equal to 0.02394. The model given does not restitute the dispersion of
the responses (2.394% of variation in Credit_amount can be explained by the variability in Creditability)
The visualization of the model is proposed as follow:

plot(Creditability, Credit_Amount)
abline(3938.1,-952.7, col="blue")
15000
Credit_Amount

10000
5000
0

0.0 0.2 0.4 0.6 0.8 1.0

Creditability

• Credit_amount and Age_(years)

Scatter chart and correlation between these variables are considered.

19
plot(`Age_(years)`, Credit_Amount)

15000
Credit_Amount

10000
5000
0

20 30 40 50 60 70

Age_(years)

cor(Credit_Amount, `Age_(years)`)

## [1] 0.03227268

There is possibly a positive linear model between two variables but it is not very significant. The function
summary() and lm() are used to call out the regression model.

summary(lm(Credit_Amount~`Age_(years)`))

##
## Call:
## lm(formula = Credit_Amount ~ ‘Age_(years)‘)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3066.5 -1896.1 -956.4 717.8 15181.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2986.047 293.495 10.17 <2e-16 ***
## ‘Age_(years)‘ 8.024 7.867 1.02 0.308
## ---

20
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2823 on 998 degrees of freedom
## Multiple R-squared: 0.001042, Adjusted R-squared: 4.057e-05
## F-statistic: 1.041 on 1 and 998 DF, p-value: 0.3079

It can be clearly see that the coefficient of the slope of this model have the p-value of 0.308 which means a
probability of 30.8% (which is bigger than 5%) wrong rejecting the nullity. The p-value given is so high that
we accept the nullity of the coefficient of the slope.
The authors conclude that there is no simple linear regression model between Credit_amount and
Age_(years).

• Credit_amount and No_of_dependent

The authors consider the scatter chart and correlation of these variables by plot() and cor() function.

plot(No_of_dependents,Credit_Amount)
15000
Credit_Amount

10000
5000
0

1.0 1.2 1.4 1.6 1.8 2.0

No_of_dependents

cor(Credit_Amount, No_of_dependents)

## [1] 0.01714358

There seem to be a slight positive linear relationship between the variables Credit_Amount and
No_of_dependents. The authors call out the regression model by summary() and lm() functions.

21
summary(lm(Credit_Amount~No_of_dependents))

##
## Call:
## lm(formula = Credit_Amount ~ No_of_dependents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3000.5 -1896.3 -938.9 717.0 15173.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3116.9 298.6 10.437 <2e-16 ***
## No_of_dependents 133.6 246.7 0.542 0.588
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2824 on 998 degrees of freedom
## Multiple R-squared: 0.0002939, Adjusted R-squared: -0.0007078
## F-statistic: 0.2934 on 1 and 998 DF, p-value: 0.5882

The p-value of coefficient of the slope of the model is 0.588 that means a probability of 58.8% (which is bigger
than 5%) wrong rejecting the nullity. The p-value of this coefficient is so high that the nullity is accepted.
The conclusion is there is no simple linear regression model between two following variables: Credit_Amount
and No_of_dependents.
6.MULTIPLE REGRESSION
In the sixth part of the research, we are going to consider the multiple regression between Credit_amount
and all other variables in the dataset german_credit1. In order to do that, we will use summary() and lm()
to call out the multiple regression.

summary(lm(Credit_Amount~Creditability+`Duration_of_Credit_(month)`+Instalment_per_cent+Length_of_curren

##
## Call:
## lm(formula = Credit_Amount ~ Creditability + ‘Duration_of_Credit_(month)‘ +
## Instalment_per_cent + Length_of_current_employment + ‘Age_(years)‘ +
## No_of_dependents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6145.5 -1167.3 -222.7 592.6 11866.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2157.829 370.128 5.830 7.49e-09 ***
## Creditability -277.692 143.304 -1.938 0.052933 .
## ‘Duration_of_Credit_(month)‘ 150.748 5.408 27.874 < 2e-16 ***
## Instalment_per_cent -819.499 57.596 -14.228 < 2e-16 ***
## Length_of_current_employment -49.431 55.314 -0.894 0.371727
## ‘Age_(years)‘ 21.003 5.823 3.607 0.000325 ***
## No_of_dependents 12.013 177.285 0.068 0.945988

22
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2001 on 993 degrees of freedom
## Multiple R-squared: 0.5005, Adjusted R-squared: 0.4975
## F-statistic: 165.8 on 6 and 993 DF, p-value: < 2.2e-16

The p-values of the variables Creditability, Length_of_current_employment and No_of_dependents are


5.2933%, 37.1727%, 94.5988% respectively. These p-values which are the probability to be wrong rejecting
the nullity are all bigger than 5% which means that the nullity of the coefficients corresponding to these
variables is all accepted.
The author decide to remove the following variables: Creditability, Length_of_current_employment,
No_of_dependents.
The new multiple regression model is taken as follow:

summary(lm(Credit_Amount~`Duration_of_Credit_(month)`+Instalment_per_cent+`Age_(years)`))

##
## Call:
## lm(formula = Credit_Amount ~ ‘Duration_of_Credit_(month)‘ + Instalment_per_cent +
## ‘Age_(years)‘)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6055.7 -1142.4 -252.4 586.8 12178.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1848.255 281.547 6.565 8.38e-11 ***
## ‘Duration_of_Credit_(month)‘ 152.636 5.275 28.937 < 2e-16 ***
## Instalment_per_cent -818.473 56.911 -14.382 < 2e-16 ***
## ‘Age_(years)‘ 18.731 5.596 3.347 0.000847 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2003 on 996 degrees of freedom
## Multiple R-squared: 0.498, Adjusted R-squared: 0.4965
## F-statistic: 329.3 on 3 and 996 DF, p-value: < 2.2e-16

The authors reject the nullity of the coefficients and support the relationship between Credit_Amount and
other variables in this model because the p-values are very small that means very small probability to be
wrong rejecting the nullity.
The new model given is:

Credit_Amount = 1848.255+152.636×Duration_of_Credit_(month)−818.473×Instalment_per_cent+18.731×Age_(y

The multiple R-squared of this model is 0.498 which is pretty good. We can say that the model given
does restitute the dispersion of the responses.49.8% of variation in Credit_amount can be explained by the
variability in the other variables in the new model.

23
By making a comparison with the last model, the authors see that the result of this case will be slightly worse
in term of Multiple R-squared (0.498<0.5005) and we got the higher Residual Standard Error (2003>2001).
However, the gaps between them is small and not significant enough to consider.
The fact that adjusted R-squared slightly decrease after removing three variables (from 0.4975 to 0.4965)
may not meet the expectation because this decrease could be the confirmation of the necessity level of the
three variables that the authors have already removed. However, the level of decrease is small and not
significant enough to consider.
7.CONCLUSION
After analysis of the given data, the authors come to conclusions about relationship between Credit_Amount
and the other variables.

• About Simple Regression

The impact of the variable Instalment_per_cent on Credit_Amount seem to be the most significant
one. There is no simple linear relationship found between Credit_Amount and two variables Age_(years),
No_of_dependents respectively.

• About Multiple Regression

The multiple regression model given indicates the multiple relationship between Credit_Amount and these
following variables: Duration_of_Credit_(month), Instalment_per_cent, Age_(years). The results from
this model is pretty good in term of multiple R-squared which is 0.498. The variables that seems to have
the most significant impact on Credit_Amount in this multiple model is Instalment_per_cent which has
the weight of -818.473.

24

You might also like