Credit Risk Management
Credit Risk Management
Internal Strategies
The majority of tools for controlling credit risk fall into this category. They include:
• Vetting: prospective counterparties to assess their credit risk. This is the oldest and most basic
means of managing credit risk exposure
• Position limits: imposing limits on the credit to be granted to any individual
counterparty. These position limits can be both ‘soft’ and ‘hard’; the former would be
similar to targets that might occasionally be breached; the latter would be hard and fast
limits that should not be exceeded under any circumstances.
• Monitoring: Firms should always monitor their ongoing credit risk exposures, especially to
counterparties to whom they are heavily exposed. Monitoring systems should send warning
signals as a counterparty approaches or breaches a position limit.
• Netting arrangements: to ensure that if one party defaults, the amounts owed are the net rather
than gross amounts.
• Credit enhancement: techniques include periodic settlement of outstanding debts; imposing
margin and collateral requirements, and arranging to make or receive further collateral
payments if one party suffers a credit downgrade; purchasing credit guarantees from third
parties; and credit triggers (arrangements to terminate contracts if one party’s credit rating hits
a critical level).
Risk Sharing Strategies
Purchase of a credit guarantee: the purchase from a third party, usually a bank, of a guarantee
of payment. One example is an export credit guarantee, which is often issued by governments
as a way of encouraging a growth in exports to developing markets in which credit risks may
be relatively high.
Credit derivatives: these mitigate downside risk by transfer to an external party. Examples of
credit derivatives include:
– Credit default swap: A swap in which one payment leg is contingent on a specified credit
event such as a default or downgrading
Credit insurance: Credit insurance works in exactly the same way as any other form of
insurance, whereby premiums are paid for the purchase of a policy that then pays out in the
event of a specified credit default. Such insurance is, however, likely to be expensive
Total return swap: A swap in which one leg is the total return on a credit-related reference
asset. Total return is defined as the coupon rate plus capital gain or loss.
– Credit-linked note: A security which includes an embedded credit default swap. The issuer
offers a higher rate of return to the purchaser, but retains the right not to pay back the par value
on maturity if a specified credit default occurs.
However, credit derivatives come with a large health warning: they entail their own credit risk
because the counter-party may default, and they can also entail substantial basis risk
1
CR E D I T R I S K P O L I C Y
For financial sector, having a solid credit risk policy framework is just as important as it is for
banks and MFIs. The same goes for the major topics that must covered by such policies (as
well as accompanying manuals and procedures) and which can be broadly divided into
transaction and portfolio risk management actions. For small providers it may be appropriate
to integrate all this into one credit (risk management) policy. The following outlines the core
ingredients of a possible credit policy, although companies may wish to add additional
elements
Introduction (vision, mission, risk management strategy, risk appetite)
• Organization of credit (organizational structure, staffing, job descriptions,
roles/responsibilities)
• Loan/lease products
• Processes (assessment, onboarding, monitoring, cross- selling, collections, recovery)
• Standards and compliance (consumer protection principles, internal code of conduct)
• Portfolio management (segmentation, limits)
• Reporting (KRIs, reporting frequency)
Credit Risk Assessment
Before any business sells a product to any client, it wants to know if it will be paid. A credit
risk assessment is meant to evaluate a potential borrower’s ability to repay the obligation, their
character and willingness to repay, and any risks that may endanger repayment. This
assessment tells the company how likely they are to be paid, which should inform the decision
to finance a product or not, as well as how they will price the risk on that loan/lease. In high-
income countries, consumer credit assessments are often automated and rely heavily on credit
reference bureaus. However even when reference bureaus are operational, few customers will
have a record to analyze. In microfinance, creditworthiness is usually assessed through in-
person interviews, reference checks and other high-touch interactions.
Banks must strike their own balance between mitigating the risk of non-payment and
minimizing the cost of the mitigation. Often, the approach they take is heavily influenced by
the size and type of asset being financed.
The cost of more thorough assessments may be hard to justify for small assets with a low profit
margin, while higher-value income generating assets and consumer goods may require
different approaches to credit assessment
This section focuses on the key principles for conducting credit risk assessments in financial
sector. It covers:
• Willingness and ability to pay
• Credit scoring, and the difference between expert and statistical scoring
• Methods for streamlining assessments and verifying information
2
• The importance of secure data gathering and storage
W I L L I N G N E S S T O PAY A N D A B I L I T Y T O PAY
In finance for low-income customers, a credit assessment is meant to establish two things:
1. Willingness to pay. Is the customer willing to meet their contractual obligations?
2. Ability to pay. Are they able to do that?
Where one or both of these elements are lacking, default is likely to occur. Under the constraints
faced by low-income or poor households, willingness and ability to pay are often losely
intertwined. There is rarely a month in the life of a low-income borrower where each and every
financial obligation has been met and significant money is left over. This is what being poor
means. Even if one accumulates actual cash savings, this is the result of making real sacrifices
in order to build up an emergency buffer. If having the ability to pay means that clients reach
each loan payment date with all other bills paid, all household needs met and a few hundred
dollars emergency cash left over after making the loan payment, then most low- income
borrowers would be excluded. Therefore, ability to pay almost always comes down to the
discipline required to prioritize a loan installment over many other competing needs and wants.
In other words, willingness to pay.
Assessing willingness to pay
For this reason, an important factor in assessing the credit risk in consumer loans is determining
the moral character of the client. Moral integrity serves as a proxy for the willingness to pay
even in the face of hard personal choices. This is both the most important factor to assess, and
the most difficult. There are many proxies for willingness to pay that an agent or credit officer
can try to assess in the credit decision process. These could be factors that indicate stable life
circumstances and responsible behavior: being married and caring for children, more life
experience, living in a rural area with deep community links, are all good for stability.
Having some assets or owning land or a home can also be predictive of responsible, disciplined
behavior. The question: “Do you own or rent the home you live in?” provides helpful
indications. If a poor household at leastowns the modest shelter they live in, that is an indicator
of industriousness, discipline, responsibility, personal pride, and an aspiration to better one’s
life. These are often good predictors of honest and reliable borrower behavior. Given the
subjective nature of this assessment, it is vital that an organization’s criteria should be as
comprehensive and standardized as possible (and put in writing) to avoid the agent taking a
decision/assessment based on their personal bias. Nor should any one factor be unilaterally
disqualifying, unless the failure is so uniquely egregious hat it necessitates rejection. Staff and
agents should be clear on the policies and their importance, as well as the consequences of not
following them. Successful lenders can also try to cultivate willingness to pay among their
target clientele through financial education.
3
Measuring ability to pay
That does not mean that the ability to pay is not important when conducting credit risk
assessments. Just because the ex-ante ability to pay is typically already marginal, this does not
mean that it could not get worse.
If ability to pay goes from marginal to impossible—if the cash simply is not there—then even
the most willing and disciplined borrower will default. Collecting income-related data that can
be used to assess ability to repay is inherently difficult. Informal worker income is often
irregular and comes from a number of sources, and as a result many customers struggle to
provide reliable estimates of their earnings. In the absence of other methods for calculating
income, self-reported estimates should be subjected to automated verification rules that flag,
for example, when reported income is less than reported expenses. This can help agents or
credit officers spot issues and dig deeper before making a decision.
Assessing ability to pay is not just about credit risk management. It is also about impact: this
is the critical step that companies must take to avoid over-indebting clients
But often an asset-based approach is easier to implement and more reliable than income figures.
For example, home visits can be used to check on living conditions, the existence of
motorcycles, TVs, smartphones, etc. The Poverty Probability Index (PPI) is an example of such
an approach, and offers some promise as a tool in credit scoring.
C R E D I T S C O R I N G A N D E X P E R T V S . S TAT I S T I C A L S C O R I N G
MODELS
For ease of decision-making, most companies deploy a scoring model or ‘scorecard’ that
enables various factors to be standardized and tabulated. These scoring models do not
determine disbursement, but usually act as a factor in decision-making. However, the inputs
and structure of that scorecard vary based on the maturity of the company, the nature of the
asset, and the target client.
In general, there are two types of scoring models used in credit risk assessment:
• Expert-based models, where an agent or loan office grades’ the potential borrower on a
number of pre-set criteria (asset holdings, income, character, references, etc.).
• Statistical models, where data is captured on various indicators, then an overall score is
calculated based on historical predictors of customer default. These two are not mutually
exclusive. Both can be used in credit scoring.
Expert-based scoring
For early-stage companies, an agent or credit officer (the ‘expert’) will typically collect data
and calculate a score based on their impressions and experience. This score and the collected
data should be checked by at least one centralized credit analyst who is not financially
motivated (an example of the Four Eyes Principle at work). As companies grow and collect
more data, they can begin to run regression analyses on customer data to identify patterns and
relationships between variables. This can be used as a plausibility check, revealing dependence
between variables like family income and asset ownership, which in turn can flag applications
where reported values deviate from past experience.
4
Statistical credit scoring
Once sufficient data has been collected, companies can go further and run regressions
comparing customer data to portfolio data in order to isolate variables that are highly predictive
of repayment behavior. This should, over time, enable them to build a statistical scoring model
that can be tailored to their specific context and continuously improved upon as more and more
data is fed into the model.
Statistical scoring requires the company to: qualitative, and/or psychometric, although not
necessarily all three together).
2. Store data in a well-maintained customer database.
3. Do so for a large enough sample of cases (rule of thumb >5,000, but the more the better) to
be representative.
4. Gather data for long enough to observe outcomes (i.e., complete their loan cycles or default).
5. Hold key loan features more or less constant.
Developing an in-house scoring model requires a significant initial investment, including long
interviews and in-person visits. However, this investment can have a significant long-term
return. A good scoring model can reasonably predict the likelihood of default, limit the number
of data points collected to only the most predictive, enable remote assessments (e.g., phone
interviews, submission of pictures), and automatically flag inconsistencies that require physical
verification. Statistical scoring is also an important use case for machine learning. Though the
developed algorithms may differ, the same methodology is generally applicable regardless of
context. It is up to each institution whether to use their algorithm as the main decision-making
tool or simply as a supplemental input (e.g., in addition to a basic financial analysis), to come
to the final credit decision.
Scoring is not an end state, but rather a process. When done right, a good scoring model
typically improves its performance with each loan cycle observed. Therefore, having an
efficient, secure data collection process is the pre-condition to develop a statistically proven
algorithm in due time. In other words, both quantity and quality of data matter, as does patience.
Lastly, while reliance on algorithms and machine learning can increase with time, it is
important to first develop internal capacity and set up strong management and governance
frameworks. Without these structures in place, technical innovations will not deliver their full
value.
CREDIT DATA COLLECTION
Credit is a data-driven activity, and so the importance of a well-elaborated data strategy cannot
be overemphasized. Establishing a thorough data collection strategy, requires answers to the
following questions:
1. Which data to collect?
2. Which format – e.g., date, integer, string from dropdown, etc.?
5
3. How can data entry be managed to minimize the need for future cleaning? For example,
managers can add rules to date of birth fields that reject any entry falling between 18 and 75
years prior to the date of entry (i.e., a potential client must be between 18 and
75 years old).
4. How to collect data and when?
5. How to verify data?
6. How to protect personal and sensitive data?
When prioritizing the data to collect on customers, assessors should first focus on data that are
minimally invasive, can be easily collected and are difficult to falsify. Socio-demographic
indicators are a good example of such data, including age, gender, number of children, and
time spent in an area. A provider can and should collect the same data points for each customer,
and they should do so directly into a digital system with a forced set of responses for each
question. Over time this will result in a comparably large database, showing the same elements
for most customers. Such a database is a key resource for data analytics and data-based risk
management tools, and should be accompanied by a data dictionary that describes each
variable, its derivation and its intended use .
Typically, data storage capacity is not a limiting factor, especially not in IT-driven companies
like many newer AFCs, who typically rely on cloud storage. Most of the assessed companies
had a strong IT environment in place, enabling them to record and analyze complex datasets
and visualize them in dashboards and trends.
GAT H E R I N G A N D V E R I F Y I N G D ATA
Many banks in our work were skeptical of the value of self- reported information from
customers. This skepticism is understandable: income and expenses, details about one’s family
and lifestyles—these are sensitive topics for anyone. Executives at companies were also
worried about agents ‘coaching’ clients to give ‘correct’ responses to questions.
These concerns are important, but they can be partially addressed. The following strategies are
employed by effective financial institutions in order to gather customer data and verify its
accuracy:
• Have someone besides the selling agent conduct the data gathering. Ideally this would be a
credit officer conducting an in-person visit; more likely it will be an analyst on the phone.
• Ask borrowers for both references and guarantors. The former will provide insight into their
moral integrity.
• Review credit bureau records (if available), allowing providers to check not only customers’
repayment history but also their truthfulness about their self-reported loan history.
• Incorporate logic checks into the scorecard which compare different information points and
examine for plausibility (e.g., stated income minus stated expenses should roughly match the
stated saving rate, which should match household assets, etc.).
6
Importantly, the algorithm should not be a black box. Clear policies are needed for updating,
evaluating, and re-calibrating the algorithm, all of which consider the actual outcomes
produced and their impact. However, its internal logic should not be too widely revealed to
frontline staff, nor every question equally weighted. A bit of mystery goes a long way in
reducing potential ‘gaming’ of the algorithm. One way of doing this is to include several
psychometric questions (e.g., “how often do you think suppliers cheat their buyers by
overcharging or under-delivering?”) into the questionnaire. Metadata (e.g., how long did a
customer take to answer a given question) should also be evaluated as part of the assessment,
as it can be highly useful and may have predictive power.
STREAMLININGTHECREDITASSESSMENT
Providers should explore ways of conducting assessments and gathering KYC data remotely
and make use of local referees wherever possible
Decision-Making and Disbursement
The credit assessment informs a credit decision that is taken post-assessment by the person or
algorithm empowered to do so. This section covers the possible outcomes of that decision, as
well as the management of risk in the organizational setup around credit decision.
A P P R O VA L O R R E J E C T I O N:
Once the assessment is complete, companies must take the most important step in asset finance:
decide whether or not to give the customer an asset on credit In CGAP’s experience, companies
leasing larger assets (e.g., vehicles) manage to filter out many unqualified applicants through
their process, and still reject a significant portion of finished applications. On the other hand,
companies financing smaller assets tend to have looser criteria: the majority of potential
customers who can pay the deposit are approved. Credit decisions should be based directly on
the risk appetite and tolerance of the company. However, companies should not overestimate
their ability to control repayment behavior post-decision. The most important place to manage
credit risk is before it begins (in other words, pre-disbursement). Practically, this means that
not every person who wants an asset should receive one on credit. A large part of risk is
knowing when to say ‘no.’ That said, companies who are loath to reject willing customers may
want to consider risk-pricing their deposit.
Asking those clients who are assessed to be riskier to pay 30 percent upfront as opposed to 20
percent, for example, may serve three purposes: (1) decrease the probability of default by
filtering out higher-risk clients, and (2) reduce the exposure at default (EaD) for those who still
want the asset.
DISBURSEMENT
In asset finance “disbursement” means physically handing over the asset to the customer.
Sometimes this may require an installation in a remote place (e.g., solar pumps), which also
offers an opportunity to conduct some final verifications as suggested above. This has an
indirect operational risk: if it takes too long to deliver and/or install an asset, it is reasonable to
assume that a customer may be more likely to default. This would typically apply more to
productive assets, especially those tied to seasonal use like water pumps purchased to irrigate
crops.
7
Monitoring & Repayment
Once an asset has been installed, the goal for a company is to keep the customer paying on (or
ahead of) schedule until the obligation has been completed. This is accomplished through
regular and pro-active monitoring of the client.
MONITORING
Credit monitoring is done to ensure that in the borrower’s daily battle of balancing priorities
under tight financial constraints, the credit payment comes first
If done properly, monitoring serves two objectives:
(1) maintaining the borrower’s willingness to pay and
(2) finding out early if the financial situation of the borrower has deteriorated and is threatening
their ability to pay.
The first is the more important rationale: monitoring reminds the client that they still have
an obligation, that the company (or even better, the relationship manager) will be personally
disappointed if an installment is late, and that severe consequences and additional costs will
follow if a loan falls into arrears.
In finance, monitoring should entail three main components:
1. Regular notifications (through calls or SMS) of upcoming or recent payments due. Such
notifications should be triggered automatically by the credit risk management system, and
should highlight consequences if the payment is not made.
2. Physical monitoring of the device. Though this essentially refers more to operational risk
management, it is highly recommended to establish mechanisms that allow lenders to track an
asset’s location (e.g., GPS tracker), as well as its proper functioning. This can help to mitigate
reputational risk, as well as fraud and theft.
3. Personal check-ins with the client. Ideally done in person, but over the phone is also an
option if necessary. These check-ins help identify emerging risk factors in the personal life or
business. They are also a great chance to inquire into the customer’s further needs and
satisfaction with the offered service. Scheduling and implementing this monitoring requires a
MIS that is able to trigger notifications, identify clients for check-in, and flag cases that need
attention. That system must also be able to track the payment status in (almost) real-time and
calculate due dates and amounts. A well-integrated call center function is also crucial in credit
monitoring
E very monitoring action, whether regular or ad hoc, should also result in an update to the
client’s rating/score. This provides additional data to study and use when developing early
warning indicators.
R E PAY M E N T
For clients who are in repayment, we recommend establishing an easy-to-understand payment
schedule with equal installments over a contract’s lifetime.
Collections
8
Credit Escalations
Credit escalations are steps in the process of getting ever more serious with a delinquent or
defaulted client, with the goal of having them pay off their arrears and resume regular payment.
These steps can escalate from automated SMS to voice calls to home visits to repossession to
legal action, and may include many steps in between.
For almost any AFC, the first steps will be to remind the client that a payment date is at hand
or just passed, and to find out why they have not paid. Early-stage actions may then vary
depending on the client’s risk profile and repayment history. This is why AFCs need to
regularly update their data, and use that data to segment clients by the degree of risk they
represent. For less-risky clients, it may be sufficient to get a ‘promise to pay’ or even think
about rescheduling, while for riskier clients it may be necessary to schedule a home visit.
The challenge with escalations is that while more serious actions (such as house visits or
repossession) may be good ways of collecting value from an outstanding asset, they
are also expensive and time-consuming. This is why segmentation is so important—it allows
companies to prioritize collections actions by risk segment
As clients move through various stages of escalation, the person responsible for engaging with
the client may change as well:
1. Early-stage arrears may be managed by the staff member or agent who originated the loan
and who has a financial interest in keeping the client in good standing.
2. Moderate risk clients could also be reached by the company’s call center, who may want to
obtain a promise to pay and explain future steps, including repossession.
3. Higher risk clients can be handled by a specialized collections team who may conduct
additional calls or visits to push for an amicable arrangement. This could include rescheduling
or even partial forgiveness of the loan to get a borrower back on track.
4. Repossessions may be handled by the same team or by a different group of specialists. At
this point, nothing short of a major payment should suffice to prevent repossession.
5. Legal action Where repossession is not feasible, the final stage is often outsourcing
collections to a law office, collections agency or a specialized subsidiary of the lender. This
may also be the time to report a client to the credit bureau
Splitting out the tasks in arrears management and collections among different units as above
also makes it easier to measure the performance of each team and avoid
“perverse incentives.” For reasons of consumer protection and reputation, all of these tasks as
well as the people responsible for them, the timeframes and the suitable code of conduct, must
be explicitly laid out in a Collections Policy. This is an area particularly vulnerable to abuse,
so being clear about collection policy and auditing the actions of collections staff is vital.
Contracts should clearly highlight the company’s right to switch off or repossess an asset, as
well as the precise terms under which they would do so (e.g., after 90 days consecutive
nonpayment). Contracts should also be explicit about the criteria for reporting clients to a credit
9
eference bureau. They should also clearly communicate the customer’s rights, as well as the
mechanism to deal with customer disputes, including requests for corrections or updates to
account information.
Credit reference bureaus
Independent credit reference databases play important roles in financial inclusion:
• They allow providers to look-up new borrowers to see what loans they have outstanding and
how they have paid in the past.
• They are a mechanism for clients to leverage good repayment behavior into additional credit
for home/business.
• They are a means of enforcing discipline on existing borrowers.
Repossession
In theory, the decision whether or not to repossess a defaulted asset should be a question of
straightforward math:
• What residual value does the asset have?
• How much is a repossession likely to cost (staff time,
fuel, warehouse space, etc.)?
• How much value has historically been recovered in resale of similar repossessed assets?
However, there are complicating factors to consider. First is the signaling effect. For many
lower-value items, or assets in the later stages of repayment, repossession may appear
uneconomical on its face, as the remaining value of a used asset may be low and the asset
resalable only at a steep discount, if at all. Yet no one should underestimate the signaling effect
of collection actions in general,
10
Basics of Financial Data Analytics
Data science is the process of examining raw data to conclude that useful information and
knowledge. Big data has lots of power. It has information and knowledge that exist in the real
world, but these are unknown to the researcher and are needed to be revealed. It should be just
known how to use this data and how to use this information from data. Therefore, using
scientific methods to analyze the big data and extracting information it contains is called Data
Science. Data science contains all steps of the collection, processing, presentation, and
interpretation of measurements or observations. Data, on the other hand, is the main material
that data science deals with.
Big Data is defined as large and complex data and sets of information from a variety of sources
that increase rapidly. That is too large and too complex to be handled by traditional methods
and software. Therefore, it needs to be used more developed computer-intensive techniques.
Even if there are many definitions of big data, big data includes “4 Vs” concepts known as
Volume, Variety, Velocity, and Veracity (Das, 2016). In addition, Cackett (2016) adds Value
concepts to these.
• Volume: It is related to how much data it contains. Big data can range from terabytes to
petabytes.
• Variety: It is about how many kinds of data are in the data set. It refers to the different
variations of the big data. It contains a wide variety of formats and sources such as e-commerce
and online transactions, financial transactions, social media interactions, etc.
• Velocity (Speed): It is related to how fast data is produced. Since the data generating speed
is very high, data needs to be gathered, stored, processed, handled, and analyzed in relatively
short windows.
• Veracity: It is related to data uncertainty. The data may be inconsistent, incomplete,
ambiguous, and delayed, so it is thought that the data is uncertain.
• Value: This represents the utility that can be extracted from the data.
Financial Data also covers a huge amount of variety. In financial data science, one can come
across many different kinds of data and each of them requires different approaches and
techniques. For instance, macroeconomic variables, price quotes, common stock prices,
commodity prices, spot prices, futures prices, stock indices’ values, other financial market
indexes, financial tables information such as balance sheets, income statements, and cash flow
statements, financial news, financial analyst opinions, financial transactions, research reports,
signals, or any other data or information whatsoever available through the trading platforms
can be financial data.
Classifications of Data
In big data analysis, analysts come across many different types of data. They can be categorized
depending on many aspects. According to the different aspects data can be classified as
Qualitative Data vs. Quantitative Data; Structured vs. Unstructured data; Cross-sectional vs.
Time series vs. Panel Data; Deterministic vs. Stochastic Time Series data .
11
Qualitative (Categorical) data is obtained by dividing data into non numerical groups. For
example, data regarding whether a person is a woman, or a man can be transformed into
categorical data by coding 1 for male and 2 for female. The result of the mathematical operation
like 1 + 2 is not meaningful. Qualitative Data types are expressed in more detail as nominal
(classified), ordinal (sorted/ordered), and binary data.
• Nominal data: It is a data type that has two or more answer categories and does not contain a
sequential order, and marital status (married, single), gender (female and male), and eye color
(blue, green, brown) can be given as examples.
• Ordinal data (Ordered Data): A data type that has two or more categories but specifies order.
Examples of this data type are education levels (primary school, secondary school, high school,
university, and post graduation), competition degrees (1., 2., and 3.), and the development
levels of provinces (1. Region, 2. Region, 3. Region, 4), and income status (low, medium, and
high). In summary, nominal data can only be categorized but ordinal data both can be catego-
rized and ranked.
Quantitative (numeric) data includes numbers, and it can be performed arithmetic operations,
such as addition, subtraction, multiplication, and division. Examples of quantitative data are
stock prices, GDP rates, money supply, interest rates, exchange rates, sales volumes, and
salaries. Quantitative data are also divided into two classes: Continuous and discrete data. It is
obtained by continuously variable measurement. If an infinite number is significant between
two measurements, it is called continuous data. Salary and income can be examples of
continuous data. The discrete data type is expressed as an integer, there are no intermediate
values.
Qualitative variables are often discrete. For example, it can be said 1, 2, 3 women, but not 2.5
women. The discrete data can be binary, nominal, and ordinal. When numbers are given to the
categories the data will be quantitative binary, nominal, and ordinal. For instance, if 1 and 0
are given to the yes and no variables, then the data will be quantitative discrete binary data.
The continuous data can be classified as interval and ratio. Interval Data is the data with equal
distance between values with no natural zero. If it is mentioned a variable such as annual
income measured in dollars, there are four people who earn $5000, $10,000, $15,000, and
$20,000. The intervals are equal. Ratio Data is also data with equal distance between values
but with natural zero. That is interval data can be categorized, ranked, and equally spaced.
Ratio data can be categorized ranked, equally spaced, and has a natural zero value. Since ratio
data has a real/ absolute zero value, the ratio between variables can be calculated.
The second classification for data is done depending on if it is structured and unstructured
If the data is organized, it can easily be searchable in related databases. This is called structured
data. Structured data is generally stored in tables in databases.
Unstructured data has no defined format, it is stored in more irregular chunks in databases. A
typical example of unstructured data is that of simple text files, images, videos, etc. It is a
heterogeneous data source that contains a combination. The Google Search result page is an
example of unstructured data. Unstructured data turns into structured data after going through
the processes.
12
Today, the size of unstructured data is beyond multiple zettabytes as its size increases
drastically. 10 21 bytes equal to 1 zettabyte, or a billion terabytes combined into one zettabyte.
Looking at these numbers, it can be understood why the Big Data name might be given and it
can be imagined the difficulties that arise in its storage and processing. Natural Language
Natural language is a kind of unstructured data. Tweets, social media posts, blog posts, forum
posts, and text data are examples of natural data. it is more difficult to process since it needs
knowledge of specific data science tech- niques. Natural Language Processing (NLP) technique
is used to extract meaningful and desired information from this kind of data (Cielen et al.,
2016).
Machine-Generated Data Machine-generated data is all data that is created by computers,
operating systems, processes, infrastructures of software, or other machines without any human
intervention (Cielen et al., 2016). Web page requests, clickstreams, telecom calls, and network
management logs can all be given as examples of machine-generated data.
Another classification for data is done depending on if the data time series, cross- sectional,
Cross-Sectional vs. Time Series vs. Panel Data
Time Series Data: The series that indicate the distribution of the values of the variables/units
according to any time intervals such as day, week, month, and year are called time series. In
other words, data indicating the change of values of one or more variables over time are called
time series data. Stock market indices, exchange rates, and interest rates are some examples of
financial time series examples. Most of the data is financial time series in Finance and
Economics.
Continuous time series vs. discrete time series: Time series with data recorded continuously
over time are called continuous time series, and data that can only be observed at certain
intervals, usually equal intervals, are called discrete time series. While the series belonging to
engineering fields such as electrical signals, voltage, and sound vibrations are examples of
continuous time series; the interest rate, exchange rate, sales, and production volume data are
examples of discrete time series. Asset and derivative prices in finance are examples of
continuous time series. Stock prices, option prices are mostly modelled by continuous time
financial models.
Another classification for time series data is used as deterministic and stochastic time series
data. Deterministic vs. Stochastic Time Series: The time series which can be predicted exactly
are deterministic time series. When the time series can be partly determined by past values;
however, the exact prediction cannot be possible, it is mentioned in stochastic time series.
Cross-Sectional Data: Data collected from different units at a certain point of time is called
cross-sectional data. In such data, time is fixed, but there are different units monitored in fixed
time. In general, surveys provide cross-sectional data. This is one-dimensional data set. Trading
volume data of 100 common stocks at the end of a year is an example of financial cross-
sectional data. In addition, if one want to examine the financial position of many companies at
a certain point of time, e.g., in
they examine financial statement tables in 2020 for those companies, which are cross-sectional
data. In general, analyzing cross-sectional data aims to examine similarities and differences
between different units at a certain point of time.
13
Panel Data: Panel data shows the change in both time and cross section units. For instance, the
ten states of the USA as a 5-year wheat production. Panel data are different from cross-sectional
over time data because it deals with the observations on the same subjects at different times
whereas the latter observes different subjects in different time periods.
Other important issues about data are Frequency (Time Scale) of the data, Time period
(Duration) of the data, which affect the results of the empirical study. Frequency (Time Scale)
of the data: Financial time series can be examined for different time scales such as daily,
weekly, monthly, and annual (Taylor, 2007).
Depending on the frequency of the data, daily, weekly, monthly, or annual returns are
calculated. Time period (Duration) of the data: The duration of calendar time covered by a time
series should be as long as possible (Taylor, 2007). The minimum number of years of data
required is a controversial issue.Closing prices, opening prices, high and low prices, and trading
volume can also be useful to obtain additional information.
As can be seen, there are many different types of financial data and all of them have their own
characteristics. According to the data type, financial data analysis should take shape. While
time series analysis and techniques are a major area, cross-sectional data analysis and
modelling are another major area. Different approaches are also used in modelling depending
on whether the time series is discrete or continuous. In addition, each analysis and modelling
have different usage areas in finance. For example, cross-sectional modelling plays an
important role in empirical investigations of the Capital Asset Pricing Model (CAPM) or
modelling financial time series analysis in continuous time series are popular for derivative
asset pricing (Mills & Markellos, 2008). In addition, time series should be modelled differently
depending on the characteristics. For example, if the kurtosis coefficient is greater than 3 in the
financial time series, it shows that many observations in the series are accumulated in the tails,
which indicates that it is appropriate to analyze the series with GARCH models (Tsay, 2002).
When all these variations are considered, it is understood that the field of “financial data
analytics” is quite a wide field. Special types of data need special types of processing.
Data Analytics Types and Data Modelling
Data Analytics Methods and Techniques In the literature, it is seen that there are many different
approaches to grouping methods and techniques. The main purpose of any data analysis is to
suggest policy and pathways. Therefore, no matter which group it belongs to, the main purpose
of data analytics is to define data, model, predict, and suggest policies. These operations
already constitute the data analysis process. There are four main types of data analytics in
general: Descriptive, Diagnostic, Predictive, and Prescriptive Analytics.
(a) Descriptive Analytics: This analytics type is expected to give answer for the question:
“What is happening?” It is the first step of data analyzing by using historical data. The analysis
shows the general patterns of the data. Descriptive analytics provides future probabilities and
trends and gives an idea about what might happen in the future.
(b) Diagnostic Analytics: This analytics type is expected to give answer for the question: “Why
did it happen?” This type of analysis tries to answer the root cause of a defined
problem/business question. It is used to determine why something happened.
14
(c) Predictive Analytics: It is used to answer the question: “What is likely to happen in the
future?” It generally uses past data and patterns in order to forecast the future. Forecasting and
prediction are the main aim of analytics. Data mining, artificial intelligence, and time series
analysis are some of the techniques used under Predictive analytics.
(d) Prescriptive Analytics: This analytics type is expected to give answer for the question:
“What should be done?” This analytics step is related to give pathways and suggest policies. It
is aimed to indicate and develop the right action to be taken for policy makers. Prescriptive
analytics uses these parameters to find the best solution.
In summary, Descriptive analytics uses historical data and shows the patterns of the data, and
predictive analytics helps to predict what might happen in the future.
On Predictive Analytics
Throughout the chapter, the reader is invited to contemplate a variety of scenarios, in which
the same data may be used to tell a different story, sometimes by two professionals employing
the same methodology. Predictive models attempt to offer an array of possible outcomes, paired
with plausible explanations on why a certain phenomenon is more likely to occur than another.
No one can predict the future with certifiable certainty, but one can empower decision makers
with the ability to assess the likelihood of a scenario and prepare contingency plans for a variety
of outcomes. Market stability and predictability is a coveted state by all business and finance
professionals. However, business success depends on the ability of decision makers to detect
trends, undercurrents, emerging patterns, and the likelihood of change. Throughout the chapter,
the reader is invited to contemplate a variety of scenarios, in which the same data may be
analyzed from a different perspective, using a different model. In some cases, the source of
data is a widely used Internet-based repository. In other cases, the data is synthetic and was
created specifically to illustrate certain points in this chapter.
There are some similarities across models in the approaches to treating data, visualizing, and
interpreting the analysis results. Therefore, an attempt has been made to reduce duplicity in the
interest of exposing the reader to a broader range of tools. However, the reader should view
individual steps presented in each model as potential methods that can be applied in other
models. Thus, the chapter provides a richer set of tools and potential mixtures of approaches,
which enable one to tackle different, more complex tasks than the ones described here.
The context has been adapted to better illustrate an application in finance, but the underlying
mathematical, statistical, and computational aspects are the same as in any discipline.
The first model is Logistic Regression. Often, there is a need to categorize an item or a
phenomenon in a binary manner: “yes/no,” “accept/deny,” “buy/sell,” “true/fake news,” etc.
This example analyzes a dataset of loan applications, with the objective of predicting whether
a particular loan would be approved or declined. Logistic regression is widely used in such
predictions and the model provides a peek behindthe scenes to how banks can make decisions
within seconds—if an appropriate machine learning software is in place. The model can be
adapted to address any data and context seeking binary predictions.
The second model is Time Series Analysis. Information can be static, like car engine
specifications, or dynamic, like changes in the value of an asset over time.This model is
15
analyzing historical data documenting the performance of the Microsoft stock
(NASDAQ:MSFT). It employs techniques like moving averages, exponential smoothing, and
ARIMA to predict the future value of the stock. While the context of the application is finance,
the same model can be applied to any scenario (often in conjunction with other models) in
which time-based data is available. This could be predicting traffic volume, climate change, or
demands of computer storage.
The third model is Decision Tree. The prediction of an outcome in a scenario can be
accomplished by multiple methods. This example tackles the same problem and synthetic
dataset as the logistic regression model, to demonstrate that analysts have choices in
methodology and implementation. Unlike logistic regression, decision trees can lead to one of
multiple outcomes. In this case, the model predicts whether a bank loan will be approved. The
model can be easily adapted to be used in any context for which data is available like predicting
which employee will become a manager, whether a company will choose to merge with
another, or whether a newly introduced product will be successful.
The fourth model is Multiple Linear Regression. As the name implies, this model shows how
multiple factors can be analyzed simultaneously to predict an outcome. In this example, the
context is predicting the quality of wine, using information about its chemistry. The nature of
the prediction is a continuous variable, ranking the wine on a quality scale. Linear regression
is used in practically every discipline, ranging from assessing efficacy of vaccines, car engine
failures, university admissions, or the fate of a recently planted tree. The model is sufficiently
detailed that it can be repurposed to any scenario.
The fifth and last model is Recency, Frequency, Monetary (RFM) Segmentation with k-means.
This is a more complex, which consists of two models, often used separately. RFM is a model
for ranking the outcomes of analysis of observations. The k-means model is used to categorize
observed events. The combination of the two creates a two-step prediction: (1) outcomes
ranking; and (2) grouping outcomes into categories. The model analyzes a large dataset of
historical shopping patterns in a store. It then ranks all customers based on their store visit
patterns and amount of money spent. The customers are then grouped into categories, which
can be later harnessed by store managers. For example, one category could be targeted for
credit card offers, while another with a coupon, whose value can be tailored to the customer
category.
Prescriptive Analytics
One would be hard-pressed to identify the five representative topics or scenarios in prescriptive
analytics. This chapter aims at addressing a broad range of topics, in the attempt to provide a
good grasp on what the issue is, how to address it, and how to construct a full solution in code.
Every scenario described can be found in multiple disciplines, some not related to finance at
all. However, in today’s world—and for a very long time to come—everything either produces
or consumes data, or both. The methods for analyzing what one thinks about a song on a music
app is like finding what a stockbroker thinks about a particular investment. The underlying
foundation of both is math, statistics, and computer programming. Each model presented here
could have been addressed in many other contexts and disciplines. This is another contributing
factor in their selection. Most likely, the title of this book will attract readers from the field of
16
finance and closely related areas. However, it is important to realize that the term global village,
so often cited in the media, applies to science and technology disciplines as well. It would be
very useful for a student of finance to realize and appreciate the methods and tools used by
mechanical engineers (for example), as it would for cybersecurity professionals to learn how
decisions are being made by marketing executives.
The first model covered is Sentiment Analysis. It presents an approach to crowdsourcing
feedback about a (hotel) business, to empower business managers to devise a better business
strategy. The model uses a dataset of diverse opinions but demonstrates how in the end it
presents a clear actionable piece of information: one hotel is better than others and another one
is worse. The information can be used to decide what area of the hotel improvement to finance
and in general what needs to be addressed. The model does not prescribe what to do, just what
is the nature of the problem that needs to be addressed. Human decision-makers must determine
the course of action.
The second model discusses Association Rules. It presents a method for understanding the
shopping habits of customers in a grocery store and produces actionable information that can
be used in inventory management, customer acquisition, financial planning, advertising, and
others. A basket of groceries is no different than a set of tourist sites, a stock portfolio, or a
song list. The method for determining if purchasing bread is associated with purchasing sugar
is identical to the one used to detect items association in any field. This empowers finance
professionals to focus on elements of the model most relevant to their discipline, with the
realization that they can gain insight into the customer’s psychology (for example) and not only
looking at “dry” numbers.
The third model is Network Analysis. The field of network theory is not often associated with
finance, which is why it was chosen. Everything is connected these days. Whether companies
connect on social media, individuals connect in games, countries are bound by treaties, or
athletes compete in a team, one can look at almost anything through the lens of network theory.
The model focuses on analysing relationships among business entities, often hidden in plain
sight. The premise is that it is essential to acquire knowledge about which companies are
influential in a sector, who has partnered with whom, and about the conduits that enable the
flow of capital and information. The clearly identifies the leaders and gatekeepers in a network
of large companies and provides actionable information about the connections within the
network.
The fourth model is a Recommender System. The model focuses on analyzing the opinion of
individual stockbrokers and uses that information in investment decisions. Rating a stock is no
different than rating a movie, a dish in a restaurant, or the suitability of politicians for a job. In
this scenario, we explore how seemingly unrelated individuals, who state their opinions about
an item, can be viewed as one cohesive unit that expresses its aggregate opinion, as one. The
model shows how one can identify trustworthy financial advisors, how to corroborate an
opinion, and how to assess similarities among recommenders, as well as among the items they
recommend.
The fifth and last model is Principal Components Analysis (PCA). The model was chosen
because it is fundamental to improving computational efficiency, data collection, and the
construction of predictive and prescriptive models. The model uses the context of determining
17
the essential factors that explain or predict a phenomenon. It shows how to create artificial,
alternative, and fewer predictors that would perform as well as existing ones. As such, it
prescribes a more efficient course of action, one that requires fewer factors while deciding on
an action, thus saving time, resources, and overall streamlining a process. PCA demonstrates
how a complex process can result in halving (or better) the number of variables in a model,
without sacrificing the amount and quality of information produced.
18