0% found this document useful (0 votes)
29 views68 pages

Chapter 1 New

Visualization of STEM education among minorities

Uploaded by

temmy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views68 pages

Chapter 1 New

Visualization of STEM education among minorities

Uploaded by

temmy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

CHAPTER ONE

INTRODUCTION

1.0 Background of the Study

The pendulum of educational training has moved from traditional, lecture style, and fact

memorization techniques to exploratory, hands-on, guided learning, and back again throughout

the past decade. As we advance, learning has taken on a more project and problem-based

learning (PjBL and PBL) approach with kinesthetic for students. Not only are many schools

diving into PjBL/PBL, but there has also been an increase in incorporating digital technologies

into daily educational practice.

The PjBL/PBL and technological shift has challenged the current structure of several education

programs and professional development to focus more on ‘doing’. This has led to the integration

of makerspaces into the conventional educational system and has enabled the advancement of

STEM education.

In recent years, the emphasis on Science, Technology, Engineering, and Mathematics (STEM)

education in the world has had a tremendous impact on the influx of enthusiasts into the field.

This has led to the almost universal preoccupation with STEM education to shape innovation and

development. In the USA, the 2013 report from the Committee on STEM Education stressed that

"The jobs of the future are STEM jobs," with STEM competencies increasingly required not only

within but also outside of specific STEM occupations.

As a result of this, developing competencies in STEM disciplines is regarded as an urgent goal of

many education systems, fueled by the actual shortages in the current and future STEM

1
workforce. It is common knowledge that STEM is densely populated by a majority gender and

people group. Descriptive statistics show that a smaller percentage of women and minorities

persist in a STEM field major as compared to male and nonminority student. This has limited the

exponential growth of STEM workforce compounding dilemmas surrounding STEM

employment shortages and STEM education in general and advancing global development.

What is STEM and STEM education?

Before the introduction of the current acronym, “STEM”, the National Science Foundation

(NSF) was using an acronym of “SMET” that referred to four distinct fields: Science,

Mathematics, Engineering and Technology. In recent years, STEM has been a buzzword among

stakeholders.

Despite its buzzword status, an ambiguity exists in the definition of STEM (Madden, Beyers, &

O’Brien, 2016). The ambiguity has led to different definitions and occupational applications

among stakeholders across the United States (Ntemngwa & Oliver, 2018), because several

programs within various scientific communities have utilized it. Thus, the definition differs

depending on who has employed it.

STEM and STEM education has gained considerable ground in this century being adopted

among individuals, people groups and institutions. Despite the wide use, “STEM education” is

often used interchangeably with the term “STEM” in the literature. However, STEM and STEM

education are two different terms having two different meanings, because STEM education

means a lot more than the four-letter acronym of STEM. As a result of this no standard definition

exists due to the varied opinions of its stakeholders.

2
STEM is a curriculum based on the idea of educating students in four specific disciplines

Science, Technology, Engineering and Mathematics in an interdisciplinary and applied approach.

The combining of the disciplines was a strategic decision made by scientists, technologists,

engineers, and mathematicians to combine forces and create a stronger political voice (Sezai

Kocabas, Burhan Ozfidan, Lynn M. Burlbaw 2019). While STEM education according to

(NSTA, National Science Teachers Association) is an interdisciplinary approach to learning

where rigorous academic concepts are coupled with real-world lessons as students apply Science,

Technology, Engineering and Mathematics in contexts that make the connection between school,

community, work and the global enterprise establishing the development of STEM literacy and

with it the ability to compete in the new economy.

One of the issues for researchers and curriculum developers lies in the different interpretations of

STEM education and STEM integration. There are variants of STEM which include and is not

limited to SMET, STEAM, METALS, STREAM, STEEM, THAMES, MINT etc. Although this

research focuses on the STEM variant of the discipline.

1.1 Statement of Problem

For students entering college as a Science, Technology, Engineering, or Mathematics (STEM)

major this is especially the case. Women and minority groups are less likely to persist in a

STEM field major during college than their peer (National Science Board). It is believed that a

strong STEM workforce is important for future development. Thus, it is essential to understand

the reasons for under-representations of certain groups within this workforce.

Over the last few decades, representation of women and minorities in STEM fields post-college

has increased, but gaps still remain (The National Center for Education Statistics,NCES). Much

3
of this may be due to supply – there are fewer women and minorities receiving bachelor’s

degrees in STEM fields. This is for two reasons: both groups are less likely to pick a STEM

major initially, and if they do, less likely to remain in that major (NCES).

This research is carried out in response to this with a view of increasing the STEM pipeline

should be a core goal of every stakeholder. Although it is still not well understood the factors

that affect persistence in STEM majors during college, hence this research is carried out.

1.2 Aim of the Study.

The aim of the study is to develop a hybrid chain model that combines statistical models to

predict patterns in STEM Education among Minorities and to illustrate the results using

visualization techniques.

1.3 Objectives of the Study

The following are the objectives of the research.

1. To appraise and select from a cross section of statistical models the best-fit models for

STEM institution data set based on some predefined metrics.

2. To combine the selected models into a hybrid chain model, validating each using standard

validation techniques and applying the new hybrid model to the data set.

3. To visualize the result of the data analysis performed with the hybrid chain model with

various visualization techniques.

1.4 Significance of Study

Scientific and technological innovations have become increasingly important as we face the

benefits and challenges of both globalization and a knowledge-based economy. The recruitment

4
of women to STEM fields has been a difficult battle historically with “pipeline” methods. In

order to accelerate global growth and development it is important to highlight the areas of

dependency in STEM education and integration.

To achieve this, the study has scientifically explained the need for an increase in the population

of STEM professionals not just to the minority populace but also the majority as this will lead to

an unparalleled economic growth and infrastructural development at a global scale.

It is strongly believed that the more minority and underrepresented populace migrate towards a

field of study, the more attractive it becomes and encouraging to the general public.

Specifically, this study extends the current frameworks of gender and people groups by exploring

first sample data on professionals and undergraduates on the subtopic of women and minorities

in STEM integrated fields.

1.5 Scope and Limitations of the study

The scope of the system involves cleaning of data acquired in various file extension formats like

csv, dat, db, sql, mbd, ddf, dta, oid to name a few and also includes test for granularity, analysis

using different statistical models and techniques. The analysis covers a wide range from

regression, association to correlation analysis. Information derived from this stage is then

communicated and presented to the target audience in form of graphs, area charts, scatterplot,

and also 3D visualization techniques.

This research includes and is not limited to all patterns concern with visualization and

storytelling with data not excluding model appraisal and elimination. It also includes the

exploration of data to analyze and to understand alongside explanation, which entails

communicating a specific story about data derived to the target audience.

5
However, due to availability of data this research is formally conducted using data from Lagos

State Polytechnic Full time and Lagos State University. It is also assumed that the target

audience include undergraduates, professionals, and stakeholders in STEM integrated field of

study.

1.6 Definition of Terms

Association: is a relationship between two random variables which makes them statistically

dependent. It refers to rather a general relationship without specifics of the relationship being

mentioned.

Correlation: Correlation is a statistical measure that indicates the extent to which two or more

variables fluctuate together. A positive correlation indicates the extent to which those variables

increase or decrease in parallel; a negative correlation indicates the extent to which one variable

increases as the other decreases.

K-12 education: for kindergarten to 12th grade, is an American expression that indicates the

range of years of supported primary and secondary education found in the United States, which

is similar to publicly supported school grades prior to college in several other countries, such as

Afghanistan, Australia, Canada, Ecuador, China, Egypt, India, Iran, Philippines, South Korea,

Turkey.

Makerspaces: A makerspace is a collaborative workspace inside a school, library or separate

public/private facility for making, learning, exploring and sharing that uses high tech to no tech

tools.

6
Model: a model is a representation of an idea, an object or even a process or a system that is

used to describe and explain phenomena that cannot be experienced directly.

Regression: Regression is a statistical measurement used in finance, investing, and other

disciplines that attempts to determine the strength of the relationship between one dependent

variable (usually denoted by Y) and a series of other changing variables (known as independent

variables).

SMET: Science, Technology, Engineering and Mathematics (STEM), previously Science, Math,

Engineering and Technology (SMET), is a term used to group together these academic

disciplines.

STEM: Science, technology, engineering, and math. An interdisciplinary form of all subjects.

Not excluding other subjects like Art that also play a large roll (we chose to use the acronym

STEM instead of STEAM for the purpose of this chapter).

STEM education: STEM education is an interdisciplinary approach to learning where rigorous

academic concepts are coupled with real-world lessons as students apply science, technology,

engineering, and mathematics in contexts that make connections between school, community,

work, and the global enterprise enabling the development of STEM literacy and with it the

ability to compete in the new economy. (Tsupros, 2009)

STEM integration: an effort to combine some or all of the four disciplines of science,

technology, engineering, and mathematics into one class, unit, or lesson that is based on

connections between the subjects and real-world problems.

7
STEM jobs: are careers where STEM workers use their knowledge of science, technology,

engineering, or math to try to understand how the world works and to solve problems.

STEM pipeline: is the educational pathway for students in the fields of science, technology,

engineering, and mathematics (STEM).

STEM workforce: the STEM workforce includes 74 occupations including computer and

mathematical occupations, engineers and architects, physical scientists, life scientists, and health-

related jobs such as healthcare practitioners and technicians (but not health care support workers

such as nursing aides and medical assistants). It includes workers with associate degrees and

other credentials as well as those with bachelor’s and advanced degrees.

Problem-Based Learning (PBL): Student-driven and teacher guided in-depth inquiry that

encourages problem-solving, collaboration, and critical thinking skills reflective of real-world

problems focusing on the problem and the process.

Project-Based Learning (PjBL): Student-driven and teacher guided in-depth inquiry that

encourages problem-solving, collaboration, and critical thinking skills reflective of real-world

problems focusing on a specific end product.

Kinesthetic learning: or tactile learning is a learning style in which learning takes place by the

students carrying out physical activities, rather than listening to a lecture or watching

demonstrations.

Underrepresented groups: An underrepresented group describes a subset of a population that

holds a smaller percentage within a significant subgroup than the subset holds in the general

8
population. Specific characteristics of an underrepresented group vary depending on the

subgroup being considered. Underrepresented groups in science, technology, engineering, and

mathematics in the United States include women and some minorities.

Visualization: Visualization is any technique for creating images, diagrams, or animations to

communicate a message. Visualization through visual imagery has been an effective way to

communicate both abstract and concrete ideas since the dawn of humanity.

9
CHAPTER TWO

LITERATURE REVIEW

2.1 Data Analysis

Data exist in various forms and numerous capacities. The availability and adoption of newer

more powerful devices coupled with ubiquitous access to global networks has driven the creation

of new sources for data which can be independently managed searched and analyzed. Data can

be broadly classified into three categories based on the nature of its properties and structure.

They include Structured data, Semi-structured data and Unstructured data. These stratified

structures have formed the basis of data analysis.

Structured data is data that adheres to a predefined data model usually stored in a traditional

relational database and is therefore straightforward to analyze. Structured data conforms to a

tabular format with relationship between the different rows and columns. Common examples of

structured data are Excel files or SQL databases. Unstructured data does not have a predefined

data model and cannot be understood by the typically data storage programs. Unstructured

simply means that it is datasets (typical large collections of files) that are not stored in a

structured database format. Unstructured data has an internal structure, but it's not predefined

through data models.

Semi-structured data is a form of structured data that does not conform with the formal structure

of data models associated with relational databases. can be stored in several ways: in

applications, NoSQL (non-relational) databases, data lakes, and data warehouses.

10
2.1.2 Data Management

Managing and analyzing data have always offered the greatest benefits and the greatest

challenges for organizations of all sizes and across all industries. The convergence of emerging

technologies and reduction in costs for everything from storage to compute cycles have

transformed the data landscape and made new opportunities possible. As all these technology

factors converge, it is transforming the way we manage and leverage data (Hurwitz, Nugent,

Halper, Kaufman 2013). Data management is an administrative process that includes acquiring,

validating, storing, protecting, and processing required data to ensure the accessibility, reliability,

and timeliness of data. It comprises all disciplines related to managing data as a valuable

resource.

What is Data Analysis?

Data Analysis is the systematic application of statistical and logical techniques to describe the

data scope, modularize the data structure, condense the data representation, illustrate via images,

tables, and graphs, and evaluate statistical inclinations, probability data, to derive meaningful

conclusions (Simran 2020). These procedures enable the retrieval of inference while eliminating

redundancy, chaos and ensuring integrity from the continual process of data generation.

It is a process of inspecting, cleansing, transforming and modeling data with the goal of

discovering useful information, informing conclusions and supporting decision-making

(Wikipedia 2020). Data analysis uses analytical and logical reasoning to gain information from

the data to find meaning in data so that the derived knowledge can be used to make informed

decisions.

11
2.2 Statistical Data Analysis

Statistics is a form of mathematical analysis that uses quantified models, representations and

synopses for a given set of experimental data or real-life studies. Statistics is basically a science

that involves data collection, data interpretation and finally, data validation. Statistical data

analysis is therefore a procedure performing various statistical operations. It is a kind of

quantitative research, which seeks to quantify the data, and typically, applies some form of

statistical analysis. Quantitative data basically involves descriptive data, such as survey data and

observational data.

Statistical data analysis generally involves some form of statistical tools, the most well-known

Statistical tools are the Mean, Arithmetical average of numbers, Median and Mode, Range,

Dispersion, Standard deviation, inter quartile range, coefficient of variation, Regression etc.

Although these tools span a wide variety of applications, they do not each come without their

pitfalls.

Data in statistical data analysis consists of variable(s). Sometimes the data is univariate or

multivariate. Univariate data is a type of data which consists of observations on only a single

characteristic or attribute. The analysis of univariate data is thus the simplest form of analysis

since the information deals with only one quantity that changes. It does not deal with causes or

relationships and the main purpose of the analysis is to describe the data and find patterns that

exist within it.

Bivariate data is type of data involves two different variables. The analysis of this type of data

deals with causes and relationships and the analysis is done to find out the relationship among

the two variables. Multivariate data is when data involves more than one variable It is like

12
bivariate data but contains more than one dependent variable. The ways to perform analysis on

this data depends on the goals to be achieved. Some of the techniques are regression analysis,

path analysis, factor analysis and multivariate analysis of variance (MANOVA).

2.3 Visualization of Data

Visualization is the mental image or a visual representation of an object or scene or person or

abstraction that is like visual perception. Visualization has many definitions but the most

common one, which is found in literature, is the use of computer-supported, interactive, visual

representations of data to amplify cognition. Cognition means the power of human perception or

the acquisition or the use of knowledge. Visualization is the graphical representation that best

conveys the complicated ideas clearly, precisely, and efficiently. These graphical depictions are

easily understood and interpreted effectively. The main goal of visualization is to analyze,

explore, discover, illustrate, and communicate information in well understandable form.

The process of visualization can be broken into six steps which include: Mapping, Selection,

Presentation, Interactivity, Human Factors and Evaluation. These steps provide a systematic

pattern and structure for the representation of Data and information. Visualization uses various

techniques to represent data and information. These techniques are the major categories in which

tools are divided by classification. They include:

1. Data Visualization: includes standard quantitative methods such as Tables, Pie Charts,

Area Chart, Line Graphs, etc. They are visual representations of quantitative data in

schematic form (either with or without axes), they are all-purpose, mainly used for

getting an overview of data.

13
2. Information Visualization: such as semantic networks or tree maps, entity-relationship

diagrams, flow charts, Venn diagrams, dataflow diagrams. It is defined as the use of

interactive visual representation of data to amplify cognition. This means that the data is

transformed into an image (not quantitative); it is mapped to a screen space.

3. Concept Visualization: like a concept map or Gantt chart; these are methods to elaborate

(mostly) qualitative concepts, ideas, plans, and analysis through the help of rule-guided

mapping procedures. For example, decision tree, PERT chart.

4. Metaphor Visualization: like metro map or story template are effective and simple

templates to convey complex insights. Visual Metaphors fulfil a dual function, first they

position information graphically to organize and structure it. They convey insight of

information through key characteristics of metaphor that is employed.

5. Strategy Visualization: like a strategy canvas or technology roadmap is defined as the

systematic use of complementary visual representations to improve the analysis,

development, formulation, communication, and implementation of strategies in

organizations. E.g., Organizational chart, Strategy map.

6. Compound Visualization: consists of two or more of the formats. They can be complex

knowledge maps that contain diagrammatic and metaphoric elements, conceptual

cartoons with quantitative charts, or wall sized info murals.

2.4 Review of Related Literature

A. Persistence of women and minorities in STEM field is it the school that matters.

Persistence in any of the STEM majors is much lower for women and minorities suggesting that

this may be a leaky joint in the STEM pipeline for these two groups of students. Descriptive

14
analysis statistics show that a smaller percentage of women and minorities persist in a STEM

field major as compared to male and non-minority students. Regression analysis shows that the

differences in preparation and educational experiences of these students explains much of the

differences in persistence rates.

From the analysis of this research the following were deduced:

i. A higher percentage of female STEM fields graduate s should positively

impact the persistence of female students.

ii. There is little evidence that having a larger percentage of female STEM

field faculty members increases the likelihood of persistence for women in

STEM major.

iii. The differences in the background have a significant impact on persistence

rates.

B. A Longitudinal Study of how quality mentorship and research experience integrate

underrepresented minorities in STEM careers.

The underrepresented minorities do not integrate into STEM academic community at the same

rate as their non-minority counterparts. This research longitudinally examined the integration of

underrepresented minorities into the STEM community by using growth- curve analysis to

measure the development of TIMIS's key variables (science efficacy, identity, and values) from

junior year through the postbaccalaureate year. From the research it was clearly shown that

15
quality mentorship and research experience occurring in the junior and senior years were

positively related to student science efficacy, identity and values at the same time period.

16
CHAPTER THREE

RESEARCH METHODOLOGY

3.1 Methodology and Analysis of the Existing System

Before advancing into the research and analysis of the problem space adequate research was

carried on material contained research conducted on the problem space in various countries

especially the United States.

This is as a result of the initial lack of evidence to support the claims of previous study on the

research project within the specified country. After careful feasibility study, it was discovered

that notable scholars have worked on similar research projects on the problem space. These

research works were done using sample populations, premade data or even purely based on

speculations. And such the rationale behind this study as the previous ones were not adequate to

place emphasis and draw the presupposed conclusion.

3.2 Description of the Existing System

One of the most comprehensive study done on the problem space was performed by Ugwuanyi

Okeke on the Determinants of University Students Interest in Science, Technology, Engineering

and Mathematics Education in Nigeria: A case of structural Equation modeling using a sample of

255 undergraduate students in universities in Enugu state Nigeria. The study used statistical

analytical tools like questionnaires and interest scale for data collection, structural equation

modeling statistical approach was used during the analysis of data and the result tested using

Root Mean Square Error Approximation (RMSEA) and Confirmatory Factor Index (CFI). In his

recommendation he depicted that the support of parents for academic influence greatly improves

the determining factor of student interest in STEM fields of study.

17
Another study was conducted by Simon Walter Umoh on the problems and prospects of effective

Science Technology Engineering Mathematics Education Delivery in Nigeria. He looked closely

at the educational practices in Nigeria today and how the delivery at our K-12 Educational

system has impeded the performance and interest of students in STEM courses pipeline. He

clearly stated the problems of STEM delivery in schools such as Deficient curriculum, poor

Teacher supply, unavailability of teaching facilities and overloaded syllabus to name a few. He

then gave the recommendation of only professional teachers allowed to take STEM courses and

encouraging efforts towards curriculum development by experts.

O.A. Akinsowon and F.Y. Osisanwo Conducting a study on Enhancing Interest in Science

Technology and Mathematics for Nigerian Female folk showed the general statistic of the influx

and educational level of the female gender as compared to their counterpart. They also gave the

rationale behind the need to close the gap. Noting the factors causing the lack of interest such as

individual interest factor, the teachers' factors, curriculum development and the home factor. The

study then gave accommodation on how is close the gap to increase the gender representation of

the female folk in STEM courses.

3.3 Problem of the Existing System

According to the first study performed by Ugwuanyi Okeke on the Determinants of University

Students Interest in Science, Technology, Engineering and Mathematics Education in Nigeria.

Careful analysis showed that the study did not consider the limitations of the statistical model

used in the analysis of data gathered. Also, the data of student population used was biased and

there was no consideration of the role of professionals and mentors in influencing STEM

migration and the effect of gender roles on migration patterns.

18
Simon Walter Umoh conducted a study on the problems and prospects of effective Science

Technology Engineering Mathematics Education Delivery in Nigeria. This study was also a

milestone towards the education of stakeholders regarding the problem space. After careful

consideration it was discovered that the study was conducted on pure speculation without

scientific evidence to support his claims.

Finally, the study conducted by O.A. Akinsowon and F.Y. Osisanwo Conducting on Enhancing

Interest in Science Technology and Mathematics for Nigerian Female folk seemed the closet to

the research topic due to its direct similarity to the problem space. The research was conducted

with premade data from the national bureau of statistics and there was little statistical data

analysis done. Also, the data available was from 2013 and so there is need for a new analysis to

be performed.

3.4 Justification of the New System

From the analysis of the pre-existing system. It can be deduced the little scientific research done

on the analysis of the gender roles in the STEM pipeline. There has also been no information on

the use of Statistical learning algorithm in the existing system.

3.5 Rational of Study

This study is planned towards a detailed scientific analysis of the problem space, adequately

modelling and simulating a view of the result through visualization techniques. Using the many

model approach this study aims to combine three different analytical modeling techniques. This

will ensure the result is less biased and make up for the pitfalls of each individual model.

19
3.6 Research Methods

The following are the methods to be used in the study:

i. Tools: The software that will be used during this research include: R Studio, GNU

Octave, Excel, and Tableau. The model proposed is the many model approach and will

employ the correlation and regression models.

ii. Source materials: The source of the research materials was gotten through the help of

the internet through various sources like Google scholar, Z-library, Wikipedia to name a

few.

3.6.1 Methods of Data Collection

This research uses primary source of data gotten from the Lagos state polytechnic

information communication center and which will also be supported by pre-made data from

the national bureau of statistics.

3.6.2 Method of Data Analysis

The method of used in the analysis of data is broadly divided into:

i. Descriptive Data Analysis

ii. Predictive Data Analysis

3.7 Process Analysis

The first Step is data gathering which will be gotten from the institutional repository of Lagos

State polytechnic and the Internet after this the data is inspected and cleaned. This stage

eliminates chaos, anomaly and redundancy and normalizes it based on the first to third

normalization rules.

20
Then the data is queried using the research questions prepared before now and then results are

collected and analyzed using the statistical analysis techniques applying various models to

determine the most suitable for the research.

After correlation of the result of the analysis, the information is then displayed using 3D

visualization techniques to properly illustrate the information to the stakeholders.

3.8 Analysis of Current System

The new system is to be implemented on student data from Lagos State polytechnic Full Time on

a 3-year period showing the correlation patterns, gender roles and the effect of role model

minority mentors on the choice of study.

The new system uses statistical techniques to derive information from the data gathered such

techniques are used to derive patterns and show relationships between collected data. Then

statistical models and methods are applied to the information.

21
3.9 Activity Diagram

Activity diagrams are graphical representations of the workflow activities. The diagram below

models the computational activities carried out in the project.

Fig 2.1 Activity Diagram of Workflow

22
3.0 Flowchart Diagram

The diagram below shows a diagrammatic representation of the processes involved in the

research.

Fig 2.2 Flowchart Diagram of Processes

23
CHAPTER FOUR

DATA ANALYSIS AND RESULTS

4.1 Data Analysis

Data Analysis is the science of evaluating data using statistical analytical tools to discover

information that is timely and useful.

4.1.1 Data Analysis of STEM Education

The Analysis of data was carried out in the following order.

1. Data Gathering and Sorting.

2. Data Cleaning and Inspection.

3. Model Elimination and Appraisal.

4. Model Selection and Statistical Methods

5. Coding and Debugging.

6. Validation of model.

Data Gathering and Sorting: This was carried out using Excel and other Spreadsheet packages.

Data from the National Bureau of Statistics was organized in tables but there were still anomalies

associated with it. Also, data from the school database was raw and it had to be sorted,

summarized, and placed in tables.

Data Cleaning and Inspection: This was done in a spreadsheet environment, where the data

was normalized and placed into categories. Also, all missing data were replaced with a non-null

value of zero and then categorized with the rest of the data.

24
Model Elimination and Appraisal: Four models were initially selected namely, Markov Chain

Monte Carlo, Linear Regression Model, Logistic Regression Model and Correlation Model.

a. Markov Chain Monte Carlo Model: This model was discarded after careful study, due to its

characteristics

b. Linear Regression Model: This model was selected based on its assumptions and its ability

to predict.

c. Logistic Regression Model: This model was also selected based on its assumptions and its

ability to model a binary dependent variable.

d. Correlation Model: This model was selected based on its ability to determine the

relationship between parameters.

Model Selection and Statistical Methods: After the appraisal of the models, three out of the

four models were selected based on their individual characteristics, assumptions and parameters

as the best fit for the problem space. The models were implemented in the following order.

a. Correlation model: It defines the degree of similarity between two variables and monitors

their variation. This research employed three types of correlation model. The Pearson,

Spearman and Kendall correlation.

i. Pearson Correlation: It is used to measure the degree of relationship between variables

that exist in a linear relationship. The formula is

25
rxy = Pearson r correlation coefficient between x and y

n = number of observations

xi = value of x (for ith observation)

yi = value of y (for ith observation)

Eqn 4.1 Pearson Correlation Equation

Pearson correlation assumes that both variables are normally distributed and have a bell

curved shape.

ii. Spearman correlation: This is used to calculate the degree of association between

variables. It does not hold any assumption about the distribution of the variable. The

formula is

ρ= Spearman rank correlation

di= the difference between the ranks of corresponding variables

n= number of observations

Eqn 4.2 Spearman Rank Correlation Equation

iii. Kendall Correlation: is used in the test of statistical association based on the ranks of

data. It assumes data is ordinal and follow a monotonic relationship.

26
Nc = number of concordant pairs

Nd = number of discordant pairs

Eqn 4.3 Kendall Correlation Equation

b. Linear Regression Model: models the relationship between two variables by fitting a linear

equation to the data. A linear regression model uses an equation of a line Y = a + bX. Y is

the dependent variable while X is the independent variable. Linear Regression model

assumes that there is a linear relationship between the two variables.

Eqn 4.4 Linear Regression Model Equation.

c. Logistic Regression Model: is used to model the probability of a certain event occurring.

P(x) = logistics regression function

Eqn 4.5 Linear Regression Model Equation.

27
Hybrid Chain Model: This is the final model derived from the pipeline of the three models used

in the analysis.This involved the process of creating a data pipe line in which the input of one

process is the output of the other. The data set for each institution is first run through a

correlation analysis test of its individual parameters to determine the most correlated variables

and the level of relationship. After the final result is gotten out it is then validated using the AIC

test to determine the accuracy of the test. Then the selected parameters is run through a Linear

and Logistic regression model which is used for prediction analysis.

Data Set Correlation Analysis Validation Test Regression Analysis

Fig 4 Hybrid Chain Model

Statistical Methods: The Statistical Methods used for the research include Skewness, Mean

Estimate, Standard deviation from the mean Slope and Intercept.

Validation of model: Each model was validated to confirm its accuracy and performance. After

validation only two models performed satisfactorily, Correlation model and Linear Regression

model. The Akaike information criterion (AIC) was used for the test of the parameters of the

model. It involves a comparative analysis between different possible data to determine the best

fit for the data. The AIC is calculated from the number of independent variables used to build the

model and the maximum likelihood estimate of the data.

AIC = 2K - 2ln( L^) AIC = Akaike Information Criterion.

K = number of estimated parameters in the model.

L^ = maximum likelihood function for the model.

Eqn 4.6 Akaike information criterion (AIC) Equation.

28
Coding and Debugging: This was carried out using the R Studio IDE. The codes for each model

were written using the R programming language and each line of code was interpreted with R

4.0.3 and debugged based on the syntax and semantics of the language.

4.2 Data and Methods

This research uses data from two sources. The first is from Lagos State Polytechnic database

which contains raw data of student enrollment. The source contains data from a five-year period

among students across departments. Data was then collected on the number of male and female

students enrolled in the different departments of the polytechnic, dividing the departments into

categories and testing for granularity.

The second source of data was from the National Bureau of Statistics which contained

Longitudinal data on Lagos State University for a period of three years. This data surveyed

students in their various departments and the staff of the institution. The data was placed in

tables and was not normalized.

The gender composition of the students in their various departments is measured as the average

percentage of STEM undergraduate majors. The percentage of STEM undergraduate majors that

are females is normalized, to avoid measuring a general trend at the university for there to be

more or fewer women. To measure the female faculty members available to serve as mentors for

undergraduate students interested in STEM one would ideally want the faculty gender

composition of the STEM field departments. A higher percentage of faculty members that are

female would therefore mean more opportunity for female students to identify with female role

models in the field. However, data on faculty gender composition by department is not readily

29
available for all years needed here. This led to the exclusion of some parts of the data set

available.

Descriptive Statistics are shown in tables 1 and 2 of the two institutions of study. Table 1 shows

the distribution of Lagos State Polytechnic male and female students' population with the

summary of the table. The summary shows the standard deviation of the population and the level

of skewness at 1.09 which is highly skewed and proves a normal multivariate distribution. Panel

A of the table shows the distribution of the sum of male and female students across all

departments with the total number female students superseding the male students with no

correlation between the sums. Chart A of the table shows the distribution of the departments and

the total number of students also showing the summary of the male and female values.

Table 2 shows the shows the distribution of Lagos State University male and female students'

population and its summary. The summary shows a skewness of 0.45 which is symmetrical.

Panel A of the distribution shows the total number of students in the distribution, the total male

students are more than their female counterparts in this distribution. Chart A of the table shows

the distribution of the departments and the total number of students also showing the summary of

the male and female values.

Statistics show that departments like Engineering have a higher male percentage of students

compared to Management Sciences. Table 3 and 4 show the classification of the students based

on categories and departments. Illustrating which departments fall under the STEM classification

and visualizing the tables data alongside faculty members of each institution.

30
4.2.1 Statistical Methods and Models

The descriptive Statistics indicate a variance in population density across departments. The data

show that the average number of females in managerial courses is considerable higher than those

in traditional STEM courses. To effectively predict the student's population in the categories, the

following Statistical models were employed is a hybrid chain model which combines the

following.

1. Pearson Correlation Coefficient Model

2. Spearman Correlation Coefficient Method

3. Kendall Correlation Coefficient Method

4. Linear Regression Model

The correlation coefficient was measured to determine the relationship between the variable and

their relative movement. Pearson, Spearman and Kendall rank correlation were used in the

analysis of data to show correlation. Fig 4.1 and 4.1.1 shows the correlation analysis of student's

population in the institutions. The result of this analysis was then validated using an AIC test to

determine the best fit model parameters which was then implemented in the regression model.

Linear Regression model was used to predict future trends in the population.

4.3 Results and Inference

The correlation analysis was done using three correlation coefficient parameters Pearson,

Spearman and Kendall. The data sets showed different results for the correlation test. Pearson

correlation for Lagos State Polytechnic data set showed positive correlation relationship between

the female and male students, a weak positive relationship between the academic male staff and a

weak negative relationship between the nonacademic male staff and the female students. For the

31
Spearman and Kendall correlation, both the female academic staff and the male non academic

staff showed correlation although Spearman had a stronger correlation of both positive and

negative.

The data set passed to a validation criteria the AIC which determines the best-fit model

parameters for the regression model. Lagos State Polytechnic data set AIC test determined the

criteria that best predicts the data set. The result gave an accuracy of 52.4% for the minimum and

maximum accuracy value. The Lagos State University data set showed a different behaviour.

The correlation test for all three correlation coefficient showed a strong positive relationship

between the male and female students and a very week correlation between the staff and the

minority students. The best-fit AIC test for the data set was a combination of parameters. The

result gave a minimum and maximum accuracy of 34.08%.

4.3.1 Inference

From the analysis the following can be deduced.

The distribution of female minority students in STEM and Non-STEM fields of study is highly

dependent on

i. Higher percentage of male students in STEM fields of study.

ii. The departments and categories of the fields of study (STEM and Non-STEM).

Also, the distribution of male and female staff members does not necessarily impede the influx

of female minority students to the field of study.

32
4.3.2 Assumptions

i. The data set is assumed to represent the population of students and staffs in higher

institutions of learning within Lagos State.

ii. The data set is assumed to be unbiased.

4.4 Visualization of Results

The relationship between the parameters that show a positive correlation is illustrated using a 3D

Scatter plot in Graphs 3 and 4. Graph 5 and 6 show a Box plot of the parameters in view and

Graphs 7 and 8 show a Density plot which describes the level of skewness of the parameters.

Graph 9 and 10 illustrated the relationship between parameters in the data set for both

institutions.

33
CHAPTER FIVE

SUMMARY AND CONCLUSION

5.1 Summary

At the end of the project the data sets were explored and analyzed. The analyzed data was then

visualized using visualization techniques like the 3D Scatter plot.

The first chapter introduced the concepts of STEM and STEM Education. It described the aim

and objectives of the research and gave a n overview of the project.

The next chapter described Data Analyses, Data Visualization and Statistical Modeling.

Explaining how the concepts fit together in the research work. It also described the work of two

foreign researchers on the problem space.

The third chapter gave an extensive detail on the methods used in the research, explaining the

limitations of past research and the justification of this research work. It also showed the detail of

the research with diagrams.

The next chapter explained in detail the process of the research work, describing each step and

showing the results of the research making references to the diagram and figures.

The last chapter, which is the present chapter, summarizes the previous chapters, concluding the

research and gives recommendations to the readers.

5.2 Conclusion

This research examines the relationship between female students in STEM fields of study, and

other variables in the data set. The paper examines cross-sectional and longitudinal data on

34
students in two institutions across three years of study, carefully appraising different models to

determine the best-fit model. The data set is passed through a Hybrid Chain model pipeline

which is a combination of the selected best-fit models.

Results show a positive correlation between the male and female students. The AIC determined

the best-fit for Linear Regression model, before the prediction analysis of the data set. This result

was then visualized using techniques such as density plot, box-plot and 3d scatter plot.

35
REFERENCES

Akinsowon, O. A., & Osisanwo, F. Y. (2014). Enhancing interest in sciences, technology and
mathematics (STEM) for the Nigerian female folk. International Journal of Information
Science, 4(1), 8-12.
Griffith, A. L. (2010). Persistence of women and minorities in STEM field majors: Is it the
school that matters? Economics of Education Review, 29(6), 911-922.
Hackman, S. T., Zhang, D., & He, J. (2021). Secondary school science teachers’ attitudes
towards STEM education in Liberia. International Journal of Science Education, 1-24.
Hurwitz, J. S., Nugent, A., Halper, F., & Kaufman, M. (2013). Big data for dummies. John Wiley
& Sons.
Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business
professionals. John Wiley & Sons.
Kocabas, S., Ozfidan, B., & Burlbaw, L. M. (2019). American STEM Education in Its Global,
National, and Linguistic Contexts. EURASIA Journal of Mathematics, Science and
Technology Education, 16(1), em1810.
Madden, L., Beyers, J., & O'Brien, S. (2016). The importance of STEM education in the
elementary grades: Learning from pre-service and novice teachers’ perspectives. The
Electronic Journal for Research in Science & Mathematics Education, 20(5).
Ntemngwa, C., & Oliver, S. (2018). The implementation of integrated science technology,
engineering and mathematics (STEM) instruction using robotics in the middle school
science classroom. International Journal of Education in Mathematics, Science and
Technology, 6(1), 12-40.
Page, S. E. (2018). The model thinker: what you need to know to make data work for you.
Hachette UK.
Ralph, R., MacDowell, P., Lee, Y. L., & Ng, D. (2020). STEM Education for Girls: Perspectives
of Teachers During a Makeathon. In Overcoming Current Challenges in the P-12
Teaching Profession (pp. 73-95). IGI Global.
Sainz Sujet, P. (2020). Review of Data Visualisation: A Handbook for Data Driven Design: Data
Visualisation. A Handbook for Data Driven Design, by Andy Kirk, LA, Sage
publications, 2019, 312 pp., $106.93 (hardcover), ISBN: 978-1-5264-6892-5. Structural
Equation Modeling: A Multidisciplinary Journal, 1-3.
Shanshan, H. (2020). Inspiration from STEM Education Research in American Primary and
Secondary Schools (No. 2063). EasyChair.
Simran, K. (2020, May 14). What is Data Analysis? Methods, Techniques & Tools. Hackr.io
blog. https://hackr.io/blog/what-is-data-analysis-methods-techniques-tools.

36
Tsupros, N., Kohler, R., & Hallinen, J. (2009). STEM education in Southwestern Pennsylvania:
Report of a project to identify the missing components. Unpublished Report. Pittsburgh,
PA: Intermediate Unit, 1.
Ugwuanyi, C. S., & Okeke, C. I. (2020). Determinants of university students’ interest in science,
technology, engineering and mathematics education in nigeria: a case of a structural
equation modeling. International Journal of Mechanical and Production Engineering
Research and Development, 10 (3): 6209–6218. http://dx. doi.
org/10.24247/ijmperdjun2020590.
Umoh, S. W. (2016). Problems And Prospects Of Effective Science, Technology, Engineering
And Mathematics (Stem) Education Delivery In Nigeria. Knowledge Review, 35(1), 2-
13.
Zykina, A., Kaneva, O., & Sharun, I. (2020, January). Application of the descriptor approach for
clustering entities from education sector. In Journal of Physics: Conference Series (Vol.
1441, No. 1, p. 012184). IOP Publishing.

37
APPENDIX A

Lagos State Polytechnic Data Set

Lagos State University Data Set

38
Table 1

Panel A of Table 1

Chart A of Table 1

39
Table 2

Panel A of Table 2

Chart A of Table 2

40
Table 3

Panel B

Table 4
41
Chart B

Lagos State Polytechnic Data Analysis Results

42
Fig 7.1 Lagos State Polytechnic Correlation Results

Fig 7.2 Lagos State Polytechnic AIC Validation Results

43
Fig 7.3 Lagos State Polytechnic Linear Regression model

Fig 7.4 Lagos State Polytechnic Minimum and Maximum Accuracy

44
Lagos State University Data Analysis Results

Fig 7.1.1 Lagos State University Correlation Results

Fig 7.2.2 Lagos State University AIC Validation Results

45
Fig 7.3.3 Lagos State University Linear Regression model

Fig 7.4.4 Lagos State University Minimum and Maximum Accuracy

46
Graph 1 Lagos State polytechnic Logistic Result

Graph 2 Lagos State University Logistic Result

47
Graph 3 Lagos State polytechnic 3D Scatter plot

Graph 4 Lagos State University 3D Scatter plot

48
Graph 5 Lagos State polytechnic Box Plot

Graph 6 Lagos State University Box Plot

49
Graph 7 Lagos State polytechnic Density Plot

Graph 8 Lagos State University Density Plot

50
Graph 9 Lagos State polytechnic Scatter Plot Matrix

Graph 10 Lagos State University Scatter Plot Matrix

51
APPENDIX B

# Lagos State Polytechnic Data Analysis

library(car)

library(readr)

library(e1071)

library(AICcmodavg)

# Read data from directory

setwd("C:/Users/Ogunbekun/Desktop/ProjectApp/app")

mydata <- read_csv("laspotechfinal.csv")

View(laspotechfinal)

mydata

# Correlation Analysis

# Pearson Correlation

cor(mydata$`TOTAL MALE`, mydata$`TOTAL FEMALE`)

cor(mydata$`ACADEMIC STAFF MALE`, mydata$`TOTAL FEMALE`)

cor(mydata$`ACADEMIC STAFF

52
FEMALE`, mydata$`TOTAL FEMALE`)

cor(mydata$`NON-ACADEMIC

MALE`, mydata$`TOTAL FEMALE`)

cor(mydata$`NON-ACADEMIC

FEMALE`, mydata$`TOTAL FEMALE`)

# Spearman Correlation

cor(mydata$`TOTAL MALE`, mydata$`TOTAL FEMALE`, method = c("spearman"))

cor(mydata$`ACADEMIC STAFF MALE`, mydata$`TOTAL FEMALE`, method =

c("spearman"))

cor(mydata$`ACADEMIC STAFF

FEMALE`, mydata$`TOTAL FEMALE`, method = c("spearman"))

cor(mydata$`NON-ACADEMIC

MALE`, mydata$`TOTAL FEMALE`, method = c("spearman"))

cor(mydata$`NON-ACADEMIC

FEMALE`, mydata$`TOTAL FEMALE`, method = c("spearman"))

# Kendall Correlation

cor(mydata$`TOTAL MALE`, mydata$`TOTAL FEMALE`, method = c("kendall"))

53
cor(mydata$`ACADEMIC STAFF MALE`, mydata$`TOTAL FEMALE`, method =

c("kendall"))

cor(mydata$`ACADEMIC STAFF

FEMALE`, mydata$`TOTAL FEMALE`, method = c("kendall"))

cor(mydata$`NON-ACADEMIC

MALE`, mydata$`TOTAL FEMALE`, method = c("kendall"))

cor(mydata$`NON-ACADEMIC

FEMALE`, mydata$`TOTAL FEMALE`, method = c("kendall"))

# Model Validation AIC test

female.mod <- lm(mydata$`TOTAL FEMALE` ~ mydata$`TOTAL MALE`, data = mydata)

feacad.mod <- lm(mydata$`TOTAL FEMALE` ~ mydata$`ACADEMIC STAFF

FEMALE`, data = mydata)

maacd.mod <- lm(mydata$`TOTAL FEMALE` ~ mydata$`ACADEMIC STAFF MALE`, data =

mydata)

menonacad.mod <- lm(mydata$`TOTAL FEMALE` ~ mydata$`NON-ACADEMIC

MALE`, data = mydata)

fenonacad.mod <- lm(mydata$`TOTAL FEMALE` ~ mydata$`NON-ACADEMIC

FEMALE`, data = mydata)

54
fecat.mod <- lm(mydata$`TOTAL FEMALE` ~ mydata$CATEGORIES, data = mydata)

fesch.mod <- lm(mydata$`TOTAL FEMALE` ~ mydata$SCHOOLS, data = mydata)

combine1.mod <- lm(mydata$`TOTAL FEMALE` ~ mydata$`ACADEMIC STAFF MALE` +

mydata$`ACADEMIC STAFF

FEMALE` + mydata$`TOTAL MALE`, data = mydata)

combine2.mod <- lm(mydata$`TOTAL FEMALE` ~ mydata$`NON-ACADEMIC

MALE` + mydata$`NON-ACADEMIC

FEMALE` + mydata$`TOTAL MALE`, data = mydata)

combine3.mod <- lm(mydata$`TOTAL FEMALE` ~ CATEGORIES + SCHOOLS + `TOTAL

MALE`, data = mydata)

combine4.mod <- lm(mydata$`TOTAL FEMALE` ~ CATEGORIES + SCHOOLS + `TOTAL

MALE` + mydata$`ACADEMIC STAFF MALE` + mydata$`ACADEMIC STAFF

FEMALE` , data = mydata)

# Final Model Test

models <- list(female.mod, feacad.mod, maacd.mod, menonacad.mod, fenonacad.mod,

fecat.mod, fesch.mod,

combine1.mod, combine2.mod, combine3.mod, combine4.mod)

model.names <- c("female.mod", "feacad.mod", "maacd.mod", "menonacad.mod",

"fenonacad.mod", "fecat.mod", "fesch.mod",

55
"combine1.mod", "combine2.mod", "combine3.mod", "combine4.mod")

aictab(cand.set = models, modnames = model.names)

# Linear Regression Model

fecat.mod

summary(lm(fecat.mod))

set.seed(100)

trainingrowindex <- sample(1:nrow(mydata), 0.8*nrow(mydata))

trainingdata <- mydata[trainingrowindex, ]

testdata <- mydata[-trainingrowindex, ]

femalepred <- predict(fecat.mod, testdata)

summary(fecat.mod)

AIC (fecat.mod)

actual_preds <- data.frame(cbind(actuals=testdata$`TOTAL FEMALE`,

predicteds=femalepred))

correlation_accuracy <- cor(actual_preds)

head(actual_preds)

actual_preds

min_max_accuracy <- mean(apply(actual_preds, 1, min) / apply(actual_preds, 1, max))

56
mape <- mean(abs((actual_preds$predicteds - actual_preds$actuals))/actual_preds$actuals)

min_max_accuracy

mape

correlation_accuracy

# Visualization of data

scatter.smooth(x = mydata$`TOTAL MALE`, y = mydata$`TOTAL FEMALE`, main

="`TOTAL MALE` ~ `TOTAL FEMALE`", col="blue")

par(mfrow=c(1,2))

boxplot(mydata$`TOTAL MALE`, main ="male", sub = paste("Outlier rows: ",

boxplot.stats(mydata$`TOTAL MALE`)$out))

boxplot(mydata$`TOTAL FEMALE`, main ="female", sub = paste("Outlier rows: ",

boxplot.stats(mydata$`TOTAL FEMALE`)$out))

par(mfrow=c(1,2))

boxplot(mydata$`TOTAL MALE`, main ="male", sub = paste("Outlier rows: ",

boxplot.stats(mydata$`TOTAL MALE`)$out))

boxplot(mydata$`TOTAL FEMALE`, main ="female", sub = paste("Outlier rows: ",

boxplot.stats(mydata$`TOTAL FEMALE`)$out))

par(mfrow=c(1,2))

57
plot(density(mydata$`TOTAL MALE`, main="Density plot: male", ylab="Frequency",

sub=paste("Skewness:", round(e1071::skewness(mydata$`TOTAL MALE`), 2))))

polygon(density(mydata$`TOTAL MALE`), col = "red")

plot(density(mydata$`TOTAL FEMALE`, main="Density plot: female", ylab="Frequency",

sub=paste("Skewness:", round(e1071::skewness(mydata$`TOTAL FEMALE`), 2))))

polygon(density(mydata$`TOTAL FEMALE`), col = "red")

# Scatter Matrix

pairs(~`TOTAL FEMALE`+`TOTAL MALE`+`ACADEMIC STAFF MALE`+ `ACADEMIC

STAFF

FEMALE`,data=mydata,main="Simple Scatterplot Matrix")

# 3D Scatter Plot

Staff_male <- mydata$`ACADEMIC STAFF MALE`

Student_female <- mydata$`TOTAL FEMALE`

Student_male <- mydata$`TOTAL MALE`

Staff_female <- mydata$`ACADEMIC STAFF

FEMALE`

scatter3d(x=Staff_male, y=Student_female, z=Staff_female)

58
# Lagos State University Data Analysis

library(car)

library(readr)

library(e1071)

library(AICcmodavg)

# Read data from directory

setwd("C:/Users/Ogunbekun/Desktop/ProjectApp/app")

lasufinalcopy <- read_csv("lasufinalcopy.csv")

View(lasufinalcopy)

# Correlation Analysis

# Pearson Correlation

cor(lasufinalcopy$`TOTAL MALE`, lasufinalcopy$`TOTAL FEMALE`)

cor(lasufinalcopy$`TOTAL FEMALE`, lasufinalcopy$`NON-ACADEMIC STAFF MALE`)

cor(lasufinalcopy$`TOTAL FEMALE`, lasufinalcopy$`NON-ACADEMIC STAFF FEMALE`)

# Spearman Correlation

cor(lasufinalcopy$`TOTAL MALE`, lasufinalcopy$`TOTAL FEMALE`, method =

c("spearman"))

59
cor(lasufinalcopy$`TOTAL FEMALE`, lasufinalcopy$`NON-ACADEMIC STAFF MALE`,

method = c("spearman"))

cor(lasufinalcopy$`TOTAL FEMALE`, lasufinalcopy$`NON-ACADEMIC STAFF FEMALE`,

method = c("spearman"))

# Kendall Correlation

cor(lasufinalcopy$`TOTAL MALE`, lasufinalcopy$`TOTAL FEMALE`, method = c("kendall"))

cor(lasufinalcopy$`TOTAL FEMALE`, lasufinalcopy$`NON-ACADEMIC STAFF MALE`,

method = c("kendall"))

cor(lasufinalcopy$`TOTAL FEMALE`, lasufinalcopy$`NON-ACADEMIC STAFF FEMALE`,

method = c("kendall"))

# Model Validation AIC test

female.mod <- lm(lasufinalcopy$`TOTAL FEMALE` ~ lasufinalcopy$`TOTAL MALE`, data =

lasufinalcopy)

menonacad.mod <- lm(lasufinalcopy$`TOTAL FEMALE` ~ lasufinalcopy$`NON-ACADEMIC

STAFF MALE`, data = lasufinalcopy)

fenonacad.mod <- lm(lasufinalcopy$`TOTAL FEMALE` ~ lasufinalcopy$`NON-ACADEMIC

STAFF FEMALE`, data = lasufinalcopy)

fecat.mod <- lm(lasufinalcopy$`TOTAL FEMALE` ~ lasufinalcopy$CATEGORIES, data =

lasufinalcopy)

60
fesch.mod <- lm(lasufinalcopy$`TOTAL FEMALE` ~ lasufinalcopy$DEPARTMENTS, data =

lasufinalcopy)

combine1.mod <- lm(lasufinalcopy$`TOTAL FEMALE` ~ lasufinalcopy$`TOTAL MALE` +

lasufinalcopy$`NON-ACADEMIC STAFF MALE` + lasufinalcopy$`NON-ACADEMIC

STAFF FEMALE`, data = lasufinalcopy)

combine2.mod <- lm(lasufinalcopy$`TOTAL FEMALE` ~ lasufinalcopy$`TOTAL MALE` +

lasufinalcopy$CATEGORIES + lasufinalcopy$DEPARTMENTS, data = lasufinalcopy)

combine3.mod <- lm(lasufinalcopy$`TOTAL FEMALE` ~ CATEGORIES + DEPARTMENTS

+ `TOTAL MALE` + lasufinalcopy$`NON-ACADEMIC STAFF MALE` +

lasufinalcopy$`NON-ACADEMIC STAFF FEMALE`, data = lasufinalcopy)

# Final Model Test

models <- list(female.mod, menonacad.mod, fenonacad.mod, fecat.mod, fesch.mod,

combine1.mod, combine2.mod, combine3.mod)

model.names <- c("female.mod", "menonacad.mod", "fenonacad.mod", "fecat.mod",

"fesch.mod",

"combine1.mod", "combine2.mod", "combine3.mod")

aictab(cand.set = models, modnames = model.names)

# Linear Regression Model

set.seed(100)

61
trainingrowindex <- sample(1:nrow(lasufinalcopy), 0.8*nrow(lasufinalcopy))

trainingdata <- lasufinalcopy[trainingrowindex, ]

testdata <- lasufinalcopy[-trainingrowindex, ]

femalepred <- predict(combine2.mod, testdata)

summary(combine2.mod)

AIC (combine2.mod)

actual_preds <- data.frame(cbind(actuals=testdata$`TOTAL FEMALE`,

predicteds=femalepred))

correlation_accuracy <- cor(actual_preds)

head(actual_preds)

min_max_accuracy <- mean(apply(actual_preds, 1, min) / apply(actual_preds, 1, max))

mape <- mean(abs((actual_preds$predicteds - actual_preds$actuals))/actual_preds$actuals)

min_max_accuracy

mape

correlation_accuracy

# Visualization of data

# Scatter Plot

62
scatter.smooth(x = lasufinalcopy$`TOTAL MALE`, y = lasufinalcopy$`TOTAL FEMALE`,

main ="male ~ female")

# Box Plot

par(mfrow=c(1,2))

boxplot(lasufinalcopy$`TOTAL MALE`, main ="male", sub = paste("Outlier rows: ",

boxplot.stats(lasufinalcopy$`TOTAL MALE`)$out))

boxplot(lasufinalcopy$`TOTAL FEMALE`, main ="female", sub = paste("Outlier rows: ",

boxplot.stats(lasufinalcopy$`TOTAL FEMALE`)$out))

par(mfrow=c(1,2))

boxplot(lasufinalcopy$`TOTAL MALE`, main ="male", sub = paste("Outlier rows: ",

boxplot.stats(lasufinalcopy$`TOTAL MALE`)$out))

boxplot(lasufinalcopy$`TOTAL FEMALE`, main ="female", sub = paste("Outlier rows: ",

boxplot.stats(lasufinalcopy$`TOTAL FEMALE`)$out))

# Density plot

par(mfrow=c(1,2))

plot(density(lasufinalcopy$`TOTAL MALE`, main="Density plot: male", ylab="Frequency",

sub=paste("Skewness:", round(e1071::skewness(lasufinalcopy$`TOTAL MALE`), 2))))

polygon(density(lasufinalcopy$`TOTAL MALE`), col = "red")

63
plot(density(lasufinalcopy$`TOTAL FEMALE`, main="Density plot: female",

ylab="Frequency", sub=paste("Skewness:", round(e1071::skewness(lasufinalcopy$`TOTAL

FEMALE`), 2))))

polygon(density(lasufinalcopy$`TOTAL FEMALE`), col = "red")

# Scatter Matrix

pairs(~`TOTAL FEMALE`+`TOTAL MALE`+`NON-ACADEMIC STAFF MALE`+`NON-

ACADEMIC STAFF FEMALE`,data=lasufinalcopy,main="Simple Scatterplot Matrix")

# 3D Scatter Plot

Staff_male <- lasufinalcopy$`NON-ACADEMIC STAFF MALE`

Student_female <- lasufinalcopy$`TOTAL FEMALE`

Student_male <- lasufinalcopy$`TOTAL MALE`

Staff_female <- lasufinalcopy$`NON-ACADEMIC STAFF FEMALE`

scatter3d(x=Staff_male, y=Student_female, z=Staff_female)

64
setwd("C:/Users/Ogunbekun/Desktop/PROJECTDATACOLLECTION")

library(readr)

lasufinalcopy <- read_csv("lasufinalcopy.csv")

View(lasufinalcopy)

table(lasufinalcopy$ACTION)

input_ones <- lasufinalcopy[which(lasufinalcopy$ACTION == 1), ]

input_zeros <- lasufinalcopy[which(lasufinalcopy$ACTION == 0), ]

set.seed(100)

input_ones_training_rows <- sample(1:nrow(input_ones), 0.7*nrow(input_ones))

input_zeros_training_rows <- sample(1:nrow(input_zeros), 0.7*nrow(input_zeros))

training_ones <- input_ones[input_ones_training_rows, ]

training_zeros <- input_zeros[input_zeros_training_rows, ]

trainingdatab <- rbind(training_ones, training_zeros)

test_ones <- input_ones[-input_ones_training_rows, ]

test_zeros <- input_zeros[-input_zeros_training_rows, ]

testdatab <- rbind(test_ones, test_zeros)

library(smbinning)

65
factor_vars <- c("DEPARTMENT", "CATEGORIES", "YEAR", "ACTION")

continious_vars <- c("`YEARI MALE`", "`YEARI FEMALE`","`YEARI TOTAL`","`YEARII

MALE`"," `YEARII FEMALE`"," `YEARII TOTAL`",

"`YEARIII MALE`","`YEARIII FEMALE`","`YEARIII TOTAL`"," `YEARIV

MALE`","`YEARIV FEMALE`","`YEARIV TOTAL`",

"`YEARV MALE`","`YEARV FEMALE`","`YEARV TOTAL`","`YEARVI

MALE`","`YEARVI FEMALE`","`YEARVI TOTAL`","`TOTAL MALE`",

"`TOTAL FEMALE`","`TOTAL SUM`","`NON-ACADEMIC STAFF

MALE`","`NON-ACADEMIC STAFF FEMALE`")

iv_df <- data.frame(vars =c(factor_vars, continious_vars), IV = numeric(27))

for (factor_var in factor_vars) {

smb <- smbinning.factor(trainingdatab, y="ACTION", x=factor_var)

if(class(smb) !="character"){

iv_df[iv_df$vars == factor_var, "IV"] <- smb$iv

for (continious_var in continious_vars) {

smb <- smbinning.factor(trainingdatab, y="ACTION", x=continious_var)

66
if(class(smb) !="character"){

iv_df[iv_df$vars == continious_var, "IV"] <- smb$iv

iv_df <- iv_df[order(-iv_df$IV), ]

iv_df

logitMod <- glm(ACTION ~`TOTAL MALE` + `TOTAL FEMALE` + `TOTAL SUM` +

`NON-ACADEMIC STAFF MALE` +

`NON-ACADEMIC STAFF FEMALE`, data = trainingdatab, family = binomial(link

= "logit"))

predicted <- plogis(predict(logitMod, testdatab))

library(InformationValue)

optCutOff <- optimalCutoff(testdatab$ACTION, predicted)[1]

summary(logitMod)

vif(logitMod)

misClassError(testdatab$ACTION, predicted, threshold = optCutOff)

library(plotROC)

plotROC(testdatab$ACTION, predicted)

67
sensitivity(testdatab$ACTION, predicted, threshold = optCutOff)

specificity(testdatab$ACTION, predicted, threshold = optCutOff)

confusionMatrix(testdatab$ACTION, predicted, threshold = optCutOff)

68

You might also like