0% found this document useful (0 votes)
16 views38 pages

Content 1

The document outlines a project analyzing the Citi Bike system using real-world trip data to identify usage patterns and user behavior through data science and machine learning techniques. It details the project's aims, methodologies including data preprocessing, exploratory data analysis, feature engineering, and predictive modeling, as well as the tools used such as Python and Power BI. The project seeks to enhance urban mobility by providing actionable insights for stakeholders while addressing challenges in existing bike-sharing systems.

Uploaded by

Siva Ranjini H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views38 pages

Content 1

The document outlines a project analyzing the Citi Bike system using real-world trip data to identify usage patterns and user behavior through data science and machine learning techniques. It details the project's aims, methodologies including data preprocessing, exploratory data analysis, feature engineering, and predictive modeling, as well as the tools used such as Python and Power BI. The project seeks to enhance urban mobility by providing actionable insights for stakeholders while addressing challenges in existing bike-sharing systems.

Uploaded by

Siva Ranjini H
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

1. INTRODUCTION

The Introduction chapter lays the foundation for the project. It explains the motivation, scope, and
core objectives of the study. It helps the reader understand what the project is, why it's important,
and how it was approached. It also connects the problem to real-world needs and academic
research.

1.1 Aim

The primary aim of this project is to conduct a comprehensive analysis of the Citi Bike system
using real-world trip data, with the objective of identifying key usage patterns, understanding user
behavior, and building predictive models. This includes:

 Analyzing ride trends by user type.


 Forecasting trip duration using logistic regression.
 Classifying user types (member vs. casual) using logistic regression.
 Visualizing key insights through interactive dashboards using Power BI.

The insights derived from this study aim to support stakeholders — including urban planners,
transportation agencies, and service operators — in enhancing operational efficiency, improving
user experience, and promoting sustainable urban mobility.

As cities worldwide face challenges related to traffic congestion, environmental concerns, and
last-mile connectivity, bike-sharing services offer a practical, low-emission alternative. Projects
like this empower decision-makers by converting raw data into actionable intelligence.

1.2 Project Description

Citi Bike is a large-scale, dock-based bike-sharing service operating in New York City and
neighboring areas. The data-set used in this project (citibike newww.csv) includes hundreds of
thousands of anonymize d trip records, with features such as:

 Trip Start/End Timestamps


 Start/End Station Coordinates
 Ride Type (classic/electric)
 User Type (member/casual)

PROJECT 1 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

This project analyzes the dataset through the lens of data science and machine learning by
building a data pipeline that includes:

1. Data Preprocessing
The data is cleaned by:

 Parsing timestamp columns (started_at, ended_at) into datetime objects.


 Removing missing or duplicate entries using techniques like backfilling.
 Creating new features like trip_duration_minutes.

2. Exploratory Data Analysis (EDA)

Using both Python (Seaborn, Matplotlib) and Power BI, the data is explored to uncover:

 Popular ride times and days


 Geographic ride distribution
 Differences in usage between casual users and subscribers
 Preferred ride types (electric vs. classic)

Key visuals include heatmaps, boxplots, bar graphs, and geospatial maps created using Power BI
for interactive insight.

3. Feature Engineering

Categorical variables such as member_casual and rideable_type are label encoded. Continuous
features like latitude and longitude are standardized or normalized for use in machine learning
models.

4. Machine Learning Modeling

Two main predictive modeling approaches are applied:

 Linear Regression (from Linear Regression New.ipynb) to predict


trip_duration_minutes
 Logistic Regression (from MainProjectNew.ipynb) to classify user types

Model performance is assessed using standard metrics:

 For Linear Regression: MSE, R²


 For Logistic Regression: Accuracy, Precision, Recall

PROJECT 2 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

5. Visualization (Power BI)

Dashboards created in Power BI (MainProject.pbix) present key insights interactively, including:

 Daily and hourly trip distribution


 Station-wise ride volume
 Ride type usage patterns
 User segmentation breakdown

These visualizations bridge technical results and decision-maker understanding, allowing


stakeholders to explore trends without needing code.

1.3 Highlights

This chapter established the foundation of the Citi Bike analysis project by outlining its
motivation, goals, and scope. The project aims to harness real-world trip data to uncover user
behavior patterns, predict trip duration's, and classify user types using machine learning
techniques. By integrating analytical tools like Python and Power BI, the project transforms raw
data into meaningful insights that can inform policy-making and service optimization. The
approach is both data-driven and practically oriented, addressing modern urban mobility
challenges through the lens of technology.

With a structured pipeline — from data prepossessing to visualization — the project not only
highlights the potential of shared mobility systems but also provides a replicate framework for
similar urban transportation studies. The next chapter delves into existing research and related
work in the domain of bike-sharing systems, predictive modeling, and urban mobility analytic.

PROJECT 3 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

2. LITERATURE SURVEY

Literature survey is a structured review of existing knowledge, research, tools, systems, and
technologies related to a particular topic or project.In other words the literature survey is like
doing background research before starting your project It helps in the following:

 Understand what has already been done in your field


 Identify gaps or limitations in existing approaches
 Justify why your project is necessary or valuable
 Learn from past methods, tools, and systems that are relevant to your solution

2.1 Existing and Proposed System


1. Existing System

Modern urban transportation systems increasingly adopt bike-sharing programs to reduce


congestion, promote eco-friendly commuting, and support last-mile connectivity. One such
system is Citi Bike in New York City, which provides public datasets of all trips taken. However,
the current system only offers basic analytic such as:

 Trip counts by hour/day


 Station popularity
 Membership breakdown

These metrics are descriptive, not predictive, and often presented via static reports. There’s
minimal use of machine learning or AI techniques to forecast demand or understand user behavior
dynamically.

Additionally, many city reports rely on manual planning decisions without integrating data-driven
tools like Power BI dashboards or regression analysis.

This study highlights how existing systems focus on fleet deployment and user convenience, but
often lack analytical depth and prediction models for usage patterns or operational insights.

PROJECT 4 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

2. Proposed System

The proposed system builds a data-driven decision support framework that improves upon
existing analytics in three key areas:

Traditional
Area Citi Bike Trip Data
System

Analysis
Descriptive only Descriptive + Predictive
Type

Count-based Behavioral classification via Logistic


User Insight
reports Regression

Trip duration prediction via Linear


Forecasting Absent
Regression

Visualization Static charts Dynamic dashboards (Power BI)

Limited to raw Full preprocessing, feature


Data Usage
insights engineering, ML modeling

Table 2.1 Proposed System

with Python notebooks and Power BI dashboards, this project introduces:

 Trip duration predictions using linear &logistic regression


 User-type classification using logistic regression
 Interactive dashboards for operational decisions

It integrates machine learning, data visualization, and real-world urban mobility data — a
combination rarely seen in current city dashboards.

2.2 Feasibility Study

Feasibility determines whether the proposed solution is practical to implement with the available
tools, time, and data.The feasibility study assesses whether the Citi Bike data analysis and
machine learning project can be successfully executed with the available resources, tools, and data
within the project timeline. This evaluation considers the following key aspects:

PROJECT 5 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

1. Technical Feasibility

 The data-set (citibike newww.csv) is clean, structured, and includes useful features like start
time, end time, station info, and user type.
 Models implemented in Python notebooks (e.g., MainProjectNew.ipynb, Linear Regression
New.ipynb) are based on standard libraries (e.g., scikit-learn, pandas, seaborn).
 Visualizations built in Power BI (MainProject.pbix) are highly interactive and suitable for
decision-makers with no coding background.

2. Economic Feasibility

 All core tools (Python, Jupyter, Power BI Desktop) are free or open-source.
 No paid software, data licensing, or cloud resources are required.
 The solution runs on standard computing infrastructure.

3. Operational Feasibility
The system is user-friendly and designed for real-world use (e.g., transit planning, city analytic).

 Outputs are clear and interpret-able for city stakeholders and data analysts.
 Can be easily expanded to include more stations or time-frames.

2.3 Tools and Technology used

This project uses a modern tech stack optimized for data science, modeling, and visualization.

PROJECT 6 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

Technologies

Area Tool/Technology Purpose

Programming, analysis,
Language Python 3.x
modeling

Interactive coding
Notebook Jupyter
environment

ML scikit-learn, pandas, Regression models,


Libraries numpy prepossessing

matplotlib, seaborn,
Visualization Graphs, dashboards
Power BI

CSV files (Citi Bike


Data Real-world trip data
data)

Linear and Logistic Prediction &


Modeling
Regression classification

Business-level
Power BI MainProject.pbix
dashboards

Table 2.2 Technologies

All technologies are widely supported and have large communities and documentation.

2.4 Hardware and Software Requirements

Hardware Requirements

Component Minimum Requirement

Processor Intel i3 or above

RAM 4 GB minimum (8 GB recommended)

Storage 2–5 GB free disk space

OS Windows 11

Table 2.3 Hardware Requirements

PROJECT 7 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

Software Requirements

Software Version/Type

Python 3.8+

Jupyter Notebook Any IDE or Anaconda

Power BI Desktop Latest free version

Libraries scikit-learn, matplotlib, pandas, seaborn

File Types .csv, .ipynb, .pbix

Table 2.4 Software Requirements

No proprietary tools or expensive infrastructure are needed. Everything can run on a student or
professional machine.

2.5 Highlights

This chapter provided a comprehensive review of the current state of bike-sharing systems,
focusing on Citi Bike as a case study. It highlighted the limitations of existing systems, which
mostly rely on descriptive analytic, and justified the need for predictive and interactive approaches.
The proposed system enhances analytical depth by incorporating machine learning techniques and
dynamic visualization tools such as Power BI.

The feasibility study confirmed that the project is realistic and implementable with available data,
tools, and infrastructure. The use of open-source technologies and minimal hardware requirements
further strengthens its practicality. The tools and technologies adopted—ranging from Python and
Jupyter to Power BI—are well-established in the data science ecosystem, ensuring long-term
support and scalability.

Having established the background, feasibility, and technological framework of the project, the
next chapter will define the Software Requirements Specifications (SRS). It will outline both the
functional and non-functional requirements essential for successful implementation.

PROJECT 8 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

3. SOFTWARE REQUIREMENTS SPECIFICATION (SRS)

The Software Requirements Specification (SRS) defines both the functional and non-functional
expectations for the system. It acts as a blueprint for developers, data scientists, stakeholders, and
users to ensure clarity and successful implementation.

3.1 Functional Requirements

Functional requirements define what the system should do — the features and operations it must
support. For this project, the system includes data analysis, machine learning prediction, user
classification, and visualization.

Key Functional Requirements:


ID Requirement Description

Data The system must import and preprocess the Citi


FR1
Ingestion Bike dataset (citibike newww.csv).

Implement a Linear Regression model to predict


Trip Duration how long a trip will take based on input features
FR2
Prediction like age, gender, and start station. (Implemented
in Linear Regression New.ipynb)

Implement a Logistic Regression model to


User Type classify users as either Subscribers or Customers
FR3
Classification based on trip behavior. (Implemented in
MainProjectNew.ipynb)

Data
Clean missing values, convert date-time
Cleaning &
FR4 columns, and create derived features like age
Feature
and trip duration.
Engineering

Design an interactive Power BI dashboard


Visualization (MainProject.pbix) showing metrics like peak
FR5
Dashboard trip hours, gender distribution, popular stations,
etc.

Provide visual feedback on model performance


Output
FR6 using confusion matrix (logistic) and R² score
Interpretation
(linear regression).
Table 3.1 Key Functional Requirements

PROJECT 9 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

3.2 Non-Functional Requirements

Non-functional requirements define how the system performs, rather than what it does. These
requirements address the quality attributes such as usability, performance, and maintainability.

Key Non-Functional Requirements:

ID Requirement Description

The system (especially Power BI reports)


1 Usability must be intuitive and user-friendly, suitable
for non-technical stakeholders.

The models must process typical-sized


datasets (~100,000+ rows) within a
2 Performance
reasonable time (under 30 seconds for
prediction tasks).

The Python notebooks and dashboards must


3 Portability run on any standard Windows 10 or above
machine without high-end hardware.

The system should allow the data-set to be


4 Scalability extended with new time periods or cities
without major code changes.

Regression and classification models should


5 Accuracy reach reasonable accuracy (e.g., R² > 0.7 for
linear; accuracy > 80% for logistic).

Code should be modular, documented, and


6 Maintainability reusable for future enhancements (e.g., adding
SVM or decision trees).

Table 3.2 Key Non-Functional Requirements

3.3 Highlights

In Citi Bike project, the functional requirements focus on loading, processing, analyzing,
modeling, and visualizing using Python and Power BI. The non-functional requirements ensure
that the system is accurate, scalable, and easy to use. Together, these requirements form a robust
foundation for implementation and evaluation.

PROJECT 10 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

4. SYSTEM DESIGN AND ANALYSIS

System and Design Analysis is a phase in project development where you analyze how a system
will function and design its structure, components, and workflows. It's like creating a blueprint
before building the actual system.

4.1 System Perspective

The system designed in this Citi Bike Trip Data represents a data-driven analytical pipeline for
understanding and predicting patterns in a bike-sharing business (e.g., Citibike). It integrates
several stages, including:

1. Data Collection

The system ingests real-world data from Citi bike (citibike newww.csv), which includes details
such as ride duration, bike type, station info, and user category (member vs. casual).

2. Data Preprocessing

Data Prepossessing was implemented in Python notebooks (MainProjectNew.ipynb), where


missing values are handled, time stamps are converted to datetime format, and categorical
variables are encoded.

3. Exploratory Data Analysis (EDA)

EDA was conducted using visualization libraries (matplotlib, seaborn) to identify patterns such as
peak usage hours, popular stations, and user type behavior.

4. Modeling and Prediction

 Logistic Regression: Used to classify trips as profitable vs. non-profitable


 Linear Regression: Used to predict profit values based on sales and region
 Both models are implemented in the respective notebooks with scikit-learn.

PROJECT 11 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

5. Dashboard and Reporting

Using Power BI (MainProject.pbit), insights are visualized across:

 Hours Ride Patterns


 Member vs Casual Analysis

6. Decision Support

The final output of Citi Bike helps business stakeholders understand customer usage, optimize
resource allocation, and increase profitability.

PROJECT 12 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

4.2 Data Flow Diagram

A Data Flow Diagram (DFD) is a graphical tool used to represent the flow of data within a system.
It shows how input data is transformed into output through various processes, and how data moves
between external entities, system components, and data stores.

User
t

Citi Bike Trip Data

Figure 4.1 DataFlow Diagram

PROJECT 13 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

Description:

 External Entities: Users and stakeholders


 System: Processes data for predictions and visualizations
 Data Store: .csv and .ipynb act as storage and processing modules
 Output: Predictions, Insights, Visual Dashboards

4.3 Highlights

The system design and analysis phase provided a structured blueprint for developing a robust and
intelligent Citi Bike Trip data pipeline. From data collection to decision support, each stage was
carefully planned and implemented using industry-standard tools like Python, sci-kit-learn, and
Power BI. The integration of data prepossessing, exploratory analysis, machine learning models,
and visual dashboards ensures a seamless and insightful flow of information. The Data Flow
Diagram clearly illustrates how data moves through the system, from user input to final insights.
Overall, the system is designed to be scalable, data-driven, and decision-oriented, supporting
meaningful business analysis and prediction in urban bike-sharing systems.

PROJECT 14 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

5. DETAILED DESIGN

5.1 Use Case Diagram

A Use Case Diagram is a visual representation that shows how users (actors) interact with a
system to achieve specific goals. It helps identify system functionalities from the user’s
perspective and clarifies user-system relationships.

Admin

Figure 5.1 Use case Diagram


This Use Case Diagram represents the interaction between two main actors—Admin and
Stakeholder—and the system processes involved in a bike-sharing analytics project. The Admin
initiates the workflow by uploading CSV data into the system. Once the data is uploaded, it
undergoes data prepossessing to clean and transform the input. The processed data is then used to
train and evaluate machine learning models, such as logistic and linear regression. The results are
visualized in Power BI, which is accessed by both Admin and Stakeholders to explore insights and
observe trends in bike usage, rider behavior, and operational performance. This diagram
highlights the user-driven flow from raw data upload to final trend analysis.

PROJECT 15 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

5.2 Sequence Diagram

A Sequence Diagram is a type of UML diagram that shows the order of interactions between
system components or objects over time. It visualizes how processes or functions communicate
through messages to complete a task, step by step.

Load Dataset

Prepossessing

Train Models

Results

Visualizations

Figure 5.2 Sequence Diagram

This Sequence Diagram illustrates the interaction between the User and the Google Colab
Notebook during the data analysis and machine learning workflow. The process begins with the
user initiating the system by uploading or loading the dataset. This is followed by the
preprocessing phase, where the data is cleaned and transformed; the results of this step are then
sent back to the user for review or confirmation. Next, the system moves into the model training
phase, where algorithms such as logistic and linear regression are applied. The results of the
models, including evaluation metrics like accuracy and R² score, are returned to the user. Finally,
the system generates visualizations—including charts and graphs—which are displayed back to
the user for interpretation and decision-making. This sequence ensures smooth and interactive
flow between data operations and user feedback.

PROJECT 16 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

5.3 Activity Diagram

An Activity Diagram is a UML diagram that illustrates the workflow of actions or activities in a
system. It represents the sequence of steps, decisions, and parallel processes involved in
performing a task or function.

Figure 5.3 Activity Diagram

This Activity Diagram represents the step-by-step workflow of the bike-sharing data analysis
project. The process begins with the acquisition of the Data Set, followed by the Data Cleaning
activity, where inconsistencies and missing values are handled. After cleaning, the workflow
proceeds to Exploratory Data Analysis (EDA), where the structure, trends, and key insights in the
data are uncovered. From there, the flow diverges into two parallel activities: Model Building for
Prediction, where machine learning models are developed to forecast ride duration s or classify
users, and Dashboards, where the analyzed data is visualized using Power BI to present trends and
patterns. This diagram emphasizes the logical flow of tasks involved in preparing data, deriving
insights, and delivering predictive and visual outputs.

PROJECT 17 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

5.4 Class Diagram

A Class Diagram is a UML diagram that shows the static structure of a system by representing its
classes, attributes, methods, and the relationships between them. It is used to model the blueprint
of objects in object-oriented design.

User Trip

Started_at Ride_id
Station_name Member_Casual
Started_at Ride_type

Prepossessing

Cleaning &EDA

Figure 5.4 Class Diagram

This Class Diagram represents the object-oriented structure of the bike-sharing data analysis
system. It consists of three main classes: User, Trip, and Prepossessing. The User class includes
attributes like Ride_time, Station_Name, and Started_at, which describe when and where a user
begins a ride. The Trip class contains attributes such as Ride_id, Member_Casual, and Ride_type,
representing the trip's unique identifier, user membership status, and type of bike used. Both the
User and Trip classes are associated with the Prepossessing class, which handles operations such
as data cleaning and exploratory data analysis (EDA). This diagram captures the key components
and their relationships in the system, emphasizing how raw data is structured and processed before
being used for machine learning and visualization.

PROJECT 18 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

5.5 ER Diagram

An ER (Entity-Relationship) Diagram is a visual representation of entities (like users or trips) and


the relationships between them in a database. It is used to design and model the logical structure
of a database.

User Bike

Started_at Ride_id
Member_casual Ride_type

Starts At
Has

Station Destination At

Station
Station_id
Station_start

Ended_at
Station_end

Figure 5.5 ER Diagram

This is an Entity-Relationship (ER) Diagram that models the structure of a bike-sharing database
system. It illustrates the relationships between the entities User, Bike, and Station. The User entity
includes attributes like Started_at and Member_casual, representing the start time of the ride and
the rider’s membership status. Each user initiates a ride using a Bike, which has attributes such as
Ride_id and Ride_type. The relationship between User and Bike is shown through a connector
with a diamond symbol, indicating a logical association. The ride starts at one Station, defined by
Station_id and Station_start, and ends at another Station, represented by Station_end. The two
station entities represent the origin and destination of the trip. This ER diagram effectively
captures the lifecycle of a trip in the bike-sharing system—from user initiation, bike assignment,
to station tracking.

PROJECT 19 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

5.6 Highlights

The collection of diagrams presented in Chapter 5 provides a comprehensive visual understanding


of the bike-sharing analytic system's design and operation. The Use Case Diagram outlines how
Admins and Stakeholders interact with the system to upload data, prepossessing it, train models,
and visualize trends. The Sequence Diagram captures the step-by-step flow of data between the
user and the analysis environment (Collab Notebook), highlighting key phases like data loading,
prepossessing, model training, and result interpretation. The Activity Diagram illustrates the entire
workflow, from raw data-set acquisition to parallel processes of model building and dashboard
creation. The Class Diagram breaks down the static structure of the system into classes like User,
Trip, and Prepossessing, showing how data is organized and processed. Finally, the ER Diagram
maps the relationships between real-world entities such as users, bikes, and stations, providing a
logical blueprint of the database. Together, these diagrams form a detailed visual architecture that
supports efficient data analysis, machine learning, and decision-making within the project.

PROJECT 20 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

6.IMPLEMENTATION

This chapter describes the modular structure of the system and provides relevant visual evidence
generated using both Python-based libraries (such as Matplotlib and Seaborn) and Power BI
dashboards. These visuals reflect the key stages of data exploration, prepossessing, modeling, and
result presentation based on the Citi Bike data-set. The combined use of coding environments and
business intelligence tools ensures a comprehensive analytical workflow with both technical depth
and stakeholder-friendly insights
.

6.1. Modules Description

The system is divided into distinct modules for easier implementation and debugging:

1. Data Acquisition Module

 Input: citibike newww.csv


 Purpose: Load and explore raw bike-sharing data.
 Key Operations:

 Checking for null values


 Basic statistics and data types

Figure 6.1 Data Load & Explore

PROJECT 21 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

2. Data Preprocessing Module

 Purpose: Clean and transform raw data into model-ready format.


 Key Operations:

 Encoding categorical variables (rideable_type, member_casual)


 Handling missing values
 Creating new features (trip duration, hour of day, etc.)

Figure 6.2 Data Preprocessesing

3. Exploratory Data Analysis (EDA) Module

 Purpose: Visualize usage patterns and user behaviors.


 Tools: Matplotlib, Seaborn, Power BI
 Insights Included:

 User Member vs Casual


 Popular start stations
 Hourly ride patterns
 Heat map Correlation

PROJECT 22 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

4. Model Building Module

 Logistic Regression:

 Goal: Predict whether a rider is a member or casual user


 Input Features: rideable_type, start_station, day_of_week, etc.

 Linear Regression:

 Goal: Predict trip duration


 Input Features: Hour of day, start/end station, user type, etc.

5. Dashboard Visualization Module

 Tool: Power BI (MainProject.pbit)


 Purpose: Interactive business insights for stakeholders
 Visuals Included:

 Ride counts
 Count user type
 Start/End station Bar charts
 Hourly distribution of rides

6.2. Visualizations of Citi Bike Trip Data

Figure 6.3 Member vs Casual

PROJECT 23 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

Figure 6.4 Top 10 Popular Stations

Figure 6.5 Hour Rides

PROJECT 24 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

Figure 6.6 Correlation Matrix Heat Map

Figure 6.7 Linear Regression

PROJECT 25 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

Figure 6.8 Logistic Regression

Figure 6.9 KNN

PROJECT 26 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

Figure 6.10 Decision Tree

Figure 6.11 Power BI Dashboard Without Prediction

PROJECT 27 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

Figure 6.12 Power BI Dashboard With Prediction

PROJECT 28 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

7. SOFTWARE TESTING

Software testing is the process of evaluating a software application to identify and fix defects,
ensuring it meets the required functionality and quality standards.

Types of software testing include:

 Manual Testing (performed by humans without automation) and


 Automated Testing (using tools/scripts), with further categories like Unit Testing,
Integration Testing, System Testing, and User Acceptance Testing (UAT).

7.1 Type of Testing Performed in Citi Bike Trip Data

For the Citi bike ride classification project, several types of software testing were performed to
ensure reliability, correctness, and performance of the model and associated data prepossessing
pipeline:

1. Unit Testing
 Purpose: To test individual functions such as duration calculation, label
creation, and data cleaning steps.
 Example: Verifying the trip_duration_minutes is correctly computed
from timestamps.
2. Integration Testing
 Purpose: To verify that combined components (data processing + model
training + evaluation) work seamlessly.
 Example: Checking that the cleaned data-set passes correctly into the ML
model and gives valid outputs.
3. System Testing
 Purpose: Testing the entire pipeline from CSV loading →
prepossessing → model training → prediction.
 Example: Ensuring the notebook runs without errors and all steps execute in
order

PROJECT 29 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

4. Performance Testing
 Purpose: Measuring model performance in terms of accuracy and execution
time.
 Tools: scikit-learn, classification_report, and confusion_matrix.
 Results:
 Logistic Regression Accuracy: 71.2%
 Linear Regression Accuracy: 5.8% (Failed — not suitable for classification)
 Execution time: Acceptable for datasets of this size.

5. Error Handling Testing


 Purpose: Ensuring the system reacts gracefully to invalid or missing
inputs.
 Example: Testing how the pipeline handles non-numeric values or missing
columns.

PROJECT 30 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

7.2 Test Cases and Results

The table below summarizes the test cases used for validation of the system:

Test Expected Actual


Test Input Stat
Case Output Output
Scenario us
ID
TC01 Dataset loads citibike As Pass
DataFrame
correctly newww.csv expected
with 1M rows,
15 columns
TC02 Missing Missing As
Missing rows
values end_station_n expected
dropped or Pass
handled ame
imputed
TC03 trip_duration_ trip_duration_
"2024-01-01
minut es minutes = 20
10:00" to 20 Pass
calculated
"2024-
01-01 10:20"

Label trip_duration_
TC04 long_ride = 1 1 Pass
long_ride min utes = 18
created
correctly
TC05 Logistic Train/test split Accuracy
Regression ~0.71
trains with 0.712 Pass
expected
accuracy
TC06 Model fails start_lat = Value
Error or
with invalid "abc" Error
handled Pass
input types
gracefully
TC07 trip_duration_ long_ride = 1
Model handles
min utes =
extreme ride 1 Pass
999
durations

PROJECT 31 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

Model
TC08 Valid X_test Only 0 or 1 0/1 Pass
prediction
input
output is
binary

Accuracy Full model Accuracy ≥


TC09 0.712 Pass
threshold met training 0.60

Visualization sns.heatmap No exceptions


TC10 No Pass
runs without (conf_matrix) thrown
error
error
TC11 Linear Linear Accuracy ≥ Accuracy
Regression Regression on 0.90 = 0.058
performs classification Fail
poorly task

Table 7.1 Test Case Results

The testing phase of the Citi Bike project validated both data prepossessing and machine learning
model performance through a structured set of test cases. The test cases confirmed that the data-
set loaded correctly (TC01), missing values were effectively handled (TC02), and new features
like trip_duration_minutes and the long_ride label were accurately derived (TC03, TC04).
Logistic Regression performed as expected, achieving a consistent accuracy of 0.712 (TC05),
surpassing the minimum threshold of 0.60 (TC09), and handling edge cases and invalid inputs
appropriately (TC06, TC07, TC08). Visualizations such as heatmaps also rendered successfully
without errors (TC10). However, one test case (TC11) highlighted a critical insight — Linear
Regression, when applied to a classification task, performed poorly with an accuracy of only
0.058. This indicates a clear mismatch between the algorithm and the problem type, reinforcing
the importance of model selection based on the task. Overall, the system is stable, well-tested, and
aligned with the project goals, with the exception of using regression for classification, which was
intentionally tested to demonstrate its inadequacy.

PROJECT 32 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

7.3 Highlights

The software testing conducted on the Citi Bike Trip Classification Project demonstrates that the
system is both functionally sound and robust. Multiple testing methodologies — including unit,
integration, system, performance, and error handling tests — were effectively implemented to
validate every stage of the data pipeline and machine learning workflow. The test cases confirm
that the system handles data prepossessing, model training, and visualization accurately and
efficiently. With a logistic regression model achieving reliable accuracy and graceful handling of
erroneous inputs, the system proves to be stable and production-ready. The only failing case,
involving linear regression on a classification task, served as an intentional validation of
inappropriate model selection. Overall, the testing process provided critical assurance of the
system’s quality and reliability.

PROJECT 33 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

8. CONCLUSION

The culmination of this project highlights the importance of data analytic in enhancing the
operational efficiency of bike-sharing systems. The project aimed to gain insights from Citi bike’s
real-world data-set, build predictive models, and present interactive dashboards to help
stakeholders make informed decisions.

Key Outcomes:

1. Successful Data Cleaning and Preprocessing

The data-set used in this project (citibike newww.csv) included vital trip-level information such as
started_at, ended_at, rideable_type, start_station_name, and member_casual. These attributes
were essential for understanding user behavior and ride trends. Before proceeding with analysis,
the data-set underwent thorough cleaning. This included handling missing values to ensure data
integrity, and engineering new features such as ride duration, hour of the day, and day of the week,
which provided contextual insights for modeling. Additionally, categorical variables like
rideable_type and member_casual were label encoded to convert them into numerical format
suitable for machine learning algorithms. These prepossessing steps ensured that the data was
structured, consistent, and ready for downstream analysis.

2. Exploratory Data Analysis (EDA) and Insights

With the data prepared, the next phase involved performing exploratory data analysis using
Python libraries such as Seaborn and Matplotlib, alongside Power BI for interactive reporting.
This analysis revealed meaningful behavior patterns among users. Notably, casual users were
more active on weekends, while members preferred weekdays, especially during commuting
hours (7–9 AM and 5–7 PM). It was also observed that ride duration s were typically longer for
casual users. These insights were reinforced through Power BI dashboards, where visualizations
included heat-maps for identifying the most popular stations, line charts that depicted monthly
ride frequency trends, and histograms showing the distribution of ride duration s across user types
and time frames.

PROJECT 34 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

3. Predictive Modelling

To derive predictive insights from the data-set, two supervised machine learning models were
built and evaluated. The first was a Logistic Regression model, used to classify riders as either
member or casual based on features like the hour of the day, rideable type, and day of the week.
This classification model was assessed using metrics such as the accuracy score and confusion
matrix to ensure its reliability. The second model was a Linear Regression model, aimed at
predicting the ride duration (ride_length) in minutes. The key input features for this model
included start time, rider type, and bike type. It was evaluated using Mean Squared Error (MSE)
and the R² score, both of which provided insights into the model’s prediction accuracy and fit.
These models contributed to understanding user behavior and forecasting demand.

4. Power BI Dashboard Integration

The insights derived from data analysis and modeling were brought to life using a professionally
designed Power BI dashboard (from MainProject.pbit). This interactive platform enabled
stakeholders to explore trends visually, with detailed breakdowns by month, station usage, and
hourly ride patterns. One of the standout features of this dashboard was its ability to distinguish
usage behavior based on membership type, helping identify profit potential from various user
segments. The dashboards were equipped with dynamic filters that allowed users to drill down
into specific time-frames or categories, providing a flexible and user-friendly experience. Overall,
Power BI served as a powerful decision support tool, giving stakeholders actionable insights for
resource allocation, marketing, and fleet optimization.

This integrated workflow of data acquisition → model building → dashboard visualization


exemplifies the power of data science in solving urban mobility challenges.

PROJECT 35 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

9. FUTURE ENHANCEMENTS

Though the project successfully achieves its objectives, several enhancements can make it more
robust, intelligent, and scalable.

1. Integration of Real-Time Data

Current Limitation: The dataset is static.


Enhancement: Integrate Citibike’s real-time APIs to capture:

 Live bike availability


 Ongoing trip data
 Station statuses

This allows:

 Dynamic dashboard updates


 Real-time prediction and alerts

2. Advanced Machine Learning Algorithms

Current Models: Logistic and Linear Regression


Future Models:

 Random Forest Classifier – for higher accuracy


 XGBoost – to handle large, imbalanced datasets
 Neural Networks – for deeper feature interaction

3. Geospatial & Route Analysis

Include map-based visuals and route clustering:

 Start and end station geolocation mapping


 Route heatmaps for optimization
 Tools: GeoPandas, Folium, and Power BI ArcGIS maps

PROJECT 36 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

4. Mobile and Web Interface Development

A user-friendly dashboard (outside Power BI) can improve accessibility:

 Flask/Django web application


 React dashboard frontend
 Push notifications and filters

5. Time Series Forecasting

Apply models like:

 ARIMA
 Prophet by Facebook
 LSTM (Long Short-Term Memory networks)

Use case: Forecast next month’s demand and station-specific usage.

6. Revenue and Resource Optimization

Combine ride data with cost analysis to optimize:

 Bike maintenance schedules


 Station stocking
 Peak-time staff allocation

PROJECT 37 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA

References

1. · Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
(2nd ed.). O'Reilly Media.
2. · Sommerville, I. (2011). Software Engineering (9th ed.). Pearson Education.
3. · Pressman, R. S., & Maxim, B. R. (2014). Software Engineering: A Practitioner’s Approach
(8th ed.). McGraw-Hill Education.
4. · Shelly, G. B., & Rosenblatt, H. J. (2012). Systems Analysis and Design (9th ed.). Cengage
Learning.
5. · Microsoft Power BI Documentation
URL: https://learn.microsoft.com/en-us/power-bi/
6. · Citibike NYC – General Bikeshare Feed Specification (GBFS).
URL: https://gbfs.citibikenyc.com/gbfs/gbfs.json
7. · IBM – CRISP-DM Methodology.
URL: https://www.ibm.com/docs/en/spss-modeler/SaaS?topic=overview-crisp-dm-modeling-
tool
8. · Python Library Docs - Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn
URL: https://pypi.org
9. · Lucidchart UML Guide
URL: https://www.lucidchart.com/pages/uml-diagram
10. · GeeksforGeeks
URL: https://www.geeksforgeeks.org/
11. · Kaggle – Bike Share Dataset Examples
URL: https://www.kaggle.com

PROJECT 38 2024-2025

You might also like