Content 1
Content 1
1. INTRODUCTION
The Introduction chapter lays the foundation for the project. It explains the motivation, scope, and
core objectives of the study. It helps the reader understand what the project is, why it's important,
and how it was approached. It also connects the problem to real-world needs and academic
research.
1.1 Aim
The primary aim of this project is to conduct a comprehensive analysis of the Citi Bike system
using real-world trip data, with the objective of identifying key usage patterns, understanding user
behavior, and building predictive models. This includes:
The insights derived from this study aim to support stakeholders — including urban planners,
transportation agencies, and service operators — in enhancing operational efficiency, improving
user experience, and promoting sustainable urban mobility.
As cities worldwide face challenges related to traffic congestion, environmental concerns, and
last-mile connectivity, bike-sharing services offer a practical, low-emission alternative. Projects
like this empower decision-makers by converting raw data into actionable intelligence.
Citi Bike is a large-scale, dock-based bike-sharing service operating in New York City and
neighboring areas. The data-set used in this project (citibike newww.csv) includes hundreds of
thousands of anonymize d trip records, with features such as:
PROJECT 1 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
This project analyzes the dataset through the lens of data science and machine learning by
building a data pipeline that includes:
1. Data Preprocessing
The data is cleaned by:
Using both Python (Seaborn, Matplotlib) and Power BI, the data is explored to uncover:
Key visuals include heatmaps, boxplots, bar graphs, and geospatial maps created using Power BI
for interactive insight.
3. Feature Engineering
Categorical variables such as member_casual and rideable_type are label encoded. Continuous
features like latitude and longitude are standardized or normalized for use in machine learning
models.
PROJECT 2 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
1.3 Highlights
This chapter established the foundation of the Citi Bike analysis project by outlining its
motivation, goals, and scope. The project aims to harness real-world trip data to uncover user
behavior patterns, predict trip duration's, and classify user types using machine learning
techniques. By integrating analytical tools like Python and Power BI, the project transforms raw
data into meaningful insights that can inform policy-making and service optimization. The
approach is both data-driven and practically oriented, addressing modern urban mobility
challenges through the lens of technology.
With a structured pipeline — from data prepossessing to visualization — the project not only
highlights the potential of shared mobility systems but also provides a replicate framework for
similar urban transportation studies. The next chapter delves into existing research and related
work in the domain of bike-sharing systems, predictive modeling, and urban mobility analytic.
PROJECT 3 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
2. LITERATURE SURVEY
Literature survey is a structured review of existing knowledge, research, tools, systems, and
technologies related to a particular topic or project.In other words the literature survey is like
doing background research before starting your project It helps in the following:
These metrics are descriptive, not predictive, and often presented via static reports. There’s
minimal use of machine learning or AI techniques to forecast demand or understand user behavior
dynamically.
Additionally, many city reports rely on manual planning decisions without integrating data-driven
tools like Power BI dashboards or regression analysis.
This study highlights how existing systems focus on fleet deployment and user convenience, but
often lack analytical depth and prediction models for usage patterns or operational insights.
PROJECT 4 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
2. Proposed System
The proposed system builds a data-driven decision support framework that improves upon
existing analytics in three key areas:
Traditional
Area Citi Bike Trip Data
System
Analysis
Descriptive only Descriptive + Predictive
Type
It integrates machine learning, data visualization, and real-world urban mobility data — a
combination rarely seen in current city dashboards.
Feasibility determines whether the proposed solution is practical to implement with the available
tools, time, and data.The feasibility study assesses whether the Citi Bike data analysis and
machine learning project can be successfully executed with the available resources, tools, and data
within the project timeline. This evaluation considers the following key aspects:
PROJECT 5 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
1. Technical Feasibility
The data-set (citibike newww.csv) is clean, structured, and includes useful features like start
time, end time, station info, and user type.
Models implemented in Python notebooks (e.g., MainProjectNew.ipynb, Linear Regression
New.ipynb) are based on standard libraries (e.g., scikit-learn, pandas, seaborn).
Visualizations built in Power BI (MainProject.pbix) are highly interactive and suitable for
decision-makers with no coding background.
2. Economic Feasibility
All core tools (Python, Jupyter, Power BI Desktop) are free or open-source.
No paid software, data licensing, or cloud resources are required.
The solution runs on standard computing infrastructure.
3. Operational Feasibility
The system is user-friendly and designed for real-world use (e.g., transit planning, city analytic).
Outputs are clear and interpret-able for city stakeholders and data analysts.
Can be easily expanded to include more stations or time-frames.
This project uses a modern tech stack optimized for data science, modeling, and visualization.
PROJECT 6 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
Technologies
Programming, analysis,
Language Python 3.x
modeling
Interactive coding
Notebook Jupyter
environment
matplotlib, seaborn,
Visualization Graphs, dashboards
Power BI
Business-level
Power BI MainProject.pbix
dashboards
All technologies are widely supported and have large communities and documentation.
Hardware Requirements
OS Windows 11
PROJECT 7 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
Software Requirements
Software Version/Type
Python 3.8+
No proprietary tools or expensive infrastructure are needed. Everything can run on a student or
professional machine.
2.5 Highlights
This chapter provided a comprehensive review of the current state of bike-sharing systems,
focusing on Citi Bike as a case study. It highlighted the limitations of existing systems, which
mostly rely on descriptive analytic, and justified the need for predictive and interactive approaches.
The proposed system enhances analytical depth by incorporating machine learning techniques and
dynamic visualization tools such as Power BI.
The feasibility study confirmed that the project is realistic and implementable with available data,
tools, and infrastructure. The use of open-source technologies and minimal hardware requirements
further strengthens its practicality. The tools and technologies adopted—ranging from Python and
Jupyter to Power BI—are well-established in the data science ecosystem, ensuring long-term
support and scalability.
Having established the background, feasibility, and technological framework of the project, the
next chapter will define the Software Requirements Specifications (SRS). It will outline both the
functional and non-functional requirements essential for successful implementation.
PROJECT 8 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
The Software Requirements Specification (SRS) defines both the functional and non-functional
expectations for the system. It acts as a blueprint for developers, data scientists, stakeholders, and
users to ensure clarity and successful implementation.
Functional requirements define what the system should do — the features and operations it must
support. For this project, the system includes data analysis, machine learning prediction, user
classification, and visualization.
Data
Clean missing values, convert date-time
Cleaning &
FR4 columns, and create derived features like age
Feature
and trip duration.
Engineering
PROJECT 9 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
Non-functional requirements define how the system performs, rather than what it does. These
requirements address the quality attributes such as usability, performance, and maintainability.
ID Requirement Description
3.3 Highlights
In Citi Bike project, the functional requirements focus on loading, processing, analyzing,
modeling, and visualizing using Python and Power BI. The non-functional requirements ensure
that the system is accurate, scalable, and easy to use. Together, these requirements form a robust
foundation for implementation and evaluation.
PROJECT 10 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
System and Design Analysis is a phase in project development where you analyze how a system
will function and design its structure, components, and workflows. It's like creating a blueprint
before building the actual system.
The system designed in this Citi Bike Trip Data represents a data-driven analytical pipeline for
understanding and predicting patterns in a bike-sharing business (e.g., Citibike). It integrates
several stages, including:
1. Data Collection
The system ingests real-world data from Citi bike (citibike newww.csv), which includes details
such as ride duration, bike type, station info, and user category (member vs. casual).
2. Data Preprocessing
EDA was conducted using visualization libraries (matplotlib, seaborn) to identify patterns such as
peak usage hours, popular stations, and user type behavior.
PROJECT 11 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
6. Decision Support
The final output of Citi Bike helps business stakeholders understand customer usage, optimize
resource allocation, and increase profitability.
PROJECT 12 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
A Data Flow Diagram (DFD) is a graphical tool used to represent the flow of data within a system.
It shows how input data is transformed into output through various processes, and how data moves
between external entities, system components, and data stores.
User
t
PROJECT 13 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
Description:
4.3 Highlights
The system design and analysis phase provided a structured blueprint for developing a robust and
intelligent Citi Bike Trip data pipeline. From data collection to decision support, each stage was
carefully planned and implemented using industry-standard tools like Python, sci-kit-learn, and
Power BI. The integration of data prepossessing, exploratory analysis, machine learning models,
and visual dashboards ensures a seamless and insightful flow of information. The Data Flow
Diagram clearly illustrates how data moves through the system, from user input to final insights.
Overall, the system is designed to be scalable, data-driven, and decision-oriented, supporting
meaningful business analysis and prediction in urban bike-sharing systems.
PROJECT 14 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
5. DETAILED DESIGN
A Use Case Diagram is a visual representation that shows how users (actors) interact with a
system to achieve specific goals. It helps identify system functionalities from the user’s
perspective and clarifies user-system relationships.
Admin
PROJECT 15 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
A Sequence Diagram is a type of UML diagram that shows the order of interactions between
system components or objects over time. It visualizes how processes or functions communicate
through messages to complete a task, step by step.
Load Dataset
Prepossessing
Train Models
Results
Visualizations
This Sequence Diagram illustrates the interaction between the User and the Google Colab
Notebook during the data analysis and machine learning workflow. The process begins with the
user initiating the system by uploading or loading the dataset. This is followed by the
preprocessing phase, where the data is cleaned and transformed; the results of this step are then
sent back to the user for review or confirmation. Next, the system moves into the model training
phase, where algorithms such as logistic and linear regression are applied. The results of the
models, including evaluation metrics like accuracy and R² score, are returned to the user. Finally,
the system generates visualizations—including charts and graphs—which are displayed back to
the user for interpretation and decision-making. This sequence ensures smooth and interactive
flow between data operations and user feedback.
PROJECT 16 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
An Activity Diagram is a UML diagram that illustrates the workflow of actions or activities in a
system. It represents the sequence of steps, decisions, and parallel processes involved in
performing a task or function.
This Activity Diagram represents the step-by-step workflow of the bike-sharing data analysis
project. The process begins with the acquisition of the Data Set, followed by the Data Cleaning
activity, where inconsistencies and missing values are handled. After cleaning, the workflow
proceeds to Exploratory Data Analysis (EDA), where the structure, trends, and key insights in the
data are uncovered. From there, the flow diverges into two parallel activities: Model Building for
Prediction, where machine learning models are developed to forecast ride duration s or classify
users, and Dashboards, where the analyzed data is visualized using Power BI to present trends and
patterns. This diagram emphasizes the logical flow of tasks involved in preparing data, deriving
insights, and delivering predictive and visual outputs.
PROJECT 17 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
A Class Diagram is a UML diagram that shows the static structure of a system by representing its
classes, attributes, methods, and the relationships between them. It is used to model the blueprint
of objects in object-oriented design.
User Trip
Started_at Ride_id
Station_name Member_Casual
Started_at Ride_type
Prepossessing
Cleaning &EDA
This Class Diagram represents the object-oriented structure of the bike-sharing data analysis
system. It consists of three main classes: User, Trip, and Prepossessing. The User class includes
attributes like Ride_time, Station_Name, and Started_at, which describe when and where a user
begins a ride. The Trip class contains attributes such as Ride_id, Member_Casual, and Ride_type,
representing the trip's unique identifier, user membership status, and type of bike used. Both the
User and Trip classes are associated with the Prepossessing class, which handles operations such
as data cleaning and exploratory data analysis (EDA). This diagram captures the key components
and their relationships in the system, emphasizing how raw data is structured and processed before
being used for machine learning and visualization.
PROJECT 18 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
5.5 ER Diagram
User Bike
Started_at Ride_id
Member_casual Ride_type
Starts At
Has
Station Destination At
Station
Station_id
Station_start
Ended_at
Station_end
This is an Entity-Relationship (ER) Diagram that models the structure of a bike-sharing database
system. It illustrates the relationships between the entities User, Bike, and Station. The User entity
includes attributes like Started_at and Member_casual, representing the start time of the ride and
the rider’s membership status. Each user initiates a ride using a Bike, which has attributes such as
Ride_id and Ride_type. The relationship between User and Bike is shown through a connector
with a diamond symbol, indicating a logical association. The ride starts at one Station, defined by
Station_id and Station_start, and ends at another Station, represented by Station_end. The two
station entities represent the origin and destination of the trip. This ER diagram effectively
captures the lifecycle of a trip in the bike-sharing system—from user initiation, bike assignment,
to station tracking.
PROJECT 19 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
5.6 Highlights
PROJECT 20 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
6.IMPLEMENTATION
This chapter describes the modular structure of the system and provides relevant visual evidence
generated using both Python-based libraries (such as Matplotlib and Seaborn) and Power BI
dashboards. These visuals reflect the key stages of data exploration, prepossessing, modeling, and
result presentation based on the Citi Bike data-set. The combined use of coding environments and
business intelligence tools ensures a comprehensive analytical workflow with both technical depth
and stakeholder-friendly insights
.
The system is divided into distinct modules for easier implementation and debugging:
PROJECT 21 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
PROJECT 22 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
Logistic Regression:
Linear Regression:
Ride counts
Count user type
Start/End station Bar charts
Hourly distribution of rides
PROJECT 23 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
PROJECT 24 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
PROJECT 25 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
PROJECT 26 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
PROJECT 27 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
PROJECT 28 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
7. SOFTWARE TESTING
Software testing is the process of evaluating a software application to identify and fix defects,
ensuring it meets the required functionality and quality standards.
For the Citi bike ride classification project, several types of software testing were performed to
ensure reliability, correctness, and performance of the model and associated data prepossessing
pipeline:
1. Unit Testing
Purpose: To test individual functions such as duration calculation, label
creation, and data cleaning steps.
Example: Verifying the trip_duration_minutes is correctly computed
from timestamps.
2. Integration Testing
Purpose: To verify that combined components (data processing + model
training + evaluation) work seamlessly.
Example: Checking that the cleaned data-set passes correctly into the ML
model and gives valid outputs.
3. System Testing
Purpose: Testing the entire pipeline from CSV loading →
prepossessing → model training → prediction.
Example: Ensuring the notebook runs without errors and all steps execute in
order
PROJECT 29 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
4. Performance Testing
Purpose: Measuring model performance in terms of accuracy and execution
time.
Tools: scikit-learn, classification_report, and confusion_matrix.
Results:
Logistic Regression Accuracy: 71.2%
Linear Regression Accuracy: 5.8% (Failed — not suitable for classification)
Execution time: Acceptable for datasets of this size.
PROJECT 30 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
The table below summarizes the test cases used for validation of the system:
Label trip_duration_
TC04 long_ride = 1 1 Pass
long_ride min utes = 18
created
correctly
TC05 Logistic Train/test split Accuracy
Regression ~0.71
trains with 0.712 Pass
expected
accuracy
TC06 Model fails start_lat = Value
Error or
with invalid "abc" Error
handled Pass
input types
gracefully
TC07 trip_duration_ long_ride = 1
Model handles
min utes =
extreme ride 1 Pass
999
durations
PROJECT 31 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
Model
TC08 Valid X_test Only 0 or 1 0/1 Pass
prediction
input
output is
binary
The testing phase of the Citi Bike project validated both data prepossessing and machine learning
model performance through a structured set of test cases. The test cases confirmed that the data-
set loaded correctly (TC01), missing values were effectively handled (TC02), and new features
like trip_duration_minutes and the long_ride label were accurately derived (TC03, TC04).
Logistic Regression performed as expected, achieving a consistent accuracy of 0.712 (TC05),
surpassing the minimum threshold of 0.60 (TC09), and handling edge cases and invalid inputs
appropriately (TC06, TC07, TC08). Visualizations such as heatmaps also rendered successfully
without errors (TC10). However, one test case (TC11) highlighted a critical insight — Linear
Regression, when applied to a classification task, performed poorly with an accuracy of only
0.058. This indicates a clear mismatch between the algorithm and the problem type, reinforcing
the importance of model selection based on the task. Overall, the system is stable, well-tested, and
aligned with the project goals, with the exception of using regression for classification, which was
intentionally tested to demonstrate its inadequacy.
PROJECT 32 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
7.3 Highlights
The software testing conducted on the Citi Bike Trip Classification Project demonstrates that the
system is both functionally sound and robust. Multiple testing methodologies — including unit,
integration, system, performance, and error handling tests — were effectively implemented to
validate every stage of the data pipeline and machine learning workflow. The test cases confirm
that the system handles data prepossessing, model training, and visualization accurately and
efficiently. With a logistic regression model achieving reliable accuracy and graceful handling of
erroneous inputs, the system proves to be stable and production-ready. The only failing case,
involving linear regression on a classification task, served as an intentional validation of
inappropriate model selection. Overall, the testing process provided critical assurance of the
system’s quality and reliability.
PROJECT 33 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
8. CONCLUSION
The culmination of this project highlights the importance of data analytic in enhancing the
operational efficiency of bike-sharing systems. The project aimed to gain insights from Citi bike’s
real-world data-set, build predictive models, and present interactive dashboards to help
stakeholders make informed decisions.
Key Outcomes:
The data-set used in this project (citibike newww.csv) included vital trip-level information such as
started_at, ended_at, rideable_type, start_station_name, and member_casual. These attributes
were essential for understanding user behavior and ride trends. Before proceeding with analysis,
the data-set underwent thorough cleaning. This included handling missing values to ensure data
integrity, and engineering new features such as ride duration, hour of the day, and day of the week,
which provided contextual insights for modeling. Additionally, categorical variables like
rideable_type and member_casual were label encoded to convert them into numerical format
suitable for machine learning algorithms. These prepossessing steps ensured that the data was
structured, consistent, and ready for downstream analysis.
With the data prepared, the next phase involved performing exploratory data analysis using
Python libraries such as Seaborn and Matplotlib, alongside Power BI for interactive reporting.
This analysis revealed meaningful behavior patterns among users. Notably, casual users were
more active on weekends, while members preferred weekdays, especially during commuting
hours (7–9 AM and 5–7 PM). It was also observed that ride duration s were typically longer for
casual users. These insights were reinforced through Power BI dashboards, where visualizations
included heat-maps for identifying the most popular stations, line charts that depicted monthly
ride frequency trends, and histograms showing the distribution of ride duration s across user types
and time frames.
PROJECT 34 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
3. Predictive Modelling
To derive predictive insights from the data-set, two supervised machine learning models were
built and evaluated. The first was a Logistic Regression model, used to classify riders as either
member or casual based on features like the hour of the day, rideable type, and day of the week.
This classification model was assessed using metrics such as the accuracy score and confusion
matrix to ensure its reliability. The second model was a Linear Regression model, aimed at
predicting the ride duration (ride_length) in minutes. The key input features for this model
included start time, rider type, and bike type. It was evaluated using Mean Squared Error (MSE)
and the R² score, both of which provided insights into the model’s prediction accuracy and fit.
These models contributed to understanding user behavior and forecasting demand.
The insights derived from data analysis and modeling were brought to life using a professionally
designed Power BI dashboard (from MainProject.pbit). This interactive platform enabled
stakeholders to explore trends visually, with detailed breakdowns by month, station usage, and
hourly ride patterns. One of the standout features of this dashboard was its ability to distinguish
usage behavior based on membership type, helping identify profit potential from various user
segments. The dashboards were equipped with dynamic filters that allowed users to drill down
into specific time-frames or categories, providing a flexible and user-friendly experience. Overall,
Power BI served as a powerful decision support tool, giving stakeholders actionable insights for
resource allocation, marketing, and fleet optimization.
PROJECT 35 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
9. FUTURE ENHANCEMENTS
Though the project successfully achieves its objectives, several enhancements can make it more
robust, intelligent, and scalable.
This allows:
PROJECT 36 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
ARIMA
Prophet by Facebook
LSTM (Long Short-Term Memory networks)
PROJECT 37 2024-2025
NMKRV COLLEGE FOR WOMEN DEPARTMENT OF BCA
References
1. · Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow
(2nd ed.). O'Reilly Media.
2. · Sommerville, I. (2011). Software Engineering (9th ed.). Pearson Education.
3. · Pressman, R. S., & Maxim, B. R. (2014). Software Engineering: A Practitioner’s Approach
(8th ed.). McGraw-Hill Education.
4. · Shelly, G. B., & Rosenblatt, H. J. (2012). Systems Analysis and Design (9th ed.). Cengage
Learning.
5. · Microsoft Power BI Documentation
URL: https://learn.microsoft.com/en-us/power-bi/
6. · Citibike NYC – General Bikeshare Feed Specification (GBFS).
URL: https://gbfs.citibikenyc.com/gbfs/gbfs.json
7. · IBM – CRISP-DM Methodology.
URL: https://www.ibm.com/docs/en/spss-modeler/SaaS?topic=overview-crisp-dm-modeling-
tool
8. · Python Library Docs - Pandas, NumPy, Seaborn, Matplotlib, Scikit-learn
URL: https://pypi.org
9. · Lucidchart UML Guide
URL: https://www.lucidchart.com/pages/uml-diagram
10. · GeeksforGeeks
URL: https://www.geeksforgeeks.org/
11. · Kaggle – Bike Share Dataset Examples
URL: https://www.kaggle.com
PROJECT 38 2024-2025