ML PROJECT
Online Payment
   Fraud Detection
    using Machine
          Learning
              D O N E B Y:
         G AU TH AM G D
         A I S H WA R YA R
INTRODUCTION
 In the digital age, online payments have become an
  integral part of our daily lives. With the increasing
  trend of online transactions, fraud cases are also
  rising, resulting in significant financial losses. To
  combat this issue, we have developed a machine
  learning-based system for online payment fraud
  detection. This project aims to provide a robust and
  accurate solution to detect fraudulent transactions in
  real-time, reducing financial losses and increasing
  confidence in online payments. By leveraging machine
  learning algorithms and historical data, our system can
  identify patterns and anomalies to flag potential fraud
  cases, providing a secure and streamlined transaction
  experience for users.
PROBLEM
DEFINITION:
 Online payment fraud detection is a critical
 issue in the digital payment ecosystem.
 With the increasing trend of online
 transactions, fraudulent activities are also
 rising, resulting in significant financial
 losses. The problem is to develop a system
 that can accurately detect fraudulent
 transactions in real-time, preventing
 financial losses and enhancing customer
 trust.
PROPOSED SYSTEM:
 Our proposed system detects fraudulent
 transactions in real-time using advanced
 machine learning algorithms and multi-
 parameter analysis. It automatically
 updates models with new data and
 algorithms, staying ahead of evolving fraud
 patterns. This improves detection accuracy
 and reduces false positives, enhancing
 online payment security.
MODULES:
1 Data Preprocessing :
 Checking for missing values and preprocessing data to
  remove missing values, outliers, and inconsistencies ensures
  data quality and prevents errors in later stages.
 Creating a correlation matrix helps identify relationships
  between numeric columns, which can inform feature selection.
2. Feature engineering module:
 Applies feature engineering techniques to extract relevant
  features from data like transaction amount, location, and time,
  transforms and select key attributes,
           3. Model Training and Evaluation:
            This module trains five different machine learning
             models (XGBoost, Logistic Regression, Random
             Forest, K-Neighbors, and AdaBoost) on the training
             data.
            Each model's performance is evaluated on the
             testing data using accuracy and loss metrics,
             providing insights into their effectiveness.
           4. User Input and Prediction:
            This module takes user input for transaction details,
             such as step, amount, old balance, and new balance.
            A Data frame is created from the user input,
MODULES:     mimicking the format of the training data. The trained
             model is used to predict whether the transaction is
             fraudulent or not based on the user input.
    MODULES:
5. Visualization:
 This module plots training and testing accuracy
   for each model, providing a visual comparison of
   their performance.
 The correlation matrix is also visualized, helping
   to identify relationships between numeric
   columns.
ARCHITECTURE
                              MODEL
    DATA                     TRANING       USER INPUT
                FEATURE                       AND
PREPROCESSIN                                            VISUALIZTION
               SELECTION   And EVALUATIO
     G                                     PREDICTION
                                 N
              PROCESSOR : Intel Core i5 or equivalent
              RAM: 8 GB or more
HARDWARE
REQUIREMENT
S:            STORAGE: 500 GB or more
              GRAPHICS CARD : NVIDIA GeForce GTX
              1060 or equivalent (for visualization)
SOFTWARE      Operating System: Windows 10 or macOS
REQUIREMENT   High Sierra or later
S:            Python: Version 3.8 or later
              Libraries: Pandas,
              NumPy, Matplotlib, Seaborn ,XGBoost,
              Scikit-learn
              IDE: Jupyter Notebook or equivalent
              Database: CSV file or equivalent (for
              storing the dataset)
LIBRARIES:
The project uses the following libraries are:
 Pandas: For data manipulation and analysis
 NumPy: For numerical computations
 Matplotlib and Seaborn: For data
 visualization
 Scikit-learn: For machine learning models
 XGBoost: For gradient boosting
FEATURE DESCRIPTION:
1. Step: This feature represents the unit of time (1 hour) in which
the transaction occurred. It can help identify patterns or anomalies
in transaction behavior over time.
2. Amount: This feature represents the transaction amount, which
can be a key indicator of fraud. Large or unusual transaction
amounts may be flagged as potential fraud.
3. OldbalanceOrg: This feature represents the old balance of the
origin account before the transaction occurred. It can help identify
changes in account behavior or unusual activity.
4. NewbalanceOrig: This feature represents the new balance of the
origin account after the transaction occurred. It can help identify
changes in account behavior or unusual activity.
FEATURE
DESCRIPTION:
5. OldbalanceDest: This feature represents the old balance of the
destination account before the transaction occurred. It can help
identify changes in account behavior or unusual activity.
6. NewbalanceDest: This feature represents the new balance of the
destination account after the transaction occurred. It can help
identify changes in account behavior or unusual activity.
7. IsFlaggedFraud: This feature indicates whether the transaction is
fraudulent (1) or not (0). It is the target variable that the machine
learning models are trained to predict.
 These features are used together to train machine learning models
  to detect fraudulent transactions. By analyzing patterns and
  relationships between these features, the models can identify
  potential fraud and flag it for further investigation.
PROGRAM
FUNCTIONALITY:
 Loads a dataset of transactions
 Preprocesses the data by removing missing values and
  selecting numeric columns
 Splits the data into training and testing sets
 Trains five different machine learning models (XGBoost,
  Logistic Regression, Random Forest, K-Neighbors, and
  AdaBoost) on the training data
 Evaluates the performance of each model on the testing
  data
 Plots the training and testing accuracies for each model
 Prompts the user to input transaction details
 Makes a prediction using the trained model based on the
  user input
 Indicates whether the transaction is predicted as fraud or
ARCHITECTURE:
 Data Preprocessing: The dataset is loaded, and
  missing values are removed.
 Feature Selection: Non-numeric columns are
  dropped.
 Model Training: Five machine learning models are
  trained on the dataset.
 Model Evaluation: The models are evaluated on the
  testing set.
 Prediction: The best model is used to predict
  fraudulent transactions.
MACHINE LEARNING MODELS:
01                02               03                04                 05
XGBoost:          Logistic         Random Forest:    K-Neighbors:       AdaBoost:
Gradient          Regression:      Ensemble          Finds similar      Combines
Boosted           Predicts         learning for      transactions to    multiple weak
decision trees    probability of   classification.   known              models to
for               fraud.                             fraudulent ones.   improve
classification.                                                         accuracy.
MACHINE LEARNING MODELS:
1. XGBoost:
   import xgboost as xgb
 Initialize: xgb.XGBClassifier()
 Train: xgb.XGBClassifier().fit(X_train, y_train)
 Predict: xgb.XGBClassifier().predict(X_test)
 Hyperparameter tuning: Use xgb.XGBClassifier() with GridSearchCV or
    RandomizedSearchCV to optimize parameters like max_depth, learning_rate, and
    n_estimators
MACHINE LEARNING MODELS:
2. Logistic Regression:
 Import: from sklearn.linear_model import LogisticRegression
 Initialize: LogisticRegression()
 Train: LogisticRegression().fit(X_train, y_train)
 Predict: LogisticRegression().predict(X_test)
 Hyperparameter tuning: Use LogisticRegression()
  with GridSearchCV or RandomizedSearchCV to optimize parameters like C and
  penalty
MACHINE LEARNING MODELS:
3. Random Forest:
     Import: from sklearn.ensemble import RandomForestClassifier
     Initialize: RandomForestClassifier()
     Train: RandomForestClassifier().fit(X_train, y_train)
     Predict: RandomForestClassifier().predict(X_test)
     Hyperparameter tuning: Use RandomForestClassifier() with GridSearchCV or
    RandomizedSearchCV to optimize parameters like n_estimators, max_depth, and
    min_samples_split
MACHINE LEARNING MODELS:
4. K-Neighbors:
     Import: from sklearn.neighbors import KNeighborsClassifier
     Initialize: KNeighborsClassifier()
     Train: KNeighborsClassifier().fit(X_train, y_train)
     Predict: KNeighborsClassifier().predict(X_test)
     Hyperparameter tuning: Use KNeighborsClassifier() with GridSearchCV or
    RandomizedSearchCV to optimize parameters like n_neighbors and weights
MACHINE LEARNING MODELS:
5. AdaBoost:
     Import: from sklearn.ensemble import AdaBoostClassifier
     Initialize: AdaBoostClassifier()
     Train: AdaBoostClassifier().fit(X_train, y_train)
     Predict: AdaBoostClassifier().predict(X_test)
      Hyperparameter tuning: Use AdaBoostClassifier() with GridSearchCV or
    RandomizedSearchCV to optimize parameters like n_estimators and
    learning_rate
      Correlation
      Matrix
•    correlation_matrix = df.corr(nu
     meric_only=True)
•    plt.figure(figsize=(10, 8))
•   sns.heatmap(correlation_matri
    x, annot=True, cmap='coolwar
    m', fmt=".2f")
•   plt.title("Correlation Matrix")
•   plt.show()
      OUTPUT
• Accuracy for XGBoost:
  0.9996605172083198, Loss:
  0.0010340594057733003
• Accuracy for Logistic Regression
  0.9983128019589415, Loss:
  0.01829306096263862
• Accuracy for Random Forest:
  0.9995677881124443, Loss:
  0.0051479299492394335
• Accuracy for K-neighbours:
  0.9994499121431109, Loss:
  0.009307683764686884
• Accuracy for AdaBoost:
  0.9992015867677152, Loss:
  0.5698750028631996
CONCULSION:
 We conclude by that the XGBoost
 classifier is giving the highest
 accuracy and low losses while
 training and testing for Online
 Payment Fraud.
                                    This Photo by Unknown author is licensed under CC BY-SA-NC.
INNOVATION:
 Machine learning algorithms adapt to new
 fraud patterns
 Real-time detection responds to
 transactions as they occur
 External data integration enhances detection
 accuracy
 User feedback mechanism improves model
 accuracy
Online Payment Fraud
Detection using Machine
        Learning
       ML PROJECT
          DONE BY:
              GAUTHAM GD
              AISHWARYA R
STEPS TO IMPLEMENT
Step 1: Upload Dataset to Google Drive
1. Go to https://drive.google.com/ and log in with your Google account.
2. Upload your dataset ('onlinefraud.csv') to your Google Drive. Make sure to
remember the path where you upload the file.
Step 2: Set Up Google Colab
1. Go to https://colab.research.google.com/ and log in with the same Google
account.
2. Create a new notebook by clicking on "File" > "New Notebook" or "File" >
"Upload Notebook" if you have a notebook file.
3. If you are creating a new notebook, you will see a new cell. You can start typing
code in this cell.
Step 3: Mount Google Drive
1. In a new cell in Colab, run the following code:
from google.colab import drive
drive.mount('/content/drive')
2. Click on the link generated, allow access to your Google Drive, and copy the
authentication code. Paste this code into the cell and press Enter.
Step 4: Install Required Libraries
1. In a new cell in Colab, run the following code to install the necessary libraries:
!pip install xgboost
Step 5: Copy and Paste the Code
1. Copy the entire code you provided in your question.
Step 6: Modify File Path
1. Find the line where the dataset is loaded (`df =
pd.read_csv('/content/drive/MyDrive/ML /onlinefraud.csv')`).
2. Replace `'/content/drive/MyDrive/ML /onlinefraud.csv'` with the path to your
dataset in your Google Drive. The path should start with
`'/content/drive/MyDrive/'`.
Step 7: Run the Code
1. Paste the modified code into a new cell in Colab.
2. Run the cell, either by clicking the play button next to the cell or pressing
Shift+Enter.
 Step 8: Check Results
1. After running the code, you will see the results of model evaluation and
predictions in the output cells.
2. Look for the prediction result for the user input transaction to see if it's predicted
as fraud or non-fraud.
PROGRAM TO RUN
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix, roc_curve, auc, log_loss
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, roc_curve, auc,
ConfusionMatrixDisplay
import random
from sklearn.metrics import roc_auc_score as ras
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
df = pd.read_csv('/content/drive/MyDrive/ML /onlinefraud.csv')
print(df.shape)
print(df.head(5))
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)
df = df.dropna() # Remove rows with missing values
# Create a correlation matrix
correlation_matrix = df.corr(numeric_only=True) # To calculate correlation only
for numeric columns
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.show()
# Feature selection and splitting
X = df.drop(['isFraud'], axis=1)
y = df['isFraud']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Exclude non-numeric columns from the training and testing data
non_numeric_columns = ['nameOrig', 'nameDest', 'type']
X_train = X_train.drop(columns=non_numeric_columns)
X_test = X_test.drop(columns=non_numeric_columns)
model = xgb.XGBClassifier()
model1 = LogisticRegression()
model2 =
RandomForestClassifier(n_estimators=7,criterion='entropy',random_state=7)
model3 = KNeighborsClassifier()
model4 = AdaBoostClassifier(random_state=42)
models = [model, model1, model2, model3, model4]
model_names = ['XGBoost', 'Logistic Regression', 'Random Forest', 'K-
neighbours', 'AdaBoost']
train_accuracy = []
test_accuracy = []
train_losses = []
test_losses = []
for model, name in zip(models, model_names):
  model.fit(X_train, y_train)
  # Training accuracy and loss
  train_pred = model.predict(X_train)
  train_acc = accuracy_score(y_train, train_pred)
  train_loss = log_loss(y_train, model.predict_proba(X_train))
  train_accuracy.append(train_acc)
  train_losses.append(train_loss)
  # Testing accuracy and loss
  test_pred = model.predict(X_test)
  test_acc = accuracy_score(y_test, test_pred)
  test_loss = log_loss(y_test, model.predict_proba(X_test))
  test_accuracy.append(test_acc)
  test_losses.append(test_loss)
  print(f"Accuracy for {name}: {test_acc}, Loss: {test_loss}")
# Plotting
plt.figure(figsize=(12, 8))
plt.plot(model_names, train_accuracy, marker='o', label='Training Accuracy')
plt.plot(model_names, test_accuracy, marker='o', label='Testing Accuracy')
plt.title('Training and Testing Accuracies for Different Models')
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.legend()
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
user_input = {
  'step': 1,
  'amount': 10000.00,
  'oldbalanceOrg': 30000.00,
  'newbalanceOrig': 60000.0,
    'oldbalanceDest': 3000.00,
    'newbalanceDest': 33000.00,
    'isFlagge1dFraud': df['isFlaggedFraud'].values[0] # Extract from your dataset
}
# Create a DataFrame from user input
user_df = pd.DataFrame([user_input])
# Make predictions using the model
user_predictions = model.predict(user_df)
# Check if the user input resulted in fraud or not
if user_predictions[0] == 1:
    print("The transaction is predicted as fraud.")
else:
    print("The transaction is predicted as non-fraud.")