Understanding and
Implementing Faster R-CNN
Most of the current SOTA models are built on top of the
groundwork laid by the Faster-RCNN model. Faster R-CNN is
an object detection model that identifies objects in an image
and draws bounding boxes around them, while also classifying
what those objects are. It’s a two-stage detector:
1. Stage 1: Proposes potential regions in the image that
might contain objects. This is handled by the Region
Proposal Network (RPN).
2. Stage 2: Uses these proposed regions to predict the
class of the object and refines the bounding box to
better match the object.
The Architecture of Faster R-CNN
Faster R-CNN Architechture
Stage 1: Region Proposal Network (RPN):
Backbone Network:
● The image passes through a convolutional network (like
ResNet or VGG16).
● This extracts important features from the image and
creates a feature map.
Anchors:
● Anchors are boxes of different sizes and shapes placed
over points on the feature map.
● Each anchor box represents a possible object location.
● At every point on the feature map, anchor boxes are
generated with different sizes and aspect ratios.
Classification of Anchors:
● The RPN predicts whether each anchor box is
background (no object) or foreground (contains an
object).
● Positive (foreground) anchors: Boxes with high
overlap with actual objects.
● Negative (background) anchors: Boxes with little
or no overlap with objects.
Bounding Box Refinement:
● The RPN also refines the anchor boxes to better align
them with the actual objects by predicting offsets
(adjustments).
Loss functions:
I)Classification loss: Helps the model decide if the anchor is
background or foreground.
II)Regression loss: Helps adjust the anchor boxes to fit the
objects more precisely.
Stage 2: Object Classification and Box
Refinement:
Region Proposals:
● After RPN, we get region proposals (refined boxes
that likely contain objects).
ROI Pooling:
● The region proposals have different sizes, but the neural
network needs fixed-size inputs.
● ROI Pooling resizes all region proposals to a fixed size
by dividing them into smaller regions and applying
pooling, making them uniform.
Object Classification:
● Each region proposal is passed through a small network
to predict the category (e.g., dog, car, etc.) of the object
inside it.
● Cross-entropy loss is used to classify the objects into
categories.
Bounding Box Refinement (Again):
● The region proposals are refined again to better match
the actual objects, using offsets.
● This uses regression loss to adjust the proposals.
Multi-task Learning:
● The network in stage 2 learns both to predict object
categories and refine bounding boxes at the same time.
Inference (Testing/Prediction Time):
● Top Region Proposals: During testing, the model
generates a large number of region proposals, but only
the top proposals (with the highest classification
scores) are passed to the second stage.
● Final Predictions: The second stage predicts the final
categories and bounding boxes.
● Non-Max Suppression: A technique called
Non-Max Suppression is applied to remove
duplicate or overlapping boxes, keeping only the best
ones.
Training:
Two ways to train:
1. Train in stages: First, train the region proposal
network (RPN) and then the classifier and regressor.
2. Train together: Train both stages at the same time
(faster and more efficient).
Implement and Fine-Tune Faster R-CNN in
PyTorch
Step 1: Install Required Libraries
pip install torch torchvision
Step 2: Import Required Modules
import torch
from torch.utils.data import DataLoader
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.datasets import ImageFolder
from torchvision import transforms
import torchvision.transforms as T
from torchvision.models.detection.faster_rcnn import
FastRCNNPredictor
Step 3: Load Pre-trained Faster R-CNN Model
PyTorch’s torchvision provides a Faster R-CNN model
pre-trained on COCO. You can modify this for your own dataset
by changing the number of classes in the final layer.
# Load the pre-trained Faster R-CNN model with a ResNet-50 backbone
model = fasterrcnn_resnet50_fpn(pretrained=True)
# Number of classes (your dataset classes + 1 for background)
num_classes = 3 # For example, 2 classes + background
# Get the number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# Replace the head of the model with a new one (for the number of
classes in your dataset)
model.roi_heads.box_predictor = FastRCNNPredictor(in_features,
num_classes)
Step 4: Prepare the Dataset
● Faster R-CNN requires images and corresponding
annotations (bounding boxes and labels).
● Your dataset should return: Images and Target
dictionaries that include bounding boxes (boxes) and
labels (labels).
Create your custom dataset class if necessary. You can use
torchvision.datasets.ImageFolder and provide bounding boxes in
the annotation files or create a custom Dataset class.
# Define transformations (e.g., resizing, normalization)
transform = T.Compose([
T.ToTensor(),
])
# Custom Dataset class or using an existing one
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, transforms=None):
# Initialize dataset paths and annotations here
self.transforms = transforms
# Your dataset logic (image paths, annotations, etc.)
def __getitem__(self, idx):
# Load image
img = ... # Load your image here
# Load corresponding bounding boxes and labels
boxes = ... # Load or define bounding boxes
labels = ... # Load or define labels
# Create a target dictionary
target = {}
target["boxes"] = torch.tensor(boxes, dtype=torch.float32)
target["labels"] = torch.tensor(labels, dtype=torch.int64)
# Apply transforms
if self.transforms is not None:
img = self.transforms(img)
return img, target
def __len__(self):
# Return the length of your dataset
return len(self.data)
Step 5: Set Up Data Loader
# Load dataset
dataset = CustomDataset(transforms=transform)
# Split into train and validation sets
indices = torch.randperm(len(dataset)).tolist()
train_dataset = torch.utils.data.Subset(dataset, indices[:-50])
valid_dataset = torch.utils.data.Subset(dataset, indices[-50:])
# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True,
collate_fn=lambda x:
tuple(zip(*x)))
valid_loader = DataLoader(valid_dataset, batch_size=4,
shuffle=False,
collate_fn=lambda x:
tuple(zip(*x)))
Step 6: Set Up Training Loop
Now set up the optimizer and training loop. For Faster R-CNN,
it’s common to use SGD or Adam as the optimizer.
# Move model to GPU if available
device = torch.device('cuda') if torch.cuda.is_available()
else torch.device('cpu')
model.to(device)
# Set up the optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9,
weight_decay=0.0005)
# Learning rate scheduler
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
step_size=3,
gamma=0.1)
# Train the model
num_epochs = 10
for epoch in range(num_epochs):
model.train()
train_loss = 0.0
# Training loop
for images, targets in train_loader:
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in
targets]
# Zero the gradients
optimizer.zero_grad()
# Forward pass
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
# Backward pass
losses.backward()
optimizer.step()
train_loss += losses.item()
# Update the learning rate
lr_scheduler.step()
print(f'Epoch: {epoch + 1}, Loss: {train_loss /
len(train_loader)}')
print("Training complete!")
Step 7: Evaluate the Model
After training, you can evaluate the model on the validation set
or use it for inference on new images.
# Set the model to evaluation mode
model.eval()
# Test on a new image
with torch.no_grad():
for images, targets in valid_loader:
images = list(img.to(device) for img in images)
predictions = model(images)
# Example: print the bounding boxes and labels for the first
image
print(predictions[0]['boxes'])
print(predictions[0]['labels'])
Step 8: Inference
To run inference on a new image:
import cv2
from PIL import Image
# Load image
img = Image.open("path/to/your/image.jpg")
# Apply the same transformation as for training
img = transform(img)
img = img.unsqueeze(0).to(device)
# Model prediction
model.eval()
with torch.no_grad():
prediction = model([img])
# Print the predicted bounding boxes and labels
print(prediction[0]['boxes'])
print(prediction[0]['labels'])