This project provides a set of Python scripts for extracting text from PDF files and combining the extracted content into a single text file. It is designed to process multiple PDF files across various categories of legal acts, including Biosafety, Climate, Energy Laws, Environment, Fisheries, Forestry, Land, Mining, Water, and Wildlife.
The repository is organized into the following structure:
.
├── Acts
│ ├── Biosafety
│ ├── Climate
│ ├── EnergyLaws
│ ├── Environment
│ ├── Fisheries
│ ├── Forestry
│ ├── Land
│ ├── Mining
│ ├── water
│ ├── Wildlife
│ └── download.py
├── combined.py
└── test.py
combined.py
: The main script for extracting text from PDFs and combining them.Acts/download.py
: A script for downloading PDF files from a specific webpage.Acts/[Category]/combined.py
: Category-specific scripts for text extraction and combination.
- Ensure you have Python 3.6 or later installed.
- Install the required libraries:
pip install PyPDF2 requests beautifulsoup4
-
To download PDF files:
-
Navigate to the
Acts
directory. -
Run the
download.py
script:python download.py
This will download PDF files to a
forestry
folder in the root directory. -
-
To extract and combine text from PDFs:
-
Navigate to the directory containing the PDF files you want to process.
-
Run the appropriate
combined.py
script:python combined.py
This will create a combined text file in the same directory.
-
- Extracting text from all PDFs in the current directory:
import os
from combined import extract_text_from_pdfs_to_single_file
folder_path = os.getcwd()
output_file = os.path.join(folder_path, "combined_output.txt")
extract_text_from_pdfs_to_single_file(folder_path, output_file)
- Processing PDFs from a specific category:
import os
category = "Biosafety"
folder_path = os.path.join("Acts", category)
output_file = os.path.join(folder_path, f"combined_{category.lower()}.txt")
extract_text_from_pdfs_to_single_file(folder_path, output_file)
- If you encounter
PyPDF2.errors.PdfReadError
, ensure the PDF file is not corrupted or password-protected. - For
UnicodeDecodeError
, try specifying the correct encoding when opening the output file.
To enable verbose logging:
- Add the following import at the beginning of the script:
import logging
- Set the logging level to DEBUG:
logging.basicConfig(level=logging.DEBUG)
- Add log statements in the
extract_text_from_pdfs_to_single_file
function:
logging.debug(f"Processing file: {filename}")
Log files will be output to the console. To save logs to a file, modify the logging.basicConfig
call:
logging.basicConfig(filename='pdf_extraction.log', level=logging.DEBUG)
The data flow in this application follows these steps:
- PDF files are downloaded from a specified webpage using
Acts/download.py
. - The
combined.py
script (or category-specific scripts) reads PDF files from a specified folder. - Each PDF file is processed using PyPDF2 to extract text content.
- Extracted text is written to a single output file, with each PDF's content separated by headers.
- Any errors during processing are logged, and problematic files are skipped.
[Web Source] -> [download.py] -> [PDF Files] -> [combined.py] -> [Extracted Text] -> [Combined Output File]
Note: Error handling is implemented to ensure the process continues even if individual files fail to process.