This repository contains all the code, configuration files, and documentation for the Invoice Annotation Project. The objective is to build an end-to-end workflow for processing a sample invoice using OCR, refining annotations with Label Studio, and preparing a dataset for AI training.
- Overview
- Project Workflow
- Requirements
- Setup and Installation
- Step-by-Step Instructions
- License
- Contact
The project demonstrates:
- Extraction of text from a scanned invoice using Tesseract OCR.
- Creation of a custom XML configuration for Label Studio.
- Refinement of OCR output and manual labeling of invoice fields.
- Compilation of a final annotated dataset and a reflection on the process.
- Document Analysis & Label Identification: Identify key invoice fields.
- XML Label Configuration: Create an XML file to define labels for annotation.
- Generate OCR Data: Use a Colab Notebook to extract text and bounding boxes.
- Label Refinement in Label Studio: Import and refine annotations.
- Reflection and Final Report: Write a reflection on your process and improvements.
- Python 3.7+
- Tesseract OCR (installed and in your PATH)
- Google Colab account (for running the OCR notebook)
- Label Studio (installed locally)
- Clone the Repository:
git clone https://github.com/yourusername/Invoice-Annotation-Project.git cd Invoice-Annotation-Project - Install Label Studio:
pip install label-studio
label-studio start- Set Up Tesseract OCR: Follow instructions for your OS to install Tesseract OCR.
Task:
Review the sample invoice image and identify key fields.
Deliverable:
A bullet list of fields saved in document_analysis.txt.
Example:
- Invoice Number
- Invoice Date
- Customer Name
- Item Descriptions
- Quantity, Unit Price
- Tax
- Total Amount
Task:
Create an XML configuration file for Label Studio to define the labels for annotation.
Deliverable:
Save the following as invoice_label_config.xml.
Example:
<View>
<Image name="image" value="$image" zoom="true" />
<RectangleLabels name="label" toName="image">
<Label value="Invoice Number" background="blue" />
<Label value="Invoice Date" background="green" />
<Label value="Customer Name" background="yellow" />
<Label value="Item Descriptions" background="orange" />
<Label value="Quantity, Unit Price" background="red" />
<Label value="Tax" background="pink" />
<Label value="Total Amount" background="purple" />
</RectangleLabels>
<TextArea name="transcription" toName="image" editable="true" perRegion="true"
rows="3" placeholder="Enter recognized text here" />
</View>Task:
Run the provided Google Colab Notebook to extract text and bounding boxes from the sample invoice.
Instructions:
- Open the Pre-Configured Colab Notebook.
- Update the
image_urlvariable to:https://allies-assets.s3.us-east-1.amazonaws.com/birthplan_builder_assets/extra_images/invoice_sample.png
- Click Runtime > Run all to execute all cells.
- When prompted (or via the Files sidebar), download the generated file as invoice_ocr_output.json.
- Deliverable: invoice_ocr_output.json
Task: Import the OCR JSON file into Label Studio, refine the bounding boxes, correct OCR text, and assign the correct labels.
Instructions:
- Launch Label Studio at http://localhost:8080 and create a new project named "Invoice Annotation."
- In the project settings, paste the XML configuration from Step 2 and click Save.
- Import the invoice_ocr_output.json file into your project.
- Open the task, adjust bounding boxes, delete any irrelevant annotations, and manually correct any OCR errors.
- Once satisfied, export the final labeled dataset as label_studio_output.json.
- Deliverable: label_studio_output.json
Task: Write a reflection (150–200 words) discussing:
- Label Selection: Why you chose the specific labels.
- Challenges: What issues you encountered during OCR and annotation, and how you addressed them.
- Workflow Improvements: Suggestions for streamlining the process in the future.
- Deliverable: Save your reflection in reflection.md.
This project is licensed under the MIT License.
For any questions, please contact [omoiza@ttu.edu].