Skip to content

A production-ready data pipeline for entity matching, web data enrichment, and analytics visualization using Apache Airflow, PySpark, and Docker.

Notifications You must be signed in to change notification settings

Chaimaaorg/entity-match-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Entity Match Platform

A production-ready data pipeline for entity matching, web data enrichment, and analytics visualization using Apache Airflow, PySpark, and Docker.

🎯 Overview

This platform enables organizations to:

  • Match entities across different data sources using fuzzy matching algorithms
  • Enrich data by scraping external sources for additional information
  • Visualize results through an interactive analytics dashboard
  • Track matching quality with comprehensive metrics and tracing

Perfect for:

  • Customer data deduplication
  • Third-party data integration
  • Entity resolution across databases
  • Data quality improvement initiatives
  • Master data management (MDM)

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Source A      β”‚ (Internal Database)
β”‚  (Primary Data) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Ί β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚            β”‚  Data Preparation   β”‚
         β”‚            β”‚  - Normalization    β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  - Deduplication    β”‚
β”‚   Source B      │───  - Web Enrichment   β”‚
β”‚ (External Data) β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
                                β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Entity Matching    β”‚
                    β”‚  - Exact Matching    β”‚
                    β”‚  - LSH Fuzzy Match   β”‚
                    β”‚  - Multi-stage       β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚    Visualization     β”‚
                    β”‚  - Match Analytics   β”‚
                    β”‚  - Quality Metrics   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Features

1. Multi-Stage Matching Engine

  • Exact Matching: Fast matching on unique identifiers (tax IDs, registration numbers)
  • Fuzzy Matching: LSH-based similarity matching for name fields
  • Weighted Scoring: Configurable weights for different field importance
  • Deduplication: Automatic removal of duplicate matches

2. Web Data Enrichment

  • Configurable web scraping for external data sources
  • Intelligent search strategies (by ID, name, etc.)
  • Automatic data validation and cleaning
  • Rate limiting and error handling

3. Production-Ready Pipeline

  • Docker containerization for consistent deployment
  • Apache Airflow for workflow orchestration
  • PySpark for distributed data processing
  • Incremental processing to avoid reprocessing
  • Tracing & auditing of all matching attempts

4. Interactive Dashboard

  • Real-time matching statistics
  • Data quality metrics
  • Field-level mismatch analysis
  • Tracing analytics with attempt history

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • 8GB+ RAM recommended
  • Python 3.9+

Installation

  1. Clone the repository
git clone https://github.com/Chaimaaorg/entity-match-platform.git
cd entity-match-platform
  1. Configure environment
cp .env.example .env
# Edit .env with your settings
  1. Update configuration Edit config/config.ini:
[dev]
db = local_db
source_a_main = /app/data/source_a/entities.csv
source_b_main = /app/data/source_b/entities.csv
  1. Start the platform
docker-compose up -d
  1. Access services
  • Airflow UI: http://localhost:8080 (user: admin, password: admin)
  • Dashboard: Open visualization/analytics_dashboard.html in browser

Running the Pipeline

  1. Prepare your data

    • Place Source A data in data/source_a/
    • Place Source B data in data/source_b/
    • Ensure CSV files have proper headers
  2. Trigger the DAG

    • Open Airflow UI
    • Enable the entity_matching_pipeline DAG
    • Trigger manually or wait for scheduled run
  3. View results

    • Matched records: data/matched/results.csv
    • Processing logs: Check Airflow task logs
    • Analytics: Load CSV files into the dashboard

πŸ“Š Data Format

Source A (Primary Dataset)

entity_id,name,tax_id,registration_number,city
001,Acme Corp,TX123456,REG789,New York
002,Beta LLC,TX234567,REG890,Los Angeles

Source B (Secondary Dataset)

entity_id,name,tax_id,registration_number,activity
B001,Acme Corporation,TX123456,REG789,Manufacturing
B002,Beta Limited,TX234567,REG890,Retail

Required Fields

  • Unique identifier (entity_id, company_id, etc.)
  • Name field (company name, organization name, etc.)
  • Optional identifiers (tax_id, registration_number, etc.)

βš™οΈ Configuration

Matching Parameters (src/matching/matcher.py)

@dataclass
class MatchingConfig:
    lsh_num_hash_tables: int = 3
    lsh_distance_threshold: float = 0.5
    similarity_threshold: float = 0.65
    max_candidates_per_record: int = 10
    
    weights: Dict[str, float] = {
        "name": 0.45,
        "registration_id": 0.25,
        "tax_id": 0.25,
        "other": 0.05
    }

Web Enrichment (src/enrichment/web_scraper.py)

enricher = EntityEnricher(
    base_url="https://your-data-source.com/search",
    search_params_mapping={
        "tax_id": "tax_param",
        "registration_id": "reg_param",
        "name": "name_param"
    }
)

πŸ“ Project Structure

entity-match-platform/
β”œβ”€β”€ airflow/                 # Airflow DAGs and configuration
β”‚   β”œβ”€β”€ dags/
β”‚   β”‚   └── entity_matching_pipeline.py
β”‚   └── Dockerfile
β”œβ”€β”€ config/                  # Configuration files
β”‚   └── config.ini
β”œβ”€β”€ src/                     # Source code
β”‚   β”œβ”€β”€ enrichment/          # Web scraping modules
β”‚   β”‚   └── web_scraper.py
β”‚   └── matching/            # Matching engine
β”‚       β”œβ”€β”€ data_preparation.py
β”‚       β”œβ”€β”€ matcher.py
β”‚       └── utils.py
β”œβ”€β”€ data/                    # Data storage
β”‚   β”œβ”€β”€ source_a/            # Primary dataset
β”‚   β”œβ”€β”€ source_b/            # Secondary dataset
β”‚   β”œβ”€β”€ processed/           # Cleaned data
β”‚   └── matched/             # Matching results
β”œβ”€β”€ visualization/           # Analytics dashboard
β”‚   └── analytics_dashboard.html
└── tests/                   # Unit tests

πŸ”§ Customization

Adding New Data Sources

  1. Update config/config.ini with new paths
  2. Create data loaders in src/matching/data_preparation.py
  3. Adjust field mappings in matching configuration

Custom Matching Logic

Override methods in src/matching/matcher.py:

class CustomMatcher(LSHMatcher):
    def _build_feature_pipeline(self, company_col: str):
        # Custom text processing logic
        pass

Web Scraping Configuration

Implement scrape_entities_from_html() for your target website:

def scrape_entities_from_html(self, html: str):
    # Parse HTML specific to your data source
    soup = BeautifulSoup(html, "html.parser")
    # Extract entity information
    return entities

🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

πŸ“„ License

MIT License - see LICENSE file for details

About

A production-ready data pipeline for entity matching, web data enrichment, and analytics visualization using Apache Airflow, PySpark, and Docker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published