This project provides a robust, scalable pipeline for merchant transaction analysis using PySpark, with all steps orchestrated for Dockerized environments. The workflow includes data cleaning, deduplication, and a suite of business analyses, producing actionable reports and data exports.
merchant_analysis/
├── app/
│ ├── __init__.py
│ ├── cleaning.py
│ ├── download_data.py
│ ├── preprocess_data.py
│ ├── run_all.py
│ ├── utils.py
│ ├── README.md
│ ├── notebooks/
│ │ └── eda_basic.ipynb
│ └── analysis/
│ ├── __init__.py
│ ├── q1_top_merchants.py
│ ├── q2_avg_sale.py
│ ├── q3_top_hours.py
│ ├── q4_popular_merchants.py
│ └── q5_advice.py
├── data/
│ ├── in/
│ │ ├── .gitkeep
│ │ ├── data_dictionary.xlsx
│ │ ├── merchants-subset.csv
│ │ └── historical_transactions.parquet
│ ├── clean/
│ │ ├── .gitkeep
│ │ ├── merchants_cleaned.parquet/
│ │ └── historical_transactions_cleaned.parquet/
│ └── out/
│ └── .gitkeep
├── reports/
│ ├── .gitkeep
│ ├── q1_top_merchants.csv/
│ ├── q2_avg_sale.csv/
│ ├── q3_top_hours.csv/
│ ├── q4_popular_merchants.csv/
│ ├── q4_city_category_crosstab.csv/
│ ├── q5_top_cities.csv/
│ ├── q5_top_categories.csv/
│ ├── q5_interesting_months.csv/
│ ├── q5_top_hours.csv/
│ └── q5_advice.md
├── .gitignore
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
└── README.md
- Docker and Docker Compose
- (Optional) Python 3.8+ if running locally without Docker
-
Clone the repository:
git clone <repo-url> cd merchant_analysis
-
Download the data:
- Place the following files in
data/in/:merchants-subset.csvhistorical_transactions.parquetdata_dictionary.xlsx
- (If not present, use the provided
app/download_data.pyor follow project instructions.)
- Place the following files in
-
Build and start the Spark container:
docker compose up
-
Run the full pipeline:
docker compose exec spark python app/run_all.pyThis will:
- Clean and deduplicate all raw data (one-time ETL step)
- Run all analysis scripts in sequence
- Generate all reports in the
reports/directory
- Script:
app/preprocess_data.py - Loads raw data from
data/in/ - Cleans (trims, standardizes, fills missing, deduplicates)
- Saves cleaned data to
data/clean/as Parquet files:merchants_cleaned.parquethistorical_transactions_cleaned.parquet
- Scripts in
app/analysis/use only the cleaned data. - Each script produces a CSV (in a subfolder) in
reports/and prints a summary.
- Script:
app/analysis/q5_advice.py - Produces a Markdown report:
reports/q5_advice.md - Summarizes key findings and recommendations for new merchants.
-
Cleaned Data:
data/clean/merchants_cleaned.parquetdata/clean/historical_transactions_cleaned.parquet
-
Analysis Results:
- CSVs in
reports/(e.g.,q1_top_merchants.csv/part-*.csv) - Markdown report:
reports/q5_advice.md
- CSVs in
-
CSV Output Note:
Each CSV output is in a subdirectory (e.g.,reports/q1_top_merchants.csv/) and may be partitioned (look forpart-*.csv).
-
Full pipeline:
python app/run_all.py
(Or use the Docker Compose command above.)
-
Individual analysis:
python -m app.analysis.q1_top_merchants
- CSV files:
Contain tabular results for each analysis question (top merchants, average sales, top hours, etc.). - Markdown report (
q5_advice.md):
Provides business recommendations, including:- Top cities and categories by sales
- Seasonal/monthly trends
- Most profitable hours
- Installment profitability analysis
- Add new analysis scripts to
app/analysis/and include them inrun_all.py. - Adjust cleaning logic in
app/preprocess_data.pyas needed for new data sources. - The pipeline is modular and ready for scaling or integration with additional Spark services.
- Performance:
All cleaning and deduplication is performed once during preprocessing for maximum efficiency. - Data issues:
Check thedata/in/anddata/clean/folders for input/output consistency. - Logs:
Spark logs are output to the console; adjust log level in scripts as needed.
I'd love to hear from you! For questions, suggestions, or collaboration, please reach out or create a pull request:
- Name: Milovan Minic
- Email: milovan.minic@gmail.com
- LinkedIn: https://www.linkedin.com/in/milovan-minic
- GitHub: https://github.com/milovan-minic
Feel free to connect or open an issue/pull request if you have ideas to improve this project!