An interactive data profiling library for Python that generates comprehensive HTML reports with rich visualizations and PDF export capabilities.
- 📊 Interactive Visualizations: Built with Plotly for dynamic, interactive charts
- 📱 Responsive Design: Reports adapt to different screen sizes
- đź“„ PDF Export: Generate publication-ready PDF reports
- 🎯 Target Analysis: Special insights for classification/regression tasks
- 🔍 Comprehensive Profiling: Detailed statistics and distributions
- ⚡ Performance Optimized: Efficient handling of large datasets
- 🛠️ Customizable: Configure sections and visualization options
↔️ DataFrame Comparison: Compare two datasets for differences in schema, stats, and distributions
pip install pyticsimport pandas as pd
from pytics import profile, compare
# --- Basic Profiling ---
# Method 1: Profile a DataFrame object
df = pd.read_csv('your_data.csv')
profile(df, output_file='report.html')
# Method 2: Profile directly from a file path
# Supports CSV and Parquet files
profile('path/to/your_data.csv', output_file='report.html')
profile('path/to/your_data.parquet', output_file='report.html')
# --- Advanced Profiling ---
# Generate a PDF report
profile(df, output_format='pdf', output_file='report.pdf')
# Profile with a target variable for enhanced analysis
profile(
df,
target='target_column', # Enables target-specific analysis
output_file='targeted_report.html'
)
# Select specific sections to include/exclude
profile(
df,
include_sections=['overview', 'correlations'],
exclude_sections=['target_analysis'],
output_file='custom_report.html'
)
# --- DataFrame Comparison ---
# Method 1: Compare two DataFrame objects
df_train = pd.read_csv('train_data.csv')
df_test = pd.read_csv('test_data.csv')
compare(
df_train,
df_test,
name1='Train Set', # Optional: Custom names for the datasets
name2='Test Set',
output_file='comparison.html'
)
# Method 2: Compare directly from file paths
compare(
'path/to/train_data.csv',
'path/to/test_data.csv',
name1='Train Set',
name2='Test Set',
output_file='comparison.html'
)When you specify a target variable using the target parameter, pytics enhances the analysis with:
- Target distribution visualization
- Feature importance analysis
- Target-specific correlations
- Conditional distributions of features
- Statistical tests for feature-target relationships
Example:
# Profile with target variable analysis
profile(
df,
target='target_column',
output_file='targeted_report.html'
)profile(
df,
target='target_column', # Target variable for supervised learning
include_sections=['overview'], # Sections to include
exclude_sections=['correlations'],# Sections to exclude
output_format='pdf', # 'html' or 'pdf'
output_file='report.html', # Output file path
theme='light', # Report theme ('light' or 'dark')
title='Custom Report Title' # Report title
)compare(
df1,
df2,
name1='First Dataset', # Custom name for first dataset
name2='Second Dataset', # Custom name for second dataset
output_file='comparison.html', # Output file path
theme='light', # Report theme ('light' or 'dark')
title='Dataset Comparison' # Report title
)overview: Dataset summary and memory usagevariables: Detailed variable analysiscorrelations: Correlation analysistarget_analysis: Target-specific insights (requires target parameter)interactions: Feature interaction analysismissing_values: Missing value patternsduplicates: Duplicate record analysis
-
Overview
- Dataset summary
- Memory usage
- Data types distribution
- Missing values summary
-
DataFrame Summary
- Complete DataFrame info output
- Numerical and categorical statistics
- Data preview (head/tail)
- Memory usage details
-
Variable Analysis
- Detailed statistics
- Distribution plots
- Missing value patterns
- Unique values analysis
-
Correlations
- Correlation matrix
- Feature relationships
- Interactive heatmaps
-
Target Analysis (when target specified)
- Target distribution
- Feature importance
- Target correlations
-
Missing Values
- Missing value patterns
- Distribution analysis
- Correlation with other features
-
Duplicates
- Duplicate record analysis
- Pattern identification
- Impact assessment
-
About
- Project information
- Feature overview
- GitHub repository links
- Recommended maximum rows: 1 million
- Recommended maximum columns: 1000
- Large datasets may require increased memory allocation
When exporting reports to PDF format:
- Plots are intentionally omitted due to a known issue with Kaleido version >= 0.2.1 that causes PDF export to hang indefinitely
- A message is displayed in place of each plot indicating it has been omitted
- All other report content (statistics, tables, etc.) remains fully functional
- For viewing plots, use the HTML export format which provides fully interactive visualizations
- If PDF plots are required, consider using pytics version 1.1.3 which supports them
- Missing Values: Automatically handled and reported
- Categorical Variables: Limited to 1000 unique values by default
- Date/Time: Automatically detected and analyzed
- Mixed Data Types: Handled with appropriate warnings
- Custom exceptions for clear error reporting
- Warning system for non-critical issues
- Graceful degradation for memory constraints
-
Memory Management
- Sample large datasets if needed
- Use section selection for focused analysis
- Monitor memory usage for big datasets
-
Performance Optimization
- Limit categorical variables when possible
- Use targeted section selection
- Consider data sampling for initial exploration
-
Report Generation
- Choose appropriate output format
- Use meaningful report titles
- Save reports with descriptive filenames
Contributions are welcome! Please feel free to submit a Pull Request. See the CONTRIBUTING.md file for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.