Unsupervised Machine Learning: Footballer Attacking Productivity Clustering & Similarity Recommendation System
This project applies unsupervised learning techniques to analyze attacking footballers' productivity metrics from the 2024/25 season across Europe’s Big 5 leagues:
- Clustering with K-Means: Players are grouped by similarity in goal contributions, expected goals/assists, and progressive actions to identify meaningful player archetypes.
- Dimensionality Reduction and Similarity Search with NMF: Extracts latent feature representations from non-negative attacking stats to recommend players with similar attacking productivity profiles.
The deliverables include:
- An exploratory data analysis (EDA) Jupyter notebook covering distributional analysis, team-level attacking efficiency, clustering, and player profiling.
- A Streamlit web app that provides an interactive interface to recommend and visualize similar forwards using NMF-based embeddings.
-
Source: Detailed player performance data extracted from Europe’s top five leagues (Premier League, La Liga, Serie A, Bundesliga, Ligue 1) for the 2024/25 season.
-
Focus on forwards only (
Posincludes"FW"). -
Selected features reflect attacking productivity and contributions, including:
Metric Description GlsGoals scored AstAssists made G+AGoals plus assists G-PKGoals excluding penalties PKPenalty goals PKattPenalty attempts xGExpected goals npxGNon-penalty expected goals xAGExpected assists npxG+xAGComposite expected goal contribution PrgCProgressive carries PrgPProgressive passes PrgRProgressive receptions -
Preprocessing:
- Applied
MinMaxScalerto normalize feature ranges [0,1]. - For clustering, standardized with
StandardScalerwhere appropriate. - Dimensionality reduction using
NMFto leverage non-negativity and parts-based representation.
- Applied
-
EDA steps:
- Univariate histograms reveal strong left skewness typical for attacking metrics, with many low-scoring players and a few high performers.
- Aggregated team-level attacking outputs via sums of goals, assists, xG, and xAG.
- Constructed an Efficiency Score composite metric, weighted to reflect contributions beyond raw stats.
- Visualized relationships between actual and expected goals/assists identifying over- and under-performers.
-
K-Means clustering:
-
Performed on standardized attacking metrics to segment forwards into 4 clusters.
-
Silhouette scores and inertia evaluated cluster separation and compactness.
-
Cluster centroids visualized via radar/spider plots to interpret trait differences.
-
Identified clusters roughly correspond to:
- Elite “lethal strikers” with dominant finishing and expected goals.
- Creative, well-rounded attackers combining assists and progressive play.
- Average contributors with moderate stats.
- Low-productivity or peripheral forwards.
-
Outlier analysis isolates standout players within clusters (e.g., Mbappé as an outlier in lethal striker cluster).
-
-
Workflow:
- Scale selected features with
MinMaxScaler. - Fit NMF (
max_iter=500) to factorize the player-feature matrix into latent components. - Normalize the resulting latent vectors using cosine normalization.
- Compute cosine similarity between the target player and all others in latent space.
- Return top-N similar players ranked by similarity score.
- Scale selected features with
- Player scouting: Find hidden gems or comparable alternatives by productivity profile, useful when positional data is unavailable or ambiguous.
- Transfer market: Data-driven similarity can inform recruitment strategy and mitigate risk.
- Player development: Track progression in latent productivity traits over time.
- Football analytics: Extends typical metrics to non-negative latent factors that capture attacking style nuances.
-
Clone the repo.
-
Install dependencies:
pip install -r requirements.txt
-
Open and explore the Jupyter Notebook for detailed EDA and clustering:
jupyter notebook football_analysis.ipynb
-
Launch the Streamlit app for real-time recommendations:
streamlit run football.py
- pandas for data manipulation
- scikit-learn for scaling, clustering, NMF, and similarity metrics
- matplotlib and seaborn for visualization
- streamlit for interactive web app
This project is licensed under the MIT License.