Comprehensive Guide to Football Data
A hands-on guide to building a football data pipeline—from raw event ingestion and feature engineering to predictive scoring and role clustering—using the Football Performance API as the practical backbone.
Football data creates value only when it feeds a system that turns raw numbers into better decisions on the pitch, in the transfer market, and in the training ground. The challenge most clubs and independent analysts face is not access to data—dozens of providers now serve European leagues—but the absence of a structured pipeline that moves from ingestion through modeling to actionable output. This guide walks through that pipeline end-to-end, using the open-source Football Performance API as the practical backbone for every stage.
The Football Performance API is a self-hosted scoring and analytics platform built on top of real match-level event data. It processes raw player statistics through a feature engineering layer, feeds them into CatBoost-based prediction models, assigns players to tactical role clusters via Gaussian Mixture Models, and exposes the results through a comprehensive REST API with over 30 endpoints covering database queries, predictions, visualizations, and reports. Whether you are building an analytics department from scratch or integrating a scoring layer into an existing scouting workflow, the architecture below provides a proven, extensible foundation.
1) The Data Layer: What Football Data Actually Looks Like in Production
Most guides describe football data in abstract categories—“event data,” “tracking data,” “physical data.” In a production system, these abstractions must resolve into concrete database tables with defined schemas, consistent identifiers, and queryable fields. The Football Performance API structures its data layer around three core database endpoints that serve distinct analytical purposes.
Player Basic Information — /database/players/basic
The foundation of any football data system is a clean, deduplicated player registry. The API’s /database/players/basic endpoint provides the canonical player record with fields including player_id, name, position, league, season, and club. This endpoint supports fuzzy name matching (name parameter uses contains-match logic) alongside exact filters for position, league, and season, enabling both exploratory search and precise lookup.
Why this matters operationally: player identity resolution is the single most common source of data quality failure in football analytics. A player who moves clubs mid-season, a youth player promoted to the first team, or a player whose name transliterates differently across providers can fragment longitudinal analysis unless the registry enforces a single canonical player_id. The API’s registry solves this by maintaining stable integer IDs across seasons and clubs.
Player Feature Vectors — /database/player-features
Raw event counts—goals, assists, shots—are not directly comparable across players in different tactical systems. The feature engineering layer transforms raw statistics into normalized, position-aware feature vectors that capture a player’s true contribution profile.
The /database/player-features endpoint returns comprehensive skill profiles for each player, filterable by player_id, name, league, season, position, and club. The critical parameter is source, which selects between two model variants:
pro— The professional model variant with enhanced feature engineering, proprietary normalization, and higher predictive accuracy. This is the default and recommended source for production analytical workflows.base— The community model variant included in the open-source repository, using the base feature pipeline frombase/features.py. Suitable for prototyping and independent research.
The feature vectors include raw performance counts (goals, assists, shots, dribbles, tackles, interceptions, clearances, blocks, aerials_won, passes_completed, passes_attempted, fouls, yellow_cards, red_cards, minutes) alongside derived percentage-based scores:
attack_score_score_pct— Offensive contribution percentile within the player’s position cohort.defense_score_score_pct— Defensive contribution percentile.creation_score_score_pct— Chance creation and playmaking percentile.rating_display_score_pct— Overall composite rating percentile.
These percentile scores are what make cross-league, cross-position comparison meaningful. A midfielder with an attack_score_score_pct of 85 is outperforming 85% of midfielders in the database on offensive metrics—regardless of whether they play in the Premier League or the Eredivisie.
Player Ratings — /database/ratings
The /database/ratings endpoint provides the final computed rating records, filterable by player_id, full_name, season, league, and team. These ratings represent the output of the prediction model—the composite performance score that the CatBoost model assigns based on the full feature vector.
Ratings are seasonally versioned, meaning you can track a player’s trajectory across multiple seasons by querying different season values (format: "2024/2025"). This longitudinal view is essential for development curve analysis, regression detection, and transfer target validation.
2) The Modeling Layer: How Raw Stats Become Predictions
Raw statistics describe what happened. Predictions describe what a player’s statistical profile implies about their overall quality. The gap between these two is where the modeling layer operates.
The Scoring Pipeline
The Football Performance API implements a two-stage scoring pipeline, documented in the codebase as score_building.py:
-
Feature processing (
base/features.py) — Ingests raw match-level statistics, applies position-aware normalization, handles missing data imputation, and generates the feature vectors stored in the database. -
Prediction (
model/base_model.py) — A CatBoost gradient boosting model trained on the processed features with performance targets generated bybase/target.py. The model produces the composite rating score and the dimensional sub-scores (attack, defense, creation).
The model exists in two variants:
- Base model — Included in the open-source repository as a
.cbmfile. You can retrain this locally using your own data and feature engineering choices. The/score/predict-baseendpoint serves predictions from this model. - Pro model — The production variant hosted by Futrix Metrics with enhanced features and higher accuracy. The
/score/predictendpoint serves predictions from this model.
Making Predictions via the API
The /score/predict endpoint accepts a POST request with a player’s feature object:
POST /score/predict
{
"features": {
"goals": 12,
"assists": 7,
"shots": 68,
"dribbles": 45,
"tackles": 22,
"interceptions": 18,
"clearances": 5,
"blocks": 8,
"aerials_won": 30,
"passes_completed": 1420,
"passes_attempted": 1680,
"fouls": 28,
"yellow_cards": 4,
"red_cards": 0,
"minutes": 2850,
"position": "Midfielder",
"league": "Premier League",
"season": "2024/2025"
}
}
For bulk operations—evaluating an entire squad or running scenario analysis across a shortlist—the /score/predict-batch endpoint accepts up to 500 prediction requests in a single call, reducing network overhead and enabling large-scale scouting workflows.
Understanding Feature Importance
The /charts/features endpoint visualizes which raw statistics contribute most to the model’s predictions. The top parameter (default: 30) controls how many features are displayed. This transparency is critical for building trust with coaching staff—when a scout asks “why does the model rate this player highly?”, the feature importance chart provides a concrete, verifiable answer.
The /charts/models/base and /charts/models/pro endpoints provide comparative views of the two model variants, showing where the pro model’s enhanced features produce different rankings and why.
3) The Clustering Layer: Tactical Role Classification
Position labels—“midfielder,” “forward,” “defender”—are too coarse for modern tactical analysis. A box-to-box #8 and a deep-lying #6 both carry the “midfielder” label but demand entirely different skill profiles. The Football Performance API addresses this with a Gaussian Mixture Model (GMM) clustering system that assigns players to tactical roles based on their actual statistical profiles rather than their nominal positions.
How the Clustering Works
The GMM clustering operates on the normalized feature vectors from the player-features database. Unlike hard clustering methods like K-Means, GMM produces probabilistic assignments—a player might be 72% “Advanced Playmaker” and 28% “Box-to-Box Carrier”—which captures the reality that elite players often operate across role boundaries.
The clustering results are accessible through two endpoints:
-
/database/role-cluster-results— Returns individual player assignments withcluster_id,cluster_name,cluster_group, andcluster_localfields. Filterable byplayer_id,player_name,season,league,club,position, andcluster_name. -
/database/role-cluster-summary— Returns aggregate cluster metadata: the characteristic feature profile of each cluster, the player distribution across clusters, and the mean performance metrics per cluster.
Why This Matters for Scouting
Traditional scouting searches start with position: “find me a right winger.” Cluster-based scouting starts with role: “find me an inverted winger who cuts inside to create from half-spaces.” The difference is transformative.
By querying /database/role-cluster-results with a specific cluster_name, you retrieve every player in the database who matches that tactical archetype—across all leagues and seasons. Combined with the feature percentile scores, this creates a scouting shortlist that is both tactically precise and statistically validated.
The /player-charts/cluster-heatmap endpoint visualizes the mean z-score pattern for each cluster, making it immediately clear which statistical dimensions define each role. The /player-charts/cluster-profile/{player_id} endpoint shows an individual player’s GMM membership probabilities across all clusters, revealing whether they are a pure archetype or a hybrid profile.
4) The Visualization Layer: From Numbers to Communication
Data that cannot be communicated to coaching staff cannot change decisions. The Football Performance API includes a comprehensive visualization system with 10 chart endpoints, each designed to answer a specific analytical question.
| Chart endpoint | What it answers | Key parameters |
|---|---|---|
/player-charts/radar/{player_id} | How does this player’s skill profile compare to the position average? | season, source (pro/base), format (png/svg/html) |
/player-charts/timeline/{player_id} | How has this player’s rating evolved across seasons? | dims (e.g., attack_score_score_pct,defense_score_score_pct), format |
/player-charts/compare | How do 2-5 players compare on the same radar? | player_ids (comma-separated), season, format |
/player-charts/top-players | Who are the highest-rated players in a specific context? | n, position, league, season, format |
/player-charts/score-distribution | How are scores distributed across positions? | dim (which score dimension), season, format |
/player-charts/score-scatter | How do players cluster on two scoring dimensions? | x_dim, y_dim, color_by (position/cluster_group/cluster_name), format |
/player-charts/cluster-heatmap | What defines each tactical role cluster? | group (e.g., “Forwards”), format |
/player-charts/cluster-distribution | How many players fall into each cluster? | source, season, format |
/player-charts/cluster-profile/{player_id} | What is this player’s role probability distribution? | season, source, format |
/player-charts/league-scores | How do leagues compare on average skill scores? | top_leagues, season, format |
All chart endpoints support multiple output formats: png, jpg, webp, svg, and html. The HTML format is particularly valuable for interactive exploration—embedding a score-scatter chart with format=html produces a zoomable, hoverable plot where clicking a data point reveals the player’s identity and full metric profile.
5) The Report Layer: Structured Analytical Output
For scouting and recruitment workflows that require structured, shareable documents rather than individual charts, the API provides a report generation system:
-
/report/data— Returns structured JSON containing a player’s complete analytical profile: features, scores, cluster assignments, and cross-season trajectories. This endpoint is designed for programmatic consumption—feeding into custom dashboards, PDF generators, or recruitment management systems. -
/report— Generates a complete HTML player report combining visualizations, statistical summaries, and contextual benchmarks in a single, shareable document. This is the “one-click scouting report” that sporting directors and recruitment committees can review without accessing the analytical platform directly. -
/report/customize— Provides a personalization interface for tailoring report content and formatting to organizational preferences.
6) Building a Weekly Pipeline on Top of the API
With the data, modeling, clustering, visualization, and reporting layers in place, a practical weekly pipeline looks like this:
-
Monday — Post-match update. Query
/database/player-featuresfor your squad’s latest feature vectors. Run/score/predict-batchto update composite ratings. Compare to the previous week’s scores to identify performance shifts. -
Tuesday — Scouting focus. Use
/database/role-cluster-resultsto pull players matching the target tactical role. Filter byleagueandseason. Rank byrating_display_score_pct. Generate/player-charts/compareoverlays for the top 5 candidates against your current starter. -
Wednesday — Tactical preparation. Pull the upcoming opponent’s player profiles via
/database/player-features. Generate radar charts for their key players. Identify where their feature profiles show vulnerabilities relative to position averages. -
Thursday — Report generation. Use
/reportto generate shareable HTML reports for the recruitment committee. Use/player-charts/timeline/{player_id}to show development trajectories for shortlisted targets. -
Friday — Cross-league benchmarking. Use
/player-charts/league-scoresto contextualize performance across competitions. Use/player-charts/score-scatterwithcolor_by=cluster_nameto map tactical archetypes across leagues.
7) Data Quality Rules That Apply to Any Pipeline
Even with a well-structured API, analytical quality depends on disciplined data hygiene:
-
Always verify sample size before acting on metrics. The
/database/player-featuresendpoint includesminutesas a field—players with fewer than 900 minutes (approximately 10 full matches) should be flagged with confidence caveats. Short-sample metrics are unreliable regardless of how sophisticated the model is. -
Use the
sourceparameter deliberately. Switching betweenproandbasesources mid-analysis produces inconsistent results. Choose one source for each analytical workflow and document it. Use the/charts/models/baseand/charts/models/procomparison to understand where the models diverge. -
Cross-validate with public benchmarks. Use FBref as an independent reference for raw event counts. If your API-derived feature vectors diverge significantly from FBref’s per-90 statistics for the same player, investigate the normalization pipeline before drawing conclusions.
-
Version your analytical outputs. The API’s
seasonparameter enables longitudinal tracking, but your downstream analysis must also be versioned. A scouting shortlist generated in October should be re-evaluated in January using updated feature vectors—player performance is non-stationary. -
Separate exploration from production queries. Use generous
limitvalues (up to 5000) for exploratory analysis but lock production dashboards to specific, validated query parameters. The API supportsoffset-based pagination for large result sets—use it to avoid overwhelming downstream systems.
Further Reading
- How to Analyze Football Statistics for Performance Insights — Deep dive into the prediction model and scoring methodology.
- Exploring Football Analytics: Tools and Techniques — Comprehensive guide to the API’s visualization and reporting capabilities.
- Best Soccer Data Providers in Europe — How the Football Performance API compares to commercial providers for European league coverage.
- FBref — Independent statistical reference for cross-validation.
- StatsBomb Articles — Analytical methodology and metric definitions.
- Transfermarkt — Financial and personnel context for recruitment workflows.
Conclusion
A football data pipeline is successful when every layer—from raw ingestion through feature engineering, prediction, clustering, visualization, and reporting—is transparent, reproducible, and connected to a specific decision. The Football Performance API provides this full stack as an open-source, self-hosted platform with a production-grade managed option from Futrix Metrics. Build the pipeline once, validate it against public benchmarks, and then let the compound effect of marginally better weekly decisions create genuine competitive advantage across a full season.