Comprehensive Guide to Football Data

A hands-on guide to building a football data pipeline—from raw event ingestion and feature engineering to predictive scoring and role clustering—using the Football Performance API as the practical backbone.

2026-03-08T00:00:00.000Z

Football data workflow overview

Football data creates value only when it feeds a system that turns raw numbers into better decisions on the pitch, in the transfer market, and in the training ground. The challenge most clubs and independent analysts face is not access to data—dozens of providers now serve European leagues—but the absence of a structured pipeline that moves from ingestion through modeling to actionable output. This guide walks through that pipeline end-to-end, using the open-source Football Performance API as the practical backbone for every stage.

The Football Performance API is a self-hosted scoring and analytics platform built on top of real match-level event data. It processes raw player statistics through a feature engineering layer, feeds them into CatBoost-based prediction models, assigns players to tactical role clusters via Gaussian Mixture Models, and exposes the results through a comprehensive REST API with over 30 endpoints covering database queries, predictions, visualizations, and reports. Whether you are building an analytics department from scratch or integrating a scoring layer into an existing scouting workflow, the architecture below provides a proven, extensible foundation.

1) The Data Layer: What Football Data Actually Looks Like in Production

Most guides describe football data in abstract categories—“event data,” “tracking data,” “physical data.” In a production system, these abstractions must resolve into concrete database tables with defined schemas, consistent identifiers, and queryable fields. The Football Performance API structures its data layer around three core database endpoints that serve distinct analytical purposes.

Player Basic Information — /database/players/basic

The foundation of any football data system is a clean, deduplicated player registry. The API’s /database/players/basic endpoint provides the canonical player record with fields including player_id, name, position, league, season, and club. This endpoint supports fuzzy name matching (name parameter uses contains-match logic) alongside exact filters for position, league, and season, enabling both exploratory search and precise lookup.

Why this matters operationally: player identity resolution is the single most common source of data quality failure in football analytics. A player who moves clubs mid-season, a youth player promoted to the first team, or a player whose name transliterates differently across providers can fragment longitudinal analysis unless the registry enforces a single canonical player_id. The API’s registry solves this by maintaining stable integer IDs across seasons and clubs.

Player Feature Vectors — /database/player-features

Raw event counts—goals, assists, shots—are not directly comparable across players in different tactical systems. The feature engineering layer transforms raw statistics into normalized, position-aware feature vectors that capture a player’s true contribution profile.

The /database/player-features endpoint returns comprehensive skill profiles for each player, filterable by player_id, name, league, season, position, and club. The critical parameter is source, which selects between two model variants:

  • pro — The professional model variant with enhanced feature engineering, proprietary normalization, and higher predictive accuracy. This is the default and recommended source for production analytical workflows.
  • base — The community model variant included in the open-source repository, using the base feature pipeline from base/features.py. Suitable for prototyping and independent research.

The feature vectors include raw performance counts (goals, assists, shots, dribbles, tackles, interceptions, clearances, blocks, aerials_won, passes_completed, passes_attempted, fouls, yellow_cards, red_cards, minutes) alongside derived percentage-based scores:

  • attack_score_score_pct — Offensive contribution percentile within the player’s position cohort.
  • defense_score_score_pct — Defensive contribution percentile.
  • creation_score_score_pct — Chance creation and playmaking percentile.
  • rating_display_score_pct — Overall composite rating percentile.

These percentile scores are what make cross-league, cross-position comparison meaningful. A midfielder with an attack_score_score_pct of 85 is outperforming 85% of midfielders in the database on offensive metrics—regardless of whether they play in the Premier League or the Eredivisie.

Player Ratings — /database/ratings

The /database/ratings endpoint provides the final computed rating records, filterable by player_id, full_name, season, league, and team. These ratings represent the output of the prediction model—the composite performance score that the CatBoost model assigns based on the full feature vector.

Ratings are seasonally versioned, meaning you can track a player’s trajectory across multiple seasons by querying different season values (format: "2024/2025"). This longitudinal view is essential for development curve analysis, regression detection, and transfer target validation.

2) The Modeling Layer: How Raw Stats Become Predictions

Raw statistics describe what happened. Predictions describe what a player’s statistical profile implies about their overall quality. The gap between these two is where the modeling layer operates.

The Scoring Pipeline

The Football Performance API implements a two-stage scoring pipeline, documented in the codebase as score_building.py:

  1. Feature processing (base/features.py) — Ingests raw match-level statistics, applies position-aware normalization, handles missing data imputation, and generates the feature vectors stored in the database.

  2. Prediction (model/base_model.py) — A CatBoost gradient boosting model trained on the processed features with performance targets generated by base/target.py. The model produces the composite rating score and the dimensional sub-scores (attack, defense, creation).

The model exists in two variants:

  • Base model — Included in the open-source repository as a .cbm file. You can retrain this locally using your own data and feature engineering choices. The /score/predict-base endpoint serves predictions from this model.
  • Pro model — The production variant hosted by Futrix Metrics with enhanced features and higher accuracy. The /score/predict endpoint serves predictions from this model.

Making Predictions via the API

The /score/predict endpoint accepts a POST request with a player’s feature object:

POST /score/predict
{
  "features": {
    "goals": 12,
    "assists": 7,
    "shots": 68,
    "dribbles": 45,
    "tackles": 22,
    "interceptions": 18,
    "clearances": 5,
    "blocks": 8,
    "aerials_won": 30,
    "passes_completed": 1420,
    "passes_attempted": 1680,
    "fouls": 28,
    "yellow_cards": 4,
    "red_cards": 0,
    "minutes": 2850,
    "position": "Midfielder",
    "league": "Premier League",
    "season": "2024/2025"
  }
}

For bulk operations—evaluating an entire squad or running scenario analysis across a shortlist—the /score/predict-batch endpoint accepts up to 500 prediction requests in a single call, reducing network overhead and enabling large-scale scouting workflows.

Understanding Feature Importance

The /charts/features endpoint visualizes which raw statistics contribute most to the model’s predictions. The top parameter (default: 30) controls how many features are displayed. This transparency is critical for building trust with coaching staff—when a scout asks “why does the model rate this player highly?”, the feature importance chart provides a concrete, verifiable answer.

The /charts/models/base and /charts/models/pro endpoints provide comparative views of the two model variants, showing where the pro model’s enhanced features produce different rankings and why.

3) The Clustering Layer: Tactical Role Classification

Position labels—“midfielder,” “forward,” “defender”—are too coarse for modern tactical analysis. A box-to-box #8 and a deep-lying #6 both carry the “midfielder” label but demand entirely different skill profiles. The Football Performance API addresses this with a Gaussian Mixture Model (GMM) clustering system that assigns players to tactical roles based on their actual statistical profiles rather than their nominal positions.

How the Clustering Works

The GMM clustering operates on the normalized feature vectors from the player-features database. Unlike hard clustering methods like K-Means, GMM produces probabilistic assignments—a player might be 72% “Advanced Playmaker” and 28% “Box-to-Box Carrier”—which captures the reality that elite players often operate across role boundaries.

The clustering results are accessible through two endpoints:

  • /database/role-cluster-results — Returns individual player assignments with cluster_id, cluster_name, cluster_group, and cluster_local fields. Filterable by player_id, player_name, season, league, club, position, and cluster_name.

  • /database/role-cluster-summary — Returns aggregate cluster metadata: the characteristic feature profile of each cluster, the player distribution across clusters, and the mean performance metrics per cluster.

Why This Matters for Scouting

Traditional scouting searches start with position: “find me a right winger.” Cluster-based scouting starts with role: “find me an inverted winger who cuts inside to create from half-spaces.” The difference is transformative.

By querying /database/role-cluster-results with a specific cluster_name, you retrieve every player in the database who matches that tactical archetype—across all leagues and seasons. Combined with the feature percentile scores, this creates a scouting shortlist that is both tactically precise and statistically validated.

The /player-charts/cluster-heatmap endpoint visualizes the mean z-score pattern for each cluster, making it immediately clear which statistical dimensions define each role. The /player-charts/cluster-profile/{player_id} endpoint shows an individual player’s GMM membership probabilities across all clusters, revealing whether they are a pure archetype or a hybrid profile.

4) The Visualization Layer: From Numbers to Communication

Data that cannot be communicated to coaching staff cannot change decisions. The Football Performance API includes a comprehensive visualization system with 10 chart endpoints, each designed to answer a specific analytical question.

Chart endpointWhat it answersKey parameters
/player-charts/radar/{player_id}How does this player’s skill profile compare to the position average?season, source (pro/base), format (png/svg/html)
/player-charts/timeline/{player_id}How has this player’s rating evolved across seasons?dims (e.g., attack_score_score_pct,defense_score_score_pct), format
/player-charts/compareHow do 2-5 players compare on the same radar?player_ids (comma-separated), season, format
/player-charts/top-playersWho are the highest-rated players in a specific context?n, position, league, season, format
/player-charts/score-distributionHow are scores distributed across positions?dim (which score dimension), season, format
/player-charts/score-scatterHow do players cluster on two scoring dimensions?x_dim, y_dim, color_by (position/cluster_group/cluster_name), format
/player-charts/cluster-heatmapWhat defines each tactical role cluster?group (e.g., “Forwards”), format
/player-charts/cluster-distributionHow many players fall into each cluster?source, season, format
/player-charts/cluster-profile/{player_id}What is this player’s role probability distribution?season, source, format
/player-charts/league-scoresHow do leagues compare on average skill scores?top_leagues, season, format

All chart endpoints support multiple output formats: png, jpg, webp, svg, and html. The HTML format is particularly valuable for interactive exploration—embedding a score-scatter chart with format=html produces a zoomable, hoverable plot where clicking a data point reveals the player’s identity and full metric profile.

5) The Report Layer: Structured Analytical Output

For scouting and recruitment workflows that require structured, shareable documents rather than individual charts, the API provides a report generation system:

  • /report/data — Returns structured JSON containing a player’s complete analytical profile: features, scores, cluster assignments, and cross-season trajectories. This endpoint is designed for programmatic consumption—feeding into custom dashboards, PDF generators, or recruitment management systems.

  • /report — Generates a complete HTML player report combining visualizations, statistical summaries, and contextual benchmarks in a single, shareable document. This is the “one-click scouting report” that sporting directors and recruitment committees can review without accessing the analytical platform directly.

  • /report/customize — Provides a personalization interface for tailoring report content and formatting to organizational preferences.

6) Building a Weekly Pipeline on Top of the API

With the data, modeling, clustering, visualization, and reporting layers in place, a practical weekly pipeline looks like this:

  1. Monday — Post-match update. Query /database/player-features for your squad’s latest feature vectors. Run /score/predict-batch to update composite ratings. Compare to the previous week’s scores to identify performance shifts.

  2. Tuesday — Scouting focus. Use /database/role-cluster-results to pull players matching the target tactical role. Filter by league and season. Rank by rating_display_score_pct. Generate /player-charts/compare overlays for the top 5 candidates against your current starter.

  3. Wednesday — Tactical preparation. Pull the upcoming opponent’s player profiles via /database/player-features. Generate radar charts for their key players. Identify where their feature profiles show vulnerabilities relative to position averages.

  4. Thursday — Report generation. Use /report to generate shareable HTML reports for the recruitment committee. Use /player-charts/timeline/{player_id} to show development trajectories for shortlisted targets.

  5. Friday — Cross-league benchmarking. Use /player-charts/league-scores to contextualize performance across competitions. Use /player-charts/score-scatter with color_by=cluster_name to map tactical archetypes across leagues.

7) Data Quality Rules That Apply to Any Pipeline

Even with a well-structured API, analytical quality depends on disciplined data hygiene:

  • Always verify sample size before acting on metrics. The /database/player-features endpoint includes minutes as a field—players with fewer than 900 minutes (approximately 10 full matches) should be flagged with confidence caveats. Short-sample metrics are unreliable regardless of how sophisticated the model is.

  • Use the source parameter deliberately. Switching between pro and base sources mid-analysis produces inconsistent results. Choose one source for each analytical workflow and document it. Use the /charts/models/base and /charts/models/pro comparison to understand where the models diverge.

  • Cross-validate with public benchmarks. Use FBref as an independent reference for raw event counts. If your API-derived feature vectors diverge significantly from FBref’s per-90 statistics for the same player, investigate the normalization pipeline before drawing conclusions.

  • Version your analytical outputs. The API’s season parameter enables longitudinal tracking, but your downstream analysis must also be versioned. A scouting shortlist generated in October should be re-evaluated in January using updated feature vectors—player performance is non-stationary.

  • Separate exploration from production queries. Use generous limit values (up to 5000) for exploratory analysis but lock production dashboards to specific, validated query parameters. The API supports offset-based pagination for large result sets—use it to avoid overwhelming downstream systems.

Further Reading

Conclusion

A football data pipeline is successful when every layer—from raw ingestion through feature engineering, prediction, clustering, visualization, and reporting—is transparent, reproducible, and connected to a specific decision. The Football Performance API provides this full stack as an open-source, self-hosted platform with a production-grade managed option from Futrix Metrics. Build the pipeline once, validate it against public benchmarks, and then let the compound effect of marginally better weekly decisions create genuine competitive advantage across a full season.