Open Source Football Player Performance API: Local Setup, Architecture, and Contribution Guide

Deploy our open-source football analytics API locally. Explore the architecture, customize models, and contribute to the community.

2026-05-02T00:00:00.000Z

Open source football analytics GitHub repository

Open source football analytics has matured from scattered Jupyter notebooks into production-grade systems that clubs, researchers, and independent analysts can deploy, extend, and contribute to. The Football Player Performance API is an open-source REST API that processes raw player statistics through a CatBoost scoring pipeline and Gaussian Mixture Model clustering system, then serves the results through 35+ endpoints covering database queries, predictions, visualizations, and scouting reports.

This guide walks through local deployment, explains the architecture and design decisions, and shows you how to customize the system for your own analytical requirements. Whether you want to self-host a private instance, contribute new features, or study a working example of production football analytics engineering, this is your starting point.

For the hosted version with the enhanced pro model, see Futrix Metrics. For the full API reference, see the Football Performance API Documentation.

1) Repository Overview

The repository is structured around four functional layers that mirror the data flow from raw statistics to analytical output:

football-player-performance-API/
├── base/                    # Base model feature engineering
│   └── features.py          # Feature pipeline for community model
├── database/                # Data layer and schema definitions
├── models/                  # CatBoost model artifacts
├── charts/                  # Visualization endpoint logic
├── reports/                 # HTML scouting report generation
├── clustering/              # GMM role cluster pipeline
├── api/                     # FastAPI route definitions
│   ├── database_routes.py   # /database/* endpoints
│   ├── score_routes.py      # /score/* prediction endpoints
│   ├── chart_routes.py      # /charts/* and /player-charts/*
│   └── report_routes.py     # /report/* endpoints
├── main.py                  # Application entry point
├── requirements.txt         # Python dependencies
└── README.md

Each layer can be modified independently. You can swap the prediction model without touching the chart endpoints. You can add new visualization types without modifying the database layer. You can extend the clustering system with custom features without changing the API routes. This separation of concerns is deliberate — it enables contribution without requiring understanding of the full system.

2) Local Setup: From Clone to Running API

Prerequisites

Python 3.10+
pip or conda for package management
Git

Step 1: Clone and Install

git clone https://github.com/liv-ynwa/football-player-performance-API.git
cd football-player-performance-API
pip install -r requirements.txt

The dependency stack is intentionally conservative to minimize setup friction:

Package	Purpose
FastAPI	ASGI web framework for route handling
Uvicorn	ASGI server
CatBoost	Gradient-boosted tree predictions
scikit-learn	GMM clustering, preprocessing utilities
pandas	Data manipulation and feature engineering
plotly	Interactive chart generation
kaleido	Static image export from Plotly charts

Step 2: Configure Environment

Create a .env file in the project root:

DATABASE_URL=your_database_connection_string
API_SECRET_KEY=your_secret_for_jwt

The database stores player records, feature vectors, computed ratings, and cluster assignments. The API reads from this database at query time — there is no in-memory cache that needs warming. If you are running locally for development, a SQLite or PostgreSQL instance works.

Step 3: Run the Server

uvicorn main:app --reload --host 0.0.0.0 --port 8000

The API is now live at http://localhost:8000. Verify with the health check:

curl http://localhost:8000/health
# {"status": "ok"}

The interactive documentation is available at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
OpenAPI JSON: http://localhost:8000/openapi.json

These are the same documentation interfaces available at the hosted version — footballperformanceapi.site/redoc and footballperformanceapi.site/playground — but running against your local data.

3) Architecture Deep Dive

The system follows a layered architecture where data flows in one direction: raw statistics → feature engineering → model prediction → composite scores and cluster assignments → API delivery.

Layer 1: Data Ingestion

Raw player statistics enter the system through the database layer. Each player record contains season-level aggregates: minutes played, goals, assists, shots, passes completed, passes attempted, tackles, interceptions, clearances, blocks, dribbles, aerials, fouls, and cards. These are standard event-derived statistics available from most football data providers.

The /database/players/basic endpoint provides the player registry — the canonical identity layer that links player_id to name, position, club, league, and season. This registry supports both substring matching (name parameter) and exact matching (full_name, position, league), enabling both exploratory search and precise lookup. For a detailed explanation of the data layer design, see the Comprehensive Guide to Football Data.

Layer 2: Feature Engineering

The base/features.py module transforms raw statistics into normalized, position-aware feature vectors. This layer handles:

Per-90 normalization — converting raw season totals (goals, assists, tackles) into per-90-minutes rates for fair cross-player comparison.
Positional context — adjusting features relative to positional baselines so that a center-back’s 2 goals are evaluated differently from a striker’s 2 goals.
Derived ratios — pass completion rate, tackle success rate, aerial win percentage, shot accuracy, and other compound metrics.
Missing value handling — imputation strategies for players with incomplete statistical records.

The source parameter available on most endpoints controls which feature engineering pipeline is used. The base source uses the open-source pipeline in base/features.py. The pro source — available on the hosted platform — uses an enhanced pipeline with additional normalization and proprietary feature engineering.

GET /database/player-features?name=Bellingham&season=2025-2026&source=base

This returns the base-model feature vector for the matched player, using the community feature engineering pipeline.

Layer 3: Prediction Model

The CatBoost prediction model maps feature vectors to 13 performance score dimensions:

Score	Dimension
`attack_score`	Goals, shots, shots on target contribution
`shooting_quality_score`	Shot accuracy and conversion quality
`assist_score`	Direct assist output
`creation_score`	Chance creation and key passing
`passing_score`	Pass volume and completion
`progression_score`	Ball progression toward goal
`dribble_score`	Dribbling success and attempt frequency
`ball_security_score`	Possession retention under pressure
`defense_score`	Tackles, interceptions, clearances
`aerial_score`	Aerial duel frequency and success rate
`discipline_score`	Foul and card discipline
`appearance_score`	Minutes and availability
`rating_pred`	Composite overall rating prediction

CatBoost was selected over alternatives (XGBoost, LightGBM, random forests) for three reasons: native categorical feature handling eliminates the need for manual encoding of position, league, and season; ordered boosting reduces overfitting on smaller league datasets; and the model provides well-calibrated probability estimates that translate cleanly into percentile scores.

The POST /score/predict endpoint exposes this model directly:

import requests

response = requests.post(
    "http://localhost:8000/score/predict",
    headers={"X-API-Key": "YOUR_KEY"},
    json={
        "features": {
            "position": "CB",
            "season": "2025-2026",
            "league": "Premier League",
            "player_name": "Virgil van Dijk",
            "club": "Liverpool",
            "minutes": 2800,
            "goals": 3,
            "assists": 2,
            "shots": 22,
            "shots_on_target": 10,
            "passes_completed": 2200,
            "passes_attempted": 2400,
            "tackles": 45,
            "tackles_won": 32,
            "interceptions": 55,
            "clearances": 160,
            "blocks": 30,
            "dribbles": 15,
            "dribbles_successful": 10,
            "aerials": 150,
            "aerials_won": 105,
            "fouls_committed": 15,
            "fouls_drawn": 20,
            "yellow_cards": 2,
            "red_cards": 0
        }
    }
)

result = response.json()
print(result["scores_0_10"])
print(result["cluster"])

The prediction endpoint requires only position, season, and league as mandatory fields. All statistical fields default to zero if omitted — but providing complete data produces more accurate scores. The base model variant is available via POST /score/predict-base, which returns a reduced set of scoring dimensions using the community feature pipeline.

For the model design rationale and evaluation methodology, see the Player Rating and Role Clustering Model Design and the Football Statistics and Performance Analysis.

Layer 4: GMM Role Clustering

The Gaussian Mixture Model clustering system assigns each player to a tactical role archetype based on their multi-dimensional score profile. Unlike hard-clustering methods (k-means), GMM produces soft assignments with probability distributions — a player might be 72% “Deep-Lying Playmaker” and 18% “Box-to-Box Engine,” reflecting the reality that tactical roles exist on a spectrum.

Query cluster assignments via:

GET /database/role-cluster-results?position=CM&league=Premier League&season=2025-2026&limit=50

Get cluster summary definitions:

GET /database/role-cluster-summary?cluster_group=Midfielders

Visualize cluster heatmaps to understand what defines each archetype:

GET /player-charts/cluster-heatmap?group=Forwards&format=html

The clustering module in clustering/ contains the GMM fitting logic, cluster labeling, and the probability assignment pipeline. You can modify the number of clusters, the features used for clustering, or the labeling strategy by editing this module. The cluster IDs are deterministic given the same input data, so changes propagate cleanly through the system.

For visual exploration of role clusters, see the Rating and Player Cluster Project on the Futrix Metrics platform.

4) Customizing the System

Adding a New Feature

To add a new derived feature to the base model:

Open base/features.py.
Add the computation logic in the feature engineering pipeline. For example, to add a “goal involvement rate”:

def compute_goal_involvement(row):
    total = row.get("goals", 0) + row.get("assists", 0)
    minutes = row.get("minutes", 1)
    return (total / minutes) * 90

Include the new feature in the feature vector output.
Retrain the CatBoost model with the updated feature set (see “Retraining Models” below).

Adding a New Chart Endpoint

The chart layer uses Plotly for all visualizations. To add a custom chart:

Create the chart logic function in charts/.
Register a new route in api/chart_routes.py:

@router.get("/player-charts/your-custom-chart")
async def custom_chart(
    player_id: int,
    format: str = "png",
    api_key: str = Depends(verify_api_key)
):
    fig = create_custom_chart(player_id)
    return export_chart(fig, format)

The export_chart utility handles format conversion (PNG, SVG, HTML) automatically.

All existing chart endpoints follow this pattern — study /player-charts/radar/{player_id} as a reference implementation. The Futrix Metrics Charts Project showcases the full range of built-in chart types.

Retraining Models

To retrain the CatBoost model with modified features or additional training data:

Prepare the training dataset as a pandas DataFrame with feature columns and target scores.
Use the CatBoost Pool and CatBoostRegressor API:

from catboost import CatBoostRegressor, Pool

cat_features = ["position", "league", "season"]
train_pool = Pool(X_train, y_train, cat_features=cat_features)

model = CatBoostRegressor(
    iterations=1000,
    learning_rate=0.05,
    depth=6,
    l2_leaf_reg=3,
    random_seed=42
)

model.fit(train_pool, verbose=100)
model.save_model("models/your_model.cbm")

Update the model loading path in the score routes to point to your new model file.

The base model artifacts in the models/ directory are version-controlled. When you retrain, keep the previous model file as a reference — this enables A/B comparison between model versions using the source parameter.

5) API Endpoint Reference for Local Development

Once the server is running, every endpoint from the hosted version is available locally. Here is the quick reference organized by functional area:

Database Endpoints

Endpoint	Method	Purpose
`/database/players/basic`	GET	Player registry search
`/database/player-features`	GET	Engineered feature vectors
`/database/ratings`	GET	Composite rating scores
`/database/role-cluster-results`	GET	Player cluster assignments
`/database/role-cluster-summary`	GET	Cluster definitions and profiles

Prediction Endpoints

Endpoint	Method	Purpose
`/score/`	GET	Service metadata and example request
`/score/predict`	POST	Single player prediction (pro model)
`/score/predict-batch`	POST	Batch prediction (1-500 players)
`/score/predict-base`	POST	Single player prediction (base model)

Chart Endpoints

Endpoint	Method	Purpose
`/charts/overview`	GET	System overview chart
`/charts/positions`	GET	Position distribution
`/charts/leagues`	GET	League distribution
`/charts/models/base`	GET	Base model visualization
`/charts/models/pro`	GET	Pro model visualization
`/charts/features`	GET	Feature importance diff
`/player-charts/radar/{id}`	GET	Player skill radar
`/player-charts/timeline/{id}`	GET	Season-by-season trajectory
`/player-charts/compare`	GET	Multi-player overlay radar
`/player-charts/top-players`	GET	Top N rankings chart
`/player-charts/score-distribution`	GET	Score box-plot by position
`/player-charts/cluster-heatmap`	GET	Cluster z-score heatmap
`/player-charts/cluster-distribution`	GET	Player count per cluster
`/player-charts/cluster-profile/{id}`	GET	GMM membership probabilities
`/player-charts/league-scores`	GET	League-level score heatmap
`/player-charts/score-scatter`	GET	Two-dimension scatter plot

Report Endpoints

Endpoint	Method	Purpose
`/report/data`	GET	Structured JSON report
`/report`	GET	Rendered HTML report

All endpoints support authentication via Authorization: Bearer or X-API-Key header. Full parameter documentation is available at /redoc on your local instance, matching the hosted API documentation.

6) Contributing to the Project

Issue Reporting

Open issues on GitHub with:

Clear reproduction steps for bugs.
Expected vs. actual behavior.
API version and Python version.
Request/response examples where applicable.

Pull Request Guidelines

Fork the repository and create a feature branch from main.
Follow the existing code structure — routes in api/, logic in the corresponding functional module.
Add endpoint documentation strings that mirror the existing pattern (FastAPI auto-generates OpenAPI docs from these).
Test your changes against the local server before submitting.
Keep PRs focused on a single feature or fix.

Areas Open for Contribution

New chart types — Custom visualizations (shot maps, pass networks, progressive carry maps) that build on the existing Plotly infrastructure.
Additional clustering methods — Alternative clustering algorithms (HDBSCAN, spectral clustering) alongside the existing GMM pipeline.
League coverage expansion — Data integration for leagues not currently in the database.
Feature engineering extensions — New derived features in the base pipeline that improve prediction accuracy.
Client libraries — Official SDK wrappers in JavaScript, R, or other languages to simplify integration.

The community benefits most from contributions that extend the system’s capabilities without breaking the existing API contract. New endpoints are preferred over modifications to existing response schemas.

7) From Local to Production

The open-source version provides the complete analytical pipeline. The hosted version at Futrix Metrics adds:

Pro model — Enhanced feature engineering and prediction accuracy beyond the base community model.
Managed infrastructure — No server management, automatic updates, and guaranteed uptime.
Expanded data coverage — Broader league and season coverage.
No-code platform — Player search, comparison, and AI scouting reports via the Futrix Metrics Explorer without any programming.
Stripe-managed subscriptions — From free tier to production scale via the Service Plans.

For teams that want to evaluate the platform before committing to self-hosted infrastructure, the free tier provides access to core database and prediction endpoints at zero cost. For teams that need full control, the open-source repository provides everything needed to deploy, customize, and operate the system independently.

Futrix Metrics API Documentation — Complete hosted API reference.
Comprehensive Guide to Football Data — End-to-end data pipeline architecture.
Football Statistics and Performance Analysis — CatBoost model methodology and per-90 normalization.
Football Decision Intelligence Report — Advanced analytical modules for match analysis.
Project Development Environment — Technical deep dive into the development workflow.
Running Analysis Project — Applied analytics example using the API infrastructure.

Clone the repo, run the server, make your first API call. The code is open, the documentation is live, and the community is building.