Open Source Football Player Performance API: Local Setup, Architecture, and Contribution Guide
Deploy our open-source football analytics API locally. Explore the architecture, customize models, and contribute to the community.
Open source football analytics has matured from scattered Jupyter notebooks into production-grade systems that clubs, researchers, and independent analysts can deploy, extend, and contribute to. The Football Player Performance API is an open-source REST API that processes raw player statistics through a CatBoost scoring pipeline and Gaussian Mixture Model clustering system, then serves the results through 35+ endpoints covering database queries, predictions, visualizations, and scouting reports.
This guide walks through local deployment, explains the architecture and design decisions, and shows you how to customize the system for your own analytical requirements. Whether you want to self-host a private instance, contribute new features, or study a working example of production football analytics engineering, this is your starting point.
For the hosted version with the enhanced pro model, see Futrix Metrics. For the full API reference, see the Football Performance API Documentation.
1) Repository Overview
The repository is structured around four functional layers that mirror the data flow from raw statistics to analytical output:
football-player-performance-API/
├── base/ # Base model feature engineering
│ └── features.py # Feature pipeline for community model
├── database/ # Data layer and schema definitions
├── models/ # CatBoost model artifacts
├── charts/ # Visualization endpoint logic
├── reports/ # HTML scouting report generation
├── clustering/ # GMM role cluster pipeline
├── api/ # FastAPI route definitions
│ ├── database_routes.py # /database/* endpoints
│ ├── score_routes.py # /score/* prediction endpoints
│ ├── chart_routes.py # /charts/* and /player-charts/*
│ └── report_routes.py # /report/* endpoints
├── main.py # Application entry point
├── requirements.txt # Python dependencies
└── README.md
Each layer can be modified independently. You can swap the prediction model without touching the chart endpoints. You can add new visualization types without modifying the database layer. You can extend the clustering system with custom features without changing the API routes. This separation of concerns is deliberate — it enables contribution without requiring understanding of the full system.
2) Local Setup: From Clone to Running API
Prerequisites
- Python 3.10+
- pip or conda for package management
- Git
Step 1: Clone and Install
git clone https://github.com/liv-ynwa/football-player-performance-API.git
cd football-player-performance-API
pip install -r requirements.txt
The dependency stack is intentionally conservative to minimize setup friction:
| Package | Purpose |
|---|---|
| FastAPI | ASGI web framework for route handling |
| Uvicorn | ASGI server |
| CatBoost | Gradient-boosted tree predictions |
| scikit-learn | GMM clustering, preprocessing utilities |
| pandas | Data manipulation and feature engineering |
| plotly | Interactive chart generation |
| kaleido | Static image export from Plotly charts |
Step 2: Configure Environment
Create a .env file in the project root:
DATABASE_URL=your_database_connection_string
API_SECRET_KEY=your_secret_for_jwt
The database stores player records, feature vectors, computed ratings, and cluster assignments. The API reads from this database at query time — there is no in-memory cache that needs warming. If you are running locally for development, a SQLite or PostgreSQL instance works.
Step 3: Run the Server
uvicorn main:app --reload --host 0.0.0.0 --port 8000
The API is now live at http://localhost:8000. Verify with the health check:
curl http://localhost:8000/health
# {"status": "ok"}
The interactive documentation is available at:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc - OpenAPI JSON:
http://localhost:8000/openapi.json
These are the same documentation interfaces available at the hosted version — footballperformanceapi.site/redoc and footballperformanceapi.site/playground — but running against your local data.
3) Architecture Deep Dive
The system follows a layered architecture where data flows in one direction: raw statistics → feature engineering → model prediction → composite scores and cluster assignments → API delivery.
Layer 1: Data Ingestion
Raw player statistics enter the system through the database layer. Each player record contains season-level aggregates: minutes played, goals, assists, shots, passes completed, passes attempted, tackles, interceptions, clearances, blocks, dribbles, aerials, fouls, and cards. These are standard event-derived statistics available from most football data providers.
The /database/players/basic endpoint provides the player registry — the canonical identity layer that links player_id to name, position, club, league, and season. This registry supports both substring matching (name parameter) and exact matching (full_name, position, league), enabling both exploratory search and precise lookup. For a detailed explanation of the data layer design, see the Comprehensive Guide to Football Data.
Layer 2: Feature Engineering
The base/features.py module transforms raw statistics into normalized, position-aware feature vectors. This layer handles:
- Per-90 normalization — converting raw season totals (goals, assists, tackles) into per-90-minutes rates for fair cross-player comparison.
- Positional context — adjusting features relative to positional baselines so that a center-back’s 2 goals are evaluated differently from a striker’s 2 goals.
- Derived ratios — pass completion rate, tackle success rate, aerial win percentage, shot accuracy, and other compound metrics.
- Missing value handling — imputation strategies for players with incomplete statistical records.
The source parameter available on most endpoints controls which feature engineering pipeline is used. The base source uses the open-source pipeline in base/features.py. The pro source — available on the hosted platform — uses an enhanced pipeline with additional normalization and proprietary feature engineering.
GET /database/player-features?name=Bellingham&season=2025-2026&source=base
This returns the base-model feature vector for the matched player, using the community feature engineering pipeline.
Layer 3: Prediction Model
The CatBoost prediction model maps feature vectors to 13 performance score dimensions:
| Score | Dimension |
|---|---|
attack_score | Goals, shots, shots on target contribution |
shooting_quality_score | Shot accuracy and conversion quality |
assist_score | Direct assist output |
creation_score | Chance creation and key passing |
passing_score | Pass volume and completion |
progression_score | Ball progression toward goal |
dribble_score | Dribbling success and attempt frequency |
ball_security_score | Possession retention under pressure |
defense_score | Tackles, interceptions, clearances |
aerial_score | Aerial duel frequency and success rate |
discipline_score | Foul and card discipline |
appearance_score | Minutes and availability |
rating_pred | Composite overall rating prediction |
CatBoost was selected over alternatives (XGBoost, LightGBM, random forests) for three reasons: native categorical feature handling eliminates the need for manual encoding of position, league, and season; ordered boosting reduces overfitting on smaller league datasets; and the model provides well-calibrated probability estimates that translate cleanly into percentile scores.
The POST /score/predict endpoint exposes this model directly:
import requests
response = requests.post(
"http://localhost:8000/score/predict",
headers={"X-API-Key": "YOUR_KEY"},
json={
"features": {
"position": "CB",
"season": "2025-2026",
"league": "Premier League",
"player_name": "Virgil van Dijk",
"club": "Liverpool",
"minutes": 2800,
"goals": 3,
"assists": 2,
"shots": 22,
"shots_on_target": 10,
"passes_completed": 2200,
"passes_attempted": 2400,
"tackles": 45,
"tackles_won": 32,
"interceptions": 55,
"clearances": 160,
"blocks": 30,
"dribbles": 15,
"dribbles_successful": 10,
"aerials": 150,
"aerials_won": 105,
"fouls_committed": 15,
"fouls_drawn": 20,
"yellow_cards": 2,
"red_cards": 0
}
}
)
result = response.json()
print(result["scores_0_10"])
print(result["cluster"])
The prediction endpoint requires only position, season, and league as mandatory fields. All statistical fields default to zero if omitted — but providing complete data produces more accurate scores. The base model variant is available via POST /score/predict-base, which returns a reduced set of scoring dimensions using the community feature pipeline.
For the model design rationale and evaluation methodology, see the Player Rating and Role Clustering Model Design and the Football Statistics and Performance Analysis.
Layer 4: GMM Role Clustering
The Gaussian Mixture Model clustering system assigns each player to a tactical role archetype based on their multi-dimensional score profile. Unlike hard-clustering methods (k-means), GMM produces soft assignments with probability distributions — a player might be 72% “Deep-Lying Playmaker” and 18% “Box-to-Box Engine,” reflecting the reality that tactical roles exist on a spectrum.
Query cluster assignments via:
GET /database/role-cluster-results?position=CM&league=Premier League&season=2025-2026&limit=50
Get cluster summary definitions:
GET /database/role-cluster-summary?cluster_group=Midfielders
Visualize cluster heatmaps to understand what defines each archetype:
GET /player-charts/cluster-heatmap?group=Forwards&format=html
The clustering module in clustering/ contains the GMM fitting logic, cluster labeling, and the probability assignment pipeline. You can modify the number of clusters, the features used for clustering, or the labeling strategy by editing this module. The cluster IDs are deterministic given the same input data, so changes propagate cleanly through the system.
For visual exploration of role clusters, see the Rating and Player Cluster Project on the Futrix Metrics platform.
4) Customizing the System
Adding a New Feature
To add a new derived feature to the base model:
- Open
base/features.py. - Add the computation logic in the feature engineering pipeline. For example, to add a “goal involvement rate”:
def compute_goal_involvement(row):
total = row.get("goals", 0) + row.get("assists", 0)
minutes = row.get("minutes", 1)
return (total / minutes) * 90
- Include the new feature in the feature vector output.
- Retrain the CatBoost model with the updated feature set (see “Retraining Models” below).
Adding a New Chart Endpoint
The chart layer uses Plotly for all visualizations. To add a custom chart:
- Create the chart logic function in
charts/. - Register a new route in
api/chart_routes.py:
@router.get("/player-charts/your-custom-chart")
async def custom_chart(
player_id: int,
format: str = "png",
api_key: str = Depends(verify_api_key)
):
fig = create_custom_chart(player_id)
return export_chart(fig, format)
- The
export_chartutility handles format conversion (PNG, SVG, HTML) automatically.
All existing chart endpoints follow this pattern — study /player-charts/radar/{player_id} as a reference implementation. The Futrix Metrics Charts Project showcases the full range of built-in chart types.
Retraining Models
To retrain the CatBoost model with modified features or additional training data:
- Prepare the training dataset as a pandas DataFrame with feature columns and target scores.
- Use the CatBoost
PoolandCatBoostRegressorAPI:
from catboost import CatBoostRegressor, Pool
cat_features = ["position", "league", "season"]
train_pool = Pool(X_train, y_train, cat_features=cat_features)
model = CatBoostRegressor(
iterations=1000,
learning_rate=0.05,
depth=6,
l2_leaf_reg=3,
random_seed=42
)
model.fit(train_pool, verbose=100)
model.save_model("models/your_model.cbm")
- Update the model loading path in the score routes to point to your new model file.
The base model artifacts in the models/ directory are version-controlled. When you retrain, keep the previous model file as a reference — this enables A/B comparison between model versions using the source parameter.
5) API Endpoint Reference for Local Development
Once the server is running, every endpoint from the hosted version is available locally. Here is the quick reference organized by functional area:
Database Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/database/players/basic | GET | Player registry search |
/database/player-features | GET | Engineered feature vectors |
/database/ratings | GET | Composite rating scores |
/database/role-cluster-results | GET | Player cluster assignments |
/database/role-cluster-summary | GET | Cluster definitions and profiles |
Prediction Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/score/ | GET | Service metadata and example request |
/score/predict | POST | Single player prediction (pro model) |
/score/predict-batch | POST | Batch prediction (1-500 players) |
/score/predict-base | POST | Single player prediction (base model) |
Chart Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/charts/overview | GET | System overview chart |
/charts/positions | GET | Position distribution |
/charts/leagues | GET | League distribution |
/charts/models/base | GET | Base model visualization |
/charts/models/pro | GET | Pro model visualization |
/charts/features | GET | Feature importance diff |
/player-charts/radar/{id} | GET | Player skill radar |
/player-charts/timeline/{id} | GET | Season-by-season trajectory |
/player-charts/compare | GET | Multi-player overlay radar |
/player-charts/top-players | GET | Top N rankings chart |
/player-charts/score-distribution | GET | Score box-plot by position |
/player-charts/cluster-heatmap | GET | Cluster z-score heatmap |
/player-charts/cluster-distribution | GET | Player count per cluster |
/player-charts/cluster-profile/{id} | GET | GMM membership probabilities |
/player-charts/league-scores | GET | League-level score heatmap |
/player-charts/score-scatter | GET | Two-dimension scatter plot |
Report Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/report/data | GET | Structured JSON report |
/report | GET | Rendered HTML report |
All endpoints support authentication via Authorization: Bearer or X-API-Key header. Full parameter documentation is available at /redoc on your local instance, matching the hosted API documentation.
6) Contributing to the Project
Issue Reporting
Open issues on GitHub with:
- Clear reproduction steps for bugs.
- Expected vs. actual behavior.
- API version and Python version.
- Request/response examples where applicable.
Pull Request Guidelines
- Fork the repository and create a feature branch from
main. - Follow the existing code structure — routes in
api/, logic in the corresponding functional module. - Add endpoint documentation strings that mirror the existing pattern (FastAPI auto-generates OpenAPI docs from these).
- Test your changes against the local server before submitting.
- Keep PRs focused on a single feature or fix.
Areas Open for Contribution
- New chart types — Custom visualizations (shot maps, pass networks, progressive carry maps) that build on the existing Plotly infrastructure.
- Additional clustering methods — Alternative clustering algorithms (HDBSCAN, spectral clustering) alongside the existing GMM pipeline.
- League coverage expansion — Data integration for leagues not currently in the database.
- Feature engineering extensions — New derived features in the base pipeline that improve prediction accuracy.
- Client libraries — Official SDK wrappers in JavaScript, R, or other languages to simplify integration.
The community benefits most from contributions that extend the system’s capabilities without breaking the existing API contract. New endpoints are preferred over modifications to existing response schemas.
7) From Local to Production
The open-source version provides the complete analytical pipeline. The hosted version at Futrix Metrics adds:
- Pro model — Enhanced feature engineering and prediction accuracy beyond the base community model.
- Managed infrastructure — No server management, automatic updates, and guaranteed uptime.
- Expanded data coverage — Broader league and season coverage.
- No-code platform — Player search, comparison, and AI scouting reports via the Futrix Metrics Explorer without any programming.
- Stripe-managed subscriptions — From free tier to production scale via the Service Plans.
For teams that want to evaluate the platform before committing to self-hosted infrastructure, the free tier provides access to core database and prediction endpoints at zero cost. For teams that need full control, the open-source repository provides everything needed to deploy, customize, and operate the system independently.
Related Resources
- Futrix Metrics API Documentation — Complete hosted API reference.
- Comprehensive Guide to Football Data — End-to-end data pipeline architecture.
- Football Statistics and Performance Analysis — CatBoost model methodology and per-90 normalization.
- Football Decision Intelligence Report — Advanced analytical modules for match analysis.
- Project Development Environment — Technical deep dive into the development workflow.
- Running Analysis Project — Applied analytics example using the API infrastructure.
Clone the repo, run the server, make your first API call. The code is open, the documentation is live, and the community is building.