Skip to content

PhilipJohnBasile/CardValueML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CardValueML

CardValueML is a local-first machine learning and MLOps showcase that mirrors the workflows of a trading-card pricing team. The repository ships with realistic sample data, reproducible pipelines, and production-style tooling so you can demonstrate ingestion, modelling, deployment, and monitoring end to end without a cloud account.

Highlights

  • Data ingestion & validation - Scrapers and generators (scripts/fetch_real_sales_data_v2.py, scripts/generate_realistic_dataset.py) create 2K+ card sales rows, validated with Great Expectations in cardvalue_ml.data.validate before landing in data/processed/sample_sales.csv.
  • Quality-controlled storage - SQLite helpers in cardvalue_ml.data.database and Make targets (make ingest, make sqlite) persist cleaned data for downstream workloads; feature stores (scripts/build_feature_store.py) support SQLite, DuckDB, and Redis.
  • Feature engineering - cardvalue_ml.features.build_features standardises sale dates, one-hot encodes card metadata, and exposes helpers for rolling stats so models stay tabular-friendly.
  • Modelling & experiments - cardvalue_ml.models.train trains a Random Forest baseline, models/benchmark.py compares XGBoost/CatBoost, models/backtest.py runs rolling evaluations, and models/tracking.py logs runs to MLflow.
  • Explainability & risk - SHAP workflows (scripts/generate_shap.py, cardvalue_ml.models.explain) and ensemble variance utilities in cardvalue_ml.models.risk provide underwriting-friendly transparency for Streamlit and the API.
  • Serving & user interfaces - FastAPI (src/cardvalue_ml/api/app.py) exposes /predict, /feature-insights, /latest-sales, /metrics, and /health; Streamlit (apps/streamlit_app.py) offers an Alt-branded pricing console fed by the same artifacts.
  • Pipelines & ops - Prefect flow (src/cardvalue_ml/pipelines/sample_flow.py), Airflow DAG (airflow_dags/cardvalue_pipeline.py), smoke tests, drift monitoring (cardvalue_ml.monitoring.drift), and Docker Compose manifests cover orchestration and deployment scenarios.
  • Documentation library - MkDocs site (mkdocs.yml) stitches together architecture notes, MLOps strategy, monitoring guides, recruiter collateral, and more inside docs/.

Repository Layout

CardValueML/
|-- airflow_dags/           # Airflow DAGs that call Makefile tasks
|-- apps/                   # Streamlit valuation interface
|-- artifacts/              # Model metrics, SHAP plots, drift reports (generated)
|-- data/                   # Processed datasets (generated)
|-- docker/                 # Dockerfile, docker-compose.yaml, supporting scripts
|-- docs/                   # Architecture, ops strategy, collateral for interviews
|-- infra/                  # AWS ECS task + service boilerplate
|-- models/                 # Trained model artifacts (generated)
|-- notebooks/              # EDA and experiment notebooks
|-- raw_data/               # Raw exports and scraped data (generated)
|-- scripts/                # CLI utilities for ingestion, modelling, ops
|-- src/cardvalue_ml/       # Python package with data, features, models, API
`-- tests/                  # Unit, integration, and smoke-style tests

Setup

  1. Use Python 3.10+ (pyproject.toml) and create a virtual environment (python3 -m venv .venv && source .venv/bin/activate).
  2. Install dependencies with make dev-install (installs -e .[dev] and pre-commit hooks). Alternative: python3 -m pip install -r requirements.txt.
  3. Optional extras:
    • Redis or DuckDB if you want to exercise non-SQLite feature stores.
    • OPENAI_API_KEY for scripts/llm_enrich_metadata.py.
    • Docker/Compose for container demos (docker/).
  4. Run pre-commit install if you skipped make dev-install.

Quickstart Workflow

  1. Generate data
    • Preferred: python3 scripts/fetch_real_sales_data_v2.py (fetches public sales and backfills to ~2,000 rows).
    • Offline fallback: make generate-data (deterministic synthetic set in raw_data/synthetic_sales.csv).
  2. Ingest & validate - make ingest cleans the raw CSV, runs Great Expectations checks, and writes data/processed/sample_sales.csv. Run make sqlite to persist into data/cardvalue_ml.db.
  3. Train the baseline - make train fits the Random Forest, saving models/random_forest.joblib plus metrics/feature importances under artifacts/.
  4. Explainability - make shap writes SHAP plots (artifacts/explainability/shap_summary.png) and mean absolute values JSON.
  5. Serve predictions - uvicorn cardvalue_ml.api.app:app --reload exposes the FastAPI service; hit http://localhost:8000/docs.
  6. Front-end workflow - make streamlit launches apps/streamlit_app.py, reading the metrics, model, and SHAP artifacts generated above.
  7. Pipeline run - python3 scripts/run_pipeline.py (or make pipeline) executes the Prefect flow for ingestion -> training.
  8. Smoke tests - make smoke builds artifacts if missing and exercises /predict, /feature-insights, and /latest-sales.

Make Targets & Script Shortcuts

  • make dev-install / make install - install dependencies (dev adds linting/test tooling).
  • make format, make lint, make test - run Ruff+Black, lint checks, and pytest (see tests/ for data, feature, API, pipeline, and monitoring coverage).
  • make ingest, make sqlite, make generate-data - manage raw/processed datasets and SQLite persistence.
  • make train, make benchmark, make backtest, make experiment - train the baseline, compare with XGBoost/CatBoost, run rolling evaluations, and execute YAML-driven experiments (experiments/configs).
  • make shap, make profile, make feature-store, make drift - produce explainability artifacts, profiling reports, populate the SQLite-backed feature store, and generate Evidently drift dashboards. (Use python3 scripts/build_feature_store.py --backend duckdb|redis for non-SQLite backends.)
  • make pipeline, make smoke, make streamlit - orchestration helpers and interactive demos.
  • make docker-build, make docker-up, make docker-down - containerise the API and seed artifacts for deployment trials.

Many targets are thin wrappers around scripts/*.py; inspect those scripts for CLI flags (e.g., --dataset, --backend, --days).

Data & Feature Pipelines

  • Great Expectations validation - cardvalue_ml.data.validate enforces schema checks before storing any sale; violations halt ingestion.
  • Cleaning & storage - cardvalue_ml.data.process.clean_sales_dataframe standardises numeric columns and datetimes; cardvalue_ml.data.database writes to SQLite for both API responses and downstream batch processing.
  • Feature engineering - cardvalue_ml.features.build_features converts sale dates to ordinals and one-hot encodes card/player metadata. Utilities such as add_rolling_stat are available for extending time-aware signals.
  • Feature store options - Populate SQLite/DuckDB/Redis stores with scripts/build_feature_store.py; see docs/feature_store.md for usage patterns.
  • Streaming & enrichment demos - scripts/stream_simulator.py, scripts/fetch_game_stats.py, scripts/fetch_multisport_stats.py, and scripts/fetch_alternative_assets.py illustrate how live signals or adjacent asset classes feed into the same schema.

Modelling, Experiments & Explainability

  • Random Forest baseline - cardvalue_ml.models.train.train_random_forest produces metrics.json, feature_importances.json, and feature_columns.json for downstream apps.
  • Benchmarking - cardvalue_ml.models.benchmark evaluates Random Forest, XGBoost, and CatBoost on identical feature matrices; results can be logged to MLflow via cardvalue_ml.models.tracking.log_benchmark_results.
  • Rolling backtests - cardvalue_ml.models.backtest.rolling_backtest simulates underwriting behaviour over sliding windows (make backtest).
  • Experiments - Define configs in experiments/configs/*.yaml and run them through scripts/run_experiment.py; metrics land in artifacts/experiments/.
  • MLflow integration - cardvalue_ml.models.tracking initialises a local MLflow instance in artifacts/mlruns for experiment review.
  • Explainability & risk - SHAP plots via cardvalue_ml.models.explain and ensemble uncertainty via cardvalue_ml.models.risk feed Streamlit as well as the /feature-insights endpoint.

Serving & Interfaces

  • FastAPI service - src/cardvalue_ml/api/app.py loads artifacts on startup and exposes:
    • POST /predict - price prediction + 95% interval,
    • POST /feature-insights - per-feature SHAP contributions and risk stats,
    • GET /latest-sales - recent sales from SQLite or fallback CSV,
    • GET /metrics, /feature-importances, and /health.
  • Streamlit dashboard - apps/streamlit_app.py reads model artifacts, displays feature importances, and calculates loan-to-value scenarios for high-value cards.
  • Automated checks - scripts/run_smoke_tests.py provisions model artifacts if needed and validates key endpoints with FastAPI's TestClient.

Monitoring & Operations

  • Prefect pipeline - src/cardvalue_ml/pipelines/sample_flow.py orchestrates ingestion, SQLite persistence, training, and artifact logging; scripts/run_pipeline.py provides a CLI wrapper.
  • Airflow DAG - airflow_dags/cardvalue_pipeline.py mirrors the Prefect flow for scheduler parity.
  • Evidently drift reports - scripts/drift_report.py (and cardvalue_ml.monitoring.drift) compare reference vs. current datasets, placing HTML output under artifacts/monitoring/.
  • Docker & AWS - docker/docker-compose.yml runs inference containers locally; infra/ecs/ contains task/service definitions for ECS proof-of-concept deployments.
  • Validation checklist - docs/local_validation_checklist.md and docs/pipeline_monitoring.md expand on operational readiness.

Documentation & Collateral

  • Launch the MkDocs site locally with mkdocs serve to browse the curated documentation.
  • Key references:
    • docs/architecture.md - system design overview,
    • docs/mlops_strategy.md - infra and cost strategy,
    • docs/data_sources.md and docs/dataset_summary.md - data sourcing details,
    • docs/feature_store.md, docs/mlflow_tracking.md, docs/mlflow_ui_demo.md,
    • docs/recruiter_summary.md, docs/pitch_template.md, docs/presentation_outline.md,
    • docs/prefect_agent.md, docs/aws_deployment.md, docs/role_alignment.md.
  • Presentation-ready summaries live under docs/ and private_docs/ to help tailor interviews.

Optional Integrations & Notes

  • Network-required scripts (fetch_real_sales_data_v2, fetch_game_stats, fetch_multisport_stats) use public APIs or pages; fall back to synthetic generators if connectivity is limited.
  • scripts/llm_enrich_metadata.py demonstrates GPT-powered card descriptions and requires OPENAI_API_KEY.
  • Redis-backed feature stores (run python3 scripts/build_feature_store.py --backend redis) and Docker-based Redis (docker run -p 6379:6379 redis) are optional but supported.
  • Pre-generate artifacts with python3 scripts/bootstrap_sample_artifacts.py if you need ready-to-demo models.
  • Need deep-learning extras (Transformers, PyTorch) for future experiments? Install with pip install .[full].

Contributing

  • Run make format, make lint, and make test before opening a PR; CI (.github/workflows/ci.yml) enforces the same gates.
  • Use make smoke or targeted pytest tests/test_* runs to validate changes to pipelines or APIs.
  • Capture larger ideas in docs/spec_alignment.md or extend Make targets as needed; the repo is structured to be fork-friendly for interview loops.
  • Optional but recommended: point Git to the bundled hooks with git config core.hooksPath .githooks so the pre-push hook runs make lint automatically before every push.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published