CardValueML is a local-first machine learning and MLOps showcase that mirrors the workflows of a trading-card pricing team. The repository ships with realistic sample data, reproducible pipelines, and production-style tooling so you can demonstrate ingestion, modelling, deployment, and monitoring end to end without a cloud account.
- Data ingestion & validation - Scrapers and generators (
scripts/fetch_real_sales_data_v2.py,scripts/generate_realistic_dataset.py) create 2K+ card sales rows, validated with Great Expectations incardvalue_ml.data.validatebefore landing indata/processed/sample_sales.csv. - Quality-controlled storage - SQLite helpers in
cardvalue_ml.data.databaseand Make targets (make ingest,make sqlite) persist cleaned data for downstream workloads; feature stores (scripts/build_feature_store.py) support SQLite, DuckDB, and Redis. - Feature engineering -
cardvalue_ml.features.build_featuresstandardises sale dates, one-hot encodes card metadata, and exposes helpers for rolling stats so models stay tabular-friendly. - Modelling & experiments -
cardvalue_ml.models.traintrains a Random Forest baseline,models/benchmark.pycompares XGBoost/CatBoost,models/backtest.pyruns rolling evaluations, andmodels/tracking.pylogs runs to MLflow. - Explainability & risk - SHAP workflows (
scripts/generate_shap.py,cardvalue_ml.models.explain) and ensemble variance utilities incardvalue_ml.models.riskprovide underwriting-friendly transparency for Streamlit and the API. - Serving & user interfaces - FastAPI (
src/cardvalue_ml/api/app.py) exposes/predict,/feature-insights,/latest-sales,/metrics, and/health; Streamlit (apps/streamlit_app.py) offers an Alt-branded pricing console fed by the same artifacts. - Pipelines & ops - Prefect flow (
src/cardvalue_ml/pipelines/sample_flow.py), Airflow DAG (airflow_dags/cardvalue_pipeline.py), smoke tests, drift monitoring (cardvalue_ml.monitoring.drift), and Docker Compose manifests cover orchestration and deployment scenarios. - Documentation library - MkDocs site (
mkdocs.yml) stitches together architecture notes, MLOps strategy, monitoring guides, recruiter collateral, and more insidedocs/.
CardValueML/
|-- airflow_dags/ # Airflow DAGs that call Makefile tasks
|-- apps/ # Streamlit valuation interface
|-- artifacts/ # Model metrics, SHAP plots, drift reports (generated)
|-- data/ # Processed datasets (generated)
|-- docker/ # Dockerfile, docker-compose.yaml, supporting scripts
|-- docs/ # Architecture, ops strategy, collateral for interviews
|-- infra/ # AWS ECS task + service boilerplate
|-- models/ # Trained model artifacts (generated)
|-- notebooks/ # EDA and experiment notebooks
|-- raw_data/ # Raw exports and scraped data (generated)
|-- scripts/ # CLI utilities for ingestion, modelling, ops
|-- src/cardvalue_ml/ # Python package with data, features, models, API
`-- tests/ # Unit, integration, and smoke-style tests
- Use Python 3.10+ (
pyproject.toml) and create a virtual environment (python3 -m venv .venv && source .venv/bin/activate). - Install dependencies with
make dev-install(installs-e .[dev]and pre-commit hooks). Alternative:python3 -m pip install -r requirements.txt. - Optional extras:
- Redis or DuckDB if you want to exercise non-SQLite feature stores.
OPENAI_API_KEYforscripts/llm_enrich_metadata.py.- Docker/Compose for container demos (
docker/).
- Run
pre-commit installif you skippedmake dev-install.
- Generate data
- Preferred:
python3 scripts/fetch_real_sales_data_v2.py(fetches public sales and backfills to ~2,000 rows). - Offline fallback:
make generate-data(deterministic synthetic set inraw_data/synthetic_sales.csv).
- Preferred:
- Ingest & validate -
make ingestcleans the raw CSV, runs Great Expectations checks, and writesdata/processed/sample_sales.csv. Runmake sqliteto persist intodata/cardvalue_ml.db. - Train the baseline -
make trainfits the Random Forest, savingmodels/random_forest.joblibplus metrics/feature importances underartifacts/. - Explainability -
make shapwrites SHAP plots (artifacts/explainability/shap_summary.png) and mean absolute values JSON. - Serve predictions -
uvicorn cardvalue_ml.api.app:app --reloadexposes the FastAPI service; hithttp://localhost:8000/docs. - Front-end workflow -
make streamlitlaunchesapps/streamlit_app.py, reading the metrics, model, and SHAP artifacts generated above. - Pipeline run -
python3 scripts/run_pipeline.py(ormake pipeline) executes the Prefect flow for ingestion -> training. - Smoke tests -
make smokebuilds artifacts if missing and exercises/predict,/feature-insights, and/latest-sales.
make dev-install/make install- install dependencies (dev adds linting/test tooling).make format,make lint,make test- run Ruff+Black, lint checks, andpytest(seetests/for data, feature, API, pipeline, and monitoring coverage).make ingest,make sqlite,make generate-data- manage raw/processed datasets and SQLite persistence.make train,make benchmark,make backtest,make experiment- train the baseline, compare with XGBoost/CatBoost, run rolling evaluations, and execute YAML-driven experiments (experiments/configs).make shap,make profile,make feature-store,make drift- produce explainability artifacts, profiling reports, populate the SQLite-backed feature store, and generate Evidently drift dashboards. (Usepython3 scripts/build_feature_store.py --backend duckdb|redisfor non-SQLite backends.)make pipeline,make smoke,make streamlit- orchestration helpers and interactive demos.make docker-build,make docker-up,make docker-down- containerise the API and seed artifacts for deployment trials.
Many targets are thin wrappers around scripts/*.py; inspect those scripts for CLI flags (e.g., --dataset, --backend, --days).
- Great Expectations validation -
cardvalue_ml.data.validateenforces schema checks before storing any sale; violations halt ingestion. - Cleaning & storage -
cardvalue_ml.data.process.clean_sales_dataframestandardises numeric columns and datetimes;cardvalue_ml.data.databasewrites to SQLite for both API responses and downstream batch processing. - Feature engineering -
cardvalue_ml.features.build_featuresconverts sale dates to ordinals and one-hot encodes card/player metadata. Utilities such asadd_rolling_statare available for extending time-aware signals. - Feature store options - Populate SQLite/DuckDB/Redis stores with
scripts/build_feature_store.py; seedocs/feature_store.mdfor usage patterns. - Streaming & enrichment demos -
scripts/stream_simulator.py,scripts/fetch_game_stats.py,scripts/fetch_multisport_stats.py, andscripts/fetch_alternative_assets.pyillustrate how live signals or adjacent asset classes feed into the same schema.
- Random Forest baseline -
cardvalue_ml.models.train.train_random_forestproducesmetrics.json,feature_importances.json, andfeature_columns.jsonfor downstream apps. - Benchmarking -
cardvalue_ml.models.benchmarkevaluates Random Forest, XGBoost, and CatBoost on identical feature matrices; results can be logged to MLflow viacardvalue_ml.models.tracking.log_benchmark_results. - Rolling backtests -
cardvalue_ml.models.backtest.rolling_backtestsimulates underwriting behaviour over sliding windows (make backtest). - Experiments - Define configs in
experiments/configs/*.yamland run them throughscripts/run_experiment.py; metrics land inartifacts/experiments/. - MLflow integration -
cardvalue_ml.models.trackinginitialises a local MLflow instance inartifacts/mlrunsfor experiment review. - Explainability & risk - SHAP plots via
cardvalue_ml.models.explainand ensemble uncertainty viacardvalue_ml.models.riskfeed Streamlit as well as the/feature-insightsendpoint.
- FastAPI service -
src/cardvalue_ml/api/app.pyloads artifacts on startup and exposes:POST /predict- price prediction + 95% interval,POST /feature-insights- per-feature SHAP contributions and risk stats,GET /latest-sales- recent sales from SQLite or fallback CSV,GET /metrics,/feature-importances, and/health.
- Streamlit dashboard -
apps/streamlit_app.pyreads model artifacts, displays feature importances, and calculates loan-to-value scenarios for high-value cards. - Automated checks -
scripts/run_smoke_tests.pyprovisions model artifacts if needed and validates key endpoints with FastAPI's TestClient.
- Prefect pipeline -
src/cardvalue_ml/pipelines/sample_flow.pyorchestrates ingestion, SQLite persistence, training, and artifact logging;scripts/run_pipeline.pyprovides a CLI wrapper. - Airflow DAG -
airflow_dags/cardvalue_pipeline.pymirrors the Prefect flow for scheduler parity. - Evidently drift reports -
scripts/drift_report.py(andcardvalue_ml.monitoring.drift) compare reference vs. current datasets, placing HTML output underartifacts/monitoring/. - Docker & AWS -
docker/docker-compose.ymlruns inference containers locally;infra/ecs/contains task/service definitions for ECS proof-of-concept deployments. - Validation checklist -
docs/local_validation_checklist.mdanddocs/pipeline_monitoring.mdexpand on operational readiness.
- Launch the MkDocs site locally with
mkdocs serveto browse the curated documentation. - Key references:
docs/architecture.md- system design overview,docs/mlops_strategy.md- infra and cost strategy,docs/data_sources.mdanddocs/dataset_summary.md- data sourcing details,docs/feature_store.md,docs/mlflow_tracking.md,docs/mlflow_ui_demo.md,docs/recruiter_summary.md,docs/pitch_template.md,docs/presentation_outline.md,docs/prefect_agent.md,docs/aws_deployment.md,docs/role_alignment.md.
- Presentation-ready summaries live under
docs/andprivate_docs/to help tailor interviews.
- Network-required scripts (
fetch_real_sales_data_v2,fetch_game_stats,fetch_multisport_stats) use public APIs or pages; fall back to synthetic generators if connectivity is limited. scripts/llm_enrich_metadata.pydemonstrates GPT-powered card descriptions and requiresOPENAI_API_KEY.- Redis-backed feature stores (run
python3 scripts/build_feature_store.py --backend redis) and Docker-based Redis (docker run -p 6379:6379 redis) are optional but supported. - Pre-generate artifacts with
python3 scripts/bootstrap_sample_artifacts.pyif you need ready-to-demo models. - Need deep-learning extras (Transformers, PyTorch) for future experiments? Install with
pip install .[full].
- Run
make format,make lint, andmake testbefore opening a PR; CI (.github/workflows/ci.yml) enforces the same gates. - Use
make smokeor targetedpytest tests/test_*runs to validate changes to pipelines or APIs. - Capture larger ideas in
docs/spec_alignment.mdor extend Make targets as needed; the repo is structured to be fork-friendly for interview loops. - Optional but recommended: point Git to the bundled hooks with
git config core.hooksPath .githooksso thepre-pushhook runsmake lintautomatically before every push.