CardValueML

CardValueML is a local-first machine learning and MLOps showcase that mirrors the workflows of a trading-card pricing team. The repository ships with realistic sample data, reproducible pipelines, and production-style tooling so you can demonstrate ingestion, modelling, deployment, and monitoring end to end without a cloud account.

Highlights

Data ingestion & validation - Scrapers and generators (scripts/fetch_real_sales_data_v2.py, scripts/generate_realistic_dataset.py) create 2K+ card sales rows, validated with Great Expectations in cardvalue_ml.data.validate before landing in data/processed/sample_sales.csv.
Quality-controlled storage - SQLite helpers in cardvalue_ml.data.database and Make targets (make ingest, make sqlite) persist cleaned data for downstream workloads; feature stores (scripts/build_feature_store.py) support SQLite, DuckDB, and Redis.
Feature engineering - cardvalue_ml.features.build_features standardises sale dates, one-hot encodes card metadata, and exposes helpers for rolling stats so models stay tabular-friendly.
Modelling & experiments - cardvalue_ml.models.train trains a Random Forest baseline, models/benchmark.py compares XGBoost/CatBoost, models/backtest.py runs rolling evaluations, and models/tracking.py logs runs to MLflow.
Explainability & risk - SHAP workflows (scripts/generate_shap.py, cardvalue_ml.models.explain) and ensemble variance utilities in cardvalue_ml.models.risk provide underwriting-friendly transparency for Streamlit and the API.
Serving & user interfaces - FastAPI (src/cardvalue_ml/api/app.py) exposes /predict, /feature-insights, /latest-sales, /metrics, and /health; Streamlit (apps/streamlit_app.py) offers an Alt-branded pricing console fed by the same artifacts.
Pipelines & ops - Prefect flow (src/cardvalue_ml/pipelines/sample_flow.py), Airflow DAG (airflow_dags/cardvalue_pipeline.py), smoke tests, drift monitoring (cardvalue_ml.monitoring.drift), and Docker Compose manifests cover orchestration and deployment scenarios.
Documentation library - MkDocs site (mkdocs.yml) stitches together architecture notes, MLOps strategy, monitoring guides, recruiter collateral, and more inside docs/.

Repository Layout

CardValueML/
|-- airflow_dags/           # Airflow DAGs that call Makefile tasks
|-- apps/                   # Streamlit valuation interface
|-- artifacts/              # Model metrics, SHAP plots, drift reports (generated)
|-- data/                   # Processed datasets (generated)
|-- docker/                 # Dockerfile, docker-compose.yaml, supporting scripts
|-- docs/                   # Architecture, ops strategy, collateral for interviews
|-- infra/                  # AWS ECS task + service boilerplate
|-- models/                 # Trained model artifacts (generated)
|-- notebooks/              # EDA and experiment notebooks
|-- raw_data/               # Raw exports and scraped data (generated)
|-- scripts/                # CLI utilities for ingestion, modelling, ops
|-- src/cardvalue_ml/       # Python package with data, features, models, API
`-- tests/                  # Unit, integration, and smoke-style tests

Setup

Use Python 3.10+ (pyproject.toml) and create a virtual environment (python3 -m venv .venv && source .venv/bin/activate).
Install dependencies with make dev-install (installs -e .[dev] and pre-commit hooks). Alternative: python3 -m pip install -r requirements.txt.
Optional extras:
- Redis or DuckDB if you want to exercise non-SQLite feature stores.
- OPENAI_API_KEY for scripts/llm_enrich_metadata.py.
- Docker/Compose for container demos (docker/).
Run pre-commit install if you skipped make dev-install.

Quickstart Workflow

Generate data
- Preferred: python3 scripts/fetch_real_sales_data_v2.py (fetches public sales and backfills to ~2,000 rows).
- Offline fallback: make generate-data (deterministic synthetic set in raw_data/synthetic_sales.csv).
Ingest & validate - make ingest cleans the raw CSV, runs Great Expectations checks, and writes data/processed/sample_sales.csv. Run make sqlite to persist into data/cardvalue_ml.db.
Train the baseline - make train fits the Random Forest, saving models/random_forest.joblib plus metrics/feature importances under artifacts/.
Explainability - make shap writes SHAP plots (artifacts/explainability/shap_summary.png) and mean absolute values JSON.
Serve predictions - uvicorn cardvalue_ml.api.app:app --reload exposes the FastAPI service; hit http://localhost:8000/docs.
Front-end workflow - make streamlit launches apps/streamlit_app.py, reading the metrics, model, and SHAP artifacts generated above.
Pipeline run - python3 scripts/run_pipeline.py (or make pipeline) executes the Prefect flow for ingestion -> training.
Smoke tests - make smoke builds artifacts if missing and exercises /predict, /feature-insights, and /latest-sales.

Make Targets & Script Shortcuts

make dev-install / make install - install dependencies (dev adds linting/test tooling).
make format, make lint, make test - run Ruff+Black, lint checks, and pytest (see tests/ for data, feature, API, pipeline, and monitoring coverage).
make ingest, make sqlite, make generate-data - manage raw/processed datasets and SQLite persistence.
make train, make benchmark, make backtest, make experiment - train the baseline, compare with XGBoost/CatBoost, run rolling evaluations, and execute YAML-driven experiments (experiments/configs).
make shap, make profile, make feature-store, make drift - produce explainability artifacts, profiling reports, populate the SQLite-backed feature store, and generate Evidently drift dashboards. (Use python3 scripts/build_feature_store.py --backend duckdb|redis for non-SQLite backends.)
make pipeline, make smoke, make streamlit - orchestration helpers and interactive demos.
make docker-build, make docker-up, make docker-down - containerise the API and seed artifacts for deployment trials.

Many targets are thin wrappers around scripts/*.py; inspect those scripts for CLI flags (e.g., --dataset, --backend, --days).

Data & Feature Pipelines

Great Expectations validation - cardvalue_ml.data.validate enforces schema checks before storing any sale; violations halt ingestion.
Cleaning & storage - cardvalue_ml.data.process.clean_sales_dataframe standardises numeric columns and datetimes; cardvalue_ml.data.database writes to SQLite for both API responses and downstream batch processing.
Feature engineering - cardvalue_ml.features.build_features converts sale dates to ordinals and one-hot encodes card/player metadata. Utilities such as add_rolling_stat are available for extending time-aware signals.
Feature store options - Populate SQLite/DuckDB/Redis stores with scripts/build_feature_store.py; see docs/feature_store.md for usage patterns.
Streaming & enrichment demos - scripts/stream_simulator.py, scripts/fetch_game_stats.py, scripts/fetch_multisport_stats.py, and scripts/fetch_alternative_assets.py illustrate how live signals or adjacent asset classes feed into the same schema.

Modelling, Experiments & Explainability

Random Forest baseline - cardvalue_ml.models.train.train_random_forest produces metrics.json, feature_importances.json, and feature_columns.json for downstream apps.
Benchmarking - cardvalue_ml.models.benchmark evaluates Random Forest, XGBoost, and CatBoost on identical feature matrices; results can be logged to MLflow via cardvalue_ml.models.tracking.log_benchmark_results.
Rolling backtests - cardvalue_ml.models.backtest.rolling_backtest simulates underwriting behaviour over sliding windows (make backtest).
Experiments - Define configs in experiments/configs/*.yaml and run them through scripts/run_experiment.py; metrics land in artifacts/experiments/.
MLflow integration - cardvalue_ml.models.tracking initialises a local MLflow instance in artifacts/mlruns for experiment review.
Explainability & risk - SHAP plots via cardvalue_ml.models.explain and ensemble uncertainty via cardvalue_ml.models.risk feed Streamlit as well as the /feature-insights endpoint.

Serving & Interfaces

FastAPI service - src/cardvalue_ml/api/app.py loads artifacts on startup and exposes:
- POST /predict - price prediction + 95% interval,
- POST /feature-insights - per-feature SHAP contributions and risk stats,
- GET /latest-sales - recent sales from SQLite or fallback CSV,
- GET /metrics, /feature-importances, and /health.
Streamlit dashboard - apps/streamlit_app.py reads model artifacts, displays feature importances, and calculates loan-to-value scenarios for high-value cards.
Automated checks - scripts/run_smoke_tests.py provisions model artifacts if needed and validates key endpoints with FastAPI's TestClient.

Monitoring & Operations

Prefect pipeline - src/cardvalue_ml/pipelines/sample_flow.py orchestrates ingestion, SQLite persistence, training, and artifact logging; scripts/run_pipeline.py provides a CLI wrapper.
Airflow DAG - airflow_dags/cardvalue_pipeline.py mirrors the Prefect flow for scheduler parity.
Evidently drift reports - scripts/drift_report.py (and cardvalue_ml.monitoring.drift) compare reference vs. current datasets, placing HTML output under artifacts/monitoring/.
Docker & AWS - docker/docker-compose.yml runs inference containers locally; infra/ecs/ contains task/service definitions for ECS proof-of-concept deployments.
Validation checklist - docs/local_validation_checklist.md and docs/pipeline_monitoring.md expand on operational readiness.

Documentation & Collateral

Launch the MkDocs site locally with mkdocs serve to browse the curated documentation.
Key references:
- docs/architecture.md - system design overview,
- docs/mlops_strategy.md - infra and cost strategy,
- docs/data_sources.md and docs/dataset_summary.md - data sourcing details,
- docs/feature_store.md, docs/mlflow_tracking.md, docs/mlflow_ui_demo.md,
- docs/recruiter_summary.md, docs/pitch_template.md, docs/presentation_outline.md,
- docs/prefect_agent.md, docs/aws_deployment.md, docs/role_alignment.md.
Presentation-ready summaries live under docs/ and private_docs/ to help tailor interviews.

Optional Integrations & Notes

Network-required scripts (fetch_real_sales_data_v2, fetch_game_stats, fetch_multisport_stats) use public APIs or pages; fall back to synthetic generators if connectivity is limited.
scripts/llm_enrich_metadata.py demonstrates GPT-powered card descriptions and requires OPENAI_API_KEY.
Redis-backed feature stores (run python3 scripts/build_feature_store.py --backend redis) and Docker-based Redis (docker run -p 6379:6379 redis) are optional but supported.
Pre-generate artifacts with python3 scripts/bootstrap_sample_artifacts.py if you need ready-to-demo models.
Need deep-learning extras (Transformers, PyTorch) for future experiments? Install with pip install .[full].

Contributing

Run make format, make lint, and make test before opening a PR; CI (.github/workflows/ci.yml) enforces the same gates.
Use make smoke or targeted pytest tests/test_* runs to validate changes to pipelines or APIs.
Capture larger ideas in docs/spec_alignment.md or extend Make targets as needed; the repo is structured to be fork-friendly for interview loops.
Optional but recommended: point Git to the bundled hooks with git config core.hooksPath .githooks so the pre-push hook runs make lint automatically before every push.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CardValueML

Highlights

Repository Layout

Setup

Quickstart Workflow

Make Targets & Script Shortcuts

Data & Feature Pipelines

Modelling, Experiments & Explainability

Serving & Interfaces

Monitoring & Operations

Documentation & Collateral

Optional Integrations & Notes

Contributing

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.githooks		.githooks
.github/workflows		.github/workflows
airflow_dags		airflow_dags
apps		apps
docker		docker
docs		docs
experiments/configs		experiments/configs
infra/ecs		infra/ecs
notebooks		notebooks
pipelines		pipelines
raw_data		raw_data
scripts		scripts
src/cardvalue_ml		src/cardvalue_ml
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ALT_JOB_ALIGNMENT_REPORT.md		ALT_JOB_ALIGNMENT_REPORT.md
ALT_READINESS_REPORT.md		ALT_READINESS_REPORT.md
COMPREHENSIVE_COMPLETION_REPORT.md		COMPREHENSIVE_COMPLETION_REPORT.md
CONVERSATION_SUMMARY.md		CONVERSATION_SUMMARY.md
FINAL_VALIDATION_REPORT.md		FINAL_VALIDATION_REPORT.md
MARKDOWN_FIXES_REPORT.md		MARKDOWN_FIXES_REPORT.md
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

PhilipJohnBasile/CardValueML

Folders and files

Latest commit

History

Repository files navigation

CardValueML

Highlights

Repository Layout

Setup

Quickstart Workflow

Make Targets & Script Shortcuts

Data & Feature Pipelines

Modelling, Experiments & Explainability

Serving & Interfaces

Monitoring & Operations

Documentation & Collateral

Optional Integrations & Notes

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages