This project demonstrates comprehensive data curation practices on Chicago 311 Service Request data, following the USGS Data Lifecycle Model. The workflow includes data collection, cleaning, de-identification using K-anonymity, analysis, and complete provenance documentation.
This README is not the final report. The final report PDF can be viewed in our Coursera submission.
Course: CS 598 - Foundations of Data Curation
Institution: University of Illinois at Urbana-Champaign
Team: Murali Natarajan, Ramitha Kotarkonda, Matthew Guan
Original Dataset: Chicago 311 Service Requests
Source: City of Chicago Open Data Portal
URL: https://data.cityofchicago.org/Service-Requests/311-Service-Requests/v6vf-nfxy/data_preview
License: Public Domain
Access Date: December 3, 2025
CS598_FDC_Spring2025_Project/
├── Curated Dataset/
│ ├── 1_Raw/ # Raw sampled data (199,999 records)
│ ├── 2_Cleaned/ # Cleaned data (194,104 records)
│ └── 3_Deidentified/ # K-anonymized data (193,841 records)
│
├── Data Cleaning/ # Processing scripts
│ ├── config.py # Configuration parameters
│ ├── cleanRawData.py # Data cleaning script
│ └── deIdentification.py # K-anonymity implementation
│
├── Data Analysis Jupyter Notebooks/ # Analysis notebooks
│ ├── segmentation_model.ipynb # K-Means clustering
│ └── CS_598_Project_Feature_Importance (2).ipynb
│
├── Metadata/ # Documentation
│ ├── metadata.json # DataCite metadata
│ └── data_dictionary.csv # Field descriptions
│ └── Codebook.md # Codebook
├── Data Models and Abstractions/ # Data models
│ ├── schema.json # JSON schema for the curated dataset
│ └── ontology.jsonld # Ontology
│
├── Docs/ # Project reports
│
├── Provenance/ # Generated provenance (after running scripts)
│ ├── chicago_311_provenance.json # W3C PROV data
│ ├── chicago_311_provenance.png # Provenance graph
│ └── provenance_summary.md # Human-readable summary
│
├── Workflow/ # Generated workflow (after running scripts)
│ ├── workflow_detailed.png # Workflow diagram
│ └── workflow_documentation.md # Workflow guide
│
├── generate_provenance.py # Provenance generator
├── workflow_diagram.py # Workflow diagram generator
├── validate_provenance.py # Provenance validator
├── run_all_steps.sh # Complete automation script
├── requirements.txt # Python dependencies
└── README.md # This file
City of Chicago Open Data Portal
↓
[1] Data Collection & Sampling (random, n=199,999, seed=42)
↓
Raw Dataset (199,999 records, 39 columns)
↓
[2] Data Cleaning (cleanRawData.py)
• Remove duplicates: -5,690
• Drop unlocatable: -205
• Standardize fields
• Feature engineering: RESOLUTION_TIME_HOURS
↓
Cleaned Dataset (194,104 records, 36 columns)
↓
[3] De-identification (deIdentification.py)
• K-Anonymity (k=5)
• Generalize ZIP codes (3-digit)
• Round coordinates (3 decimals)
• Drop 8 identifier columns
• Suppress 263 records
↓
K-Anonymized Dataset (193,841 records, 28 columns)
↓
[4] Analysis
• Segmentation (K-Means Clustering)
• Feature Importance (Random Forest)
↓
Analysis Results (Clusters + Feature Rankings)
- Python 3.11
- macOS, Linux, or Windows with WSL
Run everything with one command:
./run_all_steps.shThis script will:
- Install all dependencies
- Run data cleaning
- Run de-identification
- Run analysis notebooks
- Generate provenance documentation
- Generate workflow diagrams
1. Install Dependencies
pip install -r requirements.txt2. Run Data Cleaning
cd "Data Cleaning"
python cleanRawData.py3. Run De-identification
python deIdentification.py
cd ..4. Run Analysis
cd "Data Analysis Jupyter Notebooks"
jupyter lab
# Open and run both notebooks
cd ..5. Generate Documentation
python generate_provenance.py
python workflow_diagram.py| Stage | Records | Columns | Change |
|---|---|---|---|
| Raw | 199,999 | 39 | Baseline |
| Cleaned | 194,104 | 36 | -5,895 (-2.9%) |
| K-Anonymized | 193,841 | 28 | -263 (-0.14%) |
| Total Retention | 193,841 | 28 | 96.9% |
- Duplicates removed: 5,690 (2.8%)
- Unlocatable records removed: 205 (0.1%)
- All geographic fields standardized
- New analytical feature created: RESOLUTION_TIME_HOURS
- K-anonymity enforced: k=5
- Quasi-identifiers generalized: 4 fields
- Direct identifiers removed: 8 columns
- All records belong to groups of ≥5
1. Duplicate Removal
- Identified via
DUPLICATEflag - Removed: 5,690 records
2. Unlocatable Records
- Missing both address and coordinates
- Removed: 205 records
3. Standardization
CITY: Title case, fill with 'Chicago'STATE: Expand to 'Illinois', fill missingZIP_CODE: Convert to string, 'NA' for missingCREATED_DEPARTMENT: Fill with 'Unknown'
4. Feature Engineering
- Created
RESOLUTION_TIME_HOURS= (CLOSED_DATE - CREATED_DATE) in hours - Split
CREATED_DATEinto date and time components
1. ZIP Code Generalization
- Method: Truncate to 3 digits
- Example: "60601" → "606"
- Preserves regional patterns
2. Coordinate Rounding
- Method: Round to 3 decimal places (~100m precision)
- Example: 41.881832 → 41.882
- Prevents exact location identification
3. Identifier Removal
- Removed: STREET_ADDRESS, STREET_NUMBER, STREET_NAME, STREET_DIRECTION, STREET_TYPE, LOCATION
- Also removed: X_COORDINATE, Y_COORDINATE (redundant)
4. K-Anonymity Enforcement
- Threshold: k=5
- QIAs: COMMUNITY_AREA, WARD, POLICE_DISTRICT, ZIP_CODE
- Suppressed 263 records in groups < 5
- Notebook:
segmentation_model.ipynb - Purpose: Identify natural groupings in service requests
- Method: K-Means unsupervised learning
- Input: K-anonymized dataset
- Output: Cluster assignments and characteristics
- Notebook:
CS_598_Project_Feature_Importance (2).ipynb - Purpose: Predict resolution time and identify key factors
- Method: Random Forest regression
- Target Variable: RESOLUTION_TIME_HOURS
- Output: Feature importance rankings and model performance
After running the scripts, you'll have:
Provenance Files (Provenance/ directory):
chicago_311_provenance.json- W3C PROV-compliant provenance datachicago_311_provenance.png- Visual provenance graphprovenance_summary.md- Human-readable summary
Workflow Files (Workflow/ directory):
workflow_detailed.png- Detailed workflow diagramworkflow_documentation.md- Complete workflow guide
Provenance captures:
- 15+ entities (datasets, scripts, notebooks, reports)
- 6 activities (collection, cleaning, de-identification, 2 analyses, documentation)
- 7 agents (team members, City of Chicago, UIUC, Python software)
- 50+ relationships (complete data lineage)
- Timestamps for all activities
- Agent attributions
Workflow captures:
- All individual processing operations
- Data flow between operations
- Transformation impacts (records affected)
- Analysis methods and results
- Complete reproduction instructions
Hardware:
- Standard laptop/workstation
- Minimum 8GB RAM
- 1GB free disk space
Software:
- Operating System: Windows 10 / macOS / Linux
- Python: 3.11
- Key packages (see requirements.txt):
- pandas 2.1.1
- numpy 1.26.2
- matplotlib 3.8.2
- seaborn 0.12.2
- jupyterlab 4.2.1
- scikit-learn (for analysis)
- prov 2.0.0 (for provenance)
| Team Member | Responsibilities |
|---|---|
| Murali Natarajan | Data collection, cleaning, de-identification, provenance |
| Ramitha Kotarkonda | Documentation,Segmentation analysis,Reproducabilty |
| Matthew Guan | Documentation,feature importance modeling, metadata creation, data dictionary, ontology, JSON schema |
- Provenance follows W3C PROV standard
- Machine-readable JSON format
- Interoperable with provenance tools
- Validatable structure
- Metadata follows DataCite schema
- Includes creators, contributors, dates, rights
- Prepared for dataset publication
- Threshold: k=5 (industry standard)
- Method: Generalization + suppression
- QIAs: COMMUNITY_AREA, WARD, POLICE_DISTRICT, ZIP_CODE
- Guarantee: All records belong to groups of ≥5
If you use this dataset or methodology, please cite:
@dataset{chicago311_2025,
author = {Natarajan, Murali and Kotarkonda, Ramitha and Guan, Matthew},
title = {Chicago 311 Service Request Dataset (Curated Sample, K-Anonymized)},
year = {2025},
publisher = {University of Illinois at Urbana-Champaign},
howpublished = {CS 598 - Foundations of Data Curation Project},
url = {https://data.cityofchicago.org/Service-Requests/311-Service-Requests/v6vf-nfxy/data_preview},
note = {Derived from City of Chicago Open Data Portal}
}“This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.”
- W3C PROV: https://www.w3.org/TR/prov-overview/
- DataCite Schema: https://schema.datacite.org/
- City of Chicago Data Portal: https://data.cityofchicago.org/
- Data Dictionary:
Metadata/data_dictionary.csv - Metadata:
Metadata/metadata.json - Cleaning Report:
Curated Dataset/2_Cleaned/cleaning_summary_report.md - De-identification Report:
Curated Dataset/3_Deidentified/deidentification_summary_report.md
Original Data: Public Domain (City of Chicago)
This Project: Educational use (CS 598 course project)
Code & Scripts: Available for academic use
Last Updated: December 9, 2025
Version: 1.1
Status: Complete