Skip to content

itaymendel/git-forensics

Repository files navigation

git-forensics

A TypeScript library for providing insights from git commit history.

Features

  • Actionable insights
  • Fast - ~700ms for 100,000 commits (getting the git-log will be slow)
  • Follows file rename and removal
  • Optimized for CI
  • Percentile-based classification — self-calibrating thresholds that work across any codebase size
  • Composite risk scoring — weighted multi-metric risk scores per file
  • Integrated (a VERY basic) code complexity engine
  • Bring your own code complexity score
  • Add custom metrics using full temporal history

Motivation

Existing git analysis tools (code-maat, git-of-theseus, Hercules, etc.) are great for reports but feel heavy as a backend for dev-tools. This library is designed to be lightweight, fast, and embeddable.

Tip: Focus on recent history (6-9 months). While the library handles renames and long histories correctly, older data tends to add noise.

Installation

npm install git-forensics

Quick Start

import { simpleGit } from 'simple-git';
import { computeForensics } from 'git-forensics';

const git = simpleGit('/path/to/repo');
const forensics = await computeForensics(git);

forensics.hotspots; // Files changed most often
forensics.churn; // Code volatility (lines added/deleted)
forensics.coupledPairs; // Hidden dependencies
forensics.couplingRankings; // Architectural hubs
forensics.codeAge; // Stale code detection
forensics.ownership; // Knowledge silos
forensics.communication; // Developer coordination needs
forensics.topContributors; // Per-file contributor breakdown

Example Output

Running computeForensics on a repository returns structured data across all metrics:

{
  "analyzedCommits": 842,
  "dateRange": { "from": "2024-03-10", "to": "2025-01-15" },
  "metadata": { "totalFilesAnalyzed": 134, "totalAuthors": 12 },

  "hotspots": [
    { "file": "src/api/routes.ts", "revisions": 87, "exists": true },
    { "file": "src/core/engine.ts", "revisions": 64, "exists": true },
  ],

  "coupledPairs": [
    {
      "file1": "src/api/routes.ts",
      "file2": "src/api/middleware.ts",
      "couplingPercent": 82,
      "coChanges": 34,
    },
  ],

  "ownership": [
    {
      "file": "src/core/engine.ts",
      "mainDev": "alice",
      "ownershipPercent": 34,
      "fractalValue": 0.18,
      "authorCount": 7,
    },
  ],

  // ... plus churn, codeAge, couplingRankings, communication, topContributors
}

Passing the result to generateInsights produces actionable alerts:

[
  {
    "file": "src/core/engine.ts",
    "type": "hotspot",
    "severity": "critical",
    "data": {
      "type": "hotspot",
      "revisions": 64,
      "rank": 2,
      "percentile": 95,
    },
    "fragments": {
      "title": "Hotspot",
      "finding": "64 revisions (P95), ranked #2 in repository",
      "risk": "Top-ranked churn file — prioritize for refactoring or test hardening",
      "suggestion": "Consider breaking into smaller modules or adding test coverage",
    },
  },
  {
    "file": "src/core/engine.ts",
    "type": "ownership-risk",
    "severity": "critical",
    "data": {
      "type": "ownership-risk",
      "fractalValue": 0.18,
      "authorCount": 7,
      "mainDev": "alice",
      "percentile": 92,
    },
    "fragments": {
      "title": "Fragmented Ownership",
      "finding": "7 contributors, fragmentation score 0.18 (P92)",
      "risk": "Diffuse ownership slows review cycles and increases merge conflicts",
      "suggestion": "Request review from alice (primary contributor)",
    },
  },
  // ... insights generated for each metric that exceeds thresholds
]

Actionable Insights

generateInsights transforms metrics into alerts with severity (warning, critical) and human-readable fragments (title, finding, risk, suggestion).

Insights use percentile-based thresholds — a file is flagged based on where it ranks relative to other files in the same repository. This makes thresholds self-calibrating across codebases of any size.

Insight thresholds

Question Metric Insight triggers when
Where's the riskiest code? hotspots Revisions in P75+ (warning) or P90+ (critical)
What keeps getting rewritten? churn Churn in P75+ or P90+
What hidden dependencies exist? coupledPairs ≥70% co-change rate (absolute, not percentile)
What has ripple effects? couplingRankings Coupling score in P75+ or P90+
What's been forgotten? codeAge Age in P75+ or P90+
Who owns what? Any knowledge silos? ownership ≥3 authors, fragmentation in P75+ or P90+

All thresholds are overridable — pass a partial thresholds object and only the values you specify will change:

const insights = generateInsights(forensics, {
  thresholds: {
    hotspot: { warning: 80, critical: 95 }, // percentile cutoffs
    churn: { warning: 80 },
    staleCode: { warning: 60, critical: 85 },
    coupling: { minPercent: 80 }, // stays absolute — not percentile-based
    ownershipRisk: { warning: 70, critical: 90, minAuthors: 4 },
    couplingScore: { warning: 80, critical: 95 },
  },
});

Analysis options

The analysis pipeline has its own configurable thresholds that control what data is collected:

const forensics = await computeForensics(git, {
  maxFilesPerCommit: 50, // skip large commits from coupling analysis (default: 50)
  minCoChanges: 3, // minimum co-changes to report a coupled pair (default: 3)
  minCouplingPercent: 30, // minimum coupling % to report a pair (default: 30)
  minSharedEntities: 2, // minimum shared files for communication pairs (default: 2)
});

These options are also available on computeForensicsFromData().

Build your own insights

forensics.stats contains the complete temporal history—every commit, by every author, for every file. Access stats.fileStats[file].byAuthor, authorContributions, nameHistory, etc. to build custom metrics like temporal histograms, expertise scores, or handoff detection.

Composite Risk Score

computeRiskScores produces a single 0-100 risk score per file by combining percentile ranks across all metrics with configurable weights:

import { computeRiskScores } from 'git-forensics';

const scores = computeRiskScores(forensics);
// [
//   { file: 'src/core/engine.ts', riskScore: 87.5, breakdown: { revisions: 22.5, churn: 25, ownershipRisk: 18, age: 12, couplingScore: 10 } },
//   { file: 'src/api/routes.ts', riskScore: 72.0, breakdown: { ... } },
//   ...
// ]

Default weights:

Metric Weight
Revisions 0.25
Churn 0.25
Ownership Risk 0.20
Age 0.15
Coupling Score 0.15

Override weights to match your priorities:

const scores = computeRiskScores(forensics, {
  revisions: 0.4,
  churn: 0.3,
  ownershipRisk: 0.1,
  age: 0.1,
  couplingScore: 0.1,
});

File Metrics with Percentiles

extractFileMetrics flattens forensics into per-file rows for storage. Pass includePercentiles: true to enrich each row with percentile ranks and a composite risk score:

import { extractFileMetrics } from 'git-forensics';

const metrics = extractFileMetrics(forensics, { includePercentiles: true });
// Each entry includes:
// {
//   file, revisions, ageMonths, churn, fractalValue, ...
//   percentiles: { revisions: 90, churn: 75, ownershipRisk: 85, ageMonths: 60, couplingScore: 40 },
//   riskScore: 72.5,
// }

Percentile Utilities

The underlying percentile functions are exported for building custom scoring:

import {
  percentileRank,
  createPercentileRanker,
  createInvertedPercentileRanker,
} from 'git-forensics';

// One-off calculation
percentileRank(50, [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]); // 45

// Reusable ranker for repeated lookups
const rank = createPercentileRanker([10, 20, 30, 40, 50]);
rank(30); // 50
rank(50); // 90

// Inverted ranker (lower values = higher percentile)
const riskRank = createInvertedPercentileRanker([0.1, 0.3, 0.5, 0.7, 0.9]);
riskRank(0.1); // 90 (lowest value = highest risk)

Complexity Analysis

git-forensics separates commit analysis from static code analysis. It provides optional complexity helpers for convenience (using indent-complexity). It is recommended you use a language-aware complexity scoring and pass the results to computeForensics.

CI Usage

Building a report

Loop over insights and build a PR comment or CI annotation:

const insights = generateInsights(forensics, { minSeverity: 'warning' });

for (const insight of insights) {
  const prefix = insight.severity === 'critical' ? '[CRITICAL]' : '[WARNING]';
  console.log(`${prefix} ${insight.file} - ${insight.fragments.title}`);
  console.log(`  ${insight.fragments.finding}`);
  console.log(`  ${insight.fragments.suggestion}\n`);
}

Optimization: Store & Reuse (large codebases)

For very large repos, store the computeForensics result between runs and rehydrate with generateInsights — no git scan needed:

import { generateInsights, getChangedFiles } from 'git-forensics';

// Fetch pre-computed forensics from your server/cache
const forensics = await fetch('https://your-server/api/forensics?repo=org/repo').then((r) =>
  r.json()
);

// Generate insights only for PR changed files
const changedFiles = await getChangedFiles(git, 'origin/main');
const insights = generateInsights(forensics, { files: changedFiles, minSeverity: 'warning' });

Data-Driven API

For environments without direct git access use computeForensicsFromData() with pre-fetched git data:

import { computeForensicsFromData, gitLogDataSchema, validateGitLogData } from 'git-forensics';

// Data must match the following format
const data = {
  log: {
    all: [
      {
        hash: 'abc123',
        date: '2025-01-15T10:00:00Z',
        author_name: 'Alice',
        message: 'Add feature',
        diff: {
          files: [
            { file: 'src/app.ts', insertions: 50, deletions: 10 },
            { file: 'src/utils.ts', insertions: 20, deletions: 5 },
          ],
        },
      },
      // ... more commits
    ],
  },
  trackedFiles: 'src/app.ts\nsrc/utils.ts\nsrc/index.ts', // from git ls-files
};

// Print JSON-schema if needed
console.log(gitLogDataSchema); // JSON Schema object

// Validate before processing
validateGitLogData(data); // throws if invalid

const forensics = computeForensicsFromData(data);

Migration from v1.x

v2.0.0 replaces absolute thresholds with percentile-based classification. Key changes:

  • InsightThresholds values are now percentile cutoffs (0-100), not raw metric values
  • InsightData variants (except coupling) include a percentile field
  • Stale-code severity changed from info/warning to warning/critical
  • Finding strings now include (Pxx) percentile annotations
  • Generator function signatures added a percentileRank parameter (affects direct generator importers)
  • New exports: computeRiskScores, DEFAULT_RISK_WEIGHTS, percentileRank, createPercentileRanker, createInvertedPercentileRanker
  • New types: PercentileThresholds, RiskWeights, FileRiskScore, ExtractFileMetricsOptions

Attribution

Based on concepts from Adam Tornhill's Your Code as a Crime Scene and Software Design X-Rays.

License

MIT

About

A TypeScript library for providing insights from git commit history.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors