Aveloxis : Augur, but 40,000 repositories, fully collected. In Three days. In Go. Hundreds of issues: Gone.
Find us at https://github.com/aveloxis/aveloxis
Read about the way Claude Code was used to create Aveloxis
Copyright © 2026 University of Missouri, Sean Goggins, and Derek Howard. This software is possible through the support of The Sloan Foundation. Learn more in the Detailed LCF.
Note: RHEL/CentOS installations are based on internet searches, as we do not have access to a machine with those OS's
- Go 1.23+ (install)
- PostgreSQL 14+ (local, Docker, or remote)
- git (for the facade/commit collection phase)
- GitHub and/or GitLab API tokens (personal access tokens with repo/read scope)
- Python 3.10+ and libmagic (optional, for ScanCode license/copyright scanning — installed automatically by
aveloxis install-tools)- macOS:
brew install libmagic - Debian/Ubuntu:
sudo apt-get install libmagic1 - RHEL/CentOS:
sudo yum install file-libs
- macOS:
- pipx (You may need to install pipx for scancode's installation to run)
- macOS:
brew install pipx - Debian/Ubuntu:
sudo apt install pipx - RHEL/CentOS:
sudo yum install pipx
- macOS:
- python-setuptools (Necessary for scancode)
- macOS:
brew install python-setuptools - Debian/Ubuntu:
sudo apt install python3-setuptools - RHEL/CentOS:
sudo dnf install python3-setuptools
- macOS:
See Development Mode Notes if you're running locally to evaluate
Option 1: Install to your PATH (recommended — lets you run aveloxis from anywhere).
Optionally install analysis tools for code complexity scanning:
git clone https://github.com/aveloxis/aveloxis.git
cd aveloxis
go mod tidy
go install ./cmd/aveloxis
# Verify it works (binary is now in $GOPATH/bin or $HOME/go/bin):
aveloxis version
# Install optional analysis tools (scc for code complexity):
aveloxis install-toolsIf
aveloxis: command not found, add Go's bin directory to your PATH:export PATH="$PATH:$(go env GOPATH)/bin"
Option 2: Build locally (binary stays in the repo directory):
git clone https://github.com/aveloxis/aveloxis.git
cd aveloxis
go mod tidy
go build -o bin/aveloxis ./cmd/aveloxis
# Must use the explicit path — the binary is NOT on your PATH:
./bin/aveloxis versionAll examples below use aveloxis (assumes Option 1). If you used Option 2, replace aveloxis with ./bin/aveloxis everywhere.
Aveloxis needs a PostgreSQL database. You can use an existing Augur database (Aveloxis creates its own aveloxis_data and aveloxis_ops schemas and does not touch Augur's schemas) or a fresh one:
Option A: Use an existing Augur database — just point aveloxis.json at the same host/port/dbname. Aveloxis creates its own schemas and does not touch Augur's.
Create an aveloxis.json file by copying aveloxis.example.json and placing your credentials in that file.
Option B: Create a fresh database (run in psql as a superuser):
CREATE DATABASE aveloxis;
CREATE USER aveloxis WITH ENCRYPTED PASSWORD 'password';
GRANT ALL PRIVILEGES ON DATABASE aveloxis TO aveloxis;
ALTER DATABASE aveloxis OWNER TO aveloxis;Option C: Docker (one command, no psql needed):
docker run -d --name aveloxis-db -p 5432:5432 \
-e POSTGRES_DB=aveloxis \
-e POSTGRES_USER=aveloxis \
-e POSTGRES_PASSWORD=aveloxis \
postgres:16Then run migrations:
aveloxis migrateThis creates 108 tables (84 in aveloxis_data, 24 in aveloxis_ops) with full parity to Augur's schema. All DDL uses CREATE ... IF NOT EXISTS and ON CONFLICT DO NOTHING, so migrate is safe to run repeatedly.
You will need a github OAUTH application for login to work on the web view. And there's nothing available without login. You can also use GitLab's OAUTH, or configure both.
Example Values for GitHub OAuth:
- Homepage URL: https://aveloxis.io (this one is not important)
- Authorization Callback URL: http://localhost:8082/auth/github/callback (This one is important, and if you are running locally this is exactly what you need)
Example Values for GitLab OAuth:
- Callback URL: http://localhost:8082/auth/gitlab/callback
Put those into your aveloxis.json file as described in the [Configuration Section](# configuration)
If you are running on bare metal, at this point you are ready to go!
# Start all three processes in the background (logs to ~/.aveloxis/*.log)
aveloxis start all
# Or start individually
aveloxis start serve # → ~/.aveloxis/aveloxis.log
aveloxis start web # → ~/.aveloxis/web.log
aveloxis start api # → ~/.aveloxis/api.logThen you can open the interfaces:
open http://localhost:8082 # Web GUI (login, visualizations, comparison)
open http://localhost:8082/monitor # Collection monitor dashboard (requires login)
open http://localhost:8383/api/v1/health # REST APIAveloxis runs in containers via Docker Compose or Podman Compose. All instructions below work with either — substitute podman for docker if you use Podman.
This file is required. It's mounted into all containers as the config file. You must add at least one API key — without keys, the scheduler (serve) will refuse to start.
# An example of the file already exists in the repo — copy, then edit it directly:
cp aveloxis.docker.example.json aveloxis.docker.json
vim aveloxis.docker.jsonMinimum changes needed:
GitHub API token: Go to github.com/settings/tokens and create a Personal Access Token with repo (or public_repo) scope. Add it to github.api_keys.
GitHub OAuth app (for web GUI login): Go to github.com/settings/developers → New OAuth App:
- Homepage URL:
http://localhost:8082 - Authorization callback URL:
http://localhost:/auth/github/callback
Copy the Client ID and Client Secret into github_client_id and github_client_secret.
dev_mode: true is needed because the containers run over plain HTTP. Without it, session cookies are marked Secure and browsers won't send them over HTTP, causing login to fail silently.
# Docker
docker compose up -d --build
# Podman
podman compose up -d --buildThis starts 5 containers:
| Container | Purpose | Port |
|---|---|---|
postgres |
PostgreSQL 16 database | 5432 |
migrate |
Runs schema migrations, then exits | — |
serve |
Collection scheduler | — |
web |
Web GUI (OAuth login, visualizations, monitor) | 8082 |
api |
REST API (stats, charts, SBOMs) | 8383 |
http://localhost:8082 # Web GUI (login with GitHub, create groups, add repos)
http://localhost:8082/monitor # Collection monitor (queue status, repo progress)
http://localhost:8383/api/v1/health # REST API health check
# Via CLI (run inside a container)
docker compose exec serve aveloxis add-repo https://github.com/chaoss/augur
# Via the web GUI
# Log in at http://localhost:8082, create a group, and add repos through the browser# View logs
docker compose logs -f serve # Follow scheduler logs
docker compose logs -f web # Follow web GUI logs
docker compose logs migrate # Check migration output
# Stop (data is preserved in volumes)
docker compose down
# Stop AND delete all data (database + clones — destructive!)
docker compose down -v
# Restart with different worker count
AVELOXIS_WORKERS=40 docker compose up -d
# Set a custom database password
AVELOXIS_DB_PASSWORD=secret docker compose up -d| Volume | Contents | Survives down? |
Destroyed by down -v? |
|---|---|---|---|
aveloxis-pgdata |
PostgreSQL database (all collected data) | Yes | Yes |
aveloxis-repos |
Bare git clones for facade/analysis | Yes | Yes |
# Docker
docker build -t aveloxis .
# Podman
podman build -t aveloxis .
# Then use the local image
docker compose up -d --buildserve exits immediately:
Check docker compose logs serve. The most common cause is missing API keys — you'll see "no API keys configured for any platform". Add at least one key to aveloxis.docker.json and restart.
Web GUI login fails (no error, just redirects back):
Set "dev_mode": true in the "web" section of aveloxis.docker.json. Without this, session cookies require HTTPS.
migrate fails:
Check docker compose logs migrate. If it says a relation doesn't exist, you may need to rebuild: docker compose down -v && docker compose up -d --build.
Monitor shows no repos:
Add repos via docker compose exec serve aveloxis add-repo <url> or through the web GUI.
If you already have Augur running with repos and API keys in its database, you can be collecting in four commands:
# 1. Point Aveloxis at your existing Augur database.
# Create a minimal config with just the database connection:
cat > aveloxis.json <<'EOF'
{
"database": {
"host": "localhost",
"port": 5432,
"user": "augur",
"password": "your-augur-db-password",
"dbname": "augur",
"sslmode": "prefer"
}
}
EOF
# 2. Create the aveloxis_data and aveloxis_ops schemas in your Augur database.
# This does NOT touch augur_data or augur_operations.
aveloxis migrate
# 3. Copy your API keys from augur_operations.worker_oauth into aveloxis_ops.worker_oauth.
aveloxis add-key --from-augur
# 4. Import your repos from augur_data.repo.
# Each URL is verified against the forge via HTTP HEAD — dead repos are skipped.
aveloxis add-repo --from-augur
# 5. Start collecting. Open http://localhost:8082/monitor to watch progress.
aveloxis start allAfter step 3, your keys live in aveloxis_ops.worker_oauth and are loaded automatically — no --augur-keys flag needed going forward. After step 4, all your verified Augur repos are in the Aveloxis queue and will be collected on the scheduler's priority order.
# 1. Create a config file
cp aveloxis.example.json aveloxis.json
# Edit aveloxis.json with your database credentials and API tokens
#
# IMPORTANT for local development: set "dev_mode": true in the "web" section
# so session cookies work over plain HTTP (without HTTPS).
# See the "Development Mode" note below.
# 2. Create the database schemas and tables
aveloxis migrate
# 3. Store your API keys in the database
aveloxis add-key ghp_your_github_token --platform github
aveloxis add-key glpat-your_gitlab_token --platform gitlab
# 4. Add repos to the collection queue (CLI method)
aveloxis add-repo https://github.com/chaoss/augur https://gitlab.com/fdroid/fdroidclient
# -- OR use the web GUI to add repos and orgs via browser --
# Configure OAuth credentials in aveloxis.json (see Configuration),
# then run: aveloxis web
# Open http://localhost:8082, log in with GitHub/GitLab, create a group,
# and add repos or orgs through the UI.
# 5. Start the scheduler
aveloxis start serve
# Open http://localhost:8082/monitor to watch collection progressCreate aveloxis.json (or copy from aveloxis.example.json):
{
"database": {
"host": "localhost",
"port": 5432,
"user": "augur",
"password": "your-password",
"dbname": "augur",
"sslmode": "prefer"
},
"github": {
"api_keys": ["ghp_your_token_here"],
"base_url": "https://api.github.com"
},
"gitlab": {
"api_keys": ["glpat-your_token_here"],
"base_url": "https://gitlab.com/api/v4",
"gitlab_hosts": ["gitlab.freedesktop.org"]
},
"collection": {
"batch_size": 1000,
"days_until_recollect": 1,
"workers": 4,
"repo_clone_dir": "/data/aveloxis-repos"
},
"web": {
"addr": ":8082",
"base_url": "http://localhost:8082",
"session_secret": "change-me-to-a-random-string",
"dev_mode": false,
"github_client_id": "your-github-oauth-app-client-id",
"github_client_secret": "your-github-oauth-app-client-secret",
"gitlab_client_id": "your-gitlab-oauth-app-client-id",
"gitlab_client_secret": "your-gitlab-oauth-app-client-secret",
"gitlab_base_url": "https://gitlab.com"
},
"log_level": "info"
}| Field | Description | Default |
|---|---|---|
database.* |
PostgreSQL connection parameters | localhost:5432 |
github.api_keys |
GitHub personal access tokens (multiple for rotation) | [] |
github.base_url |
GitHub API URL (change for GHE) | https://api.github.com |
gitlab.api_keys |
GitLab personal access tokens | [] |
gitlab.base_url |
GitLab API URL | https://gitlab.com/api/v4 |
gitlab.gitlab_hosts |
Additional hostnames to recognize as GitLab instances | [] |
collection.batch_size |
Staging flush batch size | 1000 |
collection.days_until_recollect |
Days before a repo is re-collected | 1 |
collection.workers |
Concurrent collection workers (for serve) |
12 |
collection.repo_clone_dir |
Directory for bare git clones used by the facade phase. Can grow to terabytes for large instances. | $HOME/aveloxis-repos |
collection.matview_rebuild_day |
Day of week to rebuild materialized views: "monday" through "sunday", or "disabled" |
"saturday" |
collection.matview_rebuild_on_startup |
Whether to refresh materialized views during startup migration. Slow on large DBs. | false |
web.addr |
Listen address for the web GUI | ":8082" |
web.base_url |
External URL for OAuth callback redirects | "http://localhost:8082" |
web.session_secret |
Secret key for signing session cookies | (required for aveloxis web) |
web.dev_mode |
Set true for local HTTP development. Disables Secure flag on cookies so they work without HTTPS. Do not enable in production. HttpOnly is always set regardless. |
false |
web.github_client_id |
GitHub OAuth app client ID | "" |
web.github_client_secret |
GitHub OAuth app client secret | "" |
web.gitlab_client_id |
GitLab OAuth app client ID | "" |
web.gitlab_client_secret |
GitLab OAuth app client secret | "" |
web.gitlab_base_url |
GitLab instance URL (for self-hosted GitLab) | "https://gitlab.com" |
log_level |
Log verbosity: debug, info, warn, error |
info |
If you are developing locally over HTTP (the typical case), you must set
"dev_mode": truein the"web"section ofaveloxis.json. Without this, session cookies are markedSecureand your browser will refuse to send them over plainhttp://localhost, making login fail silently."web": { "dev_mode": true, ... }Do not enable
dev_modein production. In production, run behind a TLS-terminating reverse proxy (nginx, Caddy) and leavedev_modeat its default (false) so cookies are only sent over HTTPS.HttpOnlyis always enabled regardless of this setting.
Keys are loaded from three sources, merged together:
aveloxis_ops.worker_oauth— always checked. Store keys viaaveloxis add-key.augur_operations.worker_oauth— checked when--augur-keysflag is set.aveloxis.json— lowest priority, for standalone deployments without a database pre-populated with keys.
Startup validation: aveloxis serve and aveloxis collect require at least one API key (GitHub or GitLab) to start. If no keys are found in any source, the process exits with a clear error message. If keys are found for only one platform, a warning is logged but the process continues (repos for the unconfigured platform will not be collected). If key loading fails (e.g., database connection error), the error is logged at ERROR level.
Key rotation: All keys are rotated via round-robin so every key's rate limit is fully utilized. When a key's remaining requests drop to the buffer threshold (default: 15), it's skipped until its rate-limit window resets. With N tokens at 5000 req/hr each, total throughput is N * ~4985 req/hr. For example, 74 tokens give ~368K lookups/hour. Keys that return 401 (bad credentials) are permanently invalidated.
Starts the long-running scheduler that continuously collects repos from the queue. Uses the staged collection pipeline (see Architecture). The monitor dashboard is integrated into aveloxis web at /monitor.
aveloxis serve [flags]
Flags:
--workers int Concurrent collection workers (default 1)
--augur-keys Load API keys from Augur's worker_oauth tableThe scheduler uses a Postgres-backed priority queue (aveloxis_ops.collection_queue). Jobs are claimed atomically with SELECT ... FOR UPDATE SKIP LOCKED, so multiple Aveloxis instances can share the same queue for horizontal scaling. No Redis, no RabbitMQ, no Celery.
Restart/resume: Aveloxis is safe to stop and restart at any time. On shutdown (Ctrl-C / SIGTERM / pkill aveloxis), it waits for active API calls to finish, then releases all queue locks so repos go back to queued immediately. On startup, it automatically:
- Processes any leftover staged data from the interrupted run into relational tables (so you don't lose what was already fetched from the API)
- Releases any stale locks from a previous instance
- Repos that were mid-collection resume from the beginning of their current collection cycle, but data already in the relational tables is upserted (duplicates are harmless)
Starts the web GUI for group management with OAuth login. Users can log in via GitHub or GitLab, create groups, and add repositories or entire organizations to those groups for collection.
aveloxis webNo flags -- all configuration comes from the web section of aveloxis.json (see Configuration below). Default listen address is :8082.
Requires OAuth app credentials in aveloxis.json. Create a GitHub OAuth app at https://github.com/settings/developers or a GitLab OAuth app at https://gitlab.com/-/profile/applications. Set the callback URL to {web.base_url}/auth/github/callback or {web.base_url}/auth/gitlab/callback respectively.
For ad-hoc collection of specific repos without the scheduler. Uses the direct collection pipeline (bypasses staging, writes directly to relational tables). Best for testing or collecting a handful of repos.
# Incremental (only data since last collection window)
aveloxis collect https://github.com/chaoss/augur
# Full historical collection
aveloxis collect --full https://github.com/chaoss/augur
# Multiple repos, mixed platforms
aveloxis collect \
https://github.com/torvalds/linux \
https://gitlab.com/fdroid/fdroidclient
Flags:
--full Full historical collection (ignore recollect window)
--augur-keys Load API keys from Augur's worker_oauth table# Add at default priority (100)
aveloxis add-repo https://github.com/chaoss/augur
# Add at high priority (lower number = collected sooner)
aveloxis add-repo --priority 10 https://gitlab.com/gitlab-org/gitlab
# Add multiple repos at once
aveloxis add-repo \
https://github.com/torvalds/linux \
https://github.com/chaoss/grimoirelab \
https://gitlab.com/fdroid/fdroidclient
# Import all repos from an existing Augur installation (verifies each URL is alive)
aveloxis add-repo --from-augurPlatform is auto-detected from the URL. GitLab nested subgroups are supported:
https://gitlab.com/group/subgroup/project -> owner="group/subgroup", repo="project"
Self-hosted GitLab instances are recognized if the hostname contains "gitlab" or is listed in gitlab_hosts in the config.
# Store a GitHub token
aveloxis add-key ghp_your_github_token --platform github
# Store a GitLab token
aveloxis add-key glpat-your_gitlab_token --platform gitlab
# Bulk import all keys from Augur (duplicates are skipped)
aveloxis add-key --from-auguraveloxis prioritize https://github.com/chaoss/augurSets priority to 0 and due time to now. The scheduler will collect this repo next.
Also available via the monitor dashboard's "Boost" button at /monitor, or by clicking "Boost" next to any queued repo.
aveloxis recollect https://github.com/chaoss/augur
aveloxis recollect https://github.com/a/b https://github.com/c/d # batchSets force_full_collect on the named repos' queue rows. On each repo's next scheduler cycle the collector ignores last_collected and re-collects from the beginning of time; the flag clears itself on successful completion. Use this after a bugfix that invalidates collected data, or when you want to make sure a specific repo is fully refreshed.
The scheduler also sets this flag automatically when a collection ends with a GraphQL PR batch error class (stream CANCEL, validation timeout, retry exhaustion) so that next pass backfills whatever the failed batch missed. See docs/guide/troubleshooting.md for details.
Combine with aveloxis prioritize <url> if you want the re-collection to start immediately rather than on the repo's normal cycle.
aveloxis migrateCreates 108 tables and 22 materialized views across three PostgreSQL schemas:
aveloxis_data(84 tables + 22 materialized views) — all collected data plus analytics viewsaveloxis_ops(24 tables) — operational tables: collection queue, JSONB staging store, collection status, API credentials, users/auth, config, worker stateaveloxis_augur_data(6 views) — Augur compatibility layer for 8Knot. Contains views that alias Aveloxis column names to Augur conventions (e.g.,star_count→stars_count,pr_number→pr_src_number). Only tables with column name differences have views here; identical tables resolve via search_path fallback toaveloxis_data.
Safe to run repeatedly. Does not touch Augur schemas if sharing a database. Also creates 22 materialized views for 8Knot/analytics compatibility and runs a data cleanup pass that fixes any garbage timestamps (e.g., year 0001 BC from uninitialized fields) by setting them to NULL.
8Knot integration: Set AUGUR_SCHEMA=aveloxis_augur_data,aveloxis_data (no space after comma) in 8Knot's .env. The two-schema search path resolves Augur-named columns from aveloxis_augur_data first, then falls through to aveloxis_data for tables with identical schemas. For existing Augur databases, use AUGUR_SCHEMA=augur_data as before — the compatibility schema is not needed.
aveloxis refresh-viewsManually refreshes all 22 materialized views used by 8Knot and other analytics tools. Uses REFRESH MATERIALIZED VIEW CONCURRENTLY where unique indexes exist (doesn't block reads). Views are also rebuilt automatically on a configurable schedule by aveloxis serve (default: Saturday; set collection.matview_rebuild_day in aveloxis.json to change, or "disabled" to turn off).
aveloxis install-toolsInstalls all optional third-party tools used by Aveloxis. Each tool is independently optional — if not installed, its analysis phase is silently skipped.
| Tool | Install command | Purpose |
|---|---|---|
| scc | go install github.com/boyter/scc/v3@latest |
Code complexity analysis — populates repo_labor with lines of code, comments, blanks, and complexity per file per language |
| scorecard | go install github.com/ossf/scorecard/v5/cmd/scorecard@latest |
OpenSSF Scorecard — evaluates security practices (Code-Review, Maintained, Vulnerabilities, etc.) and populates repo_deps_scorecard |
| scancode | pip3 install --user scancode-toolkit-mini |
Per-file license and copyright detection — populates aveloxis_scan.scancode_file_results with SPDX license expressions, copyrights, holders, and package data. Runs every 30 days per repo. Requires Python 3.10+ and libmagic (brew install libmagic on macOS, apt-get install libmagic1 on Debian/Ubuntu). |
Tools that are already installed are skipped. The command verifies each tool is on PATH after installation.
Automatic updates: On scheduler startup, Aveloxis checks if it has been more than 30 days since the last tool update. If so, it re-runs go install ...@latest for each installed tool to pull the latest version. Only tools already on PATH are updated — missing tools are not auto-installed. The check timestamp is stored at ~/.aveloxis-tool-check.
aveloxis start serve # scheduler + monitor → ~/.aveloxis/aveloxis.log
aveloxis start web # web GUI → ~/.aveloxis/web.log
aveloxis start api # REST API → ~/.aveloxis/api.log
aveloxis start all # all three at onceLaunches the specified component(s) as detached background processes. Output is appended to log files in ~/.aveloxis/. PID files are written to ~/.aveloxis/aveloxis-{serve,web,api}.pid for reliable process tracking. If a component is already running, the command reports it and skips the launch.
aveloxis stop serve # stop only the scheduler
aveloxis stop web # stop only the web GUI
aveloxis stop api # stop only the REST API
aveloxis stop all # stop all three
aveloxis stop # (no args) same as 'all'Sends SIGTERM to the specified component(s) using PID files in ~/.aveloxis/. Active workers finish their current API call, queue locks are released, and staging data is preserved. PID files are cleaned up automatically. Stale PID files (process no longer running) are detected and removed.
Generates a CycloneDX 1.5 or SPDX 2.3 SBOM from the dependency data collected for a repository. The repo must have been collected with dependency/libyear analysis (runs automatically during aveloxis serve).
# Generate CycloneDX JSON to stdout
aveloxis sbom 42
# Generate SPDX JSON
aveloxis sbom 42 --format spdx
# Write to file
aveloxis sbom 42 -o sbom.json
# Store in database (repo_sbom_scans table)
aveloxis sbom 42 --store
# Both file and database
aveloxis sbom 42 -o sbom.json --store
Flags:
--format string Output format: cyclonedx or spdx (default "cyclonedx")
-o, --output Write to file instead of stdout
--store Also store the SBOM in repo_sbom_scans tableWhat's in the SBOM:
| Format | Contents |
|---|---|
| CycloneDX 1.5 | bomFormat, specVersion, tool metadata (aveloxis), root component with evidence.licenses (concluded from ScanCode source analysis) and evidence.copyright (detected holders), all dependencies as library components with purl, version, license, scope (required/optional) |
| SPDX 2.3 | CC0-1.0 data license, root package with licenseConcluded from ScanCode source analysis (vs. licenseDeclared from registry), copyrightText from detected holders, all dependencies as packages with purl external refs, DEPENDS_ON relationships |
License capture from 12 package registries:
| Registry | Ecosystem | License source |
|---|---|---|
| npm | JavaScript | license field from npm view JSON |
| PyPI | Python | info.license from pypi.org API |
| crates.io | Rust | license field from crates.io version data |
| RubyGems | Ruby | licenses array from rubygems.org API |
| Go proxy | Go | Not available from proxy (would need pkg.go.dev) |
| Maven Central | Java/Scala | timestamp from search API |
| Packagist | PHP | license array from repo.packagist.org API |
| Hex.pm | Elixir | licenses from hex.pm API |
| NuGet | .NET | licenseExpression from nuget.org registration API |
| pub.dev | Dart/Flutter | From pub.dev API |
| Hackage | Haskell | Upload time from hackage.haskell.org |
| GitHub Releases | Swift (SwiftPM) | Via GitHub releases API (no central registry) |
Package URLs (purl) are generated for each dependency following the purl spec: pkg:npm/express@4.18.0, pkg:pypi/flask@2.3.0, pkg:golang/github.com/spf13/cobra@1.8.1, pkg:cargo/serde@1.0, pkg:gem/rails@7.0, pkg:maven/junit/junit@4.13, pkg:composer/laravel/framework@10.0, pkg:hex/phoenix@1.7.0, pkg:nuget/Newtonsoft.Json@13.0.3, pkg:pub/http@0.13.6, pkg:hackage/aeson@2.0.
The monitor dashboard is integrated into the web GUI at /monitor (requires login). It shows:
- Queue statistics (total, queued, collecting)
- Every repo with: status, priority, due time, last run duration
- Gathered vs Metadata columns: Gathered Issues, Meta Issues, Gathered PRs, Meta PRs, Gathered Commits, Meta Commits — so you can see collection completeness at a glance
- A Boost button to push any queued repo to the top
- Pagination at 200 repos per page
- Auto-refreshes every 10 seconds
Navigate to the monitor from any page via the "Monitor" link in the top nav bar.
Separate process (default 127.0.0.1:8383). Start alongside serve and web.
| Endpoint | Method | Description |
|---|---|---|
/api/v1/repos/{repoID}/stats |
GET | Gathered vs metadata PR/issue/commit counts for one repo |
/api/v1/repos/stats?ids=1,2,3 |
GET | Batch stats for multiple repos |
/api/v1/repos/{repoID}/sbom?format=cyclonedx|spdx |
GET | Download SBOM as JSON |
/api/v1/health |
GET | Health check with version |
The web GUI (aveloxis web) provides a browser-based interface for managing repository groups with OAuth authentication.
- OAuth login flow: Users authenticate via GitHub or GitLab OAuth apps. The login redirects to the provider's authorization page, then back to the callback URL with an auth code that is exchanged for an access token. The token is used to fetch the user's profile (login, email, avatar).
- Group management: Authenticated users create named groups and add repos or entire GitHub orgs / GitLab groups. Repos are automatically queued for collection.
- Bulk repo paste: The add-repo form accepts a textarea with line-delimited URLs — paste a list and they're all added at once.
- Gathered vs metadata stats: Each repo shows Gathered Issues, Meta Issues, Gathered PRs, Meta PRs, Gathered Commits, Meta Commits — collection completeness at a glance.
- SBOM download: CDX and SPDX download buttons per repo, generating SBOMs on-the-fly. Authenticated users only.
- Breadcrumb navigation: Home / Group Name hierarchy. 25-per-page pagination with 5-page sliding window and case-insensitive search.
- Org tracking: When a user adds an org, a scheduler task scans it every 4 hours, discovers new repos, and queues them automatically.
- URL validation: All URLs are validated before adding. GitHub and GitLab URLs require owner/repo path. Other URLs are accepted as git-only repos.
- Session management: Sessions are in-memory with 24-hour expiry. Restarting
aveloxis webclears all sessions.
The web GUI runs as a separate process from aveloxis serve — they share the same database but do not need to run on the same host. Note: aveloxis web does NOT run migrations. Use aveloxis migrate or aveloxis serve for that.
The web GUI includes built-in interactive visualizations powered by Chart.js, loaded from CDN with no build step. Requires the REST API (aveloxis api) to be running alongside aveloxis web.
Repository detail page — clicking a repo name in a group opens /groups/{gid}/repos/{rid} with:
- 4 weekly time-series charts: Commits/week, PRs Opened/week, PRs Merged/week, Issues/week (last 2 years by default)
- Summary stat cards: Issues, PRs, Commits, Vulnerabilities (critical count highlighted)
- Dependency license table: All licenses in the project's dependency tree with counts and OSI compliance indicators (checkmark for OSI-approved licenses)
- Source code license table: Per-file license detections from ScanCode with SPDX expressions, file counts, OSI compliance, and copyright holders list
- SBOM download buttons: CycloneDX 1.5 and SPDX 2.3
Comparison page (/compare) — accessible from the dashboard home page:
- Search any repo in the database via autocomplete
- Select up to 5 repos — each shown as a color-coded tag
- Three comparison modes:
- Raw Counts — actual weekly values, best for repos of similar size
- 100% — each repo normalized so its peak week = 100%, best for comparing trends regardless of size
- Z-Score — values as standard deviations from the mean, best for comparing trends while explicitly controlling for community size differences
- 4 overlaid charts with mode toggle, one per metric
- URL-shareable:
/compare?repos=1,2,3pre-populates the selection
The design follows the GHData/CHAOSS visualization principles: temporal context for all data, cross-project comparison with normalization, and rapid iteration.
Aveloxis can collect data from any git-hosted repository, not just GitHub and GitLab. When a user enters a URL that doesn't match github.com or gitlab.com (e.g., https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git), it is accepted as a git-only repository.
What's collected for git-only repos:
- Git commits (bare clone +
git log --all --numstat) — full commit history with per-file stats - Commit messages and parent relationships
- Dependencies (from manifest files in the checkout)
- Libyear (dependency age from package registries)
- Code complexity (scc)
- OpenSSF Scorecard
- SBOM generation (CycloneDX + SPDX)
What's NOT collected (requires a forge API):
- Issues, pull requests, events, messages, releases, repo info metadata
- Contributors (from API — git authors are still resolved via email)
Commit author resolution for git-only repos: Aveloxis attempts to resolve commit author emails against both the GitHub Search API and GitLab API to find platform identities. This means if a contributor uses the same email on GitHub and on a self-hosted Gitea instance, their identity can still be linked.
In the web GUI: Git-only repos are marked with a purple Git-only badge in the repository list.
URL validation: All URLs entered in the web GUI are validated. GitHub and GitLab URLs must have an owner/repo path. Other URLs are accepted if they have a valid host and path structure — the scheduler will attempt to clone them and report an error if cloning fails.
Aveloxis has two collection pipelines. The staged pipeline is used by aveloxis serve for production workloads. The direct pipeline is used by aveloxis collect for ad-hoc runs.
Designed for 400K+ repos. Eliminates database contention on the contributors table by decoupling API collection from relational persistence:
Prelim phase: Before any data collection, each repo's URL is checked with an HTTP HEAD request. If the URL redirects (repo was renamed or transferred):
- If the new URL already exists in our database: the old repo is marked as a duplicate and dequeued. This prevents collecting the same repo twice.
- If the new URL is new: the old repo's URL is updated to the canonical URL, and all stored URLs in issues, PRs, reviews, releases are bulk-updated via
REPLACE()to reflect the new org/repo path. - If the URL returns 404/410: the repo is skipped as dead.
Phase 1 — Collect (fast, no contention): Raw API responses are written to a JSONB staging table (aveloxis_ops.staging). No FK lookups, no contributor resolution. Multiple workers can blast data concurrently with zero contention on any relational table. Issues and PRs are staged as envelope types that bundle the parent entity with all its children (labels, assignees, reviewers, reviews, commits, files, head/base metadata) in a single JSONB row. Data is collected in this order:
- Repo info, releases, clone/traffic stats (collected first for commit count metadata)
- Contributors (seed from member/contributor lists)
- Issues + labels + assignees (bundled per issue)
- Pull requests + all children (bundled per PR)
- Events (issue + PR)
- Messages (issue comments, PR comments, inline review comments)
For repos with >10,000 commits (detected from repo_info metadata), steps 3-5 run in parallel across 3 goroutines (each with its own staging writer), then messages are collected after all three complete.
Heartbeat locking: During staged collection, workers send heartbeats every 30 seconds (HeartbeatJob) to update locked_at. This prevents RecoverStaleLocks (1-hour timeout) from stealing active jobs on large repos that take hours to collect. Without heartbeats, the stale lock recovery would repeatedly reclaim the lock and purge accumulated staging data.
Phase 2 — Process (single-threaded per repo): Staged data is drained in 500-row batches by entity type, in dependency order (contributors first, then issues, then PRs, then events/messages, then metadata). Contributors are resolved in bulk with an in-memory write-through cache (platform ID -> email -> login deduplication). When an envelope is processed, the parent is upserted first to obtain its database ID, then all bundled children are upserted using that ID.
- Review messages: PR review bodies are stored in the
messagestable with a link inpull_request_review_message_ref— the same bridge-table pattern used for issue comments (issue_message_ref) and PR comments (pull_request_message_ref). Only reviews with non-empty bodies get a message row. - Repo info rotation: Before inserting a new
repo_infosnapshot, the previous snapshot is moved torepo_info_history, preserving all metadata columns. The mainrepo_infotable always has only the latest data per repo. - Repo info source: GitHub uses a GraphQL query that returns PR/issue/commit counts, community profile files (CONTRIBUTING.md, CHANGELOG.md, CODE_OF_CONDUCT.md, SECURITY.md), license, and archive status in one API call. GitLab uses REST (
/projects/{id}?statistics=true) plus/issues_statisticsfor issue breakdowns and per-state MR counts viaX-Totalheaders.
Phase 2b — Gap fill: After processing, gathered issue/PR counts are compared against repo_info metadata. If the gap exceeds 5%, all issue/PR numbers are listed from the API and diffed against collected numbers in the database. Only the specific missing items are fetched, plus 2 already-collected items on each side of each gap to verify their associated data (comments, events, reviews) is complete. Handles multiple distinct gaps per repo. This catches incomplete collections from any cause — interrupted runs, API errors, rate limit exhaustion.
Phase 2c — Contributor enrichment: Thin contributor records (missing company/location from the basic Contributors API or lazy resolution) are enriched by calling GET /users/{login} for full profile data (company, location, email, name, created_at). Up to 500 contributors per pass — over multiple collection cycles, all contributors eventually get enriched.
Phase 3 — Facade (git): After API data is processed, the repo is cloned as a bare repo (or fetched if a clone already exists). git log --all --numstat is run to extract per-file commit data. For each commit:
- Per-file rows are inserted into
commits(one row per file touched per commit, matching Augur's model) - Parent-child relationships are inserted into
commit_parents - Commit messages are inserted into
commit_messages - Contributor affiliations are resolved: email domains are matched against the
contributor_affiliationstable to populatecmt_author_affiliationandcmt_committer_affiliation - After all commits are inserted, Facade aggregates are computed:
dm_repo_annual,dm_repo_monthly,dm_repo_weekly(and their repo_group counterparts) are refreshed by aggregating commit data by email, affiliation, and time period
Phase 4 — Analysis (on-demand full clone): A temporary full checkout is created from the bare clone (local, no network). Five analysis phases run against it, then the checkout is retained for scorecard before deletion:
- Dependency scanning (
repo_dependencies): walks the checkout for manifest files across 15 ecosystems — JavaScript (package.json), Python (requirements.txt, pyproject.toml, Pipfile), Go (go.mod), Rust (Cargo.toml), Ruby (Gemfile), Java/Kotlin (pom.xml, build.gradle, build.gradle.kts), PHP (composer.json), Elixir (mix.exs), Swift (Package.swift), Dart (pubspec.yaml), Scala (build.sbt), .NET (packages.config), Haskell (package.yaml), C/C++ (Makefile, CMakeLists.txt). Extracts dependency names and counts. - Libyear (
repo_deps_libyear): for each versioned dependency, queries its package registry (npm, PyPI, Go proxy, crates.io, RubyGems, Maven Central, Packagist, Hex.pm, NuGet, pub.dev, Hackage, SwiftPM/GitHub) to compare the current version against the latest. Calculates libyear = (latest_release_date - current_release_date) / 365. - Code complexity (
repo_labor): ifsccis installed, runsscc -f json --by-fileto get per-file metrics — programming language, total lines, code lines, comment lines, blank lines, and complexity. Install viaaveloxis install-tools. - ScanCode license/copyright detection (
aveloxis_scan.scancode_file_results): ifscancodeis installed, runsscancode -clpi --only-findings --jsonto detect per-file licenses (SPDX expressions), copyrights, holders, and packages. Only runs every 30 days per repo — license/copyright data changes infrequently. Results stored in dedicatedaveloxis_scanschema with history rotation. Install viapipx install scancode-toolkit-mini(requires Python 3.10+). - OpenSSF Scorecard (
repo_deps_scorecard): if thescorecardbinary is installed, runs locally against the checkout withscorecard --local <path>(much faster than remote mode). Each check (Code-Review, Maintained, Vulnerabilities, etc.) is stored with its score, reason, and details as JSONB. Previous results are rotated torepo_deps_scorecard_history. Install viaaveloxis install-tools.
Phase 5 — Commit Author Resolution (GitHub only): After facade completes, resolves git commit author emails to GitHub user accounts. This is the Go implementation of the augur-contributor-resolver scripts. Resolution strategy, cheapest first:
- Noreply email parse (free) —
12345+user@users.noreply.github.comextracts login andgh_user_iddirectly from the email format - Database lookup — checks
contributors(cntrb_email, cntrb_canonical) andcontributors_aliases(alias_email) - GitHub Commits API —
GET /repos/{owner}/{repo}/commits/{sha}returns the linked GitHub user with all profile fields (gh_user_id,gh_node_id,gh_avatar_url, allgh_*URLs, etc.) - GitHub Search API —
GET /search/users?q=email+in:emailfor remaining non-noreply emails
For each resolved commit author:
cmt_author_platform_usernameis set on all commit rows with that hash- The contributor row is created/updated with the deterministic
GithubUUID(Augur-compatible) and allgh_*profile fields are backfilled - Login renames are detected (same
gh_user_id, different login) and the contributor'sgh_loginis updated - An alias is created in
contributors_aliaseslinking the commit email to the contributor - After all commits are resolved, a bulk SQL backfill sets
cmt_ght_author_idby joiningcmt_author_platform_usernametocontributors.gh_login
Phase 6 — Canonical Email Enrichment: For contributors that have gh_login but no cntrb_canonical, calls GET /users/{login} to get their profile email and sets cntrb_canonical.
Phase 7 — SBOM Generation: Both CycloneDX 1.5 and SPDX 2.3 SBOMs are generated from the repo_deps_libyear data and stored in repo_sbom_scans with format metadata. SBOMs include dependency names, versions, licenses, and package URLs from all 12 registries. Available for download via the web GUI or REST API.
Phase 8 — Vulnerability Scanning (OSV.dev): All dependencies with package URLs (purls) are batch-queried against the OSV.dev API to identify known vulnerabilities. OSV aggregates data from NVD (CVEs), GitHub Advisory Database (GHSA), PyPI advisories, RustSec, Go Vulnerability Database, and OSS-Fuzz — providing comprehensive coverage across all supported ecosystems. Results are stored in repo_deps_vulnerabilities with:
- Vulnerability ID (GHSA, PYSEC, RUSTSEC, GO, etc.) and CVE cross-reference
- CVSS severity and score (approximated from vector)
- Affected and fixed version ranges
- Summary, details, and reference URLs
- Source attribution
The OSV.dev batch endpoint (POST /v1/querybatch) accepts purls natively — no CPE mapping needed. No API key required. NIST NVD is not queried directly because it uses CPE identifiers where the vendor field is unpredictable from package names alone.
Periodic — Contributor Breadth: Every 6 hours, the scheduler runs the breadth worker which calls GET /users/{login}/events for each contributor to discover their activity in repos outside the tracked set. Each event (PushEvent, PullRequestEvent, IssuesEvent, etc.) is stored in contributor_repo, mapping contributors to their cross-repo activity. Contributors are prioritized by those never processed first, then oldest. Up to 100 contributors are processed per cycle.
For ad-hoc single-repo runs. Writes directly to relational tables without staging. Runs the same phases (contributors, issues, PRs, events, messages, metadata, facade, commit resolution) but with inline contributor resolution and direct upserts. Best for testing or collecting a small number of repos.
The scheduler queue lives in aveloxis_ops.collection_queue and uses FOR UPDATE SKIP LOCKED for atomic job claiming:
- Durability: Queue survives process restarts — no in-memory state lost
- Horizontal scaling: Multiple
aveloxis serveinstances can share the same queue - Transparency: Queue state is queryable with plain SQL
- Stale lock recovery: Jobs locked by crashed workers are automatically re-queued after 1 hour
- Priority override: Any repo can be pushed to the top at any time via CLI or API
- No extra infrastructure: No Redis, RabbitMQ, or Celery
GitHub and GitLab implement the same platform.Client interface with 7 sub-interfaces:
| Sub-interface | Methods | Notes |
|---|---|---|
RepoCollector |
FetchRepoInfo, FetchCloneStats |
Clone stats unavailable for GitLab via API |
IssueCollector |
ListIssues, ListIssueLabels, ListIssueAssignees |
|
PullRequestCollector |
ListPullRequests, ListPRLabels, ListPRAssignees, ListPRReviewers, ListPRReviews, ListPRCommits, ListPRFiles, FetchPRMeta |
GitLab MRs mapped to PR model |
EventCollector |
ListIssueEvents, ListPREvents |
GitLab uses resource events API |
MessageCollector |
ListIssueComments, ListPRComments, ListReviewComments |
GitLab review comments use /merge_requests/:iid/discussions with diff position filtering |
ReleaseCollector |
ListReleases |
|
ContributorCollector |
ListContributors, EnrichContributor |
GitLab combines /members/all + /repository/contributors |
All methods use Go 1.23 iterators (iter.Seq2) for memory-efficient streaming pagination.
There are two layers of contributor resolution:
API-phase resolution (during issue/PR/event collection): Platform user references (login, email, avatar, etc.) are resolved to a canonical cntrb_id UUID via a three-tier strategy:
- In-memory cache: Platform ID -> UUID lookup (avoids DB round-trips for repeated contributors within a batch)
- Database lookup:
contributor_identitiestable (platform_id + platform_user_id unique key) - Create new: Insert into
contributors+contributor_identitiesif no match found
Git-phase resolution (after facade commits are inserted): Commit author emails are resolved to GitHub user accounts. This is the equivalent of the augur-contributor-resolver scripts, implemented natively in Go:
- Noreply parse (free) — extract login + user ID from GitHub noreply email format
- DB lookup — check contributors and aliases tables by email
- GitHub Commits API —
GET /repos/{owner}/{repo}/commits/{sha}for the linked GitHub user - GitHub Search API —
GET /search/users?q=emailfor remaining emails - Backfill — bulk SQL join to set
cmt_ght_author_idfrom resolved logins
Aveloxis generates cntrb_id UUIDs using Augur's deterministic scheme: the UUID encodes platform_id (byte 0) and gh_user_id (bytes 1-4, big-endian). This means:
- The same GitHub user always gets the same
cntrb_idregardless of which system created it - Aveloxis contributor IDs are byte-compatible with existing Augur data
- GitLab uses the same scheme with platform byte = 2
During the facade phase, commit author/committer emails are matched against the contributor_affiliations table to resolve organizational affiliations. The resolver:
- Loads all active affiliation rules on first use (lazy, cached in memory)
- Matches exact domain first (e.g.,
user@redhat.com->Red Hat) - Falls back to parent domains (e.g.,
user@mail.google.com->Googleviagoogle.com) - Populates
cmt_author_affiliationandcmt_committer_affiliationon every commit row
All text fields (issue titles/bodies, PR titles/bodies, message text, release descriptions, review bodies, commit messages) are sanitized before database insertion. This mirrors Augur's remove_null_characters_from_string() and UTF-8 encoding cleanup:
- Null bytes (
\x00) — removed (PostgreSQL TEXT cannot store them; these appear in bot-generated content and copy-pasted binary data) - Invalid UTF-8 sequences — replaced with U+FFFD (Unicode replacement character)
- Control characters (C0: 0x01-0x1F except tab/newline/CR; C1: 0x7F-0x9F) — stripped
- Clean strings pass through without allocation (fast path)
When the prelim phase detects a repo that returns 404 or 410 (deleted, made private, or DMCA'd):
- Data is preserved — all previously collected issues, PRs, commits, messages, etc. remain in the database
- Collection stops permanently — the repo is marked
repo_archived = TRUEand removed from the queue - No wasted API calls — unlike Augur, which keeps retrying dead repos every cycle, Aveloxis permanently sidelines them
- Re-adding: to un-sideline a repo that comes back, manually
UPDATE aveloxis_data.repos SET repo_archived = FALSE WHERE repo_id = Nthenaveloxis add-repo <url>
- Gateway error retry: 502/503/504 responses trigger exponential backoff with jitter (1s, 2s, 4s, 8s... up to 64s base + random jitter), context-aware, up to 10 retries. This handles GitHub/GitLab service degradation gracefully.
- Timestamp cleanup:
aveloxis migrateautomatically detects and nullifies garbage timestamps (year < 1970) across all tables, preventing BC-era dates from poisoning queries - Deadlock retry: All database upserts use exponential backoff retry on PostgreSQL deadlock errors (error code
40P01), up to 10 attempts - Stale lock recovery: The scheduler checks every 5 minutes for jobs that have been locked for more than 1 hour and re-queues them
- Per-entity error isolation: A failed upsert for one issue/PR/message logs a warning but does not abort collection for the entire repo
- Facade resilience: If
git fetchfails on an existing clone, the facade re-clones from scratch before giving up
Aveloxis creates 22 materialized views compatible with 8Knot and other Augur analytics tools:
| View | Purpose |
|---|---|
api_get_all_repo_prs |
Total PR count per repo |
api_get_all_repos_commits |
Total distinct commit count per repo |
api_get_all_repos_issues |
Total issue count per repo (excluding PRs) |
explorer_entry_list |
Repo list with group names |
explorer_commits_and_committers_daily_count |
Daily commit/committer counts |
explorer_contributor_actions |
All contributor actions (commits, issues, PRs, reviews, comments) with ranking |
explorer_new_contributors |
First-time contributor tracking |
augur_new_contributors |
8Knot compat alias |
explorer_pr_assignments |
PR assignment/unassignment events |
explorer_pr_response |
PR message response tracking |
explorer_pr_response_times |
Comprehensive PR metrics (time to close, response times, line/file/commit counts) |
explorer_issue_assignments |
Issue assignment events |
explorer_user_repos |
User-to-repo mapping |
explorer_repo_languages |
Language breakdown from repo_labor |
explorer_libyear_all / _summary / _detail |
Dependency age (libyear) metrics |
explorer_contributor_recent_actions |
Same as explorer_contributor_actions but limited to last 13 months |
explorer_pr_files |
PR file paths with pull_request_id and repo_id |
explorer_cntrb_per_file |
Contributors and reviewers aggregated per file path |
explorer_repo_files |
Latest SCC file listing per repo (most recent analysis date) |
issue_reporter_created_at |
Legacy issue reporter view |
Rebuild schedule: Configurable via collection.matview_rebuild_day in aveloxis.json (default: "saturday"). Set to "disabled" to turn off automatic rebuilds. Views are NOT refreshed on every startup (was causing slow starts on large databases). On first run, views are created; subsequent startups skip them. Manual rebuild: aveloxis refresh-views. The explicit aveloxis migrate command always creates/refreshes views.
Three schemas in PostgreSQL with full parity to Augur's augur_data and augur_operations, plus a dedicated schema for ScanCode results:
aveloxis_data(84 tables + 22 materialized views) — All collected data: repos, issues, PRs, commits (per-file), commit parents, commit messages, messages, events, releases, contributors, contributor identities/aliases/affiliations, dependencies/SBOM, sentiment/NLP analysis, LSTM anomaly detection, topic modeling, Facade aggregates (dm_repo_annual/monthly/weekly, dm_repo_group_annual/monthly/weekly), repo labor/complexity, DEI badging, CHAOSS metrics, network analysis, repo insights, and more. Plus 22 materialized views for 8Knot compatibility.aveloxis_ops(24 tables) — Operational tables: collection queue, JSONB staging store, collection status (tracks core/secondary/facade/ML phases independently), API credentials, users/auth/sessions, config, worker history/jobs, network weighted tables.aveloxis_scan(4 tables) — ScanCode per-file license and copyright detection:scancode_scans(scan metadata),scancode_file_results(per-file SPDX license, copyrights, holders, packages as JSONB), plus_historytables for both.
Tables omitted from Augur (junk): _transfer_testing, _transfer_training, akl;fjlk;a (renamed to dei_badging), analysis_log, all, github_users_2, worker_oauth_copy1.
Aveloxis uses cleaner column names internally but exposes Augur-compatible names in all materialized views for seamless 8Knot integration. The internal schema avoids Augur's pr_src_* and gh_* prefixes in favor of descriptive names, but view output columns are aliased to match Augur exactly.
Pull Requests:
| Augur column | Aveloxis table column | Matview output alias |
|---|---|---|
pr_src_id |
platform_pr_id |
pr_src_id |
pr_src_number |
pr_number |
— |
pr_src_state |
pr_state |
pr_src_state |
pr_src_title |
pr_title |
— |
pr_created_at |
created_at |
pr_created_at |
pr_merged_at |
merged_at |
pr_merged_at |
pr_closed_at |
closed_at |
pr_closed_at |
pr_augur_contributor_id |
author_id |
cntrb_id |
pr_src_author_association |
author_association |
pr_src_author_association |
pr_merge_commit_sha |
merge_commit_sha |
— |
Pull Request Meta:
| Augur column | Aveloxis table column | Matview output alias |
|---|---|---|
pr_repo_meta_id |
pr_meta_id |
— |
pr_head_or_base |
head_or_base |
pr_head_or_base |
pr_src_meta_label |
meta_label |
pr_src_meta_label |
pr_src_meta_ref |
meta_ref |
— |
pr_sha |
meta_sha |
— |
Pull Request Reviews:
| Augur column | Aveloxis table column |
|---|---|
pr_review_state |
review_state |
pr_review_body |
review_body |
pr_review_submitted_at |
submitted_at |
pr_review_src_id |
platform_review_id |
Issues:
| Augur column | Aveloxis table column |
|---|---|
gh_issue_id |
platform_issue_id |
gh_issue_number |
issue_number |
Repo Info:
| Augur column | Aveloxis table column |
|---|---|
stars_count |
star_count |
watchers_count |
watcher_count |
pull_request_count |
pr_count |
pull_requests_open |
prs_open |
pull_requests_closed |
prs_closed |
pull_requests_merged |
prs_merged |
committers_count |
committer_count |
Table Names:
| Augur table | Aveloxis table |
|---|---|
augur_data.repo |
aveloxis_data.repos |
augur_data.message |
aveloxis_data.messages |
augur_data.platform |
aveloxis_data.platforms |
augur_data.* (all others) |
aveloxis_data.* (same name) |
augur_operations.* |
aveloxis_ops.* |
Libyear compatibility note: Augur's repo_deps_libyear table has a typo: current_verion (missing 's'). Aveloxis fixes this to current_version in the table, but the explorer_libyear_detail materialized view aliases it back to current_verion for 8Knot compatibility.
Both platforms collect the same data types. Most fields have full parity; known gaps are documented below the table:
| Entity | GitHub Source | GitLab Source | Storage |
|---|---|---|---|
| Issues | /repos/{o}/{r}/issues |
/projects/{id}/issues |
issues |
| Issue Labels | Embedded in issues | Embedded in issues | issue_labels |
| Issue Assignees | Embedded in issues | Embedded in issues | issue_assignees |
| Pull Requests / MRs | /repos/{o}/{r}/pulls |
/projects/{id}/merge_requests |
pull_requests |
| PR/MR Labels | Embedded in PRs | Embedded in MRs | pull_request_labels |
| PR/MR Assignees | Embedded in PRs | Embedded in MRs | pull_request_assignees |
| PR/MR Reviewers | /pulls/{n}/requested_reviewers |
Embedded in MRs | pull_request_reviewers |
| PR/MR Reviews | /pulls/{n}/reviews |
/merge_requests/{n}/approvals |
pull_request_reviews + messages + pull_request_review_message_ref (review body stored as message via bridge table) |
| PR/MR Commits | /pulls/{n}/commits |
/merge_requests/{n}/commits |
pull_request_commits |
| PR/MR Files | /pulls/{n}/files |
/merge_requests/{n}/diffs |
pull_request_files |
| PR/MR Head/Base Meta | Embedded in PR response | Source/target branch from MR | pull_request_meta |
| Issue Comments | /issues/comments |
/issues/{n}/notes |
messages + issue_message_ref |
| PR/MR Comments | /issues/comments (shared endpoint) |
/merge_requests/{n}/notes |
messages + pull_request_message_ref |
| Review Comments | /pulls/comments |
/merge_requests/{n}/discussions (diff-positioned) |
messages + review_comments |
| Issue Events | /issues/events |
/projects/{id}/events + resource events |
issue_events |
| PR/MR Events | /issues/events (shared endpoint) |
/projects/{id}/events + resource events |
pull_request_events |
| Releases | /repos/{o}/{r}/releases |
/projects/{id}/releases |
releases |
| Repo Info (metadata) | GraphQL API (counts, community profile, license, status) | /projects/{id}?statistics=true + /issues_statistics + MR counts via X-Total |
repo_info (latest) + repo_info_history |
| Contributors | /repos/{o}/{r}/contributors |
/projects/{id}/members/all + /repository/contributors |
contributors + contributor_identities |
| Clone Stats | /traffic/clones |
Not available via API | repo_clones |
| Commits (git) | git clone --bare + git log --all --numstat |
Same | commits + commit_parents + commit_messages |
| Facade Aggregates | Computed from commits table | Same | dm_repo_annual/monthly/weekly |
| Commit Author Resolution | Noreply parse + Commits API + Search API (GitHub only) | N/A (GitLab identity from API) | contributors + contributor_aliases |
| Dependencies | File scan: 15 ecosystems (package.json, go.mod, pom.xml, Cargo.toml, etc.) | Same | repo_dependencies |
| Libyear | 12 registries (npm, PyPI, Go, Cargo, RubyGems, Maven, Packagist, Hex, NuGet, pub.dev, Hackage, SwiftPM) | Same | repo_deps_libyear |
| Code Complexity | scc --by-file (if installed) |
Same | repo_labor |
| OpenSSF Scorecard | scorecard --local (if installed) |
Same (works on any git URL) | repo_deps_scorecard (latest) + repo_deps_scorecard_history |
| ScanCode License/Copyright | scancode -clpi per file (every 30 days, if installed) |
Same | aveloxis_scan.scancode_scans + scancode_file_results + history |
| SBOMs | Generated from libyear data | Same | repo_sbom_scans (CycloneDX 1.5 + SPDX 2.3) |
| Vulnerability Scan | OSV.dev batch API (purls) | Same | repo_deps_vulnerabilities (CVE ID, severity, CVSS, fixed version) |
| PR/MR Fork Repos | head.repo / base.repo in PR response |
/projects/{id} per source/target |
pull_request_repo |
| Contributor Affiliations | Auto-populated from email domains + cntrb_company |
Same | contributor_affiliations |
| Contributor Breadth | GET /users/{login}/events (every 6h) |
— | contributor_repo |
| Canonical Email Enrichment | GET /users/{login} for profile email |
Same | contributors.cntrb_canonical |
The following fields are available from GitHub but not from GitLab due to platform API limitations:
| Field | GitHub | GitLab | Notes |
|---|---|---|---|
| Community profile files (CHANGELOG, CONTRIBUTING, CODE_OF_CONDUCT, SECURITY) | GraphQL file detection | /repository/tree file detection (v0.12.2) |
Full parity |
repo_info.commit_count |
REST /repos/{o}/{r} — accurate |
GET /projects/:id?statistics=true → often 0 for mirrored or private-low-scope projects |
v0.16.9+ backfills from facade's git log count after a successful clone so the monitor/web "Metadata commits" column matches reality |
Watcher count (repo_info) |
GraphQL watchers.totalCount |
Not available | GitLab has no public "watchers" API; star_count is the closest analog |
Clone statistics (repo_clones) |
/traffic/clones (requires push access) |
Not available | GitLab exposes clone data only via admin-only endpoints |
GraphQL node IDs (pr_src_node_id) |
Available on all entities | Not applicable | GitLab uses numeric IDs, not GraphQL node IDs — architectural difference |
Contributor URL fields (gh_followers_url, etc.) |
10+ URL fields per contributor | Not available | GitLab API doesn't expose follower/following/gist/etc. URLs |
| Contributor type (User/Bot/Organization) | type field on user objects |
Not available | GitLab doesn't distinguish user types the same way |
All text content from conversations — regardless of where it originates — is stored in a single messages table. This design enables cross-cutting text analysis (sentiment, response times, contributor communication patterns) without needing to query four separate tables. The semantic origin of each message is preserved via bridge tables:
| Message type | Purpose | Bridge table | Metadata table |
|---|---|---|---|
| Issue comments | Discussion on issues | issue_message_ref |
— |
| PR/MR comments | Discussion on pull requests | pull_request_message_ref |
— |
| Inline review comments | Code-level feedback on specific diff lines | review_comments (has msg_id FK) |
review_comments (diff_hunk, file_path, line, position) |
| Review bodies | Top-level review text (e.g., "LGTM", "Changes requested because...") | pull_request_review_message_ref |
pull_request_reviews (review_state, submitted_at) |
Why this matters for analysis: A query like "all messages by contributor X" joins messages once. A query like "all inline code review feedback on file Y" joins through review_comments. A query like "average time from PR open to first review body" joins through pull_request_review_message_ref. The bridge tables give you the semantic context; the messages table gives you the text.
Review bodies are stored in both pull_request_reviews.review_body (for quick access with the review metadata) and in messages (for unified text analysis). This intentional duplication keeps the review table self-contained while enabling cross-message-type analytics.
| Aspect | Augur | Aveloxis |
|---|---|---|
| Language | Python | Go |
| Processes | Celery workers + Flask API + Flower + Redis + RabbitMQ | 3 processes: serve (scheduler), web (GUI+monitor), api (REST) |
| Queue | Celery + RabbitMQ + Redis | Postgres SKIP LOCKED (no extra infrastructure) |
| Monitoring | Flower (separate service) | Built-in dashboard with gathered vs metadata count columns |
| Testing | Lags development due to long history. | Test first programming from birth. |
| REST API | Flask/Gunicorn with Beaker cache | Separate aveloxis api with repo stats, batch stats, SBOM download |
| GitLab support | Partial (missing releases, repo info, review comments, contributor enrichment) | Full parity with GitHub — all features work on both platforms |
| Repo info | GitHub Only | GitHub: GraphQL for all counts + community profile files. GitLab: REST + /issues_statistics + MR counts via X-Total. Historical snapshots in repo_info_history. |
| Contributor model | gh_*/gl_* columns mixed on one table |
Separate contributor_identities table |
| DB write pattern | Individual upserts during collection | JSONB staging → bulk batch processing (pgx.Batch for deps, libyear, labor, breadth). Significantly lower database contention for the contributors table. |
| Repo redirect handling | Not proactively handled | Prelim phase detects renames/transfers, deduplicates, updates URLs + bulk-fixes all stored URLs regularly |
| Dependency scanning | Custom Python parsers for 12 languages | Go parsers for 15 ecosystems, on-demand full clone |
| Libyear | npm + PyPI only | 12 registries: npm, PyPI, Go, Cargo, RubyGems, Maven, Packagist, Hex, NuGet, pub.dev, Hackage, SwiftPM |
| Code complexity (scc) | Requires manual scc install + separate worker | aveloxis install-tools + automatic per-repo analysis |
| OpenSSF Scorecard | Runs scorecard binary against GitHub repos. |
Runs scorecard binary against GitHub AND GitLab repos. Results in repo_deps_scorecard with history. |
| SBOM generation | Not supported | CycloneDX 1.5 + SPDX 2.3 with license capture from 12 registries. Download via web GUI or REST API. |
| Review messages | Review body stored in messages table with pull_request_review_message_ref bridge (same pattern as issue/PR comments). |
Same. |
| History tracking | Fills repo_info on each run, grows infinitely. | repo_info_history and repo_deps_scorecard_history preserve all previous snapshots |
| Scheduling | Celery Beat + collection_status table (opaque) | Priority queue — fills ALL worker slots per tick (not one per tick) |
| Priority override | Not supported | aveloxis prioritize / POST API / dashboard button |
| Scaling | Single Celery worker per queue | Multiple instances via SKIP LOCKED; 40+ workers on one host doing all collection for each repo before moving on. |
| API key rotation | Sequential drain, single key at a time | Round-robin across all keys, 15-request buffer, full utilization |
| API key source | Keyman service + Redis | Config file and/or Augur's worker_oauth table |
| API efficiency | No conditional requests | ETag caching (304 = free), HTTP/2 multiplexing, 20 idle connections per host |
| Commit author resolution | Separate Python scripts, long process with contention. | Built-in post-facade phase: noreply parse, DB lookup, Commits API, Search API |
| Materialized views | 18 views, manual refresh or Celery task | 19 views, configurable auto-rebuild schedule (default Saturday), not refreshed on startup |
| Contributor breadth | Separate Celery worker, manual scheduling | Built-in, runs every 6 hours automatically |
| Contributor IDs | Deterministic GithubUUID from gh_user_id | Deterministic GithubUUID from gh_user_id (Augur byte-compatible) |
| Facade aggregates | Post-processing in Python | SQL-based aggregate refresh per repo after git log |
| Affiliation resolution | Python domain matching | In-memory cached resolver with parent domain fallback |
| Text sanitization | Python encode/decode with backslashreplace + null byte removal | Go sanitizer: null bytes, invalid UTF-8, control characters stripped at DB boundary |
| Dead repo handling | Keeps retrying dead repos every cycle | Permanently sidelines (archived flag + dequeued), data preserved |
| Gateway error retry | Basic retry | Exponential backoff with jitter (1s-64s) for 502/503/504 |
| User interface | CLI only, except for Admin. | Web GUI with OAuth, interactive Chart.js visualizations, cross-project comparison (100%/Z-Score), dependency license analysis, SBOM download |
| Visualizations | Requires external tool (8Knot/Dash) | Still 8Knot compatible, 100%! And, built-in weekly time-series charts, comparison page for up to 5 repos with Z-score normalization |
| Vulnerability scanning | Not supported | OSV.dev batch API: scans all dependencies by purl, aggregates NVD+GHSA+PyPI+RustSec+Go vulns |
| Non-GitHub/GitLab repos | Not supported | Git-only mode: facade, analysis, scorecard, SBOM — email resolution against both platforms |
| User org tracking | Static — orgs added once, never rescanned | Dynamic — user_org_requests tracked, new repos auto-discovered every 4h |
| Error recovery | Manual restart | Automatic stale lock recovery + deadlock retry |
aveloxis/
cmd/aveloxis/ # CLI entry point (cobra commands)
main.go # serve, web, api, collect, add-repo, add-key, prioritize, recollect, migrate, version
internal/
collector/ # Collection orchestration
collector.go # Direct pipeline (used by `collect` command)
staged.go # Staged pipeline (used by `serve` command)
facade.go # Git clone + log parsing for commits
commit_resolver.go # Git email -> GitHub user resolution (port of augur-contributor-resolver)
breadth.go # Contributor breadth worker (cross-repo activity via GitHub Events API)
analysis.go # On-demand repo analysis (dependencies, libyear, scc, scancode)
sbom.go # CycloneDX 1.5 and SPDX 2.3 SBOM generation
scorecard.go # OpenSSF Scorecard integration (local execution)
scancode.go # ScanCode Toolkit integration (license/copyright/package detection)
vulnerability.go # OSV.dev vulnerability scanning (CVE/GHSA lookup by purl)
enrich.go # Contributor profile enrichment (GET /users/{login})
gap_fill.go # Smart gap detection and targeted re-collection
refresh_open.go # Open issue/PR refresh (status, labels, assignees)
tools.go # External tool management (scc, scorecard, scancode install)
noreply.go # GitHub noreply email parser
prelim.go # Redirect detection and duplicate checking
state.go # Collection status/phase constants
config/
config.go # JSON config loading with defaults
db/
postgres.go # All upsert methods (issues, PRs, events, messages, etc.)
store.go # Store interface definition
staging.go # JSONB staging writer and batch processor
migrate.go # Schema migration (embeds schema.sql)
schema.sql # Full DDL (112+ tables across 3 schemas, 23 indexes)
matviews.sql # 22 materialized views for 8Knot/analytics
matviews.go # View creation and refresh logic
sanitize.go # Text sanitization (null bytes, invalid UTF-8, control chars)
contributors.go # Contributor resolver with in-memory cache and heartbeat
affiliations.go # Email domain -> org affiliation resolver
aggregates.go # Facade aggregate refresh (dm_repo_* tables)
github_uuid.go # Deterministic UUID generation (Augur-compatible)
commit_resolver_store.go # DB methods for commit author resolution
breadth_store.go # DB methods for contributor breadth
analysis_store.go # DB methods for dependency/libyear/scc/scorecard analysis
scancode_store.go # DB methods for ScanCode results (aveloxis_scan schema)
gap_store.go # DB methods for gap fill, open item queries, metadata counts
repo_stats.go # Gathered vs metadata counts (RepoStats, GetRepoStatsBatch)
timeseries.go # Time series queries, license aggregation, OSI compliance
history.go # History rotation (libyear, scorecard, scancode, repo_info)
vulnerability_store.go # DB methods for CVE/vulnerability storage and queries
web_store.go # DB methods for user/group/org management
queue.go # Postgres-backed priority queue operations
keys.go # API key management and Augur import
import.go # Augur repo import
version.go # Single source of truth for tool version
model/ # Platform-agnostic data types
repo.go # Repo, RepoGroup, Platform, Contributor, ContributorIdentity
issue.go # Issue, IssueLabel, IssueAssignee, IssueEvent
pullrequest.go # PullRequest + all sub-entities
message.go # Message, IssueMessageRef, PRMessageRef, ReviewComment
commit.go # Commit, CommitMessage, CommitParent
release.go # Release
repoinfo.go # RepoInfo, RepoClone
userref.go # UserRef (platform user reference for contributor resolution)
web/
server.go # Web GUI server with OAuth handlers
templates.go # Embedded HTML templates (dashboard, groups, repos, compare)
url_validation.go # URL parsing, scheme fixing, validation
api/
server.go # REST API server (stats, timeseries, licenses, scancode, SBOM, search)
monitor/
monitor.go # HTTP dashboard with sortable gathered vs metadata columns; paginated + server-side search (v0.18.6)
platform/
platform.go # Client interface (7 sub-interfaces + FetchIssueByNumber, FetchPRByNumber)
httpclient.go # HTTP client with rate limiting, key rotation, ETag caching, retries
ratelimit.go # API key pool with rate limit tracking and MarkDepleted
repourl.go # URL parsing (GitHub/GitLab detection)
github/
client.go # Full GitHub REST + GraphQL API implementation
types.go # GitHub API response types
gitlab/
client.go # Full GitLab API v4 implementation
types.go # GitLab API response types
scheduler/
scheduler.go # Queue polling, job dispatch, heartbeat, stale lock recovery, gap fill
aveloxis-story/
augur_data.sql # Reference: Augur's augur_data schema (for comparison)
augur_operations.sql # Reference: Augur's augur_operations schema (for comparison)
[pdf|pptx|png|html] # Various artifacts describing the road to Aveloxis
Dockerfile # Multi-stage Docker build
docker-compose.yml # Docker Compose with PostgreSQL
.readthedocs.yaml # ReadTheDocs build configuration
go.mod # Go module definition
go.sum # Go dependency checksums
aveloxis.example.json # Example configuration file
aveloxis.docker.json # Docker-specific configuration (uses docker service names)
install-scorecard.sh # Manual scorecard binary installation script
# Run all unit tests (no database required)
go test ./...
# Run with verbose output
go test -v ./...
# Run a specific package
go test ./internal/platform/...
# Run integration tests (requires live PostgreSQL)
# Set the connection string, then tests with t.Skip guards will run:
AVELOXIS_TEST_DB="postgres://user:pass@localhost:5432/aveloxis_test" go test ./internal/db/...The test suite has 561 tests across 75 test files in 12 packages (all pass, no database required). Coverage by area:
| Package | Tests | Coverage |
|---|---|---|
internal/collector |
344 | Dependency parsers (15 ecosystems), libyear, SBOM generation (CycloneDX + SPDX), vulnerability scanning (CVSS, OSV), facade git log parsing, git URL security validation, commit resolution, prelim URL handling, noreply/bot email detection, breadth worker, SCC complexity, scorecard |
internal/db |
61 | Queue jobs, staging, GithubUUID (incl. overflow detection), text sanitization, affiliations, batch operations, repo stats, vulnerability store, SBOM store, timeseries, licenses |
internal/api |
40 | All REST endpoints (health, stats, SBOM, timeseries, licenses, search) + all Augur-compatible metric endpoints (issues, PRs, commits, contributors, stars, forks, watchers, deps, releases, complexity), parameter validation, route registration |
internal/platform |
39 | Key pool (round-robin, exhaustion, reset wait, empty pool, invalidated keys), HTTP client (pagination, query params, retry-after), URL parsing (GitHub, GitLab, nested subgroups, self-hosted) |
internal/web |
22 | Web GUI handlers, OAuth flow, URL validation, cookie security (Secure/HttpOnly/dev_mode) |
internal/scheduler |
17 | Job lifecycle, phase orchestration, worker management |
internal/platform/gitlab |
17 | Community file detection, contributor enrichment, user reference conversion, diff line counting, discussion notes |
internal/config |
9 | Default values, connection string generation, JSON loading, merge behavior |
internal/platform/github |
7 | User reference conversion, GraphQL query building |
internal/monitor |
4 | API endpoint validation, request parsing |
internal/model |
1 | UserRef zero-value detection |
cd docs
pip install -r requirements.txt
sphinx-build -b html . _build/html
open _build/html/index.html
Or if you prefer a one-liner from the repo root:
pip install sphinx sphinx-rtd-theme myst-parser && sphinx-build -b html docs docs/_build/html && open docs/_build/html/index.htmlAveloxis is free software: you can redistribute it and/or modify it under the terms of the MIT License as published by the Open Source Initiative. See the LICENSE file for more details. This work has been funded almost entirely through the Alfred P. Sloan Foundation. Mozilla, The Reynolds Journalism Institute, VMWare, Red Hat Software, Grace Hopper's Open Source Day, GitHub, Microsoft, Twitter, Adobe, the Gluster Project, Open Source Summit (NA/Europe), and the Linux Foundation Compliance Summit have made contributions to the code and in some cases financially supported the development of Aveloxis's predecessoar, Augur, from 2017 to 2026. Aveloxis collects open source community health data from GitHub and GitLab with equal completeness, storing it in a shared PostgreSQL schema for cross-platform analysis. It is designed as an upgrade to the Augur collection pipeline. A feature and robustness comparison is available in the Comparison with Augur section of this readme.
{ "github": { "api_keys": ["ghp_YOUR_GITHUB_TOKEN"], // ← REQUIRED: at least one PAT ... }, "web": { "dev_mode": true, // ← Required for HTTP (no HTTPS) "github_client_id": "Iv1.abc123...", // ← For OAuth login (see below) "github_client_secret": "deadbeef...", // ← For OAuth login (see below) ... } }