Skip to content

RANGER-5655: Dynamic unified ingestor registry for audit partition routing and service allowlists#1032

Open
ramackri wants to merge 9 commits into
apache:masterfrom
ramackri:RANGER-5655-patch
Open

RANGER-5655: Dynamic unified ingestor registry for audit partition routing and service allowlists#1032
ramackri wants to merge 9 commits into
apache:masterfrom
ramackri:RANGER-5655-patch

Conversation

@ramackri

@ramackri ramackri commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Implements RANGER-5655: a dynamic unified ingestor registry for Ranger audit-ingestor so operators can change Kafka partition routing and per-repo service allowlists at runtime — without restarting ingestor pods.

The registry is a versioned JSON document in Kafka topic ranger_audit_partition_plan (1 partition, compacted). All ingestor replicas converge via PartitionPlanWatcher; AuditPartitioner routes on the hot path from in-memory state only.

Feature flag (default off): ranger.audit.ingestor.kafka.partition.plan.dynamic.enabled=false

Problem

Job Static behavior today Pain
Service allowlist ranger.audit.ingestor.service.*.allowed.users in site XML at startup Onboard repo / change allowlist → XML edit + restart all ingestor pods
Partition routing kafka.configured.plugins + per-plugin overrides at startup Promote hot plugin or grow partitions → restart; contiguous ranges can reshuffle later plugins

Solution (this PR update)

Simplified REST control plane — three endpoints only:

Method Path Purpose
GET /api/audit/partition-plan Read current plan (plugins, buffer, services, version)
POST /api/audit/partition-plan/plugins Onboard plugin: dedicated partitions + mandatory non-empty services map
PATCH /api/audit/partition-plan/plugins/{pluginId} Update onboarded plugin: scale and/or addServices / updateServices / removeServices

Removed (consolidated above): PATCH /api/audit/partition-plan, POST /api/audit/partition-plan/services, separate promote-only / scale-only flows.

New request models: OnboardPlugin, UpdatePlugin. Service entries stored with optional pluginId for repo→plugin ownership.


Code changes in this commit (REST simplification slice)

Area Change
AuditREST.java Three partition-plan endpoints only
PartitionPlanService.java onboardPlugin(), updatePlugin()
PartitionPlanAllocator.java Onboard/update with service allowlist mutations
PartitionPlanRequestValidator.java Mandatory services on POST onboard
OnboardPlugin.java, UpdatePlugin.java New REST DTOs
ServiceAllowlistEntry.java Optional pluginId for ownership
Unit tests PartitionPlanRequestValidatorTest + mutation/allocator updates (94 partition-plan tests)

How was this patch tested?

Unit tests + quality gates

mvn verify -pl audit-server/audit-ingestor -Drat.skip=true \
  -Dtest='PartitionPlan*Test,ServiceAllowlist*Test,AuthToLocalRuleComposerTest'
Gate Result
Partition-plan + allowlist tests 94/94 pass
Checkstyle Pass
PMD Pass

Focused run:

mvn test -pl audit-server/audit-ingestor \
  -Dtest='PartitionPlan*Test,ServiceAllowlist*Test' -Drat.skip=true

Manual testing (local Docker audit lab)

Manual validation used a local Docker Compose audit environment that mirrors a production-style Ranger audit deployment: Kerberos (KDC + plugin keytabs), Kafka with both the audit data topic (ranger_audits) and the compacted registry topic (ranger_audit_partition_plan), a running audit-ingestor instance on port 7081, Solr (with audit dispatcher), Postgres-backed Ranger Admin, and real plugin containers for HDFS and Hive. All partition-plan REST calls used SPNEGO (Kerberos) as the ingestor HTTP service principal; plugin audit posts used each plugin’s own keytab.

The ingestor was rebuilt with this branch’s code (including mandatory services validation on onboard) before running the scenarios below.


1. Environment readiness

Before exercising the new API, the lab was brought to a healthy state: ingestor health endpoint returned 200, Kafka was reachable, the plan watcher was active after enabling dynamic mode, and GET /api/audit/partition-plan returned a coherent plan JSON (version, plugins, buffer, services, topicPartitionCount matching the live ranger_audits partition count).


2. Static mode unchanged (feature flag off)

With ranger.audit.ingestor.kafka.partition.plan.dynamic.enabled=false (default):

  • GET /api/audit/partition-plan returned 503 — partition-plan admin API correctly disabled.
  • GET /api/audit/health still returned 200.
  • Normal plugin audit delivery (HDFS smoke, Solr indexing) continued to work.

This confirms existing deployments are unaffected when the flag stays off.


3. Enabling dynamic mode and reading the registry

Dynamic mode was turned on (dynamic.enabled=true) with a fresh or reset plan topic where appropriate. After ingestor restart:

  • Kafka showed ranger_audit_partition_plan with one partition and compacted cleanup policy.
  • GET /api/audit/partition-plan returned 200 with version ≥ 1, populated services from XML bootstrap, and topicPartitionCount equal to kafka-topics --describe ranger_audits.
  • Ingestor logs confirmed PartitionPlanWatcher started and the partitioner loaded the in-memory plan.

4. Simplified REST API — onboard, validation, scale

All mutations used expectedVersion from the preceding GET.

Negative validation (new behavior)

  • POST /api/audit/partition-plan/plugins with pluginId, partitionCount, and expectedVersion but omitting services400 Bad Request with message indicating services are required. This was the primary regression guard for the API consolidation.

Successful onboard

  • Onboarded a buffer-only plugin (e.g. storm or ambari) in a single call with a non-empty services map (repo → allowedUsers). Response 200; plan version incremented; plugin appeared under plugins with dedicated partition IDs taken from the buffer (or tail-grown when needed); corresponding repo entries appeared under services.

Multi-repo onboard in one version bump

  • Onboarded trino with two repos in one POST (dev_trino and dev_trino2, each with its own allowedUsers) → 200; both repos present in services with pluginId ownership tagged to trino.

Optimistic locking

  • Repeated onboard with a stale expectedVersion409 Conflict with current plan in the response body.
  • Attempted to onboard hdfs again when it already had dedicated partitions → 400 (conflicting state).

Scale after onboard

  • PATCH /api/audit/partition-plan/plugins/{pluginId} with additionalPartitions200; tail partition IDs appended append-only; ranger_audits grown via AdminClient when required; subsequent GET showed stable version and layout.

Idempotency check

  • After mutations, GET /api/audit/partition-plan without restart showed the same version and layout as the last successful write.

5. End-to-end plugin flows (allowlist + routing)

These tests prove the full path: registry onboard → allowlist enforcement → Kafka produce → correct partition assignment.

HDFS

  1. Onboarded plugin hdfs with repo dev_hdfs and allowlist hdfs,nn via POST .../plugins (mandatory services).
  2. From the Hadoop container, posted a test audit batch to POST /api/audit/access?serviceName=dev_hdfs&appId=hdfs using the hdfs Kerberos principal → 200; authenticatedUser mapped to short name hdfs.
  3. Consumed the corresponding record from ranger_audits and verified the partition number was in the hdfs assignment list from the plan (not the buffer pool).
  4. Optionally scaled hdfs with PATCH .../plugins/hdfs and repeated the access + Kafka partition check — routing still respected the updated plan.

Hive

  1. Onboarded hiveServer2 with repo dev_hive and allowlist ["hive"] in the same onboard POST.
  2. From the Hive container, posted audits with the hive principal → 200.
  3. Kafka record landed on a partition in the hiveServer2 dedicated set (e.g. partitions [7, 8] after prior lab mutations).
  4. Confirmed ingestor logs showed auth_to_local rules recomposed after the onboard (allowlist union updated).

HDFS already onboarded path

  • Where hdfs was already present in the plan from an earlier run, the lab skipped re-onboard and verified allowlist + routing still held: access accepted, partition ∈ plan.

6. Allowlist behavior (authorization layer)

Separate from partition routing:

  • A principal in services[repo].allowedUsers (after auth_to_local) → 200 on /access.
  • After tightening allowlist via PATCH .../plugins/{pluginId} with updateServices to remove the principal → 403 on the same POST.
  • Restoring the allowlist → 200 again.
  • Posting audits claiming a different repo than the principal is allowed for → 403 (cross-repo denial).

This confirms the unified services map in the registry drives authorization without XML restart.


7. What did not change

Area Observation
Audit spool / recovery Kafka produce failures still spool to per-pod local files; retry path unchanged. Dynamic mode does not alter recovery semantics.
Solr / HDFS dispatchers No reconfiguration needed; consumers rebalance when ranger_audits grows.
Static mode No plan topic usage; no watcher; partition-plan REST returns 503.

8. Summary of manual test outcomes

Scenario Result
Static mode regression Pass
Dynamic enable + plan bootstrap Pass
Onboard without services → 400 Pass
Onboard with mandatory services Pass
Multi-repo single onboard Pass
Stale version → 409 Pass
Duplicate plugin → 400 Pass
PATCH scale (append-only) Pass
HDFS: access + Kafka partition ∈ plan Pass
Hive: onboard + access + Kafka partition ∈ plan Pass
Allowlist toggle via PATCH update Pass (where exercised)

…gestor: runtime Kafka partition routing and per-repo service allowlists via compacted topic + REST, without ingestor restarts. Feature flag default off.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ramackri ramackri requested review from mneethiraj and rameeshm June 23, 2026 13:43
ramk and others added 3 commits June 23, 2026 19:18
Use hdfs-only allowlist for dev_hdfs, remove unused dev_solr allowlist
entry, fix buffer partition example math, and add detailed manual test
documentation for PR apache#1032.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep dev_solr service allowlist property (remove only the stray blank line
the feature commit added). Retain hdfs-only dev_hdfs allowlist and buffer
partition example fix. Remove dev-support/RANGER-5655-PR-TEMPLATE.md.

Co-authored-by: Cursor <cursoragent@cursor.com>
Correct import order, remove unused import, use static requireNonNull,
drop duplicate test import, and align PartitionPlan imports with
checkstyle rules reported on PR apache#1032.

Co-authored-by: Cursor <cursoragent@cursor.com>
ramk and others added 5 commits June 23, 2026 21:20
…n layout.

Ship the standard 14-plugin lab list in ranger-audit-ingestor-site.xml with
dynamic partition plan disabled by default; update buffer partition example
to 14 × 3 + 9 = 51 total.

Co-authored-by: Cursor <cursoragent@cursor.com>
Consolidate partition-plan mutations into three endpoints: GET plan,
POST onboard plugin (mandatory non-empty services map), and PATCH update
plugin. Remove PATCH /partition-plan and POST /services. Add validator
and E2E coverage for mandatory services on onboard.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep REST simplification to Java sources and unit tests only.

Co-authored-by: Cursor <cursoragent@cursor.com>
Drop unused PromotePlugin, OnboardService, PluginScale, and
PartitionPlanReplacement after REST API consolidation. Cache partition-plan
admin users and dynamic.enabled flag in PartitionPlanService constructor.
Refactor partition-plan helpers and AuditREST partition-plan paths to
match Ranger review style with one return statement per method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant