Skip to content

AVRO-4247: Enforce decompression size limits#3745

Open
steveloughran wants to merge 7 commits intoapache:mainfrom
steveloughran:pr/java-codec-decompression
Open

AVRO-4247: Enforce decompression size limits#3745
steveloughran wants to merge 7 commits intoapache:mainfrom
steveloughran:pr/java-codec-decompression

Conversation

@steveloughran
Copy link
Copy Markdown

What is the purpose of the change

#3625 with size limit checks moved into the NonCopyingByteArrayOutputStream

  • guarantees all decompressors get the coverage
  • will make writing a test trivial

There's a new constructor to NonCopyingByteArrayOutputStream to set a size limit, or no limit, and the default constructor now automatically picks up the size set by system property/fallback default.

Those choices could be discussed, with options being

  • static function to get "size limited output stream"
  • move fetch and pase of system property into org.apache.avro.SystemLimitException, where the int parser lives.

AI: No AI was used for this PR.

Verifying this change

Needs tests, if people are happy with the design I can put one in whichever module people would prefer...it's pretty straightforward

Documentation

  • Does this pull request introduce a new feature? (yes / no)
    yes

  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
    javadocs

OwenSanzas and others added 6 commits January 14, 2026 10:31
…b DoS

Add maximum decompression size limit in DeflateCodec to prevent
OutOfMemoryError when processing maliciously crafted Avro files
with high compression ratios (decompression bombs).

The limit defaults to 200MB and can be configured via system property:
org.apache.avro.limits.decompress.maxLength
….java


Thanks!

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
….java

Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
- Move MAX_DECOMPRESS_LENGTH initialization to static block (read once at class load)
- Add WARNING log for invalid property values (NumberFormatException)
- Validate negative and zero values, reject with warning
- Add "(bytes)" to error message for clarity
- Add quotes around property name in error message

Test command:
java -Xmx64m -Dorg.apache.avro.limits.decompress.maxLength=1048576 \
  -jar avro-tools-1.13.0-SNAPSHOT.jar tojson poc.avro

Expected behavior:
Exception in thread "main" org.apache.avro.AvroRuntimeException:
Decompressed size 1056768 (bytes) exceeds maximum allowed size 1048576.
This can be configured by setting the system property 'org.apache.avro.limits.decompress.maxLength'
Change-Id: Ib24c52cdf3234a3805628041946b229b221383ad
* Automatically available to all codecs
* Does need an explicit constructor with no limit, used in DataFileWriter
* No tests, though that new constructor makes it trivial

Note: merged in main as DataFileWriter changes would otherwise stop merging
Change-Id: Ifc5b8921a00425df331a4889472b3e78c6677bde
@github-actions github-actions Bot added the Java Pull Requests for Java binding label Apr 28, 2026
@steveloughran steveloughran changed the title [AVRO-4247] decompresson size limits [AVRO-4247] enforce decompression size limits Apr 28, 2026
private static final long MAX_DECOMPRESS_LENGTH;

static {
String prop = System.getProperty(MAX_DECOMPRESS_LENGTH_PROPERTY);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could move to SystemLimitException, as that's where the int equivalent lives.

Copy link
Copy Markdown
Member

@iemejia iemejia Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, MAX_DECOMPRESS_LENGTH_PROPERTY makes more sense in the SystemLimitException class

* @throws IllegalArgumentException if size is negative
*/
public NonCopyingByteArrayOutputStream(int size) {
this(size, MAX_DECOMPRESS_LENGTH);
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does change the default operation. Apart from DataFileWriter, it is only ever used in decompressors.

Options

  • change the default (here)
  • change the code uses to take a limit
  • private two arg ctor and a public static creator method

Copy link
Copy Markdown
Member

@iemejia iemejia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @steveloughran!

The changes look good but we need to add tests that validate that the fix works, including:

  • Compressing data whose compressed output exceeds a small configured decompression limit.
  • Decompressing data exactly equal to the limit.

There is also another issue. the trevni module has the same no enforced decompression size limits check issue so we need to get a copy of NonCopyingByteArrayOutputStream working on that module. I suggest a copy because trevni does not depend on avro core by design it is independent.

I was also wondering if we should be extra cautions and add some test in TestAllCodecs that ensure that all compression types are covered but maybe that's too much.

private static final long MAX_DECOMPRESS_LENGTH;

static {
String prop = System.getProperty(MAX_DECOMPRESS_LENGTH_PROPERTY);
Copy link
Copy Markdown
Member

@iemejia iemejia Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, MAX_DECOMPRESS_LENGTH_PROPERTY makes more sense in the SystemLimitException class

* @param size buffer capacity
* @param limit size limit or -1 for no limit.
*/
public NonCopyingByteArrayOutputStream(final int size, final long limit) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it private, since nobody is using it, it is probably better not to encourage the use.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll go for package private and a test in this module with a tiny limit and verify that all write ops get rejected.

@iemejia iemejia changed the title [AVRO-4247] enforce decompression size limits AVRO-4247: Enforce decompression size limits Apr 30, 2026
* Tests
* Option read moved to SystemLimitException

Change-Id: I01d4203ad65e0e09dde88b33f603879323b06425
@steveloughran
Copy link
Copy Markdown
Author

@iemejia ok, pushed out test and the move of the property read.

If you are happy with this, I can move a copy of the stream + test + exception into the trevni module.

* parsable as an int
* @return The value from the system property
*/
private static long getLongLimitFromProperty(String property, long defaultValue) {
* parsable as an int
* @return The value from the system property
*/
private static long getLongLimitFromProperty(String property, long defaultValue) {
@RyanSkraba
Copy link
Copy Markdown
Contributor

Thanks for taking this up! The bot made a good comment about parameters, simple mistake to fix then this LGTM. This needs to be cherry-picked to 1.12.2 after merge of course.

Also thanks for keeping the contributor history ❤️ Good collaboration!

@iemejia
Copy link
Copy Markdown
Member

iemejia commented May 1, 2026

@iemejia ok, pushed out test and the move of the property read.

If you are happy with this, I can move a copy of the stream + test + exception into the trevni module.

Thanks a lot @steveloughran there is an issue I did not notice during the previous review :S the SnappyCodec does not use NonCopyingByteArrayOutputStream so we probably need to refactor SnappyCodec to use it, that way all the Compression Codecs will be covered.

Another doubt I have is:
The DataFileWriter also uses NonCopyingByteArrayOutputStream (for compression/write path), so the 200 MB limit now also applies to writes. This could potentially be a problem for users writing very large blocks. I am less familiar on the avro mapred side to see if this could be an issue, we need to check it too.

@steveloughran
Copy link
Copy Markdown
Author

the DataFileWriter was why my initial PR had the -1 for no limits option visible. If we make that visible and call it from there: no problem.

I will look at SnappyCodec

@iemejia
Copy link
Copy Markdown
Member

iemejia commented May 1, 2026

Ok my dear Claude says this (which I don't know if it should gives us confidence but it make somehow sense):

The DataFileWriter.buffer is used to accumulate raw uncompressed records before they're flushed as a block. The default sync interval is 64 KB, so normally this buffer only grows to ~64 KB. But users can set syncInterval up to 1 GB. If someone sets a sync interval > 200 MB, the write would fail.
However, there's a subtlety — this buffer goes through buffer.reset() at DataFileWriter.java:434 after each block flush, so it never accumulates more than one block's worth of data. The question is whether any real-world use case writes blocks > 200 MB.
Your concern is valid but limited in scope. Here's the precise answer:

No, this will NOT be an issue for typical Avro MapReduce usage. Here's why:

  1. Default sync interval is 64 KB (4000 * 16). The DataFileWriter.buffer is flushed whenever accumulated data reaches this threshold, so it never grows beyond ~80 KB in the default case.
  2. MapReduce defaults are the same. AvroOutputFormat uses DEFAULT_SYNC_INTERVAL (64 KB) unless explicitly overridden via avro.mapred.sync.interval.
  3. Even with a large sync interval, you'd need > 200 MB per block. The setSyncInterval API allows up to 1 GB, but the limit only fires if a single uncompressed block exceeds 200 MB. Most production jobs use intervals of 64 KB to a few MB.
  4. The compress() path is safe. Compressed output is always <= uncompressed input, so the NonCopyingByteArrayOutputStream used inside DeflateCodec.compress() etc. won't hit the limit.
    Where it COULD be an issue:
  • Users who set syncInterval to a very large value (> 200 MB) to optimize for sequential read throughput — rare, but possible in batch analytics.
  • The appendAllFrom() path calls decompressUsing() then compressUsing() — the decompression step would correctly be bounded (intended behavior), but this means files with blocks > 200 MB can't be recompressed between codecs.
  • SortedKeyValueFile sets sync interval to 1 << 20 (1 MB) — safe.
    The 200 MB default is generous enough that this won't affect MapReduce workloads in practice. The Avro default block size is 64 KB, Hadoop ecosystem tools typically use 1-16 MB, and even aggressive configurations rarely exceed 128 MB per block. The edge case would be someone who explicitly set syncInterval > 200 MB AND writes records that actually fill a single block to that size — an extremely uncommon configuration.
    If you want to be safe, the system property org.apache.avro.limits.decompress.maxLength can be increased or set to -1 to disable the limit entirely for trusted environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Java Pull Requests for Java binding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants