feat: expose arrow schema on async avro reader#9534
Conversation
Add a schema method to obtain the Arrow schema from the async Avro reader.
Tests on alltypes_plain.avro
Add metadata on fields of nested records and the list type, so that the expected schema matches the one produced by the reader. Add a test reading nested_records.avro to verify the schema exposed by the reader.
alamb
left a comment
There was a problem hiding this comment.
Thanks @mzabaluev -- the code looks good I just had some test comments
cc @jecsand838
|
|
||
| #[tokio::test] | ||
| async fn test_arrow_schema_from_reader_no_reader_schema() { | ||
| // Use a very small header size hint to force multiple fetches |
There was a problem hiding this comment.
these comments seem out of date
There was a problem hiding this comment.
Sorry, I should have cleaned these up. Done now.
| let location = Path::from_filesystem_path(&file).unwrap(); | ||
| let file_size = store.head(&location).await.unwrap().size; | ||
|
|
||
| let file_reader = AvroObjectReader::new(store, location); |
There was a problem hiding this comment.
Could you also reduce some of the duplication in this test so that it is easier to understand what is actually being tested and what is different between the tests?
There was a problem hiding this comment.
I have added clarifying comments explaining the purpose and differences of code in each of the added cases. Hope this helps.
|
|
||
| #[tokio::test] | ||
| async fn test_arrow_schema_from_reader_with_reader_schema() { | ||
| // Use a very small header size hint to force multiple fetches |
There was a problem hiding this comment.
likewise this comment seems outdated
Remove copy-pasted comments that don't apply to the new tests. In the test with the reader schema, update the test to use a projected schema and verify that the reader schema is applied correctly. Add comments explaining the expectations for each test case.
017c28a to
4b39bef
Compare
alamb
left a comment
There was a problem hiding this comment.
Thanks -- looks good to me
|
🚢 🇮🇹 |
Rationale for this change
Exposes the Arrow schema produced by the async Avro file reader, similarly to the
schemamethod on the synchronous reader.This allows an application to prepare casting or other schema transformations with no need to fetch the first record batch to learn the produced Arrow schema. Since the async reader only parses OCF content for the moment, the schema does not change from batch to batch.
What changes are included in this PR?
The
schemamethod forAsyncAvroFileReaderexposes the Arrow schema of record batches that are produced by the reader.Are these changes tested?
Added tests verifying that the returned schema matches the expected.
Are there any user-facing changes?
Added a
schemamethod toAsyncAvroFileReader.