-
Notifications
You must be signed in to change notification settings - Fork 0
QDB-10908 - Add Arrow query API support to Python API #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
QDB-10908 - Add Arrow query API support to Python API #108
Conversation
tests/conftest.py
Outdated
| ): | ||
|
|
||
| index = pd.Index( | ||
| # pd.date_range(start_date, periods=row_count, freq="s"), name="$timestamp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests/test_numpy.py: 634 warnings D:\work\quasar\qdb-api-python\tests\conftest.py:685: FutureWarning: 'S' is deprecated and will be removed in a future version, please use 's' instead. pd.date_range(start_date, periods=row_count, freq="S"), name="$timestamp"
tests/conftest.py
Outdated
| return request.param | ||
|
|
||
|
|
||
| # @pytest.fixture(params=["s"], ids=["frequency=s"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
solatis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good as a demonstration of the concept, but it lacks proper integration with the existing abstractions and, even more importantly, doesn't handle object/pointer/ownership correctly between Python and C++.
I think things can/should be simplified before merging this into master.
The intention here is to be the "standard" API? Did we run a benchmark to compare the performance of this insertion/reading method with the old one?
Is this intended only for the bulk reader, or will the query API be adapted as well?
I'm happy to pick things up from here!
| std::vector<any_column> _columns; | ||
|
|
||
| std::vector<qdb_exp_batch_push_column_t> _columns_data; | ||
| std::vector<const char *> _duplicate_ptrs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears very fragile; what is the "ownership" of the _duplicate_ptrs, who "owns" the const char * and when does it go out of scope?
What I do to circumvent this is:
- Copy all data into
staged_table, whose lifetime is coupled with the batch writer session (i.e. it outlives the python function invocation) - Then use pointers from the copied object
| std::vector<std::string> const & columns, | ||
| qdb_exp_batch_push_table_t & out) | ||
| qdb_exp_batch_push_table_t & out, | ||
| std::vector<const char *> & duplicate_ptrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have 2 different output vectors? This seems like a fragile design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the original code you create unique_ptr object
auto where_duplicate = std::make_unique<char const *[]>(columns.size());
and then give ownership of C++ object to C pointer
out.where_duplicate = where_duplicate.release();
const char ** where_duplicate;
is it expected by design?
Is this memory leak planned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the ownership transfer a bit more explicit (as the original unique ptr is inaccessible after the release call), but I agree it's not pretty. I'll make it prettier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of introducing an entirely new file? Wouldn't we want this to be in writer.cpp / writer.hpp ?
| void exp_batch_push_arrow_with_options(handle_ptr handle, | ||
| const std::string & table_name, | ||
| const pybind11::object & reader, | ||
| pybind11::kwargs args) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a qdb-api-python-specific function, or something from our C API? Because it doesn't integrate well / match the design of the existing writer.hpp / writer.cpp, and seems much more "C-style" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like most of this code is duplicated from elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you introduce a whole new file for this? This is only for development, not for actual dependencies / releasing?
I think we dropped win32 support, does that make things easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont understand this idea to accumulate table by individual columns on C level. It needs when we work with the qdb table object.
Batch writers operate with whole table. And we can prepare this table on the python level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem concerns memory ownership, native numpy arrays, and differences in representation.
Let's say I have this code:
writer = quasar.Writer()
for row in range(rows):
df = process_row(row)
writer.add(df)
writer.push()The lifetime of df (and all numpy arrays underneath it) are only scoped to the for() loop.
Additionally, timestamps and string data are represented in entirely different ways between our C API and Python.
Additionally, Python offers no guarantees at all about memory stability between function invocations: all memory can be moved around in between multiple invocations.
The solution employed in the APIs is:
- Upon invocation, copy data into a native representation which we "own" and is stable;
- We couple the lifetime of these data structures with the batch writer;
- In practice, this means it's a member object of our batch writer object.
- When invoking functions of our C API, we avoid copies and provide a pointer to our data structures instead.
It is possible that with PyArrow, since the representation of the data between Python and our C API is the same, but then still we must tell the Python GC that our batch writer objects references the PyArrow objects, otherwise we'll get into ownership issues if e.g. Python decides to garbage collect. I.e. this is tricky and needs to be considered carefully. You can read more about this over here: https://pybind11.readthedocs.io/en/stable/advanced/functions.html, especially the section "additional call policies".
No description provided.