Skip to content

Conversation

@vikonix
Copy link
Contributor

@vikonix vikonix commented Nov 24, 2025

No description provided.

):

index = pd.Index(
# pd.date_range(start_date, periods=row_count, freq="s"), name="$timestamp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/test_numpy.py: 634 warnings D:\work\quasar\qdb-api-python\tests\conftest.py:685: FutureWarning: 'S' is deprecated and will be removed in a future version, please use 's' instead. pd.date_range(start_date, periods=row_count, freq="S"), name="$timestamp"

return request.param


# @pytest.fixture(params=["s"], ids=["frequency=s"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@vikonix vikonix requested a review from solatis December 16, 2025 19:04
Copy link
Contributor

@solatis solatis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good as a demonstration of the concept, but it lacks proper integration with the existing abstractions and, even more importantly, doesn't handle object/pointer/ownership correctly between Python and C++.

I think things can/should be simplified before merging this into master.

The intention here is to be the "standard" API? Did we run a benchmark to compare the performance of this insertion/reading method with the old one?

Is this intended only for the bulk reader, or will the query API be adapted as well?

I'm happy to pick things up from here!

std::vector<any_column> _columns;

std::vector<qdb_exp_batch_push_column_t> _columns_data;
std::vector<const char *> _duplicate_ptrs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears very fragile; what is the "ownership" of the _duplicate_ptrs, who "owns" the const char * and when does it go out of scope?

What I do to circumvent this is:

  • Copy all data into staged_table, whose lifetime is coupled with the batch writer session (i.e. it outlives the python function invocation)
  • Then use pointers from the copied object

std::vector<std::string> const & columns,
qdb_exp_batch_push_table_t & out)
qdb_exp_batch_push_table_t & out,
std::vector<const char *> & duplicate_ptrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have 2 different output vectors? This seems like a fragile design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the original code you create unique_ptr object
auto where_duplicate = std::make_unique<char const *[]>(columns.size());

and then give ownership of C++ object to C pointer
out.where_duplicate = where_duplicate.release();

const char ** where_duplicate;

is it expected by design?
Is this memory leak planned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the ownership transfer a bit more explicit (as the original unique ptr is inaccessible after the release call), but I agree it's not pretty. I'll make it prettier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of introducing an entirely new file? Wouldn't we want this to be in writer.cpp / writer.hpp ?

Comment on lines +154 to +158
void exp_batch_push_arrow_with_options(handle_ptr handle,
const std::string & table_name,
const pybind11::object & reader,
pybind11::kwargs args)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a qdb-api-python-specific function, or something from our C API? Because it doesn't integrate well / match the design of the existing writer.hpp / writer.cpp, and seems much more "C-style" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like most of this code is duplicated from elsewhere?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you introduce a whole new file for this? This is only for development, not for actual dependencies / releasing?

I think we dropped win32 support, does that make things easier?

Copy link
Contributor Author

@vikonix vikonix Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand this idea to accumulate table by individual columns on C level. It needs when we work with the qdb table object.

Batch writers operate with whole table. And we can prepare this table on the python level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem concerns memory ownership, native numpy arrays, and differences in representation.

Let's say I have this code:

writer = quasar.Writer()
for row in range(rows):
   df = process_row(row)
   writer.add(df)
writer.push()

The lifetime of df (and all numpy arrays underneath it) are only scoped to the for() loop.

Additionally, timestamps and string data are represented in entirely different ways between our C API and Python.

Additionally, Python offers no guarantees at all about memory stability between function invocations: all memory can be moved around in between multiple invocations.

The solution employed in the APIs is:

  • Upon invocation, copy data into a native representation which we "own" and is stable;
  • We couple the lifetime of these data structures with the batch writer;
    • In practice, this means it's a member object of our batch writer object.
  • When invoking functions of our C API, we avoid copies and provide a pointer to our data structures instead.

It is possible that with PyArrow, since the representation of the data between Python and our C API is the same, but then still we must tell the Python GC that our batch writer objects references the PyArrow objects, otherwise we'll get into ownership issues if e.g. Python decides to garbage collect. I.e. this is tricky and needs to be considered carefully. You can read more about this over here: https://pybind11.readthedocs.io/en/stable/advanced/functions.html, especially the section "additional call policies".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants