-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: Offline Store historical features retrieval based on datetime range in Ray #5738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ange in Ray Signed-off-by: Aniket Paluskar <[email protected]>
Signed-off-by: Aniket Paluskar <[email protected]>
jyejare
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good initially, have some doubts.
Also needs to add tests.
| return pa.Table.from_pandas(df).schema | ||
|
|
||
|
|
||
| def _compute_non_entity_dates_ray( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have make a common utility function for this, so that it can be used in all stores without repeating the code.
wdyt ?
| return _filter_range | ||
|
|
||
|
|
||
| def _make_select_distinct_keys(join_keys: List[str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not drop rows with duplicate IDs, because there could be multiple transactions per ID and we need to choose the row based on timestamp while joining the colums from another table/view. I think this is the same case with your spark PR.
Please check the postgres implementation to understand the case.
Or Am I misreading this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing the case after discussion
What this PR does / why we need it:
Add support for entity_df=None in RayOfflineStore.get_historical_features with start_date/end_date.
-- Derives entity set by reading distinct join keys from each FeatureView source within the time window, applies field mappings and join_key_map, filters by timestamp, and unions aligned schemas.
-- Adds stable event_timestamp = end_date for PIT joins.
Signature change: get_historical_features accepts entity_df: Optional[Union[pd.DataFrame, str]] and **kwargs.
-- Why: Match base interface and support date-only retrieval.
Which issue(s) this PR fixes:
RHOAIENG-38643
Misc