-
Notifications
You must be signed in to change notification settings - Fork 44
Expand Dataset.from_files so it works properly with derived variables
#2777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…es_with_derived_vars
Co-authored-by: Bouwe Andela <[email protected]>
…es_with_derived_vars
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2777 +/- ##
==========================================
+ Coverage 95.41% 95.42% +0.01%
==========================================
Files 260 260
Lines 15409 15421 +12
==========================================
+ Hits 14703 14716 +13
+ Misses 706 705 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
valeriupredoi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another superb PR, Manu 🍻 - I really got nothing to raise here, but I'd defer a quick looksee to @bouweandela too, just one minor comment related to one test from me 🍻
|
@LisaBock This is the PR I talked about earlier. This is necessary to be able to properly use the |
Would this be ready to be merged @LisaBock ? Or another review from @bouweandela ? |
|
I would be happy to do a review, but will need some time as this is 1300 lines of new/changed code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @schlunma, I heard you're back at work, so I made a start with reviewing this.
I'm a bit concerned that we'll make the esmvalcore.dataset.Dataset class more complicated than desirable. Where exactly is the boundary between defining input data and defining how to process it?
If we include more preprocessing in the Dataset class, it could turn into the esmvalcore.preprocessor.PreprocessorFile that we never made public because it is just too poorly designed and complicated #1847.
Maybe it's fine to include one more preprocessor function in the Dataset.load method, but maybe we could also solve this in another way too. Have you considered creating a function like esmvalcore.preprocessor._derive.get_required that would be user-friendly?
| return input_datasets | ||
|
|
||
| @property | ||
| def input_datasets(self) -> list[Dataset]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this to derived_from or something similar?
| return not copy.files | ||
|
|
||
|
|
||
| def _get_input_datasets(dataset: Dataset) -> list[Dataset]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function still needed now that the dataset provides these as an attribute?
| return input_datasets | ||
|
|
||
|
|
||
| def _representative_datasets(dataset: Dataset) -> list[Dataset]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems no longer needed either
| def input_datasets(self) -> list[Dataset]: | ||
| """Get input datasets. | ||
| For non-derived variables (i.e., those with facet ``derive=False``), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For non-derived variables (i.e., those with facet ``derive=False``), | |
| For non-derived variables (i.e., those without a ``derive`` facet or with facet ``derive=False``), |
| all_datasets: list[list[tuple[dict, Dataset]]] = [] | ||
| for input_dataset in self._get_input_datasets(): | ||
| all_datasets.append([]) | ||
| for expanded_ds in self._get_available_datasets( | ||
| input_dataset, | ||
| ): | ||
| updated_facets = {} | ||
| for key, value in self.facets.items(): | ||
| if _isglob(value): | ||
| if key in expanded_ds.facets and not _isglob( | ||
| expanded_ds[key], | ||
| ): | ||
| updated_facets[key] = expanded_ds.facets[key] | ||
| new_ds = self.copy() | ||
| new_ds.facets.update(updated_facets) | ||
| new_ds.supplementaries = self.supplementaries | ||
|
|
||
| all_datasets[-1].append((updated_facets, new_ds)) | ||
|
|
||
| # Only consider those datasets that contain all input variables | ||
| # necessary for derivation | ||
| for updated_facets, new_ds in all_datasets[0]: | ||
| other_facets = [[d[0] for d in ds] for ds in all_datasets[1:]] | ||
| if all(updated_facets in facets for facets in other_facets): | ||
| yield new_ds | ||
| else: | ||
| logger.debug( | ||
| "Not all necessary input variables to derive '%s' are " | ||
| "available for %s with facets %s", | ||
| self["short_name"], | ||
| new_ds.summary(shorten=True), | ||
| updated_facets, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is difficult to understand. I believe that what it intends to do, is yield a new dataset if the globs can be expanded in a similar way for all input datasets that are required to derive the dataset, did I get that right?
If yes, it could probably be simplified by bailing out as soon as you find an unexpanded glob pattern that was expanded for another dataset. Or did you intend to have all glob patterns expanded? I have some concerns about how reliable it is too. What happens if some facets are different from one input dataset to another, e.g. institute or version?
| def _get_all_available_datasets(self) -> Iterator[Dataset]: # noqa: C901 | ||
| """Yield datasets based on the available files. | ||
| This function requires that self.facets['mip'] is not a glob pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still the case?
Description
This PR expands
Dataset.from_filesso it works properly with derived variables. In addition, a new attributeDataset.input_datasetsis available which returns the datasets necessary for derivation (or simply the dataset itself is no derivation is required). This can also be used within thederivepreprocessor function.This PR is the second step to make
Dataset.loadwork with derived variables.Example
Related to #2769.
Link to documentation:
Before you get started
Checklist
It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.
To help with the number pull requests: