Skip to content

Conversation

@schlunma
Copy link
Contributor

@schlunma schlunma commented Jul 16, 2025

Description

This PR expands Dataset.from_files so it works properly with derived variables. In addition, a new attribute Dataset.input_datasets is available which returns the datasets necessary for derivation (or simply the dataset itself is no derivation is required). This can also be used within the derive preprocessor function.

This PR is the second step to make Dataset.load work with derived variables.

Example

dataset_template = Dataset(
    short_name="lwcre",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="r1i1p1f1",
    grid="gn",
    derive=True,
    force_derivation=True,
)

datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets")  # Found 36 datasets

dataset = datasets[0]
dataset.files  # []

for d in dataset.input_datasets:
    print(d["short_name"])
    print(d.files)

# rlut
# [ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlut/gn/v20200623/rlut_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de', 'esgf3.dkrz.de']]
# rlutcs
# [ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlutcs/gn/v20200623/rlutcs_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de']]

Related to #2769.

Link to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@schlunma schlunma added this to the v2.13.0 milestone Jul 16, 2025
@schlunma schlunma requested a review from bouweandela July 16, 2025 16:08
@schlunma schlunma added the variable derivation Related to variable derivation functions label Jul 16, 2025
@codecov
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.42%. Comparing base (181f18d) to head (1cdfef2).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2777      +/-   ##
==========================================
+ Coverage   95.41%   95.42%   +0.01%     
==========================================
  Files         260      260              
  Lines       15409    15421      +12     
==========================================
+ Hits        14703    14716      +13     
+ Misses        706      705       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another superb PR, Manu 🍻 - I really got nothing to raise here, but I'd defer a quick looksee to @bouweandela too, just one minor comment related to one test from me 🍻

@schlunma
Copy link
Contributor Author

@LisaBock This is the PR I talked about earlier. This is necessary to be able to properly use the derive preprocessor with public API functions. It would be very nice to have this in v2.13.0.

@jlenh
Copy link
Contributor

jlenh commented Aug 18, 2025

@LisaBock This is the PR I talked about earlier. This is necessary to be able to properly use the derive preprocessor with public API functions. It would be very nice to have this in v2.13.0.

Would this be ready to be merged @LisaBock ? Or another review from @bouweandela ?

@bouweandela
Copy link
Member

I would be happy to do a review, but will need some time as this is 1300 lines of new/changed code

@jlenh jlenh modified the milestones: v2.13.0, v2.14.0 Aug 21, 2025
Copy link
Member

@bouweandela bouweandela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @schlunma, I heard you're back at work, so I made a start with reviewing this.

I'm a bit concerned that we'll make the esmvalcore.dataset.Dataset class more complicated than desirable. Where exactly is the boundary between defining input data and defining how to process it?

If we include more preprocessing in the Dataset class, it could turn into the esmvalcore.preprocessor.PreprocessorFile that we never made public because it is just too poorly designed and complicated #1847.

Maybe it's fine to include one more preprocessor function in the Dataset.load method, but maybe we could also solve this in another way too. Have you considered creating a function like esmvalcore.preprocessor._derive.get_required that would be user-friendly?

return input_datasets

@property
def input_datasets(self) -> list[Dataset]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this to derived_from or something similar?

return not copy.files


def _get_input_datasets(dataset: Dataset) -> list[Dataset]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function still needed now that the dataset provides these as an attribute?

return input_datasets


def _representative_datasets(dataset: Dataset) -> list[Dataset]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems no longer needed either

def input_datasets(self) -> list[Dataset]:
"""Get input datasets.
For non-derived variables (i.e., those with facet ``derive=False``),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For non-derived variables (i.e., those with facet ``derive=False``),
For non-derived variables (i.e., those without a ``derive`` facet or with facet ``derive=False``),

Comment on lines +311 to +343
all_datasets: list[list[tuple[dict, Dataset]]] = []
for input_dataset in self._get_input_datasets():
all_datasets.append([])
for expanded_ds in self._get_available_datasets(
input_dataset,
):
updated_facets = {}
for key, value in self.facets.items():
if _isglob(value):
if key in expanded_ds.facets and not _isglob(
expanded_ds[key],
):
updated_facets[key] = expanded_ds.facets[key]
new_ds = self.copy()
new_ds.facets.update(updated_facets)
new_ds.supplementaries = self.supplementaries

all_datasets[-1].append((updated_facets, new_ds))

# Only consider those datasets that contain all input variables
# necessary for derivation
for updated_facets, new_ds in all_datasets[0]:
other_facets = [[d[0] for d in ds] for ds in all_datasets[1:]]
if all(updated_facets in facets for facets in other_facets):
yield new_ds
else:
logger.debug(
"Not all necessary input variables to derive '%s' are "
"available for %s with facets %s",
self["short_name"],
new_ds.summary(shorten=True),
updated_facets,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is difficult to understand. I believe that what it intends to do, is yield a new dataset if the globs can be expanded in a similar way for all input datasets that are required to derive the dataset, did I get that right?

If yes, it could probably be simplified by bailing out as soon as you find an unexpanded glob pattern that was expanded for another dataset. Or did you intend to have all glob patterns expanded? I have some concerns about how reliable it is too. What happens if some facets are different from one input dataset to another, e.g. institute or version?

def _get_all_available_datasets(self) -> Iterator[Dataset]: # noqa: C901
"""Yield datasets based on the available files.
This function requires that self.facets['mip'] is not a glob pattern.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still the case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

variable derivation Related to variable derivation functions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants