Expand `Dataset.from_files` so it works properly with derived variables #2777

schlunma · 2025-07-16T16:08:56Z

Description

This PR expands Dataset.from_files so it works properly with derived variables. In addition, a new attribute Dataset.input_datasets is available which returns the datasets necessary for derivation (or simply the dataset itself is no derivation is required). This can also be used within the derive preprocessor function.

This PR is the second step to make Dataset.load work with derived variables.

Example

dataset_template = Dataset(
    short_name="lwcre",
    mip="Amon",
    project="CMIP6",
    exp="historical",
    dataset="*",
    institute="*",
    ensemble="r1i1p1f1",
    grid="gn",
    derive=True,
    force_derivation=True,
)

datasets = list(dataset_template.from_files())
print(f"Found {len(datasets)} datasets")  # Found 36 datasets

dataset = datasets[0]
dataset.files  # []

for d in dataset.input_datasets:
    print(d["short_name"])
    print(d.files)

# rlut
# [ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlut/gn/v20200623/rlut_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de', 'esgf3.dkrz.de']]
# rlutcs
# [ESGFFile:CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/Amon/rlutcs/gn/v20200623/rlutcs_Amon_TaiESM1_historical_r1i1p1f1_gn_185001-201412.nc on hosts ['esgf.ceda.ac.uk', 'esgf.rcec.sinica.edu.tw', 'esgf3.dkrz.de']]

Related to #2769.

Link to documentation:

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

🧪 The new functionality is relevant and scientifically sound
🛠 This pull request has a descriptive title and labels
🛠 Code is written according to the code quality guidelines
🧪 and 🛠 Documentation is available
🛠 Unit tests have been added
🛠 Changes are backward compatible
🛠 Any changed dependencies have been added or removed correctly
🛠 The list of authors is up to date
🛠 All checks below this pull request were successful

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

…es_with_derived_vars

Co-authored-by: Bouwe Andela <[email protected]>

…es_with_derived_vars

…ed_vars

codecov · 2025-07-16T16:12:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.42%. Comparing base (181f18d) to head (1cdfef2).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2777      +/-   ##
==========================================
+ Coverage   95.41%   95.42%   +0.01%     
==========================================
  Files         260      260              
  Lines       15409    15421      +12     
==========================================
+ Hits        14703    14716      +13     
+ Misses        706      705       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

valeriupredoi

another superb PR, Manu 🍻 - I really got nothing to raise here, but I'd defer a quick looksee to @bouweandela too, just one minor comment related to one test from me 🍻

esmvalcore/_recipe/recipe.py

tests/integration/recipe/test_check.py

schlunma · 2025-07-30T13:54:44Z

@LisaBock This is the PR I talked about earlier. This is necessary to be able to properly use the derive preprocessor with public API functions. It would be very nice to have this in v2.13.0.

jlenh · 2025-08-18T09:04:51Z

@LisaBock This is the PR I talked about earlier. This is necessary to be able to properly use the derive preprocessor with public API functions. It would be very nice to have this in v2.13.0.

Would this be ready to be merged @LisaBock ? Or another review from @bouweandela ?

bouweandela · 2025-08-19T15:01:28Z

I would be happy to do a review, but will need some time as this is 1300 lines of new/changed code

bouweandela

Hi @schlunma, I heard you're back at work, so I made a start with reviewing this.

I'm a bit concerned that we'll make the esmvalcore.dataset.Dataset class more complicated than desirable. Where exactly is the boundary between defining input data and defining how to process it?

If we include more preprocessing in the Dataset class, it could turn into the esmvalcore.preprocessor.PreprocessorFile that we never made public because it is just too poorly designed and complicated #1847.

Maybe it's fine to include one more preprocessor function in the Dataset.load method, but maybe we could also solve this in another way too. Have you considered creating a function like esmvalcore.preprocessor._derive.get_required that would be user-friendly?

bouweandela · 2025-12-19T14:01:56Z

esmvalcore/dataset.py

+        return input_datasets
+
+    @property
+    def input_datasets(self) -> list[Dataset]:


Can we rename this to derived_from or something similar?

bouweandela · 2025-12-19T14:18:24Z

esmvalcore/_recipe/to_datasets.py

-    return not copy.files
-
-
 def _get_input_datasets(dataset: Dataset) -> list[Dataset]:


Is this function still needed now that the dataset provides these as an attribute?

bouweandela · 2025-12-19T14:18:44Z

esmvalcore/_recipe/to_datasets.py

+    return input_datasets


 def _representative_datasets(dataset: Dataset) -> list[Dataset]:


This function seems no longer needed either

bouweandela · 2025-12-19T14:56:21Z

esmvalcore/dataset.py

+    def input_datasets(self) -> list[Dataset]:
+        """Get input datasets.
+
+        For non-derived variables (i.e., those with facet ``derive=False``),


Suggested change

For non-derived variables (i.e., those with facet ``derive=False``),

For non-derived variables (i.e., those without a ``derive`` facet or with facet ``derive=False``),

bouweandela · 2025-12-19T15:11:42Z

esmvalcore/dataset.py

+            all_datasets: list[list[tuple[dict, Dataset]]] = []
+            for input_dataset in self._get_input_datasets():
+                all_datasets.append([])
+                for expanded_ds in self._get_available_datasets(
+                    input_dataset,
+                ):
+                    updated_facets = {}
+                    for key, value in self.facets.items():
+                        if _isglob(value):
+                            if key in expanded_ds.facets and not _isglob(
+                                expanded_ds[key],
+                            ):
+                                updated_facets[key] = expanded_ds.facets[key]
+                    new_ds = self.copy()
+                    new_ds.facets.update(updated_facets)
+                    new_ds.supplementaries = self.supplementaries
+
+                    all_datasets[-1].append((updated_facets, new_ds))
+
+            # Only consider those datasets that contain all input variables
+            # necessary for derivation
+            for updated_facets, new_ds in all_datasets[0]:
+                other_facets = [[d[0] for d in ds] for ds in all_datasets[1:]]
+                if all(updated_facets in facets for facets in other_facets):
+                    yield new_ds
+                else:
+                    logger.debug(
+                        "Not all necessary input variables to derive '%s' are "
+                        "available for %s with facets %s",
+                        self["short_name"],
+                        new_ds.summary(shorten=True),
+                        updated_facets,
+                    )


This code is difficult to understand. I believe that what it intends to do, is yield a new dataset if the globs can be expanded in a similar way for all input datasets that are required to derive the dataset, did I get that right?

If yes, it could probably be simplified by bailing out as soon as you find an unexpanded glob pattern that was expanded for another dataset. Or did you intend to have all glob patterns expanded? I have some concerns about how reliable it is too. What happens if some facets are different from one input dataset to another, e.g. institute or version?

bouweandela · 2025-12-19T15:23:04Z

esmvalcore/dataset.py

+    def _get_all_available_datasets(self) -> Iterator[Dataset]:  # noqa: C901
+        """Yield datasets based on the available files.
+
+        This function requires that self.facets['mip'] is not a glob pattern.


Is this still the case?

schlunma added 30 commits July 13, 2025 15:37

Remove all new features, just keep no-op changes

4b989d3

Further no-op changes

b0c44f6

force_derivation=True without derive=True does not make sense

1dd5671

Add tests

8989549

Add type hints to check.py

1f6dfa3

Added type hints for recipe.py

b6a6651

Added type hints for to_datasets.py

6793e0c

Added type hints for dataset.py

878e310

Add type hints to local.py

be6e55d

Add type hints to preprocessor/__init__.py

b1caf65

Add type hints to compare_with_refs.py

19dbff9

Add type hints to _derive/__init__.py

d8ea7d9

Add type hints to some derive functions

367bfe7

Add type hints to _regrid.py

5bbe6ce

Make new dataset methods private

d10de1e

Small fix

7323866

Fix test

3ab2cdf

Fix mock

099349f

100% test coverage

86b308b

Clean doc

369a811

100% diff coverage

c2a3d81

Try to please Codacy

a3dab12

Make tests work without ESMValTool installation

001eafa

100% diff coverage for real

debd589

Added Dataset.input_datasets

c3df13e

Shorter code

e794817

Merge remote-tracking branch 'origin/type_hints_derive' into from_fil…

7c1bfd7

…es_with_derived_vars

Dataset.set_version can handle derived variables now

b971d50

Dataset._input_datasets is always list[Dataset]

f6b6d22

Make changes fully backwards-compatible

1f4de86

schlunma and others added 16 commits July 15, 2025 19:32

Make mypy happy

ecbecc6

Use type aliases in regrid.py

d7c73aa

Valid return type in docstring

69e0502

Avoid Coord

6eedca2

Correct type hint

62c1996

Assign new variable for new type

8f2f179

Raise error for invalid type

7bc1bee

Co-authored-by: Bouwe Andela <[email protected]>

Fail if invalid types given

62067fc

Restore _pattern_filter

b12df84

Better _special_name_to_dataset

22ab6e7

Do not cast to str

36724ef

Use int variables

6ad2fef

Merge remote-tracking branch 'origin/type_hints_derive' into from_fil…

3ce06fc

…es_with_derived_vars

Merge remote-tracking branch 'origin/main' into from_files_with_deriv…

1116641

…ed_vars

Add doc

74983d5

Expand notebook

acaf9fd

schlunma added this to the v2.13.0 milestone Jul 16, 2025

schlunma requested a review from bouweandela July 16, 2025 16:08

schlunma added the variable derivation Related to variable derivation functions label Jul 16, 2025

schlunma added 3 commits July 16, 2025 18:13

Fix doc build

f6e531b

Update doc

30b6f53

Better derivation example

1cdfef2

valeriupredoi approved these changes Jul 18, 2025

View reviewed changes

esmvalcore/_recipe/recipe.py Show resolved Hide resolved

tests/integration/recipe/test_check.py Show resolved Hide resolved

jlenh modified the milestones: v2.13.0, v2.14.0 Aug 21, 2025

bouweandela requested changes Dec 19, 2025

View reviewed changes

		return not copy.files


		def _get_input_datasets(dataset: Dataset) -> list[Dataset]:

		return input_datasets


		def _representative_datasets(dataset: Dataset) -> list[Dataset]:

	For non-derived variables (i.e., those with facet ``derive=False``),
	For non-derived variables (i.e., those without a ``derive`` facet or with facet ``derive=False``),

Expand Dataset.from_files so it works properly with derived variables #2777

Are you sure you want to change the base?

Expand Dataset.from_files so it works properly with derived variables #2777

Uh oh!

Conversation

schlunma commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example

Uh oh!

codecov bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

valeriupredoi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

schlunma commented Jul 30, 2025

Uh oh!

jlenh commented Aug 18, 2025

Uh oh!

bouweandela commented Aug 19, 2025

Uh oh!

bouweandela left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Expand `Dataset.from_files` so it works properly with derived variables #2777

Expand `Dataset.from_files` so it works properly with derived variables #2777

schlunma commented Jul 16, 2025 •

edited

Loading

codecov bot commented Jul 16, 2025 •

edited

Loading

bouweandela left a comment •

edited

Loading