Skip to content

Add Pystp netcdf driver#1

Open
RichardHitier wants to merge 7 commits intoSciQLop:mainfrom
RichardHitier:pystp_netcdf
Open

Add Pystp netcdf driver#1
RichardHitier wants to merge 7 commits intoSciQLop:mainfrom
RichardHitier:pystp_netcdf

Conversation

@RichardHitier
Copy link
Copy Markdown

No description provided.

@sonarqubecloud
Copy link
Copy Markdown

@jeandet jeandet requested review from brenard-irap and jeandet April 28, 2026 16:54
@jeandet jeandet added the enhancement New feature or request label Apr 28, 2026
@jeandet
Copy link
Copy Markdown
Member

jeandet commented May 6, 2026

Code review

Found 1 issue:

  1. cdf_type() is inconsistent with values() for CDF_EPOCH variables. values() calls _get_units() which checks both 'units' and 'UNITS', and routes through _is_cdf_epoch() to return datetime64[ns]. cdf_type() only calls v.getncattr('units') (lowercase) and only handles the CF time ('since' in units) branch — there is no CDF_EPOCH branch at all. Verified against the bundled tests/resources/ac_h2s_mfi_cdaweb.nc (whose Epoch has UNITS='ms'): values('Epoch').dtype is datetime64[ns] while cdf_type('Epoch') returns 'CDF_DOUBLE'. Downstream consumers that key off cdf_type to identify time axes will misclassify the epoch.

def cdf_type(self, var):
v = self._ds[var]
# CF time variable: float with a "units" attribute containing "since"
try:
units = v.getncattr('units')
if isinstance(units, str) and 'since' in units:
return 'CDF_TIME_TT2000'
except AttributeError:
pass
if v.dtype == str:
return 'CDF_CHAR'
dtype_str = v.dtype.str.lstrip('<>=!')
return self._DTYPE_TO_CDF.get(dtype_str, f'CDF_UNKNOWN_{dtype_str}')

Suggest reusing _get_units() and adding a _is_cdf_epoch(var) branch returning 'CDF_EPOCH' (mirroring values()).

Other items considered but not flagged: netCDF4.Dataset is never closed (no close()/context manager) and the _DTYPE_TO_CDF 'S' key is unreachable (v.dtype.str.lstrip('<>=!') leaves the leading |) — both real but lower-impact; worth a follow-up.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown

@brenard-irap brenard-irap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few remarks, mostly related to design.
We should find some time in the coming days to discuss them together.

Comment thread pyistp/__init__.py
def load(file=None, buffer=None, master_file=None, master_buffer=None) -> _ISTPLoader:
return _ISTPLoader(file=file, buffer=buffer, master_file=master_file, master_buffer=master_buffer)


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction between the load (for CDF) and load_netcdf methods based on the file format bothers me.
I would have preferred to keep a single load method.

The selection of the driver to use could be handled in the constructor of ISTPLoaderImpl (as is already partially done for the CDF driver type: pycdfpp or spacepy).
The format detection could be implemented by inspecting the first 4 bytes that define the magic number, for example.

Additionally, in load_netcdf, we lose the ability to provide a master file, which I find problematic (see examples below).

I would also like the reading of the data file and the reading of the master file to be independent, meaning they could potentially use different drivers.

Example 1 – ICON mission from CDAWeb:

The master file is provided in CDF: https://cdaweb.gsfc.nasa.gov/pub/software/cdawlib/0MASTERS/icon_l2-6_euv_00000000_v01.cdf
The data files are in netCDF: https://spdf.gsfc.nasa.gov/pub/data/icon/l2/l2-6_euv/

The data files look like ISTP-compliant files, but they are not actually compliant.
For example, in the netCDF data files, Var_Type is used to specify whether a variable is data or support_data.
However, the specification (https://github.com/IHDE-Alliance/ISTP_metadata/blob/main/ISTP_metadata_guidelines/docs/05_metadata-variable-attributes.md#istp-variable-attributes) clearly states: "Note that attribute names are case sensitive, and the names of the ISTP variable attributes must match the case as shown."
Therefore, VAR_TYPE should have been used for the netCDF files to be directly ISTP-compliant.
The master file, on the other hand, is properly ISTP-compliant and does use VAR_TYPE to define the data type.

Example 2 - AMDA:

For AMDA, we are considering decommissioning our DDSERVER data server and replacing it with Speasy.
The data in this database is in netCDF and is not ISTP-compliant.
Regenerating the entire database is not an option (several million files and multiple terabytes in volume).
What I would like to do instead is generate CDF/ISTP-compliant master files for each dataset.
This would put us in a situation similar to the ICON mission from CDAWeb:

  • master file in CDF / ISTP-compliant
  • data files in non-ISTP-compliant netCDF
    This would avoid the need for AMDA-specific development, which would be ideal.

Comment thread pyistp/_impl.py
driver_factory = current_driver
if file is not None:
log.debug(f"Loading {file}")
self.cdf = current_driver(file or buffer)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in another comment, I would like to take into account the fact that the driver used to read the master file may differ from the one used to read the data file.

Comment thread pyistp/drivers/netcdf.py
return (unix_ms * 1_000_000).astype('datetime64[ns]')

def values(self, var, is_metadata_variable=False): # NOSONAR
v = self._ds[var]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it is not the responsibility of the pyistp driver to interpret the data and convert it into datetime64; this should instead be handled by the consuming tool (in our case, the Speasy codec).
The data should be provided as-is, exactly as they appear in the file.
The pyistp library should only identify which variable contains the time information for the other variables.
This seems even more important given that, in the case of netCDF, the type of a time variable is not always clearly defined (unlike CDF, which uses CDF_EPOCH, CDF_EPOCH16, or TT2000).
As stated in the specification:
https://github.com/IHDE-Alliance/ISTP_metadata/blob/main/ISTP_metadata_guidelines/docs/04_metadata-variables.md#netcdf-times

NetCDF files can include the CDF time variables, with CDF_TIME_TT2000 especially recommended, but will require using the CDF library time routines for conversion. Otherwise, netCDF times are typically something like seconds from some specific time epoch, with UNITS = "seconds from 2000-01-01 UTC" or similar. In either case, the ISTP time variable attributes should be added.

If we move this interpretation logic into Speasy, we will be able to adapt it more easily depending on the provider.
For example, in AMDA / DDSERVER, the netCDF data files in our local database use a time format called DDTIME (which is not used anywhere else - for historical reasons).
It would not make sense to implement support for this format in pyistp, whereas in Speasy we could provide a callback mechanism in the netCDF codec to handle such very specific cases.

Comment thread pyproject.toml
"Programming Language :: Python :: 3.12",
]
dependencies = ['pycdfpp>=0.6.0']

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this dependency optional?

Comment thread tests/test_netcdf.py
except ImportError:
pytest.skip("netcdf driver not implemented yet", allow_module_level=True)


Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll try to provide you with additional test cases.
However, there are far fewer publicly available datasets in netCDF compared to CDF...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants