Skip to content

Port ThermoML parsing, add data entry types#24

Open
mattwthompson wants to merge 29 commits intomainfrom
data-entry-types
Open

Port ThermoML parsing, add data entry types#24
mattwthompson wants to merge 29 commits intomainfrom
data-entry-types

Conversation

@mattwthompson
Copy link
Member

@mattwthompson mattwthompson commented Feb 2, 2026

Closes #2
Closes #25
Closes #26

  • Require RDKit
  • Drop ThermoML plugin machinery (in favor of built-in solutions)
  • Drop most Evaluator machinery in favor of simpler solutions
    • No ThermodynamicState, just attributes for temperature and pressure
    • Use None in place of UNDEFINED
    • Remove Attribute / AttributeClass
    • Port some pre-Pydantic classes into Pydantic
  • Define Arrow schema for entry types
  • Look at how substance submodule is used (Re-use Evaluator's Substance classes? #40) but not necessarily refactor it

@mattwthompson
Copy link
Member Author

It turns out the implementation in Evaluator requires each property to be fully set up in openff/evaluator/properties/ before the corresponding ThermoML data can be parsed

@mattwthompson
Copy link
Member Author

This test is failing (and probably the cause of other failures):

FAILED dimsim/_tests/datasets/test_thermoml.py::test_load_property_types[single_dhvap.xml-expected2] - Exception: No properties parsed

Dropping into a debugger indicates it's deeper:

dimsim/_tests/datasets/test_thermoml.py::test_load_property_types[single_dhvap.xml-expected2] > /Users/mattthompson/software/dimsim/dimsim/datasets/thermoml/thermoml.py(2041)from_xml()
   2040             import ipdb; ipdb.set_trace()
-> 2041             properties = _PureOrMixtureData.from_xml_node(node, namespace, compounds)
   2042

ipdb> node
<Element '{http://www.iupac.org/namespaces/ThermoML}PureOrMixtureData' at 0x16a186de0>
ipdb> namespace
{'ThermoML': 'http://www.iupac.org/namespaces/ThermoML'}
ipdb> compounds
{1: Compound(smiles=Fc1ccccc1Cl, index=1)}
ipdb> _PureOrMixtureData.from_xml_node(node, namespace, compounds)
ipdb> _PureOrMixtureData.from_xml_node(node, namespace, compounds) is None
True

This is the code most blatantly copied from Evaluator, a.k.a. I understand it the least

Note that similar tests for other properties are passing

@codecov-commenter
Copy link

codecov-commenter commented Feb 16, 2026

Codecov Report

❌ Patch coverage is 52.31362% with 742 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.83%. Comparing base (c91f015) to head (c417b33).

Files with missing lines Patch % Lines
dimsim/datasets/thermoml/thermoml.py 66.58% 273 Missing ⚠️
dimsim/datasets/datasets.py 31.87% 171 Missing ⚠️
dimsim/utils/serialization.py 20.61% 154 Missing ⚠️
dimsim/substances/substances.py 28.65% 122 Missing ⚠️
dimsim/substances/amounts.py 55.55% 20 Missing ⚠️
dimsim/substances/components.py 93.10% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (c91f015) and HEAD (c417b33). Click for more details.

HEAD has 16 uploads less than BASE
Flag BASE (c91f015) HEAD (c417b33)
24 8
Additional details and impacted files
@@             Coverage Diff             @@
##             main      #24       +/-   ##
===========================================
- Coverage   89.66%   57.83%   -31.84%     
===========================================
  Files           4       12        +8     
  Lines         271     1819     +1548     
===========================================
+ Hits          243     1052      +809     
- Misses         28      767      +739     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mattwthompson
Copy link
Member Author

Decision: let RDKit (but not AmberTools) be a required dependency of dimsim. So no need to test with/without RDKit

Need to:

  • Define schema for each data entry type (see similar code in Descent)
  • Think more about if substances submodule needs to be ported here (Re-use Evaluator's Substance classes? #40)
  • Try out using mapped SMILES when going from ThermoML results to data entry types

Nice to haves:

  • look through ported Evaluator code and remove dead/unused code
  • Look through some Enums/fancy base classes and see what can be removed for simplicity
    • Example: PropertyPhase enum could be simpler
    • Stretch goal: have coverage 90%+ without adding more tests
  • dimsim/utils/serialization.py could probably be replaced by existing Pydantic-based serialization code
  • intermediate ThermodynamicState class could be removed

Comment on lines 2063 to 2064
entry = DataEntry(
tag=_TYPE_TAG_MAPPING[measured_property.type_string],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to switch into different class constructors here, based on the value of measured_property.type_string

For now, keep the tag attribute - even though it's redundant now, it might be useful later on (and it can be dropped later if it is not)

@mattwthompson mattwthompson changed the title Data entry types Port ThermoML parsing, add data entry types Feb 25, 2026
@mattwthompson mattwthompson marked this pull request as ready for review February 25, 2026 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port Evaluator plugins as non-plugins Comply with Evaluator licensing Port ThermoML parsing

2 participants