Skip to content

Support for Azure storage#571

Open
tramora wants to merge 2 commits intomainfrom
244-support-azure-storage
Open

Support for Azure storage#571
tramora wants to merge 2 commits intomainfrom
244-support-azure-storage

Conversation

@tramora
Copy link
Copy Markdown
Collaborator

@tramora tramora commented Apr 10, 2026

  • Among all the Azure storages, Khiops supports only (via its specific driver) "Files" and "Blobs" (Binary Large Objects)
  • The only supported authentication method is currently the AZURE_STORAGE_CONNECTION_STRING embedding the account name and account key

Completes #244


TODO Before Asking for a Review

  • Rebase your branch to the latest version of main (or main-v10)
  • Make sure all CI workflows are green
  • When adding a public feature/fix: Update the Unreleased section of CHANGELOG.md (no date)
  • Self-Review: Review "Files Changed" tab and fix any problems you find
  • API Docs (only if there are changes in docstrings, rst files or samples):
    • Check the docs build without warning: see the log of the API Docs workflow
    • Check that your changes render well in HTML: download the API Docs artifact and open index.html
    • If there are any problems it is faster to iterate by building locally the API Docs

@tramora tramora force-pushed the 244-support-azure-storage branch 3 times, most recently from be97119 to 08f91e2 Compare April 10, 2026 14:34
@tramora tramora force-pushed the 244-support-azure-storage branch 4 times, most recently from 17fe956 to 80691c3 Compare April 12, 2026 12:59
@tramora
Copy link
Copy Markdown
Collaborator Author

tramora commented Apr 13, 2026

The automated tests on conda environments currently fail because of an outdate version of the azure-storage-file-share package in the conda-forge channel. This PR conda-forge/azure-storage-file-share-feedstock#9 should fix the issue but it is blocked for the moment.

@tramora tramora force-pushed the 244-support-azure-storage branch from 80691c3 to 701678b Compare April 13, 2026 16:32
Comment thread khiops/core/internals/filesystems.py Outdated
Comment thread khiops/core/internals/filesystems.py Outdated
Comment thread khiops/core/internals/filesystems.py
Comment thread khiops/core/internals/filesystems.py Outdated
Comment thread khiops/core/internals/filesystems.py Outdated
Comment thread khiops/core/internals/filesystems.py Outdated
Comment thread tests/test_remote_access.py Outdated
Comment on lines +117 to +127
try:
os.remove("/tmp/dummy")
except: # pylint: disable=broad-exception-caught
pass
fs.copy_to_local(
f"{proto}://{bucket_name}/khiops-cicd/samples/{file}", "/tmp/dummy"
f"{proto}://{bucket_name_or_storage_prefix}/khiops-cicd/"
f"samples/{file}",
"/tmp/dummy",
)
# We cannot use `unittest.TestCase.assertTrue` in a class method
assert os.path.isfile("/tmp/dummy")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I don't get why this new code was added
  • If it stays I'd refactor it as:
    • Before the loop create /tmp/khiops-dummy
    • Save each file to /tmp/khiops-dummy/{file}
    • Do not use assert as it is not a programming error, I think it should be something like if not os.path.isfile(...): raise RuntimeError
    • Clean /tmp/khiops-dummy afterwards
      This avoids the try/except with pass in the middle.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the development phase when fs.copy_to_local was an empty method, the test would pass even if the file download was not successful. This part had to be improved to assess the file is correctly downloaded (without any error and with the local file creation).

Following your suggestions I improved a little bit the snippet

Comment on lines -315 to 320
def should_skip_in_a_conda_env(self):
# The S3 driver is now released for conda too.
# No need to skip the tests any longer in a conda environment
return False

def config_exists(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these elimination should go in a different commit.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +759 to +760
class AzureStorageResourceMixin:
"""Azure compatible Storage Resource Mixin
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all these docs should go in the classes AzureStorage*Resource.

In the docstring of this class they should point to those, or it could be duplicated.

Also I think we should add an URL to a relevant Azure storage page.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@tramora tramora force-pushed the 244-support-azure-storage branch from 701678b to 19aa231 Compare April 15, 2026 16:39
Copy link
Copy Markdown
Member

@folmos-at-orange folmos-at-orange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an opportunity to have better docs on file system.

Currently, very little is documented, and we now have 3 different remote systems of which only one (S3) is well documented.

I'd a section for each remote system in the module's docstring. Each section contains how to configure it in each case, and a URI example.

Comment thread khiops/core/internals/filesystems.py Outdated


class AzureStorageResourceMixin:
"""Azure compatible Storage Resource Mixin"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add in the body, See `AzureStorageFileResource` and `AzureBlobResource` for more details.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment thread khiops/core/internals/filesystems.py Outdated
Comment on lines +901 to +903
"""
Check if the target resource exists (file or directory)
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the first docstring line to keep the codebase style.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread khiops/core/internals/filesystems.py Outdated

def remove(self):
"""
Remove the target resource either it is a file or a directory.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the first docstring line to keep the codebase style.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


def list_dir(self):
"""
List the files of the current directory
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the first docstring line to keep the codebase style.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Thierry RAMORASOAVINA added 2 commits April 16, 2026 16:39
- Among all the Azure storages, Khiops supports only (via its specific driver) "Files" and "Blobs" (Binary Large Objects)
 - The only supported authentication method is currently the `AZURE_STORAGE_CONNECTION_STRING` embedding the account name and account key
@tramora tramora force-pushed the 244-support-azure-storage branch from 19aa231 to 4dc0953 Compare April 16, 2026 14:39
"<container name>/<0 to n virtual folder name(s)>/<blob name>"
)

if __class__.is_netloc_of_file_share(self.uri_info.netloc):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if / else should be split and dispatched to the __init__ methods of the relevant Azure storage class IMHO.

# Create the authentication object using the connection string
connection_string = os.environ.get("AZURE_STORAGE_CONNECTION_STRING")
mappings = utils.parse_connection_string(connection_string)
creds = credentials.AzureNamedKeyCredential(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this in an attribute named credentials, to be reused in the per-storage Azure classes directly (also see the if / else split comment below).

return create_resource(parent_uri_info(self.uri_info).geturl())

@staticmethod
def _splitall(path):
Copy link
Copy Markdown
Collaborator

@popescu-v popescu-v Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would turn this method in a per-instance method, because it is only used on an instance attribute:

    def _uri_parts(self):
        allparts = []
        path = self.uri_info.path
        while True:
            # Always split on "/" as `path` is a URI
            parts = path.split("/")
            if parts[0] == path:  # sentinel for absolute paths
                allparts.insert(0, parts[0])
                break
            elif parts[1] == path:  # sentinel for relative paths
                allparts.insert(0, parts[1])
                break
            else:
                path = parts[0]
                allparts.insert(0, parts[1])
        return allparts

Then, I would also create a uri_parts property:

@property
def uri_parts(self):
    return self._uri_parts()

and use it instead of parts, in the per-storage type Azure classes (also see the if / else split comment).

return allparts

@staticmethod
def is_netloc_of_file_share(netloc):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the if / else in the mixin's __init__ has been split to the per-storage classes, this method is no longer needed: in create_resource (its only other usage besides the if / else mentioned above), we can use netloc.endswith("...") directly.

# (via its specific driver):
# - Files
# - Blobs (Binary Large Objects)
if AzureStorageResourceMixin.is_netloc_of_file_share(uri_info.netloc):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use netloc.endswith directly (see other comments related to this).

except Exception as delete_error: # pylint: disable=broad-exception-caught
delete_directory_error = delete_error

raise delete_directory_error or delete_file_error
Copy link
Copy Markdown
Collaborator

@popescu-v popescu-v Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rewrite all this method thus:

        file_share_client = self.azure_share_client.get_file_client(
            self.relative_remaining_path
        )

        # Attempt to delete the file first; if this fails, attempt to delete the directory; fail with the Azure
        # exception if this fails
        try:
            file_share_client.delete_file()
        except Exception:  # pylint: disable=broad-exception-caught
            directory_share_client = self.azure_share_client.get_directory_client(
                self.relative_remaining_path
            )
            directory_share_client.delete_directory()

Comment on lines +956 to +957
with open(local_path) as input_file:
file_share_client.upload_file(input_file.read())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use file_share_client.upload_file(read(local_path)) directly?

self.relative_remaining_path
)
with open(local_path, "wb") as output_file:
data = file_share_client.download_file()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this line above the with (thus out of its scope)?

Comment on lines +1057 to +1058
with open(local_path) as input_file:
self.azure_blob_client.upload_blob(input_file.read(), overwrite=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use self.azure_blob_client.upload_blob(read(local_path), overwrite=True) directly?


def copy_to_local(self, local_path):
with open(local_path, "wb") as output_file:
data = self.azure_blob_client.download_blob()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this line above the with line (thus out of its scope).

$CONDA install -y -n py${version}_conda \
khiops-driver-s3==${KHIOPS_S3_DRIVER_REVISION} \
khiops-driver-gcs==${KHIOPS_GCS_DRIVER_REVISION}; \
# hardcoded version because the latest version is not released on conda-forge \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precede this comment text with XXX in order to better locate this for future removal.

&& wget -O khiops-gcs.deb https://github.com/KhiopsML/khiopsdriver-gcs/releases/download/${KHIOPS_GCS_DRIVER_REVISION}/khiops-driver-gcs_${KHIOPS_GCS_DRIVER_REVISION}-1-bookworm.amd64.deb \
&& wget -O khiops-s3.deb https://github.com/KhiopsML/khiopsdriver-s3/releases/download/${KHIOPS_S3_DRIVER_REVISION}/khiops-driver-s3_${KHIOPS_S3_DRIVER_REVISION}-1-bookworm.amd64.deb \
&& (dpkg -i --force-all khiops-gcs.deb khiops-s3.deb || true) \
# There is a typo in the latest release tag. Until it is fixed, we need to use the hardcoded folder name
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precede comment text with XXX .

$CONDA install -y -n py${version}_conda \
khiops-driver-s3==${KHIOPS_S3_DRIVER_REVISION} \
khiops-driver-gcs==${KHIOPS_GCS_DRIVER_REVISION}; \
# hardcoded version because the latest version is not released on conda-forge \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precede comment with XXX.

&& wget -O khiops-gcs.deb https://github.com/KhiopsML/khiopsdriver-gcs/releases/download/${KHIOPS_GCS_DRIVER_REVISION}/khiops-driver-gcs_${KHIOPS_GCS_DRIVER_REVISION}-1-${VERSION_CODENAME}.amd64.deb \
&& wget -O khiops-s3.deb https://github.com/KhiopsML/khiopsdriver-s3/releases/download/${KHIOPS_S3_DRIVER_REVISION}/khiops-driver-s3_${KHIOPS_S3_DRIVER_REVISION}-1-${VERSION_CODENAME}.amd64.deb \
&& (dpkg -i --force-all khiops-gcs.deb khiops-s3.deb || true) \
# There is a typo in the latest release tag. Until it is fixed, we need to use the hardcoded folder name
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precede comment with XXX.

f"samples/{file}",
f"/tmp/khiops-{proto}-dummy",
)
# We cannot use `unittest.TestCase.assertTrue` in a class method
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this comment from here.

Copy link
Copy Markdown
Collaborator

@popescu-v popescu-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants