Software

Hugging Face Datasets Integrate with Dagster via New Library

Hugging Face Datasets now integrate with Dagster through dagster-hf-datasets, enabling production data pipelines for ML workflows.

Image: Hugging Face

Hugging Face Datasets has expanded its integration capabilities by introducing dagster-hf-datasets, a library that connects Hugging Face datasets with Dagster’s orchestration framework. This integration allows data teams to manage datasets as production assets, ensuring reproducibility, observability, and versioning across the entire data pipeline lifecycle. The tool supports loading datasets directly from the Hugging Face Hub and transforming them through multiple stages, with intermediate results persisted via Parquet artifacts. It also enables the republishing of refined datasets back to the Hugging Face Hub as curated artifacts. The integration is designed to streamline the workflow of data engineers and ML teams, making it easier to build, monitor, and maintain data pipelines in production environments.

The new library introduces several key components, including hf_dataset_asset and hf_multi_asset, which map Hugging Face datasets to Dagster assets. These abstractions allow for explicit materialization, metadata tracking, and lineage visualization. Additionally, the HFDatasetPublisher enables transformed datasets to be published back to the Hugging Face Hub as part of the orchestration process. The integration also includes HFParquetIOManager, which handles local storage serialization via Parquet, ensuring compatibility with both Hugging Face datasets and Dagster’s asset model. By separating responsibilities between Hugging Face Datasets and Dagster, the tool maintains the efficiency of Arrow-backed datasets while leveraging Dagster’s orchestration capabilities for lineage tracking, scheduling, and metadata management.

The integration was introduced to address the growing need for managing datasets as evolving operational assets rather than static files. As ML systems scale, datasets must be treated as first-class assets with observable workflows and reproducible transformations. By combining Hugging Face Datasets with Dagster’s asset-oriented execution model, the new tool enables teams to track dataset lineage, monitor transformations, and maintain metadata throughout the pipeline lifecycle. This approach enhances transparency and ensures that datasets remain consistent, versioned, and easily inspectable within the orchestration framework.

Source: huggingface

Key points

Hugging Face Datasets now integrate with Dagster through dagster-hf-datasets, enabling production data pipelines for ML workflows.
The integration allows datasets to be treated as production assets with reproducibility, observability, and versioning.
dagster-hf-datasets introduces hf_dataset_asset and hf_multi_asset to map Hugging Face datasets to Dagster assets.
The HFDatasetPublisher enables transformed datasets to be published back to the Hugging Face Hub as curated artifacts.
HFParquetIOManager handles local storage serialization via Parquet, ensuring compatibility with Hugging Face datasets and Dagster.
The integration separates responsibilities between Hugging Face Datasets and Dagster, maintaining Arrow-backed datasets while leveraging Dagster’s orchestration capabilities.
By combining Hugging Face Datasets with Dagster’s asset-oriented execution model, the tool enables teams to track dataset lineage, monitor transformations, and maintain metadata throughout the pipeline lifecycle.

Source: Hugging Face Read the original →

WRITTEN BY

Theo Almeida

AI Software & Developer Tools

Theo covers AI software, developer tools, frameworks, and the platforms builders use every day.

Hugging Face Datasets Integrate with Dagster via New Library

Key points

Related articles

Current AI Launches Open-Source AI Chatbot for Global Access

Smartsheet Deploys Remote MCP Server on AWS

Amazon Introduces Mobile Layout for Quick Dashboards

Linus Torvalds Supports AI in Linux Kernel Development