An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry

6 min readMay 26, 2023

This is a brief for two research papers:

“An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry” (published at ICSE 2023). The full paper is available here. The artifact for our code and data is available on GitHub and Zenodo.
“PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages” (published at MSR 2023). The full paper is available here. The artifact for our code is available on Zenodo. If you want the full dataset, it is available through a Globus share hosted at Purdue.

This post was written by my student Wenxin Jiang, and lightly edited by me. In this post, we use the word “PTM” as shorthand for “pre-trained model”.

Summary

Deep Neural Networks (DNNs) are being adopted as components in software systems. Creating and specializing DNNs from scratch has grown increasingly difficult as state-of-the-art architectures grow more complex. Following the path of traditional software engineering, machine learning engineers have begun to reuse large-scale pre-trained models (PTMs) and fine-tune these models for downstream tasks.

Prior works have studied reuse practices for traditional software packages to guide software engineers towards better package maintenance and dependency management. We lack a similar foundation of knowledge to guide behaviors in pre-trained model ecosystems. In this work, we study the process of PTM reuse. Our contributions are:

We summarized a decision-making workflow for PTM reuse, identify reuse attributes and challenges.
We identified unique properties of PTM package reuse to guide future research.
We measured potential risks in the HuggingFace platform for PTM reuse.
We published the PTMTorrent dataset of 15,913 PTM packages from 5 model hubs.

Motivation

We’re probably all familiar with traditional software packages and the registries that hold them. For example, here is the LODASH package hosted by the NPM registry for JavaScript software.

Deep learning model registries imitate traditional package registries. They can be used to help engineers better reuse these costly artifacts.

Hugging Face is the largest DL model registry. At time of writing there are 211,350 pre-trained DL models (PTMs) on HuggingFace. For example, here is the BERT-BASE-UNCASED package hosted by the HuggingFace registry for DL software.

BERT-BASE-UNCASED package from Hugging Face. Note the imitation of the npm registry view shown earlier — HuggingFace posittions itself as being for PTMs what npm is for traditional software.

The most-downloaded PTMs from HuggingFace have download rates comparable to the most popular packages in NPM and PyPI. (We note that there is a much more rapid drop-off in HuggingFace. Many people have conjectured different explanations for this drop-off, but the phenomenon bears further study).

Package download rates in two software package registries, NPM and PyPi, and the leading DL model registry, Hugging Face. Many Hugging Face model packages have high download rates, though with a rapid drop-off.

A PTM package is similar to a package for traditional software, but it includes some DL-specific parts:

A PTM package contains traditional software (e.g. for data pipeline and model definition) as well as some additional considerations.

The way that software engineers interact with PTM packages may be different from traditional software packages. Few studies have explored this topic yet.

Research Questions

The goal of our work is to discover the similarities and differences between PTM packages and regualr software packages. There are mainly two gaps in this problem domain:

The nature of reuse and trust is unexamined in DL model registries.
Accessing data is tricky; it would be great to have a dataset of PTMs available for analysis.

In our work, we studied the reuse of PTM packages in DL model registries, considering both qualitative and quantitative aspects. We focused on one DL model registry, Hugging Face, as it is by far the largest registry at present. For PTM reuse in the Hugging Face ecosystem, we ask:

How do engineers select PTMs?
What are the challenges of PTM reuse?
To what extent are the risks of reusing PTMs mitigated by HuggingFace defenses?

Methods

To answer these questions, we examined two data sources. First, we interviewed 12 “power users” of HuggingFace. Second, we reviewed the HuggingFace public documentation and mined information from the platform.

Results

PTM selection: The interview participants share a similar decision-making process for PTM reuse, as shown in the next figure. Our participants reported two reuse scenarios: transfer learning (e.g. fine-tuning) and quantization. When reusing, participants find PTMs from DL model registries easier to adopt than PTMs from GitHub projects.

Summarized decision-making progress of PTM reuse. Back edges indicate the possible changes of model selection.

PTM reuse challenges: Participants reported 6 distinct kinds of challenges. The primary two were “Missing attributes” and “Discrepancies in claims” — PTMs often omit details or make un-reproducible claims.

Challenges associated with PTM reuse. Third column shows how many participants (of 12) mentioned the challenge.

Risks and HuggingFace defenses: Several subjects talked about trust issues with PTMs, e.g. security and privacy risks. We investigated the extent to which HuggingFace mitigates these risks. First, by studying the HuggingFace documentation and making our own PTMs and datasets, we created the following Dataflow Diagram. The white boxes and arrows indicate the “normal” flow of entities through the system. The red boxes are defenses against malicious users.

Dataflow diagram for PTMs on Hugging Face. Security features and dependencies appear as red and blue blocks. Libraries with the GitHub logo are open source GitHub repositories. The large dashed boxes indicate a trust boundary between users and Hugging Face. Solid connections indicate required paths, and dashed connections represent alternatives.

We analyzed each of the red boxes. The following table summarizes our concerns. For example, the second-from-left red box is “Verified organizations”, similar to Twitter’s blue check-mark. This would, in principle, let users know whether they are really downloading a model from Microsoft or from an impersonator. However, only ~3% of organizations have actually completed the verification process, rendering the defense relatively useless.

The mitigations in HuggingFace, the security risks they mitigate, and the concerns we identified in HuggingFace.

The PTMTorrent Dataset

Here are three reasons that we created a new dataset:

The PTM are distributed in many places and the model hubs provide different APIs to facilitate model reuse. There is no one-stop shop to compare PTMs from different model hubs.
PTMs can be really big! Some model hubs, they rate limits on accessing PTMs to prevent server overloads and service disruptions. This makes it harder to access a large amount of PTMs.
We would like to keep long-term data to support the scientific replicability of our work

We want researchers to have easy access to PTM packages, so we created PTMTorrent (link at the top).

Our data collection workflow is shown next. We identified several model registries, some of which have been covered in prior study on PTM supply chain. We built tooling to obtain PTM packages from each model registry and map to the corresponding GitHub repositories.

Data collection and preprocessing workflow for PTMTorrent. We standardize the PTM metadata by using a data schema, collecting it from PTM packages and the corresponding GitHub repository.

We extracted metadata from each PTM package into a unified data scheme:

An overview of PTMTorrent’s data schema. Each model hub shares a general schema (grey boxes), with hub-specific data stored in customized schema (colored boxes). The full schema is available in JSON in the dataset generation repository.

Here is a summary of the data we collected:

Details about the PTMTorrent content for each of the 5 model registries we collected.

Example uses: We used the dataset to measure the risk mitigation of HF model registry. This includes the measurement of dependencies, documentation, and GPG commit signing. We used the metadata to identify model discrepancies and build a “maintainers’ reach” graph.

Conclusions

Comparing to traditional package registries, DL model registries have three main differences:

Main differences between DL model registries and traditioanl package registries.

We suggest several future directions to researchers, including model audit, optimizing model registry infrastructure, improving PTM standardization, and developing adversarial attack detector for PTM packages. Aspects of PTM reuse are different from traditional package reuse. We need more research on PTM packages!