An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry
This is a brief for two research papers:
- “An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry” (published at ICSE 2023). The full paper is available here. The artifact for our code and data is available on GitHub and Zenodo.
- “PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages” (published at MSR 2023). The full paper is available here. The artifact for our code is available on Zenodo. If you want the full dataset, it is available through a Globus share hosted at Purdue.
This post was written by my student Wenxin Jiang, and lightly edited by me. In this post, we use the word “PTM” as shorthand for “pre-trained model”.
Summary
Deep Neural Networks (DNNs) are being adopted as components in software systems. Creating and specializing DNNs from scratch has grown increasingly difficult as state-of-the-art architectures grow more complex. Following the path of traditional software engineering, machine learning engineers have begun to reuse large-scale pre-trained models (PTMs) and fine-tune these models for downstream tasks.
Prior works have studied reuse practices for traditional software packages to guide software engineers towards better package maintenance and dependency management. We lack a similar foundation of knowledge to guide behaviors in pre-trained model ecosystems. In this work, we study the process of PTM reuse. Our contributions are:
- We summarized a decision-making workflow for PTM reuse, identify reuse attributes and challenges.
- We identified unique properties of PTM package reuse to guide future research.
- We measured potential risks in the HuggingFace platform for PTM reuse.
- We published the PTMTorrent dataset of 15,913 PTM packages from 5 model hubs.
Motivation
We’re probably all familiar with traditional software packages and the registries that hold them. For example, here is the LODASH package hosted by the NPM registry for JavaScript software.
Deep learning model registries imitate traditional package registries. They can be used to help engineers better reuse these costly artifacts.
Hugging Face is the largest DL model registry. At time of writing there are 211,350 pre-trained DL models (PTMs) on HuggingFace. For example, here is the BERT-BASE-UNCASED package hosted by the HuggingFace registry for DL software.
The most-downloaded PTMs from HuggingFace have download rates comparable to the most popular packages in NPM and PyPI. (We note that there is a much more rapid drop-off in HuggingFace. Many people have conjectured different explanations for this drop-off, but the phenomenon bears further study).
A PTM package is similar to a package for traditional software, but it includes some DL-specific parts:
The way that software engineers interact with PTM packages may be different from traditional software packages. Few studies have explored this topic yet.
Research Questions
The goal of our work is to discover the similarities and differences between PTM packages and regualr software packages. There are mainly two gaps in this problem domain:
- The nature of reuse and trust is unexamined in DL model registries.
- Accessing data is tricky; it would be great to have a dataset of PTMs available for analysis.
In our work, we studied the reuse of PTM packages in DL model registries, considering both qualitative and quantitative aspects. We focused on one DL model registry, Hugging Face, as it is by far the largest registry at present. For PTM reuse in the Hugging Face ecosystem, we ask:
- How do engineers select PTMs?
- What are the challenges of PTM reuse?
- To what extent are the risks of reusing PTMs mitigated by HuggingFace defenses?
Methods
To answer these questions, we examined two data sources. First, we interviewed 12 “power users” of HuggingFace. Second, we reviewed the HuggingFace public documentation and mined information from the platform.
Results
PTM selection: The interview participants share a similar decision-making process for PTM reuse, as shown in the next figure. Our participants reported two reuse scenarios: transfer learning (e.g. fine-tuning) and quantization. When reusing, participants find PTMs from DL model registries easier to adopt than PTMs from GitHub projects.
PTM reuse challenges: Participants reported 6 distinct kinds of challenges. The primary two were “Missing attributes” and “Discrepancies in claims” — PTMs often omit details or make un-reproducible claims.
Risks and HuggingFace defenses: Several subjects talked about trust issues with PTMs, e.g. security and privacy risks. We investigated the extent to which HuggingFace mitigates these risks. First, by studying the HuggingFace documentation and making our own PTMs and datasets, we created the following Dataflow Diagram. The white boxes and arrows indicate the “normal” flow of entities through the system. The red boxes are defenses against malicious users.
We analyzed each of the red boxes. The following table summarizes our concerns. For example, the second-from-left red box is “Verified organizations”, similar to Twitter’s blue check-mark. This would, in principle, let users know whether they are really downloading a model from Microsoft or from an impersonator. However, only ~3% of organizations have actually completed the verification process, rendering the defense relatively useless.
The PTMTorrent Dataset
Here are three reasons that we created a new dataset:
- The PTM are distributed in many places and the model hubs provide different APIs to facilitate model reuse. There is no one-stop shop to compare PTMs from different model hubs.
- PTMs can be really big! Some model hubs, they rate limits on accessing PTMs to prevent server overloads and service disruptions. This makes it harder to access a large amount of PTMs.
- We would like to keep long-term data to support the scientific replicability of our work
We want researchers to have easy access to PTM packages, so we created PTMTorrent (link at the top).
Our data collection workflow is shown next. We identified several model registries, some of which have been covered in prior study on PTM supply chain. We built tooling to obtain PTM packages from each model registry and map to the corresponding GitHub repositories.
We extracted metadata from each PTM package into a unified data scheme:
Here is a summary of the data we collected:
Example uses: We used the dataset to measure the risk mitigation of HF model registry. This includes the measurement of dependencies, documentation, and GPG commit signing. We used the metadata to identify model discrepancies and build a “maintainers’ reach” graph.
Conclusions
Comparing to traditional package registries, DL model registries have three main differences:
We suggest several future directions to researchers, including model audit, optimizing model registry infrastructure, improving PTM standardization, and developing adversarial attack detector for PTM packages. Aspects of PTM reuse are different from traditional package reuse. We need more research on PTM packages!