快速方便地下载huggingface的模型库和数据集

方法一：用于使用 aria2/wget+git 下载 Huggingface 模型和数据集的 CLI 工具

来自https://gist.github.com/padeoe/697678ab8e528b85a2a7bddafea1fa4f。

使用方法：将hfd.sh拷贝过去，然后参考下面的参考命令，下载数据集或者模型

🤗Huggingface 模型下载器

考虑到官方 huggingface-cli 缺乏多线程下载支持，以及错误处理不足在 hf_transfer 中，这个命令行工具巧妙地利用

wget

或

aria2

来处理 LFS 文件，并使用

git clone

来处理其余文件。

特点

⏯️ 从断点恢复：您可以随时重新运行它或按 Ctrl+C。
🚀 多线程下载：利用多线程加速下载过程。
🚫 文件排除：使用--exclude或--include跳过或指定文件，为具有重复格式的模型（例如，*.bin或*.safetensors）节省时间）。
🔐 身份验证支持：对于需要 Huggingface 登录的门控模型，请使用 --hf_username 和 --hf_token 进行身份验证。
🪞 镜像站点支持：使用“HF_ENDPOINT”环境变量进行设置。
🌍代理支持：使用“HTTPS_PROXY”环境变量进行设置。
📦 简单：仅依赖git、aria2c/wget。

Usage

chmod a+x hfd.sh

为了方便起见，您可以创建一个别名

aliashfd="$PWD/hfd.sh"

使用说明：

$ ./hfd.sh -h
Usage:
  hfd <repo_id> [--include include_pattern] [--exclude exclude_pattern] [--hf_username username] [--hf_token token] [--tool aria2c|wget] [-x threads] [--dataset] [--local-dir path]

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format 'org/repo_name'.
  --include       (Optional) Flag to specify a string pattern to include files for downloading.
  --exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g., '--exclude *.safetensor', '--include vae/*'.
  --hf_username   (Optional) Hugging Face username for authentication. **NOT EMAIL**.
  --hf_token      (Optional) Hugging Face token for authentication.
  --tool          (Optional) Download tool to use. Can be aria2c (default) or wget.
  -x              (Optional) Number of download threads for aria2c. Defaults to 4.
  --dataset       (Optional) Flag to indicate downloading a dataset.
  --local-dir     (Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset

下载模型：

hfd bigscience/bloom-560m

下载模型需要登录

从https://huggingface.co/settings/tokens获取huggingface令牌，然后

hfd meta-llama/Llama-2-7b --hf_username YOUR_HF_USERNAME_NOT_EMAIL --hf_token YOUR_HF_TOKEN

下载模型并排除某些文件（例如.safetensors）：

hfd bigscience/bloom-560m --exclude *.safetensors

使用 aria2c 和多线程下载：

hfd bigscience/bloom-560m

输出：
下载过程中，将显示文件 URL：

$ hfd bigscience/bloom-560m --tool wget --exclude *.safetensors
...
Start Downloading lfs files, bash script:

wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/flax_model.msgpack
# wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/model.safetensors
wget -c https://huggingface.co/bigscience/bloom-560m/resolve/main/onnx/decoder_model.onnx
...

# 安装包
apt update
apt-get install aria2
apt-get install iftop
apt-get install git-lfs 
#参考命令
bash /xxx/xxx/hfd.sh mmaaz60/ActivityNet-QA-Test-Videos --tool aria2c -x 16--dataset --local-dir/xxx/xxx/ActivityNet

hfd.sh

#!/usr/bin/env bash# Color definitions
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'# No Color

trap 'printf "${YELLOW}\nDownload interrupted. If you re-run the command, you can resume the download from the breakpoint.\n${NC}"; exit 1' INT

display_help(){
    cat << EOF
Usage:
  hfd <repo_id>[--include include_pattern][--exclude exclude_pattern][--hf_username username][--hf_token token][--tool aria2c|wget][-x threads][--dataset][--local-dir path]    

Description:
  Downloads a model or dataset from Hugging Face using the provided repo ID.

Parameters:
  repo_id        The Hugging Face repo ID in the format'org/repo_name'.--include       (Optional) Flag to specify a string pattern to include files for downloading.--exclude       (Optional) Flag to specify a string pattern to exclude files from downloading.
  include/exclude_pattern The pattern to match against filenames, supports wildcard characters. e.g.,'--exclude *.safetensor','--include vae/*'.--hf_username   (Optional) Hugging Face username for authentication.**NOT EMAIL**.--hf_token      (Optional) Hugging Face token for authentication.--tool          (Optional) Download tool to use. Can be aria2c (default)or wget.-x              (Optional) Number of download threads for aria2c. Defaults to 4.--dataset       (Optional) Flag to indicate downloading a dataset.--local-dir(Optional) Local directory path where the model or dataset will be stored.

Example:
  hfd bigscience/bloom-560m --exclude *.safetensors
  hfd meta-llama/Llama-2-7b --hf_username myuser --hf_token mytoken -x 4
  hfd lavita/medical-qa-shared-task-v1-toy --dataset
EOF
    exit 1}

MODEL_ID=$1
shift

# Default values
TOOL="aria2c"
THREADS=4
HF_ENDPOINT=${HF_ENDPOINT:-"https://hf-mirror.com"}while[[ $# -gt 0 ]]; docase $1in--include) INCLUDE_PATTERN="$2"; shift 2;;--exclude) EXCLUDE_PATTERN="$2"; shift 2;;--hf_username) HF_USERNAME="$2"; shift 2;;--hf_token) HF_TOKEN="$2"; shift 2;;--tool) TOOL="$2"; shift 2;;-x) THREADS="$2"; shift 2;;--dataset) DATASET=1; shift ;;--local-dir) LOCAL_DIR="$2"; shift 2;;*) shift ;;
    esac
done

# Check if aria2, wget, curl, git, and git-lfs are installed
check_command(){if ! command -v $1&>/dev/null; then
        echo -e "${RED}$1 is not installed. Please install it first.${NC}"
        exit 1
    fi
}# Mark current repo safe when using shared file system like samba or nfs
ensure_ownership(){if git status 2>&1| grep "fatal: detected dubious ownership in repository at">/dev/null; then
        git config --global--add safe.directory "${PWD}"
        printf "${YELLOW}Detected dubious ownership in repository, mark ${PWD} safe using git, edit ~/.gitconfig if you want to reverse this.\n${NC}" 
    fi
}[["$TOOL"=="aria2c"]]&& check_command aria2c
[["$TOOL"=="wget"]]&& check_command wget
check_command curl; check_command git; check_command git-lfs

[[-z "$MODEL_ID"||"$MODEL_ID"=~^-h ]]&& display_help

if[[-z "$LOCAL_DIR"]]; then
    LOCAL_DIR="${MODEL_ID#*/}"
fi

if[["$DATASET"==1]]; then
    MODEL_ID="datasets/$MODEL_ID"
fi
echo "Downloading to $LOCAL_DIR"if[-d "$LOCAL_DIR/.git"]; then
    printf "${YELLOW}%s exists, Skip Clone.\n${NC}""$LOCAL_DIR"
    cd "$LOCAL_DIR"&& ensure_ownership && GIT_LFS_SKIP_SMUDGE=1 git pull ||{ printf "${RED}Git pull failed.${NC}\n"; exit 1;}else
    REPO_URL="$HF_ENDPOINT/$MODEL_ID"
    GIT_REFS_URL="${REPO_URL}/info/refs?service=git-upload-pack"
    echo "Testing GIT_REFS_URL: $GIT_REFS_URL"
    response=$(curl -s -o /dev/null -w "%{http_code}""$GIT_REFS_URL")if["$response"=="401"]||["$response"=="403"]; then
        if[[-z "$HF_USERNAME"||-z "$HF_TOKEN"]]; then
            printf "${RED}HTTP Status Code: $response.\nThe repository requires authentication, but --hf_username and --hf_token is not passed. Please get token from https://huggingface.co/settings/tokens.\nExiting.\n${NC}"
            exit 1
        fi
        REPO_URL="https://$HF_USERNAME:$HF_TOKEN@${HF_ENDPOINT#https://}/$MODEL_ID"elif["$response"!="200"]; then
        printf "${RED}Unexpected HTTP Status Code: $response\n${NC}"
        printf "${YELLOW}Executing debug command: curl -v %s\nOutput:${NC}\n""$GIT_REFS_URL"
        curl -v "$GIT_REFS_URL"; printf "\n${RED}Git clone failed.\n${NC}"; exit 1
    fi
    echo "GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR"

    GIT_LFS_SKIP_SMUDGE=1 git clone $REPO_URL $LOCAL_DIR && cd "$LOCAL_DIR"||{ printf "${RED}Git clone failed.\n${NC}"; exit 1;}

    ensure_ownership

    while IFS= read -r file; do
        truncate -s 0"$file"
    done <<< $(git lfs ls-files | cut -d ' '-f 3-)
fi

printf "\nStart Downloading lfs files, bash script:\ncd $LOCAL_DIR\n"
files=$(git lfs ls-files | cut -d ' '-f 3-)
declare -a urls

while IFS= read -r file; do
    url="$HF_ENDPOINT/$MODEL_ID/resolve/main/$file"
    file_dir=$(dirname "$file")
    mkdir -p "$file_dir"if[["$TOOL"=="wget"]]; then
        download_cmd="wget -c \"$url\" -O \"$file\""[[-n "$HF_TOKEN"]]&& download_cmd="wget --header=\"Authorization: Bearer ${HF_TOKEN}\" -c \"$url\" -O \"$file\""else
        download_cmd="aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""[[-n "$HF_TOKEN"]]&& download_cmd="aria2c --header=\"Authorization: Bearer ${HF_TOKEN}\" --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c \"$url\" -d \"$file_dir\" -o \"$(basename "$file")\""
    fi
    [[-n "$INCLUDE_PATTERN"&& ! "$file"== $INCLUDE_PATTERN ]]&& printf "# %s\n""$download_cmd"&&continue[[-n "$EXCLUDE_PATTERN"&&"$file"== $EXCLUDE_PATTERN ]]&& printf "# %s\n""$download_cmd"&&continue
    printf "%s\n""$download_cmd"
    urls+=("$url|$file")
done <<<"$files"for url_file in"${urls[@]}"; do
    IFS='|' read -r url file<<<"$url_file"
    printf "${YELLOW}Start downloading ${file}.\n${NC}" 
    file_dir=$(dirname "$file")if[["$TOOL"=="wget"]]; then
        [[-n "$HF_TOKEN"]]&& wget --header="Authorization: Bearer ${HF_TOKEN}"-c "$url"-O "$file"|| wget -c "$url"-O "$file"else[[-n "$HF_TOKEN"]]&& aria2c --header="Authorization: Bearer ${HF_TOKEN}"--console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url"-d "$file_dir"-o "$(basename "$file")"|| aria2c --console-log-level=error --file-allocation=none -x $THREADS -s $THREADS -k 1M -c "$url"-d "$file_dir"-o "$(basename "$file")"
    fi
    [[ $? -eq 0]]&& printf "Downloaded %s successfully.\n""$url"||{ printf "${RED}Failed to download %s.\n${NC}""$url"; exit 1;}
done

printf "${GREEN}Download completed successfully.\n${NC}"

方法二：模型下载【个人使用记录】

这个代码不能保持目录结构，见下面的改进版

import datetime
import os
import threading

from huggingface_hub import hf_hub_url
from huggingface_hub.hf_api import HfApi
from huggingface_hub.utils import filter_repo_objects

# 执行命令defexecCmd(cmd):print("命令%s开始运行%s"%(cmd, datetime.datetime.now()))
    os.system(cmd)print("命令%s结束运行%s"%(cmd, datetime.datetime.now()))if __name__ =='__main__':# 需下载的hf库名称
    repo_id ="Salesforce/blip2-opt-2.7b"# 本地存储路径
    save_path ='./blip2-opt-2.7b'# 获取项目信息
    _api = HfApi()
    repo_info = _api.repo_info(
        repo_id=repo_id,
        repo_type="model",
        revision='main',
        token=None,)# 获取文件信息
    filtered_repo_files =list(
        filter_repo_objects(
            items=[f.rfilename for f in repo_info.siblings],
            allow_patterns=None,
            ignore_patterns=None,))

    cmds =[]
    threads =[]# 需要执行的命令列表forfilein filtered_repo_files:# 获取路径
        url = hf_hub_url(repo_id=repo_id, filename=file)# 断点下载指令
        cmds.append(f'wget -c {url} -P {save_path}')print(cmds)print("程序开始%s"% datetime.datetime.now())for cmd in cmds:
        th = threading.Thread(target=execCmd, args=(cmd,))
        th.start()
        threads.append(th)for th in threads:
        th.join()print("程序结束%s"% datetime.datetime.now())

保持目录结构

import datetime
import os
import threading
from pathlib import Path

from huggingface_hub import hf_hub_url
from huggingface_hub.hf_api import HfApi
from huggingface_hub.utils import filter_repo_objects

# 执行命令defexecCmd(cmd):print("命令%s开始运行%s"%(cmd, datetime.datetime.now()))
    os.system(cmd)print("命令%s结束运行%s"%(cmd, datetime.datetime.now()))if __name__ =='__main__':# 需下载的hf库名称
    repo_id ="Salesforce/blip2-opt-2.7b"# 本地存储路径
    save_path ='./blip2-opt-2.7b'# 创建本地保存目录
    Path(save_path).mkdir(parents=True, exist_ok=True)# 获取项目信息
    _api = HfApi()
    repo_info = _api.repo_info(
        repo_id=repo_id,
        repo_type="model",
        revision='main',
        token=None,)# 获取文件信息
    filtered_repo_files =list(
        filter_repo_objects(
            items=[f.rfilename for f in repo_info.siblings],
            allow_patterns=None,
            ignore_patterns=None,))

    cmds =[]
    threads =[]# 需要执行的命令列表forfilein filtered_repo_files:# 获取路径
        url = hf_hub_url(repo_id=repo_id, filename=file)# 在本地创建子目录
        local_file = os.path.join(save_path,file)
        local_dir = os.path.dirname(local_file)
        Path(local_dir).mkdir(parents=True, exist_ok=True)# 断点下载指令
        cmds.append(f'wget -c {url} -P {local_dir}')print(cmds)print("程序开始%s"% datetime.datetime.now())for cmd in cmds:
        th = threading.Thread(target=execCmd, args=(cmd,))
        th.start()
        threads.append(th)for th in threads:
        th.join()print("程序结束%s"% datetime.datetime.now())

数据集下载

import datetime
import os
import threading
from pathlib import Path

from huggingface_hub import HfApi
from huggingface_hub.utils import filter_repo_objects

# 执行命令defexecCmd(cmd):print("命令%s开始运行%s"%(cmd, datetime.datetime.now()))
    os.system(cmd)print("命令%s结束运行%s"%(cmd, datetime.datetime.now()))if __name__ =='__main__':# 需下载的数据集ID
    dataset_id ="openai/webtext"# 本地存储路径
    save_path ='./webtext'# 创建本地保存目录
    Path(save_path).mkdir(parents=True, exist_ok=True)# 获取数据集信息
    _api = HfApi()
    dataset_info = _api.dataset_info(
        dataset_id=dataset_id,
        revision='main',
        token=None,)# 获取文件信息
    filtered_dataset_files =list(
        filter_repo_objects(
            items=[f.rfilename for f in dataset_info.siblings],
            allow_patterns=None,
            ignore_patterns=None,))

    cmds =[]
    threads =[]# 需要执行的命令列表forfilein filtered_dataset_files:# 获取路径
        url = dataset_info.get_file_url(file)# 在本地创建子目录
        local_file = os.path.join(save_path,file)
        local_dir = os.path.dirname(local_file)
        Path(local_dir).mkdir(parents=True, exist_ok=True)# 断点下载指令
        cmds.append(f'wget -c {url} -P {local_dir}')print(cmds)print("程序开始%s"% datetime.datetime.now())for cmd in cmds:
        th = threading.Thread(target=execCmd, args=(cmd,))
        th.start()
        threads.append(th)for th in threads:
        th.join()print("程序结束%s"% datetime.datetime.now())

不足之处

不支持需要授权的库。

文件太多可能会开很多线程。

创作不易，观众老爷们请留步… 动起可爱的小手，点个赞再走呗 (๑◕ܫ￩๑)欢迎大家关注笔者，你的关注是我持续更博的最大动力

原创文章，转载告知，盗版必究

在这里插入图片描述

在这里插入图片描述
♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠ ⊕ ♠

标签： python windows 前端

本文转载自: https://blog.csdn.net/qq_45934285/article/details/140792378
版权归原作者 旋转的油纸伞 所有，如有侵权，请联系我们删除。

快速方便地下载huggingface的模型库和数据集

快速方便地下载huggingface的模型库和数据集

方法一：用于使用 aria2/wget+git 下载 Huggingface 模型和数据集的 CLI 工具

特点

Usage

方法二：模型下载【个人使用记录】

保持目录结构

数据集下载

不足之处

发表评论

“快速方便地下载huggingface的模型库和数据集”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航