Kingfisher — 快速灵活的公共数据库下载工具

Kingfisher 是一个快速灵活的程序，用于从公共数据库下载序列文件 (及其元数据注释)，包括 European Nucleotide Archive (ENA)， NCBI SRA，亚马逊 AWS 和谷歌云。在 "get" 子命令中，Kingfisher 从一系列冗余的源中下载数据，按顺序尝试，直到其中一个成功。与使用 NCBI 的 SRA 工具相比，下载和提取阶段都可能更快。这个工具有两种

hotlong_1998

1706人浏览 · 2024-02-16 14:43:47

hotlong_1998 · 2024-02-16 14:43:47 发布

Kingfisher — 快速灵活的公共数据库下载工具

Kingfisher 是一个快速灵活的程序，用于从公共数据库下载序列文件 (及其元数据注释)，包括 European Nucleotide Archive (ENA)， NCBI SRA，亚马逊 AWS 和谷歌云。它的输入是一个或多个 "Run" 条目，例如DRR001970，或一个 BioProject 条目，例如 PRJNA621514 或 SRP260223。

Kingfisher logo

官方说明参考：

https://wwood.github.io/kingfisher-download/

安装方式

推荐使用 docker 容器安装：

docker pull wwood/kingfisher:0.4.1
# 安装完成后查看
docker run wwood/kingfisher:0.4.1 annotate --full-help

Kingfisher 用法

这个工具有两种主要模式——"get" 模式用于下载序列数据，而 "annotate" 模式用于下载元数据。

`annotate` 模式 — 获取元数据表

一般情况下，文献中的所有原始测序数据会存储在 NCBI 或其它数据库的某个项目号（NCBI 为 Bioproject）下，因此，我们可以直接使用该项目号一次下载其全部数据，例如：

在 GEO 中检索到其 BioProject 号：

利用 "annotation" 子命令从 NCBI 获取该项目的部分元信息。

docker run wwood/kingfisher:0.4.1 annotate -p PRJNA758280
# 或使用转换后的容器
singularity exec kingfisher_0.4.1 kingfisher annotate -p PRJNA758280

如果想要获取完整详细的注释信息，可以将以下三个参数搭配使用：

docker run -v `pwd`:/data wwood/kingfisher:0.4.1 \
 annotate -p PRJNA217407 \
 --all-columns -f tsv -o /data/PRJNA758280.metainfo.csv

--all-column 获得更完整的信息集；
-f 指定以 CSV、 TSV、 JSON、feather 或 parquet 格式输出。
-o 指定输出文件的写入路径。

`get` 模式 — 从公共数据库下载并转换序列数据

在 "get" 子命令中，Kingfisher 从一系列冗余的源中下载数据，按顺序尝试，直到其中一个成功。然后，下载的数据按照需要转换为输出的 SRA / FASTQ / FASTA / GZIP 文件格式。与使用 NCBI 的 SRA 工具相比，下载和提取阶段都可能更快。特别是从 ENA 下载意味着直接下载 FASTQ 文件，因此不需要提取步骤。

下载整个 BioProject 的测序数据：

docker run -v `pwd`:/data wwood/kingfisher:0.4.1 \
 get -p PRJNA531115 --download-threads 8 --output-directory /data/PRJNA531115 \
 --download-methods prefetch ena-ascp ena-ftp aws-http aws-cp gcp-cp

--download-threads：指定下载线程数

--output-directory：指定下载目录

--download-methods ：指定下载源。

在 get 模式中，有以下几种下载方式:

method	description
`ena-ascp`	Download `.fastq.gz` files from ENA using Aspera, which can then be further converted. This is the fastest method since no `fasterq-dump` is required.
`ena-ftp`	Download `.fastq.gz` files from ENA using `curl`, which can then be further converted. This is relatively fast since no `fasterq-dump` is required.
`prefetch`	利用 NCBI prefetch 从 sra-tools下载 .sra文件，然后使用 `fasterq-dump` 提取.
`aws-http`	Download .SRA file from AWS Open Data Program using `aria2c` with multiple connection threads, which is then extracted with `fasterq-dump`.
`aws-cp`	Download .SRA file from AWS using `aws s3 cp`, which is then extracted with fasterq-dump. Does not usually require payment or an AWS account.
`gcp-cp`	Download .SRA file from Google Cloud `gsutil`, which is then extracted with fasterq-dump. Requires payment and a Google Cloud account.

经测试使用 prefetch 方法更快

其它参数：

-r：下载某个确定的 SRA 数据

--run-identifiers-list：SRR 样本列表文件，换行分隔 SRR 号。

扫码关注微信公众号【生信F3】获取文章完整信息，分享生物信息学最新知识。 ShengXinF3_QRcode

本文由 mdnice 多平台发布

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

自动化提示词生成工具盘点

腾讯云开发者社区

AI PPT免费使用技巧盘点：如何快速制作专业PPT？

腾讯云开发者社区

腾讯云架构师技术沙龙 · 长沙站圆满落幕，共话AI驱动下的技术架构与前沿应用

人工智能已成为推动技术创新与产业变革的重要引擎，开发者正身处一场前所未有的技术变革之中。通过本次腾讯云架构师技术沙龙，各位专家深入分享前沿技术洞察，探讨 AI 落地的应用路径与实践经验，为架构师的职业发展指明方向。腾讯云架构师长沙同盟和腾讯云架构师技术同盟长沙地区理事会正式成立。未来，腾讯云架构师长沙同盟将凝心聚力，打造属于本地架构师的学习与成长的家园，助力中国架构的蓬勃发展。未来已来，让我们携手