10X-Genomics Bam 数据处理

Introduction

从公共数据库中下载的 10x 数据,有很大一部分都是以 Bam 文件形式存储。但是在进行后续分析中,基本都是基于 fastq 或 fasta 的,于是我们需要将 Bam 转换成 Fastq。官方提供了 bamtofastq ,用于将 Cell Ranger、Space Ranger、Cell Ranger ATAC、Cell Ranger DNA 和 Long Ranger 生成的 10x BAM 转换回 FASTQ 文件的工具,可用作重新运行分析的输入。

需要注意的是,该工具生成的 Fastq 文件并不会保留原始的顺序;此外,这里提到的 bamtofastq 并不是 bedtools 管道中的 bamtofastq 组件,而是 10x Genomics 官方专门为其 Type:10x 开发的转换工具,仅适用于 10x Genomics 数据。

Installation

官方在 Github 上提供了二进制和源码,但是建议直接下载二进制文件,有兼容 LinuxMac 两个版本。

The releases page

下载下来的二进制文件需要赋予对应的权限,才能正常运行;chmod 777 bamtofastq

Getting Started

10x Genomics Bam 的头部应该是这样的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ samtools view -H 4279_B_run619_cellranger_possorted_bam.bam
@HD VN:1.5 SO:coordinate
@SQ SN:chrY LN:20833752
@SQ SN:chr6 LN:27818160
@SQ SN:chr1 LN:45499013
@SQ SN:chr4 LN:24553389
@SQ SN:chr5 LN:18269195
@SQ SN:chr3 LN:25835044
@SQ SN:chrX LN:22138634
@SQ SN:chr2 LN:29246250
@RG ID:4279_B_run619_cellranger:LibraryNotSpecified:1:H7WNYBBXY:5 SM:4279_B_run619_cellranger LB:LibraryNotSpecified.1 PU:4279_B_run619_cellranger:LibraryNotSpecified:1:H7WNYBBXY:5 PL:ILLUMINA
@PG PN:crdna-stages ID:crdna-stages VN:0.1.0
@PG ID:samtools PN:samtools PP:crdna-stages VN:1.13 CL:samtools view -H 4279_B_run619_cellranger_possorted_bam.bam
@CO 10x_bam_to_fastq:R1(CR:CY,SEQ:QUAL)
@CO 10x_bam_to_fastq:R2(SEQ:QUAL)
@CO 10x_bam_to_fastq:I1(BC:QT)

@HD bam 固有的 head,@SQ 显示参考基因组染色体及对应的长度,@RG 指示 reads 的来源,@CO 特有的标签,指示如何冲 bam 记录中恢复原始 fastq 序列。

查看帮助文档:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
$ ./bamtofastq -h
bamtofastq v1.4.1
10x Genomics BAM to FASTQ converter.

Tool for converting 10x BAMs produced by Cell Ranger or Long Ranger back to
FASTQ files that can be used as inputs to re-run analysis. The FASTQ files
emitted by the tool should contain the same set of sequences that were
input to the original pipeline run, although the order will not be
preserved. The FASTQs will be emitted into a directory structure that is
compatible with the directories created by the 'mkfastq' tool.

10x BAMs produced by Long Ranger v2.1+ and Cell Ranger v1.2+ contain header
fields that permit automatic conversion to the correct FASTQ sequences.

Older 10x pipelines require one of the arguments listed below to indicate
which pipeline created the BAM.

NOTE: BAMs created by non-10x pipelines are unlikely to work correctly,
unless all the relevant tags have been recreated.

NOTE: BAM produced by the BASIC and ALIGNER pipeline from Long Ranger 2.1.2 and earlier
are not compatible with bamtofastq

NOTE: BAM files created by CR < 1.3 do not have @RG headers, so bamtofastq will use the GEM well
annotations attached to the CB (cell barcode) tag to split data from multiple input libraries.
Reads without a valid barcode do not carry the CB tag and will be dropped. These reads would
not be included in any valid cell.

Usage:
bamtofastq [options] <bam> <output-path>
bamtofastq (-h | --help)

Options:

--nthreads=<n> Threads to use for reading BAM file [default: 4]
--locus=<locus> Optional. Only include read pairs mapping to locus. Use chrom:start-end format.
--reads-per-fastq=N Number of reads per FASTQ chunk [default: 50000000]
--relaxed Skip unpaired or duplicated reads instead of throwing an error.
--gemcode Convert a BAM produced from GemCode data (Longranger 1.0 - 1.3)
--lr20 Convert a BAM produced by Longranger 2.0
--cr11 Convert a BAM produced by Cell Ranger 1.0-1.1
--bx-list=L Only include BX values listed in text file L. Requires BX-sorted and index BAM file (see Long Ranger support for details).
--traceback Print full traceback if an error occurs.
-h --help Show this screen.

Just bamtofastq bam_file out_dir ;输出结果为一个文件夹,内还有以原始命名的文件夹,其中便是我们需要的 fastq.gz

但是需要注意的是,Bam 文件格式支持:Long Ranger v2.1+ 和 Cell Ranger v1.2+ 生成的 10x BAM 包含允许自动转换为正确 FASTQ 序列的标头字段。较旧的 10x 管道需要参数来指示哪个管道创建了 BAM。如:--lr20--cr11--reads-per-fastq=N 指定一次性处理的 reads 个数。若一个 bam 文件中含有的 reads 数超过 50000000 个,将会而外生成一个文件来存储。

如:

1
2
3
4
5
6
7
8
9
10
11
$ tree 4350_A_run626_cellranger_possorted
4350_A_run626_cellranger_possorted
`-- 4350_A_run626_cellranger_LibraryNotSpecified_1_HFFW7BBXY
|-- bamtofastq_S1_L008_I1_001.fastq.gz
|-- bamtofastq_S1_L008_I1_002.fastq.gz
|-- bamtofastq_S1_L008_R1_001.fastq.gz
|-- bamtofastq_S1_L008_R1_002.fastq.gz
|-- bamtofastq_S1_L008_R2_001.fastq.gz
`-- bamtofastq_S1_L008_R2_002.fastq.gz

1 directory, 6 files

正常大小 bam 的主要文件结构(reads 数少于 50000000):

1
2
3
4
5
6
7
8
$ tree 4350_B_run626_cellranger_possorted
4350_B_run626_cellranger_possorted
`-- 4350_B_run626_cellranger_LibraryNotSpecified_1_HFFW7BBXY
|-- bamtofastq_S1_L008_I1_001.fastq.gz
|-- bamtofastq_S1_L008_R1_001.fastq.gz
`-- bamtofastq_S1_L008_R2_001.fastq.gz

1 directory, 3 files

*_I1_* 索引文件,内含各个 reads name。

*_R1_* R1 端 reads

*_R2_* R2 端 reads

References

[1] https://support.10xgenomics.com/docs/bamtofastq

[2] https://github.com/10XGenomics/bamtofastq