star的twopassMode问题
如果只有一个样本,那么通常推荐加上这个 --twopassMode Basic参数,可以保证更精准的比对情况,也方便去找变异,其官方文档解释如下:
Annotated junctions will be included in both the 1st and 2nd passes. To run STAR 2-pass mapping for each sample separately, use --twopassMode Basic option. STAR will perform the 1st pass mapping, then it will automatically extract junctions, insert them into the genome index, and, finally, re-map all reads in the 2nd mapping pass.
但是如果有多个样本,每个样本都走 twopassMode 就浪费时间了,通常是把所有样本都比对一次,然后收集好他们产生的SJ.out.tab文件重新构建一次参考基因组的index,然后批量再比对一次。
~/biosoft/STAR/STAR-2.5.3a/bin/Linux_x86_64/STAR --runMode genomeGenerate \--genomeDir second_index \--genomeFastaFiles ~/reference/genome/mm10/mm10.fa \--sjdbGTFfile ~/reference/gtf/gencode/gencode.v25lift37.annotation.gtf \--sjdbFileChrStartEnd all_raw_star.tab --runThreadN 4
再次比对,代码是:
$star --runThreadN 5 --genomeLoad LoadAndKeep --limitBAMsortRAM 13045315604 \--outSAMtype BAM SortedByCoordinate --genomeDir $second_index \--readFilesCommand zcat --readFilesIn $fq1 $fq2 --outFileNamePrefix ${sample}_
一个样本正常比对是:
## 测序数据如下:6.7G Dec 12 15:55 clean.1.fq.gz6.6G Dec 12 18:03 clean.2.fq.gz## 比对代码如下,需要自行安装好软件已经参考基因组文件及索引$star --runThreadN 5 --genomeDir $hg19_star_index --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate \--readFilesIn $fq1 $fq2 --outFileNamePrefix ${sample}_star ## --alignEndsType EndToEnd
比对后的结果如下:
14G Dec 30 22:38 DH01_starAligned.sortedByCoord.out.bam1.9K Dec 30 22:38 DH01_starLog.final.out21K Dec 30 22:38 DH01_starLog.out4.6K Dec 30 22:38 DH01_starLog.progress.out8.1M Dec 30 22:38 DH01_starSJ.out.tab
比对耗时如下:
Dec 30 21:43:03 ..... started STAR runDec 30 21:43:03 ..... loading genomeDec 30 21:49:28 ..... started mappingDec 30 22:27:57 ..... started sorting BAMDec 30 22:38:20 ..... finished successfully
两次比对是:
$star --runThreadN 5 --genomeDir $hg19_star_index --readFilesCommand zcat --outSAMtype BAM SortedByCoordinate \--twopassMode Basic --outReadsUnmapped None --chimSegmentMin 12 \--chimJunctionOverhangMin 12 --alignSJDBoverhangMin 10 --alignMatesGapMax 100000 \--alignIntronMax 100000 --chimSegmentReadGapMax parameter 3 --alignSJstitchMismatchNmax 5 -1 5 5 \--readFilesIn $fq1 $fq2 --outFileNamePrefix ${sample}_star ## --alignEndsType EndToEnd## 这样会比较耗费内存哦
耗费内存,并且耗时:
Dec 31 09:33:17 ..... started STAR runDec 31 09:33:17 ..... loading genomeDec 31 09:38:55 ..... started 1st pass mappingDec 31 10:14:14 ..... finished 1st pass mappingDec 31 10:14:29 ..... inserting junctions into the genome indicesDec 31 10:19:55 ..... started mappingDec 31 11:24:54 ..... started sorting BAMDec 31 11:31:11 ..... finished successfully
可以看到前面的基础比对才不到一个小时,这个两次比对消耗2个小时了。
得到的文件如下;
6.7G Dec 12 15:55 clean.1.fq.gz6.6G Dec 12 18:03 clean.2.fq.gz12G Dec 31 11:31 DH01_starAligned.sortedByCoord.out.bam126M Dec 31 11:25 DH01_starChimeric.out.junction874M Dec 31 11:25 DH01_starChimeric.out.sam1.9K Dec 31 11:31 DH01_starLog.final.out24K Dec 31 11:31 DH01_starLog.out12K Dec 31 11:31 DH01_starLog.progress.out8.0M Dec 31 11:31 DH01_starSJ.out.tab
我开启了 chimeric模式,所以可以看到输出文件也多了一点,主要是为了找fusion基因准备的。
