METHODS.docx
- 文档编号:10408188
- 上传时间:2023-02-11
- 格式:DOCX
- 页数:56
- 大小:1.57MB
METHODS.docx
《METHODS.docx》由会员分享,可在线阅读,更多相关《METHODS.docx(56页珍藏版)》请在冰豆网上搜索。
METHODS
GATK-Methods
----DNAseq
DNAseqOVERVIEW1
Pre-processing2
LocalRealignmentaroundIndels2
LocalRealignmentaroundIndels2
BaseQualityScoreRecalibration3
BaseQualityScoreRecalibration(BQSR)3
VariantDiscoveryOverview13
CallingvariantsoncohortsofsamplesusingtheHaplotypeCallerinGVCFmode13
HowtheHaplotypeCaller'sreferenceconfidencemodelworks15
SuggestedPreliminaryAnalyses16
VariantFilteringwithVQSR16
VariantQualityScoreRecalibration(VQSR)16
GenotypeRefinement21
Whichtoolsusepedigreeinformation?
21
PurposeandoperationofRead-backedPhasing21
FunctionalAnnotation24
AddingGenomicAnnotationsUsingSnpEffandVariantAnnotator24
WherecanIgetagenelistinRefSeqformat?
29
VariantEvaluation31
Selectingvariantsofinterestfromacallset31
Combiningvariantsfromdifferentfilesintoone32
UsingVariantEval36
DNAseqOVERVIEW
Pre-processing
LocalRealignmentaroundIndels
LocalRealignmentaroundIndels
RealignerTargetCreator/IndelRealigner
Foracomplete,detailedargumentreference,refertotheGATKdocumentpage here(RealignerTargetCreator)/here(IndelRealigner)
RunningtheIndelRealigneronlyatknownsites
WhileweadvocateforusingtheIndelRealigneroveranaggregatedbamusingthefullSmith-Watermanalignmentalgorithm,itwillworkforjustasinglelaneofsequencingdatawhenrunin-knownsOnlymode.Novelsitesobviouslywon'tbecleanedup,butthemajorityofasingleindividual'sshortindelswillalreadyhavebeenseenindbSNPand/or1000Genomes.Onewouldemploytheknown-only/lane-levelrealignmentstrategyinalarge-scaleproject(e.g.1000Genomes)wherecomputationtimeisseverelyconstrainedandlimited.Wemodifytheexampleargumentsfromabovetoreflectthecommand-linesnecessaryforknown-only/lane-levelcleaning.
TheRealignerTargetCreatorstepwouldneedtobedonejustonceforasinglesetofindels;soaslongasthesetofknownindelsdoesn'tchange,theoutput.intervalsfilefrombelowwouldneverneedtoberecalculated.
java-Xmx1g-jar/path/to/GenomeAnalysisTK.jar\
-TRealignerTargetCreator\
-R/path/to/reference.fasta\
-o/path/to/output.intervals\
-known/path/to/indel_calls.vcf
TheIndelRealignerstepneedstoberunoneverybamfile.
java-Xmx4g-Djava.io.tmpdir=/path/to/tmpdir\
-jar/path/to/GenomeAnalysisTK.jar\
-I
-R
-TIndelRealigner\
-targetIntervals
-o
-known/path/to/indel_calls.vcf
--consensusDeterminationModelKNOWNS_ONLY\
-LOD0.4
BaseQualityScoreRecalibration
BaseQualityScoreRecalibration(BQSR)
DetailedinformationaboutcommandlineoptionsforBaseRecalibratorcanbefound here.
Introduction
Thetoolsinthispackagerecalibratebasequalityscoresofsequencing-by-synthesisreadsinanalignedBAMfile.Afterrecalibration,thequalityscoresintheQUALfieldineachreadintheoutputBAMaremoreaccurateinthatthereportedqualityscoreisclosertoitsactualprobabilityofmismatchingthereferencegenome.Moreover,therecalibrationtoolattemptstocorrectforvariationinqualitywithmachinecycleandsequencecontext,andbydoingsoprovidesnotonlymoreaccuratequalityscoresbutalsomorewidelydispersedones.ThesystemworksonBAMfilescomingfrommanysequencingplatforms:
Illumina,SOLiD,454,CompleteGenomics,PacificBiosciences,etc.
NewwiththereleaseofthefullversionofGATK2.0istheabilitytorecalibratenotonlythewell-knownbasequalityscoresbutalsobaseinsertionandbasedeletionqualityscores.Theseareper-basequantitieswhichestimatetheprobabilitythatthenextbaseinthereadwasmis-incorporatedormis-deleted(duetoslippage,forexample).We'vefoundthatthesenewqualityscoresareveryvaluableinindelcallingalgorithms.InparticularthesenewprobabilitiesfitverynaturallyasthegappenaltiesinanHMM-basedindelcallingalgorithms.Wesuspecttherearemanyotherfantasticusesforthesedata.
Thisprocessisaccomplishedbyanalyzingthecovariationamongseveralfeaturesofabase.Forexample:
∙Reportedqualityscore
∙Thepositionwithintheread
∙Theprecedingandcurrentnucleotide(sequencingchemistryeffect)observedbythesequencingmachine
ThesecovariatesarethensubsequentlyappliedthroughapiecewisetabularcorrectiontorecalibratethequalityscoresofallreadsinaBAMfile.
Forexample,pre-calibrationafilecouldcontainonlyreportedQ25bases,whichseemsgood.However,itmaybethatthesebasesactuallymismatchthereferenceata1in100rate,soareactuallyQ20.Thesehigher-than-empiricalqualityscoresprovidefalseconfidenceinthebasecalls.Moreover,asiscommonwithsequencing-by-synthesismachine,basemismatcheswiththereferenceoccurattheendofthereadsmorefrequentlythanatthebeginning.Also,mismatchesarestronglyassociatedwithsequencingcontext,inthatthedinucleotideACisoftenmuchlowerqualitythanTG.TherecalibrationtoolwillnotonlycorrecttheaverageQinaccuracy(shiftingfromQ25toQ20)butidentifysubsetsofhigh-qualitybasesbyseparatingthelow-qualityendofreadbasesACbasesfromthehigh-qualityTGbasesatthestartoftheread.Seebelowforexamplesofpreandpostcorrectedvalues.
Thesystemwasdesignedforuserstobeabletoeasilyaddnewcovariatestothecalculations.ForuserswishingtoaddtheirowncovariatesimplylookatQualityScoreCovariate.javaforanideaofhowtoimplementtherequiredinterface.EachcovariateisaJavaclasswhichimplementstheorg.broadinstitute.sting.gatk.walkers.recalibration.Covariateinterface.Specifically,theclassneedstohaveagetValuemethoddefinedwhichlooksatthereadandassociatedsequencecontextandpullsoutthedesiredinformationsuchasmachinecycle.
Runningthetools
BaseRecalibrator
DetailedinformationaboutcommandlineoptionsforBaseRecalibratorcanbefound here.
ThisGATKprocessingstepwalksoverallofthereadsin my_reads.bam andtabulatesdataaboutthefollowingfeaturesofthebases:
∙readgroupthereadbelongsto
∙assignedqualityscore
∙machinecycleproducingthisbase
∙currentbase+previousbase(dinucleotide)
Foreachbin,wecountthenumberofbaseswithinthebinandhowoftensuchbasesmismatchthereferencebase,excludinglociknowntovaryinthepopulation,accordingtodbSNP.Afterrunningoverallreads,BaseRecalibratorproducesafilecalled my_reads.recal_data.grp,whichcontainsthedataneededtorecalibratereads.TheformatofthisGATKreportisdescribedbelow.
CreatingarecalibratedBAM
TocreatearecalibratedBAMyoucanuseGATK'sPrintReadswiththeengineon-the-flyrecalibrationcapability.Hereisatypicalcommandlinetodoso:
java-jarGenomeAnalysisTK.jar\
-TPrintReads\
-Rreference.fasta\
-Iinput.bam\
-BQSRrecalibration_report.grp\
-ooutput.bam
AftercomputingcovariatesintheinitialBAMFile,wethenwalkthroughtheBAMfileagainandrewritethequalityscores(intheQUALfield)usingthedatainthe recalibration_report.grp file,intoanewBAMfile.
Thisstepusestherecalibrationtabledatainrecalibration_report.grpproducedbyBaseRecalibrationtorecalibratethequalityscoresininput.bam,andwritingoutanewBAMfileoutput.bamwithrecalibratedQUALfieldvalues.
Effectivelythenewqualityscoreis:
∙thesumoftheglobaldifferencebetweenreportedqualityscoresandtheempiricalquality
∙plusthequalitybinspecificshift
∙plusthecyclexqualanddinucleotidexqualeffect
Followingrecalibration,thereadqualityscoresaremuchclosertotheirempiricalscoresthanbefore.Thismeanstheycanbeusedinastatisticallyrobustmannerfordownstreamprocessing,suchasSNPcalling.Inadditional,byaccountingforqualitychangesbycycleandsequencecontext,wecanidentifytrulyhighqualitybasesinthereads,oftenfindingasubsetofbasesthatareQ30evenwhennobaseswereoriginallylabeledassuch.
Miscellaneousinformation
∙Therecalibrationsystemisread-groupaware.Itseparatesthecovariatedatabyreadgroupintherecalibration_report.grpfile(using@RGtags)andPrintReadswillapplythisdataforeachreadgroupinthefile.WeroutinelyprocessBAMfileswithmultiplereadgroups.Pleasenotethatthememoryrequirementsscalelinearlywiththenumberofreadgroupsinthefile,sothatfileswithmanyreadgroupscouldrequireasignificantamountofRAMtostoreallofthecovariatedata.
∙Acriticaldeterminantofthequalityoftherecalibationisthenumberofobservedbasesandmismatchesineachbin.Thesystemwillnotworkwellonasmallnumberofalignedreads.Weusuallyexpectwellinexcessof100Mbasesfromanext-generationDNAsequencerperreadgroup.1Bbasesyieldssignificantlybetterresults.
∙Unlessyourdatabaseofvariationissopoorand/orvariationsocommoninyourorganismthatmostofyourmismatchesarerealsnps,youshouldalwaysperformrecalibrationonyourbamfile.Forhumans,withdbSNPandnow1000Genomesavailable,almostallofthemismatches-evenincancer-willbeerrors,andanaccurateerrormodel(essentialfordownstreamanalysis)canbeascertained.
∙Therecalibratorappliesa"yates"correctionforlowoccupancybins.RatherthaninferringthetrueQscorefrom#mismatches/#basesweactuallyinferitfrom(#mismatches+1)/(#bases+2).Thisdealsverynicelywithoverfittingproblems,whichhasonlyaminorimpactondatasetswithbillionsofbasesbutiscriticaltoavoidoverconfidenceinrarebinsinsparsedata.
Examplepreandpostrecalibrationresults
∙RecalibrationofalanesequencedattheBroad
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- METHODS
![提示](https://static.bdocx.com/images/bang_tan.gif)