Error message here!

Hide Error message here!

忘记密码?

Error message here!

请输入正确邮箱

Hide Error message here!

密码丢失?请输入您的电子邮件地址。您将收到一个重设密码链接。

Error message here!

返回登录

Close

RNA-seq Detailed Tutorial: Introduction to the Analysis Process (1)

Freezing factory 2022-11-24 23:37:19 阅读数:8 评论数:0 点赞数:0 收藏数:0

学习目标

了解从 RNA 提取到获取基因表达矩阵, 既RNA-seq The whole process of analysis.

1. workflow

Analysis of differentially expressed genes is the premise of,To obtain representative gene expression levels of matrix.因此在进行分析前,Must know how gene expression matrix is.

在本教程中,Will be briefly introduced from the original sequencing readings to counting gene expression in the process of matrix,By use of different steps.Below is a flow chart of the whole process of analysis.

RNA-seq workflow
RNA-seq workflow

2. RNA提取与文库制备

在对 RNA Sequencing before,Must be extracted from the cell environment and isolated RNA 制备成 cDNA 文库.The following will introduce involves many steps,Including quality inspection,In order to ensure access to high quality RNA.

2.1. RNA富集

一旦使用 DNAse 处理(去除 DNA 序列)后,Samples will experience mRNA 的富集(polyA 富集)或 rRNA 的去除.

通常,核糖体 RNAOn behalf of the cells that exist in the most RNA,而 mRNA (信使RNA)On behalf of a small part of the,In humans is about 2%.因此,If we want to research the protein-coding genes,Must be rich mRNA 或 去除 rRNA.For differences in gene expression analysis,最好对 Poly(A)+ 进行富集,Unless the target is to obtain the long chain noncoding RNA 的信息,In this case suggest removing the ribosome RNA .

  • RNA 质量检查

在开始 cDNA Library preparation before,Must check the extracted RNA 的完整性.传统上,By looking at the ribosome RNA 条带,Through the gel electrophoresis assessment RNA 的完整性;But this method is time consuming and inaccurate.The existing biological analyzer system can quickly evaluate RNA Integrity and calculate RNA Integrity value (RIN),这有助于 RNA The quality of interpretation and repeat.从本质上讲,RIN 提供了一种方法,Can be in a standardized way of comparing from different samples of each other RNA 质量.

2.2. 碎片化

将剩余的 RNA Molecular fragmented(打断).This is done by chemical、enzymatic(例如 RNA 酶)Or physical process(Such as mechanical shear)完成的.Then on these fragments size selection,仅保留 Illumina Sequencing machine best processing within the scope of those pieces,即 150 到 300 bp 之间.

  • Fragments quality inspection

After select the pieces,Fragment size distribution should be evaluated,Make sure that it is unimodal distribution.

2.3. 反转录

可以通过创建 strand library To save the information about fragments derived from which chain.The most commonly used method is in the second cDNA Join in the synthetic process of chain deoxy-UTP.Once generated double chain cDNA 片段,Sequence joint will connect to the end.(Can also be carried out after the step fragment size selection)

2.4. PCR扩增

If the amount of starting material is very low or to cDNA The amount of increase in the number of molecules enough to sequencing,Usually the library PCR 扩增.Amplification cycle as less as possible to,避免 PCR Extension technology impact.

Zeng and Mortavi, 2012
Zeng and Mortavi, 2012

3. 测序

cDNA Library sequencing will generate reads (读数).Reading the corresponding Yu Wenku each cDNA Fragments at the end of the nucleotide sequence.可以选择对 cDNA Pieces of single ended(Single-ended read)Or the ends of the fragment(Double side read)进行测序.

Sequencing
Sequencing
  • SE :单端数据 > 只有 Read1
  • PE:双端数据 > Read1 + Read2
    • 结果可以是2个单独的 Fastq 文件,或者一个文件(包含两者).

通常,Single-ended sequencing is enough,Unless the reading is expected to match the multiple locations on the genome(For example, has a number of the homologous genes of biological)、Are performing assembly or for variable shear analysis.请注意,Double side are usually expensive 2 倍.

3.1. 边合成边测序

Illumina Sequencing technology adopts the method of synthesis and sequencing.To further explore synthetic while sequencing,请观看Youtube channel[1].

Sequencing-by-synthesis
Sequencing-by-synthesis

Below for a brief description of this step:

  • Cluster growth(Clusters amplification)

cDNA In the library DNA Flow pool hybrid segment degeneration and.Then each fragment was cloned amplification,形成一个双链 DNA 簇.This step is to ensure that the sequencing signal strong enough,Each one can detect each clear.

*

Number of clusters ~= Number of reads

  • Sequencing(测序)

Fragments at the end of the sequence is based on the fluorophore labeled with a reversible end child elements dNTP.In each sequence cycle,A base has been integrated into the each cluster and fluorescence.

  • Image acquisition(图像采集)

每个 dNTP There is a unique signal,By the camera capture.

  • Base calling

然后,Base calling Program will be assessed through the captured image in many sequencing cycle,Is generated for each segment base sequence,即读数.The quality of the information will also record it.

*

Number of sequencing cycles = Length of reads

4. 质控

From sequencing machine is stored as the original reading of the FASTQ 文件.FASTQ File format is the next generation sequencing technology generated sequence read file format.

每个 FASTQ 文件都是一个文本文件,Said the sample sequence reading.Each reading by the 4 行表示,如下所示:

@HWI-ST330:304:H045HADXX:1:1101:1111:61397
CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
+
@?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################
意义
1 始终以“@”开头,It is about to read the information
2 实际的DNA序列
3 始终以“+”开头,Sometimes with the first 1 The information in the line of the same
4 The mass fraction of a string of representative characters;It must be with the first 2 Lines of the same length

FastQC 是常用的软件,It provides a quality control inspection of raw sequence data simple method.

主要功能包括:

  1. Provides a quick overview of,Tell you what areas there may be a problem
  2. Summary graphics and tables to quickly assess your data
  3. The results derived based on HTML 的报告

5. 定量

When we explore the quality of the original reading,You can continue to quantitative expression at the transcriptional level.The goal of this step is to determine which transcription each reading from the total number of readings and related to each transcript.

Has been found for the analysis of this step is the most accurate tool called lightweight comparing tool,其中包括:

  • Kallisto [2]
  • Sailfish [3]
  • Salmon [4]

All tools work above is a bit different.然而,Have in common is that they avoid the base to base genome than to read(base-to-base genomic alignment of the reads).基因组比对Are made from old alignment tool(如 STAR[5]HISAT2[6])Perform a step.Compared with these tools,Lightweight than tools can not only provide quantitative estimates faster(通常快 20 倍以上),But also the improvement of the accuracy.

This tutorial will use from Salmon To obtain the expression of the estimate(通常称为“Pseudo count”)As a starting point the analysis of gene expression difference.

Salmon
Salmon

6. Than after quality control

如上所述,Differences in gene expression analysis will use Salmon The transcription of generating this/Gene false count.然而,To some of the basic data of the sequencing quality inspection,The reading andThe entire genomeCompare very important.STARHiSAT2 To be able to perform this step and produce can be used to QC 的 BAM 文件.

Qualimap Tools in their genome mapped to the area within the context of exploring the characteristics of the aligned read,To provide overall data quality view(作为 HTML 文件). Qualimap Evaluation of various quality indicators including:

  • DNArRNA 污染
  • 5’-3’ 偏差
  • Cover deviation

7. Quality control integration

在整个工作流程中,We have to perform data all kinds of quality inspection steps.You need to the data set of each sample do this,Ensure that the indicators in a given experimental samples consistent.Should be marked samples for further investigation or removed from the group.

Manual tracking these indicators and browse multiple of each sample HTML 报告(FastQCQualimap)和日志文件(SalmonSTAR)既乏味又容易出错.MultiQC ,Can be aggregated from multiple tools and generate the chart with the results of a single HTML 报告,In visualization and comparative samples between the various QC 指标.如有必要,对 QCIndicators evaluation may cause before proceed to the next step to remove the sample.


Once for all samples carried out QC,就可以开始使用 DESeq2 Differences in gene expression analysis.

count_data
count_data

欢迎Star -> 学习目录

国内链接 -> 学习目录


参考资料

[1]

边合成边测序: https://www.youtube.com/watch?v=fCd6B5HRaZ8

[2]

Kallisto: https://pachterlab.github.io/kallisto/about

[3]

Sailfish: http://www.nature.com/nbt/journal/v32/n5/full/nbt.2862.html

[4]

Salmon: https://combine-lab.github.io/salmon/

[5]

STAR: https://academic.oup.com/bioinformatics/article/29/1/15/272537

[6]

HISAT2: https://daehwankimlab.github.io/hisat2/

Copyright statement
In this paper,the author:[Freezing factory],Reprint please bring the original link, thank you

飞链云3D数字艺术品
30万现金开奖等你来领