DESeq2 normalization vs VST vs rlog (2024)

DESeq2 normalization vs VST vs rlog

Entering edit mode

Jonas B. &utrif; 40

@jonas-b-14652

Last seen 4.6 years ago

Belgium, Antwerp, University of Antwerp

Hi all,

after consulting the manual on data normalization, I have one question left to ask:

The way I see it, there are 4 ways described to obtain normalized data:

The first one is to extract data, normalized using the normalization factors for a gene x sample matrix, and size factors for a single number per sample. This can be done using the following code:
counts(dds, normalized=TRUE)
The second way is to perform log2 transformation log2(n + 1), using the following function:
normTransform(dds)
The third and fourth way is to use the vst and rlog transformation, using the following functions respectively:vst(dds, blind=FALSE)rlog(dds, blind=FALSE)

When I just got started, I used the the first function (counts(dds, normalized=TRUE)), to obtain the normalized data, which I later used for clustering etc. . However, now I doubt that this was the correct decision and that the normalized data, obtained this way, is only used during the DE genes analysis and that for clustering, the second, third and fourth way of normalization is preferred.

I was hoping that any of you could share a more expert opinion on the what normalization to use and whether or not the "counts(dds, normalized=TRUE)" is a viable option as well.

Thank you a lot in advance.

Kind regards,Jonas

deseq2 • 23k views

ADD COMMENT • link updated 18 months ago by Jiahua • 0 • written 4.8 years ago by Jonas B. &utrif; 40

Entering edit mode

As a side note: I did find a recent question addressing normalization ( https://support.bioconductor.org/p/123651/ ) , however it leaves my question unanswered on whether or not I could also use the counts function ( I guess it's wrong, but I am not sure. Maybe it is still usable... ) and which one is most commonly used/advised. Any opinions shared are much appreciated!

ADD REPLY • link 4.8 years ago Jonas B. &utrif; 40

Entering edit mode

It came to mind that the function: counts(dds, normalized=TRUE), might already return log2 transformed data? (However, this is not described in: https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/counts)

ADD REPLY • link 4.8 years ago Jonas B. &utrif; 40

Entering edit mode

Dear Micheal, Thank you for your time and answers. It's all clear now.Kind regards,Jonas

ADD REPLY • link 4.8 years ago Jonas B. &utrif; 40

Entering edit mode

Guandong Shang &utrif; 40

@shangguandong1996-21805

Last seen 19 months ago

China

In my opinion, the main role of normalization factors(or your first one) is for DE. you have to normized your count to deal with some sequence factor or biology factor before your do DE analysis. For the practical ways, you can extract norm count and show these to your collaborator, or you can plot single gene expression tendcy. But you can not just use these normalized counts to do some operation like Heatmap plot, Hierarchical clustering, k-means because of different orders of magnitude.

For the vst or rlog, the main role has been writen in the DESeq2 paper:

The results, shown in Additional file 1: Figure S17, revealed that when the size factors were equal for all samples, the Poisson distance and the Euclidean distance of rlog-transformed or VST counts outperformed other methods. However, when the size factors were not equal across samples, the rlog approach generally outperformed the other methods. Finally, we note that the rlog transformation provides normalized data, which can be used for a variety of applications, of which distance calculation is one.

When you do some operation based on distance calculation(maybe some machine learning application?), you can choose vst or rlog, even log(normCount + 1). For the practical ways, you can plot single gene expression tendcy. But I do not recommend it, because it have less biology meaning compared with normCount. And for PCA, clustering, or kmeans, it is more suitable compared with normCount.

But I also have a question. someone may use Z-scale of normCount to do kmeans or Heatmap plot. I am wodering what's the pros and cons of z-scale of normCount and vst or rlog ?

ADD COMMENT • link 3.1 years ago Guandong Shang &utrif; 40

Entering edit mode

Z scaled normalized counts are not variance stabilized with respect to the systematic trend, it’s just forcing all the SD to 1, whether the variance across samples is predominantly made up of shot noise or signal (DE). I don’t recommend unit scaling all genes without having first having removed low biological signal genes from the matrix under study.

ADD REPLY • link 3.1 years ago Michael Love 42k

Entering edit mode

guyho &utrif; 20

@guyho-15677

Last seen 3.1 years ago

Israel

Hi,

I hope it is fine to add my question here.I did rlog and VST followed by PCA.The experimental design has two factors each with 3 levels each, and there are 45 samples.With rlog the PCA clustered 43 samples together and 2 samples were outliers. With the VST the PCA plot corresponds to the experimental design.My questions are what can I learn from this result about the data? and can I use this information to improve the differential analysis?I upload the PCA images below.

Thanks in advance,

Guy

ADD COMMENT • link 3.1 years ago guyho &utrif; 20

Entering edit mode

Note: you posted this as an “Answer” to the top Question not a comment.

I recommend the VST in general, depending on how you ran the code the rlog may be over shrinking the changes between groups.

ADD REPLY • link 3.1 years ago Michael Love 42k

Entering edit mode

Thank you very much for the prompt reply.I thought my question is related to this post.I can move it to be a comment on the original post if it is more appropriate.

I ran the default DESeq pipeline, then I did the transformations and PCAs with and without blinding (which did not affect the PCA results). Clearly the VST is better here, but does it make any difference to the differential analysis? for example, should I be concerned about the two samples that are outliers in the rlog?

ADD REPLY • link 3.1 years ago guyho &utrif; 20

Entering edit mode

No difference I think. I’m not convinced those are outliers. You can use plotCounts on DE genes for further inspection.

ADD REPLY • link 3.1 years ago Michael Love 42k

DESeq2 normalization vs VST vs rlog (2024)

References