---
title: "Introduction to MetaCycle"
author: "Gang Wu, Xavier Li, Matthew Carlucci, Ron Anafi, Michael Hughes, Karl Kornacker, and John Hogenesch"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to MetaCycle}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---
The **MetaCycle** package is mainly used for detecting rhythmic signals from large scale time-series data. Depending on features of each time-series data, **MetaCycle** incorporates [ARSER(ARS)](https://github.com/cauyrd/ARSER), [JTK_CYCLE(JTK)](http://openwetware.org/wiki/HughesLab:JTK_Cycle), and [Lomb-Scargle(LS)](https://academic.oup.com/bioinformatics/article/22/3/310/220284/) properly for periodic signal detection, and it could also output integrated analysis results if required. 

This vignette introduces implementation of method selection and integration steps of **MetaCycle**, which are not explained in detail in help files. For looking at how to use two main functions--**meta2d** and **meta3d** of this package, please see 'Examples' part of each function's help file. 

The **MetaCycle** source code is available on [github](https://github.com/gangwug/MetaCycle).

## DATASETS
### Two main categories
A typical time-series dataset from a non-human organism is a two-dimensional matrix. Each row indicates one molecular's profile along time, and all molecule at any one time point are detected from the same sample. It is usually not necessary to keep track of which individual organism does a sample come from. For ease of explanation, we named this kind of dataset as 2D time-series dataset. Take the time-series transcriptome dataset from mouse liver as an example.

```{r}
library(MetaCycle)
head(cycMouseLiverRNA[,1:5])
```

For time-series datasets from human, it is usually essential to keep track of the subject identity and group for each sample. In this case, one matrix stores experimental values of detected molecule from all samples, while another matrix is necessary to store subject identity and group for each sample. This kind of dataset is named as 3D time-series dataset. For example, a time-series dataset from human blood is shown as below. 

The subject information matrix:

```{r}
set.seed(100)
row_index <- sample(1:nrow(cycHumanBloodDesign), 4)
cycHumanBloodDesign[row_index,]
```

The corresponding experimental values:

```{r}
sample_id <- cycHumanBloodDesign[row_index,1]
head(cycHumanBloodData[,c("ID_REF", sample_id)])
```

A 3D time-series dataset could be divided into multiple 2D time-series datasets, with all experimental values for an individual under the same treatment in one 2D time-series dataset. For example, we could extract all experimental values from "AF0004" under "SleepExtension" into one 2D time-series dataset.

```{r}
group_index <- which(cycHumanBloodDesign[, "group"] == "SleepExtension")
cycHumanBloodDesignSE <- cycHumanBloodDesign[group_index,]
sample_index <- which(cycHumanBloodDesignSE[, "subject"] == "AF0004")
sample_AF0004 <- cycHumanBloodDesignSE[sample_index, "sample_library"]
cycHumanBloodDataSE_AF0004 <- cycHumanBloodData[, c("ID_REF", sample_AF0004)]
head(cycHumanBloodDataSE_AF0004)
```

### Detail types of 2D time-series dataset
One kind of usual 2D time-series dataset is evenly sampled once at each time point, and the interval between neighbour time points is integer. Not all datasets are as simple as this. There are datasets with replicate samples, or with missing values, or un-evenly sampled, or sampled with a non-integer interval. Examples of these types of dataset are shown in the below table. 

Data Type                    | Point 1| Point 2| Point 3| Point 4| Point 5| Point 6
---------------------------  | -------| -------| -------| -------| -------| -------
The usual data               | CT0    | CT4    | CT8    | CT12   | CT16   | CT20  
With missing value           | CT0    | NA     | CT8    | CT12   | CT16   | CT20  
With replicates              | CT0    | CT0    | CT8    | CT8    | CT16   | CT16  
With un-even interval        | CT0    | CT2    | CT8    | CT10   | CT16   | CT20
With non-integer interval    | CT0    | CT4.5  | CT9    | CT13.5 | CT18   | CT22.5  

Of course, some datasets may be a combination of two or more of above types of data.

Data Type                             | Point 1| Point 2| Point 3| Point 4| Point 5| Point 6
------------------------------------  | -------| -------| -------| -------| -------| -------
With replicates and missing value     | CT0    | CT0    | CT8    | NA     | CT16   | CT16  
With un-even interval and replicates  | CT0    | CT2    | CT2    | CT10   | CT16   | CT20

## METHOD SELECTION
### The Pros and Cons of meta2d methods
Each method (i.e. JTK, ARS, LS) has its pros and cons. It is ideal to know these trade-offs prior to decisions on study design. For example, the features of meta2d methods are listed in the below figure. 

```{r, echo=FALSE, warning=FALSE, out.width = "640px"}
knitr::include_graphics("./images/image1.png")
```

### Method selection based on the sampling pattern
As shown in the above figure, the accuracy of rhythmic analysis is influenced a lot by the sampling pattern. Although higher sampling resolution usually gives more accurate results, the ideal experimental design depends on the finial research aim. For example, biological replicates are important for identifying differentially expressed genes between two time points (e.g. midnight and noon). The high sampling resolution (2h/2days) is necessary for accurately reporting the percentage of rhythmic genes at genome-wide level. If the goal is to get accurate phase or period length of the rhythmic gene, much higher sampling resolution is required. A brief summary of method selection based on the sampling pattern is shown in the below figure. 

```{r, echo=FALSE, warning=FALSE, out.width = "640px"}
knitr::include_graphics("./images/image2.png")
```

**We suggest to think about method selection during the experimental design step, which will make the latter analysis much easier.** 

**Experimental design >>> statistical methods.**

### The automatic method selection strategy used by meta2d 
The **meta2d** function in **MetaCycle** is designed to analyze 2D time-series datasets, and it could automatically select proper methods to analyze different types of input datasets. The implementation strategy used for **meta2d** is shown in the flow chart (drawn with "diagram" package).

```{r, echo=FALSE, warning=FALSE, out.width = "640px"}
knitr::include_graphics("./images/image3.png")
```

For analyzing 3D time-series datasets, the **meta3d** function in **MetaCycle** is suggested. It firstly divides the input dataset into multiple 2D time-series datasets based on individual information, and then use the defined method through calling **meta2d** to analyze each divided dataset. 

## INTEGRATION
In addition to selecting proper methods to analyze different kinds of datasets, **MetaCycle** could also output integrated results. In detail, **meta2d** integrates analysis results from multiple methods and **meta3d** integrates analysis results from multiple individuals.

### Pvalue
[Fisher's method](https://en.wikipedia.org/wiki/Fisher%27s_method) is implemented in both **meta2d** and **meta3d** for integrating multiple p-values. The below formula is used to combine multiple p-values into one test statistic (X^2^). 

$$X^2_{2k} \sim -2\sum_{i=1}^k ln(p_i)$$

X^2^ has a chi-squared distribution with 2k degrees of freedom (k is the number of p-values), when all the null hypotheses are true, and each p-value is independent. The combined p-value is determined by the p-value of X^2^.

### Notes on the Fisher's method
We acknowledge that Hutchison A.L., et al. pointed out the p-value integration issue using Fisher's method in **meta2d** function in two papers ([*Correcting for Dependent P-values in Rhythm Detection*](https://www.biorxiv.org/content/10.1101/118547v1.abstract) and [*Bootstrapping and empirical Bayes methods improve rhythm detection in sparsely sampled data*](https://journals.sagepub.com/doi/abs/10.1177/0748730418789536)). There is a similar issue mentioned in the [MetaCycle](https://academic.oup.com/bioinformatics/article/32/21/3351/2415176) paper discussing about the performance of N-version programming (NVP). *"In some cases, NVP may not give better results than the single most suitable method, but it will rarely give the worst results. For example, in analyzing time-series datasets with high temporal resolution (every 1 h over 2 days; Supplementary Fig.S3), NVP does better than ARS but not as well as LS or JTK (Under these conditions, all three methods do relatively well but have similar mode failure, while ARS has relatively more false positive observations.)."*

Hutchison A.L., et al. also mentioned in the paper, *"While the combinations of the methods integrated improperly with the Fisher method appear to outperform the individual methods at p-values below typical significance cutoffs (Figs. 4C andD, dark gray), once the p-values are accurately calculated using the Brown method, the combined methods (Figs. 4C and D, light gray) underperform the individual methods for low p-values."*. The advantage of Fisher method comparing with single method is compreshensively evaluated in the [MetaCycle](https://academic.oup.com/bioinformatics/article/32/21/3351/2415176) paper using different sampling patterns and rhythmic curves. We agree that Fisher's method is not ideal. But it shows its power in practice. So in the current version of **MetaCycle**, Fisher's method is still the default selection. We will update it immediately whenever we find an integration method that is beautiful in theory and powerful in practice.

As we know, each method has its pros and cons. There is not a perfect rhythmic detection method for all time-series datasets. As more rhythmic detection methods are published in coming years, more comprehensive evaluation works are required in this field. In our opinion, one of the most important thing is to generate a series of gold standard evaluation datasets. Multiple methods should be evaluated on the same datasets. Otherwise, it is not an apples to apples comparison when claiming one method is better than others. 

### Period and phase
The integrated period from **MetaCycle** is an arithmetic mean value of multiple periods, while phase integration based on [mean of circular quantities](https://en.wikipedia.org/wiki/Mean_of_circular_quantities) is implemented in **meta2d** and **meta3d**. The detailed steps are as below.

* convert phase values to polar coordinates $\alpha_j$
* convert polar coordinates to cartesian coordinates ($cos\alpha_j$, $sin\alpha_j$)
* compute the arithmetic mean of these points and its corresponding polar coordinate $\bar{\alpha}$
$$\bar{\alpha} = atan2(\frac{\sum_{j=1}^n sin\alpha_j}{n}, \frac{\sum_{j=1}^n cos\alpha_j}{n})$$
* convert the resulting polar coordinate to a integrated phase value

```{r, warning=FALSE}
# given three phases
pha <- c(0.9, 0.6, 23.6)
# their corresponding periods
per <- c(23.5, 24, 24.5)
# mean period length
per_mean <- mean(per)
# covert to polar coordinate
polar <- 2*pi*pha/per
# get averaged ploar coordinate
polar_mean <- atan2(mean(sin(polar)), mean(cos(polar)))
# get averaged phase value
pha_mean <- per_mean*polar_mean/(2*pi)
pha_mean
```

### Amplitude calculation
**meta2d** recalculates the amplitude with the following model:

$$Y_i = B + TRE*(t_i - \frac{\sum_{i=1}^n t_i}{n}) + A*cos(2*\pi*\frac{t_i - PHA}{PER})$$

where $Y_i$ is the observed value at time $t_i$; B is baseline level of the time-series profile; TRE is trend level of the time-series profile; A is the amplitude of the waveform. PHA and PER are integrated period and phase mentioned above. In this model, only B, TRE and A are unknown parameters, which could be calculated with the ordinary least squares (OLS) method. The baseline and trend level are explained in the below example.

```{r, echo=FALSE, warning=FALSE, fig.width=6.65, fig.height=5}
getAMP <- function(expr, per, pha, tim=18:65)
{ 
    trendt <- tim - mean(tim[!is.na(tim) & !is.nan(tim)])
    cost <- cos(2*pi/per*(tim - pha))
    fit <- lm(expr~trendt + cost)
    fitcoef <- fit$coefficients
    basev <- fitcoef[1]
    trendv <- fitcoef[2]
    ampv <- fitcoef[3]
    fitexp <- basev + trendv*trendt + ampv*cost
    outL <- list("base"=basev, "trend"=trendv, "amp"=ampv, "fit"=fitexp)
    return(outL)
}

cirD <- cycVignettesAMP
ampL <- getAMP(expr=as.numeric(cirD[1,24:71]), per=cirD[1, "meta2d_period"], pha=cirD[1, "meta2d_phase"])

lay<-layout(cbind(1, 2), widths=c( lcm(cm(4.5)), lcm(cm(1.5)) ), heights=lcm(cm(4.5)) )
par(mai=c(0.65,0.6,0.4,0.05),mgp=c(2,0.5,0),tck=-0.01)
xrange <- c(18, 65)
yrange <- c(200, 2350)
plot(18:65, cirD[1,24:71], type="b", xlim=xrange, ylim=yrange, xlab="Circadian time(CT)", ylab="Expression value",  main=cirD[1,1], cex.main=1.2)
par(new=T)
plot(18:65, ampL[[4]], type="b", xlim=xrange, ylim=yrange, col="red", xlab="", ylab="", main="")
abline(h=ampL[[1]], lty=3, col="purple", lwd=1.5)
lines(18:65, 500+ampL[[2]]*(18:65-mean(18:65)), lty=4, col="orange", lwd=1.5)
legend("topleft", legend=c("Raw value", "OLS fitted value"), col=c("black", "red"), pch=1, bty="n")
legend("topright", legend=c("Baseline", "Trend"), col=c("purple", "orange"), lty=c(3, 4), lwd=1.5, bty="n" )

par(mai=c(0.5,0.05,0.4,0.1),mgp=c(2,0.3,0),tck=-0.01);
plot(x=NULL,y=NULL,xlim=c(0,10),ylim=c(0,10),type="n", xaxt="n",yaxt="n",bty="n",xlab="",ylab="",main="")
text(rep(1,3), c(8, 5, 2), c("Base = ", "Trend = ", "AMP = "), adj=0)
text_value <- unlist(ampL)
text(rep(6,6), c(8, 5, 2), round(text_value[1:3], 1), adj=0)
```

In addition, **meta2d** also outputs a relative amplitude value (rAMP), which could be easily taken as the ratio between amplitude and baseline (if |B| >= 1). The amplitude value is associated with the general expression level, which indicates highly expressed genes may always have larger amplitude than lowly expressed genes. The rAMP may be used to compare the amplitude values among genes with different expression levels. For example, *Ugt2b34* has a larger amplitude than *Arntl*, but its rAMP is smaller than *Arntl*.

```{r, echo=FALSE, warning=FALSE, fig.width=6.65, fig.height=5}
cirD <- cycVignettesAMP
cirM <- as.matrix(cirD[2:3, 24:71])
expmax <- apply(cirM, 1, max)
cirM <- cirM/expmax

lay<-layout(cbind(1, 2), widths=c( lcm(cm(4.5)), lcm(cm(1.5)) ), heights=lcm(cm(4.5)) )
par(mai=c(0.65,0.6,0.4,0.05),mgp=c(2,0.5,0),tck=-0.01)
xrange <- c(18, 65)
yrange <- c(0, 1)
colname <- c("red", "blue")
grey_trans <- rgb(191/255,191/255,191/255,0.65);

par(mai=c(0.65,0.6,0.2,0.05),mgp=c(2,0.5,0),tck=-0.01)
plot(NULL,NULL,xlim=xrange,ylim=yrange,xaxt="n",yaxt="n",xlab="Circadian time(CT)",ylab="Exp/Max", main="");
rect_xL <- c(18, 36, 60)
rect_yL <- c(24, 48, 65)
rect(rect_xL, rep(-0.1, 3), rect_yL, rep(1.1,3), col=grey_trans, border=NA, bty="n")

for (i in 1:2)
{
  loessD <- data.frame(expd=as.numeric(cirM[i,]),tp=18:65);
  exploess <- loess(expd~tp, loessD, span = 0.2);
  expsmooth <- predict(exploess, data.frame(tp=18:65));
  lines(18:65,expsmooth,lwd=1.2,col=colname[i]);
}

xpos <- c(seq(18,60,by=6), 65)
axis(side=1,at=xpos,labels=xpos,mgp=c(0,0.2,0),tck=-0.01,cex.axis=0.8)
ypos <- seq(0, 1, by=0.2)
axis(side=2,at=ypos,labels=ypos,mgp=c(0,0.2,0),tck=-0.01,cex.axis=0.8)

par(mai=c(0.5,0.05,0.2,0.1),mgp=c(2,0.3,0),tck=-0.01)
plot(x=NULL,y=NULL,xlim=c(0,10),ylim=c(0,10),type="n", xaxt="n",yaxt="n",bty="n",xlab="",ylab="",main="")
lines(c(0.2,2.3), c(9.5,9.5), col="blue", lwd=1.5)
text(c(5,0,0), c(9.5, 8, 6.5), c("Ugt2b34", "AMP = ", "rAMP = "), col="blue", adj=0)
text(c(5,5), c(8, 6.5), round(as.numeric(cirD[3, 22:23]), 2), col="blue", adj=0)

lines(c(0.2,2.3), c(4.5,4.5), col="red", lwd=1.5)
text(c(5,0,0), c(4.5, 3, 1.5), c("Arntl", "AMP = ", "rAMP = "), col="red", adj=0)
text(c(5,5), c(3,1.5), round(as.numeric(cirD[2, 22:23]), 2), col="red", adj=0)
```

Based on the calculated baseline, amplitude and relative amplitude values by **meta2d**, **meta3d** calculates the corresponding integrated values with arithmetic mean of multiple individuals in each group.