---
title: "Lab-Regression"
author: "J. Lucas McKay"
date: "2023-03-29"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = T, collapse = T, warning = F, message = F)
```


***
#### 1. Equivalence of F and t tests when there are two groups

```{r}
library(tidyverse)
mean0 = function(x) round(mean(x,na.rm = T),0)
sd0 = function(x) round(sd(x,na.rm = T),0)
ss0 = function(x) round(sum((x-mean(x))^2),0)

n = 50
set.seed(510)
durations = bind_rows(
  tibble(Gender = "Male", Years = rnorm(n,74,10)),
  tibble(Gender = "Female", Years = rnorm(n,82,10))) |> 
  mutate(Gender = forcats::fct_rev(factor(Gender)))

# make a dot plot
durations |> ggplot(aes(Gender,Years)) +
  geom_jitter(width = 0.1) +
  expand_limits(y=0)

# box plot. note that outlier.shape=NA prevents outliers from being drawn, which is useful.
durations |> ggplot(aes(Gender,Years)) +
  geom_boxplot(outlier.shape=NA) +
  theme_minimal()
```

```{r}
# Analysis of Variance
lm(Years~Gender,data = durations) |> anova()

# The default t test is actually a Welch, as you recall. 
t.test(Years~Gender,data = durations)

# In order to exactly approximate the F, we must use the "var.equal" flag.
t.test(Years~Gender,data = durations,var.equal=T)
```
***
#### 2. Working with regression diagnostics

One other thing to note is that the fitted object contains several useful pieces of functionality.
First of all, plotting the object shows several diagnostic plots.

```{r}
mod = lm(Years~Gender,data = durations)
plot(mod)
# The residual plot is probably the most important, as it allows you to assess the assumptions of the linear regression.
# Note that observations with a lot of leverage are noted by row number.
```

```{r}
# Let's try something with more of a dense x axis.
set = read_csv("https://jlucasmckay.bmi.emory.edu/global/bmi510/Labs-Materials/set-shifting.csv")

# This dataset contains various outcomes from a paper I did in 2018. 
# https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5960619/

m_set = lm(trails_a~moca_score,data = set)

plot(m_set)
# 1. We can see that there is no major slope or nonlinearity in residuals vs. fitted, which means that the linear relationship is probably ok. If there was a parabola here we might consider a different function for the relationship, like a polynomial.

# 2. The Q-Q plot shows that some of the extreme values are laying far from the linear relationship. To ameliorate this you could perhaps do a variance transformation, or just iterate the analysis on the middle of the data and demonstrate that your slope and p value were not particularly sensitive to these cases.

# 3. Scale-Location is really designed to show the homogeneity of the residuals, I do not think it adds too much to the first plot. This kind of has a squashed oval shape, which you would expect if fewer x values were at the extrema.

# 4. The Cook's distance plot is used to identify data that are exerting undue leverage on the relationship. In this case we cannot actually see the Cook's distance cutoffs.

# Let's try another relationship that clearly should not be a linear regression
x = set$moca_score
y = set$trails_a + 17*set$trails_b^3
plot(lm(y~x))


x = set$moca_score
y = set$trails_a + 17*set$trails_b^3+5*set$trails_b>60
plot(lm(y~x))


# We would guess that overall cognitive performance would predict psychomotor speed.
set |> ggplot(aes(moca_score,trails_a)) + geom_point() + geom_smooth()


# To get a nice linear fit, we set se=F and set the model to linear.
set |> ggplot(aes(moca_score,trails_a)) + geom_point() + geom_smooth(se = F, method = "lm", fullrange=T)

# To get a nice linear fit, we set se=F and set the model to linear.
set |> ggplot(aes(moca_score,trails_a,color = factor(haspd))) + geom_point() + geom_smooth(se = F, method = "lm")

# Note function of fullrange argument
set |> ggplot(aes(moca_score,trails_a,color = factor(haspd))) + geom_point() + geom_smooth(se = F, method = "lm", fullrange=T)


```

A note - we will get to interaction terms soon. How would we test whether the presence of PD modifies the relationship between trails_a and MoCA score?


***
#### 3. Extracting elements of models

```{r}
mod = lm(trails_a~moca_score,data = set)
# extract coefficients
betas = coef(mod)

# extract variance of coefficients - need to pull diagonal for this
beta_vars = diag(vcov(mod))
beta_ses = sqrt(beta_vars)
t_stats = betas / beta_ses
# these are distributed as a t with n-2 degrees of freedom
p_vals = 2 * pt(abs(t_stats), df = df.residual(mod), lower.tail = FALSE)

# you can access the residuals in two ways
residuals(mod)
mod$residuals

# what will this be?
mean(mod$residuals)

# either way is sufficient to calculate the residual sum of squares
rss = sum(mod$residuals^2)

# we can extract the degrees of freedom for the residuals as well
df.residual(mod)

rse = rss/df.residual(mod)

# how could we get the SSMOD?
sst = sum((set$trails_a - mean(set$trails_a))^2)

# notice that there are two missing values in this vector, which lm has already accommodated
sst = sum((set$trails_a - mean(set$trails_a,na.rm=T))^2,na.rm=T)

# we can get the model sum of squares by subtracting the residuals from the total.
ssmod = sst - rss

# compare to ANOVA
anova(mod)
```

***
#### 4. R^2 and the correlation coefficient

```{r}
# just for nomenclature purposes let's say "sum squares residuals"
ssr = rss

r_squared = ssmod/sst

cor(set$trails_a,set$moca_score, use = "com")

cor(set$trails_a,set$moca_score, use = "com")^2

```


***
#### 5. Confidence and prediction intervals.

```{r}
mod = lm(trails_a~moca_score,data = set)

newdata = tibble(moca_score = seq(10,30,1))

# confidence interval for average predicted value for each value of x
predict(mod, newdata, interval="confidence")

# confidence interval for predicted values for each value of x
predict(mod, newdata, interval="predict")


# plot
predict(mod, newdata, interval="confidence") |>
  as_tibble() |>
  bind_cols(newdata) |>
  pivot_longer(-moca_score) |> 
  arrange(name,value) |> 
  ggplot(aes(moca_score,value,color = name)) +
  geom_line(show.legend = F) +
  theme_minimal() +
  scale_color_viridis_d(end=0.8)

# cute plot
predict(mod, newdata, interval="prediction") |>
  as_tibble() |>
  bind_cols(newdata) |>
  pivot_longer(-moca_score) |> 
  arrange(name,value) |> 
  ggplot(aes(moca_score,value,color = name)) +
  geom_line(show.legend = F) +
  theme_minimal() +
  scale_color_viridis_d(end=0.8)


```


***
#### 6. Variance-normalizing transforms.

The Box-Cox transformation depends on the model at hand as well as the individual y and x data.
```{r}
mod = lm(trails_a~moca_score,data = set)
bc = MASS::boxcox(mod)

# you could find the unique value of lambda
bc$x[which.max(bc$y)]

# this is suggesting a natural log transformation, which doesn't change the answer all that much aside from resulting in a smaller p value
broom::tidy(lm(trails_a~moca_score,data = set))

broom::tidy(lm(log(trails_a)~moca_score,data = set))
```