---
title: "Lab-Regression"
author: "J. Lucas McKay"
date: "2023-03-29"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = T, collapse = T, warning = F, message = F)
```
***
#### 1. Equivalence of F and t tests when there are two groups
```{r}
library(tidyverse)
mean0 = function(x) round(mean(x,na.rm = T),0)
sd0 = function(x) round(sd(x,na.rm = T),0)
ss0 = function(x) round(sum((x-mean(x))^2),0)
n = 50
set.seed(510)
durations = bind_rows(
tibble(Gender = "Male", Years = rnorm(n,74,10)),
tibble(Gender = "Female", Years = rnorm(n,82,10))) |>
mutate(Gender = forcats::fct_rev(factor(Gender)))
# make a dot plot
durations |> ggplot(aes(Gender,Years)) +
geom_jitter(width = 0.1) +
expand_limits(y=0)
# box plot. note that outlier.shape=NA prevents outliers from being drawn, which is useful.
durations |> ggplot(aes(Gender,Years)) +
geom_boxplot(outlier.shape=NA) +
theme_minimal()
```
```{r}
# Analysis of Variance
lm(Years~Gender,data = durations) |> anova()
# The default t test is actually a Welch, as you recall.
t.test(Years~Gender,data = durations)
# In order to exactly approximate the F, we must use the "var.equal" flag.
t.test(Years~Gender,data = durations,var.equal=T)
```
***
#### 2. Working with regression diagnostics
One other thing to note is that the fitted object contains several useful pieces of functionality.
First of all, plotting the object shows several diagnostic plots.
```{r}
mod = lm(Years~Gender,data = durations)
plot(mod)
# The residual plot is probably the most important, as it allows you to assess the assumptions of the linear regression.
# Note that observations with a lot of leverage are noted by row number.
```
```{r}
# Let's try something with more of a dense x axis.
set = read_csv("https://jlucasmckay.bmi.emory.edu/global/bmi510/Labs-Materials/set-shifting.csv")
# This dataset contains various outcomes from a paper I did in 2018.
# https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5960619/
m_set = lm(trails_a~moca_score,data = set)
plot(m_set)
# 1. We can see that there is no major slope or nonlinearity in residuals vs. fitted, which means that the linear relationship is probably ok. If there was a parabola here we might consider a different function for the relationship, like a polynomial.
# 2. The Q-Q plot shows that some of the extreme values are laying far from the linear relationship. To ameliorate this you could perhaps do a variance transformation, or just iterate the analysis on the middle of the data and demonstrate that your slope and p value were not particularly sensitive to these cases.
# 3. Scale-Location is really designed to show the homogeneity of the residuals, I do not think it adds too much to the first plot. This kind of has a squashed oval shape, which you would expect if fewer x values were at the extrema.
# 4. The Cook's distance plot is used to identify data that are exerting undue leverage on the relationship. In this case we cannot actually see the Cook's distance cutoffs.
# Let's try another relationship that clearly should not be a linear regression
x = set$moca_score
y = set$trails_a + 17*set$trails_b^3
plot(lm(y~x))
x = set$moca_score
y = set$trails_a + 17*set$trails_b^3+5*set$trails_b>60
plot(lm(y~x))
# We would guess that overall cognitive performance would predict psychomotor speed.
set |> ggplot(aes(moca_score,trails_a)) + geom_point() + geom_smooth()
# To get a nice linear fit, we set se=F and set the model to linear.
set |> ggplot(aes(moca_score,trails_a)) + geom_point() + geom_smooth(se = F, method = "lm", fullrange=T)
# To get a nice linear fit, we set se=F and set the model to linear.
set |> ggplot(aes(moca_score,trails_a,color = factor(haspd))) + geom_point() + geom_smooth(se = F, method = "lm")
# Note function of fullrange argument
set |> ggplot(aes(moca_score,trails_a,color = factor(haspd))) + geom_point() + geom_smooth(se = F, method = "lm", fullrange=T)
```
A note - we will get to interaction terms soon. How would we test whether the presence of PD modifies the relationship between trails_a and MoCA score?
***
#### 3. Extracting elements of models
```{r}
mod = lm(trails_a~moca_score,data = set)
# extract coefficients
betas = coef(mod)
# extract variance of coefficients - need to pull diagonal for this
beta_vars = diag(vcov(mod))
beta_ses = sqrt(beta_vars)
t_stats = betas / beta_ses
# these are distributed as a t with n-2 degrees of freedom
p_vals = 2 * pt(abs(t_stats), df = df.residual(mod), lower.tail = FALSE)
# you can access the residuals in two ways
residuals(mod)
mod$residuals
# what will this be?
mean(mod$residuals)
# either way is sufficient to calculate the residual sum of squares
rss = sum(mod$residuals^2)
# we can extract the degrees of freedom for the residuals as well
df.residual(mod)
rse = rss/df.residual(mod)
# how could we get the SSMOD?
sst = sum((set$trails_a - mean(set$trails_a))^2)
# notice that there are two missing values in this vector, which lm has already accommodated
sst = sum((set$trails_a - mean(set$trails_a,na.rm=T))^2,na.rm=T)
# we can get the model sum of squares by subtracting the residuals from the total.
ssmod = sst - rss
# compare to ANOVA
anova(mod)
```
***
#### 4. R^2 and the correlation coefficient
```{r}
# just for nomenclature purposes let's say "sum squares residuals"
ssr = rss
r_squared = ssmod/sst
cor(set$trails_a,set$moca_score, use = "com")
cor(set$trails_a,set$moca_score, use = "com")^2
```
***
#### 5. Confidence and prediction intervals.
```{r}
mod = lm(trails_a~moca_score,data = set)
newdata = tibble(moca_score = seq(10,30,1))
# confidence interval for average predicted value for each value of x
predict(mod, newdata, interval="confidence")
# confidence interval for predicted values for each value of x
predict(mod, newdata, interval="predict")
# plot
predict(mod, newdata, interval="confidence") |>
as_tibble() |>
bind_cols(newdata) |>
pivot_longer(-moca_score) |>
arrange(name,value) |>
ggplot(aes(moca_score,value,color = name)) +
geom_line(show.legend = F) +
theme_minimal() +
scale_color_viridis_d(end=0.8)
# cute plot
predict(mod, newdata, interval="prediction") |>
as_tibble() |>
bind_cols(newdata) |>
pivot_longer(-moca_score) |>
arrange(name,value) |>
ggplot(aes(moca_score,value,color = name)) +
geom_line(show.legend = F) +
theme_minimal() +
scale_color_viridis_d(end=0.8)
```
***
#### 6. Variance-normalizing transforms.
The Box-Cox transformation depends on the model at hand as well as the individual y and x data.
```{r}
mod = lm(trails_a~moca_score,data = set)
bc = MASS::boxcox(mod)
# you could find the unique value of lambda
bc$x[which.max(bc$y)]
# this is suggesting a natural log transformation, which doesn't change the answer all that much aside from resulting in a smaller p value
broom::tidy(lm(trails_a~moca_score,data = set))
broom::tidy(lm(log(trails_a)~moca_score,data = set))
```