STAT 306

STAT 306: Finding Relationships in Data, January 2018


credit: Rafael Irizarry

Course Information

Instructor: Harlan Campbell

TAs:

Course description

Modeling a response (output) variable as a function of several explanatory (input) variables: multiple regression for a continuous response, logistic regression for a binary response, and log-linear models for count data. Finding low-dimensional structure: principal components analysis. Cluster analysis.

This course emphasizes (i) applications of statistical methods such as multiple regression, binary regression, principal component analysis; (ii) the use of statistical software to do the computations; and (iii) interpretation of statistical analysis and output of statistical software. There is some linear algebra (with matrix representations) to show how multiple regression is computed in software, and there is some probability (mainly expected values, variances and covariances for linear combinations) to show how standard errors are determined for parameter estimates and predictions.

Course evaluation

Your grade will be calculated according to the following weighting scheme: Labs 9%, Homeworks 21%, Midterm 30%, Final 40%. If the midterm exam is missed, its weight is transferred to the final exam.

Course notes

We will be using custom course notes by Prof. Harry Joe, available at the UBC bookstore.

Course schedule




Lectures

Lecture 1 (January 4)

Introduction to Course

Lecture 2 (January 9)

Section 2.1 Least Squares for one predictor

Lecture 3 (January 11)

Residuals + Section 2.2 Statistical linear regression model

Lecture 4 (January 16)

Section 2.2 (continued) + Section 2.5 Intervals for simple linear regression

Lecture4_Rcode

Lecture 5 (January 18)

Section 2.5 (continued)

Lecture 6 (January 23)

Section 2.6 Explanation of Student t quantiles used in the interval estimates

Lecture 7 (January 25)

Section 3.1 Least squares with two or more explanatory variables

Rcode for Section 2.8 Question 2.1

Rcode for 3d plot with surface

Lecture 8 (January 25)

Section 3.4 + Section 3.6

short video on adjR2

Lecture 9 (February 1)

Section 3.6 + Section 3.7 + Section 3.8 + Section 3.9

Lecture 10 (February 6)

Section 3.9 - Categorical explanatory variables

lm_function.R

lm_function_updated.R

short youtube video on categorical variables in R

Lecture 11 (February 8)

Section 3.9 (recap) + quadratic terms + Section 3.10 - Partial correlation

Note: there was originally a typo on slides 44-45 that has now been corrected.

Search online for “mtcars linear regression” and you will find some great material, for example:

https://rstudio-pubs-static.s3.amazonaws.com/193417_4e1f9d5b1c6f472885fc5b03df9d4331.html

http://rstudio-pubs-static.s3.amazonaws.com/20516_29b941670a4b42688292b4bb892a660f.html

Lecture 12 (February 13)

Excellent website on Simpson’s paradox: http://vudlab.com/simpsons/

Section 3.11 - Multicollinearity (updated)

Lecture 12 R-code

Note: Lecture 12 slides were originally posted with an error on slide 14. This has now been fixed.

Lecture 13 (February 15)

Section 3.12 - Interpretations

Have a listen: podcast about causation

Have a read: Jagsi2012.pdf , Giuli2014.pdf , Haby2011.pdf, Promislow2002.pdf, Gupta2010.pdf, Fallowfield2002.pdf

Lecture 14 (February 27)

Section 3.13 - Summary for multiple regression

burnaby_condos.csv

lecture14.R

Midterm (March 1)

Here is a list of suggested questions from Chapter 2 and Chapter 3:

Chapter 2: 2.1, 2.5, 2.6, and 2.8 (for scale changes/location shifts: 2.3, 2.4, and 2.10)

Chapter 3: 3.1, 3.8, 3.9, 3.10, and 3.11

Midterm 2016

Lecture 15 (March 6)

4.1 (briefly) and 4.2

NYT article on Funding for gun violence research , StarkShaw2017.pdf , gunresearchdata.csv , gunresearch.R

Youtube Video on Cross-Validation 1, Youtube Video on Cross-Validation 2

Lecture 16 (March 8)

4.3 Additional diagnostics (Cook’s distance, influential observations)

Great website about diagnostics

Great tips for model diagnostics in R: https://www.statmethods.net/stats/rdiagnostics.html

Lecture 17 (March 13)

4.4 Transforms and nonlinearity and 4.5 Diagnostics for data collected sequentially in time

Note for the interpretations of log-linear and linear-log models: the original lecture 17 slides were correct. My apologies for thinking that there were mistakes! Clearly, interpretation of these models can be difficult. I have now added some extra explanations and derivations on the slides. Let me know if you have any questions.

stat306_lecture17.R

price_data.csv

Good simple explanaitons on how to interpret models with log-transformations:

http://kenbenoit.net/assets/courses/ME104/logmodels2.pdf

Lecture 18 (March 15)

Chapter 5: case studies

stat306_lecture18.R

HDATA.csv

carSalesPrice.csv

Lecture 19 (March 20)

6.2 Logistic regression (Part 1)

stat306_lecture19.R

Good website that explains basics of logistic regression

Lecture 20 (March 22)

6.2 Logistic regression (Part 2)

lecture20.R

Good website that explains Likelihood for logistic regression

Lecture 21 (March 27)

6.3 Logistic regression (Part 3) and Principal Component Regression (PCR)

lecture21.R

PCR in the news: https://motherboard.vice.com/en_us/article/mg9vvn/how-our-likes-helped-trump-win

Kosinski2013.pdf

wine_csv.csv

Lecture 22 (March 29)

PCA and Count response

lecture22.R

Data: wine_csv.csv

Data: carsub.csv

Lecture 23 (April 3)

6.5 Count response

lecture23.R

Lecture 24 (April 5)

Summary of regression

lecture24.R

Homework

Each homework assignment is worth 3% of your final grade. Homework can be found and completed on https://webwork.elearning.ubc.ca/ “STAT306-201_2017W2”.

Homework 1 (due Jan. 26)

Homework 2 (due Feb. 2)

Homework 3 (due Feb. 9)

Homework 4 (due Feb. 16)

### Optional Questions Set 1

Homework 5 (due Mar. 15)

BCL Data Competition!

Consider data listing the prices and attributes of 465 randomly selected BCL products under 100$. Provided with the following data, BCL_data_available.csv , define a linear regression model (lm() object) with the goal of making accurate predictions. Your model will be used to predict the prices of 35 BCL products we have set aside in a secret holdout dataset.

The accuracy of the predictions will be based on RMSE. The outcome variable for your model should be “log(CURRENT_DISPLAY_PRICE)” and the predictor variables should be limited to those in the BCL_data_available dataframe and functions of those variables in the BCL_data_available dataframe.

test.R

Sample submission: 12345678.R

Please email your submission to stat306bclcontest@gmail.com

Please email any questions about submissions to Bo Chang (bchang@stat.ubc.ca)

Results:

BCL_data_secret.csv

BCLtopten.R

new grading scheme:

RMSE <0.45 , 3.5pts ; RMSE 0.45-0.46 , 3pts ; RMSE 0.46-0.51 , 2.5pts ; RMSE 0.51-0.53 , 2pts ; RMSE 0.53-0.55 , 1.5pts ; RMSE <0.55 , 0pts .

Homework 6 (Due March 27)

Homework 7 (Due April 4)

gala.csv


Labs

Labs will focus on learning R. Each Lab Quiz (posted to Webwork) is worth 1% of your final grade. Lab quizes can be found and completed on Webwork https://webwork.elearning.ubc.ca/ “STAT306-201_2017W2”

Lab 1 (Jan 12 and Jan 16) : Lab1.R

data: Age_vs_Money_data.csv

Some R Basics; Reading in data; calculating sample statistics; t-test.

Lab Quiz 1 due January 18.

Lab 2 (Jan 19 and Jan 23) : Lab2.R

data: Age_vs_Money_data.csv, data: Hubble.txt

Simple plot of the data; Simple Linear Regression with lm(); linear regression with “Hubble” data set.

Lab Quiz 2 due January 25.

Lab 3 (Jan 26 and Jan 30) : Lab3.R

Lab Quiz 3 due February 1.

Lab 4 (Feb 2 and Feb 6) : Lab4.R

Lab Quiz due February 8.

Lab 5 (Feb 9 and Feb 13) : Lab5.R

data: moviegross.txt, Lab Quiz due February 15.

Special Lab : Midterm prep. (Feb 16 and Feb 27)

Lab 6 (Mar 2 and Mar 6) : Lab6.R

Variance Inflation factor (VIF) Signs of multicolinearity Adjusted R-squared Interaction terms

data: burnaby_condos.csv

Lab 7 (Mar 9 and Mar 13) : Lab7.R

Lab quiz due March 15.

Leave-one-out Cross validation. K-fold cross validation.

Lab 8 (Mar 16 and Mar 20) : Lab8.R

data: bwt.txt,

Lab quiz 8 due on March 22.

Logistic Regression, prediction and cross-validation.

Step by Step logistic regression example with R

Lab 9 (Mar 23 and Mar 27): Lab9.R

data: newbie.txt,

Lab quiz 9 due on March 29.

Special Lab : Final prep. (April 3 and April 6)

Here is a copy of last year’s final exam. TA’s will go through some of the solutions during this lab. The final question is extra challenging and should get you thinking.

—>

Online resources

Piazza

Learning R

Understanding statistics

http://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/7-t-tests

http://www.stat.columbia.edu/~martin/W2024/R2.pdf

https://www.seas.upenn.edu/~ese302/extra_mtls/Regression_Notes.pdf