An archive the questions from Mark's Fall 2018 Stat 225.

Linear regression on road race times

Mark

(10 pts)

I’ve got a CSV file on my web space that contains data on the 2015 Peach Tree Road Race. You can grab it and take a look like so:

df = pd.read_csv('https://www.marksmath.org/data/peach_tree2015.csv')
df.head()
Index Unnamed Div Place Name Bib Age Place Gender Place Clock Time Net Time Hometown Gender
0 6451 1 SCOTT OVERALL 72 32 1 1 29.500 29.500 SUTTON, UNITED KINGDOM M
1 6452 2 BEN PAYNE 74 33 2 2 29.517 29.517 COLORADO SPRINGS, CO M
2 4092 1 GRIFFITH GRAVES 79 25 3 3 29.633 29.633 BLOWING ROCK, NC M
3 4093 2 SCOTT MACPHERSON 87 28 4 4 29.800 29.783 COLUMBIA, MO M
4 6453 3 ELKANAH KIBET 77 32 5 5 29.883 29.883 FAYETTEVILLE, NC M

Filter this data so that you’ve got a data frame with just the men or women, as you prefer.
Grab a sample of size 100 from this data and store the sample in a variable. When grabbing the random sample, use the sum of the place in the alphabet of the letters in your name as the random_state. For example, ‘Mark’ yields

13 + 1 + 18 + 11 = 43.

After getting the data, run a linear regression to examine the relationship between Age and Net Time. Answer the following questions:

  1. To a 99% level of confidence, is there a genuine relationship between Age and Net Time?
  2. What Net Time does the model predict for a 54 year old person of the gender in your sample?
vscala
from numpy.random import seed
from scipy.stats import linregress
seed(87) #22+9+14+3+5+14+20
df = pd.read_csv('https://www.marksmath.org/data/peach_tree2015.csv')
size = 100
sd = df[df.Gender == 'M'].sample(size)
sd.head()
Unnamed: 0 Div Place Name Bib Age Place Gender Place Clock Time Net Time Hometown Gender
30939 17508 2223 HAPPY GUADALUPE 47001 45 30938 17983 177.233 75.017 AUSTELL, GA M
10345 10166 952 MICHAEL COHEN 1427 35 10342 7467 57.317 57.150 ATLANTA, GA M
14228 16465 1180 MICHAEL RINGO 25740 49 14228 9736 91.233 60.300 NEWPORT BEACH, CA M
53894 25505 1586 RICHARD HANSEN 90124 63 53895 26906 199.517 130.500 DECATUR, GA M
12468 7538 1088 RONGKAI GUO 20045 30 12468 8740 78.917 58.867 ATLANTA, GA M
Age = sd['Age']
NetTime = sd['Net Time']
linregress(Age, NetTime) 

LinregressResult(slope=0.4769202942971197, intercept=51.43572916619079, rvalue=0.2916508401797609, pvalue=0.0032386368770628083, stderr=0.15800316645853096)

#p-value in the case of slope == 0
from scipy.stats import norm
alpha = .01
linregress(Age, NetTime).pvalue < alpha

True, thus with a 99% confidence the slope is not 0

#At age of 54 the predicted net time would be...
linregress(Age, NetTime).slope * 54 + linregress(Age, NetTime).intercept

77.18942505823526 (predicted net time at age 54)

megan
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import linregress
import pandas as pd

13+5+7+1+14
>>40

df = pd.read_csv('https://www.marksmath.org/data/peach_tree2015.csv')
df.head()
Unnamed: 0 Div Place Name Bib Age Place Gender Place Clock Time Net Time Hometown Gender
0 6451 1 SCOTT OVERALL 72 32 1 1 29.500 29.500 SUTTON, UNITED KINGDOM M
1 6452 2 BEN PAYNE 74 33 2 2 29.517 29.517 COLORADO SPRINGS, CO M
2 4092 1 GRIFFITH GRAVES 79 25 3 3 29.633 29.633 BLOWING ROCK, NC M
3 4093 2 SCOTT MACPHERSON 87 28 4 4 29.800 29.783 COLUMBIA, MO M
4 6453 3 ELKANAH KIBET 77 32 5 5 29.883 29.883 FAYETTEVILLE, NC M
df_men = df[df['Gender'] == 'M']

sam = df_men.sample(100, random_state=74)
sam

sam.plot.scatter('Age', 'Net Time')

df_men['Net Time']


lr = linregress(sam.Age, sam['Net Time'])
lr

LinregressResult(slope=0.36514077076110596, intercept=58.551656739521093, rvalue=0.25502534899813828, pvalue=0.010446748003158068, stderr=0.13984950584707564)

Therefore, I can fail to reject the null hypothesis the m=0 because my p-value is 0.0104 which is greater than 0.01

def f(x): return lr.slope*x + lr.intercept
sam.plot.scatter('Age', 'Net Time')
plt.plot([10,80], [f(10), f(80)], 'black')

f(54)

78.26925836062081

The predicted net time at age 54 is 78.26925836062081

dennis

Setup:

%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import linregress

import pandas as pd


df = pd.read_csv('https://www.marksmath.org/data/peach_tree2015.csv')
df.head()

Data head:

 	Unnamed: 0 	Div Place 	Name 	Bib 	Age 	Place 	Gender Place 	Clock Time 	Net Time 	Hometown 	Gender
0 	6451 	1 	SCOTT OVERALL 	72 	32 	1 	1 	29.500 	29.500 	SUTTON, UNITED KINGDOM 	M
1 	6452 	2 	BEN PAYNE 	74 	33 	2 	2 	29.517 	29.517 	COLORADO SPRINGS, CO 	M
2 	4092 	1 	GRIFFITH GRAVES 	79 	25 	3 	3 	29.633 	29.633 	BLOWING ROCK, NC 	M
3 	4093 	2 	SCOTT MACPHERSON 	87 	28 	4 	4 	29.800 	29.783 	COLUMBIA, MO 	M
4 	6453 	3 	ELKANAH KIBET 	77 	32 	5 	5 	29.883 	29.883 	FAYETTEVILLE, NC 	

Select Sample:

df_men = df[df['Gender']== 'M']
sam = df_men.sample(100, random_state=65)

Linear Regression:

lr = linregress(sam['Age'], sam['Net Time'])
lr

LR Results:

LinregressResult(slope=0.41763022606104783, intercept=50.092612883896166, rvalue=0.27871103449308648, pvalue=0.0049860094629450959, stderr=0.14536691461906104)

Plot:

def f(x): return lr.slope*x + lr.intercept
sam.plot.scatter('Age', 'Net Time')
plt.plot([10,80], [f(10), f(80)], 'black')

image

pvalue=0.0049860094629450959 is <.01 :. we reject the null of Slope = 0, meaning there is a relationship between age and time. 

f(54) = 72.644645091192757
john

Get my sample:

df = pd.read_csv(‘https://www.marksmath.org/data/peach_tree2015.csv’)
df.head()
df_men = df[df[‘Gender’] == ‘M’]
seed = 47
sam = df_men.sample(100, random_state=seed)

Do a linear regression on it

lr = linregress(sam.Age, sam['Net Time'])
lr
LinregressResult(slope=0.17594741498783259, intercept=27.813346527079492, rvalue=0.22667833522917125, pvalue=0.023335850843621341, stderr=0.076366919876287603)

Determine if there is a genuine relationship between age and time.

from scipy.stats import norm
alpha = 0.01
lr.pvalue < alpha\
False

The determination is that there is not a genuine relationship between those two variables.

Now find the Net Time the model predicts for a 54 year old person of the male category.

lr.slope * 54 + lr.intercept
72.165429062746014
mac

I imported my regression tools:

%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import linregress
import pandas as pd

I grabbed my data set:

df = pd.read_csv('https://www.marksmath.org/data/peach_tree2015.csv')
df.head()

I excluded the women:

sd = df[df['Gender'] == 'M']

I calculated my seed from my name:

mac = 13 + 1 + 3 = 17

I grabbed my sample using my seed:

sam = df_men.sample(100, random_state=17)
sam.plot.scatter('Age', 'Net Time')

I run my regression:

lr = linregress(sam['Age'], sam['Net Time'])
lr

This results:

LinregressResult(slope=0.27146012399691261, intercept=59.802783068017803, 
rvalue=0.19807325500622364, pvalue=8.5568045460350696e-239, stderr=0.0081442090319174126)

I use:

from scipy.stats import norm
alpha = .01
linregress(Age, NetTime).pvalue &lt; alpha

Which results in “True”, therefore with a 99% CI, the slope is not 0.
Thus:

linregress(Age, NetTime).slope * 54 + linregress(Age, NetTime).intercept

This equals f(54) = 74.461629763851079

Rebecca

My seed is 18+5+2+5+3+3+1 = 37 and I’ll look at women’s data.

The scatter plot looks kind of like somebody threw darts at a piece of posterboard.

image

Plotting the linear regression (and wasn’t getting that linreg an adventure!) gives this:

image

with the following outputs to describe my regression line:

LinregressResult(slope=0.7931034482758621, intercept=2.7241379310344827, 
rvalue=0.9320070332440739, pvalue=0.06799296675592614, stderr=0.21808811449437113)

If I understand right, then I should reject the hypothesis that there’s a genuine relationship between age and net time to within a 99% level of confidence, because my pvalue of about .068 is greater than the desired pvalue of .010. Please let me know if I’m reading that right…

I think that to answer (2), I should plug 54 into the equation for my line and see what I get. So, I’d expect a 54-year-old woman to finish the race in 85.205 minutes (assuming this is minutes). I should probably call up my mother and ask how that compares the next time she runs a 10k, since she’s a 53 year old woman. That could be kind of funny.

goodmorning
df = pd.read_csv('https://www.marksmath.org/data/peach_tree2015.csv')
df.head()

gives me the data poinst and i use

%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import linregress
import pandas as pd
df_men = df[df['Gender'] =='M']

to pull just the males from the data set. then we pull a sample using

sam = df_men.sample(100, random_state=55)
sam.plot.scatter('Age', 'Net Time')

then to get the linear regression we use

Age = df['Age']
NetTime = df['Net Time']
linregress(Age, NetTime) 

LinregressResult(slope=0.26313774106668503, intercept=65.514897976413067, rvalue=0.17531979108711121, pvalue=0.0, stderr=0.0063125743413930107)

from scipy.stats import norm
alpha = .01
linregress(Age, NetTime).pvalue < alpha

True
then for the time of a 54 year old i got

linregress(Age, NetTime).slope * 54 + linregress(Age, NetTime).intercept

79.724335994014055

Tripp

Grabbed my data

 sam = df_men.sample(100, random_state=79)
 sam

sam = df_men.sample(100, random_state=79)
sam.plot.scatter('Age', 'Net Time')
lr = lineregress(sam['Age'], sam['Net Time'])
lr
def f(x): return lr.slope*x + lr.intercept
sam.plot.scatter(['Age'], ['Net Time'])
plt.plot([10,80], [f(10), f(80)], 'black')

 lr = linregress(sam['Age'], sam['Net Time'])
 lr

LinregressResult(slope=0.054607277704340491, intercept=71.024197498113011, rvalue=0.036126765315246286, pvalue=0.72120998218749843, stderr=0.15258955886241984)

pvalue=0.72120998218749843
fail to reject since 99% --> .01

age54 = f(54)
age54

73.972990494147396

dpulse
  %matplotlib inline
  import numpy as np
  from matplotlib import pyplot as plt
  from scipy.stats import linregress
  import pandas as pd

Grabbed my Data:

  df = pd.read_csv("https://www.marksmath.org/data/peach_tree2015.csv")
  df.head()

Displayed my Data with just men:

  Mdf = df[df["Gender"]=="M"]
  def f(x): return lr.slope*x + lr.intercept
  sam = Mdf.sample(100, random_state=40)
  sam.plot.scatter("Age","Net Time")
  plt.plot([10,80], [f(10),f(80)], "black")

z

Got my P-Value:

 lr = linregress(sam["Age"], sam["Net Time"])
 lr

P-Value = .0003
Therefore for a 99% confidence interval, can reject the null hypothesis

Got my Value for a 54yo Male:

 age54 = f(54)
 age54

time = 76.565164896981926

joshua

First I use and read the cvs file…

%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import linregress
import pandas as pd

and read the cvs file…

df = pd.read_csv('https://www.marksmath.org/data/peach_tree2015.csv')
df.head()

Then I Filter the data to just men…

df_men=df[df['Gender']=='M']

Then I get a sample of size 100 from this data and store the sample in a variable… Seeded at (Joshua=10+15+19+8+21+1=74)

sam = df.sample(100, random_state=74)
sam

A Scatter plot would look like

sam = df.sample(100, random_state=74)
sam.plot.scatter('Age', 'Net Time')

image
Then do linear Regression…

lr = linregress(df.Age, df['Net Time'])
lr
OUTPUT:LinregressResult(slope=0.263137741066685, intercept=65.51489797641307, rvalue=0.1753197910871112, pvalue=0.0, stderr=0.006312574341393011)

Then input…

#p-value in the case of slope == 0
from scipy.stats import norm
alpha = .01
linregress(df.Age, df['Net Time']).pvalue < alpha

OUTPUT: True

Which means that at the 99% confidence interval the slope is not zero. This means there is a relationship between the variables.

Then model predicts that net time for age 54 male is…

df.Age, df['Net Time']).slope * 54 + linregress(df.Age, df['Net Time']).intercept

OUTPUT:79.72433599401406

79.72433599401406 is the predicted net time of a 54 year old

btucker
  %matplotlib inline
  import numpy as np
  from matplotlib import pyplot as plt
  from scipy.stats import linregress
  import pandas as pd

Import data:

 df = pd.read_csv('https://www.marksmath.org/data/peach_tree2015.csv')
 df.head()

Look at net time of men

df_men = df[df['Gender'] == 'M']
df_men['Net Time']

Plot the sample

sam = df.sample(100, random_state=21)
sam.plot.scatter('Age', 'Net Time')

image

Find regression

lr = linregress(sam['Age'], sam['Net Time'])
lr

Output

LinregressResult(slope=0.26756112592527004, intercept=65.86908317750806, rvalue=0.19814113196858557, pvalue=0.048139016938270704, stderr=0.13370212024809247)

my p value is 0.048, which is greater that .01, there for I fail to reject the null.

Plot regression line

def f(x): return lr.slope*x + lr.intercept
sam.plot.scatter('Age', 'Net Time')
plt.plot([0,80], [f(0), f(80)], 'black')

image

f(54)=80.317

Mark

I think you’ve got that backwards!

Mark

@goodmorning I’m not sure what your conclusion is?

Mark

@john
I’d really like to see

# formatted code, like
x = 42**2

rather than a quote, like
x = 42**2

Garrett

Load Data:

%matplotlib inline
import numpy as np 
from matplotlib import pyplot as plt 
from scipy.stats import linregress 
import pandas as pd 
df = pd.read_csv('https://www.marksmath.org/data/peach_tree2015.csv') 
df.head()

My seed is (7+1+18+18+5+20+20) = 89

Set:

Unnamed: 0 Div Place Name Bib Age Place Gender Place Clock Time Net Time Hometown Gender
0 6451 1 SCOTT OVERALL 72 32 1 1 29.500 29.500 SUTTON, UNITED KINGDOM M
1 6452 2 BEN PAYNE 74 33 2 2 29.517 29.517 COLORADO SPRINGS, CO M
2 4092 1 GRIFFITH GRAVES 79 25 3 3 29.633 29.633 BLOWING ROCK, NC M
3 4093 2 SCOTT MACPHERSON 87 28 4 4 29.800 29.783 COLUMBIA, MO M
4 6453 3 ELKANAH KIBET 77 32 5 5 29.883 29.883 FAYETTEVILLE, NC M

Choose Sample:

df_men = df[df['Gender'] == 'M']
sam = df_men.sample(100, random_state=89)
sam
sam.plot.scatter('Age', 'Net Time')

Regression:

Linear Regression:

df_men['Net Time']
lr = linregress(sam.Age, sam['Net Time'])
lr

Out[7]: LinregressResult(slope=0.18065766695564817, intercept=63.73191177806217, rvalue=0.1365497283170158, pvalue=0.17551815836994897, stderr=0.1323931165493825)

My p-value is 0.176 which is larger than 0.01, so I reject the null

def f(x): return lr.slope*x + lr.intercept
sam.plot.scatter('Age', 'Net Time')
plt.plot([10,80], [f(10), f(80)], 'black')

john

The problem is fixed.