Again, this is paired data; each point corresponds to a single book which has two numeric values - it's bkstr.com price and it's amazon.com price.
We talked about the correlation $R$ before. Now we'll focus on the regression line $A = 0.68\times B + 3.18$, which is supposedly the "best linear fit" for the data. Specifically, we'll address
What makes the regression line better than some other line?
Answer: It minimizes the total squared error.
More specifically, if we have observations $$(x_1, y_1),(x_2, y_2),\ldots,(x_n, y_n)$$ and we assume that our model is $Y=aX+b$, then the regression line chooses $a$ and $b$ so that $$(y_1-(ax_1+b))^2 + (y_2-(ax_2+b))^2 + \cdots + (y_n-(ax_n+b))^2$$ is as small as possible.
The terms $y_i-(ax_i+b)$ are the called the residuals; they measure how far off our regression model is at the data points.
First and foremost, linear regression might be useful if you want to understand the relationship between two numerical variables. Furthermore, the data should satisfy a few more conditions:
It seems that correlation should be connected with slope - and it is, via the formula $$m = r\frac{s_y}{s_x}.$$ Furthermore, the regression line should go through the point $(\bar{x},\bar{y})$. These formulae make it relatively easy to compute a regression line for given data.
Suppose that $X$ is a data set with mean $90$ and standard deviation $5$; $Y$ is a data set with mean $74$ and standard deviation $4$. Furthermore, $X$ and $Y$ have a tight correlation of $r=0.85$.
Solution
We know the regression line has the form $y=mx+b$. The slope is $$m = r\frac{s_y}{s_x} = 0.85 \times \frac{4}{5} = 0.68.$$ We now know the regression line has the more specific form $y=0.68x+b$. Once we know this, we can plug the means for $X$ and $Y$ to get $b$: $$74 = 0.68 \times 90 + b, $$ so $$b = 74 - 0.68 \times 90 = 12.8.$$ Thus, the final equation of the line is $$y = 0.68x +12.8.$$
To solve the second part, we simply plug $x=85$ into our formula for the line to find the corresponding value of $y$.
Let's discuss how we might interpret the following:
import pandas as pd
from scipy.stats import linregress
df = pd.read_csv("https://www.marksmath.org/data/cdc.csv")
sam = df.sample(50, random_state=1)
lr = linregress(sam.height, sam.weight)
lr
Here are the answers to the first couple of questions at least:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(sam.height, sam.weight, '.')
x1 = 60
x2 = 77
m = lr.slope
b = lr.intercept
y1 = m*x1+b
y2 = m*x2+b
plt.plot([x1,x2],[y1,y2]);
Finally, here's a little code that might simplify the data entry portion of your HW:
value_string = """15.9 9.3
15.8 7
17.4 10.9
16.2 11.1
16.4 11.1
17.6 27.9
19 19
19.9 18
19.1 18.1
19.4 19.7
18.2 27.8
21.8 21.1
21.5 31.9
21 30.2
22.4 30.6
23 26
22 28.8
24.8 30.1
26.5 38.5
26.9 36.1"""
values = [s.split('\t') for s in value_string.split('\n')]
pp = [float(v[0]) for v in values]
mm = [float(v[1]) for v in values]
[pp,mm]
And here's an online version, in case you need it later: https://sagecell.sagemath.org/?q=uskwtw