Fitting height and weight data

mark · February 2020

(15 pts)

I've got a fun program on my webpage that generates random CSV data for people. You can access it via Python like so:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=mark')
df.tail()

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Frank	Parker	45	male	65.49	130.90	39050	high
96	Clyde	Botti	43	male	74.95	156.12	1952	moderate
97	Donald	Hollack	31	male	67.11	206.31	44204	moderate
98	Cheryl	Hamilton	23	female	65.32	188.72	86	none
99	Doug	Garcia	23	male	68.27	148.50	722	high

Here's the cool thing - the data is randomly generated but the random number generator is seeded using the username query parameter in the URL. Thus, if I execute that command several times, I get the same result every time. That result depends upon the username, however. If you do it with your forum username, you'll get a different result. Thus, we all have our own randomly generated data file!

The problem

This Python assignment comes in several parts:

Download your own personal CSV and display the tail of that data
(I used the command print(df.tail().to_html()) to generate code for the table)
Extract and plot the height and weight columns,
Plot that data with height on the horizontal axis and weight on the vertical,
Use Python/Numpy/Scipy to set up and solve the normal equations to find a function of the form $f(x)=ax+b$ that models the data, and
Plot your function with the data.

You'll turn in the assignment as a response to this post. Of course, collaboration is encouraged - even unavoidable. You should certainly find some helpful code on our class web pages:

Least Squares and
Galileos Ramp

gabriel · March 2020

Using

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.linalg import solve, eig
GA = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=gabriel')
GA.tail()

My tail of my data was:

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Wanda	Hammack	44	female	59.33	166.46	27163	none
96	Latoya	Farmer	53	female	64.22	130.05	3252	high
97	Stacie	Sartwell	26	female	64.61	163.25	8762	high
98	Elizabeth	Lyman	33	female	60.76	150.13	3550	moderate
99	Mary	Ivey	35	female	60.63	137.10	261661	moderate

Edit: This isnt the entire data set, I plotted the entire data set, which, when plotted using
import matplotlib.pyplot as plt

produced

I printed the values for height and weight and renamed them in a list H1 and W1, respectively. Edit: I reran the code with the entire data set. This was the new H1 and W1

H1=GA.height
W1=GA.weight

Using code found in the Least Squares tutorial,

A = np.array([H1,np.ones(len(H1))]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(W1))
a,b

which gave the solution

(1.1819982447405388, 88.9482338086134)

which roughly corresponds to the equation

$$ y = 1.18x + 89 $$

Using this solution I plugged

def f(x): return a*x+b
fy = [f(x) for x in H1]
plt.plot(H1,W1, '.')
plt.plot(H1, fy)

The corresponding plot gave

PrinceHumperdinck · March 2020

Generating data under the glorious username LordFarquaad resulted in

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Mary	Phillips	51	female	66.57	183.45	666	high
96	Samantha	Grimes	26	female	65.76	165.57	5359	moderate
97	Russell	Dwelle	33	male	71.36	200.36	3117	none
98	Maxwell	Rodrigues	34	male	70.88	163.93	19932	none
99	Jim	Creasey	57	male	67.60	144.44	4417	moderate

I then extracted the height and weight columns to their own variables

height = df.loc[:, "height"]
weight = df.loc[:, "weight"]
plt.plot(height, weight, '.')

and plotted them.
plt.plot(height, weight, '.')

Now to fit the data to a line $f(x)=ax+b$ by solving $A^TAx=A^Tb$ where matrix $A$ is comprised of the height data and vector $b$ the weight.

A = np.array([height, height**0]).transpose()
b = np.array([weight]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(weight))

The result was a=1.389714565661591 and b=75.88285967631238.
And finally plotting the line $f(x)=1.39x+75.9$

def f(x): return a*x + b
xs = np.linspace(57.5,77.5)
ys = f(xs)
plt.plot(xs,ys)
plt.plot(height,weight, '.')

displays the closest possible approximation.

mark · March 2020

@dan

I like your answer the best so far. It's the most complete and presents an analysis using all the data. I do have some comments, though.

The plots with just four points are unnecessary.
Your domain for the last plot that displays the points and the line is way too big and obscures the fit. It's hard to say what's going on there, since you don't show the code.
It looks like you're trying to use LaTeX at some inappropriate spots.
- You should use HTML to show the table. The command print(df.tail().to_html()) should yield good exactly what you want.
- You should indicate most other computer output as a code comment. In Python, a comment is preceded with a hash (#). A simple example. looks like:
a = 1
b = 2
a+b

// Out:
// 3

mark · March 2020

@gabriel and @LordFarquaad

We're interested in all your data - not just five points.

Donkey · March 2020

I began this problem by setting up the necessary libraries and retrieving the data.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.linalg import solve, eig
df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=Donkey')
df.tail()

This gave the output:

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Richard	Schmidt	40	male	69.04	184.16	425	moderate
96	Paul	Fleury	29	male	64.96	206.06	11212	moderate
97	Agnes	Pollard	39	female	61.78	233.80	12416	moderate
98	Diane	Morrison	42	female	61.50	179.02	17823	none
99	Frances	Horn	38	female	60.43	132.64	336	moderate

Note that this is not the complete data set, but only the final 5 rows. I then set up vectors containing the height and weight data for the entire data set, then plotted weight vs. height.

h = df.loc[:, "height"]
w = df.loc[:, "weight"]
plt.plot(h,w,'.')

Next I set up a matrix using the height values as the first column, with the second column all ones. I then used the Least Squares formula to generate a fit line.

A = np.array([h,np.ones(len(h))]).transpose()
bv = np.array([w]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(bv))
print(a,b)
#output: [0.69005625] [127.93110201]

Therefore the equation for the fit line is $\hat{weight} = 0.69(height) + 127.93$. Plotting this line with the original plot gives the following:

Opie · March 2020

I started by importing the necessary libraries, as well as the data:

import pandas as pd
import numpy as np
from scipy.linalg import solve, eig
import matplotlib.pyplot as plt
df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv? 
username=opie')
df.tail()

The tail of my data was the following:

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Albina	Donaldson	25	female	61.62	164.53	10570	none
96	Joanne	Garza	26	female	62.06	153.76	2252	moderate
97	Eva	Garrett	38	female	63.44	134.73	54855	moderate
98	Avril	To	40	female	63.29	186.23	4817454	none
99	George	Weekley	33	male	70.16	190.69	33424	high

Next, I pulled the height and weight from the table, and labeld them H and W, respectively

H = df.height
W = df.weight

I then plotted the whole set of the data and found the following

plt.plot(H, W, '.')

Then, using the Least Squares Formula, I was able to find an equation for the line of best fit:

A = np.array([H,np.ones(len(H))]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(W))
a,b
(0.6337213350857557, 131.17989401737339)

Thus, we find that our line of best fit for this data is $$y = .6337x + 131.1799.$$

We can then add this line to the earlier plot to get a good visual on what we should expect:

def f(x): return a*x+b
fy = [f(x) for x in H]
plt.plot(H,W, '.')
plt.plot(H,fy)

joshua · March 2020

Import libraries, download CSV data and print tail of data;

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.linalg import solve, eig
df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=joshua')
df.tail()

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Gary	Bejjani	45	male	73.28	157.28	2270	none
96	Sara	Lamb	28	female	64.03	149.05	287	high
97	Michelle	Martinez	43	female	58.38	169.22	68610	moderate
98	Thomas	Hart	40	male	69.84	182.22	45682	high
99	Delbert	Bueche	46	male	68.87	219.97	1463	high

Extract and plot height and weight columns of data,

H = df.loc[:, "height"]
W = df.loc[:, "weight"]
plt.plot(H,W,'.')

Set up and solve the normal equations to find a linear function that models the data,

A = np.array([H,np.ones(len(H))]).transpose()
bv = np.array([W]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(bv))
print(a,b)

$$f(x) = ax + b$$

$$a = 1.58265964 \ , \ \ b = 67.31526454$$

$$\boxed{f(x) = (1.58265964)x + (67.31526454)}$$

Plot function with data,

def f(x): return a*x + b
xs = np.linspace(57.0,77.0)
ys = f(xs)
plt.plot(xs,ys)
plt.plot(H,W, '.')

Ben · March 2020

Once upon a time, there was a tabular-formatted collection of data. This table was full of information. Much of that information related to the heights and weights of an (assumed) fictional group of 100 people.

This data was acquired like so:

import pandas as pd

def getData(name):
    path =  'https://www.marksmath.org/cgi-bin/random_data.csv?username=' + name
    return pd.read_csv(path)

df = getData('ben')

The tail of that data-table looked like this:
print(df.tail().to_html())

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Herbert	Jurist	43	male	66.43	191.74	8464	none
96	Mae	Forster	21	female	63.53	170.32	19963	high
97	Han	Martens	50	female	60.17	168.36	18925	none
98	Cole	Hills	38	male	70.04	147.05	23065	high
99	Fred	Powell	32	female	67.75	169.37	14074	high

This data raises many important questions, such as "Is there a direct, linear correlation between sampled heights and weights?", and "Why is Fred Powell female?". The first question can be addressed through a least-squares solution. The latter we leave as an exercise for the reader.

To find a linear best fit, we first setup our environment:

import matplotlib.pyplot as plt
import numpy as np
from scipy.linalg import solve

and generate a quick plot of our data (using matplotlib.pyplot):

xs = df.height
ys = df.weight

plt.xlabel('Height')
plt.ylabel('Weight')
plt.suptitle('Subject Height vs. Weight')

plt.plot(xs,ys, '.')
plt.show()

Once the data is visually verified, we briefly define a line:

def f(x, a, b):
    return a*x + b

and build a least-squares linear regression of our data solving $A^{T}A\vec{x} = A^{T}\vec{b}$, where $A$ is an $m \times 2$ matrix comprised of our height values and a column of ones, and $\vec{b}$ is a column of our weight values.

# Assemble and solve matrix of datapoints
A = np.array([xs, np.ones(len(xs))]).transpose()
a, b = solve(A.transpose().dot(A), A.transpose().dot(ys))

Solving, we get a = -0.18073773896209944 and b = 186.0236653724885. Analyzing our a, or slope of fitted line, shows a small negative correlation between height and weight.

# Construct and plot the line
fy = [f(x, a, b) for x in xs]

Plotting our line $ax+b$ against our data we see:

plt.plot(xs,ys, '.')
plt.plot(xs, fy)
plt.show()

The negative correlation is visualized as a downward sloping line against our point cloud. Broadly extrapolating from our sample data, this suggests that short people tend to be heavier than tall people.

mark · March 2020

@Ben - Nice story telling! I've used this program in Stat 185/225 a few times but yours is the first time I've wondered if I should double check my programming.

dan · March 2020

(Figured it would be easier to just make a new comment: To show my changes that you recommend):

Going off the boys previous work by plugging this in:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv? 
username=dan')
df.tail()

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy.linalg import solve, eig

Which produces this tale end of the tabel:

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Gregory	Hernandez	21	male	68.33	191.08	4010	moderate
96	Kay	Walsh	25	female	67.69	134.76	4108	none
97	Lottie	Hoggins	32	female	63.14	171.32	170724	none
98	Norris	Wagers	54	male	67.36	209.84	478	high
99	Howard	Gonzalez	24	male	67.01	202.34	3957	none

Importing a package to plot the data:

import matplotlib.pyplot as plt

To consider all the data I made these changes:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv? 
username=dan')
df.tail(99)

Plotting all of the data:

height = df.loc[1:99, "height"]
weight = df.loc[1:99, "weight"]
plt.plot(height, weight, '.')

Using the least squares method and provided code, a list of slopes and y-intersecpts can be analyzed and found to best fit tall of the data:

A = np.array([height, height**0]).transpose()
b = np.array([weight]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(weight))
print(a)
print(b)

#Out: a = 1.7673603008572196
         b = 51.52382533705932

From here a best fit line can be plotted:

def f(x): return a*x + b
xs = np.linspace(50,80)
ys = f(xs)
plt.plot(xs,ys)
plt.plot(height,weight, '.')

frank · March 2020

I started by importing the required libraries,

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.linalg import solve, eig

and then the data itself.

df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=frank')
df.tail()

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Daniel	Howard	36	male	68.22	188.49	6231	moderate
96	Gina	Richard	27	female	66.72	107.76	58860	high
97	Jonathan	Daniel	37	male	67.42	164.60	4933	moderate
98	Oralia	Duvall	22	female	62.43	230.45	14265	moderate
99	Cristina	Mateer	39	female	60.73	151.24	473	none

Putting the data into vector form and plotting the data resulted in the following graph.

h = df.loc[:, "height"]
w = df.loc[:, "weight"]
plt.plot(h,w,'.')

Using the least squares method, I created a matrix $A$ to hold the vector $h$ in the first column, and a column of ones in the second. Solving $A^{\text{T}}A\vec{x}=A^{\text{T}}\vec{w}$ resulted in values for $a\approx1.34$ and $b\approx83.4$.

A = np.array([h,np.ones(len(h))]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(w))

Using the values of $a$ and $b$ in the function $f(x)=ax+b$ yields the line of best fit for the given data.

def f(x): return a*x+b
fy = [f(x) for x in h]
plt.plot(h,w,'.')
plt.plot(h,fy)

joshuam · March 2020

First, all of the necessary libraries are imported:

import matplotlib.pyplot as plt
#import numpy as np
#import pandas as pd
from scipy.linalg import solve

The next step of the assignment is to download the data collection so that it can be used in the script:

# Import data collection
data = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=joshuam')
data.tail()
# print(data.tail())
print(data.tail().to_html())

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Elaine	Ferguson	38	female	65.03	219.24	170650	none
96	Micki	Davis	31	female	59.64	139.71	13543	none
97	Brian	Landin	22	male	65.52	140.60	573025	none
98	Michael	Skinner	58	male	69.05	233.09	15852	high
99	Ryan	Hess	29	male	73.72	168.47	4887	none

Once the data is able to be manipulated, the height and weight values are extracted and plotted on a graph to display all of the data. This generated the following plot:

# Create variables to store the height and weight values from the collection
height = data.height
weight = data.weight
plt.xlabel('Height')
plt.ylabel('Weight')
plt.plot(height, weight, '.')

The next step is to determine the line of best fit for the data. This will be found by finding the solution to the system $ A^T A\vec{x} = A^Tb⃗ $ .

A = np.array([height,np.ones(len(height))]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(weight))
a,b

Running this code will calculate the slope and y-intercept for the line of best fit $ y = ax+b $

a:
1.2843741017609727
b:
80.09348856482262

As a result, the line of best fit will be y = 1.284x + 80.093, where x represents height and y represents weight.

eli · April 2020

Using

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=eli')
df.tail()

I got the tail of

	first_name	last_name	age	sex	height	weight	income	activity_level
95	Cheryl	Henderson	35	female	62.37	204.73	89693	high
96	Brooks	Snow	42	male	70.33	200.56	4479	moderate
97	Barbara	Johnson	56	female	64.78	155.15	1032	high
98	Ashley	Mathias	28	male	70.52	181.02	66083	none
99	Deborah	Schiller	26	female	68.21	193.43	11676	none

Plotting Height vs Weight gave

additionally using

A = np.array([H,np.ones(len(H))]).transpose()
a,b = solve(A.transpose().dot(A), A.transpose().dot(W))
a,b

I got a= 0.27654818250610624 and b= 152.14317669408973
Therefore,

y= 0.28x + 152.14.

Furthermore, I used

def f(x): return a*x+b
fy = [f(x) for x in H]
plt.plot(H,W, '.')
plt.plot(H, fy)

to get the fit line

Sign in info

Quick Links

Categories

Fitting height and weight data

The problem

Comments