Fitting height and weight data

edited March 2020 in Assignments

(15 pts)

I've got a fun program on my webpage that generates random CSV data for people. You can access it via Python like so:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=mark')
df.tail()
first_name last_name age sex height weight income activity_level
95 Frank Parker 45 male 65.49 130.90 39050 high
96 Clyde Botti 43 male 74.95 156.12 1952 moderate
97 Donald Hollack 31 male 67.11 206.31 44204 moderate
98 Cheryl Hamilton 23 female 65.32 188.72 86 none
99 Doug Garcia 23 male 68.27 148.50 722 high

Here's the cool thing - the data is randomly generated but the random number generator is seeded using the username query parameter in the URL. Thus, if I execute that command several times, I get the same result every time. That result depends upon the username, however. If you do it with your forum username, you'll get a different result. Thus, we all have our own randomly generated data file!

The problem

This Python assignment comes in several parts:

  • Download your own personal CSV and display the tail of that data
    (I used the command print(df.tail().to_html()) to generate code for the table)
  • Extract and plot the height and weight columns,
  • Plot that data with height on the horizontal axis and weight on the vertical,
  • Use Python/Numpy/Scipy to set up and solve the normal equations to find a function of the form $f(x)=ax+b$ that models the data, and
  • Plot your function with the data.

You'll turn in the assignment as a response to this post. Of course, collaboration is encouraged - even unavoidable. You should certainly find some helpful code on our class web pages:

Comments

  • edited March 2020

    Using

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.linalg import solve, eig
    GA = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=gabriel')
    GA.tail()
    

    My tail of my data was:

    first_name last_name age sex height weight income activity_level
    95 Wanda Hammack 44 female 59.33 166.46 27163 none
    96 Latoya Farmer 53 female 64.22 130.05 3252 high
    97 Stacie Sartwell 26 female 64.61 163.25 8762 high
    98 Elizabeth Lyman 33 female 60.76 150.13 3550 moderate
    99 Mary Ivey 35 female 60.63 137.10 261661 moderate

    Edit: This isnt the entire data set, I plotted the entire data set, which, when plotted using
    import matplotlib.pyplot as plt

    produced

    I printed the values for height and weight and renamed them in a list H1 and W1, respectively. Edit: I reran the code with the entire data set. This was the new H1 and W1

    H1=GA.height
    W1=GA.weight
    

    Using code found in the Least Squares tutorial,

    A = np.array([H1,np.ones(len(H1))]).transpose()
    a,b = solve(A.transpose().dot(A), A.transpose().dot(W1))
    a,b
    

    which gave the solution

    (1.1819982447405388, 88.9482338086134)
    

    which roughly corresponds to the equation

    $$ y = 1.18x + 89 $$

    Using this solution I plugged

    def f(x): return a*x+b
    fy = [f(x) for x in H1]
    plt.plot(H1,W1, '.')
    plt.plot(H1, fy)
    

    The corresponding plot gave

  • edited March 2020

    Generating data under the glorious username LordFarquaad resulted in

    first_name last_name age sex height weight income activity_level
    95 Mary Phillips 51 female 66.57 183.45 666 high
    96 Samantha Grimes 26 female 65.76 165.57 5359 moderate
    97 Russell Dwelle 33 male 71.36 200.36 3117 none
    98 Maxwell Rodrigues 34 male 70.88 163.93 19932 none
    99 Jim Creasey 57 male 67.60 144.44 4417 moderate

    I then extracted the height and weight columns to their own variables

    height = df.loc[:, "height"]
    weight = df.loc[:, "weight"]
    plt.plot(height, weight, '.')
    

    and plotted them.
    plt.plot(height, weight, '.')

    Now to fit the data to a line $f(x)=ax+b$ by solving $A^TAx=A^Tb$ where matrix $A$ is comprised of the height data and vector $b$ the weight.

    A = np.array([height, height**0]).transpose()
    b = np.array([weight]).transpose()
    a,b = solve(A.transpose().dot(A), A.transpose().dot(weight))
    

    The result was a=1.389714565661591 and b=75.88285967631238.
    And finally plotting the line $f(x)=1.39x+75.9$

    def f(x): return a*x + b
    xs = np.linspace(57.5,77.5)
    ys = f(xs)
    plt.plot(xs,ys)
    plt.plot(height,weight, '.')
    

    displays the closest possible approximation.

  • edited March 2020

    @dan

    I like your answer the best so far. It's the most complete and presents an analysis using all the data. I do have some comments, though.

    • The plots with just four points are unnecessary.
    • Your domain for the last plot that displays the points and the line is way too big and obscures the fit. It's hard to say what's going on there, since you don't show the code.
    • It looks like you're trying to use LaTeX at some inappropriate spots.

      • You should use HTML to show the table. The command print(df.tail().to_html()) should yield good exactly what you want.
      • You should indicate most other computer output as a code comment. In Python, a comment is preceded with a hash (#). A simple example. looks like:

      a = 1
      b = 2
      a+b

      // Out:
      // 3

  • @gabriel and @LordFarquaad

    We're interested in all your data - not just five points. :smile:

  • I began this problem by setting up the necessary libraries and retrieving the data.

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.linalg import solve, eig
    df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=Donkey')
    df.tail()
    

    This gave the output:

    first_name last_name age sex height weight income activity_level
    95 Richard Schmidt 40 male 69.04 184.16 425 moderate
    96 Paul Fleury 29 male 64.96 206.06 11212 moderate
    97 Agnes Pollard 39 female 61.78 233.80 12416 moderate
    98 Diane Morrison 42 female 61.50 179.02 17823 none
    99 Frances Horn 38 female 60.43 132.64 336 moderate

    Note that this is not the complete data set, but only the final 5 rows. I then set up vectors containing the height and weight data for the entire data set, then plotted weight vs. height.

    h = df.loc[:, "height"]
    w = df.loc[:, "weight"]
    plt.plot(h,w,'.')
    

    Next I set up a matrix using the height values as the first column, with the second column all ones. I then used the Least Squares formula to generate a fit line.

    A = np.array([h,np.ones(len(h))]).transpose()
    bv = np.array([w]).transpose()
    a,b = solve(A.transpose().dot(A), A.transpose().dot(bv))
    print(a,b)
    #output: [0.69005625] [127.93110201]
    

    Therefore the equation for the fit line is $\hat{weight} = 0.69(height) + 127.93$. Plotting this line with the original plot gives the following:

    mark
  • I started by importing the necessary libraries, as well as the data:

    import pandas as pd
    import numpy as np
    from scipy.linalg import solve, eig
    import matplotlib.pyplot as plt
    df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv? 
    username=opie')
    df.tail()
    

    The tail of my data was the following:

    first_name last_name age sex height weight income activity_level
    95 Albina Donaldson 25 female 61.62 164.53 10570 none
    96 Joanne Garza 26 female 62.06 153.76 2252 moderate
    97 Eva Garrett 38 female 63.44 134.73 54855 moderate
    98 Avril To 40 female 63.29 186.23 4817454 none
    99 George Weekley 33 male 70.16 190.69 33424 high

    Next, I pulled the height and weight from the table, and labeld them H and W, respectively

    H = df.height
    W = df.weight
    

    I then plotted the whole set of the data and found the following

    plt.plot(H, W, '.')
    

    Then, using the Least Squares Formula, I was able to find an equation for the line of best fit:

    A = np.array([H,np.ones(len(H))]).transpose()
    a,b = solve(A.transpose().dot(A), A.transpose().dot(W))
    a,b
    (0.6337213350857557, 131.17989401737339)
    

    Thus, we find that our line of best fit for this data is $$y = .6337x + 131.1799.$$

    We can then add this line to the earlier plot to get a good visual on what we should expect:

    def f(x): return a*x+b
    fy = [f(x) for x in H]
    plt.plot(H,W, '.')
    plt.plot(H,fy)
    

  • edited March 2020

    Import libraries, download CSV data and print tail of data;

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.linalg import solve, eig
    df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=joshua')
    df.tail()
    
    first_name last_name age sex height weight income activity_level
    95 Gary Bejjani 45 male 73.28 157.28 2270 none
    96 Sara Lamb 28 female 64.03 149.05 287 high
    97 Michelle Martinez 43 female 58.38 169.22 68610 moderate
    98 Thomas Hart 40 male 69.84 182.22 45682 high
    99 Delbert Bueche 46 male 68.87 219.97 1463 high

    Extract and plot height and weight columns of data,

    H = df.loc[:, "height"]
    W = df.loc[:, "weight"]
    plt.plot(H,W,'.')
    

    Set up and solve the normal equations to find a linear function that models the data,

    A = np.array([H,np.ones(len(H))]).transpose()
    bv = np.array([W]).transpose()
    a,b = solve(A.transpose().dot(A), A.transpose().dot(bv))
    print(a,b)
    

    $$f(x) = ax + b$$

    $$a = 1.58265964 \ , \ \ b = 67.31526454$$

    $$\boxed{f(x) = (1.58265964)x + (67.31526454)}$$

    Plot function with data,

    def f(x): return a*x + b
    xs = np.linspace(57.0,77.0)
    ys = f(xs)
    plt.plot(xs,ys)
    plt.plot(H,W, '.')
    

  • BenBen
    edited March 2020

    Once upon a time, there was a tabular-formatted collection of data. This table was full of information. Much of that information related to the heights and weights of an (assumed) fictional group of 100 people.

    This data was acquired like so:

    import pandas as pd
    
    def getData(name):
        path =  'https://www.marksmath.org/cgi-bin/random_data.csv?username=' + name
        return pd.read_csv(path)
    
    df = getData('ben')
    

    The tail of that data-table looked like this:
    print(df.tail().to_html())

    first_name last_name age sex height weight income activity_level
    95 Herbert Jurist 43 male 66.43 191.74 8464 none
    96 Mae Forster 21 female 63.53 170.32 19963 high
    97 Han Martens 50 female 60.17 168.36 18925 none
    98 Cole Hills 38 male 70.04 147.05 23065 high
    99 Fred Powell 32 female 67.75 169.37 14074 high

    This data raises many important questions, such as "Is there a direct, linear correlation between sampled heights and weights?", and "Why is Fred Powell female?". The first question can be addressed through a least-squares solution. The latter we leave as an exercise for the reader.

    To find a linear best fit, we first setup our environment:

    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.linalg import solve
    

    and generate a quick plot of our data (using matplotlib.pyplot):

    xs = df.height
    ys = df.weight
    
    plt.xlabel('Height')
    plt.ylabel('Weight')
    plt.suptitle('Subject Height vs. Weight')
    
    plt.plot(xs,ys, '.')
    plt.show()
    

    Once the data is visually verified, we briefly define a line:

    def f(x, a, b):
        return a*x + b
    

    and build a least-squares linear regression of our data solving $A^{T}A\vec{x} = A^{T}\vec{b}$, where $A$ is an $m \times 2$ matrix comprised of our height values and a column of ones, and $\vec{b}$ is a column of our weight values.

    # Assemble and solve matrix of datapoints
    A = np.array([xs, np.ones(len(xs))]).transpose()
    a, b = solve(A.transpose().dot(A), A.transpose().dot(ys))
    

    Solving, we get a = -0.18073773896209944 and b = 186.0236653724885. Analyzing our a, or slope of fitted line, shows a small negative correlation between height and weight.

    # Construct and plot the line
    fy = [f(x, a, b) for x in xs]
    

    Plotting our line $ax+b$ against our data we see:

    plt.plot(xs,ys, '.')
    plt.plot(xs, fy)
    plt.show()
    

    The negative correlation is visualized as a downward sloping line against our point cloud. Broadly extrapolating from our sample data, this suggests that short people tend to be heavier than tall people.

    mark
  • @Ben - Nice story telling! I've used this program in Stat 185/225 a few times but yours is the first time I've wondered if I should double check my programming.

  • (Figured it would be easier to just make a new comment: To show my changes that you recommend):

    Going off the boys previous work by plugging this in:

    import pandas as pd
    df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv? 
    username=dan')
    df.tail()
    
    %matplotlib inline
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.linalg import solve, eig
    

    Which produces this tale end of the tabel:

    first_name last_name age sex height weight income activity_level
    95 Gregory Hernandez 21 male 68.33 191.08 4010 moderate
    96 Kay Walsh 25 female 67.69 134.76 4108 none
    97 Lottie Hoggins 32 female 63.14 171.32 170724 none
    98 Norris Wagers 54 male 67.36 209.84 478 high
    99 Howard Gonzalez 24 male 67.01 202.34 3957 none

    Importing a package to plot the data:

    import matplotlib.pyplot as plt
    

    To consider all the data I made these changes:

    import pandas as pd
    df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv? 
    username=dan')
    df.tail(99)
    

    Plotting all of the data:

    height = df.loc[1:99, "height"]
    weight = df.loc[1:99, "weight"]
    plt.plot(height, weight, '.')
    

    Using the least squares method and provided code, a list of slopes and y-intersecpts can be analyzed and found to best fit tall of the data:

    A = np.array([height, height**0]).transpose()
    b = np.array([weight]).transpose()
    a,b = solve(A.transpose().dot(A), A.transpose().dot(weight))
    print(a)
    print(b)
    
    #Out: a = 1.7673603008572196
             b = 51.52382533705932
    

    From here a best fit line can be plotted:

    def f(x): return a*x + b
    xs = np.linspace(50,80)
    ys = f(xs)
    plt.plot(xs,ys)
    plt.plot(height,weight, '.')
    

    mark
  • edited March 2020

    I started by importing the required libraries,

    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy.linalg import solve, eig
    

    and then the data itself.

    df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=frank')
    df.tail()
    
    first_name last_name age sex height weight income activity_level
    95 Daniel Howard 36 male 68.22 188.49 6231 moderate
    96 Gina Richard 27 female 66.72 107.76 58860 high
    97 Jonathan Daniel 37 male 67.42 164.60 4933 moderate
    98 Oralia Duvall 22 female 62.43 230.45 14265 moderate
    99 Cristina Mateer 39 female 60.73 151.24 473 none

    Putting the data into vector form and plotting the data resulted in the following graph.

    h = df.loc[:, "height"]
    w = df.loc[:, "weight"]
    plt.plot(h,w,'.')
    


    Using the least squares method, I created a matrix $A$ to hold the vector $h$ in the first column, and a column of ones in the second. Solving $A^{\text{T}}A\vec{x}=A^{\text{T}}\vec{w}$ resulted in values for $a\approx1.34$ and $b\approx83.4$.

    A = np.array([h,np.ones(len(h))]).transpose()
    a,b = solve(A.transpose().dot(A), A.transpose().dot(w))
    

    Using the values of $a$ and $b$ in the function $f(x)=ax+b$ yields the line of best fit for the given data.

    def f(x): return a*x+b
    fy = [f(x) for x in h]
    plt.plot(h,w,'.')
    plt.plot(h,fy)
    

    mark
  • edited March 2020

    First, all of the necessary libraries are imported:

    import matplotlib.pyplot as plt
    #import numpy as np
    #import pandas as pd
    from scipy.linalg import solve
    

    The next step of the assignment is to download the data collection so that it can be used in the script:

    # Import data collection
    data = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=joshuam')
    data.tail()
    # print(data.tail())
    print(data.tail().to_html())
    
    first_name last_name age sex height weight income activity_level
    95 Elaine Ferguson 38 female 65.03 219.24 170650 none
    96 Micki Davis 31 female 59.64 139.71 13543 none
    97 Brian Landin 22 male 65.52 140.60 573025 none
    98 Michael Skinner 58 male 69.05 233.09 15852 high
    99 Ryan Hess 29 male 73.72 168.47 4887 none

    Once the data is able to be manipulated, the height and weight values are extracted and plotted on a graph to display all of the data. This generated the following plot:

    # Create variables to store the height and weight values from the collection
    height = data.height
    weight = data.weight
    plt.xlabel('Height')
    plt.ylabel('Weight')
    plt.plot(height, weight, '.')
    

    The next step is to determine the line of best fit for the data. This will be found by finding the solution to the system $ A^T A\vec{x} = A^Tb⃗ $ .

    A = np.array([height,np.ones(len(height))]).transpose()
    a,b = solve(A.transpose().dot(A), A.transpose().dot(weight))
    a,b
    

    Running this code will calculate the slope and y-intercept for the line of best fit $ y = ax+b $

    a:
    1.2843741017609727
    b:
    80.09348856482262
    

    As a result, the line of best fit will be y = 1.284x + 80.093, where x represents height and y represents weight.

    mark
  • Using

    import pandas as pd
    df = pd.read_csv('https://www.marksmath.org/cgi-bin/random_data.csv?username=eli')
    df.tail()
    

    I got the tail of

    first_name last_name age sex height weight income activity_level
    95 Cheryl Henderson 35 female 62.37 204.73 89693 high
    96 Brooks Snow 42 male 70.33 200.56 4479 moderate
    97 Barbara Johnson 56 female 64.78 155.15 1032 high
    98 Ashley Mathias 28 male 70.52 181.02 66083 none
    99 Deborah Schiller 26 female 68.21 193.43 11676 none

    Plotting Height vs Weight gave

    additionally using

    A = np.array([H,np.ones(len(H))]).transpose()
    a,b = solve(A.transpose().dot(A), A.transpose().dot(W))
    a,b
    

    I got a= 0.27654818250610624 and b= 152.14317669408973
    Therefore,

    y= 0.28x + 152.14.

    Furthermore, I used

    def f(x): return a*x+b
    fy = [f(x) for x in H]
    plt.plot(H,W, '.')
    plt.plot(H, fy)
    

    to get the fit line

Sign In or Register to comment.