in dealing with data
Mon, Feb 16, 2026
We’ve talked plenty about linear regression. It’s almost time to actually do a real-world problem. Before doing so, though, we need to mention a few other issues that often arise in the context of real world problems.
Most of these issues are not super complicated - at least, not at the level of linear regression as we’ve worked on to this point. It’s also the case that a good library, like SciKit Learn, will take care of most of this automatically. These things do require a bit of discussion, though.
Here are the libraries we’ll be using in this presentation.
We’re going to work with some real data taken from Kaggle’s House Prices competition for part of this presentation.
The original dataset was taken from the Ames, Iowa Assessor’s Office, which maintains detailed property records used for tax assessment. Dean De Cock, a statistics professor at Truman State, took that data and prepared it for use in intro statistics around 2011.
Kaggle started using the data set in 2016 and licenses it under the open source MIT license.
I’ve got the data on my website so we can easily read it into Python. There are 81 variables across 1460 observations. We’ll read all of the data but display just a bit:
train = pd.read_csv('https://marksmath.org/data/house-prices/train.csv')
my_variables = ["Neighborhood", "GrLivArea", "BedroomAbvGr", "FullBath", "HalfBath", "KitchenQual", "GarageCars", "SalePrice"]
print([len(train), len(train.columns)])
train.head(4)[my_variables][1460, 81]
| Neighborhood | GrLivArea | BedroomAbvGr | FullBath | HalfBath | KitchenQual | GarageCars | SalePrice | |
|---|---|---|---|---|---|---|---|---|
| 0 | CollgCr | 1710 | 3 | 2 | 1 | Gd | 2 | 208500 |
| 1 | Veenker | 1262 | 3 | 2 | 0 | TA | 2 | 181500 |
| 2 | CollgCr | 1786 | 3 | 2 | 1 | Gd | 2 | 223500 |
| 3 | Crawfor | 1717 | 3 | 1 | 0 | Gd | 3 | 140000 |
It turns out that there’s a lot of missing data.
It’s likely that most of that information is simply because it’s not applicable to a given house and nobody checked the N/A box. Nonetheless, we need a strategy to deal with it, if we want to use this data to make predictions on other houses.
The strategy for dealing with missing data is pretty simple. Use a K-Nearest Neighbors algorithm to find similar houses and simply replace the missing data with the average (for numerical data) or most common occurrence (for categorical data).
This process is called imputation.
When we do our lab next week, you’ll notice a KNNImputer function that takes care of this for us!
Ultimately, most machine learning algorithms work with numbers. When working with categorical data, there needs to be some sensible way to translate values into numbers. The standard technique for doing so depends on whether the variable is
Consider the KitchenQual which takes on the values
In this case, a simple 1 through 5 should work just fine.
Many categorical variables cannot be ordered in that fashion; these are called nominal. A simple example could be color, which might take on one of four values
red, yellow, blue, purple
While I, personally, think that purple is the best others may differ. There’s no natural ranking of the colors leading to a sensible order.
The most commonly used encoding strategy is called one-hot encoding.
The strategy of one-hot encoding is to replace a single categorical variable (which I’ll call the main variable) with a vector of variables (that I’ll call slots).
If the main variable has \(n\) possible values, then corresponding vector of slots has length \(n-1\). Each of those slots refers to \(1\) of the main values and the value in each slot can be either \(1\) or \(0\), depending on whether the main variable takes on that value or not.
There’s a default value so that all zeros indicates that the variable takes on that default value.
Let’s suppose I’ve got a variable called color, which can take on one of the four values
red, yellow, blue, purple
If we choose purple to be the default value, then our new “color” vector could take on the values
The Neighborhood variable in the housing data is a good example of nominal variable. It takes on 25 possible values including names like ‘Blmngtn’, ‘BrkSide’, ‘CollgCr’, ‘Sawyer’, and ‘Veenker’.
Encoding in SciKit Learn will be handled with OrdinalEncoder and OneHotEncoder.
Overfitting is a constant problem in machine learning; it’s an issue with just about every ML algorithm. Conversely, every such algorithm should have standard techniques to deal with it.
The standard technique to combat overfitting in the context of linear regression is called regularization.
Regularization to counter overfitting is based on the observation that the coefficients of a polynomial fit tend to grow as the order of the polynomial grows. Thus, to avoid overfitting, we place a penalty on the size of the coefficients.
In functional terms, we might minimize \[ \|X\mathbf{a} - \mathbf{y}\|^2 + \lambda \|\mathbf{a}\|^2. \]
The constant \(\lambda\) is a parameter that influences the process. The larger \(\lambda\), the stronger the impact on regularization. Generally, \(\lambda\) is determined experimentally or even as part of the fitting process.
Recall that a high degree polynomial doesn’t work well here:
plt.figure(figsize=(8,3))
N = 9
f = lambda x: 1/(1+x**2)
x = np.linspace(-8,8,N)
y = f(x)
xs = np.linspace(-8.2,8.2,200)
ys = f(xs)
plt.plot(xs,ys, '-')
linear_model = LinearRegression(fit_intercept=True)
X = np.array([x**(N-1-p) for p in range(len(x))]).T
linear_model.fit(X, y)
coefs = np.round(linear_model.coef_, 4)
print("Non-zero coefs: ", coefs[coefs != 0])
Xs = np.array([xs**(N-1-p) for p in range(len(x))]).T
Ys = linear_model.predict(Xs)
plt.plot(xs,Ys, '-')
plt.plot(x,y, 'ok')Non-zero coefs: [-0.0006 0.022 -0.2787]
Here’s what it looks like, if we apply ridge regularization:
plt.figure(figsize=(8,3))
plt.plot(xs,ys, '-')
ridge_model = Ridge(alpha=10, fit_intercept=True)
X = np.array([x**(N-1-p) for p in range(len(x))]).T
ridge_model.fit(X, y)
coefs = np.round(ridge_model.coef_, 4)
print("Non-zero coefs: ", coefs[coefs != 0])
Xs = np.array([xs**(N-1-p) for p in range(len(x))]).T
Ys = ridge_model.predict(Xs)
plt.plot(xs,Ys, '-')
plt.plot(x,y, 'ok')
ax = plt.gca()
ax.set_ylim([-1.6,1.6])Non-zero coefs: [-0.0001 0.0044 -0.0902]
Recall that we have the following expression for the norm of a vector \(\mathbf{x} = [x_1\:\:\:x_2\:\:\:\cdots\:\:\:x_n]^{\mathsf{T}}\):
\[ \|\mathbf{x}\| = \sqrt{x_1^2+x_2^2+\cdots+x_n^2}. \]
We use the concept of distance frequently in numerical analysis and machine learning. For example, we quantify the error \(E\) of a model \(\mathbf{F}\) given input \(\mathbf{x}\) with expected output \(\mathbf{y}\) as \[E = \|\mathbf{F}(\mathbf{x}) - \mathbf{y}\|.\] It turns out that there are other ways magnitude can be measured that can affect the results and efficiency of a model. This leads to a generalization of the notion of magnitude called a “norm”.
The Manhattan norm of \(\mathbf{x}\in\mathbb R^n\) is defined by \[ \|\mathbf{x}\|_1 = |x_1| + |x_2|+\cdots+|x_n|. \] We might think of the Manhattan norm as the distance from the tip of a vector back to its origin as measured along a grid of streets in space. I’ve also seen it called the “taxicab norm”.

You might notice that we denoted the Manhattan norm as \(\|\mathbf{x}\|_1\).
It’s also common to denote the standard Euclidean norm as \(\|\mathbf{x}\|_2\).
In fact, they both lie in whole family of norms called the \(L_p\) norms. The parameter \(p\) can be any real number in \([1,\infty]\), both “endpoints” included!
Given a vector space \(V\) with norm \(\|\cdot\|\), the corresponding unit ball is \[ \{\vec{x}\in V: \|\vec{x}\| \leq 1\}. \] The shape of the ball depends on the norm, as well as on the space itself. Here are pictures of the unit ball in \(\mathbb R^2\) with respect to three different norms.
As we’ll see, the shape of the unit ball in \(L_1\) norm is of particular interest.
A large number of input variables can lead to several problems, including
Thus, we have good reason to attempt to reduce the number of variables, where appropriate.
It turns out that the process of regularization can shed light on variable selection and L1 regularization is particularly well suited for this task.
The common term LASSO, which refers to L1 regularization is an acronym standing for
Least Absolute Shrinkage and Selection Operator
In this column of slides, we’ll illustrate how this might work.
In order to understand how variable selection works, it helps to reformulate the regularized minimization problem as follows:
We minimize \[\|X\mathbf{a} - \mathbf{y}\|^2 \text{ subject to } \|\mathbf{a}\| \leq s,\] where \(s\) is a parameter related to \(\lambda\) from our original formulation.
This reformulation leads to a nice visualization illustrating why this might work.
Now, let’s suppose we have a list of data points \(\{(x_i,y_i)\}\) and we hope to fit a line of the form \(f_{a,b}(x)=ax+b\) to that data. In this case, \[ E(a,b) = \|X\mathbf{a} - \mathbf{y}\|^2 \] will be a quadratic \(z=f(a,b)\) and its graph in three-dimensional space will be a paraboloid opening up.
The contours of the Error function appear to be ellipses that collapse down to the minimum:
Now, the L2 regularized minimum occurs at the point of tangency between a contour and a circle in the L2 norm. If the \(x\) and \(y\) inputs have a common scale, then it certainly seems that the \(y\)-coordinate is the more important of the two.
Here’s the corresponding picture in the L1 norm. We can see that the \(x\)-coordinate is effectively zero and (presumably) much less important than the \(y\)-coordinate. More importantly, we can see why this happens.
Here’s the key observation coming out of all this:
Regularization with the LASSO can be very effective at identifying variables that can be removed from the data without serious degradation of results.
Often \(L_1\) regularization (the LASSO) is used in combination with \(L_2\) regularization. SciKit Learn has functions to automate all of this, including
sklearn.linear_model.Ridge for \(L_2\) regularization,sklearn.linear_model.Lasso for \(L_1\) regulaization,sklearn.linear_model.ElasticNet for a combination.These all have CV versions, which brings us (almost) to our last topic.
Our last two topics are brief and much simpler.
First, it is very common practice to scale data before passing it to a machine learning algorithm. This is very simple, given numerical data - which all data is, once encoded; we simply scale it so that its mean is zero and its standard deviation is one.
Think of it this way: Culmen length should contribute more to a model than flipper length, if you happen to measure one in millimeters and the other in meters.
That’s it.
Finally, let’s mention a bit about cross-validation.
Generally, you fit your model to the training data but you evaluate your model with testing data.
In the context of a Kaggle competition, they ultimately evaluate your model using the testing data that they provide. There’s no reason you can attempt to evaluate your model on your own, though.
For the rest of this column, let’s refer to the two types of data as
The idea is to fit and evaluate your model using the labeled data. Ultimately, you want to use it on unlabeled data.
To fit and evaluate a model using labeled data, we’ll break it into two parts:
We’ll fit the training data and evaluate using the (labeled) testing data.
In \(k\)-fold cross-validation, we randomly partition the labeled data into \(k\) pieces. We then set aside \(1\) piece for testing and use the other \(k-1\) pieces for training. This is done \(k\) times using each of the \(k\) pieces for testing.
For \(5\)-fold cross-validation, we run the process five times training with 80% of the data and testing with 20% each time.
Functions like ElasticNetCV in scikit-learn automate this procedure. A few select parameters (like regularization coefficients or L1 ratio) can be optimized using the cross-validation information. The final fit is then obtained by fitting the full data using the non-coefficient parameters.
In addition, information on the cross-validation results are returned in the regression object.
You can detect overfitting in the context of linear regression quite concretely using cross-validation. Suppose we’ve got \(n\) observations and \(p\) predictors. If TRAIN_MSE and TEST_MSE denote the Mean Square Error for the Train and Test sets, then we can prove that the expected value of TRAIN_MSE is \[ \text{TRAIN_MSE} = \frac{n-p-1}{n+p+1} \times \text{TEST_MSE}. \] Thus, a significantly smaller value of TRAIN_MSE indicates overfitting.
So much of this will make so much more sense, once we really play with it in the lab!