Sampling vs. Resampling With Python: Key Differences and Applications

Learn the key differences between sampling and resampling in data science with real-world examples, Python code and best practices.

Apr 16th, 2025 4:00pm by Jack Wallen

Featued image for: Sampling vs. Resampling With Python: Key Differences and Applications

Have you ever watched or listened to the news during election times and heard mention of sampling or sample size when referencing polls? Those samples are essentially a small subset of voters used to represent the entire population of a country.

Sampling is an important aspect of data science and is used everywhere. And then there’s resampling.

What are these things, and why are they so important? Let’s dive in and find out.

What Is Sampling?

In the wonderful world of Python, sampling is the process of selecting a subset of data points from an original data set to represent the entire data set. The ultimate goal of sampling is to reduce the size of a data set while preserving its essential characteristics. Sampling is widely used in data science, machine learning (ML) and statistics. Python provides multiple methods and libraries for sampling, including random sampling techniques.

Sampling is a crucial aspect of using data sets when programming and can be done using one of the following methods:

Uniform random sampling: Selecting data points at uniform intervals
Stratified sampling: Dividing the data into subsets (strata) and randomly selecting from each subset
Systematic sampling: Selecting data points based on a fixed interval or pattern

When working with large data sets, sampling becomes even more important because it can:

Reduce computational complexity.
Improve storage efficiency.
Facilitate analysis of smaller data sets.

Sample has several applications that span several use cases. Those applications include:

Data reduction: When dealing with massive data sets, sampling can help reduce the size while preserving essential characteristics of the data set.
Model training: Within the realm of ML, sampling is used to create new training data sets for model development and evaluation.
Oversampling: Oversampling is used in some ML algorithms where more samples are generated from underrepresented classes to improve performance.
Data augmentation: Sampling can be used for data augmentation techniques in image and speech processing for rotation, scaling or flipping images, and even adding noise to audio signals.
Surveys and research: Sampling is essential in surveys and research studies where a representative subset of participants is selected from the target population.

What Is Resampling?

Unlike sampling, resampling involves changing the size or density of a data set by interpolating or extrapolating data points between existing values. Resampling is often used to improve interpolation, enhance noise reduction, reduce high-frequency components in data, modify frequency content and shift or modify the frequency distribution of a data set.

There are different resampling methods, such as:

Linear interpolation: Estimating missing values between existing points.
Polynomial regression: Using polynomial equations to estimate missing values.
Spline-based resampling: Interpolating data with smooth curves.

As for how resampling can be applied, consider this list:

Image resizing (using the Pillow library)
Audio resampling (using the scipy library)
Data interpolation (using numpy)
Time series resampling (using pandas)
Data augmentation (using the torch library)
Signal processing (using the scipy library)

Key Differences Between Sampling and Resampling

Sampling	Resampling
Used for exploration, modeling or feature engineering.	Employed for data augmentation, noise reduction or signal processing.
Does not modify existing values.	May introduce new estimates based on interpolation/extrapolation.
Generally preserves statistical properties and distribution.	Can alter statistical properties and distribution.
Involves randomly selecting a subset of elements from a larger data set.	Involves estimating missing values by interpolating or extrapolating (e.g., using a polynomial fit) between existing data points.
Samples do not contain interpolated values between existing data points.	Resampling alters the number of elements in the data set, which can affect its statistical properties and distribution.
Sampling preserves the underlying statistical properties and distribution of the original data.	Resampling captures underlying patterns and relationships within the original data.

When To Use Sampling vs. Resampling

Use sampling for:

Exploring or investigating the underlying distribution to aid in understanding the characteristics of an original data set without altering its statistical properties.
Generating training data, validating model performance and exploring different scenarios.
Creating new features from existing ones, such as converting categorical variables into numerical representations.

Use resampling for:

Expanding the training set by generating additional samples with varying characteristics.
Modifying the sample rate of a signal while preserving its essential features, such as frequency content.
Adjusting the sampling interval and creating new estimates based on past observations.

How To Perform Sampling and Resampling in Python

Here’s an example of sampling with Python, using pandas and numpy:

import numpy as np
import pandas as pd

# Create a large array of random values (e.g., 10,000 rows)
np.random.seed(42) # To ensure reproducibility of the results.
data = np.random.rand(10000)

# Convert data to pandas DataFrame for sampling and analysis.
df = pd.DataFrame(data, columns=[‘Value’])

# Sample from the larger dataset using random indices
sample_indices = np.random.choice(df.index, size=20)
sampled_df = df.loc[ sample_indices ]

print(sampled_df.head()) # Print a portion of samples drawn with print() method.

Here’s a breakdown of the above code:

Create an array data containing 10,000 random values between 0 and 1.
Convert the data into pandas DataFrame (df) for easier manipulation and analysis using sampling capabilities provided by DataFrames.
Randomly select indices up to a specified number of samples (in this example, twenty samples were selected).
Use the .loc[] method on DataFrame df and sample_indices variable as arguments to yield a new dataframe (sampled_df) containing just sampled elements from the original data.

Here’s an example of resampling with Python, using sklearn and numpy:

import numpy as np
from sklearn.preprocessing import Resample

# Create a large array of random values (e.g., 1000 rows)
np.random.seed(42) # To ensure reproducibility of the results.
data = np.linspace(-10, 30, 1000)

# Convert data to pandas DataFrame for resampling and analysis
df = pd.DataFrame(data, columns=[‘Value’])

# Resample using interpolation (e.g., nearest neighbors or polynomial)
resampler = Resample(method=’nearest’, ratio=2) # To get double the number of samples

resampled_data = resampler.fit_transform(df)

print(resampled_data.head()) # Print a portion of resamples with print() method.

The breakdown of the above code looks like this:

Create a data array containing 1,000 evenly spaced values between -10 and 30.
Convert the data into pandas DataFrame (df) for easier manipulation and analysis using interpolation capabilities provided by the Resampler class from the scikit-learn library.
Create a resampling object with method = to nearest with a ratio of 2 to get double the number of samples.
The fit_transform() function is called for DataFrames and Resampler objects to resample data.

Common Challenges and Best Practices

There are a few common challenges you should consider when using sampling and resampling in Python. First, let’s look at sampling:

Sampling can sometimes result in under- or oversampling, where the sample size is too small or too large compared to the original data set.
If the sampling ratio is set too high, some data points might be lost during the sampling process, which could lead to biased results.
Sampling can introduce randomness into your analysis if not done carefully.

Next, let’s consider these resampling challenges:

When using interpolation methods like linear or cubic spline, resampling may lose some information and details in data points, causing artifacts in the resampled values that are far from the original ones.
Resampling can sometimes lead to over-smoothing, which might not be desirable if you want to capture some level of noise or variability in a data set.
Interpolation methods may struggle when dealing with edge cases such as data points near the boundaries.

Handling Bias in Sampling and Resampling

There are a few strategies for handling bias in both sampling and resampling, such as:

Randomize the sampling process.
Use stratified sampling.
Use oversampling techniques.
Data augmentation.
Use regularization.

Ensuring Representative Data

Here are ways you can ensure representative data:

Use stratified sampling to ensure that each class or category in the data set is represented proportionally.
Preshape the data according to classes, then sample from it.
Sample without replacement when possible.
Use data augmentation techniques like rotation, flipping and scaling.

Avoiding Overfitting With Proper Resampling

Instead of reducing the sample size, use oversampling techniques like SMOTE or RandomOverSampling to artificially increase the number of minority class samples.
Use undersampling techniques like TomekLinks or EditedNearestNeighbors to reduce the number of majority class samples without losing any data points.
Apply random transformations such as rotation, flipping and scaling to create new training examples from existing ones.

Conclusion

Sampling and resampling are crucial concepts in data science that can greatly impact the accuracy and reliability of our analysis. Sampling involves selecting a subset of data points from an original data set to represent the entire data set, while resampling involves changing the size or density of a data set by interpolating or extrapolating data points between existing values.

By understanding the key differences between sampling and resampling, including their applications, advantages and limitations, we can make informed decisions about when to use each technique. Sampling is often used for exploration, modeling and feature engineering, while resampling is employed for data augmentation, noise reduction or signal processing.

Jack Wallen is what happens when a Gen Xer mind-melds with present-day snark. Jack is a seeker of truth and a writer of words with a quantum mechanical pencil and a disjointed beat of sound and soul. Although he resides...