Sampling in R

Published

September 9, 2022

Reasons for sampling

Sampling is an important technique in your statistical arsenal. It isn’t always appropriate though—you need to know when to use it and when to work with the whole dataset.

Which of the following is not a good scenario to use sampling?
when data set is small

Simple sampling with dplyr

Throughout this chapter you’ll be exploring song data from Spotify. Each row of the dataset represents a song, and there are 41656 rows. Columns include the name of the song, the artists who performed it, the release year, and attributes of the song like its duration, tempo, and danceability. We’ll start by looking at the durations.

Your first task is to sample the song dataset and compare a calculation on the whole population and on a sample.

spotify_population is available and dplyr is loaded.

library(tidyverse)
library(fst)
library(knitr)
spotify_population <- read_fst("data/spotify_2000_2020.fst")
# View the whole population dataset

# Sample 1000 rows from spotify_population
spotify_sample <- slice_sample(spotify_population, n = 10)


# See the result
kable(spotify_sample)

acousticness	artists	danceability	duration_ms	duration_minutes	energy	explicit	id	instrumentalness	key	liveness	loudness	mode	name	popularity	release_date	speechiness	tempo	valence	year
0.17100	[‘John K’]	0.958	144427	2.407117	0.376	0	5HBDBBr5OV90lubqW8ctJF	0.000000	8	0.1270	-7.063	1	if we never met	76	2019-04-26	0.0523	107.964	0.331	2019
0.20400	[‘The Chainsmokers’, ‘Bebe Rexha’]	0.585	217640	3.627333	0.696	0	2oejEp50ZzPuQTQ6v54Evp	0.000000	4	0.3440	-5.600	0	Call You Mine	83	2019-12-06	0.0307	104.010	0.522	2019
0.38800	[‘Never Shout Never’]	0.768	178144	2.969067	0.683	0	2waLDWGLc4Q14ZVyDNrxLM	0.000000	7	0.2860	-7.173	1	cheatercheaterbestfriendeater	57	2010-08-23	0.0604	99.955	0.669	2010
0.57200	[‘Conway Twitty’]	0.638	199600	3.326667	0.377	0	4VF5XIZ7EiMfLElRzYG2E8	0.000000	2	0.1020	-15.640	1	I’d Love To Lay You Down	52	2001-01-01	0.0341	81.911	0.778	2001
0.00879	[‘Los Angeles Azules’, ‘Jay de la Cueva’]	0.526	336147	5.602450	0.765	0	6rEA2GbBkGC9GqOOrBwvza	0.000463	0	0.9670	-4.544	1	17 Años - Concierto Sinfónico Cumbia Fuzion	51	2015-10-16	0.0368	91.035	0.627	2015
0.00575	[‘Modest Mouse’]	0.521	229867	3.831117	0.931	0	72zsr1jSMnaMPtl713jXeJ	0.000000	2	0.2430	-4.549	1	Bury Me With It	44	2004-04-05	0.0654	170.048	0.816	2004
0.99400	[‘Mae Ji-Yoon’]	0.671	173719	2.895317	0.200	0	1COWV6U3LKqCBpiHPthbiB	0.939000	8	0.1140	-23.978	1	Vibrations	65	2017-09-10	0.0365	107.517	0.530	2017
0.13200	[‘Dr. Dog’]	0.454	234800	3.913333	0.820	0	0UV5zxRMz6AO4ZwUOZNIKI	0.000969	2	0.1150	-4.193	1	Where’d All the Time Go?	65	2010-11-02	0.0567	166.303	0.575	2010
0.03890	[‘Lil Baby’]	0.902	162791	2.713183	0.850	1	2XEsbmynS9dLSzNSuZzfXF	0.000000	7	0.0838	-6.390	1	Same Thing	64	2020-02-28	0.3580	120.132	0.556	2020
0.68400	[‘invention_’]	0.543	184000	3.066667	0.467	0	00DTeE4nekCTgYz1QYHXSl	0.006040	4	0.3510	-11.223	1	Nature Bump 000	54	2015-06-06	0.2740	75.293	0.328	2015

Simple sampling with dplyr

Your first task is to sample the song dataset and compare a calculation on the whole population and on a sample.

spotify_population is available and dplyr is loaded.

# Calculate the mean duration in mins from spotify_population
mean_dur_pop <- summarize(spotify_population, mean(duration_minutes))


# Calculate the mean duration in mins from spotify_sample
mean_dur_samp <- summarize(spotify_sample, mean(duration_minutes))


# See the results
mean_dur_pop

  mean(duration_minutes)
1               3.852152

mean_dur_samp

  mean(duration_minutes)
1               3.435225

Simple sampling with base-R

While dplyr provides great tools for sampling data frames, if you want to work with vectors you can use base-R.

Let’s turn it up to eleven and look at the loudness property of each song.

spotify_population is available.

# From previous step
loudness_pop <- spotify_population$loudness
loudness_samp <- sample(loudness_pop, size = 100)

# Calculate the standard deviation of loudness_pop
sd_loudness_pop <- sd(loudness_pop)

# Calculate the standard deviation of loudness_samp
sd_loudness_samp <- sd(loudness_samp)

# See the results
sd_loudness_pop

[1] 4.524076

sd_loudness_samp

[1] 4.184483

Are findings from the sample generalizable?

You just saw how convenience sampling—collecting data via the easiest method can result in samples that aren’t representative of the whole population. Equivalently, this means findings from the sample are not generalizable to the whole population. Visualizing the distributions of the population and the sample can help determine whether or not the sample is representative of the population.

The Spotify dataset contains a column named acousticness, which is a confidence measure from zero to one of whether the track is acoustic, that is, it was made with instruments that aren’t plugged in. Here, you’ll look at acousticness in the total population of songs, and in a sample of those songs.

spotify_population and spotify_mysterious_sample are available; dplyr and ggplot2 are loaded.

ggplot(spotify_population, aes(acousticness))+
    geom_histogram(binwidth = 0.01)

ggplot(spotify_population, aes(duration_minutes))+
    geom_histogram(binwidth = 0.5)

Generating random numbers

You’ve seen sample() and it’s dplyr cousin, slice_sample() for generating pseudo-random numbers from a set of values. A related task is to generate random numbers that follow a statistical distribution, like the uniform distribution or the normal distribution.

Each random number generation function has a name beginning with “r”. It’s first argument is the number of numbers to generate, but other arguments are distribution-specific. Free hint: Try args(runif) and args(rnorm) to see what arguments you need to pass to those functions.

n_numbers is available and set to 5000; ggplot2 is loaded.

n_numbers <- 5000
# Generate random numbers from ...
randoms <- data.frame(
  # a uniform distribution from -3 to 3
  uniform =runif(n_numbers, -3, 3),
  # a normal distribution with mean 5 and sd 2
  normal = rnorm(n_numbers, mean = 5, sd = 2)
)


# Plot a histogram of uniform values, binwidth 0.25
ggplot(randoms, aes(uniform)) +
    geom_histogram(binwidth = 0.25)

# Plot a histogram of normal values, binwidth 0.5
ggplot(randoms, aes(normal)) +
    geom_histogram(binwidth = 0.5)

Understanding random seeds

While random numbers are important for many analyses, they create a problem: the results you get can vary slightly. This can cause awkward conversations with your boss when your script for calculating the sales forecast gives different answers each time.

Setting the seed to R’s random number generator helps avoid such problems by making the random number generation reproducible. - The values of x are different to those of y.

set.seed(123)
x <- rnorm(5)
y <- rnorm(5)
x

[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774

[1]  1.7150650  0.4609162 -1.2650612 -0.6868529 -0.4456620

x and y have identical values.

set.seed(123)
x <- rnorm(5)
set.seed(123)
y <- rnorm(5)
x

[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774

[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774

x and y have identical values.

set.seed(123)
x <- c(rnorm(5), rnorm(5))
set.seed(123)
y <- rnorm(10)
x

 [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
 [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197

 [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
 [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197