Sampling in R

Published

September 9, 2022

Reasons for sampling

Sampling is an important technique in your statistical arsenal. It isn’t always appropriate though—you need to know when to use it and when to work with the whole dataset.

  • Which of the following is not a good scenario to use sampling?
  • when data set is small

Simple sampling with dplyr

Throughout this chapter you’ll be exploring song data from Spotify. Each row of the dataset represents a song, and there are 41656 rows. Columns include the name of the song, the artists who performed it, the release year, and attributes of the song like its duration, tempo, and danceability. We’ll start by looking at the durations.

Your first task is to sample the song dataset and compare a calculation on the whole population and on a sample.

spotify_population is available and dplyr is loaded.

library(tidyverse)
library(fst)
library(knitr)
spotify_population <- read_fst("data/spotify_2000_2020.fst")
# View the whole population dataset

# Sample 1000 rows from spotify_population
spotify_sample <- slice_sample(spotify_population, n = 10)


# See the result
kable(spotify_sample)
acousticness artists danceability duration_ms duration_minutes energy explicit id instrumentalness key liveness loudness mode name popularity release_date speechiness tempo valence year
0.17100 [‘John K’] 0.958 144427 2.407117 0.376 0 5HBDBBr5OV90lubqW8ctJF 0.000000 8 0.1270 -7.063 1 if we never met 76 2019-04-26 0.0523 107.964 0.331 2019
0.20400 [‘The Chainsmokers’, ‘Bebe Rexha’] 0.585 217640 3.627333 0.696 0 2oejEp50ZzPuQTQ6v54Evp 0.000000 4 0.3440 -5.600 0 Call You Mine 83 2019-12-06 0.0307 104.010 0.522 2019
0.38800 [‘Never Shout Never’] 0.768 178144 2.969067 0.683 0 2waLDWGLc4Q14ZVyDNrxLM 0.000000 7 0.2860 -7.173 1 cheatercheaterbestfriendeater 57 2010-08-23 0.0604 99.955 0.669 2010
0.57200 [‘Conway Twitty’] 0.638 199600 3.326667 0.377 0 4VF5XIZ7EiMfLElRzYG2E8 0.000000 2 0.1020 -15.640 1 I’d Love To Lay You Down 52 2001-01-01 0.0341 81.911 0.778 2001
0.00879 [‘Los Angeles Azules’, ‘Jay de la Cueva’] 0.526 336147 5.602450 0.765 0 6rEA2GbBkGC9GqOOrBwvza 0.000463 0 0.9670 -4.544 1 17 Años - Concierto Sinfónico Cumbia Fuzion 51 2015-10-16 0.0368 91.035 0.627 2015
0.00575 [‘Modest Mouse’] 0.521 229867 3.831117 0.931 0 72zsr1jSMnaMPtl713jXeJ 0.000000 2 0.2430 -4.549 1 Bury Me With It 44 2004-04-05 0.0654 170.048 0.816 2004
0.99400 [‘Mae Ji-Yoon’] 0.671 173719 2.895317 0.200 0 1COWV6U3LKqCBpiHPthbiB 0.939000 8 0.1140 -23.978 1 Vibrations 65 2017-09-10 0.0365 107.517 0.530 2017
0.13200 [‘Dr. Dog’] 0.454 234800 3.913333 0.820 0 0UV5zxRMz6AO4ZwUOZNIKI 0.000969 2 0.1150 -4.193 1 Where’d All the Time Go? 65 2010-11-02 0.0567 166.303 0.575 2010
0.03890 [‘Lil Baby’] 0.902 162791 2.713183 0.850 1 2XEsbmynS9dLSzNSuZzfXF 0.000000 7 0.0838 -6.390 1 Same Thing 64 2020-02-28 0.3580 120.132 0.556 2020
0.68400 [‘invention_’] 0.543 184000 3.066667 0.467 0 00DTeE4nekCTgYz1QYHXSl 0.006040 4 0.3510 -11.223 1 Nature Bump 000 54 2015-06-06 0.2740 75.293 0.328 2015

Simple sampling with dplyr

Throughout this chapter you’ll be exploring song data from Spotify. Each row of the dataset represents a song, and there are 41656 rows. Columns include the name of the song, the artists who performed it, the release year, and attributes of the song like its duration, tempo, and danceability. We’ll start by looking at the durations.

Your first task is to sample the song dataset and compare a calculation on the whole population and on a sample.

spotify_population is available and dplyr is loaded.

# Calculate the mean duration in mins from spotify_population
mean_dur_pop <- summarize(spotify_population, mean(duration_minutes))


# Calculate the mean duration in mins from spotify_sample
mean_dur_samp <- summarize(spotify_sample, mean(duration_minutes))


# See the results
mean_dur_pop
  mean(duration_minutes)
1               3.852152
mean_dur_samp
  mean(duration_minutes)
1               3.435225

Simple sampling with base-R

While dplyr provides great tools for sampling data frames, if you want to work with vectors you can use base-R.

Let’s turn it up to eleven and look at the loudness property of each song.

spotify_population is available.

# From previous step
loudness_pop <- spotify_population$loudness
loudness_samp <- sample(loudness_pop, size = 100)

# Calculate the standard deviation of loudness_pop
sd_loudness_pop <- sd(loudness_pop)

# Calculate the standard deviation of loudness_samp
sd_loudness_samp <- sd(loudness_samp)

# See the results
sd_loudness_pop
[1] 4.524076
sd_loudness_samp
[1] 4.184483

Are findings from the sample generalizable?

You just saw how convenience sampling—collecting data via the easiest method can result in samples that aren’t representative of the whole population. Equivalently, this means findings from the sample are not generalizable to the whole population. Visualizing the distributions of the population and the sample can help determine whether or not the sample is representative of the population.

The Spotify dataset contains a column named acousticness, which is a confidence measure from zero to one of whether the track is acoustic, that is, it was made with instruments that aren’t plugged in. Here, you’ll look at acousticness in the total population of songs, and in a sample of those songs.

spotify_population and spotify_mysterious_sample are available; dplyr and ggplot2 are loaded.

ggplot(spotify_population, aes(acousticness))+
    geom_histogram(binwidth = 0.01)

ggplot(spotify_population, aes(duration_minutes))+
    geom_histogram(binwidth = 0.5)

Generating random numbers

You’ve seen sample() and it’s dplyr cousin, slice_sample() for generating pseudo-random numbers from a set of values. A related task is to generate random numbers that follow a statistical distribution, like the uniform distribution or the normal distribution.

Each random number generation function has a name beginning with “r”. It’s first argument is the number of numbers to generate, but other arguments are distribution-specific. Free hint: Try args(runif) and args(rnorm) to see what arguments you need to pass to those functions.

n_numbers is available and set to 5000; ggplot2 is loaded.

n_numbers <- 5000
# Generate random numbers from ...
randoms <- data.frame(
  # a uniform distribution from -3 to 3
  uniform =runif(n_numbers, -3, 3),
  # a normal distribution with mean 5 and sd 2
  normal = rnorm(n_numbers, mean = 5, sd = 2)
)


# Plot a histogram of uniform values, binwidth 0.25
ggplot(randoms, aes(uniform)) +
    geom_histogram(binwidth = 0.25)

# Plot a histogram of normal values, binwidth 0.5
ggplot(randoms, aes(normal)) +
    geom_histogram(binwidth = 0.5)

Understanding random seeds

While random numbers are important for many analyses, they create a problem: the results you get can vary slightly. This can cause awkward conversations with your boss when your script for calculating the sales forecast gives different answers each time.

Setting the seed to R’s random number generator helps avoid such problems by making the random number generation reproducible. - The values of x are different to those of y.

set.seed(123)
x <- rnorm(5)
y <- rnorm(5)
x
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774
y
[1]  1.7150650  0.4609162 -1.2650612 -0.6868529 -0.4456620
  • x and y have identical values.
set.seed(123)
x <- rnorm(5)
set.seed(123)
y <- rnorm(5)
x
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774
y
[1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774
  • x and y have identical values.
set.seed(123)
x <- c(rnorm(5), rnorm(5))
set.seed(123)
y <- rnorm(10)
x
 [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
 [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197
y
 [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
 [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197