Scalable Data Processing in R

Author

Mburu

Published

April 13, 2021

How does processing time vary by data size?

If you are processing all elements of two data sets, and one data set is bigger, then the bigger data set will take longer to process. However, it’s important to realize that how much longer it takes is not always directly proportional to how much bigger it is. That is, if you have two data sets and one is two times the size of the other, it is not guaranteed that the larger one will take twice as long to process. It could take 1.5 times longer or even four times longer. It depends on which operations are used to process the data set.

In this exercise, you’ll use the microbenchmark package, which was covered in the Writing Efficient R Code course.

Note: Numbers are specified using scientific notation

# Load the microbenchmark package
library(microbenchmark)

# Compare the timings for sorting different sizes of vector
mb <- microbenchmark(
  # Sort a random normal vector length 1e5
  "1e5" = sort(rnorm(1e5)),
  # Sort a random normal vector length 2.5e5
  "2.5e5" = sort(rnorm(2.5e5)),
  # Sort a random normal vector length 5e5
  "5e5" = sort(rnorm(5e5)),
  "7.5e5" = sort(rnorm(7.5e5)),
  "1e6" = sort(rnorm(1e6)),
  times = 10
)

# Plot the resulting benchmark object
plot(mb)

Reading a big.matrix object

In this exercise, you’ll create your first file-backed big.matrix object using the read.big.matrix() function. The function is meant to look similar to read.table() but, in addition, it needs to know what type of numeric values you want to read (“char”, “short”, “integer”, “double”), it needs the name of the file that will hold the matrix’s data (the backing file), and it needs the name of the file to hold information about the matrix (a descriptor file). The result will be a file on the disk holding the value read in along with a descriptor file which holds extra information (like the number of columns and rows) about the resulting big.matrix object.

# Load the bigmemory package
library(bigmemory)

# Create the big.matrix object: x
x <- read.big.matrix("mortgage-sample.csv", header = TRUE, 
                     type = "integer", 
                     backingfile = "mortgage-sample.bin", 
                     descriptorfile = "mortgage-sample.desc")
    
# Find the dimensions of x
dim(x)
[1] 70000    16

Attaching a big.matrix object

Now that the big.matrix object is on the disk, we can use the information stored in the descriptor file to instantly make it available during an R session. This means that you don’t have to reimport the data set, which takes more time for larger files. You can simply point the bigmemory package at the existing structures on the disk and begin accessing data without the wait.

# Attach mortgage-sample.desc
mort <- attach.big.matrix("mortgage-sample.desc")

# Find the dimensions of mort
dim(mort)
[1] 70000    16
# Look at the first 6 rows of mort
head(mort)
     enterprise record_number msa perc_minority tract_income_ratio
[1,]          1           566   1             1                  3
[2,]          1           116   1             3                  2
[3,]          1           239   1             2                  2
[4,]          1            62   1             2                  3
[5,]          1           106   1             2                  3
[6,]          1           759   1             3                  3
     borrower_income_ratio loan_purpose federal_guarantee borrower_race
[1,]                     1            2                 4             3
[2,]                     1            2                 4             5
[3,]                     3            8                 4             5
[4,]                     3            2                 4             5
[5,]                     3            2                 4             9
[6,]                     2            2                 4             9
     co_borrower_race borrower_gender co_borrower_gender num_units
[1,]                9               2                  4         1
[2,]                9               1                  4         1
[3,]                5               1                  2         1
[4,]                9               2                  4         1
[5,]                9               3                  4         1
[6,]                9               1                  2         2
     affordability year type
[1,]             3 2010    1
[2,]             3 2008    1
[3,]             4 2014    0
[4,]             4 2009    1
[5,]             4 2013    1
[6,]             4 2010    1