Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it’s very important to understand how to ask the right question to properly unlock the potentials of your dataset.
In this lesson, you will learn:
How to prepare your data for model-building.
How to use ggplot2
for data visualization.
The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.
Let’s see this by working through a practical exercise.
Artwork by @allison_horst
We’ll require the following packages to slice and dice this lesson:
tidyverse
: The tidyverse is a collection of R packages
designed to makes data science faster, easier and more fun!You can have them installed as:
install.packages(c("tidyverse"))
The script below checks whether you have the packages required to complete this module and installs them for you in case they are missing.
if (!require("pacman")) install.packages("pacman")
::p_load(tidyverse) pacman
Now, let’s fire up some packages and load the data provided for this lesson!
# Load the core Tidyverse packages
library(tidyverse)
# Import the pumpkins data
<- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv")
pumpkins
# Get a glimpse and dimensions of the data
glimpse(pumpkins)
## Rows: 1,757
## Columns: 26
## $ `City Name` <chr> "BALTIMORE", "BALTIMORE", "BALTIMORE", "BALTIMORE", ~
## $ Type <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Package <chr> "24 inch bins", "24 inch bins", "24 inch bins", "24 ~
## $ Variety <chr> NA, NA, "HOWDEN TYPE", "HOWDEN TYPE", "HOWDEN TYPE",~
## $ `Sub Variety` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Grade <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Date <chr> "4/29/17", "5/6/17", "9/24/16", "9/24/16", "11/5/16"~
## $ `Low Price` <dbl> 270, 270, 160, 160, 90, 90, 160, 160, 160, 160, 160,~
## $ `High Price` <dbl> 280, 280, 160, 160, 100, 100, 170, 160, 170, 160, 17~
## $ `Mostly Low` <dbl> 270, 270, 160, 160, 90, 90, 160, 160, 160, 160, 160,~
## $ `Mostly High` <dbl> 280, 280, 160, 160, 100, 100, 170, 160, 170, 160, 17~
## $ Origin <chr> "MARYLAND", "MARYLAND", "DELAWARE", "VIRGINIA", "MAR~
## $ `Origin District` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ `Item Size` <chr> "lge", "lge", "med", "med", "lge", "lge", "med", "lg~
## $ Color <chr> NA, NA, "ORANGE", "ORANGE", "ORANGE", "ORANGE", "ORA~
## $ Environment <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ `Unit of Sale` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Quality <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Condition <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Appearance <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Storage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Crop <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Repack <chr> "E", "E", "N", "N", "N", "N", "N", "N", "N", "N", "N~
## $ `Trans Mode` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ...25 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ ...26 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
# Print the first 50 rows of the data set
%>%
pumpkins slice_head(n =50)
A quick glimpse()
immediately shows that there are
blanks and a mix of strings (chr
) and numeric data
(dbl
). The Date
is of type character and
there’s also a strange column called Package
where the data
is a mix between sacks
, bins
and other values.
The data, in fact, is a bit of a mess 😤.
In fact, it is not very common to be gifted a dataset that is completely ready to use to create a ML model out of the box. But worry not, in this lesson, you will learn how to prepare a raw dataset using standard R libraries 🧑🔧. You will also learn various techniques to visualize the data.📈📊
A refresher: The pipe operator (
%>%
) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying “and then” in your code.
One of the most common issues data scientists need to deal with is
incomplete or missing data. R represents missing, or unknown values,
with special sentinel value: NA
(Not Available).
So how would we know that the data frame contains missing values?
anyNA
which returns the logical objects TRUE
or FALSE
%>%
pumpkins anyNA()
## [1] TRUE
Great, there seems to be some missing data! That’s a good place to start.
is.na()
that
indicates which individual column elements are missing with a logical
TRUE
.%>%
pumpkins is.na() %>%
head(n = 7)
## City Name Type Package Variety Sub Variety Grade Date Low Price
## [1,] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
## [2,] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
## [3,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [4,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [5,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [6,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## [7,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## High Price Mostly Low Mostly High Origin Origin District Item Size Color
## [1,] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [3,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## Environment Unit of Sale Quality Condition Appearance Storage Crop Repack
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## Trans Mode ...25 ...26
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE
## [6,] TRUE TRUE TRUE
## [7,] TRUE TRUE TRUE
Okay, got the job done but with a large data frame such as this, it would be inefficient and practically impossible to review all of the rows and columns individually😴.
%>%
pumpkins is.na() %>%
colSums()
## City Name Type Package Variety Sub Variety
## 0 1712 0 5 1461
## Grade Date Low Price High Price Mostly Low
## 1757 0 0 0 103
## Mostly High Origin Origin District Item Size Color
## 103 3 1626 279 616
## Environment Unit of Sale Quality Condition Appearance
## 1757 1595 1757 1757 1757
## Storage Crop Repack Trans Mode ...25
## 1757 1757 0 1757 1757
## ...26
## 1654
Much better! There is missing data, but maybe it won’t matter for the task at hand. Let’s see what further analysis brings forth.
Along with the awesome sets of packages and functions, R has a very good documentation. For instance, use
help(colSums)
or?colSums
to find out more about the function.
Artwork by @allison_horst
dplyr
, a
package in the Tidyverse, is a grammar of data manipulation that
provides a consistent set of verbs that help you solve the most common
data manipulation challenges. In this section, we’ll explore some of
dplyr’s verbs!
select()
is a function in the package dplyr
which helps you pick columns to keep or exclude.
To make your data frame easier to work with, drop several of its
columns, using select()
, keeping only the columns you
need.
For instance, in this exercise, our analysis will involve the columns
Package
, Low Price
, High Price
and Date
. Let’s select these columns.
# Select desired columns
<- pumpkins %>%
pumpkins select(Package, `Low Price`, `High Price`, Date)
# Print data set
%>%
pumpkins slice_head(n = 5)
mutate()
is a function in the package dplyr
which helps you create or modify columns, while keeping the existing
columns.
The general structure of mutate is:
data %>% mutate(new_column_name = what_it_contains)
Let’s take mutate
out for a spin using the
Date
column by doing the following operations:
Convert the dates (currently of type character) to a month format
(these are US dates, so the format is MM/DD/YYYY
).
Extract the month from the dates to a new column.
In R, the package lubridate makes it easier to
work with Date-time data. So, let’s use dplyr::mutate()
,
lubridate::mdy()
, lubridate::month()
and see
how to achieve the above objectives. We can drop the Date column since
we won’t be needing it again in subsequent operations.
# Load lubridate
library(lubridate)
<- pumpkins %>%
pumpkins # Convert the Date column to a date object
mutate(Date = mdy(Date)) %>%
# Extract month from Date
mutate(Month = month(Date)) %>%
# Drop Date column
select(-Date)
# View the first few rows
%>%
pumpkins slice_head(n = 7)
Woohoo! 🤩
Next, let’s create a new column Price
, which represents
the average price of a pumpkin. Now, let’s take the average of the
Low Price
and High Price
columns to populate
the new Price column.
# Create a new column Price
<- pumpkins %>%
pumpkins mutate(Price = (`Low Price` + `High Price`)/2)
# View the first few rows of the data
%>%
pumpkins slice_head(n = 5)
Yeees!💪
“But wait!”, you’ll say after skimming through the whole data set
with View(pumpkins)
, “There’s something odd here!”🤔
If you look at the Package
column, pumpkins are sold in
many different configurations. Some are sold in
1 1/9 bushel
measures, and some in 1/2 bushel
measures, some per pumpkin, some per pound, and some in big boxes with
varying widths.
Let’s verify this:
# Verify the distinct observations in Package column
%>%
pumpkins distinct(Package)
Amazing!👏
Pumpkins seem to be very hard to weigh consistently, so let’s filter
them by selecting only pumpkins with the string bushel in the
Package
column and put this in a new data frame
new_pumpkins
.
dplyr::filter()
:
creates a subset of the data only containing rows that
satisfy your conditions, in this case, pumpkins with the string
bushel in the Package
column.
stringr::str_detect(): detects the presence or absence of a pattern in a string.
The stringr
package provides simple functions for common string operations.
# Retain only pumpkins with "bushel"
<- pumpkins %>%
new_pumpkins filter(str_detect(Package, "bushel"))
# Get the dimensions of the new data
dim(new_pumpkins)
## [1] 415 5
# View a few rows of the new data
%>%
new_pumpkins slice_head(n = 5)
You can see that we have narrowed down to 415 or so rows of data containing pumpkins by the bushel.🤩
But wait! There’s one more thing to do
Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel. Time to do some math to standardize it.
We’ll use the function case_when()
to mutate the Price column depending on some conditions.
case_when
allows you to vectorise multiple
if_else()
statements.
# Convert the price if the Package contains fractional bushel values
<- new_pumpkins %>%
new_pumpkins mutate(Price = case_when(
str_detect(Package, "1 1/9") ~ Price/(1 + 1/9),
str_detect(Package, "1/2") ~ Price/(1/2),
TRUE ~ Price))
# View the first few rows of the data
%>%
new_pumpkins slice_head(n = 30)
Now, we can analyze the pricing per unit based on their bushel
measurement. All this study of bushels of pumpkins, however, goes to
show how very important
it is to
understand the nature of your data
!
✅ According to The Spruce Eats, a bushel’s weight depends on the type of produce, as it’s a volume measurement. “A bushel of tomatoes, for example, is supposed to weigh 56 pounds… Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds.” It’s all pretty complicated! Let’s not bother with making a bushel-to-pound conversion, and instead price by the bushel. All this study of bushels of pumpkins, however, goes to show how very important it is to understand the nature of your data!
✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.
Now lastly, for the sheer sake of adventure 💁♀️, let’s also move the
Month column to the first position i.e before
column
Package
.
dplyr::relocate()
is used to change column
positions.
# Create a new data frame new_pumpkins
<- new_pumpkins %>%
new_pumpkins relocate(Month, .before = Package)
%>%
new_pumpkins slice_head(n = 7)
Good job!👌 You now have a clean, tidy dataset on which you can build your new regression model!
Infographic by Dasani Madipalli
There is a wise saying that goes like this:
“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
Part of the data scientist’s role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.
Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.
R offers a number of several systems for making graphs, but ggplot2
is one of the most elegant and most versatile. ggplot2
allows you to compose graphs by combining independent
components.
Let’s start with a simple scatter plot for the Price and Month columns.
So in this case, we’ll start with ggplot()
,
supply a dataset and aesthetic mapping (with aes()
)
then add a layers (like geom_point()
)
for scatter plots.
# Set a theme for the plots
theme_set(theme_light())
# Create a scatter plot
<- ggplot(data = new_pumpkins, aes(x = Price, y = Month))
p + geom_point() p
Is this a useful plot 🤷? Does anything about it surprise you?
It’s not particularly useful as all it does is display in your data as a spread of points in a given month.
To get charts to display useful data, you usually need to group the data somehow. For instance in our case, finding the average price of pumpkins for each month would provide more insights to the underlying patterns in our data. This leads us to one more dplyr flyby:
dplyr::group_by() %>% summarize()
Grouped aggregation in R can be easily computed using
dplyr::group_by() %>% summarize()
dplyr::group_by()
changes the unit of analysis from
the complete dataset to individual groups such as per month.
dplyr::summarize()
creates a new data frame with one
column for each grouping variable and one column for each of the summary
statistics that you have specified.
For example, we can use the
dplyr::group_by() %>% summarize()
to group the pumpkins
into groups based on the Month columns and then find
the mean price for each month.
# Find the average price of pumpkins per month
%>%
new_pumpkins group_by(Month) %>%
summarise(mean_price = mean(Price))
Succinct!✨
Categorical features such as months are better represented using a
bar plot 📊. The layers responsible for bar charts are
geom_bar()
and geom_col()
. Consult
?geom_bar
to find out more.
Let’s whip up one!
# Find the average price of pumpkins per month then plot a bar chart
%>%
new_pumpkins group_by(Month) %>%
summarise(mean_price = mean(Price)) %>%
ggplot(aes(x = Month, y = mean_price)) +
geom_col(fill = "midnightblue", alpha = 0.7) +
ylab("Pumpkin Price")
🤩🤩This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?
Congratulations on finishing the second lesson 👏! You prepared your data for model building, then uncovered more insights using visualizations!