Cyclistic Project

Summary

This case study was completed by Osbaldo Albornoz in February 2023 as part of the Google Data Analytics Professional Certificate capstone unit. R has been used to complete this case study and then hosted online through Github.

This is a case study, where we perform thinking as real-world tasks of a junior data analyst. In this case we are working for a fictional company, Cyclistic a Bike-share company in Chicago, where we meet different characters and team members.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. 

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. 

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. 

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends. 

Case Details

Businesses Task

How to convert casual riders into annual members?

Cyclistic currently offers:

  • 5,800 bicycles.

  • 690 stations.

Bikes can be unlocked from one station and returned to another station in the system anytime.

As of August 2021, Cyclistic offers the following pricing plans (USD):

  • Single ride passes, $3.30/trip

  • Full day passes, $15/day

  • Annual memberships, $9/month

The Cyclistic team have recently concluded that annual memberships are more profitable than casual riders. Furthermore, the team have noted that while 30% of users use the bikes for their work commute, the majority of Cyclistic users ride for leisure. This report will assess how existing Cyclistic causal riders can be encouraged to convert to annual memberships.

M E T H O D O L O G Y S T E P S

ASK

  1. How annual members and casual riders differ?

  2. Why casual riders would buy a membership?

  3. How digital media could affect their marketing tactics?

Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Specific assignment from my team:

Answer the first Question:

How do annual members and casual riders use Cyclistic bikes differently?

PREPARE

We will use R language, R Notebook to document and share insights.

We will use Cyclistic’s historical trip data to elaborate metrics, analyze it , and if is possible identify patterns and trends. In order to create a 12 months dataset, we need to download some months from year (August 2020 to July 2021), we will download from Motivate international Inc at our desktop, then process it in RStudio). We will use Cyclistic’s historical trip data to analyze and identify trends. (Note: The datasets have a different name because Cyclistic is a fictional company. The data has been made available by Motivate International Inc. under this license:. This is public data that you can use to explore how different customer types are using Cyclistic bikes. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. . Our date is ROCCC, because it’s verifiable as Reliable (data is unbiased), Original (data comes from original source), Comprehensive (contains necessary data to answer the questions), Current (data is generated continuously,on January 5th, will be ready to download total 2022), and Cited . As we said data was obtaining from Motivate International Inc under license. download from divvy-tripdata, and credible source of information.


Data Limitations: data-privacy issues prohibit you from using riders’ personally identifiable information. This means that we won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

options(repos = c(CRAN = "https://cloud.r-project.org"))

PROCESS

Installing and loading packages

# Install and Load necessary packages

install.packages("tidyverse")
## 
## The downloaded binary packages are in
##  /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("skimr")
## 
## The downloaded binary packages are in
##  /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("janitor")
## 
## The downloaded binary packages are in
##  /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("dplyr")
## 
## The downloaded binary packages are in
##  /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("lubridate")
## 
## The downloaded binary packages are in
##  /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("ggmap")
## 
## The downloaded binary packages are in
##  /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("data.table")
## 
## The downloaded binary packages are in
##  /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(lubridate)
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(dplyr)
library(ggmap)
## ℹ Google's Terms of Service: <]8;;https://mapsplatform.google.comhttps://mapsplatform.google.com]8;;>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

Loading Raw Data

# Loading Raw data

August_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202008-divvy-tripdata.csv")
## Rows: 622361 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
September_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202009-divvy-tripdata.csv")
## Rows: 532958 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
October_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202010-divvy-tripdata.csv")
## Rows: 388653 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
November_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202011-divvy-tripdata.csv")
## Rows: 259716 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
December_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202012-divvy-tripdata.csv")
## Rows: 131573 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
January_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202101-divvy-tripdata.csv")
## Rows: 96834 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
February_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202102-divvy-tripdata.csv")
## Rows: 49622 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
March_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202103-divvy-tripdata.csv")
## Rows: 228496 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
April_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202104-divvy-tripdata.csv")
## Rows: 337230 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
May_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202105-divvy-tripdata.csv")
## Rows: 531633 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
June_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202106-divvy-tripdata.csv")
## Rows: 729595 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
July_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202107-divvy-tripdata.csv")
## Rows: 822410 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Check data structure all datasets

This will help us identify the data types and column names in the datasets for consistency.

str(August_20)
## spc_tbl_ [622,361 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:622361] "322BD23D287743ED" "2A3AEF1AB9054D8B" "67DC1D133E8B5816" "C79FBBD412E578A7" ...
##  $ rideable_type     : chr [1:622361] "docked_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:622361], format: "2020-08-20 18:08:14" "2020-08-27 18:46:04" ...
##  $ ended_at          : POSIXct[1:622361], format: "2020-08-20 18:17:51" "2020-08-27 19:54:51" ...
##  $ start_station_name: chr [1:622361] "Lake Shore Dr & Diversey Pkwy" "Michigan Ave & 14th St" "Columbus Dr & Randolph St" "Daley Center Plaza" ...
##  $ start_station_id  : num [1:622361] 329 168 195 81 658 658 196 67 153 177 ...
##  $ end_station_name  : chr [1:622361] "Clark St & Lincoln Ave" "Michigan Ave & 14th St" "State St & Randolph St" "State St & Kinzie St" ...
##  $ end_station_id    : num [1:622361] 141 168 44 47 658 658 49 229 225 305 ...
##  $ start_lat         : num [1:622361] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:622361] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:622361] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:622361] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr [1:622361] "member" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(September_20)
## spc_tbl_ [532,958 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:532958] "2B22BD5F95FB2629" "A7FB70B4AFC6CAF2" "86057FA01BAC778E" "57F6DC9A153DB98C" ...
##  $ rideable_type     : chr [1:532958] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:532958], format: "2020-09-17 14:27:11" "2020-09-17 15:07:31" ...
##  $ ended_at          : POSIXct[1:532958], format: "2020-09-17 14:44:24" "2020-09-17 15:07:45" ...
##  $ start_station_name: chr [1:532958] "Michigan Ave & Lake St" "W Oakdale Ave & N Broadway" "W Oakdale Ave & N Broadway" "Ashland Ave & Belle Plaine Ave" ...
##  $ start_station_id  : num [1:532958] 52 NA NA 246 24 94 291 NA NA NA ...
##  $ end_station_name  : chr [1:532958] "Green St & Randolph St" "W Oakdale Ave & N Broadway" "W Oakdale Ave & N Broadway" "Montrose Harbor" ...
##  $ end_station_id    : num [1:532958] 112 NA NA 249 24 NA 256 NA NA NA ...
##  $ start_lat         : num [1:532958] 41.9 41.9 41.9 42 41.9 ...
##  $ start_lng         : num [1:532958] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:532958] 41.9 41.9 41.9 42 41.9 ...
##  $ end_lng           : num [1:532958] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:532958] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(October_20)
## spc_tbl_ [388,653 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:388653] "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
##  $ rideable_type     : chr [1:388653] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:388653], format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
##  $ ended_at          : POSIXct[1:388653], format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
##  $ start_station_name: chr [1:388653] "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
##  $ start_station_id  : num [1:388653] 313 227 102 165 190 359 313 125 NA 174 ...
##  $ end_station_name  : chr [1:388653] "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : num [1:388653] 125 260 423 256 185 53 125 313 199 635 ...
##  $ start_lat         : num [1:388653] 41.9 41.9 41.8 42 41.9 ...
##  $ start_lng         : num [1:388653] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:388653] 41.9 41.9 41.8 42 41.9 ...
##  $ end_lng           : num [1:388653] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:388653] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(November_20)
## spc_tbl_ [259,716 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
##  $ rideable_type     : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
##  $ ended_at          : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
##  $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
##  $ start_station_id  : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
##  $ end_station_name  : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
##  $ end_station_id    : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
##  $ start_lat         : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:259716] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(December_20)
## spc_tbl_ [131,573 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:131573] "70B6A9A437D4C30D" "158A465D4E74C54A" "5262016E0F1F2F9A" "BE119628E44F871E" ...
##  $ rideable_type     : chr [1:131573] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:131573], format: "2020-12-27 12:44:29" "2020-12-18 17:37:15" ...
##  $ ended_at          : POSIXct[1:131573], format: "2020-12-27 12:55:06" "2020-12-18 17:44:19" ...
##  $ start_station_name: chr [1:131573] "Aberdeen St & Jackson Blvd" NA NA NA ...
##  $ start_station_id  : chr [1:131573] "13157" NA NA NA ...
##  $ end_station_name  : chr [1:131573] "Desplaines St & Kinzie St" NA NA NA ...
##  $ end_station_id    : chr [1:131573] "TA1306000003" NA NA NA ...
##  $ start_lat         : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
##  $ start_lng         : num [1:131573] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
##  $ end_lng           : num [1:131573] -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:131573] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(January_21)
## spc_tbl_ [96,834 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:96834] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:96834] NA NA NA NA ...
##  $ end_station_id    : chr [1:96834] NA NA NA NA ...
##  $ start_lat         : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:96834] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(February_21)
## spc_tbl_ [49,622 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:49622] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
##  $ rideable_type     : chr [1:49622] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:49622], format: "2021-02-12 16:14:56" "2021-02-14 17:52:38" ...
##  $ ended_at          : POSIXct[1:49622], format: "2021-02-12 16:21:43" "2021-02-14 18:12:09" ...
##  $ start_station_name: chr [1:49622] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
##  $ start_station_id  : chr [1:49622] "525" "525" "KA1503000012" "637" ...
##  $ end_station_name  : chr [1:49622] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
##  $ end_station_id    : chr [1:49622] "660" "16806" "TA1305000029" "TA1305000034" ...
##  $ start_lat         : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ start_lng         : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ end_lng           : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:49622] "member" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(March_21)
## spc_tbl_ [228,496 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:228496] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
##  $ rideable_type     : chr [1:228496] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:228496], format: "2021-03-16 08:32:30" "2021-03-28 01:26:28" ...
##  $ ended_at          : POSIXct[1:228496], format: "2021-03-16 08:36:34" "2021-03-28 01:36:55" ...
##  $ start_station_name: chr [1:228496] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
##  $ start_station_id  : chr [1:228496] "15651" "15651" "15443" "TA1308000021" ...
##  $ end_station_name  : chr [1:228496] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:228496] "13266" "18017" "TA1308000043" "13323" ...
##  $ start_lat         : num [1:228496] 41.9 41.9 41.8 42 42 ...
##  $ start_lng         : num [1:228496] -87.7 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:228496] 41.9 41.9 41.8 42 42.1 ...
##  $ end_lng           : num [1:228496] -87.7 -87.7 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr [1:228496] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(April_21)
## spc_tbl_ [337,230 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:337230] "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
##  $ rideable_type     : chr [1:337230] "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:337230], format: "2021-04-12 18:25:36" "2021-04-27 17:27:11" ...
##  $ ended_at          : POSIXct[1:337230], format: "2021-04-12 18:56:55" "2021-04-27 18:31:29" ...
##  $ start_station_name: chr [1:337230] "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
##  $ start_station_id  : chr [1:337230] "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
##  $ end_station_name  : chr [1:337230] "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
##  $ end_station_id    : chr [1:337230] "13235" "KA1503000069" "20121" "13235" ...
##  $ start_lat         : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ start_lng         : num [1:337230] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ end_lng           : num [1:337230] -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:337230] "member" "casual" "casual" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(May_21)
## spc_tbl_ [531,633 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:531633] "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
##  $ rideable_type     : chr [1:531633] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:531633], format: "2021-05-30 11:58:15" "2021-05-30 11:29:14" ...
##  $ ended_at          : POSIXct[1:531633], format: "2021-05-30 12:10:39" "2021-05-30 12:14:09" ...
##  $ start_station_name: chr [1:531633] NA NA NA NA ...
##  $ start_station_id  : chr [1:531633] NA NA NA NA ...
##  $ end_station_name  : chr [1:531633] NA NA NA NA ...
##  $ end_station_id    : chr [1:531633] NA NA NA NA ...
##  $ start_lat         : num [1:531633] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:531633] 41.9 41.8 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:531633] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(June_21)
## spc_tbl_ [729,595 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
##  $ rideable_type     : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
##  $ ended_at          : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
##  $ start_station_name: chr [1:729595] NA NA NA NA ...
##  $ start_station_id  : chr [1:729595] NA NA NA NA ...
##  $ end_station_name  : chr [1:729595] NA NA NA NA ...
##  $ end_station_id    : chr [1:729595] NA NA NA NA ...
##  $ start_lat         : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ start_lng         : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ end_lng           : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:729595] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(July_21)
## spc_tbl_ [822,410 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:822410] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
##  $ rideable_type     : chr [1:822410] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:822410], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
##  $ ended_at          : POSIXct[1:822410], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
##  $ start_station_name: chr [1:822410] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:822410] "13001" "17660" "SL-012" "17660" ...
##  $ end_station_name  : chr [1:822410] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
##  $ end_station_id    : chr [1:822410] "KA1504000117" "13432" "KA1503000044" "13196" ...
##  $ start_lat         : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:822410] "casual" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Changing the data types

The structure outputs showed that the datasets from August_20 to November_20 datasets listed the start_station_id and end_station_id columns as ‘int’ types instead of ‘chr’ types. We will change them to ‘chr’ so that all columns have the same data types.

# August 2020
August_20 <- mutate(
  August_20, 
  start_station_id = as.character(start_station_id),
  end_station_id = as.character(end_station_id)
  )

# September 2020
September_20 <- mutate(
  September_20, 
  start_station_id = as.character(start_station_id),
  end_station_id = as.character(end_station_id)
  )

# October 2020
October_20 <- mutate(
  October_20, 
  start_station_id = as.character(start_station_id),
  end_station_id = as.character(end_station_id)
  )

# Nombember 2020
November_20 <- mutate(
  November_20, 
  start_station_id = as.character(start_station_id),
  end_station_id = as.character(end_station_id)
  )

Column names consistency

We will use the clean_names function just to make sure that all datasets have only characters, number and underscores in the column names.

# applying the clean_names function

clean_names(August_20)
## # A tibble: 622,361 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 322BD23D2877… docked… 2020-08-20 18:08:14 2020-08-20 18:17:51 Lake S… 329    
##  2 2A3AEF1AB905… electr… 2020-08-27 18:46:04 2020-08-27 19:54:51 Michig… 168    
##  3 67DC1D133E8B… electr… 2020-08-26 19:44:14 2020-08-26 21:53:07 Columb… 195    
##  4 C79FBBD412E5… electr… 2020-08-27 12:05:41 2020-08-27 12:53:45 Daley … 81     
##  5 13814D3D661E… electr… 2020-08-27 16:49:02 2020-08-27 16:59:49 Leavit… 658    
##  6 56349A5A42F0… electr… 2020-08-27 17:26:23 2020-08-27 18:07:50 Leavit… 658    
##  7 EB6ABC5570C2… electr… 2020-08-26 20:14:02 2020-08-26 20:34:00 Cityfr… 196    
##  8 B4ECE389A1DE… electr… 2020-08-26 21:59:50 2020-08-26 22:12:35 Sheffi… 67     
##  9 0B355B0FE076… electr… 2020-08-26 19:17:42 2020-08-26 19:32:14 Southp… 153    
## 10 1ECE04F779E9… electr… 2020-08-27 15:13:57 2020-08-27 15:41:59 Theate… 177    
## # … with 622,351 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(September_20)
## # A tibble: 532,958 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 2B22BD5F95FB… electr… 2020-09-17 14:27:11 2020-09-17 14:44:24 Michig… 52     
##  2 A7FB70B4AFC6… electr… 2020-09-17 15:07:31 2020-09-17 15:07:45 W Oakd… <NA>   
##  3 86057FA01BAC… electr… 2020-09-17 15:09:04 2020-09-17 15:09:35 W Oakd… <NA>   
##  4 57F6DC9A153D… electr… 2020-09-17 18:10:46 2020-09-17 18:35:49 Ashlan… 246    
##  5 B9C4712F78C1… electr… 2020-09-17 15:16:13 2020-09-17 15:52:55 Fairba… 24     
##  6 378BBCE1E444… electr… 2020-09-17 18:37:04 2020-09-17 19:23:28 Clark … 94     
##  7 0CB5E2496B6F… electr… 2020-09-16 21:39:57 2020-09-16 21:53:44 Wells … 291    
##  8 9673F5D39BDB… electr… 2020-09-17 12:18:06 2020-09-17 12:18:19 <NA>    <NA>   
##  9 54B91F5C95B2… electr… 2020-09-17 17:09:17 2020-09-17 17:34:20 <NA>    <NA>   
## 10 91CEBB66076D… electr… 2020-09-17 12:20:25 2020-09-17 12:29:47 <NA>    <NA>   
## # … with 532,948 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(October_20)
## # A tibble: 388,653 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 ACB6B40CF5B9… electr… 2020-10-31 19:39:43 2020-10-31 19:57:12 Lakevi… 313    
##  2 DF450C72FD10… electr… 2020-10-31 23:50:08 2020-11-01 00:04:16 Southp… 227    
##  3 B6396B54A15A… electr… 2020-10-31 23:00:01 2020-10-31 23:08:22 Stony … 102    
##  4 44A4AEE261B9… electr… 2020-10-31 22:16:43 2020-10-31 22:19:35 Clark … 165    
##  5 10B7DD76A6A2… electr… 2020-10-31 19:38:19 2020-10-31 19:54:32 Southp… 190    
##  6 DA6C37596601… electr… 2020-10-29 17:38:04 2020-10-29 17:45:43 Larrab… 359    
##  7 C2F3808FD56B… electr… 2020-10-29 09:03:06 2020-10-29 09:17:56 Lakevi… 313    
##  8 15B13B5A508B… electr… 2020-10-29 16:37:21 2020-10-29 16:52:40 Rush S… 125    
##  9 285D224410C1… electr… 2020-10-28 23:12:03 2020-10-28 23:24:32 <NA>    <NA>   
## 10 E1FB79FFE6DB… electr… 2020-10-29 16:38:44 2020-10-29 16:50:17 Canal … 174    
## # … with 388,643 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(November_20)
## # A tibble: 259,716 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 BD0A6FF6FFF9… electr… 2020-11-01 13:36:00 2020-11-01 13:45:40 Dearbo… 110    
##  2 96A7A7A4BDE4… electr… 2020-11-01 10:03:26 2020-11-01 10:14:45 Frankl… 672    
##  3 C61526D06582… electr… 2020-11-01 00:34:05 2020-11-01 01:03:06 Lake S… 76     
##  4 E533E89C3208… electr… 2020-11-01 00:45:16 2020-11-01 00:54:31 Leavit… 659    
##  5 1C9F4EF18C16… electr… 2020-11-01 15:43:25 2020-11-01 16:16:52 Buckin… 2      
##  6 7259585D8276… electr… 2020-11-14 15:55:17 2020-11-14 16:44:38 Wabash… 72     
##  7 91FE5C8F8A67… electr… 2020-11-14 16:47:29 2020-11-14 17:03:03 Lake S… 76     
##  8 9E7A79ADA90C… electr… 2020-11-14 16:04:15 2020-11-14 16:19:33 <NA>    <NA>   
##  9 A5B02C0D41DB… electr… 2020-11-14 16:24:09 2020-11-14 16:51:34 Marshf… 58     
## 10 8234407C29FE… electr… 2020-11-14 01:24:22 2020-11-14 01:31:42 Clark … 394    
## # … with 259,706 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(December_20)
## # A tibble: 131,573 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 70B6A9A437D4… classi… 2020-12-27 12:44:29 2020-12-27 12:55:06 Aberde… 13157  
##  2 158A465D4E74… electr… 2020-12-18 17:37:15 2020-12-18 17:44:19 <NA>    <NA>   
##  3 5262016E0F1F… electr… 2020-12-15 15:04:33 2020-12-15 15:11:28 <NA>    <NA>   
##  4 BE119628E44F… electr… 2020-12-15 15:54:18 2020-12-15 16:00:11 <NA>    <NA>   
##  5 69AF78D57854… electr… 2020-12-22 12:08:17 2020-12-22 12:10:59 <NA>    <NA>   
##  6 C1DECC4AB488… electr… 2020-12-22 13:26:37 2020-12-22 13:34:50 <NA>    <NA>   
##  7 B014A60B856C… electr… 2020-12-03 16:23:48 2020-12-03 16:33:39 <NA>    <NA>   
##  8 1E127B1929C0… electr… 2020-12-03 15:03:38 2020-12-03 15:12:39 <NA>    <NA>   
##  9 05F41F5137B5… electr… 2020-12-12 09:26:17 2020-12-12 09:26:35 <NA>    <NA>   
## 10 BB807646588D… electr… 2020-12-18 12:52:06 2020-12-18 12:52:23 <NA>    <NA>   
## # … with 131,563 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(January_21)
## # A tibble: 96,834 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 E19E6F1B8D4C… electr… 2021-01-23 16:14:19 2021-01-23 16:24:44 Califo… 17660  
##  2 DC88F20C2C55… electr… 2021-01-27 18:43:08 2021-01-27 18:47:12 Califo… 17660  
##  3 EC45C94683FE… electr… 2021-01-21 22:35:54 2021-01-21 22:37:14 Califo… 17660  
##  4 4FA453A75AE3… electr… 2021-01-07 13:31:13 2021-01-07 13:42:55 Califo… 17660  
##  5 BE5E8EB4E726… electr… 2021-01-23 02:24:02 2021-01-23 02:24:45 Califo… 17660  
##  6 5D8969F88C77… electr… 2021-01-09 14:24:07 2021-01-09 15:17:54 Califo… 17660  
##  7 09275CC10F85… electr… 2021-01-04 05:05:04 2021-01-04 05:10:39 Califo… 17660  
##  8 DF7A32A217AE… electr… 2021-01-14 15:07:00 2021-01-14 15:13:40 Califo… 17660  
##  9 C2EFC62379EB… electr… 2021-01-09 09:57:55 2021-01-09 10:00:26 Califo… 17660  
## 10 B9F73448DFBE… classi… 2021-01-24 19:15:38 2021-01-24 19:22:51 Califo… 17660  
## # … with 96,824 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(February_21)
## # A tibble: 49,622 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 89E7AA6C2922… classi… 2021-02-12 16:14:56 2021-02-12 16:21:43 Glenwo… 525    
##  2 0FEFDE260356… classi… 2021-02-14 17:52:38 2021-02-14 18:12:09 Glenwo… 525    
##  3 E6159D746B2D… electr… 2021-02-09 19:10:18 2021-02-09 19:19:10 Clark … KA1503…
##  4 B32D3199F1C2… classi… 2021-02-02 17:49:41 2021-02-02 17:54:06 Wood S… 637    
##  5 83E463F23575… electr… 2021-02-23 15:07:23 2021-02-23 15:22:37 State … 13216  
##  6 BDAA7E3494E8… electr… 2021-02-24 15:43:33 2021-02-24 15:49:05 Fairba… 18003  
##  7 A77274235117… classi… 2021-02-01 17:47:42 2021-02-01 17:48:33 LaSall… KP1705…
##  8 295476889D9B… classi… 2021-02-11 18:33:53 2021-02-11 18:35:09 Fairba… 18003  
##  9 362087194BA4… classi… 2021-02-27 15:13:39 2021-02-27 15:36:36 LaSall… KP1705…
## 10 21630F715038… classi… 2021-02-20 08:59:42 2021-02-20 09:17:04 LaSall… KP1705…
## # … with 49,612 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(March_21)
## # A tibble: 228,496 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 CFA86D4455AA… classi… 2021-03-16 08:32:30 2021-03-16 08:36:34 Humbol… 15651  
##  2 30D9DC61227D… classi… 2021-03-28 01:26:28 2021-03-28 01:36:55 Humbol… 15651  
##  3 846D87A15682… classi… 2021-03-11 21:17:29 2021-03-11 21:33:53 Shield… 15443  
##  4 994D05AA75A1… classi… 2021-03-11 13:26:42 2021-03-11 13:55:41 Winthr… TA1308…
##  5 DF7464FBE92D… classi… 2021-03-21 09:09:37 2021-03-21 09:27:33 Glenwo… 525    
##  6 CEBA8516FD17… classi… 2021-03-20 11:08:47 2021-03-20 11:29:39 Glenwo… 525    
##  7 297268586B79… classi… 2021-03-20 14:10:41 2021-03-20 14:22:13 State … 13050  
##  8 F39301858B60… electr… 2021-03-23 07:56:51 2021-03-23 08:05:50 Shore … TA1308…
##  9 D297F199D875… electr… 2021-03-31 15:31:19 2021-03-31 15:35:58 Clinto… 13021  
## 10 36B877141175… classi… 2021-03-11 17:37:37 2021-03-11 17:52:44 Michig… TA1305…
## # … with 228,486 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(April_21)
## # A tibble: 337,230 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 6C992BD37A98… classi… 2021-04-12 18:25:36 2021-04-12 18:56:55 State … TA1307…
##  2 1E0145613A20… docked… 2021-04-27 17:27:11 2021-04-27 18:31:29 Dorche… KA1503…
##  3 E498E15508A8… docked… 2021-04-03 12:42:45 2021-04-07 11:40:24 Loomis… 20121  
##  4 1887262AD101… classi… 2021-04-17 09:17:42 2021-04-17 09:42:48 Honore… TA1305…
##  5 C123548CAB2A… docked… 2021-04-03 12:42:25 2021-04-03 14:13:42 Loomis… 20121  
##  6 097E76F3651B… classi… 2021-04-25 18:43:18 2021-04-25 18:43:59 Clinto… 15542  
##  7 53C38EB01E6F… classi… 2021-04-03 16:28:21 2021-04-03 16:29:47 Ashlan… 16948  
##  8 D53AC014EFD6… electr… 2021-04-06 16:35:06 2021-04-06 17:00:56 Dorche… KA1503…
##  9 6E2F7CA1FA9E… classi… 2021-04-12 15:22:54 2021-04-12 16:15:48 Ashlan… 16948  
## 10 04218447AAC8… classi… 2021-04-24 15:04:55 2021-04-24 15:06:16 Dorche… KA1503…
## # … with 337,220 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(May_21)
## # A tibble: 531,633 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 C809ED75D616… electr… 2021-05-30 11:58:15 2021-05-30 12:10:39 <NA>    <NA>   
##  2 DD59FDCE0ACA… electr… 2021-05-30 11:29:14 2021-05-30 12:14:09 <NA>    <NA>   
##  3 0AB83CB88C43… electr… 2021-05-30 14:24:01 2021-05-30 14:25:13 <NA>    <NA>   
##  4 7881AC6D3911… electr… 2021-05-30 14:25:51 2021-05-30 14:41:04 <NA>    <NA>   
##  5 853FA701B458… electr… 2021-05-30 18:15:39 2021-05-30 18:22:32 <NA>    <NA>   
##  6 F5E63DFD96B2… electr… 2021-05-30 11:33:41 2021-05-30 11:57:17 <NA>    <NA>   
##  7 C884951E3665… electr… 2021-05-30 10:51:37 2021-05-30 11:06:20 <NA>    <NA>   
##  8 48B60B250FE7… electr… 2021-05-05 13:57:03 2021-05-05 14:14:58 <NA>    <NA>   
##  9 E3D0CC2FE135… electr… 2021-05-05 11:31:26 2021-05-05 11:34:03 <NA>    <NA>   
## 10 4382735758AB… electr… 2021-05-04 19:51:05 2021-05-04 20:17:26 <NA>    <NA>   
## # … with 531,623 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(June_21)
## # A tibble: 729,595 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 99FEC93BA843… electr… 2021-06-13 14:31:28 2021-06-13 14:34:11 <NA>    <NA>   
##  2 06048DCFC852… electr… 2021-06-04 11:18:02 2021-06-04 11:24:19 <NA>    <NA>   
##  3 9598066F6804… electr… 2021-06-04 09:49:35 2021-06-04 09:55:34 <NA>    <NA>   
##  4 B03C0FE48C41… electr… 2021-06-03 19:56:05 2021-06-03 20:21:55 <NA>    <NA>   
##  5 B9EEA89F8FEE… electr… 2021-06-04 14:05:51 2021-06-04 14:09:59 <NA>    <NA>   
##  6 62B943CEAAA4… electr… 2021-06-03 19:32:01 2021-06-03 19:38:46 <NA>    <NA>   
##  7 7E2546FBA79C… electr… 2021-06-10 16:30:10 2021-06-10 16:36:21 <NA>    <NA>   
##  8 3DDF3BBF6C4C… electr… 2021-06-10 17:00:30 2021-06-10 17:06:48 <NA>    <NA>   
##  9 260880563715… electr… 2021-06-10 12:46:16 2021-06-10 12:55:02 <NA>    <NA>   
## 10 AF529C946F28… electr… 2021-06-23 17:57:29 2021-06-23 18:06:40 <NA>    <NA>   
## # … with 729,585 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(July_21)
## # A tibble: 822,410 × 13
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 0A1B623926EF… docked… 2021-07-02 14:44:36 2021-07-02 15:19:58 Michig… 13001  
##  2 B2D5583A5A5E… classi… 2021-07-07 16:57:42 2021-07-07 17:16:09 Califo… 17660  
##  3 6F264597DDBF… classi… 2021-07-25 11:30:55 2021-07-25 11:48:45 Wabash… SL-012 
##  4 379B58EAB20E… classi… 2021-07-08 22:08:30 2021-07-08 22:23:32 Califo… 17660  
##  5 6615C1E4EB08… electr… 2021-07-28 16:08:06 2021-07-28 16:27:09 Califo… 17660  
##  6 62DC2B32872F… electr… 2021-07-29 17:09:08 2021-07-29 17:15:00 Califo… 17660  
##  7 4BBB6E80E6A2… classi… 2021-07-28 16:51:47 2021-07-28 17:03:45 Califo… 17660  
##  8 22CA03D32C6B… classi… 2021-07-03 12:44:50 2021-07-03 12:52:55 Clark … 13128  
##  9 61F0D07D1EEE… classi… 2021-07-02 18:18:22 2021-07-02 18:38:21 Sherid… TA1307…
## 10 09B4551386A8… classi… 2021-07-29 21:54:05 2021-07-29 22:07:26 Sherid… TA1307…
## # … with 822,400 more rows, 7 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id

Merging the datasets

Once we have the datasets clean and ready, it is time to unite all the information in a single dataset to facilitate processing. For this we will use the bind_rows function which adds NA in case of missing data.

# Merge all datasets 
all_datasets <- bind_rows(
  August_20, September_20, October_20, November_20, December_20, 
  January_21, February_21, March_21, April_21, May_21, June_21, July_21
  )

head(all_datasets)
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 322BD23D28774… docked… 2020-08-20 18:08:14 2020-08-20 18:17:51 Lake S… 329    
## 2 2A3AEF1AB9054… electr… 2020-08-27 18:46:04 2020-08-27 19:54:51 Michig… 168    
## 3 67DC1D133E8B5… electr… 2020-08-26 19:44:14 2020-08-26 21:53:07 Columb… 195    
## 4 C79FBBD412E57… electr… 2020-08-27 12:05:41 2020-08-27 12:53:45 Daley … 81     
## 5 13814D3D661EC… electr… 2020-08-27 16:49:02 2020-08-27 16:59:49 Leavit… 658    
## 6 56349A5A42F0A… electr… 2020-08-27 17:26:23 2020-08-27 18:07:50 Leavit… 658    
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id

Checking the final dataset

Now we are going to do some checks on the final dataset for further processing.

Removing Missing Values

We want to remove the rows with missing values (NA), and sorting in descending order.

# Removing rows with missing values
all_datasets <- na.omit(all_datasets)

# Order by date in descending order (newest first)
all_datasets <- all_datasets %>%
  arrange(desc(started_at))

Rides Length

We will add an additional column named rides_length which contains the total length of a single ride in minutes. This column will be useful for a later process.

all_datasets$ride_length <- difftime(
  all_datasets$ended_at, 
  all_datasets$started_at,
  units = "mins"
  ) 

# convert ride_length column type to numeric
all_datasets$ride_length <- as.numeric(all_datasets$ride_length)
                          
# select rows only with valid ridelength
all_datasets <- all_datasets[all_datasets$ride_length > 5 & all_datasets$ride_length < 60 , ]
             
head(all_datasets)
## # A tibble: 6 × 14
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 1AFC6C416B7A8… classi… 2021-07-31 23:59:57 2021-08-01 00:08:38 Orlean… 636    
## 2 713A7722A6E58… electr… 2021-07-31 23:59:57 2021-08-01 00:19:10 Sheffi… TA1309…
## 3 C6F886992E60E… classi… 2021-07-31 23:59:55 2021-08-01 00:15:17 Bissel… 13059  
## 4 1ED34A91A245A… classi… 2021-07-31 23:59:55 2021-08-01 00:06:01 Wells … TA1306…
## 5 EA999FB37E6E6… electr… 2021-07-31 23:59:41 2021-08-01 00:05:44 Wells … TA1306…
## 6 33E5E40576D4E… classi… 2021-07-31 23:59:39 2021-08-01 00:31:39 Sherid… 13063  
## # … with 8 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, ride_length <dbl>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id

Separating the dates

To make a more detailed analysis we are going to identify the days, months, years and time in different columns.

# Year 
all_datasets$year <- format(
    all_datasets$started_at, 
    "%Y"
    )



# Month 
all_datasets$month <- format(
    all_datasets$started_at, 
    "%m"
    )


# Day
all_datasets$day <- format(
  all_datasets$started_at, 
  "%d"
  )

# Day of week 
all_datasets$day_of_week <- format(
  all_datasets$started_at, 
  "%A"
  )

# Date
all_datasets$y_m_d <- format(
  all_datasets$started_at, 
  "%Y-%m-%d"
  )

# Time
all_datasets$time <- format(
  all_datasets$started_at, 
  "%H:%M:%S"
  )

head(all_datasets)
## # A tibble: 6 × 20
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 1AFC6C416B7A8… classi… 2021-07-31 23:59:57 2021-08-01 00:08:38 Orlean… 636    
## 2 713A7722A6E58… electr… 2021-07-31 23:59:57 2021-08-01 00:19:10 Sheffi… TA1309…
## 3 C6F886992E60E… classi… 2021-07-31 23:59:55 2021-08-01 00:15:17 Bissel… 13059  
## 4 1ED34A91A245A… classi… 2021-07-31 23:59:55 2021-08-01 00:06:01 Wells … TA1306…
## 5 EA999FB37E6E6… electr… 2021-07-31 23:59:41 2021-08-01 00:05:44 Wells … TA1306…
## 6 33E5E40576D4E… classi… 2021-07-31 23:59:39 2021-08-01 00:31:39 Sherid… 13063  
## # … with 14 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, ride_length <dbl>, year <chr>, month <chr>, day <chr>,
## #   day_of_week <chr>, y_m_d <chr>, time <chr>, and abbreviated variable names
## #   ¹​rideable_type, ²​start_station_name, ³​start_station_id

Cleaning the dataset

# Formatting ride lenght column to 2 decimal places.
all_datasets$ride_length<-format(round(all_datasets$ride_length,2),nsmall=2)

#convert ride_lenght to numeric values
all_datasets$ride_length <- as.numeric(all_datasets$ride_length)

Checking for duplicates

We must make sure that there are no duplicate rows, the ride_id column must contain different numbers in each row. This ensures that each row is a different observation.

# We will use distinct() function to keep only unique/distinct rows from the data frame using the ride_id column
all_datasets %>% distinct(ride_id, .keep_all = TRUE)
## # A tibble: 3,394,790 × 20
##    ride_id       ridea…¹ started_at          ended_at            start…² start…³
##    <chr>         <chr>   <dttm>              <dttm>              <chr>   <chr>  
##  1 1AFC6C416B7A… classi… 2021-07-31 23:59:57 2021-08-01 00:08:38 Orlean… 636    
##  2 713A7722A6E5… electr… 2021-07-31 23:59:57 2021-08-01 00:19:10 Sheffi… TA1309…
##  3 C6F886992E60… classi… 2021-07-31 23:59:55 2021-08-01 00:15:17 Bissel… 13059  
##  4 1ED34A91A245… classi… 2021-07-31 23:59:55 2021-08-01 00:06:01 Wells … TA1306…
##  5 EA999FB37E6E… electr… 2021-07-31 23:59:41 2021-08-01 00:05:44 Wells … TA1306…
##  6 33E5E40576D4… classi… 2021-07-31 23:59:39 2021-08-01 00:31:39 Sherid… 13063  
##  7 D3A2460169E1… classi… 2021-07-31 23:59:32 2021-08-01 00:05:57 Ravens… TA1307…
##  8 AAB02049AEF4… classi… 2021-07-31 23:59:30 2021-08-01 00:35:58 Centra… 15687  
##  9 C6DE6C1BCAEB… docked… 2021-07-31 23:59:27 2021-08-01 00:09:00 Green … TA1307…
## 10 7791FAFFF849… electr… 2021-07-31 23:59:25 2021-08-01 00:19:10 Sheffi… TA1309…
## # … with 3,394,780 more rows, 14 more variables: end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>, ride_length <dbl>, year <chr>,
## #   month <chr>, day <chr>, day_of_week <chr>, y_m_d <chr>, time <chr>, and
## #   abbreviated variable names ¹​rideable_type, ²​start_station_name,
## #   ³​start_station_id

The output of this code results in the same rows count. This means there are no duplicate rows.

Saving the Dataset

Once the data set is clean and free of errors, it is ready for analysis and it is good practice to keep it in a safe place.

# Saving the Dataset in our system
fwrite(
  all_datasets, 
  "/Users/osbaldoealbornoz/Documents/Cyclistc_Project/dataset_cleaned/all_datasets_cleaned.csv", 
  col.names = TRUE,
  row.names = FALSE
  )

ANALISYS

Trips comparison by casual and annual members

We will use the aggregated data to get information, to get answers in order to deliver insights at our stakeholders. 

Let’s see the number of rides, median, average and max ride length in minutes…

# number of rides, median, average and max ride length
all_datasets %>% 
  group_by(member_casual ) %>%
  summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 2 × 5
##   member_casual number_trips median_ride_length avg_ride_length max_ride_length
##   <chr>                <int>              <dbl>           <dbl>           <dbl>
## 1 casual             1490512               17.5            21.1            60.0
## 2 member             1904278               12.5            15.6            60.0

Here the two types of riders , the annual members ride more trips, however the casual riders ride more time in each trip they take.

Riders preferences

Let’s see what type of bikes do they prefer and how they use them …

Bike types

For casual riders

all_datasets %>% 
  filter(member_casual=="casual" ) %>% 
  group_by(rideable_type ) %>%
  summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 3 × 5
##   rideable_type number_trips median_ride_length avg_ride_length max_ride_length
##   <chr>                <int>              <dbl>           <dbl>           <dbl>
## 1 classic_bike        582142               16.6            20.2            60.0
## 2 docked_bike         593017               20.4            23.4            60.0
## 3 electric_bike       315353               14.6            18.5            60.0

We can see that the casual rides prefer to use the docked bikes which make sense because they use it with a friend or a relative, if not they use the classic bikes. The electric bikes are not in high demand for this type of riders.

For annual members

all_datasets %>% 
  filter(member_casual=="member" ) %>% 
  group_by(rideable_type ) %>%
  summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 3 × 5
##   rideable_type number_trips median_ride_length avg_ride_length max_ride_length
##   <chr>                <int>              <dbl>           <dbl>           <dbl>
## 1 classic_bike        886489               12.5            15.5            60.0
## 2 docked_bike         656382               13.3            16.4            60.0
## 3 electric_bike       361407               11.4            14.4            60.0

We can see that the members prefer to use much more the classic bikes which make sense because they probably use it to commute to work. The electric bikes are not in high demand for this type of riders.

Rides by Weekday

For casual riders

all_datasets %>% 
  filter(member_casual=="casual") %>% 
  group_by(day_of_week) %>% 
  summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 7 × 5
##   day_of_week number_trips median_ride_length avg_ride_length max_ride_length
##   <chr>              <int>              <dbl>           <dbl>           <dbl>
## 1 Friday            215814               16.6            20.2            60.0
## 2 Monday            160503               17.4            21.2            60.0
## 3 Saturday          352289               18.9            22.3            60.0
## 4 Sunday            277371               19.5            22.8            60.0
## 5 Thursday          164315               15.6            19.3            60.0
## 6 Tuesday           156435               16.2            20.0            60.0
## 7 Wednesday         163785               15.9            19.6            60.0

Here we can clearly see that casual riders use the service mainly on Saturdays and Sundays that order, with an average about 41 minutes by trip.

For members

all_datasets %>% 
  filter(member_casual=="member") %>% 
  group_by(day_of_week) %>% 
  summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 7 × 5
##   day_of_week number_trips median_ride_length avg_ride_length max_ride_length
##   <chr>              <int>              <dbl>           <dbl>           <dbl>
## 1 Friday            279162               12.2            15.2            60.0
## 2 Monday            250266               12.1            15.2            60.0
## 3 Saturday          293655               13.6            16.8            60.0
## 4 Sunday            244022               13.9            17.1            60.0
## 5 Thursday          273067               11.9            14.9            60.0
## 6 Tuesday           274376               12.0            15.1            60.0
## 7 Wednesday         289730               12.1            15.0            60.0

Here we can see a very different behavior from that of casual riders because the demand remains stable throughout the week where it only drops a little on Sunday. However, this day is when the duration of the trips is longer.


Rides by Month

For casual riders

all_datasets %>% 
  filter(member_casual=="casual")  %>% 
    group_by(month) %>%
  summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 12 × 5
##    month number_trips median_ride_length avg_ride_length max_ride_length
##    <chr>        <int>              <dbl>           <dbl>           <dbl>
##  1 01           12330               13.4            17.1            60.0
##  2 02            7114               16.2            20.0            60.0
##  3 03           61272               18.0            21.7            60.0
##  4 04           97491               17.4            21.3            60.0
##  5 05          175205               18.2            21.8            60.0
##  6 06          250942               17.2            20.7            60.0
##  7 07          307747               16.6            20.2            60.0
##  8 08          223235               19.5            22.8            60.0
##  9 09          174379               18              21.5            60.0
## 10 10          101142               16.2            20.1            60.0
## 11 11           59224               16.5            20.5            60.0
## 12 12           20431               14.4            18.3            60.0

Regarding the months, casual riders use the service more in the summer season, that is, in the months of June, July and August, which makes a lot of sense since they use it for recreational purposes. The duration of the trips also increase in these months.

For members

all_datasets %>% 
  filter(member_casual=="member")  %>% 
    group_by(month) %>%
  summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 12 × 5
##    month number_trips median_ride_length avg_ride_length max_ride_length
##    <chr>        <int>              <dbl>           <dbl>           <dbl>
##  1 01           53283               10.9            13.9            60.0
##  2 02           28319               11.8            15.1            59.9
##  3 03          104407               12.1            15.2            60.0
##  4 04          144225               12.4            15.5            60.0
##  5 05          190905               12.6            15.7            60.0
##  6 06          250498               12.5            15.5            60.0
##  7 07          263183               12.2            15.2            60.0
##  8 08          274471               14.1            17.1            60.0
##  9 09          234567               13.0            16.2            60.0
## 10 10          172910               12.0            15.1            60.0
## 11 11          118145               11.9            15.2            60.0
## 12 12           69365               11.2            14.2            59.9

The behavior of the members throughout the months is more balanced with a not so great decrease in the winter months. The duration of the trips is much shorter compared to casual riders.

Analyzing Routes

It is important to know and analyze in more detail the routes used by users to establish relevancy zones and identify trends and behavior patterns.

We will create a new dataset that contains only the columns that refer to the routes.

Create a routes dataset

routes_dataset <- all_datasets %>% 
  select(member_casual, start_station_name, end_station_name, start_lat, end_lat, start_lng, end_lng, ride_length) %>% 
  unite(route,c("start_station_name","end_station_name"), sep=" To ") %>% 
  group_by(route)


Identify routes by rider type

Casual rider routes

routes_dataset  %>% 
  filter(member_casual=="casual") %>%  
select(route) %>% 
  summarise(number_of_rides=(ride_length=n())) %>% 
group_by(route) %>%
arrange(desc(number_of_rides))
## # A tibble: 111,759 × 2
## # Groups:   route [111,759]
##    route                                                    number_of_rides
##    <chr>                                                              <int>
##  1 Streeter Dr & Grand Ave To Streeter Dr & Grand Ave                  5701
##  2 Lake Shore Dr & Monroe St To Lake Shore Dr & Monroe St              5201
##  3 Millennium Park To Millennium Park                                  3590
##  4 Lake Shore Dr & Monroe St To Streeter Dr & Grand Ave                3350
##  5 Michigan Ave & Oak St To Michigan Ave & Oak St                      3048
##  6 Buckingham Fountain To Buckingham Fountain                          2798
##  7 Streeter Dr & Grand Ave To Millennium Park                          2168
##  8 Theater on the Lake To Theater on the Lake                          2034
##  9 Indiana Ave & Roosevelt Rd To Indiana Ave & Roosevelt Rd            2031
## 10 Millennium Park To Streeter Dr & Grand Ave                          1794
## # … with 111,749 more rows

The result is arranged in such a way that it shows the most used routes, that is, we can clearly appreciate that casual riders use more the routes of parks, museums, lakes, among others, which indicates a pattern of mobility towards recreational sites.

Member routes

routes_dataset  %>% 
  filter(member_casual=="member") %>%  
select(route) %>% 
  summarise(number_of_rides=(ride_length=n())) %>% 
group_by(route) %>%
arrange(desc(number_of_rides))
## # A tibble: 119,516 × 2
## # Groups:   route [119,516]
##    route                                                              number_o…¹
##    <chr>                                                                   <int>
##  1 MLK Jr Dr & 29th St To State St & 33rd St                                1342
##  2 State St & 33rd St To MLK Jr Dr & 29th St                                1202
##  3 Loomis St & Lexington St To Halsted St & Polk St                          926
##  4 Lakefront Trail & Bryn Mawr Ave To Theater on the Lake                    868
##  5 Halsted St & Polk St To Loomis St & Lexington St                          804
##  6 Theater on the Lake To Lakefront Trail & Bryn Mawr Ave                    769
##  7 Cornell Ave & Hyde Park Blvd To University Ave & 57th St                  750
##  8 University Ave & 57th St To Kimbark Ave & 53rd St                         739
##  9 Lakefront Trail & Bryn Mawr Ave To Lakefront Trail & Bryn Mawr Ave        730
## 10 Broadway & Waveland Ave To Broadway & Barry Ave                           687
## # … with 119,506 more rows, and abbreviated variable name ¹​number_of_rides

The routes of the members are very different since they use the service mostly in commercial and business areas, which indicates that they are commuters riders at their jobs and others use bikes to get to their businesses or study places.

SHARE

After cleaning, sorting, filtering, and analyzing the data, it’s time to give it some life with some graphs that will help us visualize, understand, and identify trends, patterns, and correlations between rider types.

Casual riders vs annual members (by ride type)

ggplot(all_datasets)+geom_bar(mapping = aes(x=member_casual,fill=rideable_type))+ 
  scale_fill_manual(values = c("#2196F3", "orange", "#66BB6A")) +
  labs(title= "Annual trips",subtitle = "Total riders")

Day of the Week behaviors

For casual riders

ggplot(all_datasets)+geom_bar(mapping = aes(x=day_of_week,fill=rideable_type),filter(all_datasets,all_datasets$member_casual=="casual"))+  theme(axis.text.x = element_text(angle = 45))+ 
  scale_fill_manual(values = c("#2196F3", "orange", "#66BB6A")) +
  labs(title= "Behavior by day of the week",subtitle = "trips by casual riders")

For members

ggplot(all_datasets)+geom_bar(mapping = aes(x=day_of_week,fill=rideable_type),filter(all_datasets,all_datasets$member_casual=="member"))+  theme(axis.text.x = element_text(angle = 45))+ 
  scale_fill_manual(values = c("#2196F3", "orange", "#66BB6A")) +
  labs(title= "Behavior by day of the week",subtitle = "trips by annual members")

Months Behavior

For casual members

ggplot(all_datasets)+geom_bar(mapping = aes(x=month,fill=rideable_type),filter(all_datasets,all_datasets$member_casual=="casual"))+  theme(axis.text.x = element_text(angle = 45))+ 
  scale_fill_manual(values = c("#2196F3", "orange", "#66BB6A")) +
  labs(title= "Behavior by month of year",subtitle = "trips by casual riders")

For annual members

ggplot(all_datasets)+geom_bar(mapping = aes(x=month,fill=rideable_type),filter(all_datasets,all_datasets$member_casual=="member"))+  theme(axis.text.x = element_text(angle = 45))+ 
  scale_fill_manual(values = c("#2196F3", "orange", "#66BB6A")) +
  labs(title= "Behavior by month of year",subtitle = "trips by members")

Casual riders vs Annual members (by the of the week)

ggplot(all_datasets)+geom_bar(mapping = aes(x=member_casual,fill=member_casual))+ 
facet_wrap(~day_of_week)+ labs(title= "Annual trips",subtitle = "count of trips by weekday") +
  scale_fill_manual(values = c("orange", "#2196F3"))

Casual riders vs Annual members (by Month)

ggplot(all_datasets)+geom_bar(mapping = aes(x=member_casual,fill=member_casual))+ 
facet_wrap(~month)+ labs(title= "Annual trips",subtitle = "count of trips by month") +
  scale_fill_manual(values = c("#2196F3", "orange"))

Casual riders vs Annual member (by zones, imported from Tableau)

Casual riders and Annual member favorite destination.

For casual riders

all_datasets %>% filter(member_casual=="casual") %>% 
    group_by(end_station_name) %>% 
    summarise(n=n()) %>% 
    arrange(desc(n)) %>% 
    slice_max(n,n=6) %>% 
    ggplot(aes(x=end_station_name,y=n))+geom_col(fill="orange")+
    geom_text(aes(label = n),vjust=1.5,color="black")+
    labs(title = "Casual rider's favorite destination",x="Destination",y="count")+
    theme_light()+theme(axis.text.x = element_text(angle = 90))

For members

all_datasets %>% filter(member_casual=="member") %>% 
    group_by(end_station_name) %>% 
    summarise(n=n()) %>% 
    arrange(desc(n)) %>% 
    slice_max(n,n=6) %>% 
    ggplot(aes(x=end_station_name,y=n))+geom_col(fill="#2196F3")+
    geom_text(aes(label = n),vjust=1.5,color="black")+
    labs(title = "Annual members favorite destination",x="Destination",y="count")+
    theme_light()+theme(axis.text.x = element_text(angle = 90))

Casual riders and annual members ride length by day of the week

For Casual riders

ggplot(all_datasets[all_datasets$member_casual == "casual", ], aes(x=month, y=day_of_week, fill=ride_length)) +
  geom_tile() +
  scale_fill_gradient(low="#F9D2B4", high="#EE7811") +
  labs(x="Year", y="Day of Week", fill="Ride Lenght")

For members

ggplot(all_datasets[all_datasets$member_casual == "member", ], aes(x=month, y=day_of_week, fill=ride_length)) +
  geom_tile() +
  scale_fill_gradient(low="white", high="#0D47A1") +
  labs(x="Year", y="Day of Week", fill="Ride Lenght")

FINAL OBSERVATIONS AND CONCLUSIONS

Type of bike

Classic bike is the most favorite bike among the users, irrespective of their categories. However in the summer months The most favorite bike is the docked one.

Days of the week

Casual riders are much more active on weekends than weekdays. Annual members have a very balanced and consistent activity throughout the week.

Months

The casual users had their greatest activity in the summer months, especially in July, however they had very low or almost no activity in the months of December, January and February. The annual members had a more balanced activity throughout the year in which they have their greatest activity in the months of June, July and August.

Routes, favorite destinations and zones.

The destinations and routes differentiate very well the behavior of the two types of users. While the destinations and the routes of casual riders are mostly focused on recreation and entertainment sites such as lakes, parks and theaters in whose areas are closer to the coast. The routes and destinations of annual users are more distributed both in the coastal area and in the center and whose destinations are commercial, residential or university areas.

Ride lengths

The duration of trips for casual users is clearly seen as they last longer in the summer months regardless of the day of the week, while annual users use the service for a stable duration throughout all the months in which they are used. observe a slightly longer duration at the beginning and end of the year.

ACT

RECOMENDATIONS

To encourage casual users to become subscription users, there are a few recommendations:

  1. Offer discounted rates: One way to entice casual users to become subscribers is to offer discounted rates for those who sign up for an annual subscription. Make the subscription price more attractive than paying for each ride individually.

  2. Emphasize convenience: Highlight the convenience of having a subscription, such as not having to worry about paying for each ride, and being able to access the bikes whenever they need them.

  3. Offer rewards or incentives: Consider offering rewards or incentives for users who make the switch from casual to subscription, such as free rides or discounts on future subscriptions. For example, make an agreement or alliance with museums, parks or recreational sites that are favorite destinations to obtain cheaper tickets or benefits for those who use the bike service on weekends by subscription.

  4. Market the benefits: Make sure to communicate the benefits of being a subscription user, such as access to more bikes and priority service during peak hours.

by Osbaldo Albornoz