This case study was completed by Osbaldo Albornoz in February 2023 as part of the Google Data Analytics Professional Certificate capstone unit. R has been used to complete this case study and then hosted online through Github.
This is a case study, where we perform thinking as real-world tasks of a junior data analyst. In this case we are working for a fictional company, Cyclistic a Bike-share company in Chicago, where we meet different characters and team members.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.Â
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.Â
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.Â
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.Â
Businesses Task
How to convert casual riders into annual members?
Cyclistic currently offers:
5,800 bicycles.
690 stations.
Bikes can be unlocked from one station and returned to another station in the system anytime.
As of August 2021, Cyclistic offers the following pricing plans (USD):
Single ride passes, $3.30/trip
Full day passes, $15/day
Annual memberships, $9/month
The Cyclistic team have recently concluded that annual memberships are more profitable than casual riders. Furthermore, the team have noted that while 30% of users use the bikes for their work commute, the majority of Cyclistic users ride for leisure. This report will assess how existing Cyclistic causal riders can be encouraged to convert to annual memberships.
How annual members and casual riders differ?
Why casual riders would buy a membership?
How digital media could affect their marketing tactics?
Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Specific assignment from my team:
Answer the first Question:
How do annual members and casual riders use Cyclistic bikes differently?
We will use R language, R Notebook to document and share insights.
We will use Cyclistic’s historical trip data to elaborate metrics, analyze it , and if is possible identify patterns and trends. In order to create a 12 months dataset, we need to download some months from year (August 2020 to July 2021), we will download from Motivate international Inc at our desktop, then process it in RStudio). We will use Cyclistic’s historical trip data to analyze and identify trends. (Note: The datasets have a different name because Cyclistic is a fictional company. The data has been made available by Motivate International Inc. under this license:. This is public data that you can use to explore how different customer types are using Cyclistic bikes. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. . Our date is ROCCC, because it’s verifiable as Reliable (data is unbiased), Original (data comes from original source), Comprehensive (contains necessary data to answer the questions), Current (data is generated continuously,on January 5th, will be ready to download total 2022), and Cited . As we said data was obtaining from Motivate International Inc under license. download from divvy-tripdata, and credible source of information.
Data Limitations: data-privacy issues prohibit
you from using riders’ personally identifiable information. This means
that we won’t be able to connect pass purchases to credit card numbers
to determine if casual riders live in the Cyclistic service area or if
they have purchased multiple single passes.
options(repos = c(CRAN = "https://cloud.r-project.org"))
# Install and Load necessary packages
install.packages("tidyverse")
##
## The downloaded binary packages are in
## /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("skimr")
##
## The downloaded binary packages are in
## /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("janitor")
##
## The downloaded binary packages are in
## /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("dplyr")
##
## The downloaded binary packages are in
## /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("lubridate")
##
## The downloaded binary packages are in
## /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("ggmap")
##
## The downloaded binary packages are in
## /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
install.packages("data.table")
##
## The downloaded binary packages are in
## /var/folders/rd/mzrbpdj13ljdlvdnnjr3hjym0000gn/T//Rtmp7OTp51/downloaded_packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(lubridate)
library(skimr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
library(ggmap)
## ℹ Google's Terms of Service: <]8;;https://mapsplatform.google.comhttps://mapsplatform.google.com]8;;>
## ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
# Loading Raw data
August_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202008-divvy-tripdata.csv")
## Rows: 622361 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
September_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202009-divvy-tripdata.csv")
## Rows: 532958 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
October_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202010-divvy-tripdata.csv")
## Rows: 388653 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
November_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202011-divvy-tripdata.csv")
## Rows: 259716 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
December_20 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202012-divvy-tripdata.csv")
## Rows: 131573 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
January_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202101-divvy-tripdata.csv")
## Rows: 96834 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
February_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202102-divvy-tripdata.csv")
## Rows: 49622 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
March_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202103-divvy-tripdata.csv")
## Rows: 228496 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
April_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202104-divvy-tripdata.csv")
## Rows: 337230 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
May_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202105-divvy-tripdata.csv")
## Rows: 531633 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
June_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202106-divvy-tripdata.csv")
## Rows: 729595 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
July_21 <- read_csv("/Users/osbaldoealbornoz/Documents/Cyclistc_Project/datasets/202107-divvy-tripdata.csv")
## Rows: 822410 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This will help us identify the data types and column names in the datasets for consistency.
str(August_20)
## spc_tbl_ [622,361 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:622361] "322BD23D287743ED" "2A3AEF1AB9054D8B" "67DC1D133E8B5816" "C79FBBD412E578A7" ...
## $ rideable_type : chr [1:622361] "docked_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:622361], format: "2020-08-20 18:08:14" "2020-08-27 18:46:04" ...
## $ ended_at : POSIXct[1:622361], format: "2020-08-20 18:17:51" "2020-08-27 19:54:51" ...
## $ start_station_name: chr [1:622361] "Lake Shore Dr & Diversey Pkwy" "Michigan Ave & 14th St" "Columbus Dr & Randolph St" "Daley Center Plaza" ...
## $ start_station_id : num [1:622361] 329 168 195 81 658 658 196 67 153 177 ...
## $ end_station_name : chr [1:622361] "Clark St & Lincoln Ave" "Michigan Ave & 14th St" "State St & Randolph St" "State St & Kinzie St" ...
## $ end_station_id : num [1:622361] 141 168 44 47 658 658 49 229 225 305 ...
## $ start_lat : num [1:622361] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:622361] -87.6 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num [1:622361] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:622361] -87.6 -87.6 -87.6 -87.6 -87.7 ...
## $ member_casual : chr [1:622361] "member" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(September_20)
## spc_tbl_ [532,958 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:532958] "2B22BD5F95FB2629" "A7FB70B4AFC6CAF2" "86057FA01BAC778E" "57F6DC9A153DB98C" ...
## $ rideable_type : chr [1:532958] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:532958], format: "2020-09-17 14:27:11" "2020-09-17 15:07:31" ...
## $ ended_at : POSIXct[1:532958], format: "2020-09-17 14:44:24" "2020-09-17 15:07:45" ...
## $ start_station_name: chr [1:532958] "Michigan Ave & Lake St" "W Oakdale Ave & N Broadway" "W Oakdale Ave & N Broadway" "Ashland Ave & Belle Plaine Ave" ...
## $ start_station_id : num [1:532958] 52 NA NA 246 24 94 291 NA NA NA ...
## $ end_station_name : chr [1:532958] "Green St & Randolph St" "W Oakdale Ave & N Broadway" "W Oakdale Ave & N Broadway" "Montrose Harbor" ...
## $ end_station_id : num [1:532958] 112 NA NA 249 24 NA 256 NA NA NA ...
## $ start_lat : num [1:532958] 41.9 41.9 41.9 42 41.9 ...
## $ start_lng : num [1:532958] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:532958] 41.9 41.9 41.9 42 41.9 ...
## $ end_lng : num [1:532958] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:532958] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(October_20)
## spc_tbl_ [388,653 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:388653] "ACB6B40CF5B9044C" "DF450C72FD109C01" "B6396B54A15AC0DF" "44A4AEE261B9E854" ...
## $ rideable_type : chr [1:388653] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:388653], format: "2020-10-31 19:39:43" "2020-10-31 23:50:08" ...
## $ ended_at : POSIXct[1:388653], format: "2020-10-31 19:57:12" "2020-11-01 00:04:16" ...
## $ start_station_name: chr [1:388653] "Lakeview Ave & Fullerton Pkwy" "Southport Ave & Waveland Ave" "Stony Island Ave & 67th St" "Clark St & Grace St" ...
## $ start_station_id : num [1:388653] 313 227 102 165 190 359 313 125 NA 174 ...
## $ end_station_name : chr [1:388653] "Rush St & Hubbard St" "Kedzie Ave & Milwaukee Ave" "University Ave & 57th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : num [1:388653] 125 260 423 256 185 53 125 313 199 635 ...
## $ start_lat : num [1:388653] 41.9 41.9 41.8 42 41.9 ...
## $ start_lng : num [1:388653] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:388653] 41.9 41.9 41.8 42 41.9 ...
## $ end_lng : num [1:388653] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr [1:388653] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(November_20)
## spc_tbl_ [259,716 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
## $ rideable_type : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
## $ ended_at : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
## $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
## $ start_station_id : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
## $ end_station_name : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
## $ end_station_id : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
## $ start_lat : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr [1:259716] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(December_20)
## spc_tbl_ [131,573 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:131573] "70B6A9A437D4C30D" "158A465D4E74C54A" "5262016E0F1F2F9A" "BE119628E44F871E" ...
## $ rideable_type : chr [1:131573] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:131573], format: "2020-12-27 12:44:29" "2020-12-18 17:37:15" ...
## $ ended_at : POSIXct[1:131573], format: "2020-12-27 12:55:06" "2020-12-18 17:44:19" ...
## $ start_station_name: chr [1:131573] "Aberdeen St & Jackson Blvd" NA NA NA ...
## $ start_station_id : chr [1:131573] "13157" NA NA NA ...
## $ end_station_name : chr [1:131573] "Desplaines St & Kinzie St" NA NA NA ...
## $ end_station_id : chr [1:131573] "TA1306000003" NA NA NA ...
## $ start_lat : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
## $ start_lng : num [1:131573] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
## $ end_lng : num [1:131573] -87.6 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:131573] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(January_21)
## spc_tbl_ [96,834 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:96834] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:96834] NA NA NA NA ...
## $ end_station_id : chr [1:96834] NA NA NA NA ...
## $ start_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:96834] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(February_21)
## spc_tbl_ [49,622 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:49622] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
## $ rideable_type : chr [1:49622] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:49622], format: "2021-02-12 16:14:56" "2021-02-14 17:52:38" ...
## $ ended_at : POSIXct[1:49622], format: "2021-02-12 16:21:43" "2021-02-14 18:12:09" ...
## $ start_station_name: chr [1:49622] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr [1:49622] "525" "525" "KA1503000012" "637" ...
## $ end_station_name : chr [1:49622] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
## $ end_station_id : chr [1:49622] "660" "16806" "TA1305000029" "TA1305000034" ...
## $ start_lat : num [1:49622] 42 42 41.9 41.9 41.8 ...
## $ start_lng : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:49622] 42 42 41.9 41.9 41.8 ...
## $ end_lng : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr [1:49622] "member" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(March_21)
## spc_tbl_ [228,496 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:228496] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
## $ rideable_type : chr [1:228496] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:228496], format: "2021-03-16 08:32:30" "2021-03-28 01:26:28" ...
## $ ended_at : POSIXct[1:228496], format: "2021-03-16 08:36:34" "2021-03-28 01:36:55" ...
## $ start_station_name: chr [1:228496] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
## $ start_station_id : chr [1:228496] "15651" "15651" "15443" "TA1308000021" ...
## $ end_station_name : chr [1:228496] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:228496] "13266" "18017" "TA1308000043" "13323" ...
## $ start_lat : num [1:228496] 41.9 41.9 41.8 42 42 ...
## $ start_lng : num [1:228496] -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:228496] 41.9 41.9 41.8 42 42.1 ...
## $ end_lng : num [1:228496] -87.7 -87.7 -87.6 -87.6 -87.7 ...
## $ member_casual : chr [1:228496] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(April_21)
## spc_tbl_ [337,230 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:337230] "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
## $ rideable_type : chr [1:337230] "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
## $ started_at : POSIXct[1:337230], format: "2021-04-12 18:25:36" "2021-04-27 17:27:11" ...
## $ ended_at : POSIXct[1:337230], format: "2021-04-12 18:56:55" "2021-04-27 18:31:29" ...
## $ start_station_name: chr [1:337230] "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
## $ start_station_id : chr [1:337230] "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
## $ end_station_name : chr [1:337230] "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
## $ end_station_id : chr [1:337230] "13235" "KA1503000069" "20121" "13235" ...
## $ start_lat : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
## $ start_lng : num [1:337230] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
## $ end_lng : num [1:337230] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:337230] "member" "casual" "casual" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(May_21)
## spc_tbl_ [531,633 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:531633] "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
## $ rideable_type : chr [1:531633] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:531633], format: "2021-05-30 11:58:15" "2021-05-30 11:29:14" ...
## $ ended_at : POSIXct[1:531633], format: "2021-05-30 12:10:39" "2021-05-30 12:14:09" ...
## $ start_station_name: chr [1:531633] NA NA NA NA ...
## $ start_station_id : chr [1:531633] NA NA NA NA ...
## $ end_station_name : chr [1:531633] NA NA NA NA ...
## $ end_station_id : chr [1:531633] NA NA NA NA ...
## $ start_lat : num [1:531633] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:531633] 41.9 41.8 41.9 41.9 41.9 ...
## $ end_lng : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:531633] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(June_21)
## spc_tbl_ [729,595 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
## $ rideable_type : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
## $ ended_at : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
## $ start_station_name: chr [1:729595] NA NA NA NA ...
## $ start_station_id : chr [1:729595] NA NA NA NA ...
## $ end_station_name : chr [1:729595] NA NA NA NA ...
## $ end_station_id : chr [1:729595] NA NA NA NA ...
## $ start_lat : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
## $ start_lng : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
## $ end_lng : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:729595] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(July_21)
## spc_tbl_ [822,410 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:822410] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
## $ rideable_type : chr [1:822410] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:822410], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
## $ ended_at : POSIXct[1:822410], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
## $ start_station_name: chr [1:822410] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:822410] "13001" "17660" "SL-012" "17660" ...
## $ end_station_name : chr [1:822410] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
## $ end_station_id : chr [1:822410] "KA1504000117" "13432" "KA1503000044" "13196" ...
## $ start_lat : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr [1:822410] "casual" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
The structure outputs showed that the datasets from August_20 to November_20 datasets listed the start_station_id and end_station_id columns as ‘int’ types instead of ‘chr’ types. We will change them to ‘chr’ so that all columns have the same data types.
# August 2020
August_20 <- mutate(
August_20,
start_station_id = as.character(start_station_id),
end_station_id = as.character(end_station_id)
)
# September 2020
September_20 <- mutate(
September_20,
start_station_id = as.character(start_station_id),
end_station_id = as.character(end_station_id)
)
# October 2020
October_20 <- mutate(
October_20,
start_station_id = as.character(start_station_id),
end_station_id = as.character(end_station_id)
)
# Nombember 2020
November_20 <- mutate(
November_20,
start_station_id = as.character(start_station_id),
end_station_id = as.character(end_station_id)
)
We will use the clean_names function just to make sure that all datasets have only characters, number and underscores in the column names.
# applying the clean_names function
clean_names(August_20)
## # A tibble: 622,361 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 322BD23D2877… docked… 2020-08-20 18:08:14 2020-08-20 18:17:51 Lake S… 329
## 2 2A3AEF1AB905… electr… 2020-08-27 18:46:04 2020-08-27 19:54:51 Michig… 168
## 3 67DC1D133E8B… electr… 2020-08-26 19:44:14 2020-08-26 21:53:07 Columb… 195
## 4 C79FBBD412E5… electr… 2020-08-27 12:05:41 2020-08-27 12:53:45 Daley … 81
## 5 13814D3D661E… electr… 2020-08-27 16:49:02 2020-08-27 16:59:49 Leavit… 658
## 6 56349A5A42F0… electr… 2020-08-27 17:26:23 2020-08-27 18:07:50 Leavit… 658
## 7 EB6ABC5570C2… electr… 2020-08-26 20:14:02 2020-08-26 20:34:00 Cityfr… 196
## 8 B4ECE389A1DE… electr… 2020-08-26 21:59:50 2020-08-26 22:12:35 Sheffi… 67
## 9 0B355B0FE076… electr… 2020-08-26 19:17:42 2020-08-26 19:32:14 Southp… 153
## 10 1ECE04F779E9… electr… 2020-08-27 15:13:57 2020-08-27 15:41:59 Theate… 177
## # … with 622,351 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(September_20)
## # A tibble: 532,958 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 2B22BD5F95FB… electr… 2020-09-17 14:27:11 2020-09-17 14:44:24 Michig… 52
## 2 A7FB70B4AFC6… electr… 2020-09-17 15:07:31 2020-09-17 15:07:45 W Oakd… <NA>
## 3 86057FA01BAC… electr… 2020-09-17 15:09:04 2020-09-17 15:09:35 W Oakd… <NA>
## 4 57F6DC9A153D… electr… 2020-09-17 18:10:46 2020-09-17 18:35:49 Ashlan… 246
## 5 B9C4712F78C1… electr… 2020-09-17 15:16:13 2020-09-17 15:52:55 Fairba… 24
## 6 378BBCE1E444… electr… 2020-09-17 18:37:04 2020-09-17 19:23:28 Clark … 94
## 7 0CB5E2496B6F… electr… 2020-09-16 21:39:57 2020-09-16 21:53:44 Wells … 291
## 8 9673F5D39BDB… electr… 2020-09-17 12:18:06 2020-09-17 12:18:19 <NA> <NA>
## 9 54B91F5C95B2… electr… 2020-09-17 17:09:17 2020-09-17 17:34:20 <NA> <NA>
## 10 91CEBB66076D… electr… 2020-09-17 12:20:25 2020-09-17 12:29:47 <NA> <NA>
## # … with 532,948 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(October_20)
## # A tibble: 388,653 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 ACB6B40CF5B9… electr… 2020-10-31 19:39:43 2020-10-31 19:57:12 Lakevi… 313
## 2 DF450C72FD10… electr… 2020-10-31 23:50:08 2020-11-01 00:04:16 Southp… 227
## 3 B6396B54A15A… electr… 2020-10-31 23:00:01 2020-10-31 23:08:22 Stony … 102
## 4 44A4AEE261B9… electr… 2020-10-31 22:16:43 2020-10-31 22:19:35 Clark … 165
## 5 10B7DD76A6A2… electr… 2020-10-31 19:38:19 2020-10-31 19:54:32 Southp… 190
## 6 DA6C37596601… electr… 2020-10-29 17:38:04 2020-10-29 17:45:43 Larrab… 359
## 7 C2F3808FD56B… electr… 2020-10-29 09:03:06 2020-10-29 09:17:56 Lakevi… 313
## 8 15B13B5A508B… electr… 2020-10-29 16:37:21 2020-10-29 16:52:40 Rush S… 125
## 9 285D224410C1… electr… 2020-10-28 23:12:03 2020-10-28 23:24:32 <NA> <NA>
## 10 E1FB79FFE6DB… electr… 2020-10-29 16:38:44 2020-10-29 16:50:17 Canal … 174
## # … with 388,643 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(November_20)
## # A tibble: 259,716 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 BD0A6FF6FFF9… electr… 2020-11-01 13:36:00 2020-11-01 13:45:40 Dearbo… 110
## 2 96A7A7A4BDE4… electr… 2020-11-01 10:03:26 2020-11-01 10:14:45 Frankl… 672
## 3 C61526D06582… electr… 2020-11-01 00:34:05 2020-11-01 01:03:06 Lake S… 76
## 4 E533E89C3208… electr… 2020-11-01 00:45:16 2020-11-01 00:54:31 Leavit… 659
## 5 1C9F4EF18C16… electr… 2020-11-01 15:43:25 2020-11-01 16:16:52 Buckin… 2
## 6 7259585D8276… electr… 2020-11-14 15:55:17 2020-11-14 16:44:38 Wabash… 72
## 7 91FE5C8F8A67… electr… 2020-11-14 16:47:29 2020-11-14 17:03:03 Lake S… 76
## 8 9E7A79ADA90C… electr… 2020-11-14 16:04:15 2020-11-14 16:19:33 <NA> <NA>
## 9 A5B02C0D41DB… electr… 2020-11-14 16:24:09 2020-11-14 16:51:34 Marshf… 58
## 10 8234407C29FE… electr… 2020-11-14 01:24:22 2020-11-14 01:31:42 Clark … 394
## # … with 259,706 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(December_20)
## # A tibble: 131,573 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 70B6A9A437D4… classi… 2020-12-27 12:44:29 2020-12-27 12:55:06 Aberde… 13157
## 2 158A465D4E74… electr… 2020-12-18 17:37:15 2020-12-18 17:44:19 <NA> <NA>
## 3 5262016E0F1F… electr… 2020-12-15 15:04:33 2020-12-15 15:11:28 <NA> <NA>
## 4 BE119628E44F… electr… 2020-12-15 15:54:18 2020-12-15 16:00:11 <NA> <NA>
## 5 69AF78D57854… electr… 2020-12-22 12:08:17 2020-12-22 12:10:59 <NA> <NA>
## 6 C1DECC4AB488… electr… 2020-12-22 13:26:37 2020-12-22 13:34:50 <NA> <NA>
## 7 B014A60B856C… electr… 2020-12-03 16:23:48 2020-12-03 16:33:39 <NA> <NA>
## 8 1E127B1929C0… electr… 2020-12-03 15:03:38 2020-12-03 15:12:39 <NA> <NA>
## 9 05F41F5137B5… electr… 2020-12-12 09:26:17 2020-12-12 09:26:35 <NA> <NA>
## 10 BB807646588D… electr… 2020-12-18 12:52:06 2020-12-18 12:52:23 <NA> <NA>
## # … with 131,563 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(January_21)
## # A tibble: 96,834 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 E19E6F1B8D4C… electr… 2021-01-23 16:14:19 2021-01-23 16:24:44 Califo… 17660
## 2 DC88F20C2C55… electr… 2021-01-27 18:43:08 2021-01-27 18:47:12 Califo… 17660
## 3 EC45C94683FE… electr… 2021-01-21 22:35:54 2021-01-21 22:37:14 Califo… 17660
## 4 4FA453A75AE3… electr… 2021-01-07 13:31:13 2021-01-07 13:42:55 Califo… 17660
## 5 BE5E8EB4E726… electr… 2021-01-23 02:24:02 2021-01-23 02:24:45 Califo… 17660
## 6 5D8969F88C77… electr… 2021-01-09 14:24:07 2021-01-09 15:17:54 Califo… 17660
## 7 09275CC10F85… electr… 2021-01-04 05:05:04 2021-01-04 05:10:39 Califo… 17660
## 8 DF7A32A217AE… electr… 2021-01-14 15:07:00 2021-01-14 15:13:40 Califo… 17660
## 9 C2EFC62379EB… electr… 2021-01-09 09:57:55 2021-01-09 10:00:26 Califo… 17660
## 10 B9F73448DFBE… classi… 2021-01-24 19:15:38 2021-01-24 19:22:51 Califo… 17660
## # … with 96,824 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(February_21)
## # A tibble: 49,622 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 89E7AA6C2922… classi… 2021-02-12 16:14:56 2021-02-12 16:21:43 Glenwo… 525
## 2 0FEFDE260356… classi… 2021-02-14 17:52:38 2021-02-14 18:12:09 Glenwo… 525
## 3 E6159D746B2D… electr… 2021-02-09 19:10:18 2021-02-09 19:19:10 Clark … KA1503…
## 4 B32D3199F1C2… classi… 2021-02-02 17:49:41 2021-02-02 17:54:06 Wood S… 637
## 5 83E463F23575… electr… 2021-02-23 15:07:23 2021-02-23 15:22:37 State … 13216
## 6 BDAA7E3494E8… electr… 2021-02-24 15:43:33 2021-02-24 15:49:05 Fairba… 18003
## 7 A77274235117… classi… 2021-02-01 17:47:42 2021-02-01 17:48:33 LaSall… KP1705…
## 8 295476889D9B… classi… 2021-02-11 18:33:53 2021-02-11 18:35:09 Fairba… 18003
## 9 362087194BA4… classi… 2021-02-27 15:13:39 2021-02-27 15:36:36 LaSall… KP1705…
## 10 21630F715038… classi… 2021-02-20 08:59:42 2021-02-20 09:17:04 LaSall… KP1705…
## # … with 49,612 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(March_21)
## # A tibble: 228,496 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 CFA86D4455AA… classi… 2021-03-16 08:32:30 2021-03-16 08:36:34 Humbol… 15651
## 2 30D9DC61227D… classi… 2021-03-28 01:26:28 2021-03-28 01:36:55 Humbol… 15651
## 3 846D87A15682… classi… 2021-03-11 21:17:29 2021-03-11 21:33:53 Shield… 15443
## 4 994D05AA75A1… classi… 2021-03-11 13:26:42 2021-03-11 13:55:41 Winthr… TA1308…
## 5 DF7464FBE92D… classi… 2021-03-21 09:09:37 2021-03-21 09:27:33 Glenwo… 525
## 6 CEBA8516FD17… classi… 2021-03-20 11:08:47 2021-03-20 11:29:39 Glenwo… 525
## 7 297268586B79… classi… 2021-03-20 14:10:41 2021-03-20 14:22:13 State … 13050
## 8 F39301858B60… electr… 2021-03-23 07:56:51 2021-03-23 08:05:50 Shore … TA1308…
## 9 D297F199D875… electr… 2021-03-31 15:31:19 2021-03-31 15:35:58 Clinto… 13021
## 10 36B877141175… classi… 2021-03-11 17:37:37 2021-03-11 17:52:44 Michig… TA1305…
## # … with 228,486 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(April_21)
## # A tibble: 337,230 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 6C992BD37A98… classi… 2021-04-12 18:25:36 2021-04-12 18:56:55 State … TA1307…
## 2 1E0145613A20… docked… 2021-04-27 17:27:11 2021-04-27 18:31:29 Dorche… KA1503…
## 3 E498E15508A8… docked… 2021-04-03 12:42:45 2021-04-07 11:40:24 Loomis… 20121
## 4 1887262AD101… classi… 2021-04-17 09:17:42 2021-04-17 09:42:48 Honore… TA1305…
## 5 C123548CAB2A… docked… 2021-04-03 12:42:25 2021-04-03 14:13:42 Loomis… 20121
## 6 097E76F3651B… classi… 2021-04-25 18:43:18 2021-04-25 18:43:59 Clinto… 15542
## 7 53C38EB01E6F… classi… 2021-04-03 16:28:21 2021-04-03 16:29:47 Ashlan… 16948
## 8 D53AC014EFD6… electr… 2021-04-06 16:35:06 2021-04-06 17:00:56 Dorche… KA1503…
## 9 6E2F7CA1FA9E… classi… 2021-04-12 15:22:54 2021-04-12 16:15:48 Ashlan… 16948
## 10 04218447AAC8… classi… 2021-04-24 15:04:55 2021-04-24 15:06:16 Dorche… KA1503…
## # … with 337,220 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(May_21)
## # A tibble: 531,633 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 C809ED75D616… electr… 2021-05-30 11:58:15 2021-05-30 12:10:39 <NA> <NA>
## 2 DD59FDCE0ACA… electr… 2021-05-30 11:29:14 2021-05-30 12:14:09 <NA> <NA>
## 3 0AB83CB88C43… electr… 2021-05-30 14:24:01 2021-05-30 14:25:13 <NA> <NA>
## 4 7881AC6D3911… electr… 2021-05-30 14:25:51 2021-05-30 14:41:04 <NA> <NA>
## 5 853FA701B458… electr… 2021-05-30 18:15:39 2021-05-30 18:22:32 <NA> <NA>
## 6 F5E63DFD96B2… electr… 2021-05-30 11:33:41 2021-05-30 11:57:17 <NA> <NA>
## 7 C884951E3665… electr… 2021-05-30 10:51:37 2021-05-30 11:06:20 <NA> <NA>
## 8 48B60B250FE7… electr… 2021-05-05 13:57:03 2021-05-05 14:14:58 <NA> <NA>
## 9 E3D0CC2FE135… electr… 2021-05-05 11:31:26 2021-05-05 11:34:03 <NA> <NA>
## 10 4382735758AB… electr… 2021-05-04 19:51:05 2021-05-04 20:17:26 <NA> <NA>
## # … with 531,623 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(June_21)
## # A tibble: 729,595 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 99FEC93BA843… electr… 2021-06-13 14:31:28 2021-06-13 14:34:11 <NA> <NA>
## 2 06048DCFC852… electr… 2021-06-04 11:18:02 2021-06-04 11:24:19 <NA> <NA>
## 3 9598066F6804… electr… 2021-06-04 09:49:35 2021-06-04 09:55:34 <NA> <NA>
## 4 B03C0FE48C41… electr… 2021-06-03 19:56:05 2021-06-03 20:21:55 <NA> <NA>
## 5 B9EEA89F8FEE… electr… 2021-06-04 14:05:51 2021-06-04 14:09:59 <NA> <NA>
## 6 62B943CEAAA4… electr… 2021-06-03 19:32:01 2021-06-03 19:38:46 <NA> <NA>
## 7 7E2546FBA79C… electr… 2021-06-10 16:30:10 2021-06-10 16:36:21 <NA> <NA>
## 8 3DDF3BBF6C4C… electr… 2021-06-10 17:00:30 2021-06-10 17:06:48 <NA> <NA>
## 9 260880563715… electr… 2021-06-10 12:46:16 2021-06-10 12:55:02 <NA> <NA>
## 10 AF529C946F28… electr… 2021-06-23 17:57:29 2021-06-23 18:06:40 <NA> <NA>
## # … with 729,585 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
clean_names(July_21)
## # A tibble: 822,410 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 0A1B623926EF… docked… 2021-07-02 14:44:36 2021-07-02 15:19:58 Michig… 13001
## 2 B2D5583A5A5E… classi… 2021-07-07 16:57:42 2021-07-07 17:16:09 Califo… 17660
## 3 6F264597DDBF… classi… 2021-07-25 11:30:55 2021-07-25 11:48:45 Wabash… SL-012
## 4 379B58EAB20E… classi… 2021-07-08 22:08:30 2021-07-08 22:23:32 Califo… 17660
## 5 6615C1E4EB08… electr… 2021-07-28 16:08:06 2021-07-28 16:27:09 Califo… 17660
## 6 62DC2B32872F… electr… 2021-07-29 17:09:08 2021-07-29 17:15:00 Califo… 17660
## 7 4BBB6E80E6A2… classi… 2021-07-28 16:51:47 2021-07-28 17:03:45 Califo… 17660
## 8 22CA03D32C6B… classi… 2021-07-03 12:44:50 2021-07-03 12:52:55 Clark … 13128
## 9 61F0D07D1EEE… classi… 2021-07-02 18:18:22 2021-07-02 18:38:21 Sherid… TA1307…
## 10 09B4551386A8… classi… 2021-07-29 21:54:05 2021-07-29 22:07:26 Sherid… TA1307…
## # … with 822,400 more rows, 7 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
Once we have the datasets clean and ready, it is time to unite all the information in a single dataset to facilitate processing. For this we will use the bind_rows function which adds NA in case of missing data.
# Merge all datasets
all_datasets <- bind_rows(
August_20, September_20, October_20, November_20, December_20,
January_21, February_21, March_21, April_21, May_21, June_21, July_21
)
head(all_datasets)
## # A tibble: 6 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 322BD23D28774… docked… 2020-08-20 18:08:14 2020-08-20 18:17:51 Lake S… 329
## 2 2A3AEF1AB9054… electr… 2020-08-27 18:46:04 2020-08-27 19:54:51 Michig… 168
## 3 67DC1D133E8B5… electr… 2020-08-26 19:44:14 2020-08-26 21:53:07 Columb… 195
## 4 C79FBBD412E57… electr… 2020-08-27 12:05:41 2020-08-27 12:53:45 Daley … 81
## 5 13814D3D661EC… electr… 2020-08-27 16:49:02 2020-08-27 16:59:49 Leavit… 658
## 6 56349A5A42F0A… electr… 2020-08-27 17:26:23 2020-08-27 18:07:50 Leavit… 658
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## # ²​start_station_name, ³​start_station_id
Now we are going to do some checks on the final dataset for further processing.
We want to remove the rows with missing values (NA), and sorting in descending order.
# Removing rows with missing values
all_datasets <- na.omit(all_datasets)
# Order by date in descending order (newest first)
all_datasets <- all_datasets %>%
arrange(desc(started_at))
We will add an additional column named rides_length which contains the total length of a single ride in minutes. This column will be useful for a later process.
all_datasets$ride_length <- difftime(
all_datasets$ended_at,
all_datasets$started_at,
units = "mins"
)
# convert ride_length column type to numeric
all_datasets$ride_length <- as.numeric(all_datasets$ride_length)
# select rows only with valid ridelength
all_datasets <- all_datasets[all_datasets$ride_length > 5 & all_datasets$ride_length < 60 , ]
head(all_datasets)
## # A tibble: 6 × 14
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 1AFC6C416B7A8… classi… 2021-07-31 23:59:57 2021-08-01 00:08:38 Orlean… 636
## 2 713A7722A6E58… electr… 2021-07-31 23:59:57 2021-08-01 00:19:10 Sheffi… TA1309…
## 3 C6F886992E60E… classi… 2021-07-31 23:59:55 2021-08-01 00:15:17 Bissel… 13059
## 4 1ED34A91A245A… classi… 2021-07-31 23:59:55 2021-08-01 00:06:01 Wells … TA1306…
## 5 EA999FB37E6E6… electr… 2021-07-31 23:59:41 2021-08-01 00:05:44 Wells … TA1306…
## 6 33E5E40576D4E… classi… 2021-07-31 23:59:39 2021-08-01 00:31:39 Sherid… 13063
## # … with 8 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, ride_length <dbl>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
To make a more detailed analysis we are going to identify the days, months, years and time in different columns.
# Year
all_datasets$year <- format(
all_datasets$started_at,
"%Y"
)
# Month
all_datasets$month <- format(
all_datasets$started_at,
"%m"
)
# Day
all_datasets$day <- format(
all_datasets$started_at,
"%d"
)
# Day of week
all_datasets$day_of_week <- format(
all_datasets$started_at,
"%A"
)
# Date
all_datasets$y_m_d <- format(
all_datasets$started_at,
"%Y-%m-%d"
)
# Time
all_datasets$time <- format(
all_datasets$started_at,
"%H:%M:%S"
)
head(all_datasets)
## # A tibble: 6 × 20
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 1AFC6C416B7A8… classi… 2021-07-31 23:59:57 2021-08-01 00:08:38 Orlean… 636
## 2 713A7722A6E58… electr… 2021-07-31 23:59:57 2021-08-01 00:19:10 Sheffi… TA1309…
## 3 C6F886992E60E… classi… 2021-07-31 23:59:55 2021-08-01 00:15:17 Bissel… 13059
## 4 1ED34A91A245A… classi… 2021-07-31 23:59:55 2021-08-01 00:06:01 Wells … TA1306…
## 5 EA999FB37E6E6… electr… 2021-07-31 23:59:41 2021-08-01 00:05:44 Wells … TA1306…
## 6 33E5E40576D4E… classi… 2021-07-31 23:59:39 2021-08-01 00:31:39 Sherid… 13063
## # … with 14 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, ride_length <dbl>, year <chr>, month <chr>, day <chr>,
## # day_of_week <chr>, y_m_d <chr>, time <chr>, and abbreviated variable names
## # ¹​rideable_type, ²​start_station_name, ³​start_station_id
# Formatting ride lenght column to 2 decimal places.
all_datasets$ride_length<-format(round(all_datasets$ride_length,2),nsmall=2)
#convert ride_lenght to numeric values
all_datasets$ride_length <- as.numeric(all_datasets$ride_length)
We must make sure that there are no duplicate rows, the ride_id column must contain different numbers in each row. This ensures that each row is a different observation.
# We will use distinct() function to keep only unique/distinct rows from the data frame using the ride_id column
all_datasets %>% distinct(ride_id, .keep_all = TRUE)
## # A tibble: 3,394,790 × 20
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 1AFC6C416B7A… classi… 2021-07-31 23:59:57 2021-08-01 00:08:38 Orlean… 636
## 2 713A7722A6E5… electr… 2021-07-31 23:59:57 2021-08-01 00:19:10 Sheffi… TA1309…
## 3 C6F886992E60… classi… 2021-07-31 23:59:55 2021-08-01 00:15:17 Bissel… 13059
## 4 1ED34A91A245… classi… 2021-07-31 23:59:55 2021-08-01 00:06:01 Wells … TA1306…
## 5 EA999FB37E6E… electr… 2021-07-31 23:59:41 2021-08-01 00:05:44 Wells … TA1306…
## 6 33E5E40576D4… classi… 2021-07-31 23:59:39 2021-08-01 00:31:39 Sherid… 13063
## 7 D3A2460169E1… classi… 2021-07-31 23:59:32 2021-08-01 00:05:57 Ravens… TA1307…
## 8 AAB02049AEF4… classi… 2021-07-31 23:59:30 2021-08-01 00:35:58 Centra… 15687
## 9 C6DE6C1BCAEB… docked… 2021-07-31 23:59:27 2021-08-01 00:09:00 Green … TA1307…
## 10 7791FAFFF849… electr… 2021-07-31 23:59:25 2021-08-01 00:19:10 Sheffi… TA1309…
## # … with 3,394,780 more rows, 14 more variables: end_station_name <chr>,
## # end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## # end_lng <dbl>, member_casual <chr>, ride_length <dbl>, year <chr>,
## # month <chr>, day <chr>, day_of_week <chr>, y_m_d <chr>, time <chr>, and
## # abbreviated variable names ¹​rideable_type, ²​start_station_name,
## # ³​start_station_id
The output of this code results in the same rows count. This means there are no duplicate rows.
Once the data set is clean and free of errors, it is ready for analysis and it is good practice to keep it in a safe place.
# Saving the Dataset in our system
fwrite(
all_datasets,
"/Users/osbaldoealbornoz/Documents/Cyclistc_Project/dataset_cleaned/all_datasets_cleaned.csv",
col.names = TRUE,
row.names = FALSE
)
We will use the aggregated data to get information, to get answers in order to deliver insights at our stakeholders.Â
Let’s see the number of rides, median, average and max ride length in minutes…
# number of rides, median, average and max ride length
all_datasets %>%
group_by(member_casual ) %>%
summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 2 × 5
## member_casual number_trips median_ride_length avg_ride_length max_ride_length
## <chr> <int> <dbl> <dbl> <dbl>
## 1 casual 1490512 17.5 21.1 60.0
## 2 member 1904278 12.5 15.6 60.0
Here the two types of riders , the annual members ride more trips, however the casual riders ride more time in each trip they take.
Let’s see what type of bikes do they prefer and how they use them …
Bike types
For casual riders
all_datasets %>%
filter(member_casual=="casual" ) %>%
group_by(rideable_type ) %>%
summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 3 × 5
## rideable_type number_trips median_ride_length avg_ride_length max_ride_length
## <chr> <int> <dbl> <dbl> <dbl>
## 1 classic_bike 582142 16.6 20.2 60.0
## 2 docked_bike 593017 20.4 23.4 60.0
## 3 electric_bike 315353 14.6 18.5 60.0
We can see that the casual rides prefer to use the docked bikes which make sense because they use it with a friend or a relative, if not they use the classic bikes. The electric bikes are not in high demand for this type of riders.
For annual members
all_datasets %>%
filter(member_casual=="member" ) %>%
group_by(rideable_type ) %>%
summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 3 × 5
## rideable_type number_trips median_ride_length avg_ride_length max_ride_length
## <chr> <int> <dbl> <dbl> <dbl>
## 1 classic_bike 886489 12.5 15.5 60.0
## 2 docked_bike 656382 13.3 16.4 60.0
## 3 electric_bike 361407 11.4 14.4 60.0
We can see that the members prefer to use much more the classic bikes which make sense because they probably use it to commute to work. The electric bikes are not in high demand for this type of riders.
Rides by Weekday
For casual riders
all_datasets %>%
filter(member_casual=="casual") %>%
group_by(day_of_week) %>%
summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 7 × 5
## day_of_week number_trips median_ride_length avg_ride_length max_ride_length
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Friday 215814 16.6 20.2 60.0
## 2 Monday 160503 17.4 21.2 60.0
## 3 Saturday 352289 18.9 22.3 60.0
## 4 Sunday 277371 19.5 22.8 60.0
## 5 Thursday 164315 15.6 19.3 60.0
## 6 Tuesday 156435 16.2 20.0 60.0
## 7 Wednesday 163785 15.9 19.6 60.0
Here we can clearly see that casual riders use the service mainly on Saturdays and Sundays that order, with an average about 41 minutes by trip.
For members
all_datasets %>%
filter(member_casual=="member") %>%
group_by(day_of_week) %>%
summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 7 × 5
## day_of_week number_trips median_ride_length avg_ride_length max_ride_length
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Friday 279162 12.2 15.2 60.0
## 2 Monday 250266 12.1 15.2 60.0
## 3 Saturday 293655 13.6 16.8 60.0
## 4 Sunday 244022 13.9 17.1 60.0
## 5 Thursday 273067 11.9 14.9 60.0
## 6 Tuesday 274376 12.0 15.1 60.0
## 7 Wednesday 289730 12.1 15.0 60.0
Here we can see a very different behavior from that of casual riders because the demand remains stable throughout the week where it only drops a little on Sunday. However, this day is when the duration of the trips is longer.
Rides by Month
For casual riders
all_datasets %>%
filter(member_casual=="casual") %>%
group_by(month) %>%
summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 12 × 5
## month number_trips median_ride_length avg_ride_length max_ride_length
## <chr> <int> <dbl> <dbl> <dbl>
## 1 01 12330 13.4 17.1 60.0
## 2 02 7114 16.2 20.0 60.0
## 3 03 61272 18.0 21.7 60.0
## 4 04 97491 17.4 21.3 60.0
## 5 05 175205 18.2 21.8 60.0
## 6 06 250942 17.2 20.7 60.0
## 7 07 307747 16.6 20.2 60.0
## 8 08 223235 19.5 22.8 60.0
## 9 09 174379 18 21.5 60.0
## 10 10 101142 16.2 20.1 60.0
## 11 11 59224 16.5 20.5 60.0
## 12 12 20431 14.4 18.3 60.0
Regarding the months, casual riders use the service more in the summer season, that is, in the months of June, July and August, which makes a lot of sense since they use it for recreational purposes. The duration of the trips also increase in these months.
For members
all_datasets %>%
filter(member_casual=="member") %>%
group_by(month) %>%
summarise(number_trips=n(), median_ride_length=round(median(ride_length),2), avg_ride_length=round(mean(ride_length),2), max_ride_length=round(max(ride_length),2))
## # A tibble: 12 × 5
## month number_trips median_ride_length avg_ride_length max_ride_length
## <chr> <int> <dbl> <dbl> <dbl>
## 1 01 53283 10.9 13.9 60.0
## 2 02 28319 11.8 15.1 59.9
## 3 03 104407 12.1 15.2 60.0
## 4 04 144225 12.4 15.5 60.0
## 5 05 190905 12.6 15.7 60.0
## 6 06 250498 12.5 15.5 60.0
## 7 07 263183 12.2 15.2 60.0
## 8 08 274471 14.1 17.1 60.0
## 9 09 234567 13.0 16.2 60.0
## 10 10 172910 12.0 15.1 60.0
## 11 11 118145 11.9 15.2 60.0
## 12 12 69365 11.2 14.2 59.9
The behavior of the members throughout the months is more balanced with a not so great decrease in the winter months. The duration of the trips is much shorter compared to casual riders.
It is important to know and analyze in more detail the routes used by users to establish relevancy zones and identify trends and behavior patterns.
We will create a new dataset that contains only the columns that refer to the routes.
Create a routes dataset
routes_dataset <- all_datasets %>%
select(member_casual, start_station_name, end_station_name, start_lat, end_lat, start_lng, end_lng, ride_length) %>%
unite(route,c("start_station_name","end_station_name"), sep=" To ") %>%
group_by(route)
Identify routes by rider type
Casual rider routes
routes_dataset %>%
filter(member_casual=="casual") %>%
select(route) %>%
summarise(number_of_rides=(ride_length=n())) %>%
group_by(route) %>%
arrange(desc(number_of_rides))
## # A tibble: 111,759 × 2
## # Groups: route [111,759]
## route number_of_rides
## <chr> <int>
## 1 Streeter Dr & Grand Ave To Streeter Dr & Grand Ave 5701
## 2 Lake Shore Dr & Monroe St To Lake Shore Dr & Monroe St 5201
## 3 Millennium Park To Millennium Park 3590
## 4 Lake Shore Dr & Monroe St To Streeter Dr & Grand Ave 3350
## 5 Michigan Ave & Oak St To Michigan Ave & Oak St 3048
## 6 Buckingham Fountain To Buckingham Fountain 2798
## 7 Streeter Dr & Grand Ave To Millennium Park 2168
## 8 Theater on the Lake To Theater on the Lake 2034
## 9 Indiana Ave & Roosevelt Rd To Indiana Ave & Roosevelt Rd 2031
## 10 Millennium Park To Streeter Dr & Grand Ave 1794
## # … with 111,749 more rows
The result is arranged in such a way that it shows the most used routes, that is, we can clearly appreciate that casual riders use more the routes of parks, museums, lakes, among others, which indicates a pattern of mobility towards recreational sites.
Member routes
routes_dataset %>%
filter(member_casual=="member") %>%
select(route) %>%
summarise(number_of_rides=(ride_length=n())) %>%
group_by(route) %>%
arrange(desc(number_of_rides))
## # A tibble: 119,516 × 2
## # Groups: route [119,516]
## route number_o…¹
## <chr> <int>
## 1 MLK Jr Dr & 29th St To State St & 33rd St 1342
## 2 State St & 33rd St To MLK Jr Dr & 29th St 1202
## 3 Loomis St & Lexington St To Halsted St & Polk St 926
## 4 Lakefront Trail & Bryn Mawr Ave To Theater on the Lake 868
## 5 Halsted St & Polk St To Loomis St & Lexington St 804
## 6 Theater on the Lake To Lakefront Trail & Bryn Mawr Ave 769
## 7 Cornell Ave & Hyde Park Blvd To University Ave & 57th St 750
## 8 University Ave & 57th St To Kimbark Ave & 53rd St 739
## 9 Lakefront Trail & Bryn Mawr Ave To Lakefront Trail & Bryn Mawr Ave 730
## 10 Broadway & Waveland Ave To Broadway & Barry Ave 687
## # … with 119,506 more rows, and abbreviated variable name ¹​number_of_rides
The routes of the members are very different since they use the service mostly in commercial and business areas, which indicates that they are commuters riders at their jobs and others use bikes to get to their businesses or study places.
To encourage casual users to become subscription users, there are a few recommendations:
Offer discounted rates: One way to entice casual users to become
subscribers is to offer discounted rates for those who sign up for an
annual subscription. Make the subscription price more attractive than
paying for each ride individually.
Emphasize convenience: Highlight the convenience of having a
subscription, such as not having to worry about paying for each ride,
and being able to access the bikes whenever they need them.
Offer rewards or incentives: Consider offering rewards or incentives for users who make the switch from casual to subscription, such as free rides or discounts on future subscriptions. For example, make an agreement or alliance with museums, parks or recreational sites that are favorite destinations to obtain cheaper tickets or benefits for those who use the bike service on weekends by subscription.
Market the benefits: Make sure to communicate the benefits of being a subscription user, such as access to more bikes and priority service during peak hours.
by Osbaldo Albornoz