This case study was completed by Osbaldo Albornoz in February 2023 as part of the Google Data Analytics Professional Certificate capstone unit. R has been used to complete this case study and then hosted online through Github.
This is a case study, where we perform thinking as real-world tasks of a junior data analyst. In this case we are working for a fictional company, Bellabeat company, where we meet different characters and team. members.
You are a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women. Bellabeat is a successful small company, but they have the potential to become a larger player in the market. Urška Sršen, cofounder and Chief Creative Officer of Bellabeat, believes that analyzing smart device fitness data could help unlock new growth opportunities for the company. You have been asked to focus on one of Bellabeat’s products and analyze smart device data to gain insight into how consumers are using their smart devices. The insights you discover will then help guide marketing strategy for the company. You will present your analysis to the Bellabeat executive team along with your high-level recommendations for Bellabeat’s marketing strategy.
Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women.
By 2016, Bellabeat had opened offices around the world and launched multiple products. Bellabeat products became available through a growing number of online retailers in addition to their own e-commerce channel on their website. The company has invested in traditional advertising media, such as radio, out-of-home billboards, print, and television, but focuses on digital marketing extensively. Bellabeat invests year-round in Google Search, maintaining active Facebook and Instagram pages, and consistently engages consumers on Twitter. Additionally, Bellabeat runs video ads on Youtube and display ads on the Google Display Network to support campaigns around key marketing dates.
Sršen knows that an analysis of Bellabeat’s available consumer data would reveal more opportunities for growth. She has asked the marketing analytics team to focus on a Bellabeat product and analyze smart device usage data in order to gain insight into how people are already using their smart devices. Then, using this information, she would like high-level recommendations for how these trends can inform Bellabeat marketing strategy.
Stakeholders
Urška Sršen: Bellabeat’s cofounder and Chief Creative Officer
Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team
Bellabeat marketing analytics team: A team of data
analysts responsible for collecting, analyzing, and reporting data that
helps guide Bellabeat’s marketing strategy. You joined this team six
months ago and have been busy learning about Bellabeat’’s mission and
business goals — as well as how you, as a junior data analyst, can help
Bellabeat achieve them.
Products
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.
Bellabeat membership: Bellabeat also offers a
subscription-based membership program for users. Membership gives users
24/7 access to fully personalized guidance on nutrition, activity,
sleep, health and beauty, and mindfulness based on their lifestyle and
goals.
Business task: Analyze smart device usage data in order to gain insight into how consumers use non-Bellabeat smart devices.
What are some trends in smart device usage?
How could these trends apply to Bellabeat customers?
How could these trends help influence Bellabeat marketing strategy?
A Kaggle data set: FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains consented personal fitness tracker data from thirty Fitbit users which includes minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.
Data is not available on Bellabeat customers, nor on usage of Bellabeat products, hence proxy data will be used for this analysis. Public data on Fitbit users will be the primary data for this analysis. This dataset was obtained from Kaggle and has very high credibility and legitimate licensing.
# Loading Raw data
activity <- read.csv("/Users/osbaldoealbornoz/Documents/Bellabeat_Project/dailyActivity_merged.csv")
calories <- read.csv("/Users/osbaldoealbornoz/Documents/Bellabeat_Project/dailyCalories_merged.csv")
intensities <- read.csv("/Users/osbaldoealbornoz/Documents/Bellabeat_Project/dailyIntensities_merged.csv")
sleep <- read.csv("/Users/osbaldoealbornoz/Documents/Bellabeat_Project/sleepDay_merged.csv")
weight <- read.csv("/Users/osbaldoealbornoz/Documents/Bellabeat_Project/weightLogInfo_merged.csv")
steps <- read.csv("/Users/osbaldoealbornoz/Documents/Bellabeat_Project/dailySteps_merged.csv")
Lets take a look at the datasets, first we use the head and str function starting with the activity dataset.
activity dataset
head(activity)
## Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
str(activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
calories dataset
head(calories)
## Id ActivityDay Calories
## 1 1503960366 4/12/2016 1985
## 2 1503960366 4/13/2016 1797
## 3 1503960366 4/14/2016 1776
## 4 1503960366 4/15/2016 1745
## 5 1503960366 4/16/2016 1863
## 6 1503960366 4/17/2016 1728
str(calories)
## 'data.frame': 940 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDay: chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
Note: It seems that the information from the calories dataset is also contained in the activities dataset, since the fields of the calories dataset are present in activities, only with the difference of the date field, which has different names. We can check if this is true by using the all function.
all(calories %in% activity)
## [1] TRUE
This result TRUE indicates that the dataset calories in contained in the activity dataset.
Lets check the others datasets as well !!
all(intensities %in% activity)
## [1] TRUE
all(sleep %in% activity)
## [1] FALSE
all(weight %in% activity)
## [1] FALSE
all(steps %in% activity)
## [1] TRUE
So, the calories, intensities and steps datasets are contained in the activity dataset. this means that we will use the following datasets:
activity, sleep, weight.
We will use the clean_names function just to make sure that all datasets have only characters, number and underscores in the column names.
str(sleep)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
str(weight)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
The Dates columns in the activity, sleep and weight datasets are chr type, lets fix and rename them.
Verifying the changes
str(activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ Date : Date, format: "2016-04-12" "2016-04-13" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
str(sleep)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ Date : POSIXct, format: "2016-04-12" "2016-04-13" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
str(weight)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : POSIXct, format: "2016-05-02 23:59:59" "2016-05-03 23:59:59" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
This will help us to find if there are enough subjects for the analysis in each dataset. We will use the n_distinct function
n_distinct(activity$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(weight$Id)
## [1] 8
This output is telling us that in the weight dataset there are not enough subjects for the analysis.
sum(duplicated(activity))
## [1] 0
sum(duplicated(sleep))
## [1] 3
There a 3 duplicates in the sleep dataset, lets remove them
sleep <- unique(sleep)
sum(duplicated(sleep))
## [1] 0
activity dataframe
sum(is.na(activity))
## [1] 0
sleep dataframe
sum(is.na(sleep))
## [1] 0
There are no missing values in the dataframes.
Now that we have checked and cleaned the datasets (activity and sleep), we can merge them in a single dataset.
# Merging the activity and sleep datasets
all_activity <-merge(activity, sleep, by=c("Id", "Date"), all.x = TRUE)
head(all_activity)
## Id Date TotalSteps TotalDistance TrackerDistance
## 1 1503960366 2016-04-12 13162 8.50 8.50
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-14 10460 6.74 6.74
## 4 1503960366 2016-04-15 9762 6.28 6.28
## 5 1503960366 2016-04-16 12669 8.16 8.16
## 6 1503960366 2016-04-17 9705 6.48 6.48
## LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
## TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## 1 1 327 346
## 2 2 384 407
## 3 NA NA NA
## 4 1 412 442
## 5 2 340 367
## 6 1 700 712
In this case the resulting dataset will have missing values because we are retaining the unmatched values from the activity dataframe. This is ok for this analysis.
Once the data set is clean and free of errors, it is ready for analysis and it is good practice to keep it in a safe place.
# Saving the Dataset in our system
fwrite(
all_activity,
"/Users/osbaldoealbornoz/Documents/Bellabeat_Project/all_activity.csv",
col.names = TRUE,
row.names = FALSE
)
Lets Summarize the data !!!
summary(all_activity)
## Id Date TotalSteps TotalDistance
## Min. :1.504e+09 Min. :2016-04-12 Min. : 0 Min. : 0.000
## 1st Qu.:2.320e+09 1st Qu.:2016-04-19 1st Qu.: 3790 1st Qu.: 2.620
## Median :4.445e+09 Median :2016-04-26 Median : 7406 Median : 5.245
## Mean :4.855e+09 Mean :2016-04-26 Mean : 7638 Mean : 5.490
## 3rd Qu.:6.962e+09 3rd Qu.:2016-05-04 3rd Qu.:10727 3rd Qu.: 7.713
## Max. :8.878e+09 Max. :2016-05-12 Max. :36019 Max. :28.030
##
## TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## Min. : 0.000 Min. :0.0000 Min. : 0.000
## 1st Qu.: 2.620 1st Qu.:0.0000 1st Qu.: 0.000
## Median : 5.245 Median :0.0000 Median : 0.210
## Mean : 5.475 Mean :0.1082 Mean : 1.503
## 3rd Qu.: 7.710 3rd Qu.:0.0000 3rd Qu.: 2.053
## Max. :28.030 Max. :4.9421 Max. :21.920
##
## ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## Min. :0.0000 Min. : 0.000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.: 1.945 1st Qu.:0.000000
## Median :0.2400 Median : 3.365 Median :0.000000
## Mean :0.5675 Mean : 3.341 Mean :0.001606
## 3rd Qu.:0.8000 3rd Qu.: 4.782 3rd Qu.:0.000000
## Max. :6.4800 Max. :10.710 Max. :0.110000
##
## VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.: 729.8
## Median : 4.00 Median : 6.00 Median :199.0 Median :1057.5
## Mean : 21.16 Mean : 13.56 Mean :192.8 Mean : 991.2
## 3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:1229.5
## Max. :210.00 Max. :143.00 Max. :518.0 Max. :1440.0
##
## Calories TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
## Min. : 0 Min. :1.00 Min. : 58.0 Min. : 61.0
## 1st Qu.:1828 1st Qu.:1.00 1st Qu.:361.0 1st Qu.:403.8
## Median :2134 Median :1.00 Median :432.5 Median :463.0
## Mean :2304 Mean :1.12 Mean :419.2 Mean :458.5
## 3rd Qu.:2793 3rd Qu.:1.00 3rd Qu.:490.0 3rd Qu.:526.0
## Max. :4900 Max. :3.00 Max. :796.0 Max. :961.0
## NA's :530 NA's :530 NA's :530
Some observations and conclusions that can be made from the summary statistics:
On the average, participants sleep 1 time for 419.5 minutes or 7 hours.
On average, individuals take 7,638 steps per day and cover a distance of 5.49 km.
The median value of very active distance is 0.21 km, which suggests that most people in the dataset are not very active.
The mean value of total sleep records is 1.119, which indicates that some people may not have recorded their sleep at all.
The maximum values for total steps and distance are quite high, which suggests that some individuals in the dataset are very active.
The average number of sedentary minutes per day is 991.2, which is quite high and indicates that many people in the dataset may have a sedentary lifestyle.
The maximum calories burned per day is 4,900, which is a very high value and suggests that some individuals in the dataset may be professional athletes or have physically demanding jobs.
Based on the visualizations, we can see some trends in smart device usage, such as:
Users tend to be more active during weekdays than weekends.
TotalSteps and TotalDistance are positively correlated, meaning that users who take more steps also tend to travel farther distances.
SedentaryMinutes and VeryActiveMinutes are negatively correlated, meaning that users who are more active tend to be less sedentary.
There is a wide range of variability in the amount of sleep users get each night.
Use social media platforms to create a community of users who can
share their progress, tips and experiences with the device. This can
create a sense of accountability and motivation for users.
Offer discounts and promotions to customers who reach certain
activity goals, or who consistently use the device over a certain period
of time. This can incentivize customers to continue using the device and
achieve their fitness goals.
Partner with fitness influencers or athletes to showcase the
device and its benefits to a wider audience. This can help establish
credibility and trust with potential customers.
Use targeted online advertising to reach potential customers who
have shown an interest in fitness or health-related products. This can
help increase awareness of the device and drive sales.
Provide customers with personalized insights and recommendations based on their activity data, such as suggesting specific workouts or tips to improve sleep quality. This can help create a more engaging and personalized experience for users.
by Osbaldo Albornoz