The World Happiness Report is a publication that ranks countries based on their happiness levels. The rankings are determined by the responses to the Gallup World Poll, an annual survey of individuals in over 150 countries. The United Nations Sustainable Development Solutions Network publishes the report. The report also includes an analysis of the data, focusing on how happiness is affected by various factors, such as economic and social development, health, and good governance. These individual responses are turned into a score on a scale of 1 through 10. The measured variables include real GDP per capita, social support, life expectancy, freedom to choose, generosity, and perceptions of corruption. We looked at the 2017 report as it contained the most data.
Based on data collected in the 2017 World Happiness Report, which features are most impactful in determining a happiness score, and how much do these correlations differ across regions?
#add dataset called 2017.csv in the zipfile
X2017 = read.csv("2017.csv")
This model uses the World Happiness Report 2017 dataset. The dataset
is called X2017
in this markdown file (or as a local file
2017.csv
from the attached CSV file). I needed to clean up
the dataset given to us to make it easier to understand. I first removed
the whisker.high
and whisker.low
, which are
unnecessary to my data analysis. I also renamed multiple variables to
simplify them into a single word or phrase (for example,
Healthy.Life.Expectancy
to Life.Expectancy
.)
The final piece of cleanup is categorizing the country variable into
separate regions to see if certain regions have different correlations.
I named the cleaned-up data newhappiness
, so let us look at
the differences between the two datasets.
str(X2017)
## 'data.frame': 155 obs. of 12 variables:
## $ Country : chr "Norway" "Denmark" "Iceland" "Switzerland" ...
## $ Happiness.Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Happiness.Score : num 7.54 7.52 7.5 7.49 7.47 ...
## $ Whisker.high : num 7.59 7.58 7.62 7.56 7.53 ...
## $ Whisker.low : num 7.48 7.46 7.39 7.43 7.41 ...
## $ Economy..GDP.per.Capita. : num 1.62 1.48 1.48 1.56 1.44 ...
## $ Family : num 1.53 1.55 1.61 1.52 1.54 ...
## $ Health..Life.Expectancy. : num 0.797 0.793 0.834 0.858 0.809 ...
## $ Freedom : num 0.635 0.626 0.627 0.62 0.618 ...
## $ Generosity : num 0.362 0.355 0.476 0.291 0.245 ...
## $ Trust..Government.Corruption.: num 0.316 0.401 0.154 0.367 0.383 ...
## $ Dystopia.Residual : num 2.28 2.31 2.32 2.28 2.43 ...
#renaming variables
colnames(X2017)[0:11]=c("Country", "Happiness.Rank","Happiness.Score","Whisker.high", "Whisker.low", "Economy", "Life.Expectancy", "Freedom", "Generosity", "Trust in Gov","Dystopian.Residual")
newhappiness<-X2017
#creating regions
newhappiness$Regionnc = countrycode(sourcevar = newhappiness$Country, origin = "country.name",destination = "region")
#getting rid of whisker variables
newhappiness$Whisker.high=NULL
newhappiness$Whisker.low=NULL
newhappiness$Dystopia.Residual=NULL
newhappiness$Region <- as.factor(newhappiness$Regionnc)
newhappiness$Regionnc=NULL
str(newhappiness)
## 'data.frame': 155 obs. of 10 variables:
## $ Country : chr "Norway" "Denmark" "Iceland" "Switzerland" ...
## $ Happiness.Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Happiness.Score : num 7.54 7.52 7.5 7.49 7.47 ...
## $ Economy : num 1.62 1.48 1.48 1.56 1.44 ...
## $ Life.Expectancy : num 1.53 1.55 1.61 1.52 1.54 ...
## $ Freedom : num 0.797 0.793 0.834 0.858 0.809 ...
## $ Generosity : num 0.635 0.626 0.627 0.62 0.618 ...
## $ Trust in Gov : num 0.362 0.355 0.476 0.291 0.245 ...
## $ Dystopian.Residual: num 0.316 0.401 0.154 0.367 0.383 ...
## $ Region : Factor w/ 7 levels "East Asia & Pacific",..: 2 2 2 2 2 2 5 1 2 1 ...
summary(newhappiness[1:8], title = "Statistics Summary of Variables")
## Country Happiness.Rank Happiness.Score Economy
## Length:155 Min. : 1.0 Min. :2.693 Min. :0.0000
## Class :character 1st Qu.: 39.5 1st Qu.:4.505 1st Qu.:0.6634
## Mode :character Median : 78.0 Median :5.279 Median :1.0646
## Mean : 78.0 Mean :5.354 Mean :0.9847
## 3rd Qu.:116.5 3rd Qu.:6.101 3rd Qu.:1.3180
## Max. :155.0 Max. :7.537 Max. :1.8708
## Life.Expectancy Freedom Generosity Trust in Gov
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.043 1st Qu.:0.3699 1st Qu.:0.3037 1st Qu.:0.1541
## Median :1.254 Median :0.6060 Median :0.4375 Median :0.2315
## Mean :1.189 Mean :0.5513 Mean :0.4088 Mean :0.2469
## 3rd Qu.:1.414 3rd Qu.:0.7230 3rd Qu.:0.5166 3rd Qu.:0.3238
## Max. :1.611 Max. :0.9495 Max. :0.6582 Max. :0.8381
I have now cleaned the dataset to be usable for my analysis by removing unnecessary variables and categorizing the countries into regions. So we can start looking at the dataset and begin data analysis of the World Happiness Report 2017.
The variables in I will be looking at dataset are:
Country
: A categorical variable for each country
name.Happiness.Rank
: Rank of the country concerning
happiness score.Happiness.Score
: A metric measured in 2017 by asking
the sampled people: “How would you rate your happiness on a scale of 0
to 10 where 10 is the happiest.”Economy
: The extent to which GDP contributes to the
calculation of the happiness score.Life.Expectancy
: The extent to which life expectancy
contributed to the calculation of the happiness score.Freedom
: The extent to which freedom contributed to the
calculation of the happiness score.Generosity
: The extent to which generosity contributes
to happiness score.Trust in Gov
: The extent to which perception of trust
in government contributes to happiness score.Region
: What region the country falls within.When I cleaned the dataset, my initial thought was to see what
variables correlate with Happiness.Score
. So, I created a
correlation matrix of those variables and
Happiness.Score
.
#Correlation Matrix for 2017 World Happiness Report
corPearson = cor(newhappiness[3:8])
corrplot(corPearson)
This correlation plot highlights that the three most correlated
variables with Happiness.Score
are the
Economy
, Life.Expectancy
, and
Freedom
scores. Economy
is the highest
correlated, followed by Freedom
and
Life.Expectancy
. While this correlation matrix was made for
all pieces of data, I wanted to look deeper into the data to see if this
message would change depending on the specific region. Countries may
have different sources of happiness for a variety of reasons. One reason
is that countries may have different cultural values and beliefs,
affecting what people consider essential for their happiness. For
example, in some cultures, family and community connections may be more
important for happiness than individual achievement, while personal
success and achievement may be more valued in other cultures. I decided
to see if specific regions had different correlations than the entire
dataset.
Before I began to study the individual correlations of specific
regions, I wanted to look at the difference in
Happiness.Score
per region. This would give us a bigger
picture of regions’ Happiness.Score
distribution. Keeping
this in the back of my mind, I could look at the effect of different
areas on the correlation of specific scores like Economy
,
Freedom
, and Life.Expectancy
on
Happiness.Score
.
#Happiness Score v Region boxplot
ggplot(newhappiness, aes(x=Region,y=Happiness.Score, fill=Region)) +
geom_boxplot(width=0.5, length=1,outlier.shape = NA) +
geom_jitter(width = 0.1, alpha=0.5) +
labs(title="Happiness Score vs Region", y = "Happiness Score", x="Region") +
theme(legend.position = 'none') +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
#creating subsets for each region
EUCAShappiness<- subset(newhappiness, Region=="Europe & Central Asia")
NAhappiness<-subset(newhappiness, Region=="North America")
EASPhappiness<-subset(newhappiness, Region=="East Asia & Pacific")
LAChappiness<-subset(newhappiness, Region=="Latin America & Caribbean")
MENAhappiness<-subset(newhappiness, Region=="Middle East & North Africa")
SAhappiness<-subset(newhappiness, Region=="South Asia")
SSAhappiness<-subset(newhappiness, Region=="Sub-Saharan Africa")
This boxplot gives us information on the distribution of
Happiness.Score
. The region with the highest reported
Happiness.Score
was North America, with a mean of 7.1545,
and the lowest region was Sub-Saharan Africa, with a mean of 4.1119487.
I created a table containing the means of each region to contextualize
these differences better. Using the geom_jitter function, the boxplot
gives us additional information about the distribution and size of
Happiness.Score
data per region. For example, North America
and South Asia have very small samples compared to the other areas.
North America only contained two data points, and South Asia included
seven.
Mean Happiness Score | |
---|---|
East Asia & Pacific | 5.7523125 |
Europe & Central Asia | 5.93278 |
Latin America & Caribbean | 5.9578182 |
Middle East & North Africa | 5.4237368 |
North America | 7.1545 |
South Asia | 4.6284286 |
Sub-Saharan Africa | 4.1119487 |
Now that I have taken a brief look at the dataset, I can look toward answering the SMART question: “based on data collected in the 2017 World Happiness Report, which features are most impactful in determining a happiness score, and how much do these correlations differ across regions?”
#corrplot for each region
EUCAScorplot<-cor(EUCAShappiness[3:8])
corrplot(EUCAScorplot, type="upper", method = 'number')
In Europe and Central Asia correlation matrix,
Happiness.Score
was strongly correlated with
Economy
and Generosity
scores, while the
remaining scores are were moderately correlated with
Happiness.Score
.
NAcorplot<-cor(NAhappiness[3:8])
corrplot(NAcorplot, type="upper", method = 'number')
The numbers produced by this correlation matrix are unproductive as
the sample size was too small due to containing only two countries, so
the correlation between particular variables and
Happiness.Score
cannot be seen.
EASPcorplot<-cor(EASPhappiness[3:8])
corrplot(EASPcorplot, type="upper", method = 'number')
In East Asia & the Pacific correlation matrix,
Happiness.Score
was correlated most with
Economy
and Life.Expectancy
.
Happiness.Score
was moderately correlated with
Freedom
scores and was least correlated with
Trust in Gov
and Generosity
scores.
LACcorplot<-cor(LAChappiness[3:8])
corrplot(LACcorplot, type="upper", method = 'number')
In Latin America and the Caribbean correlation matrix,
Happiness.Score
was strongly correlated with
Economy
and Freedom
scores.
Generosity
and Life.Expectancy
are moderately
correlated with Happiness.Score
.
MENAcorplot<-cor(MENAhappiness[3:8])
corrplot(MENAcorplot, type="upper", method = 'number')
In the Middle East and North Africa correlation matrix,
Happiness.Score
was strongly correlated with the
Economy
, Life.Expectancy
, and
Generosity
scores. These scores were all evenly correlated
with Happiness.Score
, with a correlation ranging from .81
to .8. Happiness.Score
was least correlated with
Trust in Gov
scores.
SAcorplot<-cor(SAhappiness[3:8])
corrplot(SAcorplot, type="upper", method = 'number')
South Asia had the same issue as North America as they both lacked sample size, so I do not feel confident in making conclusions about the specific correlations.
SSAcorplot<-cor(SSAhappiness[3:8])
corrplot(SSAcorplot, type="upper", method = 'number')
There were no strong correlations for Happiness.Score
in
Sub-Saharan Africa. However, Economy
and
Life.Expectancy
scores were moderately correlated with
Happiness.Score
.
There was variation among the correlation of specific scores on
Happiness.Score
that depended upon the region. I also saw
noticeable trends when we looked at each matrix. First, the most
commonly strongly correlated variable to Happiness.Score
was Economy
. However, other variables might be strongly
correlated depending on the region. I created a table of the strongly
and moderately correlated variables to Happiness.Score
per
region.
Strongly Correlated | Moderately Correlated | |
---|---|---|
East Asia & Pacific | Economy and Life Expectancy | Freedom |
Europe & Central Asia | Economy and Generosity | Trust in Gov, Freedom, & Life Expectancy |
Latin America & Caribbean | Economy and Freedom | Trust in Gov, Generosity, & Life Expectancy |
Middle East & North Africa | Economy, Life Expectancy, Freedom, & Generosity | None |
North America | N/A | N/A |
South Asia | N/A | N/A |
Sub-Saharan Africa | None | Economy, Life Expectancy, & Generosity |
This table shows that Economy
, Freedom
, and
Life.Expectancy
is the most commonly strongly or moderately
correlated variables. This statement aligns with the initial correlation
plot I looked at.
Now that I found the differences between each region and which
variables are most correlated with Happiness.Score
per
region, I looked at the linear relationship between
Happiness.Score
and the variables I found to be most
correlated.
plit<- ggplot(newhappiness, aes(x=Freedom, y=Happiness.Score))+
geom_point()+
geom_smooth(method = 'lm')+
labs(title="Scatter plot of happiness score vs freedom score", x = "Freedom Score", y = "Happiness Score")
suppressMessages(print(plit))
I first examined the scatter plot between Freedom
scores
and Happiness.Score
. There appeared to be a positive
relationship between the two variables.
plot2<-ggplot(newhappiness, aes(x=Economy, y=Happiness.Score))+
geom_point()+
geom_smooth(method = 'lm')+
labs(title="Scatter plot of happiness score vs economy score", x = "Economy Score", y = "Happiness Score")
suppressMessages(print(plot2))
The relationship between Happiness.Score
and
Economy
was also positive. This relationship looked
similarly positively correlated to Freedom
scores. My
correlation matrix analysis would assert this because it showed that
Happiness.Score
correlated most with Economy
and Freedom
.
plot<-ggplot(newhappiness, aes(x=Life.Expectancy, y=Happiness.Score))+
geom_point()+
geom_smooth(method = 'lm')+
labs(title="Scatter plot of happiness score vs life expectancy score", x = "Life Expectancy Score", y = "Happiness Score")
suppressMessages(print(plot))
The relationship between Life.Expectancy
and
Happiness.Score
was also positive, though the slope
compared to the last two scatterplots was not as big. This was supported
by the correlation matrix, which placed Life.Expectancy
as
the least correlated of the three.
Once I saw what was most correlated with Happiness.Score
per region and the linear relationships between the most correlated
variables and Happiness.Score
, I asked myself which of
these was the most important on Happiness.Score
. To answer
this question, I decided to test the feature importance of those
variables. To do this, I took the Z-scores of the most critical
variables. Taking z-scores is a standard way of standardizing variables,
indicating how many standard deviations a given value is above or below
the variable’s mean. When one takes the z-scores of independent
variables and runs a regression, the coefficients of the regression
model will represent the standardized effects of the variables on the
response variable. This means that the coefficients will indicate how
much the response variable is expected to change for a one-unit change
in the standardized predictor variable while holding all other variables
constant. By standardizing the variables in this way, it is possible to
directly compare the effects of the different variables on the response.
Standardizing the variables can also make it easier to interpret the
regression results, as the coefficients can be directly compared to each
other.
newhappiness$economy_z<-scale(newhappiness$Economy)
newhappiness$freedom_z<-scale(newhappiness$Freedom)
newhappiness$life.expectancy_z<-(newhappiness$Life.Expectancy)
#linear regression modeling
Economylifeexpectancyfreedomonscore <- lm(Happiness.Score ~ economy_z+freedom_z+life.expectancy_z , data = newhappiness)
Economylifeexpectancyfreedomonscoresum <-summary(Economylifeexpectancyfreedomonscore, title = "Summary of the linear model for feature importance")
Economylifeexpectancyfreedomonscoresum
##
## Call:
## lm(formula = Happiness.Score ~ economy_z + freedom_z + life.expectancy_z,
## data = newhappiness)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.49825 -0.35335 -0.04934 0.38729 1.89215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.71633 0.26390 14.082 < 2e-16 ***
## economy_z 0.36362 0.09237 3.936 0.000126 ***
## freedom_z 0.33581 0.08474 3.963 0.000114 ***
## life.expectancy_z 1.37748 0.21868 6.299 3.11e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5636 on 151 degrees of freedom
## Multiple R-squared: 0.7566, Adjusted R-squared: 0.7517
## F-statistic: 156.4 on 3 and 151 DF, p-value: < 2.2e-16
vif(Economylifeexpectancyfreedomonscore)
## economy_z freedom_z life.expectancy_z
## 4.136195 3.480671 1.912949
I also looked at the VIF of the variables within the model to ensure that multi-collinearity is not a concern for my data. With VIF values less than 5, I can conclude that there are not many collinearity concerns in the dataset.
#linear regression graph
avPlots(Economylifeexpectancyfreedomonscore)
These plots show the relationship between each specified standardized
variable in the regression and Happiness.Score
, while
holding all other variables constant.
To see which variables are most important in changing the
Happiness Score
, I looked at the coefficients of the
standardized variables. All the included variables are statistically
significant (p-value<0.05), so I can analyze each coefficient. The
highest coefficient was Life.Expectancy
. After
Life.Expectancy
, Economy
was the second most
important, followed closely by Freedom
. This caught us off
guard because I thought the most correlated variable,
Economy
, would be the most important in changing the
Happiness Score
. However, this test makes sense because if
one has a longer life expectancy, you can expect them to have more time
to pursue the things that make them happy, such as hobbies, careers, and
relationships. Additionally, people with a longer life expectancy are
more likely to have good health, contributing to happiness. Good health
allows people to engage in the activities and experiences that bring
them joy and can also reduce the stress and worry that can come with
poor health and negatively impact happiness. A report on happiness and
life expectancy by the National Academy of the Sciences studied the
relationship between the two. The report stated, “People who had higher
levels of happiness had a longer life span” (Lee, 2019). Even though
this study looked at the inverse relationship, it still applied to my
analysis. As I looked at the relationship between the two, I found that
the happiest countries in my dataset had high life expectancy
scores.
My SMART question set out to find the effect of the differences in
correlations of variables of Happiness.Score
depending on
the region. I discovered that certain areas had higher correlated
variables than others, but overall trends were still seen between the
correlation matrices. Economy
,
Life.Expectancy
, and Freedom
were the
dataset’s most commonly strongly correlated variables. Knowing the most
frequently correlated variables, I looked to see which were most
important in changing the Happiness.Score
. By standardizing
these variables and running a linear regression, I found that
Life.Expectancy
is the most important of the three
variables, followed by Economy
and
Freedom
.
These results are supported by outside research on happiness. In an
article on the 2021 World Happiness Report results, Lyndsey Matthews
identified the variables life expectancy and economy “to be highly
predictive of happiness” (Matthews, 2022). She also found that the
happiest countries always have strong scores in these variables and high
levels of freedom. In addition, another report on the effect of
government policy on happiness from the London School of Economics
reaffirmed Matthews’s statement. It claimed that a country’s happiness
is most affected by “their incomes and employment” (Stewart, 2020). This
means that a country with low income and employment levels will be
unhappy, and this is supported in my analysis as I found
Economy
among the three most important variables. Thus,
these outside sources supported the results of my data analysis.
United Nations Sustainable Development Solutions Network. (n.d.). World happiness [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/unsdsn/world-happiness?select=2017.csv
Sustainable Development Solutions Network. (2017). World Happiness Report [Data set]. Retrieved from https://worldhappiness.report/ed/2017/
Lee, L. O., James, P., Zevon, E. S., Kim, E. S., Trudel-Fitzgerald, C., Spiro, A., 3rd, Grodstein, F., & Kubzansky, L. D. (2019). Optimism is associated with exceptional longevity in 2 epidemiologic cohorts of men and women. Proceedings of the National Academy of Sciences of the United States of America, 116(37), 18357–18362. https://doi.org/10.1073/pnas.1900712116
Matthews, L. (2022, August 25). “The world’s happiest country is all about reading, coffee, and Saunas.” AFAR Media. Retrieved December 12, 2022, from https://www.afar.com/magazine/the-worlds-happiest-country-is-all-about-reading-coffee-and-saunas.
Stewart, K. (2020, March 17). We Can Increase Happiness Through Public Policy (And In Our Jobs and Private Lives Too) [Blog post]. Retrieved from https://blogs.lse.ac.uk/businessreview/2020/03/17/we-can-increase-happiness-through-public-policy-and-in-ou r-jobs-and-private-lives-too/
1.2 Social Good
The World Happiness Report can help policymakers and others interested in promoting well-being in their countries. By providing a ranking of countries based on their happiness levels, the report can be used to identify areas where improvements can be made. For example, suppose a country has a low ranking in the report. In that case, policymakers can use the data and analysis provided to determine what factors contribute to the low ranking and take action to address those factors. Additionally, the report can be used to raise awareness about the importance of promoting well-being and happiness in society. By highlighting the role that factors such as economic and social development, health, and good governance play in determining happiness, the report can promote a broader understanding of the factors that contribute to well-being and inspire action to improve those factors.