class: center, middle, inverse, title-slide .title[ # Data Visualization for Economic Historians ] .author[ ### Jonathan Jayes ] .institute[ ### Lund University ] .date[ ### 2023/02/01 (updated: 2022-12-19) ] --- class: inverse, center, bottom background-image: url(images/LU_Ekonomiho╠êgskolan_BLACK_ENG.png) ## Follow along with the slides - URL in Chat --- # Purpose .pull-left[ ### 1. Get you excited about drawing maps ### 2. Show some tips and tricks that make your maps pop This is not about the software but the theory behind communicating well with maps ] .pull-right[ ![](images/hans-rosling.jpg) .footnote[ Photo from [Hans Rosling's TED Talk](https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_seen) ] ] --- # Structure .pull-left[ ### 1. Exploratory Data Analysis ### 2. Why maps ### 3. Polishing Figures for Publication ### 4. Recreating Published Figures ] .pull-right[ ![](images/structure.jpg) ] --- class: center, top # Everything is a story ### /ˈstɔːri/ ![](images/abed.PNG) --- class: center, top ![](images/story-circle.jpg) ###[Dan Harmon's Story circle](https://youtu.be/UdxX_Kljrq8) --- class: center, top ## Our Own Story Circle .pull-left[ ### Figure here ] .pull-right[ 1. Excel as zone of comfort. 2. You want to make poppin' maps 3. You explore some new data 4. You learn some tricks 5. You get what you wanted 6. It's been a lot, time for a break 7. You return to your own data, with a fresh perspective ] --- # 1. The zone of comfort Our stomping ground **MS Excel**: ![](images/excel-comfort.PNG) --- # 1. The zone of comfort **MS Excel** has useful features that suggest how to visualize your data ![](images/excel-reccomend.PNG) --- # Why leave? .center[ ![](images/shire.gif) ] --- class: inverse, center, middle # 2. You want something --- # You want to learn about your data by .panelset[ .panel[.panel-name[Counting things] ```r df <- read_rds("data/baptism_data.rds") df %>% count(female_profession, sort = T) ``` ``` ## # A tibble: 148 × 2 ## female_profession n ## <chr> <int> ## 1 <NA> 595 ## 2 Domestic duties 228 ## 3 Machinist 154 ## 4 Domestic service 46 ## 5 Domestic servant 41 ## 6 Household duties 41 ## 7 Clerk 30 ## 8 School Teacher 26 ## 9 Factory worker 24 ## 10 Shorthand Typist 21 ## # … with 138 more rows ``` ] .panel[.panel-name[Drawing maps] <img src="index_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> ] .panel[.panel-name[Making plots] ![](index_files/figure-html/unnamed-chunk-4-1.svg)<!-- --> ] ] --- # You want to make beautiful charts with .panelset[ .panel[.panel-name[Informative colours]
] .panel[.panel-name[Interactive elements]
] .panel[.panel-name[Engaging animations] <center> <iframe width="500" height="500" src="https://www.youtube.com/embed/D_Aakjiadm4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </center> ] ] --- class: inverse, center, middle # 3. You enter an unfamiliar situation --- ## Where do we start? ] --- ## How can we tell what type of pregnancy we have for each record? -- ### We have the date of birth and date of marriage -- ### We can do some subtraction -- <img src="images/baby.jpg" width="80%" style="display: block; margin: auto;" /> --- ## What's the plan? -- ### Get the date that the parents were married -- ### Get the date that the child was born -- ### If there is fewer than 280 days between marriage and birth, we know the child was most likely a prenuptial pregnancy. --- ### Dealing with dates Dates are difficult: - There are three parts to the [rule determining a leap year](https://r4ds.had.co.nz/dates-and-times.html) - There are many formats to write a date in (e.g. dmy vs mdy) The best resource for dealing with dates in R is in [Chapter 16](https://r4ds.had.co.nz/dates-and-times.html) of [R for Data Science](https://r4ds.had.co.nz/index.html) entitled **Dates and times**. ### We will explore a brief example with our baptism data. --- ## When are the children born? ### The months are recorded differently! .pull-left[ ```r df %>% count(month_baptized, sort = T) %>% head() ``` ``` ## # A tibble: 6 × 2 ## month_baptized n ## <ord> <int> ## 1 Feb 157 ## 2 Dec 153 ## 3 Nov 150 ## 4 Mar 143 ## 5 Apr 126 ## 6 Aug 123 ``` ] .pull-right[ ```r df %>% count(month_of_birth, sort = T) %>% head() ``` ``` ## # A tibble: 6 × 2 ## month_of_birth n ## <chr> <int> ## 1 9 159 ## 2 10 145 ## 3 11 145 ## 4 12 141 ## 5 1 136 ## 6 8 135 ``` ] .footnote[ Note that the months are recorded as characters not numbers, a result of importing them from Excel. ] --- ### When are the children born? ```r library(glue) library(lubridate) df <- df %>% # this glues together three individual variables into # one new variable called `birth_date` mutate(birth_date = glue("{day_of_birth}-{month_of_birth}-{year_of_birth}"), # dmy here means read the date as "day-month-year" birth_date = dmy(birth_date)) df %>% select(christian_name, birth_date) %>% head() ``` ``` ## # A tibble: 6 × 2 ## christian_name birth_date ## <chr> <date> ## 1 Islay Deirdre 1921-02-07 ## 2 William Isaac 1939-02-20 ## 3 Douglas Walter Ellis 1914-01-22 ## 4 Gordon Rhodes 1916-10-22 ## 5 Ronald Overton 1914-01-13 ## 6 Hugh Andrew 1915-11-09 ``` --- ## When are the children baptized? ```r df <- df %>% mutate(baptism_date = glue("{day_baptized}-{month_baptized}-{year_baptized}"), baptism_date = dmy(baptism_date)) ``` ``` ## Warning: 76 failed to parse. ``` ```r df %>% select(christian_name, birth_date, baptism_date) %>% head() ``` ``` ## # A tibble: 6 × 3 ## christian_name birth_date baptism_date ## <chr> <date> <date> ## 1 Islay Deirdre 1921-02-07 1921-04-08 ## 2 William Isaac 1939-02-20 1939-05-14 ## 3 Douglas Walter Ellis 1914-01-22 1914-02-21 ## 4 Gordon Rhodes 1916-10-22 1916-11-12 ## 5 Ronald Overton 1914-01-13 1914-02-08 ## 6 Hugh Andrew 1915-11-09 1915-12-22 ``` --- ### Why did 76 dates fail? ```r df %>% filter(is.na(baptism_date)) %>% select(year_baptized:day_baptized) ``` ``` ## # A tibble: 76 × 3 ## year_baptized month_baptized day_baptized ## <dbl> <ord> <chr> ## 1 1937 Jun <NA> ## 2 NA Sep 30 ## 3 1952 <NA> 14 ## 4 1924 <NA> 5 ## 5 1921 <NA> 18 ## 6 1927 <NA> 4 ## 7 1922 <NA> 5 ## 8 1920 <NA> 24 ## 9 1922 <NA> 1 ## 10 1923 <NA> 5 ## # … with 66 more rows ``` We see a number of NAs ### Data quality issues are important to catch! --- ## When are the parents married? ```r df <- df %>% mutate(married_date = glue("{day_married}-{month_married}-{year_married}"), married_date = dmy(married_date)) df %>% select(christian_name, birth_date, married_date) %>% head() ``` ``` ## # A tibble: 6 × 3 ## christian_name birth_date married_date ## <chr> <date> <date> ## 1 Islay Deirdre 1921-02-07 1916-12-15 ## 2 William Isaac 1939-02-20 1938-11-12 ## 3 Douglas Walter Ellis 1914-01-22 1912-12-18 ## 4 Gordon Rhodes 1916-10-22 1914-08-18 ## 5 Ronald Overton 1914-01-13 1912-04-24 ## 6 Hugh Andrew 1915-11-09 1915-02-17 ``` --- ### How many prenuptial pregnancies do we have? ```r df <- df %>% mutate(days_marriage_to_birth = as.numeric(as.duration(birth_date - married_date), "days"), # if gap between parents' marriage and child's birth # is less than 280 days we classify it as a prenup_preg prenup_preg = ifelse(days_marriage_to_birth < 280, "Prenuptial", "Conventional")) df %>% count(prenup_preg, sort = T) ``` ``` ## # A tibble: 2 × 2 ## prenup_preg n ## <chr> <int> ## 1 Conventional 949 ## 2 Prenuptial 655 ``` --- ### What does the distribution of pregnancies look like by parish? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/baptism-3-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% ggplot(aes(days_marriage_to_birth, fill = parish_name)) + geom_density(show.legend = F) + geom_vline(xintercept = 280, lty = 2) + facet_wrap(~ parish_name) + scale_fill_brewer(palette = "Dark2") + scale_x_continuous(limits = c(0, 1000)) + theme(axis.text.y = element_blank()) + labs(x = "Days from parents' marriage to birth of child", y = NULL, caption = "Note: dotted line indicates 280 days") ``` ] ] --- ### What does the distribution of pregnancies look like over time? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/baptism-4-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r library(ggridges) df %>% mutate(decade = year_baptized - year_baptized %% 10, decade = as.factor(str_c(decade, "s"))) %>% select(days_marriage_to_birth, decade) %>% filter(!is.na(decade)) %>% ggplot(aes(days_marriage_to_birth, y = decade, fill = decade)) + geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01, gradient_lwd = 1., show.legend = F) + geom_vline(xintercept = 280, lty = 2) + scale_fill_brewer(palette = "Dark2") + scale_x_continuous(limits = c(0, 1000)) + labs(x = "Days from parents' marriage to birth of child", y = NULL, caption = "Note: dotted line indicates 280 days") ``` ] ] --- ## Is our story clear? .pull-left[ ![](index_files/figure-html/baptism-4-out-small-1.svg)<!-- --> ] .pull-right[ ### What can we see ## .can-edit[idea] ### What can we not see? ## .can-edit[idea] ] --- ### How many pregnancies of each type do we have over time? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/baptism-5-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% mutate(decade = year_baptized - year_baptized %% 10, decade = as.factor(str_c(decade, "s"))) %>% select(prenup_preg, decade) %>% filter(!is.na(decade)) %>% group_by(decade, prenup_preg) %>% summarise(n_obs = n()) %>% ungroup() %>% ggplot(aes(y = decade, x = n_obs, fill = as.factor(prenup_preg))) + geom_col(position = "fill", show.legend = F) + scale_x_continuous(labels = percent_format()) + scale_fill_manual(values = c("#af8dc3", "#7fbf7b")) + labs(y = NULL, x = NULL, title = "Share of <span style='color:#7fbf7b'>Prenuptial</span> and <span style='color:#af8dc3'>Conventional</span> Pregnancies by decade") + theme(plot.title = element_markdown()) ``` ] ] --- ## What are the ages of the parents? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/ages-1-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% ggplot(aes(x = male_age, y = female_age, colour = prenup_preg)) + geom_abline() + geom_jitter(alpha = .4, show.legend = F) + geom_smooth(se = F, show.legend = F) + scale_colour_manual(values = c("#af8dc3", "#7fbf7b")) + labs(x = "Age of husband", y = "Age of wife", colour = "Type of pregnancy", title = "Partner's age comparison between <span style='color:#7fbf7b'>Prenuptial</span> and<br><span style='color:#af8dc3'>Conventional</span> pregnancies") + theme(plot.title = element_markdown()) ``` ] ] --- ## Do ages differ by type of pregnancy? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/ages-2-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r library(patchwork) # This allows us to paste the two plots below one another a <- df %>% ggplot(aes(male_age, fill = prenup_preg)) + geom_density(alpha = .5) + scale_fill_brewer(palette = "Pastel2") + scale_x_continuous(limits = c(NA, 40)) + labs(x = "Husband's age", y = NULL, fill = "Type of pregnancy") b <- df %>% ggplot(aes(female_age, fill = prenup_preg)) + geom_density(alpha = .5) + scale_fill_brewer(palette = "Pastel2") + scale_x_continuous(limits = c(NA, 40)) + labs(x = "Wife's age", y = NULL, fill = "Type of pregnancy") patchwork <- (a / b) patchwork + plot_annotation(title = "More younger partners have prenuptial pregnancies than old:") + theme(plot.title = element_markdown()) ``` ] ] --- ## Is the age gap between partners correlated with prenuptial pregnancy? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/ages-3-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% mutate(age_gap = male_age - female_age) %>% ggplot(aes(age_gap, fill = prenup_preg)) + geom_density(alpha = .5) + geom_vline(xintercept = 0, lty = 2) + theme(legend.position = "bottom") + labs(x = "Age gap in years\n(husband's age minus wife's age)", y = NULL, title = "There appears to be little difference in partners' age gaps\nbetween conventional and prenuptial pregnancies", fill = "Type of pregnancy", caption = "Note: dotted line represents partners of equal age") ``` ] ] --- ### Dealing with categories #### Often we will have many different categories - For example, in our dataset we have 147 different occupations for women ```r df %>% filter(!is.na(female_profession)) %>% count(female_profession, sort = T) ``` ``` ## # A tibble: 147 × 2 ## female_profession n ## <chr> <int> ## 1 Domestic duties 228 ## 2 Machinist 154 ## 3 Domestic service 46 ## 4 Domestic servant 41 ## 5 Household duties 41 ## 6 Clerk 30 ## 7 School Teacher 26 ## 8 Factory worker 24 ## 9 Shorthand Typist 21 ## 10 Shop assistant 20 ## # … with 137 more rows ``` --- ### It's not possible to put them all on one chart .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/professions-1-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% filter(!is.na(female_profession)) %>% count(female_profession, sort = T) %>% mutate(female_profession = fct_reorder(female_profession, n)) %>% ggplot(aes(x = n, y = female_profession)) + geom_col() + labs(x = "Number of observations", y = NULL) ``` ] ] --- # What do we like and dislike? .pull-left[ ![](index_files/figure-html/professions-1-out-small-1.svg)<!-- --> ] .pull-right[ ### Improvements: ## .can-edit[idea] ## .can-edit[idea] ] --- ### Dealing with categories We can `filter` for only professions with 7 or more observations. .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/professions-2-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% filter(!is.na(female_profession)) %>% count(female_profession, sort = T) %>% mutate(female_profession = fct_reorder(female_profession, n)) %>% * filter(n > 7) %>% ggplot(aes(x = n, y = female_profession, * fill = str_detect(female_profession, "Domestic"))) + geom_col() + labs(x = "Number of observations", y = NULL, fill = "Contains 'Domestic'", title = "Most common female professions", * caption = "Limited to professions with more than 7 observations") + scale_fill_brewer(palette = "Dark2") + * theme(legend.position = "right") ``` ] ] --- ### Dealing with categories Or we can choose 12 categories and lump with `fct_lump` the remaining categories into "Other". .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/professions-3-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r library(tidytext) df %>% filter(!is.na(female_profession)) %>% * mutate(female_profession = fct_lump(female_profession, 12)) %>% count(female_profession, sort = T) %>% mutate(female_profession = fct_reorder(female_profession, n), female_profession = fct_relevel(female_profession, "Other")) %>% ggplot(aes(x = n, y = female_profession, * fill = str_detect(female_profession, "Domestic"))) + geom_col() + labs(x = "Number of observations", y = NULL, fill = "Contains 'Domestic'", title = "Most common female professions") + scale_fill_brewer(palette = "Dark2") + * theme(legend.position = "right") ``` ] ] --- ### Dealing with categories Or we can `group_by` more categories, for example, race. .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/professions-4-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r library(tidytext) df %>% * filter(!is.na(female_profession)) %>% group_by(race) %>% count(female_profession, sort = T) %>% slice_max(n, n = 6, with_ties = F) %>% ungroup() %>% * mutate(female_profession = reorder_within(female_profession, n, race)) %>% ggplot(aes(x = n, y = female_profession, * fill = race)) + geom_col() + * scale_y_reordered() + * facet_wrap(~ race, scales = "free") + labs(x = "Number of observations", y = NULL, fill = "Domestic", * title = "Six most common female professions by race") + scale_fill_brewer(palette = "Spectral") + theme(legend.position = "none") ``` ] ] --- ## Where are the churches? ```r df %>% distinct(parish_address) ``` ``` ## # A tibble: 7 × 1 ## parish_address ## <chr> ## 1 16 Summerley Rd, Kenilworth, Cape Town, 7708, South Africa ## 2 30 Bamford Ave, Athlone, Cape Town, 7760, South Africa ## 3 Victoria Rd, Mowbray, Cape Town, 7700, South Africa ## 4 St James Rd, Sea Point, Cape Town, 8005, South Africa ## 5 2B Papu St, Langa, Cape Town, 7456, South Africa ## 6 25 Jansen St, Cape Town, 7500, South Africa ## 7 3 Fontana St, Brooklyn, Cape Town, 7405, South Africa ``` ## But how can we plot these addresses on a map? --- ## Using the `tidygeocoder` package to get coordinates: ```r library(tidygeocoder) locations <- df %>% select(parish_address) %>% # `distinct` means take only one of each address distinct() %>% tidygeocoder::geocode(address = parish_address, method = "osm") # Here "osm" stands for Open Street Map # The package also supports Google Maps, amongst others ``` [Here is a link](https://github.com/jessecambon/tidygeocoder) to the `tidygeocoder` package. --- ## Using the `tidygeocoder` package to get coordinates: This is what we get back: ```r locations ``` ``` ## # A tibble: 7 × 3 ## parish_address lat long ## <chr> <dbl> <dbl> ## 1 16 Summerley Rd, Kenilworth, Cape Town, 7708, South Africa -34.0 18.5 ## 2 30 Bamford Ave, Athlone, Cape Town, 7760, South Africa -34.0 18.5 ## 3 Victoria Rd, Mowbray, Cape Town, 7700, South Africa -33.9 18.5 ## 4 St James Rd, Sea Point, Cape Town, 8005, South Africa -33.9 18.4 ## 5 2B Papu St, Langa, Cape Town, 7456, South Africa -33.9 18.5 ## 6 25 Jansen St, Cape Town, 7500, South Africa -33.9 18.6 ## 7 3 Fontana St, Brooklyn, Cape Town, 7405, South Africa -33.9 18.5 ``` --- # Where are these points? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/parish-points-1-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% mutate(parish_address = str_replace_all(parish_address, ",", ",\n")) %>% distinct(lat, long, parish_address) %>% ggplot(aes(x = long, y = lat, # tell `geom_text` to display the address as text label = parish_address)) + geom_point() + geom_text() ``` ] ] --- class: inverse, center, middle # Can we do better than this? --- ### .panelset[ .panel[.panel-name[Plot] ![](images/CT-map.PNG) ] .panel[.panel-name[Code] ```r library(ggmap) # create boudning box for map image b_box <- c(left = 18.35, bottom = -34.02, right = 18.65, top = -33.85) # create map CT_map <- get_map(b_box, zoom = 12) ggmap(CT_map) + geom_point(aes(long, lat), data = locations, size = 3, show.legend = F) + geom_text(aes(x = long, y = lat, label = parish_name), vjust = 1.1, data = df %>% distinct(parish_name, long, lat) %>% mutate(parish_name = str_remove(parish_name, "Cape Town,"), parish_name = str_replace(parish_name, ",", "\n"))) + theme_void() + labs(title = "Where are our churches located?") ``` ] ] --- ### Plotting results We get this output from a linear probability model with prenuptial pregnancy as our dependent variable .panelset[ .panel[.panel-name[Plot] ``` ## # A tibble: 11 × 4 ## term estimate std.error statistic ## <chr> <dbl> <dbl> <dbl> ## 1 Female Age -0.0181 0.00345 -5.24 ## 2 Male Age -0.00262 0.00284 -0.922 ## 3 Black 0.437 0.452 0.967 ## 4 White -0.342 0.0429 -7.96 ## 5 St Anne's, Maitland 0.0543 0.0755 0.720 ## 6 St Cyprian's, Langa -0.644 0.459 -1.40 ## 7 St James The Great, Sea Point 0.127 0.0670 1.90 ## 8 St Margaret's, Parow -0.0148 0.0641 -0.231 ## 9 St Mark's, Athlone 0.0116 0.0543 0.213 ## 10 St Peter's, Mowbray 0.0983 0.0490 2.00 ## 11 Year Baptized 0.00101 0.00114 0.888 ``` ] .panel[.panel-name[Code] ```r df %>% mutate(prenup_preg = as.numeric(as.factor(prenup_preg))) %>% # Makes prenup_preg numeric for regression lm(prenup_preg ~ female_age + male_age + race + parish_name + year_baptized, data = .) %>% # Does linear regression tidy(conf.int = T) %>% # Gets confidence intervals with `conf.int = T` filter(term != "(Intercept)") %>% # Remove intercept from regression output mutate(type = case_when( # Create a type variable for colours later str_detect(term, "parish_name") ~ "Parish FE", str_detect(term, "race") ~ "Race", str_detect(term, "age") ~ "Age", TRUE ~ "Time trend" )) %>% mutate(term = str_remove_all(term, "race|parish_name"), # Tidy the names term = str_replace_all(term, "_", " "), term = str_to_title(term)) %>% select(term, estimate, std.error, statistic) # Select the ones you want ``` ] ] --- The coefficient plot (tie-fighter plot) is a great way to complement a table .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/results-1-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% mutate(prenup_preg = as.numeric(as.factor(prenup_preg))) %>% lm(prenup_preg ~ female_age + male_age + race + parish_name + year_baptized, data = .) %>% tidy(conf.int = T) %>% filter(term != "(Intercept)") %>% mutate(type = case_when( str_detect(term, "parish_name") ~ "Parish FE", str_detect(term, "race") ~ "Race", str_detect(term, "age") ~ "Age", TRUE ~ "Time trend" )) %>% mutate(term = str_remove_all(term, "race|parish_name"), term = str_replace_all(term, "_", " "), term = str_to_title(term)) %>% mutate(t_stat = ifelse(abs(statistic) > 2, "Significant", "Insignificant")) %>% # If t-stat greater than 2 ggplot(aes(estimate, y = term, colour = type)) + geom_point() + geom_errorbarh(aes(xmin = conf.low, xmax = conf.high, y = term)) + # `geom_errorbarh` gives horizontal error bars geom_vline(xintercept = 0, lty = 2) +labs(x = NULL, y = NULL, colour = "Var type", title = "Correlates of prenuptial pregnancy") + facet_wrap(~ t_stat, nrow = 2, scales = "free_y") # facet wrap gives a panel for insig. and sig. results ``` ] ] --- class: inverse, center, middle # 5. You get what you wanted --- # What charts to use -- ## Choosing charts for exploratory data analysis isn't hard -- ### Stick to the winners: - line charts - scatter plots - column charts - density plots -- ### Sometimes nice: - small multiples - simple maps --- class: inverse, center, middle # 6. You pay a heavy price --- <iframe width="800" height="450" src="https://www.youtube.com/embed/5Zg-C8AAIGg?controls=0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- ## Polishing figures for publication -- ### Maps ### Time on a third axis ### Giving context with colour ### Interactivity ### Reproducing figures for publication --- # Coffee rating map .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/coffee-map-demo-0-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df <- read_rds("data/coffee_data.rds") ggplot(data = df) + borders() + geom_sf(aes(fill = mean_country_rating, geometry = geometry)) + scale_fill_viridis_c(trans = "sqrt") + theme_void() + theme(legend.position = "bottom") + coord_sf(ylim = c(-50, 80)) + facet_wrap(~ species, nrow = 2) + labs(title = "Average Coffee Rating", subtitle = "By country and species", fill = "Average Rating / 100", caption = "Data: James LeDoux - https://github.com/jldbc/coffee-quality-database", x = "", y = "") ``` ] ] --- ## What are the problems with this map? -- - Lots of missing data -- - Only some countries produce coffee, very few produce Robusta beans -- - **Crucially, we cannot see the countries with small land area** --- ### What shall we do about this? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/coffee-map-demo-1-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r library(tidytext) df %>% select(-geometry) %>% mutate(country_name = countrycode::countrycode(iso_a3, "iso3c", "country.name")) %>% mutate(country_name = reorder_within(country_name, mean_country_rating, species)) %>% group_by(species) %>% slice_max(mean_country_rating, n = 10) %>% ungroup() %>% ggplot(aes(x = mean_country_rating, y = country_name, fill = region_un)) + geom_col() + facet_wrap(~ species, scales = "free") + scale_y_reordered() + coord_cartesian(xlim = c(70, 90)) + labs(x = "Mean coffee rating", y = NULL, fill = "Region") ``` ] ] --- ## The trusty column chart gets us some of the way there -- - We still have some information about the geography of the countries that produce the best coffee -- - The chart is very legible and the comparisons are clear -- - But, we have lost the effect of latitude that we saw in Africa. --- ## We may care about spatial autocorrelation We see in a zoomed in section that as we move North and towards the Ethiopian highlands, the coffee rating improves.
--- class: inverse, center, middle # How can we show this on a map? ## But still retain the legibility of the column chart? --- class: inverse, center, middle # Represent each country with the same size ## And arrange the countries in 'roughly' the correct place geographically --- .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/coffee-map-demo-2-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df <- read_rds("data/coffee_map_tiles.rds") df %>% ggplot(aes(xmin = x, ymin = y, xmax = x + 1, ymax = y + 1)) + geom_rect(color = "#ffffff") + geom_rect(color = "#ffffff", aes(fill = mean_country_rating)) + scale_fill_viridis_c(trans = "sqrt") + geom_text(aes(x = x, y = y, label = alpha.2), color = "#ffffff", alpha = 0.5, nudge_x = 0.5, nudge_y = 0.5, size = 2.8) + theme(panel.grid = element_blank()) + theme_void() + theme(legend.position = "bottom") + coord_equal() + labs(fill = "Average Arabica coffee rating by country") ``` ] ] --- ## What have we improved? .pull-left[ ![](index_files/figure-html/coffee-map-demo-2-out-small-1.svg)<!-- --> ] .pull-right[ - We can see the spatial correlation in Africa - Geographically small countries are not forgotten - Comparing countries is not difficult with the diverging colour scale ] --- # Time on a third axis Sometimes it is interesting to show an evolution of two variables over time. For example, say we have information on fertility rates and the share of children born outside of marriage in Europe. The data looks like this: ```r df <- read.csv("data/df_denmark_greece.csv") %>% as_tibble() df %>% head() ``` ``` ## # A tibble: 6 × 4 ## country year tfr pbom ## <chr> <int> <dbl> <dbl> ## 1 Denmark 1960 2.57 7.8 ## 2 Denmark 1961 2.55 8 ## 3 Denmark 1962 2.55 8.3 ## 4 Denmark 1963 2.67 8.9 ## 5 Denmark 1964 2.6 9.3 ## 6 Denmark 1965 2.61 9.5 ``` --- ## The obvious choice is a line chart .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/greece-denmark-demo-0-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% # gives meaningful variable names rename(`Total fertility rate` = tfr, `Proportion of births outside marriage` = pbom) %>% # makes it into a longer dataset so that we can facet # wrap by indicator pivot_longer(-c(country, year), names_to = "indicator") %>% ggplot(aes(year, value, colour = country)) + # here we say nrow = 2 so that they are above one another facet_wrap(~ indicator, nrow = 2, scales = "free_y") + geom_line() + # remove unnecessary axis labels labs(y = NULL, x = "Year", colour = "Country") + theme(legend.position = "right") ``` ] ] --- ### Pros and cons with a line chart .pull-left[ ![](index_files/figure-html/greece-denmark-demo-0-out-small-1.svg)<!-- --> ] .pull-right[ ### Pros - Familiar ### Cons - Hard to keep track of each series - Difficult to compare movements across short periods ] --- ### An alternative: time on a third axis .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/greece-denmark-demo-1-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% mutate(year_flag = ifelse(test = year %% 7 == 0, yes = year, no = NA)) %>% ggplot(aes(tfr, pbom, colour = country, label = year_flag)) + geom_point() + geom_segment(aes(x= tfr, y= pbom, xend=c(tail(tfr, n=-1), NA), yend=c(tail(pbom, n=-1), NA)), arrow = arrow(length=unit(0.25,"cm"))) + geom_text_repel(colour = "black") + scale_y_continuous(labels = percent_format(scale = 1)) + scale_colour_manual(values = c("#C60C30", "#0D5EAF")) + labs(x = "Total fertility rate", y = "Proportion of births\noutside of marriage", title = "Fertility vs births outside of marriage in<br><span style='color:#C60C30'>Denmark</span> and <span style='color:#0D5EAF'>Greece</span>") + theme(legend.position = "none", plot.title = element_markdown()) ``` ] ] --- ### What have we learned? .pull-left[ ![](index_files/figure-html/greece-denmark-demo-1-out-small-1.svg)<!-- --> ] .pull-right[ - Both countries saw a large drop in fertility from the 1960s until the 1980s - In Denmark, after 1970 we see an increase in the share of children born outside of marriage - In contrast, Greek families have relatively few children outside of marriage. - After 1990, Danish fertility increased from 1.3 to 1.8, while Greek fertility remained at 'lowest-low' levels, below replacement. ] --- ### What have we changed? .pull-left[ ![](index_files/figure-html/greece-denmark-demo-1-out-comparison-1.svg)<!-- --> ] .pull-right[ - Indicators on the x- and y-axis and then show time with text labels - Legend is replaced with colour coded title - Colours have meaning (main colour of country flag) - Percentage labels on the y-axis ] --- # Giving context with colour ## Sometimes we may want to show a particular series of data in its correct context. ### For instance, in our line graph above which showed the evoltuion of the share of births outside of marriage in **Denmark and Greece**, we might want to know if these two represent the **extremes** within Europe. --- ## Do **Denmark and Greece** represent the **extremes** of the share of children born outside of marriage in Europe? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/greece-denmark-marriage-demo-0-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df <- read_rds("data/births_outside_marriage.rds") df %>% filter(flag == 1) %>% ggplot(aes(x = year, y = pbom, group = country, colour = country)) + geom_line() + scale_y_continuous(labels = percent_format(scale = 1)) + labs(y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country") ``` ] ] --- ### One way to do this would be to show an average and then how these compare .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/greece-denmark-marriage-demo-1-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% * group_by(year) %>% * mutate(mean_pbom = mean(pbom, na.rm = T)) %>% * ungroup() %>% ggplot() + geom_line(aes(x = year, y = pbom, group = country, colour = country), data = df %>% filter(flag == 1)) + * geom_line(aes(x = year, y = mean_pbom, colour = "European average")) + scale_y_continuous(labels = percent_format(scale = 1)) + labs(y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country") ``` ] ] --- ### We can improve upon this by introducing an interval ribbon .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/greece-denmark-marriage-demo-2-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% group_by(year) %>% mutate(mean_pbom = mean(pbom, na.rm = T), * pct_10 = quantile(pbom, .1, na.rm = T), * pct_90 = quantile(pbom, .9, na.rm = T)) %>% ungroup() %>% ggplot() + geom_line(aes(x = year, y = pbom, group = country, colour = country), data = df %>% filter(flag == 1)) + * geom_ribbon(aes(x = year, ymin = pct_10, ymax = pct_90, * fill = "Interval \n(10th to 90th percentile)"), alpha = .3) + geom_line(aes(x = year, y = mean_pbom, colour = "European average")) + scale_y_continuous(labels = percent_format(scale = 1)) + * scale_fill_manual(values = "#90ee90") + * guides(fill = guide_legend(order = 2), * col = guide_legend(order = 1)) + labs(y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country", * fill = "") ``` ] ] --- ### An alternative shows the series for each country .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/greece-denmark-marriage-demo-3-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df %>% ggplot(aes(x = year, y = pbom, group = country, colour = country)) + geom_line() + scale_y_continuous(labels = percent_format(scale = 1)) + labs(y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country") + theme(legend.text = element_text(size = 4), axis.title = element_text(size = 8), legend.position = "bottom") ``` ] ] --- class: center, middle, inverse # This is silly --- class: center, middle, inverse ## The legend is gigantic ## There are far too many colours ## We cannot see Denmark or Greece --- ### We can **highlight** the counties we are interested in And gray out the other lines .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/greece-denmark-marriage-demo-4-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r library(gghighlight) df %>% ggplot() + geom_line(aes(x = year, y = pbom, colour = country)) + gghighlight(flag == 1) + scale_y_continuous(labels = percent_format(scale = 1)) + theme(legend.text = element_text(size = 4), legend.position = "bottom", plot.title = element_markdown(size = 14)) + scale_colour_manual(values = c("#C60C30", "#0D5EAF")) + labs(title = "Proportion of births outside of marriage in <span style='color:#C60C30'>Denmark</span> and <span style='color:#0D5EAF'>Greece</span>", y = "Proportion of births\noutside of marriage", x = NULL, colour = "Country") ``` ] ] --- # Advantages of gghighlight .pull-left[ ![](index_files/figure-html/greece-denmark-marriage-demo-4-out-small-1.svg)<!-- --> ] .pull-right[ - Shows each of the series - We can see that Denmark is a leader in the beginning, but is caught up by other nations - Does not hide outliers - Makes clear the trends in your countries of interest ] --- # Interactivity Publishing your research on the web is a good way to get it out there. Making your visualizations interactive is a fantastic way to have your readers explore the story you tell. .pull-left[ I'm not an expert in this, but there are [lots of resources](https://www.r-graph-gallery.com/interactive-charts.html) already and [many more](https://davidgohel.github.io/ggiraph/) are on the way! The solutions in R make it easy to apply one command (for example `ggplotly`) and get back an interactive chart from a `ggplot`. ] .pull-right[ ![](images/plotly.PNG) ] --- ### Interactive map for Anglican church locations .panelset[ .panel[.panel-name[Plot]
] .panel[.panel-name[Code] ```r library(leaflet) library(glue) df <- read_rds("data/baptism_data.rds") %>% group_by(parish_name) %>% mutate(n_obs = n()) %>% ungroup() df <- df %>% # this popup will contain the parsh name in bold (<b> ... </b> makes bold) # and the address in the next line (with a <br/> break) mutate(popup = glue("<b>{parish_name}</b><br/>Located at {parish_address},<br/>this parish has <b>{n_obs}</b> baptisms in our dataset.")) df %>% # makes a leaflet map leaflet() %>% # adds a background addTiles() %>% # makes the map grey in colour addProviderTiles(providers$CartoDB.Positron) %>% # sets the centre of the map with the mean values of lat and long setView(mean(df$long), mean(df$lat), zoom = 11) %>% # here we add a marker with the popup specified above, inclduing the name of the parish and its address. addMarkers(unique(df$long), unique(df$lat), popup = unique(df$popup)) ``` ] ] --- ## Interactive plots ### Have a look at [Claus Wilke's slides here](https://wilkelab.org/SDS375/slides/interactive-plots.html#1) for more info on Interactive plots ![](images/int-fig.PNG) --- ### Recreating published figures ![](images/GDP.png) --- ### How do you recreate it without the underlying data? .panelset[ .panel[.panel-name[Plot] ![](index_files/figure-html/recreate-1-out-1.svg)<!-- --> ] .panel[.panel-name[Code] ```r df <- read.csv("data/Fouquet_Broadberry.csv") %>% as_tibble() order <- df %>% group_by(country) %>% filter(x == max(x)) %>% arrange(desc(y)) %>% ungroup() %>% select(country) %>% as.list() df %>% mutate(country = fct_relevel(country, order)) %>% ggplot(aes(x, y, colour = country, lty = country)) + geom_point() + geom_line() + scale_y_continuous(labels = dollar_format()) + scale_color_brewer(palette = "Dark2") + labs(title = "GDP Per Capita in Selected European Countries", y = "GDP Per Capita (1990 USD)", x = NULL, colour = "Country", lty = "Country") ``` ] ] --- <iframe width="800" height="450" src="https://www.youtube.com/embed/TVCFedRgg5Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- class: inverse, center, middle # 7. Return to your familiar situation --- ### Resources Here is a link the resources at my website [Interlude One](https://interludeone.com/posts/2021-05-21-ggplot-resources/). ![](images/resources.PNG) --- ### Thank you! ![](images/library.jpg) ---