EOSE09 – Data visualization

Purpose

Get you excited about storytelling with data
Show some tips and tricks to make your maps and charts pop

Hans Rosling’s the best stats you’ve ever seen

Structure

Improving your maps
Overcoming Excel
Telling a story with data
Reproducing figures for publication

Everything is a story

Dan Harmon’s Story Circle

Our Story Circle

Improving your maps

Legend breaks

Recap from Lab 1 exercises

Make a map of the share of employment in industry in the year 2010 across the whole dataset

Recap from Lab 1 exercises

01:00

Discuss with your neighbour:

What do we like?
What is confusing?

spmap employment_share_industry using "nutscoord.dta" 
  if year == 2010, 
  id(_ID) fcolor(Spectral) legstyle(2) 
    title("Employment Share Industry - 2010", size(large)) 
    osize(0.02 ..) ocolor(white ..) 
    clmethod(custom) clbreaks(0 (0.2) 1)
    legend(pos(9) size(medium) rowgap(1.5) 
    label(6 "80-100 %") label(5 "60-80 %") 
    label(4 "40-60 %") label(3 "20-40 %") label(2 "0-20 %") 
    label(1 "No Data")) 
    ndfcolor(gray) ndocolor(white ..) ndsize(0.02 ..)

Let’s plot the distribution of the data

histogram employment_share_industry if year == 2010, 
  color(midblue)

Let’s plot the distribution of the data

kdensity employment_share_industry if year == 2010

Now let’s make breaks based on this information

spmap employment_share_industry using "nutscoord.dta" 
  if year == 2010, id(_ID) fcolor(Spectral) legstyle(2) 
    title("Employment Share Industry - 2010", size(large)) 
    osize(0.02 ..) ocolor(white ..) 
    clmethod(custom) clbreaks(0 (0.075) 0.45) 
    legend(pos(9) size(medium) rowgap(1.5) 
    label(7 "37-45 %") label(6 "30-37 %") 
    label(5 "23-30 %") label(4 "15-23 %") 
    label(3 "8-15 %") label(2 "0-8 %") 
    label(1 "No Data")) 
    ndfcolor(gray) ndocolor(white ..) ndsize(0.02 ..)

Colour scales

Uses of color in data visualization

Distinguish categories (qualitative)

Qualitative scale example

Palette name: Okabe-Ito

In this graph I have chosen to plot some data about the number of solar panels installed in various Swedish towns, and the installed capacity of those solar panels.

The data isn’t important in this case, but we can see that there is a strong linear relationship between the number of solar panels installed and the installed capacity of those solar panels.

In this case, we might want to distinguish between different counties in Sweden, and so we use a qualitative palette to do so.

We can see that there are both the largest number of panels and the largest installed capacity in Gothenburg, followed by Stockholm and Malmö - which makes sense.

In this instance, we can choose colours that distingish between the different counties, and we can see that the Okabe-Ito palette is a good choice for this. We aren’t saying that this town is ‘more Skane’ than others, so a sequential palette is not appropriate.

Qualitative scale example

Palette name: Brewer Set1

Qualitative scale example

Palette name: Brewer Dark2

Uses of color in data visualization

Distinguish categories (qualitative)
Represent numeric values (sequential)

Sequential scale example

Palette name: inferno

Here we are looking at the number of hours of sunlight in different cities around the world.

We can see that in Lund, because we are quite far North, we have very few hours of sunlight in the winter, and a lot of hours of sunlight in the summer.

In contrast, Panama City has a very consistent number of hours of sunlight throughout the year as it lies near the equator.

We can also see that in Cape Town and Livingstone Zambia we have the reverse pattern to Lund - with more hours of sunlight in the winter than in the summer.

The inferno sequential palette is a good choice for this data, as it shows the progression of the number of hours of sunlight in a clear way.

For a sequential palette, you want the person looking at your plot to be able to see clearly the progression of the data, as well as which is high and which is low. I think in this case it makes sense to use the brighter end of the scale as high, because we associate it with sunlight - quite neat!!.

Sequential scale example

Palette name: viridis

Uses of color in data visualization

Distinguish categories (qualitative)
Represent numeric values (sequential)
Represent numeric values (diverging)

Diverging scale example

Uses of color in data visualization

Distinguish categories (qualitative)
Represent numeric values (sequential)
Represent numeric values (diverging)
Highlight

Highlight example

In this example, we have information on the share of children born outside of marriage in Europe.

We have a lot of lines, and we are interested in highlighting two countries in order to compare them over time.

We can see that in Denmark, the share of children born outside of marriage was higher than Greece in 1960, and then really increased in 1970 to about 1990, before levelling off somewhat at about 50% of children born outside of marriage.

In contrast, in Greece, the share of children born outside of marriage was very low in 1960, and because of the importance of the Orthodox church in Greece, it has remained low throughout the period, only increasing to about 12,5% in 2021.

I did another neat thing here, where I used the markdown in the title to colour the country names in the legend to match the colours of the lines. This way we don’t need to have a legend and a title, and we can save some space on the plot. I also used the colours from the flags of the countries, which is a nice touch, if I do say so myself.

Using density plots to set your legend breaks: quick example

Dataset: Solar panels in Sweden

Installed solar capacity in Sweden
Year: 2021
Swedish county	Installed solar capacity (megawatts)
Västra Götalands län	266.21
Skåne län	256.25
Stockholms län	182.25
Östergötlands län	106.81
Hallands län	94.31
Jönköpings län	88.53
Södermanlands län	79.71
Uppsala län	79.11
Kalmar län	59.01
Västmanlands län	49.45
Source: Energimyndigheten

How to decide on values for the bins?

Use a histogram or a density plot to see where the weight of the distribution is.

Map with appropriate breaks

Ask your neighbour:

what kind of palette is this?
Is it appropriate to use with this data?

01:00

Improving your maps

Great Choropleths

Examples of great maps

Financial Times analysis of Italian election results in 2018

This is a map from the FT showing the results of the Italian elections in 2018. You can all get access to the FT through Lund and I think it’s a really great resource for data journalism and data visualization.

Maps work best when they show an emerging spatial pattern, as was the case with this map from the recent Italian elections.

Showing the winner at municipality level clearly shows the political divisions in the country. In the north, the Northern League party triumphed largely on the back of an anti-immigration and anti-EU agenda. In the south, the anti-establishment Five Star Movement was even more successful, gaining a majority of votes in many areas.

What is also clever as that they have used a column chart in the map in order to help us understand the aggregate vote share for the main parties.

What kind of palette do you think they have used here? Sequential, qualitative or diverging?

A: qualitative, because they are showing the winner in each municipality.

Examples of great maps

Financial Times analysis of Italian election results in 2018

Examples of great maps

Human Terrain from The Pudding

The other place I really recommend you have a look for inspiration for good data journalism is a website called “The Pudding”. They have some really great data visualizations and data journalism pieces, inclduing this one that shows where people live in countries.

Here i have a screenshow of the map showing the Oresund region, and we can see how centered the population is in Copenagenhagen and Malmo, and how the population density drops off as we move away from the cities to smaller towns.

I will say that this is based on buildings and not people, so it is not a perfect representation of the population, but it is a really interesting way to show where people live in a country. Here you can see the Oresund bridge looks like it has people living on it, but that is not the case in reality.

Let’s have a look at some other places - who has visited Japan before? Compare population growth…

Examples of great maps

The Coming Crisis: Exploring the U.S. Physician Shortage by Daniel Snow

Overcoming Excel

Motivation

Overcoming Excel

Formby et al (2017) Microsoft Excel: Is It An Important Job Skill for College Graduates?

Overcoming Excel

Takeaways:

You will likely use Excel in the future 📊
Excel’s default plots and tables can be improved upon 📈
Simple rules can help you make your message clear 💎

Overcoming Excel

Charts

Overcoming Excel: Column plot

We often encounter datasets containing simple amounts 🤏
Here is some data on a sample of Swedish musical artists 🎵
I put this data into Excel, and asked for a recommended chart 📊

Swedish musical artists
Rank	Artist	Monthly listeners (m)
1	Avicii	29.47
2	ABBA	23.48
3	José González	4.07
4	Robyn	3.11
5	Timbuktu	0.38
Datasource: Spotify charts Nov 2022

Your turn

02:30

Discuss with your neighbour:

What do we like?
What is confusing?

Tip 1: Avoid rotated axis labels

Ugly 🤢

Tip 1: Avoid rotated axis labels

Flip axes so that the text is easier to read 👓

Tip 2: Pay attention to the order of the bars

Bad 👎

Tip 2: Pay attention to the order of the bars

It is clear that José González recieves more streams than Robyn

Tip 3: Consider your titles, labels and axes

Uninformative️ ❗

Tip 3: Consider your titles, labels and axes

Note the title, x-axis title, x-axis labels 📙

Tip 3: Consider your titles, labels and axes

Titles and captions have different application areas

Figure 1: Monthly streams for Swedish musical artists. Data sources: Spotify charts in November 2022

We can use dots instead of bars

Dots are preferable if we want to truncate the axes

Dataset: Solar panels in Sweden

Dots are preferable if we want to truncate the axes

Bar lengths do
not accurately
represent the
data values

Dots are preferable if we want to truncate the axes

Key features
of the data
are obscured

Dots are preferable if we want to truncate the axes

Overcoming Excel

Tables

Overcoming Excel: Tables

We often encounter datasets containing simple amounts 🤏
Here is some data on a sample of Swedish musical artists 🎵
I put this data into Excel, and asked it to insert a table 🗃️

Swedish musical artists
Rank	Artist	Monthly listeners (m)
1	Avicii	29.47
2	ABBA	23.48
3	José González	4.07
4	Robyn	3.11
5	Timbuktu	0.38
Datasource: Spotify charts Nov 2022

Your turn again

02:30

Discuss with your neighbour:

What do we like?
What is confusing?

Key rules for table layout
Number	Rule
1	Do not use vertical lines.
2	Do not use heavy horizontal lines between data rows. (Horizontal lines as separator between the title row and the first data row or as frame for the entire table are fine.)
3	Text columns should be left aligned.
4	Number columns should be right aligned and should use the same number of decimal digits throughout.
5	Columns containing single characters are centred.
6	The header fields are aligned with their data, i.e., the heading for a text column will be left aligned and the heading for a number column will be right aligned.
Source: Claus Wilke’s Fundamentals of Data Visualization

Let’s apply these rules

01:30

Key rules for table layout
Number	Rule
1	Do not use vertical lines.
2	Do not use heavy horizontal lines between data rows. (Horizontal lines as separator between the title row and the first data row or as frame for the entire table are fine.)
3	Text columns should be left aligned.
4	Number columns should be right aligned and should use the same number of decimal digits throughout.
5	Columns containing single characters are centred.
6	The header fields are aligned with their data, i.e., the heading for a text column will be left aligned and the heading for a number column will be right aligned.
Source: Claus Wilke’s Fundamentals of Data Visualization

Let’s apply these rules

01:30

Key rules for table layout
Number	Rule
1	Do not use vertical lines.
2	Do not use heavy horizontal lines between data rows. (Horizontal lines as separator between the title row and the first data row or as frame for the entire table are fine.)
3	Text columns should be left aligned.
4	Number columns should be right aligned and should use the same number of decimal digits throughout.
5	Columns containing single characters are centred.
6	The header fields are aligned with their data, i.e., the heading for a text column will be left aligned and the heading for a number column will be right aligned.
Source: Claus Wilke’s Fundamentals of Data Visualization

Storytelling with data

Related time series

Plotting related time series

Dataset: Fertility and births outside of marriage in Denmark and Greece.

Default choice for plotting is two line plots

Plotting related time series

Pros 👍

Familiar

Cons 👎

Hard to keep track of each series
Difficult to compare movements across short periods

An alternative: time on a third axis

What have we learned?

Both countries saw a large drop in fertility from the 1960s until the 1980s
In Denmark, after 1970 we see an increase in the share of children born outside of marriage
In contrast, Greek families have relatively few children outside of marriage.
After 1990, Danish fertility increased from 1.3 to 1.8, while Greek fertility remained at ‘lowest-low’ levels, below replacement.

What have we changed?

Indicators on the x- and y-axis and then show time with text labels
Legend is replaced with colour coded title
Colours have meaning (main colour of country flag)
Percentage labels on the y-axis

Storytelling with data

Giving context

Sometimes we may want to show a particular series of data in its correct context.

For instance, in our line graph above which showed the evolution of the share of births outside of marriage in Denmark and Greece, we might want to know if these two represent the extremes within Europe.

Giving context

Do Denmark and Greece represent the extremes of the share of children born outside of marriage in Europe?

Giving context with an average

One way to do this would be to show an average for Europe

Giving context with an interval ribbon

Giving context with all of the data

This is silly

Giving context with all of the data

Here we highlight the series we are interested in and draw in the remaining series in grey

What have we changed?

Shows each of the series
We can see that Denmark is a leader in the beginning, but is caught up by other nations
Does not hide outliers
Makes clear the trends in your countries of interest

Storytelling with data

Tips for polished figures

Tips for polishing your figures

Where to get great colours from for your plots:

help spmap # Look for the palettes under fcolor

Recreating published figures

A FT chart published without the underlying data

Data visualization

Purpose

Structure

Everything is a story

Dan Harmon’s Story Circle

Our Story Circle

Recap from Lab 1 exercises

Recap from Lab 1 exercises

Let’s plot the distribution of the data

Let’s plot the distribution of the data

Now let’s make breaks based on this information

Colour scales

Qualitative scale example

Qualitative scale example

Qualitative scale example

Uses of color in data visualization

Sequential scale example

Sequential scale example

Uses of color in data visualization

Diverging scale example

Uses of color in data visualization

Highlight example

Using density plots to set your legend breaks: quick example

How to decide on values for the bins?

Map with appropriate breaks

Examples of great maps

Examples of great maps

Examples of great maps

Examples of great maps

Overcoming Excel

Overcoming Excel

Overcoming Excel

Takeaways:

Overcoming Excel: Column plot

Your turn

Tip 1: Avoid rotated axis labels

Ugly 🤢

Tip 1: Avoid rotated axis labels

Flip axes so that the text is easier to read 👓

Tip 2: Pay attention to the order of the bars

Bad 👎

Tip 2: Pay attention to the order of the bars

It is clear that José González recieves more streams than Robyn

Tip 3: Consider your titles, labels and axes

Uninformative️ ❗

Tip 3: Consider your titles, labels and axes

Note the title, x-axis title, x-axis labels 📙

Tip 3: Consider your titles, labels and axes

Titles and captions have different application areas

We can use dots instead of bars

We can use dots instead of bars

Dots are preferable if we want to truncate the axes

Dots are preferable if we want to truncate the axes

Dots are preferable if we want to truncate the axes

Dots are preferable if we want to truncate the axes

Overcoming Excel: Tables

Your turn again

Let’s apply these rules

Let’s apply these rules

Plotting related time series

Plotting related time series

Pros 👍

Cons 👎

An alternative: time on a third axis

What have we learned?

What have we changed?

Giving context

Giving context

Giving context with an average

Giving context with an interval ribbon

Giving context with all of the data

Giving context with all of the data

What have we changed?

Tips for polishing your figures

Recreating published figures

Recreating published figures

Recreating published figures

You pay a heavy price

Additional data demo