Altair: Interactive Plots
Introduction to Altair¶
Altair is a Python library that generates a JSON output complying to the Vega-Lite specification. Thus to understand the former, one must master the latter. The Vega-Lite specification is meant to describe how data are aggregated, transformed and the form of the plot used (histogram, bar, line, scatter, others). Vega-Lite is itself based on a more verbose specification simply named Vega.
The JSON is then embedded within a HTML page and a JavaScript library could read it and generate the plots into a canva.
What makes Altair/Vega so different from the other plotting frameworks (such as
plotly
, bokeh
which all use a JSON specification format) is how the Vega
specification has been engineered, as a very end rather than a internal object
meant to be hidden at all costs. Thus, Vega has a very good documentation and it
is very easy to dive into it (with many examples). From the Vega point of view,
Altair is a very pleasant way to write Vega-Lite plots, wih high-level objects
and a one-to-one mapping with low-level objects.
In this article, we'll simply study go through a dataviz with Altair, explaining the logic of Vega to plot data.
Packages and Setup¶
import numpy as np
import pandas as pd
Furthermore, we load altair
using alt
:
import altair as alt
alt.renderers.set_embed_options(theme='light');
We are going to use as drill dataset a sample of 500 elements from a larger one, the French 2016 marriage (see Appendix for the normalization process).
data_url = 'data/mar2016.json'
Simple Statistics¶
alt.Chart(data_url
).mark_bar(
).encode(x='count()',
y='SEXE1:O'
).properties(height=125,
width=250
).configure(autosize='fit')
The basic bricks of Altair are:
-
alt.Chart
create a Chart withdata_url
as the source of the data -
mark_bar
describe the wanted plot, here a bar plot -
encode
describe the data to display-
x
aggregate by counting -
y
group bySEXE1
,:O
tell VegaSEXE1
is an categorical variable
-
Maps with Altair¶
We used a simplified GeoJSON (thanks to MapShaper
) taken from France-GeoJSON
.
france_url='data/departements-avec-outre-mer.json'
france = alt.Data(url=france_url, format=alt.DataFormat(property='features', type='json'))
france_base = alt.Chart(france).mark_geoshape(fill='lightgray')
plot_france = (france_base +
alt.Chart(data_url
).transform_lookup(default='0',
as_='geo',
lookup='DEPDOM',
from_=alt.LookupData(data=france, key='properties.code')
).mark_geoshape().encode(alt.Color('count()',scale=alt.Scale(type='log')),
alt.Shape(field="geo", type="geojson"),
tooltip=['count()', 'DEPDOM:N'])
).project(scale=900,
center=[2, 46.5]).properties(
title='Departement'
).configure(autosize='fit')
plot_france
Broadly, we have done two things. First, we defined the geographic data we will use. Then, we joined them to aggregated data from our marriage dataset.
Data Definition¶
-
url
where the data is stored -
format
DataFormat-
property
: property of theJSON
containing the data -
type
: specifyJSON
as the expected format
-
Join two datasets together¶
To extend one dataset with data from another one, we must use what's named a Lookup transform, wrapped by Altair with the transform_lookup
function. It is basically a LEFT JOIN
. Let's describe the parameters we used:
-
default
default value when there is no right value -
as_
name of the appended field -
lookup
name of the field we are joining on from the dataset we want to expand -
from_
should be a LookupData, which is merely a way to define the field of the secondary dataset we want to join on
Encode¶
-
Color
We count the number of marriages. -
Shape
since the primary dataset is not imported using thealt.topo_feature
, we need to tell Altair where to find thegeojson
data. Thanks to how we did the Lookup transform, we simply set it to"geo"
. -
Tooltip
define the fields displayed when the cursor pass one department.
Appendix: Data Normalization¶
We describe how we normalized the dataset.
The source of the data is the 2016 French marriages provided by INSEE. the French national statitics bureau, that provides the following files.
-
mar2016.dbf
: the data itself -
varlist_mariages.dbf
: describes the variables (type, name, length) -
varmod_mariages.dbf
: describes the encoding of the categorical variables
There are according to the documentation 232,725
records.
Since the data are stored as dbf
, we use the simpledbf
package to convert it to a pandas
dataframe.
from simpledbf import Dbf5
mar2016_dbf = Dbf5('data/mar2016.dbf', codec='latin1')
varlist_mariages_dbf = Dbf5('data/varlist_mariages.dbf', codec='latin1')
varmod_mariages_dbf = Dbf5('data/varmod_mariages.dbf', codec='latin1')
mar2016 = mar2016_dbf.to_dataframe()
varlist_mariages = varlist_mariages_dbf.to_dataframe()
varmod_mariages = varmod_mariages_dbf.to_dataframe()
Let's check the dimensions:
mar2016.shape
mar2016.head(5)
We set the right type to the categorical variables:
categorical_variables = ['INDNAT1', 'ETAMAT1', 'INDNAT2', 'ETAMAT2', 'TUDOM', 'TUCOM', 'NBENFCOM']
mar2016[categorical_variables] = mar2016[categorical_variables].replace(np.NaN, 'nan').astype('category')
mar2016['NBENFCOM'] = mar2016['NBENFCOM'].cat.rename_categories({'O': 'Y'})
We append approximate English translation of the variables.
varlist_mariages['ENGLISH'] = ['Marriage Year',
'Birth Year Spouse 1',
'Birth Year Spouse 2',
'Residence Department after Marriage',
'Marriage Department',
'Birth Departement Spouse 1',
'Birth Departement Spouse 2',
'Prior Matrimonial Status Spouse 1',
'Prior Matrimonial Status Spouse 2',
'Nationality Spouse 1',
'Nationality Spouse 2',
'Marriage Weekday',
'Marriage Month',
'Children before Marriage',
'Sex Spouse 1',
'Sex Spouse 2',
'Commune',
'Urban Unit']
varlist_mariages
For varmod_mariages
, we focus on the int
encoded variables.
replace = {'INDNA': {'Française':'French',
'Étrangère':'Foreigner'},
'ETAMA': {'Divorcé': 'Divorced',
'Célibataire': 'Single',
'Veuf': 'Widow'},
'TUDOM': {'Agglomération de Paris': 'Paris Agglomeration',
'COM ou étranger': 'Oversea',
'Commune rurale': 'Country side'
' (fewer than 2,000 inhabitants)',
'Indéterminé': 'Unknown'},
'TUCOM': {'Indéterminé ou pays étranger': 'Unknown or foreign country',
'Terres australes et antarctiques,'
' COM non précisé': 'Austral or Antartic, unknown oversea'}
}
str_replace = {
'TUDOM': [('Unité urbaine de', 'Urban Unit from'),
('à', 'to'),
('habitant', 'inhabitant')],
'TUCOM': [('de plus de', 'with more than'),
('de moins de', 'with fewer than'),
('habitant', 'inhabitant')]
}
translation = varmod_mariages.query("VARIABLE in ['INDNAT1', 'ETAMAT1', 'INDNAT2', 'ETAMAT2', 'TUDOM', 'TUCOM']").copy()
def translate(v):
var = v['VARIABLE'][:5]
lib = v['MODLIBELLE']
if var in replace and lib in replace[var]:
return replace[var][lib]
str_rep = str_replace[var] if var in str_replace else None
if str_rep is not None:
ret = lib
for (old, new) in str_rep:
ret = ret.replace(old, new)
return ret
return lib
translation['ENGLISH'] = translation.apply(translate, 1).str.replace('\xa0', ',')
translation
for (var, cat) in translation.replace(np.NaN, 'nan'
).set_index('MODALITE'
).groupby('VARIABLE'
).apply(lambda d:
d['ENGLISH'].to_dict()
).items():
mar2016[var] = mar2016[var].cat.rename_categories(cat)
Now we have a fully English dataset:
mar2016.sample(5, random_state=5010)
We convert the year variables:
year_variables = ['ANAIS1', 'ANAIS2', 'AMAR', 'MMAR']
mar2016[year_variables] = mar2016[year_variables].astype(np.int)
Same thing with weekday
mar2016['JSEMAINE'] = mar2016['JSEMAINE'
].astype('category'
).cat.rename_categories(['Monday',
'Tuesday',
'Wenesday',
'Thursday',
'Friday',
'Saturday',
'Sunday'])
We create a datetime column for the date of marriage (although we don't have the day). We drop the columns to not uselessly duplicate the data.
mar2016['DateMAR'] = pd.to_datetime(mar2016[['AMAR', 'MMAR'
]].assign(day=1
).rename(columns={'AMAR': 'year',
'MMAR': 'month'}))
mar2016.drop(columns=['AMAR',
'MMAR'],
inplace=True)
We can consider the dataset as ready.
mar2016.sample(5, random_state=5010)
In order to get this article easily displayed on any devices (especially mobile ones), we subsample the dataset and save it (once for good). It is adviced to save the data to an external file when using Altair.
from pathlib import Path
if not Path(data_url).is_file():
mar2016 = mar2016.sample(500, random_state=5010)
mar2016.to_json(data_url, orient='records')
References¶
- Altair Home Page Altair: Declarative Visualization in Python
- Vega-Lite Home Page Vega-Lite – A Grammar of Interactive Graphics
- Insee 2016 Marriage Data Les mariages en 2016
- France-GeoJSON Contours des régions, départements, arrondissements, cantons et communes de France (métropole et départements d'outre-mer) au format GeoJSON
- MapShaper Tools for editing Shapefile, GeoJSON, TopoJSON and CSV files
- Vega-Lite Documentation: Data Describe JSON Data formatting