Altair: Interactive Plots

Introduction to Altair¶

Altair is a Python library that generates a JSON output complying to the Vega-Lite specification. Thus to understand the former, one must master the latter. The Vega-Lite specification is meant to describe how data are aggregated, transformed and the form of the plot used (histogram, bar, line, scatter, others). Vega-Lite is itself based on a more verbose specification simply named Vega.

The JSON is then embedded within a HTML page and a JavaScript library could read it and generate the plots into a canva.

What makes Altair/Vega so different from the other plotting frameworks (such as plotly, bokeh which all use a JSON specification format) is how the Vega specification has been engineered, as a very end rather than a internal object meant to be hidden at all costs. Thus, Vega has a very good documentation and it is very easy to dive into it (with many examples). From the Vega point of view, Altair is a very pleasant way to write Vega-Lite plots, wih high-level objects and a one-to-one mapping with low-level objects.

In this article, we'll simply study go through a dataviz with Altair, explaining the logic of Vega to plot data.

Packages and Setup¶

In [1]:

import numpy as np
import pandas as pd

Furthermore, we load altair using alt:

In [2]:

import altair as alt
alt.renderers.set_embed_options(theme='light');

We are going to use as drill dataset a sample of 500 elements from a larger one, the French 2016 marriage (see Appendix for the normalization process).

In [3]:

data_url = 'data/mar2016.json'

Simple Statistics¶

In [4]:

alt.Chart(data_url
).mark_bar(
).encode(x='count()',
         y='SEXE1:O'
).properties(height=125,
             width=250
).configure(autosize='fit')

Out[4]:

The basic bricks of Altair are:

alt.Chart create a Chart with data_url as the source of the data
mark_bar describe the wanted plot, here a bar plot
encode describe the data to display
- x aggregate by counting
- y group by SEXE1, :O tell Vega SEXE1 is an categorical variable

Maps with Altair¶

We used a simplified GeoJSON (thanks to MapShaper) taken from France-GeoJSON.

In [5]:

france_url='data/departements-avec-outre-mer.json'
france = alt.Data(url=france_url, format=alt.DataFormat(property='features', type='json'))
france_base = alt.Chart(france).mark_geoshape(fill='lightgray')

plot_france = (france_base +
alt.Chart(data_url
).transform_lookup(default='0',
                  as_='geo',
                  lookup='DEPDOM',
                  from_=alt.LookupData(data=france, key='properties.code')
).mark_geoshape().encode(alt.Color('count()',scale=alt.Scale(type='log')),
                        alt.Shape(field="geo", type="geojson"),
                        tooltip=['count()', 'DEPDOM:N'])
).project(scale=900,
         center=[2, 46.5]).properties(
    title='Departement'
).configure(autosize='fit')

plot_france

Out[5]:

Broadly, we have done two things. First, we defined the geographic data we will use. Then, we joined them to aggregated data from our marriage dataset.

Data Definition¶

url where the data is stored
format DataFormat
- property: property of the JSON containing the data
- type: specify JSON as the expected format

Join two datasets together¶

To extend one dataset with data from another one, we must use what's named a Lookup transform, wrapped by Altair with the transform_lookup function. It is basically a LEFT JOIN. Let's describe the parameters we used:

default default value when there is no right value
as_ name of the appended field
lookup name of the field we are joining on from the dataset we want to expand
from_ should be a LookupData, which is merely a way to define the field of the secondary dataset we want to join on

Encode¶

Color We count the number of marriages.
Shape since the primary dataset is not imported using the alt.topo_feature, we need to tell Altair where to find the geojson data. Thanks to how we did the Lookup transform, we simply set it to "geo".
Tooltip define the fields displayed when the cursor pass one department.

Appendix: Data Normalization¶

We describe how we normalized the dataset.

The source of the data is the 2016 French marriages provided by INSEE. the French national statitics bureau, that provides the following files.

mar2016.dbf: the data itself
varlist_mariages.dbf: describes the variables (type, name, length)
varmod_mariages.dbf: describes the encoding of the categorical variables

There are according to the documentation 232,725 records.

Since the data are stored as dbf, we use the simpledbf package to convert it to a pandas dataframe.

In [6]:

from simpledbf import Dbf5


mar2016_dbf = Dbf5('data/mar2016.dbf', codec='latin1')
varlist_mariages_dbf = Dbf5('data/varlist_mariages.dbf', codec='latin1')
varmod_mariages_dbf = Dbf5('data/varmod_mariages.dbf', codec='latin1')

mar2016 = mar2016_dbf.to_dataframe()
varlist_mariages = varlist_mariages_dbf.to_dataframe()
varmod_mariages = varmod_mariages_dbf.to_dataframe()

PyTables is not installed. No support for HDF output.

Let's check the dimensions:

In [7]:

mar2016.shape

Out[7]:

(232725, 18)

In [8]:

mar2016.head(5)

Out[8]:

	ANAIS1	DEPNAIS1	SEXE1	INDNAT1	ETAMAT1	ANAIS2	DEPNAIS2	SEXE2	INDNAT2	ETAMAT2	AMAR	MMAR	JSEMAINE	DEPMAR	DEPDOM	TUDOM	TUCOM	NBENFCOM
0	1986	01	F	1	1	1984	70	M	1	1	2016	06	6	01	99	9	NaN	O
1	1982	72	M	1	1	1983	99	F	2	1	2016	07	6	01	99	9	NaN	N
2	1976	01	M	1	1	1974	38	M	1	1	2016	09	6	01	99	9	NaN	N
3	1988	01	M	1	1	1988	57	F	1	1	2016	08	6	01	99	9	NaN	N
4	1986	99	M	1	1	1989	13	F	1	1	2016	11	6	01	99	9	NaN	N

We set the right type to the categorical variables:

In [9]:

categorical_variables = ['INDNAT1', 'ETAMAT1', 'INDNAT2', 'ETAMAT2', 'TUDOM', 'TUCOM', 'NBENFCOM']
mar2016[categorical_variables] = mar2016[categorical_variables].replace(np.NaN, 'nan').astype('category')
mar2016['NBENFCOM'] = mar2016['NBENFCOM'].cat.rename_categories({'O': 'Y'})

We append approximate English translation of the variables.

In [10]:

varlist_mariages['ENGLISH'] = ['Marriage Year',
                              'Birth Year Spouse 1',
                              'Birth Year Spouse 2',
                              'Residence Department after Marriage',
                              'Marriage Department',
                              'Birth Departement Spouse 1',
                              'Birth Departement Spouse 2',
                              'Prior Matrimonial Status Spouse 1',
                              'Prior Matrimonial Status Spouse 2',
                              'Nationality Spouse 1',
                              'Nationality Spouse 2',
                              'Marriage Weekday',
                              'Marriage Month',
                              'Children before Marriage',
                              'Sex Spouse 1',
                              'Sex Spouse 2',
                              'Commune',
                              'Urban Unit']

varlist_mariages

Out[10]:

	VARIABLE	LIBELLE	TYPE	LONGUEUR	ENGLISH
0	AMAR	Année du mariage	CHAR	4.0	Marriage Year
1	ANAIS1	Année de naissance du conjoint 1	CHAR	4.0	Birth Year Spouse 1
2	ANAIS2	Année de naissance du conjoint 2	CHAR	4.0	Birth Year Spouse 2
3	DEPDOM	Département de domicile après le mariage	CHAR	3.0	Residence Department after Marriage
4	DEPMAR	Département de mariage	CHAR	3.0	Marriage Department
5	DEPNAIS1	Département de naissance du conjoint 1	CHAR	3.0	Birth Departement Spouse 1
6	DEPNAIS2	Département de naissance du conjoint 2	CHAR	3.0	Birth Departement Spouse 2
7	ETAMAT1	État matrimonial antérieur du conjoint 1	CHAR	1.0	Prior Matrimonial Status Spouse 1
8	ETAMAT2	État matrimonial antérieur du conjoint 2	CHAR	1.0	Prior Matrimonial Status Spouse 2
9	INDNAT1	Indicateur de nationalité du conjoint 1	CHAR	1.0	Nationality Spouse 1
10	INDNAT2	Indicateur de nationalité du conjoint 2	CHAR	1.0	Nationality Spouse 2
11	JSEMAINE	Jour du mariage dans la semaine	CHAR	1.0	Marriage Weekday
12	MMAR	Mois du mariage	CHAR	2.0	Marriage Month
13	NBENFCOM	Enfants en commun avant le mariage	CHAR	1.0	Children before Marriage
14	SEXE1	Sexe du conjoint 1	CHAR	1.0	Sex Spouse 1
15	SEXE2	Sexe du conjoint 2	CHAR	1.0	Sex Spouse 2
16	TUCOM	Tranche de commune du lieu de domicile des époux	CHAR	1.0	Commune
17	TUDOM	Tranche d'unité urbaine 2010 du lieu de domici...	CHAR	1.0	Urban Unit

For varmod_mariages, we focus on the int encoded variables.

In [11]:

replace = {'INDNA': {'Française':'French',
                     'Étrangère':'Foreigner'},
           'ETAMA': {'Divorcé': 'Divorced',
                     'Célibataire': 'Single',
                     'Veuf': 'Widow'},
           'TUDOM': {'Agglomération de Paris': 'Paris Agglomeration',
                     'COM ou étranger': 'Oversea',
                     'Commune rurale': 'Country side'
                                       ' (fewer than 2,000 inhabitants)',
                     'Indéterminé': 'Unknown'},
           'TUCOM': {'Indéterminé ou pays étranger': 'Unknown or foreign country',
                     'Terres australes et antarctiques,'
                     ' COM non précisé': 'Austral or Antartic, unknown oversea'}
}
str_replace = {
    'TUDOM': [('Unité urbaine de', 'Urban Unit from'),
              ('à', 'to'),
              ('habitant', 'inhabitant')],
    'TUCOM': [('de plus de', 'with more than'),
              ('de moins de', 'with fewer than'),
              ('habitant', 'inhabitant')]
}

translation = varmod_mariages.query("VARIABLE in ['INDNAT1', 'ETAMAT1', 'INDNAT2', 'ETAMAT2', 'TUDOM', 'TUCOM']").copy()

def translate(v):
    var = v['VARIABLE'][:5]
    lib = v['MODLIBELLE']
    if var in replace and lib in replace[var]:
        return replace[var][lib]
    str_rep = str_replace[var] if var in str_replace else None
    if str_rep is not None:
        ret = lib
        for (old, new) in str_rep:
            ret = ret.replace(old, new)
        return ret
    
    return lib

translation['ENGLISH'] = translation.apply(translate, 1).str.replace('\xa0', ',')

In [12]:

translation

Out[12]:

	VARIABLE	MODALITE	MODLIBELLE	ENGLISH
433	ETAMAT1	1	Célibataire	Single
434	ETAMAT1	3	Veuf	Widow
435	ETAMAT1	4	Divorcé	Divorced
436	ETAMAT2	1	Célibataire	Single
437	ETAMAT2	3	Veuf	Widow
438	ETAMAT2	4	Divorcé	Divorced
439	INDNAT1	1	Française	French
440	INDNAT1	2	Étrangère	Foreigner
441	INDNAT2	1	Française	French
442	INDNAT2	2	Étrangère	Foreigner
457	TUCOM	NaN	Indéterminé ou pays étranger	Unknown or foreign country
458	TUCOM	P	Commune de plus de 10 000 habitants	Commune with more than 10,000 inhabitants
459	TUCOM	M	Commune de moins de 10 000 habitants	Commune with fewer than 10,000 inhabitants
460	TUCOM	A	Terres australes et antarctiques, COM non précisé	Austral or Antartic, unknown oversea
461	TUDOM	NaN	Indéterminé	Unknown
462	TUDOM	0	Commune rurale	Country side (fewer than 2,000 inhabitants)
463	TUDOM	1	Unité urbaine de 2 000 à 4 999 habitants	Urban Unit from 2 000 to 4,999 inhabitants
464	TUDOM	2	Unité urbaine de 5 000 à 9 999 habitants	Urban Unit from 5,000 to 9,999 inhabitants
465	TUDOM	3	Unité urbaine de 10 000 à 19 999 habitants	Urban Unit from 10,000 to 19,999 inhabitants
466	TUDOM	4	Unité urbaine de 20 000 à 49 999 habitants	Urban Unit from 20,000 to 49,999 inhabitants
467	TUDOM	5	Unité urbaine de 50 000 à 99 999 habitants	Urban Unit from 50,000 to 99,999 inhabitants
468	TUDOM	6	Unité urbaine de 100 000 à 199 999 habitants	Urban Unit from 100,000 to 199,999 inhabitants
469	TUDOM	7	Unité urbaine de 200 000 à 1 999 999 habitants	Urban Unit from 200,000 to 1,999,999 inhabitants
470	TUDOM	8	Agglomération de Paris	Paris Agglomeration
471	TUDOM	9	COM ou étranger	Oversea

In [20]:

for (var, cat) in translation.replace(np.NaN, 'nan'
                                     ).set_index('MODALITE'
                                     ).groupby('VARIABLE'
                                     ).apply(lambda d:
                                             d['ENGLISH'].to_dict()
                                     ).items():
    mar2016[var] = mar2016[var].cat.rename_categories(cat)

Now we have a fully English dataset:

In [14]:

mar2016.sample(5, random_state=5010)

Out[14]:

	ANAIS1	DEPNAIS1	SEXE1	INDNAT1	ETAMAT1	ANAIS2	DEPNAIS2	SEXE2	INDNAT2	ETAMAT2	AMAR	MMAR	JSEMAINE	DEPMAR	DEPDOM	TUDOM	TUCOM	NBENFCOM
36783	1983	78	M	French	Single	1985	83	F	French	Single	2016	10	6	23	23	Urban Unit from 5,000 to 9,999 inhabitants	Commune with fewer than 10,000 inhabitants	Y
64365	1986	34	M	French	Single	1983	34	F	French	Single	2016	05	6	34	34	Country side (fewer than 2,000 inhabitants)	Commune with fewer than 10,000 inhabitants	Y
73942	1988	38	F	French	Single	1979	38	M	French	Divorced	2016	06	6	38	38	Urban Unit from 200,000 to 1,999,999 inhabitants	Commune with more than 10,000 inhabitants	N
97354	1992	51	M	French	Single	1986	51	F	French	Single	2016	04	6	51	51	Country side (fewer than 2,000 inhabitants)	Commune with fewer than 10,000 inhabitants	N
232683	1987	69	M	French	Single	1982	69	F	French	Single	2016	05	6	38	987	Oversea	Commune with more than 10,000 inhabitants	Y

We convert the year variables:

In [15]:

year_variables = ['ANAIS1', 'ANAIS2', 'AMAR', 'MMAR']
mar2016[year_variables] = mar2016[year_variables].astype(np.int)

Same thing with weekday

In [16]:

mar2016['JSEMAINE'] = mar2016['JSEMAINE'
                             ].astype('category'
                             ).cat.rename_categories(['Monday',
                                                      'Tuesday',
                                                      'Wenesday',
                                                      'Thursday',
                                                      'Friday',
                                                      'Saturday',
                                                      'Sunday'])

We create a datetime column for the date of marriage (although we don't have the day). We drop the columns to not uselessly duplicate the data.

In [17]:

mar2016['DateMAR'] = pd.to_datetime(mar2016[['AMAR', 'MMAR'
                                            ]].assign(day=1
                                             ).rename(columns={'AMAR': 'year',
                                                               'MMAR': 'month'}))
mar2016.drop(columns=['AMAR',
                      'MMAR'],
             inplace=True)

We can consider the dataset as ready.

In [18]:

mar2016.sample(5, random_state=5010)

Out[18]:

	ANAIS1	DEPNAIS1	SEXE1	INDNAT1	ETAMAT1	ANAIS2	DEPNAIS2	SEXE2	INDNAT2	ETAMAT2	JSEMAINE	DEPMAR	DEPDOM	TUDOM	TUCOM	NBENFCOM	DateMAR
36783	1983	78	M	French	Single	1985	83	F	French	Single	Saturday	23	23	Urban Unit from 5,000 to 9,999 inhabitants	Commune with fewer than 10,000 inhabitants	Y	2016-10-01
64365	1986	34	M	French	Single	1983	34	F	French	Single	Saturday	34	34	Country side (fewer than 2,000 inhabitants)	Commune with fewer than 10,000 inhabitants	Y	2016-05-01
73942	1988	38	F	French	Single	1979	38	M	French	Divorced	Saturday	38	38	Urban Unit from 200,000 to 1,999,999 inhabitants	Commune with more than 10,000 inhabitants	N	2016-06-01
97354	1992	51	M	French	Single	1986	51	F	French	Single	Saturday	51	51	Country side (fewer than 2,000 inhabitants)	Commune with fewer than 10,000 inhabitants	N	2016-04-01
232683	1987	69	M	French	Single	1982	69	F	French	Single	Saturday	38	987	Oversea	Commune with more than 10,000 inhabitants	Y	2016-05-01

In order to get this article easily displayed on any devices (especially mobile ones), we subsample the dataset and save it (once for good). It is adviced to save the data to an external file when using Altair.

In [19]:

from pathlib import Path
if not Path(data_url).is_file():
    mar2016 = mar2016.sample(500, random_state=5010)

    mar2016.to_json(data_url, orient='records')

References¶

Altair Home Page Altair: Declarative Visualization in Python
Vega-Lite Home Page Vega-Lite – A Grammar of Interactive Graphics
Insee 2016 Marriage Data Les mariages en 2016
France-GeoJSON Contours des régions, départements, arrondissements, cantons et communes de France (métropole et départements d'outre-mer) au format GeoJSON
MapShaper Tools for editing Shapefile, GeoJSON, TopoJSON and CSV files
Vega-Lite Documentation: Data Describe JSON Data formatting