Altair: Interactive Plots

Introduction to Altair

Altair is a Python library that generates a JSON output complying to the Vega-Lite specification. Thus to understand the former, one must master the latter. The Vega-Lite specification is meant to describe how data are aggregated, transformed and the form of the plot used (histogram, bar, line, scatter, others). Vega-Lite is itself based on a more verbose specification simply named Vega.

The JSON is then embedded within a HTML page and a JavaScript library could read it and generate the plots into a canva.

What makes Altair/Vega so different from the other plotting frameworks (such as plotly, bokeh which all use a JSON specification format) is how the Vega specification has been engineered, as a very end rather than a internal object meant to be hidden at all costs. Thus, Vega has a very good documentation and it is very easy to dive into it (with many examples). From the Vega point of view, Altair is a very pleasant way to write Vega-Lite plots, wih high-level objects and a one-to-one mapping with low-level objects.

In this article, we'll simply study go through a dataviz with Altair, explaining the logic of Vega to plot data.

Packages and Setup

In [1]:
import numpy as np
import pandas as pd

Furthermore, we load altair using alt:

In [2]:
import altair as alt
alt.renderers.set_embed_options(theme='light');

We are going to use as drill dataset a sample of 500 elements from a larger one, the French 2016 marriage (see Appendix for the normalization process).

In [3]:
data_url = 'data/mar2016.json'

Simple Statistics

In [4]:
alt.Chart(data_url
).mark_bar(
).encode(x='count()',
         y='SEXE1:O'
).properties(height=125,
             width=250
).configure(autosize='fit')
Out[4]:

The basic bricks of Altair are:

  • alt.Chart create a Chart with data_url as the source of the data
  • mark_bar describe the wanted plot, here a bar plot
  • encode describe the data to display
    • x aggregate by counting
    • y group by SEXE1, :O tell Vega SEXE1 is an categorical variable

Maps with Altair

We used a simplified GeoJSON (thanks to MapShaper) taken from France-GeoJSON.

In [5]:
france_url='data/departements-avec-outre-mer.json'
france = alt.Data(url=france_url, format=alt.DataFormat(property='features', type='json'))
france_base = alt.Chart(france).mark_geoshape(fill='lightgray')

plot_france = (france_base +
alt.Chart(data_url
).transform_lookup(default='0',
                  as_='geo',
                  lookup='DEPDOM',
                  from_=alt.LookupData(data=france, key='properties.code')
).mark_geoshape().encode(alt.Color('count()',scale=alt.Scale(type='log')),
                        alt.Shape(field="geo", type="geojson"),
                        tooltip=['count()', 'DEPDOM:N'])
).project(scale=900,
         center=[2, 46.5]).properties(
    title='Departement'
).configure(autosize='fit')

plot_france
Out[5]:

Broadly, we have done two things. First, we defined the geographic data we will use. Then, we joined them to aggregated data from our marriage dataset.

Data Definition

  • url where the data is stored
  • format DataFormat
    • property: property of the JSON containing the data
    • type: specify JSON as the expected format

Join two datasets together

To extend one dataset with data from another one, we must use what's named a Lookup transform, wrapped by Altair with the transform_lookup function. It is basically a LEFT JOIN. Let's describe the parameters we used:

  • default default value when there is no right value
  • as_ name of the appended field
  • lookup name of the field we are joining on from the dataset we want to expand
  • from_ should be a LookupData, which is merely a way to define the field of the secondary dataset we want to join on

Encode

  • Color We count the number of marriages.
  • Shape since the primary dataset is not imported using the alt.topo_feature, we need to tell Altair where to find the geojson data. Thanks to how we did the Lookup transform, we simply set it to "geo".
  • Tooltip define the fields displayed when the cursor pass one department.

Appendix: Data Normalization

We describe how we normalized the dataset.

The source of the data is the 2016 French marriages provided by INSEE. the French national statitics bureau, that provides the following files.

  • mar2016.dbf: the data itself
  • varlist_mariages.dbf: describes the variables (type, name, length)
  • varmod_mariages.dbf: describes the encoding of the categorical variables

There are according to the documentation 232,725 records.

Since the data are stored as dbf, we use the simpledbf package to convert it to a pandas dataframe.

In [6]:
from simpledbf import Dbf5


mar2016_dbf = Dbf5('data/mar2016.dbf', codec='latin1')
varlist_mariages_dbf = Dbf5('data/varlist_mariages.dbf', codec='latin1')
varmod_mariages_dbf = Dbf5('data/varmod_mariages.dbf', codec='latin1')

mar2016 = mar2016_dbf.to_dataframe()
varlist_mariages = varlist_mariages_dbf.to_dataframe()
varmod_mariages = varmod_mariages_dbf.to_dataframe()
PyTables is not installed. No support for HDF output.

Let's check the dimensions:

In [7]:
mar2016.shape
Out[7]:
(232725, 18)
In [8]:
mar2016.head(5)
Out[8]:
ANAIS1 DEPNAIS1 SEXE1 INDNAT1 ETAMAT1 ANAIS2 DEPNAIS2 SEXE2 INDNAT2 ETAMAT2 AMAR MMAR JSEMAINE DEPMAR DEPDOM TUDOM TUCOM NBENFCOM
0 1986 01 F 1 1 1984 70 M 1 1 2016 06 6 01 99 9 NaN O
1 1982 72 M 1 1 1983 99 F 2 1 2016 07 6 01 99 9 NaN N
2 1976 01 M 1 1 1974 38 M 1 1 2016 09 6 01 99 9 NaN N
3 1988 01 M 1 1 1988 57 F 1 1 2016 08 6 01 99 9 NaN N
4 1986 99 M 1 1 1989 13 F 1 1 2016 11 6 01 99 9 NaN N

We set the right type to the categorical variables:

In [9]:
categorical_variables = ['INDNAT1', 'ETAMAT1', 'INDNAT2', 'ETAMAT2', 'TUDOM', 'TUCOM', 'NBENFCOM']
mar2016[categorical_variables] = mar2016[categorical_variables].replace(np.NaN, 'nan').astype('category')
mar2016['NBENFCOM'] = mar2016['NBENFCOM'].cat.rename_categories({'O': 'Y'})

We append approximate English translation of the variables.

In [10]:
varlist_mariages['ENGLISH'] = ['Marriage Year',
                              'Birth Year Spouse 1',
                              'Birth Year Spouse 2',
                              'Residence Department after Marriage',
                              'Marriage Department',
                              'Birth Departement Spouse 1',
                              'Birth Departement Spouse 2',
                              'Prior Matrimonial Status Spouse 1',
                              'Prior Matrimonial Status Spouse 2',
                              'Nationality Spouse 1',
                              'Nationality Spouse 2',
                              'Marriage Weekday',
                              'Marriage Month',
                              'Children before Marriage',
                              'Sex Spouse 1',
                              'Sex Spouse 2',
                              'Commune',
                              'Urban Unit']

varlist_mariages
Out[10]:
VARIABLE LIBELLE TYPE LONGUEUR ENGLISH
0 AMAR Année du mariage CHAR 4.0 Marriage Year
1 ANAIS1 Année de naissance du conjoint 1 CHAR 4.0 Birth Year Spouse 1
2 ANAIS2 Année de naissance du conjoint 2 CHAR 4.0 Birth Year Spouse 2
3 DEPDOM Département de domicile après le mariage CHAR 3.0 Residence Department after Marriage
4 DEPMAR Département de mariage CHAR 3.0 Marriage Department
5 DEPNAIS1 Département de naissance du conjoint 1 CHAR 3.0 Birth Departement Spouse 1
6 DEPNAIS2 Département de naissance du conjoint 2 CHAR 3.0 Birth Departement Spouse 2
7 ETAMAT1 État matrimonial antérieur du conjoint 1 CHAR 1.0 Prior Matrimonial Status Spouse 1
8 ETAMAT2 État matrimonial antérieur du conjoint 2 CHAR 1.0 Prior Matrimonial Status Spouse 2
9 INDNAT1 Indicateur de nationalité du conjoint 1 CHAR 1.0 Nationality Spouse 1
10 INDNAT2 Indicateur de nationalité du conjoint 2 CHAR 1.0 Nationality Spouse 2
11 JSEMAINE Jour du mariage dans la semaine CHAR 1.0 Marriage Weekday
12 MMAR Mois du mariage CHAR 2.0 Marriage Month
13 NBENFCOM Enfants en commun avant le mariage CHAR 1.0 Children before Marriage
14 SEXE1 Sexe du conjoint 1 CHAR 1.0 Sex Spouse 1
15 SEXE2 Sexe du conjoint 2 CHAR 1.0 Sex Spouse 2
16 TUCOM Tranche de commune du lieu de domicile des époux CHAR 1.0 Commune
17 TUDOM Tranche d'unité urbaine 2010 du lieu de domici... CHAR 1.0 Urban Unit

For varmod_mariages, we focus on the int encoded variables.

In [11]:
replace = {'INDNA': {'Française':'French',
                     'Étrangère':'Foreigner'},
           'ETAMA': {'Divorcé': 'Divorced',
                     'Célibataire': 'Single',
                     'Veuf': 'Widow'},
           'TUDOM': {'Agglomération de Paris': 'Paris Agglomeration',
                     'COM ou étranger': 'Oversea',
                     'Commune rurale': 'Country side'
                                       ' (fewer than 2,000 inhabitants)',
                     'Indéterminé': 'Unknown'},
           'TUCOM': {'Indéterminé ou pays étranger': 'Unknown or foreign country',
                     'Terres australes et antarctiques,'
                     ' COM non précisé': 'Austral or Antartic, unknown oversea'}
}
str_replace = {
    'TUDOM': [('Unité urbaine de', 'Urban Unit from'),
              ('à', 'to'),
              ('habitant', 'inhabitant')],
    'TUCOM': [('de plus de', 'with more than'),
              ('de moins de', 'with fewer than'),
              ('habitant', 'inhabitant')]
}

translation = varmod_mariages.query("VARIABLE in ['INDNAT1', 'ETAMAT1', 'INDNAT2', 'ETAMAT2', 'TUDOM', 'TUCOM']").copy()

def translate(v):
    var = v['VARIABLE'][:5]
    lib = v['MODLIBELLE']
    if var in replace and lib in replace[var]:
        return replace[var][lib]
    str_rep = str_replace[var] if var in str_replace else None
    if str_rep is not None:
        ret = lib
        for (old, new) in str_rep:
            ret = ret.replace(old, new)
        return ret
    
    return lib

translation['ENGLISH'] = translation.apply(translate, 1).str.replace('\xa0', ',')
In [12]:
translation
Out[12]:
VARIABLE MODALITE MODLIBELLE ENGLISH
433 ETAMAT1 1 Célibataire Single
434 ETAMAT1 3 Veuf Widow
435 ETAMAT1 4 Divorcé Divorced
436 ETAMAT2 1 Célibataire Single
437 ETAMAT2 3 Veuf Widow
438 ETAMAT2 4 Divorcé Divorced
439 INDNAT1 1 Française French
440 INDNAT1 2 Étrangère Foreigner
441 INDNAT2 1 Française French
442 INDNAT2 2 Étrangère Foreigner
457 TUCOM NaN Indéterminé ou pays étranger Unknown or foreign country
458 TUCOM P Commune de plus de 10 000 habitants Commune with more than 10,000 inhabitants
459 TUCOM M Commune de moins de 10 000 habitants Commune with fewer than 10,000 inhabitants
460 TUCOM A Terres australes et antarctiques, COM non précisé Austral or Antartic, unknown oversea
461 TUDOM NaN Indéterminé Unknown
462 TUDOM 0 Commune rurale Country side (fewer than 2,000 inhabitants)
463 TUDOM 1 Unité urbaine de 2 000 à 4 999 habitants Urban Unit from 2 000 to 4,999 inhabitants
464 TUDOM 2 Unité urbaine de 5 000 à 9 999 habitants Urban Unit from 5,000 to 9,999 inhabitants
465 TUDOM 3 Unité urbaine de 10 000 à 19 999 habitants Urban Unit from 10,000 to 19,999 inhabitants
466 TUDOM 4 Unité urbaine de 20 000 à 49 999 habitants Urban Unit from 20,000 to 49,999 inhabitants
467 TUDOM 5 Unité urbaine de 50 000 à 99 999 habitants Urban Unit from 50,000 to 99,999 inhabitants
468 TUDOM 6 Unité urbaine de 100 000 à 199 999 habitants Urban Unit from 100,000 to 199,999 inhabitants
469 TUDOM 7 Unité urbaine de 200 000 à 1 999 999 habitants Urban Unit from 200,000 to 1,999,999 inhabitants
470 TUDOM 8 Agglomération de Paris Paris Agglomeration
471 TUDOM 9 COM ou étranger Oversea
In [20]:
for (var, cat) in translation.replace(np.NaN, 'nan'
                                     ).set_index('MODALITE'
                                     ).groupby('VARIABLE'
                                     ).apply(lambda d:
                                             d['ENGLISH'].to_dict()
                                     ).items():
    mar2016[var] = mar2016[var].cat.rename_categories(cat)

Now we have a fully English dataset:

In [14]:
mar2016.sample(5, random_state=5010)
Out[14]:
ANAIS1 DEPNAIS1 SEXE1 INDNAT1 ETAMAT1 ANAIS2 DEPNAIS2 SEXE2 INDNAT2 ETAMAT2 AMAR MMAR JSEMAINE DEPMAR DEPDOM TUDOM TUCOM NBENFCOM
36783 1983 78 M French Single 1985 83 F French Single 2016 10 6 23 23 Urban Unit from 5,000 to 9,999 inhabitants Commune with fewer than 10,000 inhabitants Y
64365 1986 34 M French Single 1983 34 F French Single 2016 05 6 34 34 Country side (fewer than 2,000 inhabitants) Commune with fewer than 10,000 inhabitants Y
73942 1988 38 F French Single 1979 38 M French Divorced 2016 06 6 38 38 Urban Unit from 200,000 to 1,999,999 inhabitants Commune with more than 10,000 inhabitants N
97354 1992 51 M French Single 1986 51 F French Single 2016 04 6 51 51 Country side (fewer than 2,000 inhabitants) Commune with fewer than 10,000 inhabitants N
232683 1987 69 M French Single 1982 69 F French Single 2016 05 6 38 987 Oversea Commune with more than 10,000 inhabitants Y

We convert the year variables:

In [15]:
year_variables = ['ANAIS1', 'ANAIS2', 'AMAR', 'MMAR']
mar2016[year_variables] = mar2016[year_variables].astype(np.int)

Same thing with weekday

In [16]:
mar2016['JSEMAINE'] = mar2016['JSEMAINE'
                             ].astype('category'
                             ).cat.rename_categories(['Monday',
                                                      'Tuesday',
                                                      'Wenesday',
                                                      'Thursday',
                                                      'Friday',
                                                      'Saturday',
                                                      'Sunday'])

We create a datetime column for the date of marriage (although we don't have the day). We drop the columns to not uselessly duplicate the data.

In [17]:
mar2016['DateMAR'] = pd.to_datetime(mar2016[['AMAR', 'MMAR'
                                            ]].assign(day=1
                                             ).rename(columns={'AMAR': 'year',
                                                               'MMAR': 'month'}))
mar2016.drop(columns=['AMAR',
                      'MMAR'],
             inplace=True)

We can consider the dataset as ready.

In [18]:
mar2016.sample(5, random_state=5010)
Out[18]:
ANAIS1 DEPNAIS1 SEXE1 INDNAT1 ETAMAT1 ANAIS2 DEPNAIS2 SEXE2 INDNAT2 ETAMAT2 JSEMAINE DEPMAR DEPDOM TUDOM TUCOM NBENFCOM DateMAR
36783 1983 78 M French Single 1985 83 F French Single Saturday 23 23 Urban Unit from 5,000 to 9,999 inhabitants Commune with fewer than 10,000 inhabitants Y 2016-10-01
64365 1986 34 M French Single 1983 34 F French Single Saturday 34 34 Country side (fewer than 2,000 inhabitants) Commune with fewer than 10,000 inhabitants Y 2016-05-01
73942 1988 38 F French Single 1979 38 M French Divorced Saturday 38 38 Urban Unit from 200,000 to 1,999,999 inhabitants Commune with more than 10,000 inhabitants N 2016-06-01
97354 1992 51 M French Single 1986 51 F French Single Saturday 51 51 Country side (fewer than 2,000 inhabitants) Commune with fewer than 10,000 inhabitants N 2016-04-01
232683 1987 69 M French Single 1982 69 F French Single Saturday 38 987 Oversea Commune with more than 10,000 inhabitants Y 2016-05-01

In order to get this article easily displayed on any devices (especially mobile ones), we subsample the dataset and save it (once for good). It is adviced to save the data to an external file when using Altair.

In [19]:
from pathlib import Path
if not Path(data_url).is_file():
    mar2016 = mar2016.sample(500, random_state=5010)

    mar2016.to_json(data_url, orient='records')

References

  1. Altair Home Page Altair: Declarative Visualization in Python
  2. Vega-Lite Home Page Vega-Lite – A Grammar of Interactive Graphics
  3. Insee 2016 Marriage Data Les mariages en 2016
  4. France-GeoJSON Contours des régions, départements, arrondissements, cantons et communes de France (métropole et départements d'outre-mer) au format GeoJSON
  5. MapShaper Tools for editing Shapefile, GeoJSON, TopoJSON and CSV files
  6. Vega-Lite Documentation: Data Describe JSON Data formatting