PdVega examples

Imports

[1]:
import numpy as np
import pandas as pd
import pdvega

Declarative description of the data visualisation

Vega-Lite can be used to declaratively describe how the data should be mapped to visualisation functions. With pdvega, this specification is available in a similar way to the Matplotlib API: data.plot simply needs to be replaced by data.vgplot, where data refers to pandas series or DataFrame objects.

Loading a DataFrame with time series of share prices:

[2]:
from vega_datasets import data


stocks = data.stocks(pivoted=True)

Matplotlib API:

[3]:
stocks.plot.line()
[3]:
<Axes: xlabel='date'>
../../_images/vega_pdvega_examples_7_1.png

pdvega API

[5]:
stocks.vgplot.line()
../../_images/vega_pdvega_examples_9_2.png

The result is beautiful data visualisations with a minimum of boilerplate. In addition, the diagrams created from pdvega are interactive and can be moved and enlarged/shrunk.

Simple data visualisations with data.vgplot

The central interface of pdvega is the vgplot attribute, which is added to pandas DataFrame and Series objects.

As with the pandas plots, there are two ways to create diagrams:

  1. the vgplot attribute of a pandas object can be called directly, for example

iris.vgplot(kind="scatter", x="sepalLength", y="petalLength", c="species")
  1. Alternatively, the specific method assigned to each chart type can also be called:

iris.vgplot.scatter(x="sepalLength", y="petalLength", c="species")

This approach offers the advantage that available plot types can be analysed via tab completion. The individual functions also provide more detailed documentation of the arguments available for each method.

Diagram types

The vgplot API provides nine basic diagram types:

Line charts with vgplot.line

The default chart type for vgplot is a line chart.

Unless otherwise specified, the index of DataFrame or Series is used as the x-axis variable, and a separate line for the y-values of each column of the DataFrame. If you only want to plot a subset of the columns, you can use pandas indexing to select the columns you are interested in:

[6]:
stocks[['AAPL', 'AMZN']].vgplot.line()
../../_images/vega_pdvega_examples_14_2.png

Line diagrams can be further customised. Information on this can be found in the documentation:

Scatter plots with vgplot.scatter

[7]:
stocks.vgplot.scatter(x="AAPL", y="AMZN")
../../_images/vega_pdvega_examples_17_2.png

To further customise scatter plots, check out pdvega.FramePlotMethods.scatter().

Area plots with vgplot.area

[8]:
stocks[["MSFT", "AAPL", "AMZN"]].vgplot.area()
../../_images/vega_pdvega_examples_20_2.png

Area diagrams can also be stacked. In this case, transparent areas are often helpful.

[9]:
stocks[["MSFT", "AAPL", "AMZN"]].vgplot.area(stacked=False, alpha=0.4)
../../_images/vega_pdvega_examples_22_2.png

Area diagrams can be further customised, see

Bar charts with vgplot.bar

[10]:
np.random.seed(1234)
df = pd.DataFrame(np.random.rand(10, 2), columns=["a", "b"])

df.vgplot.bar()
../../_images/vega_pdvega_examples_25_2.png

As with area diagrams, you can stack the bars with stacked=True:

[11]:
df.vgplot.bar(stacked=True)
../../_images/vega_pdvega_examples_27_2.png

In addition, horizontal bar charts can be created with barh:

[12]:
df.vgplot.barh(stacked=True)
../../_images/vega_pdvega_examples_29_2.png

Histograms with vgplot.hist

[13]:
df = pd.DataFrame(
    {
        "a": np.random.randn(1000) + 1,
        "b": np.random.randn(1000),
        "c": np.random.randn(1000) - 1,
    },
    columns=["a", "b", "c"],
)

df.vgplot.hist(bins=50, alpha=0.5)
../../_images/vega_pdvega_examples_31_2.png

Histograms can be further customised, see

Core density estimation diagrams with vgplot.kde

Similar to histograms, kernel density estimation (KDE) diagrams generate smooth curves that indicate the density of the measurement points.

[14]:
df.vgplot.kde()
../../_images/vega_pdvega_examples_34_2.png

KDE diagrams can be further customised with

Heatmaps with vgplot.heatmap

pandas.plotting has a function for creating a hexagonal grouped heatmap of two-dimensional data. Unfortunately, neither Vega nor Vega-Lite currently support these hexagonal heatmaps. However, they do support Cartesian heatmaps, and this functionality is also included in pdvega:

[15]:
df.vgplot.heatmap(x="a", y="b", C="c", gridsize=20)
../../_images/vega_pdvega_examples_37_2.png

Heatmap diagrams can be further customised, see pdvega.FramePlotMethods.heatmap().

Statistical visualisation with pdvega.plotting

pdvega also supports many of the more complex plotting routines available in the pandas.plotting submodule. Below we show the example of a multi-panel scatterplot matrix from Fisher’s Iris dataset:

[16]:
iris = data.iris()
pdvega.scatter_matrix(iris, "species", figsize=(7, 7))
../../_images/vega_pdvega_examples_40_2.png

You can interactively move and zoom in/out in this diagram. You can also select individual measuring points by holding down the Shift key.

Parallel coordinates

Another way to visualise multidimensional data is to view each dimension independently using a diagram with parallel coordinates. This can be realised with pdvega.parallel_coordinates(), whereby the API corresponds to pandas.plotting.parallel_coordinates():

[17]:
pdvega.parallel_coordinates(iris, "species")
../../_images/vega_pdvega_examples_43_2.png

At a glance, you can recognise relationships between points and, in particular, that the ‘setosa’ species differs significantly from the other two species in the width and length of the petals.

Andrews curves

A similar approach to visualising data dimensions is known as the Andrews curve: The idea is to construct a Fourier series from the features of each object to qualitatively visualise the aggregated differences between classes. This can be done with the function pdvega.andrews_curves(), which corresponds to the API of pandas.plotting.andrews_curves():

[17]:
pdvega.andrews_curves(iris, "species")
../../_images/vega_pdvega_examples_46_2.png

This gives a similar impression to the parallel coordinates, but provides less quantitative information on the characteristics that lead to this distinction.

Correlogram (lag plots)

Correlograms are implemented with pdvega.plotting.lag_plot() where the API corresponds to pandas.plotting.lag_plot(). In the following, the share prices of Amazon and Microsoft from 1998 to 2010 are visualised with a lag of 12 months:

[18]:
pdvega.lag_plot(stocks[["AMZN", "MSFT"]], lag=12)
../../_images/vega_pdvega_examples_49_2.png

It is immediately apparent from this plot that Amazon was far more volatile during this period: the values at each point in time showed only a very low correlation with the value one year later. Conversely, Microsoft’s value was much more stable during this decade.

We can also see this interpretation in the simple time series chart of each company’s share price:

[19]:
stocks[["AMZN", "MSFT"]].vgplot.line()
../../_images/vega_pdvega_examples_51_2.png