datashader¶

The power of Bokeh is to render data from Python (or R) in the web browser. Due to the way web browsers are designed, there are limitations on how much data can be displayed in this way. Most web browsers can handle up to 100,000 or 200,000 data points in a bokeh chart before they slow down or have memory issues.

The datashader library is intended to extend Bokeh to allow visualisation of very large data sets by ensuring a faithful representation of the overall distribution rather than individual data points. datashader is installed with

$ pipenv install datashader

When should Datashader not be used?¶

for drawing less than 100,000 data points
when every data point is important – the default bokeh renders all data points, but not together with datashader
for full interactivity (hover tools) with each data point

When should datashader be used?¶

really big data; when bokeh/Matplotlib cause problems
when the distribution is more important than individual data points
when essentially the distribution is to be analysed

How does Datashader work?¶

Tools like Bokeh organise data directly into an HTML/JavaScript diagram
Datashader renders data into a screen-sized aggregate array from which an image can be created and embedded into a Bokeh chart
only the fixed-size image needs to be sent to the browser, so millions or billions of data points can be used
each step automatically adapts to the data but can be customised

Visualisations supported by Datashader¶

Datashader currently supports

scatterplots/heatmaps
Time series
Connected points (trajectories)
Grids

In any case, the output can be easily embedded in bokeh charts, with interactive resampling to pan and zoom range, in notebooks or apps. Legends and hover information can be generated from the aggregate arrays to enable interactivity.

Visualise big data true to the original¶

When the data is so large that individual points are not easily identifiable, it is crucial that the visualisation is created in a principled way and faithfully represents the underlying distribution for your visual system. For example, all of these charts show the same data, but is one of them the actual distribution?

[1]:

import numpy as np
import pandas as pd


np.random.seed(1)
num = 10000

dists = {
    cat: pd.DataFrame(
        dict(
            x=np.random.normal(x, s, num),
            y=np.random.normal(y, s, num),
            val=val,
            cat=cat,
        )
    )
    for x, y, s, val, cat in [
        (2, 2, 0.01, 10, "d1"),
        (2, -2, 0.1, 20, "d2"),
        (-2, -2, 0.5, 30, "d3"),
        (-2, 2, 1.0, 40, "d4"),
        (0, 0, 3, 50, "d5"),
    ]
}

df = pd.concat(dists, ignore_index=True)
df["cat"] = df["cat"].astype("category")
df.tail()

[1]:

	x	y	val	cat
49995	-1.397579	0.610189	50	d5
49996	-2.649610	3.080821	50	d5
49997	1.933360	0.243676	50	d5
49998	4.306374	1.032139	50	d5
49999	-0.493567	-2.242669	50	d5

Here we have 50000 points, 10000 in each of five categories with associated numerical values. This amount of data can only be plotted slowly with Bokeh or similar libraries, as the complete data must be transferred to the web browser. In addition, plotting data of this size with standard approaches presents some problems:

Plot A suffers from overplotting, where the distribution is obscured by data points plotted later.
Plot B uses smaller points to avoid overplotting, but suffers from oversaturation, where differences in data point density are not visible as all densities above a certain value are displayed as the same pure black colour
Plot C uses transparency to avoid oversaturation, but suffers from undersaturation, with the 10,000 data points in the largest category (at 0,0) not visible at all.
Bokeh can handle 50,000 points, but if the data were larger, these plots would suffer from undersampling, where the distribution becomes invisible or misleading due to too few data points in enlarged areas.

Pitfalls when visualising large amounts of data

PlotA and PlotB also require time-consuming and error-prone manual parameter adjustment, which is problematic when the data is so large that the visualisation becomes critical to understanding the data. With the Datashader, we can avoid all of these problems by rendering the data into an array that automatically allows the scope of all dimensions and then displays the actual distribution without parameter adjustment and with very little code:

[2]:

import datashader as ds
import datashader.transfer_functions as tf
%time tf.shade(ds.Canvas().points(df,'x','y'))

CPU times: user 309 ms, sys: 20.8 ms, total: 329 ms
Wall time: 330 ms

[2]:

Transformation¶

Once data is in the Xarray aggregate form, it can be processed in different ways, making Datashader even more flexible and powerful. For example, instead of plotting all the data, we can plot only the number of 99th percentiles:

[4]:

tf.shade(agg.where(agg>=np.percentile(agg,99)))

[4]:

Colour mapping¶

The values in an array of aggregated data can be converted into pixel colours. Datashader supports any bokeh palette or list of colours:

[5]:

tf.shade(agg, cmap=["yellow", "red"])

[5]:

We can also choose how the data values are to be displayed in colours:

linear
log
eq_hist

[6]:

tf.shade(agg, cmap=["yellow", "red"], how="linear")

[6]:

[7]:

tf.shade(agg, cmap=["yellow", "red"], how="log")

[7]:

[8]:

tf.shade(agg, cmap=["yellow", "red"], how="eq_hist")

[8]:

With linear, red is only used for the single pixel with the highest density. log mapping has similar problems, but is less severe as a wide range of data values are mapped in yellow. The eq_hist (default) setting correctly conveys the density differences between the different distributions by adjusting the histogram of pixel values so that each pixel colour is used equally often.

If there are several categories, the individual aggregates can also be coloured:

[9]:

color_key = dict(d1="blue", d2="green", d3="yellow", d4="orange", d5="red")
aggc = canvas.points(df, "x", "y", ds.count_cat("cat"))
tf.shade(aggc, color_key)

[9]:

If the dots appear too small, you can enlarge them with spreading in the final image.

[10]:

tf.spread(tf.shade(aggc, color_key))

[10]:

tf.spread uses a fixed (albeit configurable) spread size, while a similar command tf.dynspread spreads differently depending on the plot density in that view.

Embedding¶

The images generated by Datashader can be used with any plotting or display programme. Bokeh also provides interactive zooming and panning to examine extremely large data sets. We just need to include the above commands in a callback function and then add them to a Bokeh figure:

[11]:

import bokeh.plotting as bp

from datashader.bokeh_ext import InteractiveImage


bp.output_notebook()
p = bp.figure(tools="pan,wheel_zoom,reset", x_range=(-5, 5), y_range=(-5, 5))


def image_callback(x_range, y_range, w, h):
    cvs = ds.Canvas(
        plot_width=w, plot_height=h, x_range=x_range, y_range=y_range
    )
    agg = cvs.points(df, "x", "y", ds.count_cat("cat"))
    img = tf.shade(agg, color_key)
    return tf.dynspread(img, threshold=0.25)


InteractiveImage(p, image_callback)

Loading BokehJS ...

[11]:

This means that you can now also see the axis values that are not visible in plain images. If you activate the zoom, any areas of the diagram are enlarged and a new Datashader image is rendered with the callback and displayed in the diagram.

You can also easily overlay other bokeh data in the same plot or place map tiles for geographical data in Web Mercator format in the background.

Datashader works similarly for line plots (for example time series and trajectories). This means that all data points can be used without having to make a subdivision yourself. It can also use raster data (for example satellite weather data) to rasterise it into a requested grid, which can then be analysed or coloured or combined with other non-raster data. For example, if you have elevation data in raster form and income data as individual points, you can easily plot all pixels where the average income is above a certain threshold and the elevation is below a certain value. This would be very difficult to express with a traditional workflow.