datashader¶
The power of Bokeh is to render data from Python (or R) in the web browser. Due to the way web browsers are designed, there are limitations on how much data can be displayed in this way. Most web browsers can handle up to 100,000 or 200,000 data points in a bokeh chart before they slow down or have memory issues.
The datashader library is intended to extend Bokeh to allow visualisation of very large data sets by ensuring a faithful representation of the overall distribution rather than individual data points. datashader is installed with
$ pipenv install datashader
When should Datashader not be used?¶
for drawing less than 100,000 data points
when every data point is important – the default bokeh renders all data points, but not together with
datashader
for full interactivity (
hover
tools) with each data point
When should datashader be used?¶
really big data; when bokeh/Matplotlib cause problems
when the distribution is more important than individual data points
when essentially the distribution is to be analysed
How does Datashader work?¶
Tools like Bokeh organise data directly into an HTML/JavaScript diagram
Datashader renders data into a screen-sized aggregate array from which an image can be created and embedded into a Bokeh chart
only the fixed-size image needs to be sent to the browser, so millions or billions of data points can be used
each step automatically adapts to the data but can be customised
Visualisations supported by Datashader¶
Datashader currently supports
scatterplots/heatmaps
Time series
Connected points (trajectories)
Grids
In any case, the output can be easily embedded in bokeh charts, with interactive resampling to pan and zoom range, in notebooks or apps. Legends and hover information can be generated from the aggregate arrays to enable interactivity.
Visualise big data true to the original¶
When the data is so large that individual points are not easily identifiable, it is crucial that the visualisation is created in a principled way and faithfully represents the underlying distribution for your visual system. For example, all of these charts show the same data, but is one of them the actual distribution?
[1]:
import numpy as np
import pandas as pd
np.random.seed(1)
num = 10000
dists = {
cat: pd.DataFrame(
dict(
x=np.random.normal(x, s, num),
y=np.random.normal(y, s, num),
val=val,
cat=cat,
)
)
for x, y, s, val, cat in [
(2, 2, 0.01, 10, "d1"),
(2, -2, 0.1, 20, "d2"),
(-2, -2, 0.5, 30, "d3"),
(-2, 2, 1.0, 40, "d4"),
(0, 0, 3, 50, "d5"),
]
}
df = pd.concat(dists, ignore_index=True)
df["cat"] = df["cat"].astype("category")
df.tail()
[1]:
x | y | val | cat | |
---|---|---|---|---|
49995 | -1.397579 | 0.610189 | 50 | d5 |
49996 | -2.649610 | 3.080821 | 50 | d5 |
49997 | 1.933360 | 0.243676 | 50 | d5 |
49998 | 4.306374 | 1.032139 | 50 | d5 |
49999 | -0.493567 | -2.242669 | 50 | d5 |
Here we have 50000 points, 10000 in each of five categories with associated numerical values. This amount of data can only be plotted slowly with Bokeh or similar libraries, as the complete data must be transferred to the web browser. In addition, plotting data of this size with standard approaches presents some problems:
Plot A suffers from overplotting, where the distribution is obscured by data points plotted later.
Plot B uses smaller points to avoid overplotting, but suffers from oversaturation, where differences in data point density are not visible as all densities above a certain value are displayed as the same pure black colour
Plot C uses transparency to avoid oversaturation, but suffers from undersaturation, with the 10,000 data points in the largest category (at
0,0
) not visible at all.Bokeh can handle 50,000 points, but if the data were larger, these plots would suffer from undersampling, where the distribution becomes invisible or misleading due to too few data points in enlarged areas.
PlotA and PlotB also require time-consuming and error-prone manual parameter adjustment, which is problematic when the data is so large that the visualisation becomes critical to understanding the data. With the Datashader, we can avoid all of these problems by rendering the data into an array that automatically allows the scope of all dimensions and then displays the actual distribution without parameter adjustment and with very little code:
[2]:
import datashader as ds
import datashader.transfer_functions as tf
%time tf.shade(ds.Canvas().points(df,'x','y'))
CPU times: user 309 ms, sys: 20.8 ms, total: 329 ms
Wall time: 330 ms
[2]:
Projection and aggregation¶
In the first steps of the Datashader pipeline, the selection is made as to
which variables are to be displayed on the X and Y axes
the size in which the values are to be summarised
which range of values the array should cover
which function should be used for aggregation
[3]:
canvas = ds.Canvas(
plot_width=250, plot_height=250, x_range=(-4, 4), y_range=(-4, 4)
)
agg = canvas.points(df, "x", "y", agg=ds.count())
agg
[3]:
<xarray.DataArray (y: 250, x: 250)> array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 1, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [1, 0, 0, ..., 0, 0, 0]], dtype=uint32) Coordinates: * x (x) float64 -3.984 -3.952 -3.92 -3.888 ... 3.888 3.92 3.952 3.984 * y (y) float64 -3.984 -3.952 -3.92 -3.888 ... 3.888 3.92 3.952 3.984 Attributes: x_range: (-4, 4) y_range: (-4, 4)
Here we specify that the x and y columns are to be mapped to the x and y axes and aggregated with count
. This results in a 2D xarray of the requested size that contains a value for each possible pixel and counts the number of data points that have been assigned to it. An xarray is similar to a NumPy or pandas data structure and supports similar operations, but allows for arbitrary multidimensional data.
The available reduction functions that can be used for aggregation include
count()
: integer number of data points for each pixel (default setting)any()
: one pixel for each data point, otherwise0
sum(column)
: Total value of the specified column for all data points in this pixelcount_cat(column)
: Number of data points per category based on the specified categorical column, which must be declared with panda’s categorical data type
Transformation¶
Once data is in the Xarray aggregate form, it can be processed in different ways, making Datashader even more flexible and powerful. For example, instead of plotting all the data, we can plot only the number of 99th percentiles:
[4]:
tf.shade(agg.where(agg>=np.percentile(agg,99)))
[4]:
Colour mapping¶
The values in an array of aggregated data can be converted into pixel colours. Datashader supports any bokeh palette or list of colours:
[5]:
tf.shade(agg, cmap=["yellow", "red"])
[5]:
We can also choose how the data values are to be displayed in colours:
linear
log
eq_hist
[6]:
tf.shade(agg, cmap=["yellow", "red"], how="linear")
[6]:
[7]:
tf.shade(agg, cmap=["yellow", "red"], how="log")
[7]:
[8]:
tf.shade(agg, cmap=["yellow", "red"], how="eq_hist")
[8]:
With linear
, red is only used for the single pixel with the highest density. log
mapping has similar problems, but is less severe as a wide range of data values are mapped in yellow. The eq_hist
(default) setting correctly conveys the density differences between the different distributions by adjusting the histogram of pixel values so that each pixel colour is used equally often.
If there are several categories, the individual aggregates can also be coloured:
[9]:
color_key = dict(d1="blue", d2="green", d3="yellow", d4="orange", d5="red")
aggc = canvas.points(df, "x", "y", ds.count_cat("cat"))
tf.shade(aggc, color_key)
[9]:
If the dots appear too small, you can enlarge them with spreading in the final image.
[10]:
tf.spread(tf.shade(aggc, color_key))
[10]:
tf.spread
uses a fixed (albeit configurable) spread size, while a similar command tf.dynspread
spreads differently depending on the plot density in that view.
Embedding¶
The images generated by Datashader can be used with any plotting or display programme. Bokeh also provides interactive zooming and panning to examine extremely large data sets. We just need to include the above commands in a callback function and then add them to a Bokeh figure
:
[11]:
import bokeh.plotting as bp
from datashader.bokeh_ext import InteractiveImage
bp.output_notebook()
p = bp.figure(tools="pan,wheel_zoom,reset", x_range=(-5, 5), y_range=(-5, 5))
def image_callback(x_range, y_range, w, h):
cvs = ds.Canvas(
plot_width=w, plot_height=h, x_range=x_range, y_range=y_range
)
agg = cvs.points(df, "x", "y", ds.count_cat("cat"))
img = tf.shade(agg, color_key)
return tf.dynspread(img, threshold=0.25)
InteractiveImage(p, image_callback)
[11]:
This means that you can now also see the axis values that are not visible in plain images. If you activate the zoom, any areas of the diagram are enlarged and a new Datashader image is rendered with the callback and displayed in the diagram.
You can also easily overlay other bokeh data in the same plot or place map
tiles for geographical data in Web Mercator format in the background.
Datashader works similarly for line plots (for example time series and trajectories). This means that all data points can be used without having to make a subdivision yourself. It can also use raster data (for example satellite weather data) to rasterise it into a requested grid, which can then be analysed or coloured or combined with other non-raster data. For example, if you have elevation data in raster form and income data as individual points, you can easily plot all pixels where the average income is above a certain threshold and the elevation is below a certain value. This would be very difficult to express with a traditional workflow.