You can run this notebook in a live session Binder or view it on Github.

Analyze web-hosted JSON data

This notebook reads and processes JSON-encoded data hosted on the web using a combination of Dask Bag and Dask Dataframe.

This data comes from mybinder.org a web service to run Jupyter notebooks live on the web (you may be running this notebook there now). My Binder publishes records for every time someone launches a live notebook like this one, and stores that record in a publicly accessible JSON file, one file per day.

Introduction to the dataset

This data is stored as JSON-encoded text files on the public web. Here are some example lines.

[1]:
import dask.bag as db
db.read_text('https://archive.analytics.mybinder.org/events-2018-11-03.jsonl').take(3)
[1]:
('{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "Qiskit/qiskit-tutorial/master", "status": "success"}\n',
 '{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "ipython/ipython-in-depth/master", "status": "success"}\n',
 '{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "QISKit/qiskit-tutorial/master", "status": "success"}\n')

We see that it includes one line for every time someone started a live notebook on the site. It includes the time that the notebook was started, as well as the repository from which it was served.

In this notebook we’ll look at many such files, parse them from JSON to Python dictionaries, and then from there to Pandas dataframes. We’ll then do some simple analyses on this data.

Start Dask Client for Dashboard

Starting the Dask Client is optional. It will start the dashboard which is useful to gain insight on the computation.

[2]:
from dask.distributed import Client, progress
client = Client(threads_per_worker=1,
                n_workers=4,
                memory_limit='2GB')
client
[2]:

Client

Cluster

  • Workers: 4
  • Cores: 4
  • Memory: 8.00 GB

Get a list of files on the web

The mybinder.org team maintains an index file that points to all other available JSON files of data. Lets convert this to a list of URLs that we’ll read in the next section.

[3]:
import dask.bag as db
import json
[4]:
db.read_text('https://archive.analytics.mybinder.org/index.jsonl').map(json.loads).compute()
[4]:
[{'name': 'events-2018-11-03.jsonl', 'date': '2018-11-03', 'count': '7057'},
 {'name': 'events-2018-11-04.jsonl', 'date': '2018-11-04', 'count': '7489'},
 {'name': 'events-2018-11-05.jsonl', 'date': '2018-11-05', 'count': '13590'},
 {'name': 'events-2018-11-06.jsonl', 'date': '2018-11-06', 'count': '13920'},
 {'name': 'events-2018-11-07.jsonl', 'date': '2018-11-07', 'count': '12766'},
 {'name': 'events-2018-11-08.jsonl', 'date': '2018-11-08', 'count': '14105'},
 {'name': 'events-2018-11-09.jsonl', 'date': '2018-11-09', 'count': '11843'},
 {'name': 'events-2018-11-10.jsonl', 'date': '2018-11-10', 'count': '7047'},
 {'name': 'events-2018-11-11.jsonl', 'date': '2018-11-11', 'count': '6940'},
 {'name': 'events-2018-11-12.jsonl', 'date': '2018-11-12', 'count': '16322'},
 {'name': 'events-2018-11-13.jsonl', 'date': '2018-11-13', 'count': '16530'},
 {'name': 'events-2018-11-14.jsonl', 'date': '2018-11-14', 'count': '14099'},
 {'name': 'events-2018-11-15.jsonl', 'date': '2018-11-15', 'count': '13182'},
 {'name': 'events-2018-11-16.jsonl', 'date': '2018-11-16', 'count': '12863'},
 {'name': 'events-2018-11-17.jsonl', 'date': '2018-11-17', 'count': '6490'},
 {'name': 'events-2018-11-18.jsonl', 'date': '2018-11-18', 'count': '7310'},
 {'name': 'events-2018-11-19.jsonl', 'date': '2018-11-19', 'count': '13348'},
 {'name': 'events-2018-11-20.jsonl', 'date': '2018-11-20', 'count': '13982'},
 {'name': 'events-2018-11-21.jsonl', 'date': '2018-11-21', 'count': '13165'},
 {'name': 'events-2018-11-22.jsonl', 'date': '2018-11-22', 'count': '12217'},
 {'name': 'events-2018-11-23.jsonl', 'date': '2018-11-23', 'count': '9070'},
 {'name': 'events-2018-11-24.jsonl', 'date': '2018-11-24', 'count': '6798'},
 {'name': 'events-2018-11-25.jsonl', 'date': '2018-11-25', 'count': '6796'},
 {'name': 'events-2018-11-26.jsonl', 'date': '2018-11-26', 'count': '13617'},
 {'name': 'events-2018-11-27.jsonl', 'date': '2018-11-27', 'count': '14964'},
 {'name': 'events-2018-11-28.jsonl', 'date': '2018-11-28', 'count': '14434'},
 {'name': 'events-2018-11-29.jsonl', 'date': '2018-11-29', 'count': '13845'},
 {'name': 'events-2018-11-30.jsonl', 'date': '2018-11-30', 'count': '12109'},
 {'name': 'events-2018-12-01.jsonl', 'date': '2018-12-01', 'count': '6785'},
 {'name': 'events-2018-12-02.jsonl', 'date': '2018-12-02', 'count': '7119'},
 {'name': 'events-2018-12-03.jsonl', 'date': '2018-12-03', 'count': '13946'},
 {'name': 'events-2018-12-04.jsonl', 'date': '2018-12-04', 'count': '13765'},
 {'name': 'events-2018-12-05.jsonl', 'date': '2018-12-05', 'count': '13106'},
 {'name': 'events-2018-12-06.jsonl', 'date': '2018-12-06', 'count': '12249'},
 {'name': 'events-2018-12-07.jsonl', 'date': '2018-12-07', 'count': '10687'},
 {'name': 'events-2018-12-08.jsonl', 'date': '2018-12-08', 'count': '6269'},
 {'name': 'events-2018-12-09.jsonl', 'date': '2018-12-09', 'count': '6639'},
 {'name': 'events-2018-12-10.jsonl', 'date': '2018-12-10', 'count': '12782'},
 {'name': 'events-2018-12-11.jsonl', 'date': '2018-12-11', 'count': '13442'},
 {'name': 'events-2018-12-12.jsonl', 'date': '2018-12-12', 'count': '13069'},
 {'name': 'events-2018-12-13.jsonl', 'date': '2018-12-13', 'count': '15279'},
 {'name': 'events-2018-12-14.jsonl', 'date': '2018-12-14', 'count': '9941'},
 {'name': 'events-2018-12-15.jsonl', 'date': '2018-12-15', 'count': '5358'},
 {'name': 'events-2018-12-16.jsonl', 'date': '2018-12-16', 'count': '6441'},
 {'name': 'events-2018-12-17.jsonl', 'date': '2018-12-17', 'count': '11332'},
 {'name': 'events-2018-12-18.jsonl', 'date': '2018-12-18', 'count': '11971'},
 {'name': 'events-2018-12-19.jsonl', 'date': '2018-12-19', 'count': '10818'},
 {'name': 'events-2018-12-20.jsonl', 'date': '2018-12-20', 'count': '9408'},
 {'name': 'events-2018-12-21.jsonl', 'date': '2018-12-21', 'count': '7741'},
 {'name': 'events-2018-12-22.jsonl', 'date': '2018-12-22', 'count': '4818'},
 {'name': 'events-2018-12-23.jsonl', 'date': '2018-12-23', 'count': '4870'},
 {'name': 'events-2018-12-24.jsonl', 'date': '2018-12-24', 'count': '5974'},
 {'name': 'events-2018-12-25.jsonl', 'date': '2018-12-25', 'count': '4737'},
 {'name': 'events-2018-12-26.jsonl', 'date': '2018-12-26', 'count': '6725'},
 {'name': 'events-2018-12-27.jsonl', 'date': '2018-12-27', 'count': '7998'},
 {'name': 'events-2018-12-28.jsonl', 'date': '2018-12-28', 'count': '8155'},
 {'name': 'events-2018-12-29.jsonl', 'date': '2018-12-29', 'count': '5108'},
 {'name': 'events-2018-12-30.jsonl', 'date': '2018-12-30', 'count': '4428'},
 {'name': 'events-2018-12-31.jsonl', 'date': '2018-12-31', 'count': '4561'},
 {'name': 'events-2019-01-01.jsonl', 'date': '2019-01-01', 'count': '4194'},
 {'name': 'events-2019-01-02.jsonl', 'date': '2019-01-02', 'count': '8559'},
 {'name': 'events-2019-01-03.jsonl', 'date': '2019-01-03', 'count': '9687'},
 {'name': 'events-2019-01-04.jsonl', 'date': '2019-01-04', 'count': '10048'},
 {'name': 'events-2019-01-05.jsonl', 'date': '2019-01-05', 'count': '6012'},
 {'name': 'events-2019-01-06.jsonl', 'date': '2019-01-06', 'count': '6019'},
 {'name': 'events-2019-01-07.jsonl', 'date': '2019-01-07', 'count': '11903'},
 {'name': 'events-2019-01-08.jsonl', 'date': '2019-01-08', 'count': '12777'},
 {'name': 'events-2019-01-09.jsonl', 'date': '2019-01-09', 'count': '13294'},
 {'name': 'events-2019-01-10.jsonl', 'date': '2019-01-10', 'count': '13112'},
 {'name': 'events-2019-01-11.jsonl', 'date': '2019-01-11', 'count': '10327'},
 {'name': 'events-2019-01-12.jsonl', 'date': '2019-01-12', 'count': '6434'},
 {'name': 'events-2019-01-13.jsonl', 'date': '2019-01-13', 'count': '7004'},
 {'name': 'events-2019-01-14.jsonl', 'date': '2019-01-14', 'count': '12898'},
 {'name': 'events-2019-01-15.jsonl', 'date': '2019-01-15', 'count': '12363'},
 {'name': 'events-2019-01-16.jsonl', 'date': '2019-01-16', 'count': '13444'},
 {'name': 'events-2019-01-17.jsonl', 'date': '2019-01-17', 'count': '14452'},
 {'name': 'events-2019-01-18.jsonl', 'date': '2019-01-18', 'count': '12056'},
 {'name': 'events-2019-01-19.jsonl', 'date': '2019-01-19', 'count': '7590'},
 {'name': 'events-2019-01-20.jsonl', 'date': '2019-01-20', 'count': '6740'},
 {'name': 'events-2019-01-21.jsonl', 'date': '2019-01-21', 'count': '12507'},
 {'name': 'events-2019-01-22.jsonl', 'date': '2019-01-22', 'count': '15355'},
 {'name': 'events-2019-01-23.jsonl', 'date': '2019-01-23', 'count': '16319'},
 {'name': 'events-2019-01-24.jsonl', 'date': '2019-01-24', 'count': '16732'},
 {'name': 'events-2019-01-25.jsonl', 'date': '2019-01-25', 'count': '13642'},
 {'name': 'events-2019-01-26.jsonl', 'date': '2019-01-26', 'count': '6976'},
 {'name': 'events-2019-01-27.jsonl', 'date': '2019-01-27', 'count': '7570'},
 {'name': 'events-2019-01-28.jsonl', 'date': '2019-01-28', 'count': '15906'},
 {'name': 'events-2019-01-29.jsonl', 'date': '2019-01-29', 'count': '15534'},
 {'name': 'events-2019-01-30.jsonl', 'date': '2019-01-30', 'count': '15183'},
 {'name': 'events-2019-01-31.jsonl', 'date': '2019-01-31', 'count': '14421'},
 {'name': 'events-2019-02-01.jsonl', 'date': '2019-02-01', 'count': '12352'},
 {'name': 'events-2019-02-02.jsonl', 'date': '2019-02-02', 'count': '7113'},
 {'name': 'events-2019-02-03.jsonl', 'date': '2019-02-03', 'count': '7331'},
 {'name': 'events-2019-02-04.jsonl', 'date': '2019-02-04', 'count': '14493'},
 {'name': 'events-2019-02-05.jsonl', 'date': '2019-02-05', 'count': '14053'},
 {'name': 'events-2019-02-06.jsonl', 'date': '2019-02-06', 'count': '15600'},
 {'name': 'events-2019-02-07.jsonl', 'date': '2019-02-07', 'count': '17158'},
 {'name': 'events-2019-02-08.jsonl', 'date': '2019-02-08', 'count': '14107'},
 {'name': 'events-2019-02-09.jsonl', 'date': '2019-02-09', 'count': '7209'},
 {'name': 'events-2019-02-10.jsonl', 'date': '2019-02-10', 'count': '7422'},
 {'name': 'events-2019-02-11.jsonl', 'date': '2019-02-11', 'count': '17085'},
 {'name': 'events-2019-02-12.jsonl', 'date': '2019-02-12', 'count': '17286'},
 {'name': 'events-2019-02-13.jsonl', 'date': '2019-02-13', 'count': '17181'},
 {'name': 'events-2019-02-14.jsonl', 'date': '2019-02-14', 'count': '19298'},
 {'name': 'events-2019-02-15.jsonl', 'date': '2019-02-15', 'count': '13387'},
 {'name': 'events-2019-02-16.jsonl', 'date': '2019-02-16', 'count': '8182'},
 {'name': 'events-2019-02-17.jsonl', 'date': '2019-02-17', 'count': '8142'},
 {'name': 'events-2019-02-18.jsonl', 'date': '2019-02-18', 'count': '16364'},
 {'name': 'events-2019-02-19.jsonl', 'date': '2019-02-19', 'count': '18090'},
 {'name': 'events-2019-02-20.jsonl', 'date': '2019-02-20', 'count': '17441'},
 {'name': 'events-2019-02-21.jsonl', 'date': '2019-02-21', 'count': '18844'},
 {'name': 'events-2019-02-22.jsonl', 'date': '2019-02-22', 'count': '15400'},
 {'name': 'events-2019-02-23.jsonl', 'date': '2019-02-23', 'count': '8879'},
 {'name': 'events-2019-02-24.jsonl', 'date': '2019-02-24', 'count': '9342'},
 {'name': 'events-2019-02-25.jsonl', 'date': '2019-02-25', 'count': '16999'},
 {'name': 'events-2019-02-26.jsonl', 'date': '2019-02-26', 'count': '18514'},
 {'name': 'events-2019-02-27.jsonl', 'date': '2019-02-27', 'count': '15799'},
 {'name': 'events-2019-02-28.jsonl', 'date': '2019-02-28', 'count': '18702'},
 {'name': 'events-2019-03-01.jsonl', 'date': '2019-03-01', 'count': '14222'},
 {'name': 'events-2019-03-02.jsonl', 'date': '2019-03-02', 'count': '8990'},
 {'name': 'events-2019-03-03.jsonl', 'date': '2019-03-03', 'count': '8503'},
 {'name': 'events-2019-03-04.jsonl', 'date': '2019-03-04', 'count': '17427'},
 {'name': 'events-2019-03-05.jsonl', 'date': '2019-03-05', 'count': '17732'},
 {'name': 'events-2019-03-06.jsonl', 'date': '2019-03-06', 'count': '17532'},
 {'name': 'events-2019-03-07.jsonl', 'date': '2019-03-07', 'count': '17622'},
 {'name': 'events-2019-03-08.jsonl', 'date': '2019-03-08', 'count': '13110'},
 {'name': 'events-2019-03-09.jsonl', 'date': '2019-03-09', 'count': '9132'},
 {'name': 'events-2019-03-10.jsonl', 'date': '2019-03-10', 'count': '8989'},
 {'name': 'events-2019-03-11.jsonl', 'date': '2019-03-11', 'count': '16334'},
 {'name': 'events-2019-03-12.jsonl', 'date': '2019-03-12', 'count': '18637'},
 {'name': 'events-2019-03-13.jsonl', 'date': '2019-03-13', 'count': '18355'},
 {'name': 'events-2019-03-14.jsonl', 'date': '2019-03-14', 'count': '18657'},
 {'name': 'events-2019-03-15.jsonl', 'date': '2019-03-15', 'count': '15206'},
 {'name': 'events-2019-03-16.jsonl', 'date': '2019-03-16', 'count': '8606'},
 {'name': 'events-2019-03-17.jsonl', 'date': '2019-03-17', 'count': '8110'},
 {'name': 'events-2019-03-18.jsonl', 'date': '2019-03-18', 'count': '15846'},
 {'name': 'events-2019-03-19.jsonl', 'date': '2019-03-19', 'count': '17909'},
 {'name': 'events-2019-03-20.jsonl', 'date': '2019-03-20', 'count': '15610'},
 {'name': 'events-2019-03-21.jsonl', 'date': '2019-03-21', 'count': '14671'},
 {'name': 'events-2019-03-22.jsonl', 'date': '2019-03-22', 'count': '12962'},
 {'name': 'events-2019-03-23.jsonl', 'date': '2019-03-23', 'count': '7941'},
 {'name': 'events-2019-03-24.jsonl', 'date': '2019-03-24', 'count': '7248'},
 {'name': 'events-2019-03-25.jsonl', 'date': '2019-03-25', 'count': '16775'},
 {'name': 'events-2019-03-26.jsonl', 'date': '2019-03-26', 'count': '18064'},
 {'name': 'events-2019-03-27.jsonl', 'date': '2019-03-27', 'count': '17773'},
 {'name': 'events-2019-03-28.jsonl', 'date': '2019-03-28', 'count': '17945'},
 {'name': 'events-2019-03-29.jsonl', 'date': '2019-03-29', 'count': '13126'},
 {'name': 'events-2019-03-30.jsonl', 'date': '2019-03-30', 'count': '7315'},
 {'name': 'events-2019-03-31.jsonl', 'date': '2019-03-31', 'count': '7750'},
 {'name': 'events-2019-04-01.jsonl', 'date': '2019-04-01', 'count': '16049'},
 {'name': 'events-2019-04-02.jsonl', 'date': '2019-04-02', 'count': '18909'},
 {'name': 'events-2019-04-03.jsonl', 'date': '2019-04-03', 'count': '17629'},
 {'name': 'events-2019-04-04.jsonl', 'date': '2019-04-04', 'count': '17635'},
 {'name': 'events-2019-04-05.jsonl', 'date': '2019-04-05', 'count': '14057'},
 {'name': 'events-2019-04-06.jsonl', 'date': '2019-04-06', 'count': '8297'},
 {'name': 'events-2019-04-07.jsonl', 'date': '2019-04-07', 'count': '8726'},
 {'name': 'events-2019-04-08.jsonl', 'date': '2019-04-08', 'count': '18217'},
 {'name': 'events-2019-04-09.jsonl', 'date': '2019-04-09', 'count': '17833'},
 {'name': 'events-2019-04-10.jsonl', 'date': '2019-04-10', 'count': '19018'},
 {'name': 'events-2019-04-11.jsonl', 'date': '2019-04-11', 'count': '19173'},
 {'name': 'events-2019-04-12.jsonl', 'date': '2019-04-12', 'count': '15502'},
 {'name': 'events-2019-04-13.jsonl', 'date': '2019-04-13', 'count': '7839'},
 {'name': 'events-2019-04-14.jsonl', 'date': '2019-04-14', 'count': '8119'},
 {'name': 'events-2019-04-15.jsonl', 'date': '2019-04-15', 'count': '14567'},
 {'name': 'events-2019-04-16.jsonl', 'date': '2019-04-16', 'count': '16254'},
 {'name': 'events-2019-04-17.jsonl', 'date': '2019-04-17', 'count': '15211'},
 {'name': 'events-2019-04-18.jsonl', 'date': '2019-04-18', 'count': '15989'},
 {'name': 'events-2019-04-19.jsonl', 'date': '2019-04-19', 'count': '11296'},
 {'name': 'events-2019-04-20.jsonl', 'date': '2019-04-20', 'count': '8527'},
 {'name': 'events-2019-04-21.jsonl', 'date': '2019-04-21', 'count': '7861'},
 {'name': 'events-2019-04-22.jsonl', 'date': '2019-04-22', 'count': '13118'},
 {'name': 'events-2019-04-23.jsonl', 'date': '2019-04-23', 'count': '16865'},
 {'name': 'events-2019-04-24.jsonl', 'date': '2019-04-24', 'count': '17125'},
 {'name': 'events-2019-04-25.jsonl', 'date': '2019-04-25', 'count': '18687'},
 {'name': 'events-2019-04-26.jsonl', 'date': '2019-04-26', 'count': '16476'},
 {'name': 'events-2019-04-27.jsonl', 'date': '2019-04-27', 'count': '9517'},
 {'name': 'events-2019-04-28.jsonl', 'date': '2019-04-28', 'count': '9435'},
 {'name': 'events-2019-04-29.jsonl', 'date': '2019-04-29', 'count': '15896'},
 {'name': 'events-2019-04-30.jsonl', 'date': '2019-04-30', 'count': '16116'},
 {'name': 'events-2019-05-01.jsonl', 'date': '2019-05-01', 'count': '11664'},
 {'name': 'events-2019-05-02.jsonl', 'date': '2019-05-02', 'count': '15713'},
 {'name': 'events-2019-05-03.jsonl', 'date': '2019-05-03', 'count': '14162'},
 {'name': 'events-2019-05-04.jsonl', 'date': '2019-05-04', 'count': '8356'},
 {'name': 'events-2019-05-05.jsonl', 'date': '2019-05-05', 'count': '8610'},
 {'name': 'events-2019-05-06.jsonl', 'date': '2019-05-06', 'count': '15230'},
 {'name': 'events-2019-05-07.jsonl', 'date': '2019-05-07', 'count': '16286'},
 {'name': 'events-2019-05-08.jsonl', 'date': '2019-05-08', 'count': '17393'},
 {'name': 'events-2019-05-09.jsonl', 'date': '2019-05-09', 'count': '16657'},
 {'name': 'events-2019-05-10.jsonl', 'date': '2019-05-10', 'count': '13726'},
 {'name': 'events-2019-05-11.jsonl', 'date': '2019-05-11', 'count': '8098'},
 {'name': 'events-2019-05-12.jsonl', 'date': '2019-05-12', 'count': '8217'},
 {'name': 'events-2019-05-13.jsonl', 'date': '2019-05-13', 'count': '16635'},
 {'name': 'events-2019-05-14.jsonl', 'date': '2019-05-14', 'count': '17309'},
 {'name': 'events-2019-05-15.jsonl', 'date': '2019-05-15', 'count': '15230'},
 {'name': 'events-2019-05-16.jsonl', 'date': '2019-05-16', 'count': '15208'},
 {'name': 'events-2019-05-17.jsonl', 'date': '2019-05-17', 'count': '13078'},
 {'name': 'events-2019-05-18.jsonl', 'date': '2019-05-18', 'count': '7788'},
 {'name': 'events-2019-05-19.jsonl', 'date': '2019-05-19', 'count': '7587'},
 {'name': 'events-2019-05-20.jsonl', 'date': '2019-05-20', 'count': '14891'},
 {'name': 'events-2019-05-21.jsonl', 'date': '2019-05-21', 'count': '16516'},
 {'name': 'events-2019-05-22.jsonl', 'date': '2019-05-22', 'count': '18627'},
 {'name': 'events-2019-05-23.jsonl', 'date': '2019-05-23', 'count': '16218'},
 {'name': 'events-2019-05-24.jsonl', 'date': '2019-05-24', 'count': '12376'},
 {'name': 'events-2019-05-25.jsonl', 'date': '2019-05-25', 'count': '8312'},
 {'name': 'events-2019-05-26.jsonl', 'date': '2019-05-26', 'count': '6938'},
 {'name': 'events-2019-05-27.jsonl', 'date': '2019-05-27', 'count': '13366'},
 {'name': 'events-2019-05-28.jsonl', 'date': '2019-05-28', 'count': '15430'},
 {'name': 'events-2019-05-29.jsonl', 'date': '2019-05-29', 'count': '14477'},
 {'name': 'events-2019-05-30.jsonl', 'date': '2019-05-30', 'count': '13264'},
 {'name': 'events-2019-05-31.jsonl', 'date': '2019-05-31', 'count': '11721'},
 {'name': 'events-2019-06-01.jsonl', 'date': '2019-06-01', 'count': '6994'},
 {'name': 'events-2019-06-02.jsonl', 'date': '2019-06-02', 'count': '6808'},
 {'name': 'events-2019-06-03.jsonl', 'date': '2019-06-03', 'count': '9141'},
 {'name': 'events-2019-06-04.jsonl', 'date': '2019-06-04', 'count': '14414'},
 {'name': 'events-2019-06-05.jsonl', 'date': '2019-06-05', 'count': '13852'},
 {'name': 'events-2019-06-06.jsonl', 'date': '2019-06-06', 'count': '15534'},
 {'name': 'events-2019-06-07.jsonl', 'date': '2019-06-07', 'count': '11335'},
 {'name': 'events-2019-06-08.jsonl', 'date': '2019-06-08', 'count': '6799'},
 {'name': 'events-2019-06-09.jsonl', 'date': '2019-06-09', 'count': '7062'},
 {'name': 'events-2019-06-10.jsonl', 'date': '2019-06-10', 'count': '12834'},
 {'name': 'events-2019-06-11.jsonl', 'date': '2019-06-11', 'count': '14359'},
 {'name': 'events-2019-06-12.jsonl', 'date': '2019-06-12', 'count': '14899'},
 {'name': 'events-2019-06-13.jsonl', 'date': '2019-06-13', 'count': '15819'},
 {'name': 'events-2019-06-14.jsonl', 'date': '2019-06-14', 'count': '11579'},
 {'name': 'events-2019-06-15.jsonl', 'date': '2019-06-15', 'count': '6267'},
 {'name': 'events-2019-06-16.jsonl', 'date': '2019-06-16', 'count': '6274'},
 {'name': 'events-2019-06-17.jsonl', 'date': '2019-06-17', 'count': '12672'},
 {'name': 'events-2019-06-18.jsonl', 'date': '2019-06-18', 'count': '14996'},
 {'name': 'events-2019-06-19.jsonl', 'date': '2019-06-19', 'count': '17509'},
 {'name': 'events-2019-06-20.jsonl', 'date': '2019-06-20', 'count': '14436'},
 {'name': 'events-2019-06-21.jsonl', 'date': '2019-06-21', 'count': '13387'},
 {'name': 'events-2019-06-22.jsonl', 'date': '2019-06-22', 'count': '6912'},
 {'name': 'events-2019-06-23.jsonl', 'date': '2019-06-23', 'count': '6485'},
 {'name': 'events-2019-06-24.jsonl', 'date': '2019-06-24', 'count': '14478'},
 {'name': 'events-2019-06-25.jsonl', 'date': '2019-06-25', 'count': '15199'},
 {'name': 'events-2019-06-26.jsonl', 'date': '2019-06-26', 'count': '15114'},
 {'name': 'events-2019-06-27.jsonl', 'date': '2019-06-27', 'count': '16424'},
 {'name': 'events-2019-06-28.jsonl', 'date': '2019-06-28', 'count': '15936'},
 {'name': 'events-2019-06-29.jsonl', 'date': '2019-06-29', 'count': '7213'},
 {'name': 'events-2019-06-30.jsonl', 'date': '2019-06-30', 'count': '6855'},
 {'name': 'events-2019-07-01.jsonl', 'date': '2019-07-01', 'count': '16461'},
 {'name': 'events-2019-07-02.jsonl', 'date': '2019-07-02', 'count': '15384'},
 {'name': 'events-2019-07-03.jsonl', 'date': '2019-07-03', 'count': '15709'},
 {'name': 'events-2019-07-04.jsonl', 'date': '2019-07-04', 'count': '14922'},
 {'name': 'events-2019-07-05.jsonl', 'date': '2019-07-05', 'count': '16336'},
 {'name': 'events-2019-07-06.jsonl', 'date': '2019-07-06', 'count': '6732'},
 {'name': 'events-2019-07-07.jsonl', 'date': '2019-07-07', 'count': '6954'},
 {'name': 'events-2019-07-08.jsonl', 'date': '2019-07-08', 'count': '18121'},
 {'name': 'events-2019-07-09.jsonl', 'date': '2019-07-09', 'count': '18321'},
 {'name': 'events-2019-07-10.jsonl', 'date': '2019-07-10', 'count': '15141'},
 {'name': 'events-2019-07-11.jsonl', 'date': '2019-07-11', 'count': '15025'},
 {'name': 'events-2019-07-12.jsonl', 'date': '2019-07-12', 'count': '13490'},
 {'name': 'events-2019-07-13.jsonl', 'date': '2019-07-13', 'count': '7508'},
 {'name': 'events-2019-07-14.jsonl', 'date': '2019-07-14', 'count': '7056'},
 {'name': 'events-2019-07-15.jsonl', 'date': '2019-07-15', 'count': '13588'},
 {'name': 'events-2019-07-16.jsonl', 'date': '2019-07-16', 'count': '15043'},
 {'name': 'events-2019-07-17.jsonl', 'date': '2019-07-17', 'count': '13545'},
 {'name': 'events-2019-07-18.jsonl', 'date': '2019-07-18', 'count': '13197'},
 {'name': 'events-2019-07-19.jsonl', 'date': '2019-07-19', 'count': '12350'},
 {'name': 'events-2019-07-20.jsonl', 'date': '2019-07-20', 'count': '8074'},
 {'name': 'events-2019-07-21.jsonl', 'date': '2019-07-21', 'count': '7701'},
 {'name': 'events-2019-07-22.jsonl', 'date': '2019-07-22', 'count': '13099'},
 {'name': 'events-2019-07-23.jsonl', 'date': '2019-07-23', 'count': '15365'},
 {'name': 'events-2019-07-24.jsonl', 'date': '2019-07-24', 'count': '14878'},
 {'name': 'events-2019-07-25.jsonl', 'date': '2019-07-25', 'count': '13480'},
 {'name': 'events-2019-07-26.jsonl', 'date': '2019-07-26', 'count': '11324'},
 {'name': 'events-2019-07-27.jsonl', 'date': '2019-07-27', 'count': '7142'},
 {'name': 'events-2019-07-28.jsonl', 'date': '2019-07-28', 'count': '7413'},
 {'name': 'events-2019-07-29.jsonl', 'date': '2019-07-29', 'count': '12181'},
 {'name': 'events-2019-07-30.jsonl', 'date': '2019-07-30', 'count': '13921'},
 {'name': 'events-2019-07-31.jsonl', 'date': '2019-07-31', 'count': '13653'},
 {'name': 'events-2019-08-01.jsonl', 'date': '2019-08-01', 'count': '12863'},
 {'name': 'events-2019-08-02.jsonl', 'date': '2019-08-02', 'count': '11907'},
 {'name': 'events-2019-08-03.jsonl', 'date': '2019-08-03', 'count': '7599'},
 {'name': 'events-2019-08-04.jsonl', 'date': '2019-08-04', 'count': '7344'},
 {'name': 'events-2019-08-05.jsonl', 'date': '2019-08-05', 'count': '12694'},
 {'name': 'events-2019-08-06.jsonl', 'date': '2019-08-06', 'count': '13990'},
 {'name': 'events-2019-08-07.jsonl', 'date': '2019-08-07', 'count': '14971'},
 {'name': 'events-2019-08-08.jsonl', 'date': '2019-08-08', 'count': '13643'},
 {'name': 'events-2019-08-09.jsonl', 'date': '2019-08-09', 'count': '12367'},
 {'name': 'events-2019-08-10.jsonl', 'date': '2019-08-10', 'count': '7689'},
 {'name': 'events-2019-08-11.jsonl', 'date': '2019-08-11', 'count': '7181'},
 {'name': 'events-2019-08-12.jsonl', 'date': '2019-08-12', 'count': '11641'},
 {'name': 'events-2019-08-13.jsonl', 'date': '2019-08-13', 'count': '14053'},
 {'name': 'events-2019-08-14.jsonl', 'date': '2019-08-14', 'count': '14120'},
 {'name': 'events-2019-08-15.jsonl', 'date': '2019-08-15', 'count': '12333'},
 {'name': 'events-2019-08-16.jsonl', 'date': '2019-08-16', 'count': '12151'},
 {'name': 'events-2019-08-17.jsonl', 'date': '2019-08-17', 'count': '4913'}]
[5]:
filenames = (db.read_text('https://archive.analytics.mybinder.org/index.jsonl')
               .map(json.loads)
               .pluck('name')
               .compute())

filenames = ['https://archive.analytics.mybinder.org/' + fn for fn in filenames]
filenames[:5]
[5]:
['https://archive.analytics.mybinder.org/events-2018-11-03.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-04.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-05.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-06.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-07.jsonl']

Create Bag of all events

We now create a Dask Bag around that list of URLs, and then call the json.loads function on every line to turn those lines of JSON-encoded text into Python dictionaries that can be more easily manipulated.

[6]:
events = db.read_text(filenames).map(json.loads)
events.take(2)
[6]:
({'timestamp': '2018-11-03T00:00:00+00:00',
  'schema': 'binderhub.jupyter.org/launch',
  'version': 1,
  'provider': 'GitHub',
  'spec': 'Qiskit/qiskit-tutorial/master',
  'status': 'success'},
 {'timestamp': '2018-11-03T00:00:00+00:00',
  'schema': 'binderhub.jupyter.org/launch',
  'version': 1,
  'provider': 'GitHub',
  'spec': 'ipython/ipython-in-depth/master',
  'status': 'success'})

Convert to Dask Dataframe

Finally, we can convert our bag of Python dictionaries into a Dask Dataframe, and follow up with more Pandas-like computations.

We’ll do the same computation as above, now with Pandas syntax.

[8]:
df = events.to_dataframe()
df.head()
[8]:
timestamp schema version provider spec status
0 2018-11-03T00:00:00+00:00 binderhub.jupyter.org/launch 1 GitHub Qiskit/qiskit-tutorial/master success
1 2018-11-03T00:00:00+00:00 binderhub.jupyter.org/launch 1 GitHub ipython/ipython-in-depth/master success
2 2018-11-03T00:00:00+00:00 binderhub.jupyter.org/launch 1 GitHub QISKit/qiskit-tutorial/master success
3 2018-11-03T00:01:00+00:00 binderhub.jupyter.org/launch 1 GitHub QISKit/qiskit-tutorial/master success
4 2018-11-03T00:01:00+00:00 binderhub.jupyter.org/launch 1 GitHub jupyterlab/jupyterlab-demo/master success
[9]:
df.spec.value_counts().nlargest(20).to_frame().compute()
[9]:
spec
ipython/ipython-in-depth/master 1618386
jupyterlab/jupyterlab-demo/master 335485
ines/spacy-io-binder/live 151412
DS-100/textbook/master 124943
jupyterlab/jupyterlab-demo/try.jupyter.org 83268
bokeh/bokeh-notebooks/master 77593
binder-examples/r/master 57825
rationalmatter/juno-demo-notebooks/master 48439
binder-examples/requirements/master 44305
QuantStack/xeus-cling/stable 41522
ines/spacy-course/binder 25141
numba/numba-examples/master 23959
binder-examples/julia-python/master 21023
QISKit/qiskit-tutorial/master 19288
RasaHQ/rasa_core/master 19135
dask/dask-examples/master 14623
data-8/textbook/gh-pages 12135
wshuyi/demo-spacy-text-processing/master 11891
rasahq/docs-binder/master 10288
nteract/examples/master 10168

Persist in memory

This dataset fits nicely into memory. Lets avoid downloading data every time we do an operation and instead keep the data local in memory.

[10]:
df = df.persist()

Honestly, at this point it makes more sense to just switch to Pandas, but this is a Dask example, so we’ll continue with Dask dataframe.

Investigate providers other than Github

Most binders are specified as git repositories on GitHub, but not all. Lets investigate other providers.

[11]:
import urllib
[12]:
df.provider.value_counts().compute()
[12]:
GitHub    3513648
Gist        14585
GitLab      12379
Git          4756
Zenodo        166
Name: provider, dtype: int64
[13]:
(df[df.provider == 'GitLab']
 .spec
 .map(urllib.parse.unquote, meta=('spec', object))
 .value_counts()
 .to_frame()
 .compute())
[13]:
spec
rruizz/inforfis/R 1479
rruizz/inforfis/master 1261
lfortran/web/lfortran-binder/master 1058
ul-fri/ovs/python/master 803
utt-connected-innovation/ia-course-2019/master 799
DGothrek/ipyaggrid/binder-demo 636
bhugueney/cxx-init-for-python-dev/master 461
dbernhard/PythonM1/master 400
dbernhard/pythonm1s2/master 271
shadaba/lss-handson/master 261
dbernhard/ProgHumaNumTAL/master 253
albert.van.breemen/masterclassdeeplearning/master 247
dbernhard/JavaM2/master 236
amarandon/presentation-jupyter/master 152
clemej/data601-clemens-fall18/master 132
memeplex/dhouse/master 121
wichit2s/programmingfundamentals/master 115
open-scientist/formation-data-reproductibilite/master 111
slloyd/python-introduction/master 101
kitsunix/pyHIBP/pyHIBP-binder/master 97
biehl/jscatter/master 91
kkmann/shortcourse-data-science-toolbox/master 84
energyincities/besos-public/master 81
fkohrt/mri/master 81
snowhitiger/learn_deep_learning/master 80
andrey.kovalev/imagination/master 71
rruizz/inforfis/autin 62
g2lab/fossgis2019-geopython-vector/master 60
oscar6echo/ipyupload-repo2docker/master 60
sgmarkets/sgmarkets-api-notebooks/master 60
... ...
jdiep/master-thesis/master 1
passakornC/test/master 1
butzked/equivalence/dev 1
ozborniasty1/junotry/master 1
daksh7011/metis/master 1
coobas/dask-pipelines/master 1
nmg/704p2/master 1
nmg/enee704h1/master 1
plotnips/numerical_linear_algebra_notebooks/master 1
anarcat/terms-benchmarks/master 1
katylava/minimal-binderhub/master 1
ppleskov/kaggle-days-sf/master 1
adophobr/lightwavesystens/master 1
adv-ds-ws/workshop_day3/master 1
agrumery/aGrUM/0.13.5 1
knighteq/jupyter/master 1
ruivieira/matplotrust/b64 1
rogermarkussen/test/master 1
krsna1/qml-mooc/PhaseEstimationUsingCirq 1
elfua/ipython-notebooks/master 1
airss/airss-demo/06a3ea0d 1
rdubwiley/mi_campaign_finance/master 1
rahulvigneswaran/staticdmd/master 1
alandrex/dhbw-2019-1/master 1
ecuracosta/notebooks-test/master 1
littleredridingfox/decision-theory-coursework/master 1
prokopevav85/geojsontest/master 1
praj88/deepembeddings/master 1
lokeller/matrix-bot/master 1
burdickjp/diamondknobpresentation/master 1

329 rows × 1 columns

[14]:
(df[df.provider == 'Git']
 .spec
 .apply(urllib.parse.unquote, meta=('spec', object))
 .value_counts()
 .to_frame()
 .compute())
[14]:
spec
https://bitbucket.org/nikiubel/nikiubel.bitbucket.io.git/841046b40e936fa187b974aabb310a6ac0ecd094 794
https://gitlab.kwant-project.org/solidstate/lectures.git/e6c970126f4e819e4a3eb717a86ba9fb523d20bf 153
https://bitbucket.org/saibotk/tp2_python/098f35f50066d96f26df6131a6cff8080e384708 79
https://bitbucket.org/gaur/1820/a4027afe0aa592d49f3989b2e9c8136c36322a77 75
https://gricad-gitlab.univ-grenoble-alpes.fr/nonsmooth/siconos-tutorial.git/b08a0514b22b3927b58bddce3c4018f27ac0fc7d 74
https://gitlab.rc.uab.edu/mcbios19_single_cell/single_cell_rnaseq_hands-on_1.git/a20f707fc0b67f6eb4f9bf85a5daacc52c125df6 73
https://collaborating.tuhh.de/cip3725/ib_base.git/0a1f4f66a1a3c29ff347b2abc79bb292b0be17ca 69
https://collaborating.tuhh.de/xldrkp/jupyter-notebook-beispiel/e895284be3c0a128ed97dc007fcd7497ef19d2bc 59
https://gitlab.mech.kuleuven.be/rob-expressiongraphs/docker/etasl-binder.git/127d6c9f33a938a98607505ef30237b5223000e4 55
https://git.axisgroup.com/publicgrp/datascience/training/027147d86306078761f6fc3675b83a500e4a8954 49
https://git.unilim.fr/grossp01/notebooks.git/6d65058071a0d5a62ab032800932c2a3740a4196 45
https://gricad-gitlab.univ-grenoble-alpes.fr/chatelaf/conference-ia/d98a199f66d603b0b1e7c25fbe1341d29a40cd39 42
https://api.jovian.ai/git/e556978bda9343f3b30b3a9fd2a25012_4.git/53d6f783b1521cb8b7c94000fcf2f8b4f3b1518c 42
https://risk-engineering.org/git/notebooks.git/928cb3f525f41beb132a1eeb046e1c35f07d770e 41
https://collaborating.tuhh.de/xldrkp/jupyter-test/9b9e2fb383107ff9c4bfdb22c3f97f5c62ed1f6d 39
https://bitbucket.org/cognify-salzburg/idsc-2019/4dd0deadc6e6f5ac9b3ba46991aa3e26bd1834ff 38
https://gitlab.univ-nantes.fr/mouchere-h/ImageDeep_ED_Project/ac2523952640533c9e549d14f2eed1534b105776 37
https://api.jovian.ai/git/e5cfe043873f4f3c9287507016747ae5_7.git/6bdf62a8772359a94a41081a2b7ac011e7911c84 36
https://gitlab.oceantrack.org/otndc/fact-workshop/a174f5bc60cf9f0c5b86851204c7c85fcfb98131 35
https://risk-engineering.org/git/notebooks.git/a18f7f0e6a707ccfa4478d3fc73d53ada11605fb 34
https://git.rcc.uchicago.edu/jhskone/multiproc_py.git/38f9bb6ce3602b73a8ddd1dbcad3f5f9a8d21f6a 34
https://risk-engineering.org/git/notebooks.git/527b4fdaa5fe43294ed5de196b96262b1c60522b 34
https://[email protected]/y2kbugger/sapy.git/de5086ea943c94fec40e14478257ab2716e28c96 32
https://risk-engineering.org/git/notebooks.git/da7de9f3df67ec45713cd91e575418aae9466621 31
https://gitlab.ethz.ch/darioce/sysbio_ss2019.git/4c383c1e118678cf26ad8b6d9c0be7f41b15ad69 30
https://framagit.org/mfauvel/omp_machine_learning/77d246991186d758147fc280eeef0eed563d525f 29
https://git.rwth-aachen.de/iks-public/mlsap/79c5e4a6b2aa749308c1db4780dfc275a27560ee 29
https://risk-engineering.org/git/notebooks.git/f55ff6d75610fa3d07a85623381645153ca141f0 28
https://bitbucket.org/qlouder/lumc-ml-caffe/acc2701f43f267361594e72980ec59602dab7fd7 27
https://bitbucket.org/saibotk/tp2_python/f1c06775263afbd814e9b08139eddf792873b050 26
... ...
https://github.com/groznyjgrad/jptr-notes/df06a44ebab118e7812efc2de08294ea8ab3bccc 1
https://github.com/lizapol1/TMDB_movie_dataset.git/master 1
https://github.com/msc0953/robotkernel/b9aed42102276cd993dd237ab5f820b001afc773 1
https://github.com/norvig/pytudes/171486979aa4facf7692d29a8576ae8fbf26b884 1
https://github.com/olivierzach/GTx_6040.git/75a70770fe1ffb924c62f4a2544aeb55f3dcc789 1
https://github.com/ostapkharysh/Data_visualisation/42c00131e615ae8444d3ccf680fa24e81050f275 1
https://github.com/rmitsch/mybinder-test/51e1202b83cf0270bd5dfa7350d0193b648dc931 1
https://github.com/rwightman/pytorch-image-models/master 1
https://github.com/san-soucie/probability_and_statistics/2717a4ea2807905e46eaa02909a8c42082877cc2 1
https://github.com/san-soucie/probability_and_statistics/3b7d868cbc55cf0386682ff5935c5387d4423fca 1
https://github.com/sazack/Coursera-Capstone/07affc5dc4b888c6c00ccc8ded742b756dc5359a 1
https://github.com/schoend/Hello/f91ae65cf1f47c206855d04c0956677eb74e5d47 1
https://github.com/sidujjain/notebooks.git/ddf57e8d9a0eb6756a23f7e6f1e279374ac25a74 1
https://github.com/tsusetzky/GK_Prog_kmeans.git/6e0f6d55081b59954516f4a4f69a2d7d9a875782 1
https://gitlab.climent-pommeret.red/Chopopope/inf3712019-compilation/59d66082573d5672237e57bd9d20b4f7a8965cb7 1
https://github.com/wbogers/notebooks.git/2887c6399c52809ba4eb98328ecb4c79e3bb4b83 1
https://github.com/x1matrix/midas_x1.git/master 1
https://github.com/y-pakorn/py101assignment.git/b65290978116de348414496e5b9ef7e5e768483b 1
https://github.com/yc-tsui/electoral-systems/3a7727641ab6f187b95ee3d95a01f6ebbfc02a22 1
https://github.molgen.mpg.de/ledwards/FLASHcalculator/9da41d5d86cd75d9f96ac779b2b64e3e79932d9f 1
https://gitlab+deploy-token-2:[email protected]/jotez/useclecture2019.git/c848bb606409dcc305d82f18f0aa3b208e13e666 1
https://gitlab.cern.ch/fast-hep/public/fast_cms_public_tutorial.git/b9b54948839739b6ed7e087fe8e1b1eec724145a 1
https://gitlab.cern.ch/fsauerbu/nn-playground.git/d49192071bf899625236759647d8f47e74d47a88 1
https://gitlab.cern.ch/fsauerbu/vartrain.git/1f5a6e0ca304582de5ed8d3cf83403eeac4f95cf 1
https://gitlab.cern.ch/fsauerbu/vartrain.git/e80ab6680cc7f9706bd939fc9173f365acb9aee6 1
https://gitlab.cern.ch/fsauerbu/vartrain.git/master 1
https://gitlab.cern.ch/mproffit/TestStandIPCalculation/df0c91788e5cc4c05a2defdfe082a97ffed16401 1
https://gitlab.climent-pommeret.red/Chopopope/inf3212019-jupyter/7a62f0124dda06a83efe2c3f1d5bb24eeba24c6a 1
https://gitlab.climent-pommeret.red/Chopopope/inf3212019-jupyter/e62b5bc9fdd81f21f887d3ffc7def168577ffe2a 1
http://code.datamode.com/datamode/binder.git/85bd134f3713f8a8cb3e22612929987b49b675a5 1

857 rows × 1 columns

[ ]: