You can run this notebook in a live session Binder or view it on Github.

Analyze web-hosted JSON data

This notebook reads and processes JSON-encoded data hosted on the web using a combination of Dask Bag and Dask Dataframe.

This data comes from mybinder.org a web service to run Jupyter notebooks live on the web (you may be running this notebook there now). My Binder publishes records for every time someone launches a live notebook like this one, and stores that record in a publicly accessible JSON file, one file per day.

Introduction to the dataset

This data is stored as JSON-encoded text files on the public web. Here are some example lines.

[1]:
import dask.bag as db
db.read_text('https://archive.analytics.mybinder.org/events-2018-11-03.jsonl').take(3)
[1]:
('{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "Qiskit/qiskit-tutorial/master", "status": "success"}\n',
 '{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "ipython/ipython-in-depth/master", "status": "success"}\n',
 '{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "QISKit/qiskit-tutorial/master", "status": "success"}\n')

We see that it includes one line for every time someone started a live notebook on the site. It includes the time that the notebook was started, as well as the repository from which it was served.

In this notebook we’ll look at many such files, parse them from JSON to Python dictionaries, and then from there to Pandas dataframes. We’ll then do some simple analyses on this data.

Start Dask Client for Dashboard

Starting the Dask Client is optional. It will start the dashboard which is useful to gain insight on the computation.

[2]:
from dask.distributed import Client, progress
client = Client(threads_per_worker=1,
                n_workers=4,
                memory_limit='2GB')
client
[2]:

Client

Cluster

  • Workers: 4
  • Cores: 4
  • Memory: 8.00 GB

Get a list of files on the web

The mybinder.org team maintains an index file that points to all other available JSON files of data. Lets convert this to a list of URLs that we’ll read in the next section.

[3]:
import dask.bag as db
import json
[4]:
db.read_text('https://archive.analytics.mybinder.org/index.jsonl').map(json.loads).compute()
[4]:
[{'name': 'events-2018-11-03.jsonl', 'date': '2018-11-03', 'count': '7057'},
 {'name': 'events-2018-11-04.jsonl', 'date': '2018-11-04', 'count': '7489'},
 {'name': 'events-2018-11-05.jsonl', 'date': '2018-11-05', 'count': '13590'},
 {'name': 'events-2018-11-06.jsonl', 'date': '2018-11-06', 'count': '13920'},
 {'name': 'events-2018-11-07.jsonl', 'date': '2018-11-07', 'count': '12766'},
 {'name': 'events-2018-11-08.jsonl', 'date': '2018-11-08', 'count': '14105'},
 {'name': 'events-2018-11-09.jsonl', 'date': '2018-11-09', 'count': '11843'},
 {'name': 'events-2018-11-10.jsonl', 'date': '2018-11-10', 'count': '7047'},
 {'name': 'events-2018-11-11.jsonl', 'date': '2018-11-11', 'count': '6940'},
 {'name': 'events-2018-11-12.jsonl', 'date': '2018-11-12', 'count': '16322'},
 {'name': 'events-2018-11-13.jsonl', 'date': '2018-11-13', 'count': '16530'},
 {'name': 'events-2018-11-14.jsonl', 'date': '2018-11-14', 'count': '14099'},
 {'name': 'events-2018-11-15.jsonl', 'date': '2018-11-15', 'count': '13182'},
 {'name': 'events-2018-11-16.jsonl', 'date': '2018-11-16', 'count': '12863'},
 {'name': 'events-2018-11-17.jsonl', 'date': '2018-11-17', 'count': '6490'},
 {'name': 'events-2018-11-18.jsonl', 'date': '2018-11-18', 'count': '7310'},
 {'name': 'events-2018-11-19.jsonl', 'date': '2018-11-19', 'count': '13348'},
 {'name': 'events-2018-11-20.jsonl', 'date': '2018-11-20', 'count': '13982'},
 {'name': 'events-2018-11-21.jsonl', 'date': '2018-11-21', 'count': '13165'},
 {'name': 'events-2018-11-22.jsonl', 'date': '2018-11-22', 'count': '12217'},
 {'name': 'events-2018-11-23.jsonl', 'date': '2018-11-23', 'count': '9070'},
 {'name': 'events-2018-11-24.jsonl', 'date': '2018-11-24', 'count': '6798'},
 {'name': 'events-2018-11-25.jsonl', 'date': '2018-11-25', 'count': '6796'},
 {'name': 'events-2018-11-26.jsonl', 'date': '2018-11-26', 'count': '13617'},
 {'name': 'events-2018-11-27.jsonl', 'date': '2018-11-27', 'count': '14964'},
 {'name': 'events-2018-11-28.jsonl', 'date': '2018-11-28', 'count': '14434'},
 {'name': 'events-2018-11-29.jsonl', 'date': '2018-11-29', 'count': '13845'},
 {'name': 'events-2018-11-30.jsonl', 'date': '2018-11-30', 'count': '12109'},
 {'name': 'events-2018-12-01.jsonl', 'date': '2018-12-01', 'count': '6785'},
 {'name': 'events-2018-12-02.jsonl', 'date': '2018-12-02', 'count': '7119'},
 {'name': 'events-2018-12-03.jsonl', 'date': '2018-12-03', 'count': '13946'},
 {'name': 'events-2018-12-04.jsonl', 'date': '2018-12-04', 'count': '13765'},
 {'name': 'events-2018-12-05.jsonl', 'date': '2018-12-05', 'count': '13106'},
 {'name': 'events-2018-12-06.jsonl', 'date': '2018-12-06', 'count': '12249'},
 {'name': 'events-2018-12-07.jsonl', 'date': '2018-12-07', 'count': '10687'},
 {'name': 'events-2018-12-08.jsonl', 'date': '2018-12-08', 'count': '6269'},
 {'name': 'events-2018-12-09.jsonl', 'date': '2018-12-09', 'count': '6639'},
 {'name': 'events-2018-12-10.jsonl', 'date': '2018-12-10', 'count': '12782'},
 {'name': 'events-2018-12-11.jsonl', 'date': '2018-12-11', 'count': '13442'},
 {'name': 'events-2018-12-12.jsonl', 'date': '2018-12-12', 'count': '13069'},
 {'name': 'events-2018-12-13.jsonl', 'date': '2018-12-13', 'count': '15279'},
 {'name': 'events-2018-12-14.jsonl', 'date': '2018-12-14', 'count': '9941'},
 {'name': 'events-2018-12-15.jsonl', 'date': '2018-12-15', 'count': '5358'},
 {'name': 'events-2018-12-16.jsonl', 'date': '2018-12-16', 'count': '6441'},
 {'name': 'events-2018-12-17.jsonl', 'date': '2018-12-17', 'count': '11332'},
 {'name': 'events-2018-12-18.jsonl', 'date': '2018-12-18', 'count': '11971'},
 {'name': 'events-2018-12-19.jsonl', 'date': '2018-12-19', 'count': '10818'},
 {'name': 'events-2018-12-20.jsonl', 'date': '2018-12-20', 'count': '9408'},
 {'name': 'events-2018-12-21.jsonl', 'date': '2018-12-21', 'count': '7741'},
 {'name': 'events-2018-12-22.jsonl', 'date': '2018-12-22', 'count': '4818'},
 {'name': 'events-2018-12-23.jsonl', 'date': '2018-12-23', 'count': '4870'},
 {'name': 'events-2018-12-24.jsonl', 'date': '2018-12-24', 'count': '5974'},
 {'name': 'events-2018-12-25.jsonl', 'date': '2018-12-25', 'count': '4737'},
 {'name': 'events-2018-12-26.jsonl', 'date': '2018-12-26', 'count': '6725'},
 {'name': 'events-2018-12-27.jsonl', 'date': '2018-12-27', 'count': '7998'},
 {'name': 'events-2018-12-28.jsonl', 'date': '2018-12-28', 'count': '8155'},
 {'name': 'events-2018-12-29.jsonl', 'date': '2018-12-29', 'count': '5108'},
 {'name': 'events-2018-12-30.jsonl', 'date': '2018-12-30', 'count': '4428'},
 {'name': 'events-2018-12-31.jsonl', 'date': '2018-12-31', 'count': '4561'},
 {'name': 'events-2019-01-01.jsonl', 'date': '2019-01-01', 'count': '4194'},
 {'name': 'events-2019-01-02.jsonl', 'date': '2019-01-02', 'count': '8559'},
 {'name': 'events-2019-01-03.jsonl', 'date': '2019-01-03', 'count': '9687'},
 {'name': 'events-2019-01-04.jsonl', 'date': '2019-01-04', 'count': '10048'},
 {'name': 'events-2019-01-05.jsonl', 'date': '2019-01-05', 'count': '6012'},
 {'name': 'events-2019-01-06.jsonl', 'date': '2019-01-06', 'count': '6019'},
 {'name': 'events-2019-01-07.jsonl', 'date': '2019-01-07', 'count': '11903'},
 {'name': 'events-2019-01-08.jsonl', 'date': '2019-01-08', 'count': '12777'},
 {'name': 'events-2019-01-09.jsonl', 'date': '2019-01-09', 'count': '13294'},
 {'name': 'events-2019-01-10.jsonl', 'date': '2019-01-10', 'count': '13112'},
 {'name': 'events-2019-01-11.jsonl', 'date': '2019-01-11', 'count': '10327'},
 {'name': 'events-2019-01-12.jsonl', 'date': '2019-01-12', 'count': '6434'},
 {'name': 'events-2019-01-13.jsonl', 'date': '2019-01-13', 'count': '7004'},
 {'name': 'events-2019-01-14.jsonl', 'date': '2019-01-14', 'count': '12898'},
 {'name': 'events-2019-01-15.jsonl', 'date': '2019-01-15', 'count': '12363'},
 {'name': 'events-2019-01-16.jsonl', 'date': '2019-01-16', 'count': '13444'},
 {'name': 'events-2019-01-17.jsonl', 'date': '2019-01-17', 'count': '14452'},
 {'name': 'events-2019-01-18.jsonl', 'date': '2019-01-18', 'count': '12056'},
 {'name': 'events-2019-01-19.jsonl', 'date': '2019-01-19', 'count': '7590'},
 {'name': 'events-2019-01-20.jsonl', 'date': '2019-01-20', 'count': '6740'},
 {'name': 'events-2019-01-21.jsonl', 'date': '2019-01-21', 'count': '12507'},
 {'name': 'events-2019-01-22.jsonl', 'date': '2019-01-22', 'count': '15355'},
 {'name': 'events-2019-01-23.jsonl', 'date': '2019-01-23', 'count': '16319'},
 {'name': 'events-2019-01-24.jsonl', 'date': '2019-01-24', 'count': '16732'},
 {'name': 'events-2019-01-25.jsonl', 'date': '2019-01-25', 'count': '13642'},
 {'name': 'events-2019-01-26.jsonl', 'date': '2019-01-26', 'count': '6976'},
 {'name': 'events-2019-01-27.jsonl', 'date': '2019-01-27', 'count': '7570'},
 {'name': 'events-2019-01-28.jsonl', 'date': '2019-01-28', 'count': '15906'},
 {'name': 'events-2019-01-29.jsonl', 'date': '2019-01-29', 'count': '15534'},
 {'name': 'events-2019-01-30.jsonl', 'date': '2019-01-30', 'count': '15183'},
 {'name': 'events-2019-01-31.jsonl', 'date': '2019-01-31', 'count': '14421'},
 {'name': 'events-2019-02-01.jsonl', 'date': '2019-02-01', 'count': '12352'},
 {'name': 'events-2019-02-02.jsonl', 'date': '2019-02-02', 'count': '7113'},
 {'name': 'events-2019-02-03.jsonl', 'date': '2019-02-03', 'count': '7331'},
 {'name': 'events-2019-02-04.jsonl', 'date': '2019-02-04', 'count': '14493'},
 {'name': 'events-2019-02-05.jsonl', 'date': '2019-02-05', 'count': '14053'},
 {'name': 'events-2019-02-06.jsonl', 'date': '2019-02-06', 'count': '15600'},
 {'name': 'events-2019-02-07.jsonl', 'date': '2019-02-07', 'count': '17158'},
 {'name': 'events-2019-02-08.jsonl', 'date': '2019-02-08', 'count': '14107'},
 {'name': 'events-2019-02-09.jsonl', 'date': '2019-02-09', 'count': '7209'},
 {'name': 'events-2019-02-10.jsonl', 'date': '2019-02-10', 'count': '7422'},
 {'name': 'events-2019-02-11.jsonl', 'date': '2019-02-11', 'count': '17085'},
 {'name': 'events-2019-02-12.jsonl', 'date': '2019-02-12', 'count': '17286'},
 {'name': 'events-2019-02-13.jsonl', 'date': '2019-02-13', 'count': '17181'},
 {'name': 'events-2019-02-14.jsonl', 'date': '2019-02-14', 'count': '19298'},
 {'name': 'events-2019-02-15.jsonl', 'date': '2019-02-15', 'count': '13387'},
 {'name': 'events-2019-02-16.jsonl', 'date': '2019-02-16', 'count': '8182'},
 {'name': 'events-2019-02-17.jsonl', 'date': '2019-02-17', 'count': '8142'},
 {'name': 'events-2019-02-18.jsonl', 'date': '2019-02-18', 'count': '16364'},
 {'name': 'events-2019-02-19.jsonl', 'date': '2019-02-19', 'count': '18090'},
 {'name': 'events-2019-02-20.jsonl', 'date': '2019-02-20', 'count': '17441'},
 {'name': 'events-2019-02-21.jsonl', 'date': '2019-02-21', 'count': '18844'},
 {'name': 'events-2019-02-22.jsonl', 'date': '2019-02-22', 'count': '15400'},
 {'name': 'events-2019-02-23.jsonl', 'date': '2019-02-23', 'count': '8879'},
 {'name': 'events-2019-02-24.jsonl', 'date': '2019-02-24', 'count': '9342'},
 {'name': 'events-2019-02-25.jsonl', 'date': '2019-02-25', 'count': '16999'},
 {'name': 'events-2019-02-26.jsonl', 'date': '2019-02-26', 'count': '18514'},
 {'name': 'events-2019-02-27.jsonl', 'date': '2019-02-27', 'count': '15799'},
 {'name': 'events-2019-02-28.jsonl', 'date': '2019-02-28', 'count': '18702'},
 {'name': 'events-2019-03-01.jsonl', 'date': '2019-03-01', 'count': '14222'},
 {'name': 'events-2019-03-02.jsonl', 'date': '2019-03-02', 'count': '8990'},
 {'name': 'events-2019-03-03.jsonl', 'date': '2019-03-03', 'count': '8503'},
 {'name': 'events-2019-03-04.jsonl', 'date': '2019-03-04', 'count': '17427'},
 {'name': 'events-2019-03-05.jsonl', 'date': '2019-03-05', 'count': '17732'},
 {'name': 'events-2019-03-06.jsonl', 'date': '2019-03-06', 'count': '17532'},
 {'name': 'events-2019-03-07.jsonl', 'date': '2019-03-07', 'count': '17622'},
 {'name': 'events-2019-03-08.jsonl', 'date': '2019-03-08', 'count': '13110'},
 {'name': 'events-2019-03-09.jsonl', 'date': '2019-03-09', 'count': '9132'},
 {'name': 'events-2019-03-10.jsonl', 'date': '2019-03-10', 'count': '8989'},
 {'name': 'events-2019-03-11.jsonl', 'date': '2019-03-11', 'count': '16334'},
 {'name': 'events-2019-03-12.jsonl', 'date': '2019-03-12', 'count': '18637'},
 {'name': 'events-2019-03-13.jsonl', 'date': '2019-03-13', 'count': '18355'},
 {'name': 'events-2019-03-14.jsonl', 'date': '2019-03-14', 'count': '18657'},
 {'name': 'events-2019-03-15.jsonl', 'date': '2019-03-15', 'count': '15206'},
 {'name': 'events-2019-03-16.jsonl', 'date': '2019-03-16', 'count': '8606'},
 {'name': 'events-2019-03-17.jsonl', 'date': '2019-03-17', 'count': '8110'},
 {'name': 'events-2019-03-18.jsonl', 'date': '2019-03-18', 'count': '15846'},
 {'name': 'events-2019-03-19.jsonl', 'date': '2019-03-19', 'count': '17909'},
 {'name': 'events-2019-03-20.jsonl', 'date': '2019-03-20', 'count': '15610'},
 {'name': 'events-2019-03-21.jsonl', 'date': '2019-03-21', 'count': '14671'},
 {'name': 'events-2019-03-22.jsonl', 'date': '2019-03-22', 'count': '12962'},
 {'name': 'events-2019-03-23.jsonl', 'date': '2019-03-23', 'count': '7941'},
 {'name': 'events-2019-03-24.jsonl', 'date': '2019-03-24', 'count': '7248'},
 {'name': 'events-2019-03-25.jsonl', 'date': '2019-03-25', 'count': '16775'},
 {'name': 'events-2019-03-26.jsonl', 'date': '2019-03-26', 'count': '18064'},
 {'name': 'events-2019-03-27.jsonl', 'date': '2019-03-27', 'count': '17773'},
 {'name': 'events-2019-03-28.jsonl', 'date': '2019-03-28', 'count': '17945'},
 {'name': 'events-2019-03-29.jsonl', 'date': '2019-03-29', 'count': '13126'},
 {'name': 'events-2019-03-30.jsonl', 'date': '2019-03-30', 'count': '7315'},
 {'name': 'events-2019-03-31.jsonl', 'date': '2019-03-31', 'count': '7750'},
 {'name': 'events-2019-04-01.jsonl', 'date': '2019-04-01', 'count': '16049'},
 {'name': 'events-2019-04-02.jsonl', 'date': '2019-04-02', 'count': '18909'},
 {'name': 'events-2019-04-03.jsonl', 'date': '2019-04-03', 'count': '17629'},
 {'name': 'events-2019-04-04.jsonl', 'date': '2019-04-04', 'count': '17635'},
 {'name': 'events-2019-04-05.jsonl', 'date': '2019-04-05', 'count': '14057'},
 {'name': 'events-2019-04-06.jsonl', 'date': '2019-04-06', 'count': '8297'},
 {'name': 'events-2019-04-07.jsonl', 'date': '2019-04-07', 'count': '8726'},
 {'name': 'events-2019-04-08.jsonl', 'date': '2019-04-08', 'count': '18217'},
 {'name': 'events-2019-04-09.jsonl', 'date': '2019-04-09', 'count': '17833'},
 {'name': 'events-2019-04-10.jsonl', 'date': '2019-04-10', 'count': '19018'},
 {'name': 'events-2019-04-11.jsonl', 'date': '2019-04-11', 'count': '19173'},
 {'name': 'events-2019-04-12.jsonl', 'date': '2019-04-12', 'count': '15502'},
 {'name': 'events-2019-04-13.jsonl', 'date': '2019-04-13', 'count': '7839'},
 {'name': 'events-2019-04-14.jsonl', 'date': '2019-04-14', 'count': '8119'},
 {'name': 'events-2019-04-15.jsonl', 'date': '2019-04-15', 'count': '14567'},
 {'name': 'events-2019-04-16.jsonl', 'date': '2019-04-16', 'count': '16254'},
 {'name': 'events-2019-04-17.jsonl', 'date': '2019-04-17', 'count': '15211'},
 {'name': 'events-2019-04-18.jsonl', 'date': '2019-04-18', 'count': '15989'},
 {'name': 'events-2019-04-19.jsonl', 'date': '2019-04-19', 'count': '11296'},
 {'name': 'events-2019-04-20.jsonl', 'date': '2019-04-20', 'count': '8527'},
 {'name': 'events-2019-04-21.jsonl', 'date': '2019-04-21', 'count': '7861'},
 {'name': 'events-2019-04-22.jsonl', 'date': '2019-04-22', 'count': '13118'},
 {'name': 'events-2019-04-23.jsonl', 'date': '2019-04-23', 'count': '16865'},
 {'name': 'events-2019-04-24.jsonl', 'date': '2019-04-24', 'count': '17125'},
 {'name': 'events-2019-04-25.jsonl', 'date': '2019-04-25', 'count': '18687'},
 {'name': 'events-2019-04-26.jsonl', 'date': '2019-04-26', 'count': '16476'},
 {'name': 'events-2019-04-27.jsonl', 'date': '2019-04-27', 'count': '9517'},
 {'name': 'events-2019-04-28.jsonl', 'date': '2019-04-28', 'count': '9435'},
 {'name': 'events-2019-04-29.jsonl', 'date': '2019-04-29', 'count': '15896'},
 {'name': 'events-2019-04-30.jsonl', 'date': '2019-04-30', 'count': '16116'},
 {'name': 'events-2019-05-01.jsonl', 'date': '2019-05-01', 'count': '11664'},
 {'name': 'events-2019-05-02.jsonl', 'date': '2019-05-02', 'count': '15713'},
 {'name': 'events-2019-05-03.jsonl', 'date': '2019-05-03', 'count': '14162'},
 {'name': 'events-2019-05-04.jsonl', 'date': '2019-05-04', 'count': '8356'},
 {'name': 'events-2019-05-05.jsonl', 'date': '2019-05-05', 'count': '8610'},
 {'name': 'events-2019-05-06.jsonl', 'date': '2019-05-06', 'count': '15230'},
 {'name': 'events-2019-05-07.jsonl', 'date': '2019-05-07', 'count': '16286'},
 {'name': 'events-2019-05-08.jsonl', 'date': '2019-05-08', 'count': '17393'},
 {'name': 'events-2019-05-09.jsonl', 'date': '2019-05-09', 'count': '16657'},
 {'name': 'events-2019-05-10.jsonl', 'date': '2019-05-10', 'count': '13726'},
 {'name': 'events-2019-05-11.jsonl', 'date': '2019-05-11', 'count': '8098'},
 {'name': 'events-2019-05-12.jsonl', 'date': '2019-05-12', 'count': '8217'},
 {'name': 'events-2019-05-13.jsonl', 'date': '2019-05-13', 'count': '16635'},
 {'name': 'events-2019-05-14.jsonl', 'date': '2019-05-14', 'count': '17309'},
 {'name': 'events-2019-05-15.jsonl', 'date': '2019-05-15', 'count': '15230'},
 {'name': 'events-2019-05-16.jsonl', 'date': '2019-05-16', 'count': '15208'},
 {'name': 'events-2019-05-17.jsonl', 'date': '2019-05-17', 'count': '13078'},
 {'name': 'events-2019-05-18.jsonl', 'date': '2019-05-18', 'count': '7788'},
 {'name': 'events-2019-05-19.jsonl', 'date': '2019-05-19', 'count': '7587'},
 {'name': 'events-2019-05-20.jsonl', 'date': '2019-05-20', 'count': '14891'},
 {'name': 'events-2019-05-21.jsonl', 'date': '2019-05-21', 'count': '16516'},
 {'name': 'events-2019-05-22.jsonl', 'date': '2019-05-22', 'count': '18627'},
 {'name': 'events-2019-05-23.jsonl', 'date': '2019-05-23', 'count': '16218'},
 {'name': 'events-2019-05-24.jsonl', 'date': '2019-05-24', 'count': '12376'},
 {'name': 'events-2019-05-25.jsonl', 'date': '2019-05-25', 'count': '8312'},
 {'name': 'events-2019-05-26.jsonl', 'date': '2019-05-26', 'count': '6938'},
 {'name': 'events-2019-05-27.jsonl', 'date': '2019-05-27', 'count': '13366'},
 {'name': 'events-2019-05-28.jsonl', 'date': '2019-05-28', 'count': '15430'},
 {'name': 'events-2019-05-29.jsonl', 'date': '2019-05-29', 'count': '14477'},
 {'name': 'events-2019-05-30.jsonl', 'date': '2019-05-30', 'count': '13264'},
 {'name': 'events-2019-05-31.jsonl', 'date': '2019-05-31', 'count': '11721'},
 {'name': 'events-2019-06-01.jsonl', 'date': '2019-06-01', 'count': '6994'},
 {'name': 'events-2019-06-02.jsonl', 'date': '2019-06-02', 'count': '6808'},
 {'name': 'events-2019-06-03.jsonl', 'date': '2019-06-03', 'count': '9141'},
 {'name': 'events-2019-06-04.jsonl', 'date': '2019-06-04', 'count': '14414'},
 {'name': 'events-2019-06-05.jsonl', 'date': '2019-06-05', 'count': '13852'},
 {'name': 'events-2019-06-06.jsonl', 'date': '2019-06-06', 'count': '15534'},
 {'name': 'events-2019-06-07.jsonl', 'date': '2019-06-07', 'count': '11335'},
 {'name': 'events-2019-06-08.jsonl', 'date': '2019-06-08', 'count': '6799'},
 {'name': 'events-2019-06-09.jsonl', 'date': '2019-06-09', 'count': '7062'},
 {'name': 'events-2019-06-10.jsonl', 'date': '2019-06-10', 'count': '12834'},
 {'name': 'events-2019-06-11.jsonl', 'date': '2019-06-11', 'count': '14359'},
 {'name': 'events-2019-06-12.jsonl', 'date': '2019-06-12', 'count': '14899'},
 {'name': 'events-2019-06-13.jsonl', 'date': '2019-06-13', 'count': '14388'}]
[5]:
filenames = (db.read_text('https://archive.analytics.mybinder.org/index.jsonl')
               .map(json.loads)
               .pluck('name')
               .compute())

filenames = ['https://archive.analytics.mybinder.org/' + fn for fn in filenames]
filenames[:5]
[5]:
['https://archive.analytics.mybinder.org/events-2018-11-03.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-04.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-05.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-06.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-07.jsonl']

Create Bag of all events

We now create a Dask Bag around that list of URLs, and then call the json.loads function on every line to turn those lines of JSON-encoded text into Python dictionaries that can be more easily manipulated.

[6]:
events = db.read_text(filenames).map(json.loads)
events.take(2)
[6]:
({'timestamp': '2018-11-03T00:00:00+00:00',
  'schema': 'binderhub.jupyter.org/launch',
  'version': 1,
  'provider': 'GitHub',
  'spec': 'Qiskit/qiskit-tutorial/master',
  'status': 'success'},
 {'timestamp': '2018-11-03T00:00:00+00:00',
  'schema': 'binderhub.jupyter.org/launch',
  'version': 1,
  'provider': 'GitHub',
  'spec': 'ipython/ipython-in-depth/master',
  'status': 'success'})

Convert to Dask Dataframe

Finally, we can convert our bag of Python dictionaries into a Dask Dataframe, and follow up with more Pandas-like computations.

We’ll do the same computation as above, now with Pandas syntax.

[8]:
df = events.to_dataframe()
df.head()
[8]:
provider schema spec status timestamp version
0 GitHub binderhub.jupyter.org/launch Qiskit/qiskit-tutorial/master success 2018-11-03T00:00:00+00:00 1
1 GitHub binderhub.jupyter.org/launch ipython/ipython-in-depth/master success 2018-11-03T00:00:00+00:00 1
2 GitHub binderhub.jupyter.org/launch QISKit/qiskit-tutorial/master success 2018-11-03T00:00:00+00:00 1
3 GitHub binderhub.jupyter.org/launch QISKit/qiskit-tutorial/master success 2018-11-03T00:01:00+00:00 1
4 GitHub binderhub.jupyter.org/launch jupyterlab/jupyterlab-demo/master success 2018-11-03T00:01:00+00:00 1
[9]:
df.spec.value_counts().nlargest(20).to_frame().compute()
[9]:
spec
ipython/ipython-in-depth/master 1296146
jupyterlab/jupyterlab-demo/master 297699
ines/spacy-io-binder/live 115500
DS-100/textbook/master 101919
bokeh/bokeh-notebooks/master 59714
binder-examples/r/master 46326
rationalmatter/juno-demo-notebooks/master 38805
QuantStack/xeus-cling/stable 33280
QISKit/qiskit-tutorial/master 19175
RasaHQ/rasa_core/master 19020
numba/numba-examples/master 18881
binder-examples/julia-python/master 18538
binder-examples/requirements/master 16252
ines/spacy-course/binder 14792
dask/dask-examples/master 10459
data-8/textbook/gh-pages 9938
nteract/examples/master 9211
wshuyi/demo-spacy-text-processing/master 9085
bethgelab/bwki-notebooks/master 8613
stencila/examples/elife-30274-binder 8085

Persist in memory

This dataset fits nicely into memory. Lets avoid downloading data every time we do an operation and instead keep the data local in memory.

[10]:
df = df.persist()

Honestly, at this point it makes more sense to just switch to Pandas, but this is a Dask example, so we’ll continue with Dask dataframe.

Investigate providers other than Github

Most binders are specified as git repositories on GitHub, but not all. Lets investigate other providers.

[11]:
import urllib
[12]:
df.provider.value_counts().compute()
[12]:
GitHub    2740022
GitLab      10386
Gist         4569
Git          2798
Zenodo         19
Name: provider, dtype: int64
[13]:
(df[df.provider == 'GitLab']
 .spec
 .map(urllib.parse.unquote, meta=('spec', object))
 .value_counts()
 .to_frame()
 .compute())
[13]:
spec
rruizz/inforfis/R 1369
rruizz/inforfis/master 1042
lfortran/web/lfortran-binder/master 850
ul-fri/ovs/python/master 751
utt-connected-innovation/ia-course-2019/master 690
bhugueney/cxx-init-for-python-dev/master 459
DGothrek/ipyaggrid/binder-demo 427
dbernhard/PythonM1/master 398
dbernhard/pythonm1s2/master 270
shadaba/lss-handson/master 261
dbernhard/ProgHumaNumTAL/master 252
albert.van.breemen/masterclassdeeplearning/master 247
dbernhard/JavaM2/master 236
amarandon/presentation-jupyter/master 142
clemej/data601-clemens-fall18/master 132
wichit2s/programmingfundamentals/master 115
open-scientist/formation-data-reproductibilite/master 96
kkmann/shortcourse-data-science-toolbox/master 84
kitsunix/pyHIBP/pyHIBP-binder/master 72
andrey.kovalev/imagination/master 70
biehl/jscatter/master 68
slloyd/python-introduction/master 65
rruizz/inforfis/autin 62
g2lab/fossgis2019-geopython-vector/master 59
sgmarkets/sgmarkets-api-notebooks/master 57
PersonalDataIO/toronto-letter/master 55
wichit2s/pythonintro/master 54
brivadeneira/recursos-didacticos-telecomunicaciones/master 53
snowhitiger/learn_deep_learning/master 52
thoma.rey/FV_HipoDiff/master 48
... ...
littleredridingfox/decision-theory-coursework/master 1
hassakura/integracao_low_eans/master 1
hassakura/teste_exportcsv/master 1
atomap/atomap_demos/master 1
hixi/colloqium-presentation/master 1
rogermarkussen/test/master 1
g_money/folio_track/master 1
ruivieira/matplotrust/b64 1
anarcat/terms-benchmarks/master 1
jdiep/master-thesis/master 1
yerbby/scientific-python-lectures/master 1
jibe-b/crowdsource-science-improvement/dev 1
elfua/ipython-notebooks/master 1
passakornC/test/master 1
jibe-b/sabre/master 1
alandrex/dhbw-2019-1/master 1
jibe-b/sens-de-la-vie-workflow-brouillon-tests-tuto/dev 1
rdubwiley/mi_campaign_finance/master 1
rahulvigneswaran/staticdmd/master 1
katylava/minimal-binderhub/f7ba3435f01c9ad9609a7c76f4d05c5003d23037 1
coobas/dask-pipelines/master 1
uysalnet/comparex/master 1
agrumery/aGrUM/0.13.5 1
knighteq/jupyter/master 1
adophobr/lightwavesystens/master 1
valentin.queloz/jupyther/master 1
acatinon/lgo-node/master 1
butzked/equivalence/dev 1
krsna1/qml-mooc/PhaseEstimationUsingCirq 1
ozborniasty1/junotry/master 1

270 rows × 1 columns

[14]:
(df[df.provider == 'Git']
 .spec
 .apply(urllib.parse.unquote, meta=('spec', object))
 .value_counts()
 .to_frame()
 .compute())
[14]:
spec
https://gitlab.kwant-project.org/solidstate/lectures.git/e6c970126f4e819e4a3eb717a86ba9fb523d20bf 123
https://bitbucket.org/gaur/1820/a4027afe0aa592d49f3989b2e9c8136c36322a77 75
https://collaborating.tuhh.de/cip3725/ib_base.git/0a1f4f66a1a3c29ff347b2abc79bb292b0be17ca 68
https://gricad-gitlab.univ-grenoble-alpes.fr/nonsmooth/siconos-tutorial.git/b08a0514b22b3927b58bddce3c4018f27ac0fc7d 65
https://gitlab.rc.uab.edu/mcbios19_single_cell/single_cell_rnaseq_hands-on_1.git/a20f707fc0b67f6eb4f9bf85a5daacc52c125df6 65
https://bitbucket.org/saibotk/tp2_python/098f35f50066d96f26df6131a6cff8080e384708 53
https://collaborating.tuhh.de/xldrkp/jupyter-notebook-beispiel/e895284be3c0a128ed97dc007fcd7497ef19d2bc 44
https://gitlab.mech.kuleuven.be/rob-expressiongraphs/docker/etasl-binder.git/127d6c9f33a938a98607505ef30237b5223000e4 41
https://gricad-gitlab.univ-grenoble-alpes.fr/chatelaf/conference-ia/d98a199f66d603b0b1e7c25fbe1341d29a40cd39 41
https://collaborating.tuhh.de/xldrkp/jupyter-test/9b9e2fb383107ff9c4bfdb22c3f97f5c62ed1f6d 39
https://bitbucket.org/cognify-salzburg/idsc-2019/4dd0deadc6e6f5ac9b3ba46991aa3e26bd1834ff 38
https://gitlab.univ-nantes.fr/mouchere-h/ImageDeep_ED_Project/ac2523952640533c9e549d14f2eed1534b105776 37
https://gitlab.oceantrack.org/otndc/fact-workshop/a174f5bc60cf9f0c5b86851204c7c85fcfb98131 35
https://risk-engineering.org/git/notebooks.git/527b4fdaa5fe43294ed5de196b96262b1c60522b 34
https://git.rcc.uchicago.edu/jhskone/multiproc_py.git/38f9bb6ce3602b73a8ddd1dbcad3f5f9a8d21f6a 34
https://git.unilim.fr/grossp01/notebooks.git/6d65058071a0d5a62ab032800932c2a3740a4196 34
https://risk-engineering.org/git/notebooks.git/a18f7f0e6a707ccfa4478d3fc73d53ada11605fb 34
https://risk-engineering.org/git/notebooks.git/da7de9f3df67ec45713cd91e575418aae9466621 30
https://gitlab.ethz.ch/darioce/sysbio_ss2019.git/4c383c1e118678cf26ad8b6d9c0be7f41b15ad69 30
https://bitbucket.org/qlouder/lumc-ml-caffe/acc2701f43f267361594e72980ec59602dab7fd7 27
https://framagit.org/mfauvel/omp_machine_learning/77d246991186d758147fc280eeef0eed563d525f 26
https://risk-engineering.org/git/notebooks.git/2d83cc0c7a2b63676185e0b36978a765a42deeac 25
https://api.jovian.ai/git/e5cfe043873f4f3c9287507016747ae5_5.git/fc3bc662bc8b8dd00a66c336081ee39e14df5f11 23
https://bitbucket.org/ml_tsu/ml.git/1bb9f4b9b349e1771000219055e597287201bfb2 22
https://api.jovian.ai/git/5bc23520933b4cc187cfe18e5dd7e2ed_7.git/2558b3811d8a9759c89ef0afce1a7cc28f58568d 21
https://risk-engineering.org/git/notebooks.git/eab9bdb1a9bbf59833b26d39b02c5a5e743807c4 20
https://github.com/camilo912/research_rating/6fe3d52d8dce4cbc548cfd2a19299beafa0eeff2 20
https://api.jovian.ai/git/e5cfe043873f4f3c9287507016747ae5_6.git/8144d421554e1119f1d20cd17d7830c2fdd4ff64 19
https://git.unilim.fr/grossp01/test-mybinder.git/ecd98d510a0bd88dd185a8723c15cc40ae28fd62 19
https://risk-engineering.org/git/notebooks.git/cc040c44c37a2cd67b0f5c2b65b029b93d178b05 19
... ...
https://gitlab.inria.fr/vdrevell/istic-robm/4b79f44a45ffbd2a52c12585d119901edd7c9b29 1
https://gitlab.inria.fr/vdrevell/istic-robm/b45ab9de12dbf06c92d3dd5dd7b884e75dad2c35 1
https://gitlab.inria.fr/vdrevell/istic-robm/e345fd81cb6f70ba1fa4646ee80c0a619dd4feac 1
https://gitlab.lrde.epita.fr/adl/binder-test/122e1b3741a2f119e878d133d15cecb89c85f96a 1
https://gitlab.lrde.epita.fr/adl/binder-test/6416e50134679feea812605a74c497d7c31a9c37 1
https://gitlab.huma-num.fr/mshs-poitiers/forellis/sudocmeta2csv/fa20c1173d669a0bc167ace2598df4de525964bb 1
https://gitlab.huma-num.fr/mshs-poitiers/forellis/sudocmeta2csv/e0ec02c4202ab76feaea8408b8780b30ce48e5ae 1
https://gitlab.huma-num.fr/mshs-poitiers/forellis/sudocmeta2csv/231832c895938f637d1a7a60981ac1f2c00cb7f2 1
https://gitlab.com/jhskone/multiproc_py.git/54224bf1159ac6d5c3b5b9268deea5c1890581b8 1
https://gitlab.climent-pommeret.red/Chopopope/inf3212019-jupyter/7a62f0124dda06a83efe2c3f1d5bb24eeba24c6a 1
https://gitlab.climent-pommeret.red/Chopopope/inf3212019-jupyter/e62b5bc9fdd81f21f887d3ffc7def168577ffe2a 1
https://gitlab.climent-pommeret.red/Chopopope/inf3712019-compilation/59d66082573d5672237e57bd9d20b4f7a8965cb7 1
https://gitlab.climent-pommeret.red/Chopopope/inf3712019-compilation/82c8453beb6ddc03b9ba0c29f576e98426aa1f01 1
https://gitlab.climent-pommeret.red/Chopopope/inf3712019-compilation/8e3a4673e8b8c9d6a359d6231886ce5d36872b95 1
https://gitlab.climent-pommeret.red/Chopopope/inf3712019-compilation/9a736c8fa13abed706a4a4c40ece0e31494f732f 1
https://gitlab.climent-pommeret.red/Chopopope/inf3712019-compilation/d0e90aa8e03727d519f4f5bac334ebe0caa5bfbd 1
https://gitlab.com/TragaMonedas/sec-information-server/abafd4e3f4956523bc835112552e334cd23ea67a 1
https://gitlab.com/airbornemint/noodling.git/e2c54e430dcd7fb3b38074d204895cd9cf75da8d 1
https://gitlab.com/ricarthor/testing_binder.git/ac37cd61fe29d17301a6380cea61b95de60cdea9 1
https://gitlab.huma-num.fr/mshs-poitiers/forellis/dicodiachro/e2560550784146ec7de155d87b2e9217b522b7ca 1
https://gitlab.com/uchicago-ime/python-programming-primer/b9c43f48082359c89a8ba2fb2bfc0d2b55c944c8 1
https://gitlab.dsp.sandvik.com/erik.sundell/verify-backup-functionality.git/266df2023da76d974b7dee9715eb002a33f302bd 1
https://gitlab.esrf.fr/tvincent/PyNX-binder.git/3b54bf76428e944bc7b712277a69364a76abab08 1
https://gitlab.ethz.ch/darioce/sysbio_ss2019.git/ed09894e7a479c526dc086dac98d02518a41bf77 1
https://gitlab.ethz.ch/darioce/sysbio_ss2019.git/fb3db94a37cef26612d6e2f70edcbd5ffbaf8660 1
https://gitlab.ethz.ch/darioce/sysbio_ss2019/838eccfa1649695b30a4aa2a2697b66f10a2cc7d 1
https://gitlab.gwdg.de/publications/NanoMAX18_JSR.git/48cbcc2390cf0fb8e85e8c8890ce693c6681d9fe 1
https://gitlab.huma-num.fr/mshs-poitiers/forellis/dicodiachro/ad5a51c4cf58494824c544fa1754c5c571453ab0 1
https://gitlab.huma-num.fr/mshs-poitiers/forellis/dicodiachro/b086f757800bff0c931d345b44d50a78e2fe95d6 1
http://code.datamode.com/datamode/binder.git/85bd134f3713f8a8cb3e22612929987b49b675a5 1

604 rows × 1 columns

[ ]: