Live Notebook

You can run this notebook in a live session Binder or view it on Github.

Analyze web-hosted JSON data

This notebook reads and processes JSON-encoded data hosted on the web using a combination of Dask Bag and Dask Dataframe.

This data comes from mybinder.org a web service to run Jupyter notebooks live on the web (you may be running this notebook there now). My Binder publishes records for every time someone launches a live notebook like this one, and stores that record in a publicly accessible JSON file, one file per day.

Introduction to the dataset

This data is stored as JSON-encoded text files on the public web. Here are some example lines.

[1]:
import dask.bag as db
db.read_text('https://archive.analytics.mybinder.org/events-2018-11-03.jsonl').take(3)
[1]:
('{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "Qiskit/qiskit-tutorial/master", "status": "success"}\n',
 '{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "ipython/ipython-in-depth/master", "status": "success"}\n',
 '{"timestamp": "2018-11-03T00:00:00+00:00", "schema": "binderhub.jupyter.org/launch", "version": 1, "provider": "GitHub", "spec": "QISKit/qiskit-tutorial/master", "status": "success"}\n')

We see that it includes one line for every time someone started a live notebook on the site. It includes the time that the notebook was started, as well as the repository from which it was served.

In this notebook we’ll look at many such files, parse them from JSON to Python dictionaries, and then from there to Pandas dataframes. We’ll then do some simple analyses on this data.

Start Dask Client for Dashboard

Starting the Dask Client is optional. It will start the dashboard which is useful to gain insight on the computation.

[2]:
from dask.distributed import Client, progress
client = Client(threads_per_worker=1,
                n_workers=4,
                memory_limit='2GB')
client
[2]:

Client

Cluster

  • Workers: 4
  • Cores: 4
  • Memory: 8.00 GB

Get a list of files on the web

The mybinder.org team maintains an index file that points to all other available JSON files of data. Lets convert this to a list of URLs that we’ll read in the next section.

[3]:
import dask.bag as db
import json
[4]:
db.read_text('https://archive.analytics.mybinder.org/index.jsonl').map(json.loads).compute()
[4]:
[{'name': 'events-2018-11-03.jsonl', 'date': '2018-11-03', 'count': '7057'},
 {'name': 'events-2018-11-04.jsonl', 'date': '2018-11-04', 'count': '7489'},
 {'name': 'events-2018-11-05.jsonl', 'date': '2018-11-05', 'count': '13590'},
 {'name': 'events-2018-11-06.jsonl', 'date': '2018-11-06', 'count': '13920'},
 {'name': 'events-2018-11-07.jsonl', 'date': '2018-11-07', 'count': '12766'},
 {'name': 'events-2018-11-08.jsonl', 'date': '2018-11-08', 'count': '14105'},
 {'name': 'events-2018-11-09.jsonl', 'date': '2018-11-09', 'count': '11843'},
 {'name': 'events-2018-11-10.jsonl', 'date': '2018-11-10', 'count': '7047'},
 {'name': 'events-2018-11-11.jsonl', 'date': '2018-11-11', 'count': '6940'},
 {'name': 'events-2018-11-12.jsonl', 'date': '2018-11-12', 'count': '16322'},
 {'name': 'events-2018-11-13.jsonl', 'date': '2018-11-13', 'count': '16530'},
 {'name': 'events-2018-11-14.jsonl', 'date': '2018-11-14', 'count': '14099'},
 {'name': 'events-2018-11-15.jsonl', 'date': '2018-11-15', 'count': '13182'},
 {'name': 'events-2018-11-16.jsonl', 'date': '2018-11-16', 'count': '12863'},
 {'name': 'events-2018-11-17.jsonl', 'date': '2018-11-17', 'count': '6490'},
 {'name': 'events-2018-11-18.jsonl', 'date': '2018-11-18', 'count': '7310'},
 {'name': 'events-2018-11-19.jsonl', 'date': '2018-11-19', 'count': '13348'},
 {'name': 'events-2018-11-20.jsonl', 'date': '2018-11-20', 'count': '13982'},
 {'name': 'events-2018-11-21.jsonl', 'date': '2018-11-21', 'count': '13165'},
 {'name': 'events-2018-11-22.jsonl', 'date': '2018-11-22', 'count': '12217'},
 {'name': 'events-2018-11-23.jsonl', 'date': '2018-11-23', 'count': '9070'},
 {'name': 'events-2018-11-24.jsonl', 'date': '2018-11-24', 'count': '6798'},
 {'name': 'events-2018-11-25.jsonl', 'date': '2018-11-25', 'count': '6796'},
 {'name': 'events-2018-11-26.jsonl', 'date': '2018-11-26', 'count': '13617'},
 {'name': 'events-2018-11-27.jsonl', 'date': '2018-11-27', 'count': '14964'},
 {'name': 'events-2018-11-28.jsonl', 'date': '2018-11-28', 'count': '14434'},
 {'name': 'events-2018-11-29.jsonl', 'date': '2018-11-29', 'count': '13845'},
 {'name': 'events-2018-11-30.jsonl', 'date': '2018-11-30', 'count': '12109'},
 {'name': 'events-2018-12-01.jsonl', 'date': '2018-12-01', 'count': '6785'},
 {'name': 'events-2018-12-02.jsonl', 'date': '2018-12-02', 'count': '7119'},
 {'name': 'events-2018-12-03.jsonl', 'date': '2018-12-03', 'count': '13946'},
 {'name': 'events-2018-12-04.jsonl', 'date': '2018-12-04', 'count': '13765'},
 {'name': 'events-2018-12-05.jsonl', 'date': '2018-12-05', 'count': '13106'},
 {'name': 'events-2018-12-06.jsonl', 'date': '2018-12-06', 'count': '12249'},
 {'name': 'events-2018-12-07.jsonl', 'date': '2018-12-07', 'count': '10687'},
 {'name': 'events-2018-12-08.jsonl', 'date': '2018-12-08', 'count': '6269'},
 {'name': 'events-2018-12-09.jsonl', 'date': '2018-12-09', 'count': '6639'},
 {'name': 'events-2018-12-10.jsonl', 'date': '2018-12-10', 'count': '12782'},
 {'name': 'events-2018-12-11.jsonl', 'date': '2018-12-11', 'count': '13442'},
 {'name': 'events-2018-12-12.jsonl', 'date': '2018-12-12', 'count': '13069'},
 {'name': 'events-2018-12-13.jsonl', 'date': '2018-12-13', 'count': '15279'},
 {'name': 'events-2018-12-14.jsonl', 'date': '2018-12-14', 'count': '9941'},
 {'name': 'events-2018-12-15.jsonl', 'date': '2018-12-15', 'count': '5358'},
 {'name': 'events-2018-12-16.jsonl', 'date': '2018-12-16', 'count': '6441'},
 {'name': 'events-2018-12-17.jsonl', 'date': '2018-12-17', 'count': '11332'},
 {'name': 'events-2018-12-18.jsonl', 'date': '2018-12-18', 'count': '11971'},
 {'name': 'events-2018-12-19.jsonl', 'date': '2018-12-19', 'count': '10818'},
 {'name': 'events-2018-12-20.jsonl', 'date': '2018-12-20', 'count': '9408'},
 {'name': 'events-2018-12-21.jsonl', 'date': '2018-12-21', 'count': '7741'},
 {'name': 'events-2018-12-22.jsonl', 'date': '2018-12-22', 'count': '4818'},
 {'name': 'events-2018-12-23.jsonl', 'date': '2018-12-23', 'count': '4870'},
 {'name': 'events-2018-12-24.jsonl', 'date': '2018-12-24', 'count': '5974'},
 {'name': 'events-2018-12-25.jsonl', 'date': '2018-12-25', 'count': '4737'},
 {'name': 'events-2018-12-26.jsonl', 'date': '2018-12-26', 'count': '6725'},
 {'name': 'events-2018-12-27.jsonl', 'date': '2018-12-27', 'count': '7998'},
 {'name': 'events-2018-12-28.jsonl', 'date': '2018-12-28', 'count': '8155'},
 {'name': 'events-2018-12-29.jsonl', 'date': '2018-12-29', 'count': '5108'},
 {'name': 'events-2018-12-30.jsonl', 'date': '2018-12-30', 'count': '4428'},
 {'name': 'events-2018-12-31.jsonl', 'date': '2018-12-31', 'count': '4561'},
 {'name': 'events-2019-01-01.jsonl', 'date': '2019-01-01', 'count': '4194'},
 {'name': 'events-2019-01-02.jsonl', 'date': '2019-01-02', 'count': '8559'},
 {'name': 'events-2019-01-03.jsonl', 'date': '2019-01-03', 'count': '9687'},
 {'name': 'events-2019-01-04.jsonl', 'date': '2019-01-04', 'count': '10048'},
 {'name': 'events-2019-01-05.jsonl', 'date': '2019-01-05', 'count': '6012'},
 {'name': 'events-2019-01-06.jsonl', 'date': '2019-01-06', 'count': '6019'},
 {'name': 'events-2019-01-07.jsonl', 'date': '2019-01-07', 'count': '11903'},
 {'name': 'events-2019-01-08.jsonl', 'date': '2019-01-08', 'count': '12777'},
 {'name': 'events-2019-01-09.jsonl', 'date': '2019-01-09', 'count': '13294'},
 {'name': 'events-2019-01-10.jsonl', 'date': '2019-01-10', 'count': '13112'},
 {'name': 'events-2019-01-11.jsonl', 'date': '2019-01-11', 'count': '10327'},
 {'name': 'events-2019-01-12.jsonl', 'date': '2019-01-12', 'count': '6434'},
 {'name': 'events-2019-01-13.jsonl', 'date': '2019-01-13', 'count': '7004'},
 {'name': 'events-2019-01-14.jsonl', 'date': '2019-01-14', 'count': '12898'},
 {'name': 'events-2019-01-15.jsonl', 'date': '2019-01-15', 'count': '12363'},
 {'name': 'events-2019-01-16.jsonl', 'date': '2019-01-16', 'count': '13444'},
 {'name': 'events-2019-01-17.jsonl', 'date': '2019-01-17', 'count': '14452'},
 {'name': 'events-2019-01-18.jsonl', 'date': '2019-01-18', 'count': '12056'},
 {'name': 'events-2019-01-19.jsonl', 'date': '2019-01-19', 'count': '7590'},
 {'name': 'events-2019-01-20.jsonl', 'date': '2019-01-20', 'count': '6740'},
 {'name': 'events-2019-01-21.jsonl', 'date': '2019-01-21', 'count': '12507'},
 {'name': 'events-2019-01-22.jsonl', 'date': '2019-01-22', 'count': '15355'},
 {'name': 'events-2019-01-23.jsonl', 'date': '2019-01-23', 'count': '16319'},
 {'name': 'events-2019-01-24.jsonl', 'date': '2019-01-24', 'count': '16732'},
 {'name': 'events-2019-01-25.jsonl', 'date': '2019-01-25', 'count': '13642'},
 {'name': 'events-2019-01-26.jsonl', 'date': '2019-01-26', 'count': '6976'},
 {'name': 'events-2019-01-27.jsonl', 'date': '2019-01-27', 'count': '7570'},
 {'name': 'events-2019-01-28.jsonl', 'date': '2019-01-28', 'count': '15906'},
 {'name': 'events-2019-01-29.jsonl', 'date': '2019-01-29', 'count': '15534'},
 {'name': 'events-2019-01-30.jsonl', 'date': '2019-01-30', 'count': '15183'},
 {'name': 'events-2019-01-31.jsonl', 'date': '2019-01-31', 'count': '14421'},
 {'name': 'events-2019-02-01.jsonl', 'date': '2019-02-01', 'count': '12352'},
 {'name': 'events-2019-02-02.jsonl', 'date': '2019-02-02', 'count': '7113'},
 {'name': 'events-2019-02-03.jsonl', 'date': '2019-02-03', 'count': '7331'},
 {'name': 'events-2019-02-04.jsonl', 'date': '2019-02-04', 'count': '14493'},
 {'name': 'events-2019-02-05.jsonl', 'date': '2019-02-05', 'count': '14053'},
 {'name': 'events-2019-02-06.jsonl', 'date': '2019-02-06', 'count': '15600'},
 {'name': 'events-2019-02-07.jsonl', 'date': '2019-02-07', 'count': '17158'},
 {'name': 'events-2019-02-08.jsonl', 'date': '2019-02-08', 'count': '14107'},
 {'name': 'events-2019-02-09.jsonl', 'date': '2019-02-09', 'count': '7209'},
 {'name': 'events-2019-02-10.jsonl', 'date': '2019-02-10', 'count': '7422'},
 {'name': 'events-2019-02-11.jsonl', 'date': '2019-02-11', 'count': '17085'},
 {'name': 'events-2019-02-12.jsonl', 'date': '2019-02-12', 'count': '17286'},
 {'name': 'events-2019-02-13.jsonl', 'date': '2019-02-13', 'count': '17181'},
 {'name': 'events-2019-02-14.jsonl', 'date': '2019-02-14', 'count': '19298'},
 {'name': 'events-2019-02-15.jsonl', 'date': '2019-02-15', 'count': '13387'},
 {'name': 'events-2019-02-16.jsonl', 'date': '2019-02-16', 'count': '8182'},
 {'name': 'events-2019-02-17.jsonl', 'date': '2019-02-17', 'count': '8142'},
 {'name': 'events-2019-02-18.jsonl', 'date': '2019-02-18', 'count': '16364'},
 {'name': 'events-2019-02-19.jsonl', 'date': '2019-02-19', 'count': '18090'},
 {'name': 'events-2019-02-20.jsonl', 'date': '2019-02-20', 'count': '17441'},
 {'name': 'events-2019-02-21.jsonl', 'date': '2019-02-21', 'count': '18844'},
 {'name': 'events-2019-02-22.jsonl', 'date': '2019-02-22', 'count': '15400'},
 {'name': 'events-2019-02-23.jsonl', 'date': '2019-02-23', 'count': '8879'},
 {'name': 'events-2019-02-24.jsonl', 'date': '2019-02-24', 'count': '9342'},
 {'name': 'events-2019-02-25.jsonl', 'date': '2019-02-25', 'count': '16999'},
 {'name': 'events-2019-02-26.jsonl', 'date': '2019-02-26', 'count': '18514'},
 {'name': 'events-2019-02-27.jsonl', 'date': '2019-02-27', 'count': '15799'},
 {'name': 'events-2019-02-28.jsonl', 'date': '2019-02-28', 'count': '18702'},
 {'name': 'events-2019-03-01.jsonl', 'date': '2019-03-01', 'count': '14222'},
 {'name': 'events-2019-03-02.jsonl', 'date': '2019-03-02', 'count': '8990'},
 {'name': 'events-2019-03-03.jsonl', 'date': '2019-03-03', 'count': '8503'},
 {'name': 'events-2019-03-04.jsonl', 'date': '2019-03-04', 'count': '17427'},
 {'name': 'events-2019-03-05.jsonl', 'date': '2019-03-05', 'count': '17732'},
 {'name': 'events-2019-03-06.jsonl', 'date': '2019-03-06', 'count': '17532'},
 {'name': 'events-2019-03-07.jsonl', 'date': '2019-03-07', 'count': '17622'},
 {'name': 'events-2019-03-08.jsonl', 'date': '2019-03-08', 'count': '13110'},
 {'name': 'events-2019-03-09.jsonl', 'date': '2019-03-09', 'count': '9132'},
 {'name': 'events-2019-03-10.jsonl', 'date': '2019-03-10', 'count': '8989'},
 {'name': 'events-2019-03-11.jsonl', 'date': '2019-03-11', 'count': '16334'},
 {'name': 'events-2019-03-12.jsonl', 'date': '2019-03-12', 'count': '18637'},
 {'name': 'events-2019-03-13.jsonl', 'date': '2019-03-13', 'count': '18355'},
 {'name': 'events-2019-03-14.jsonl', 'date': '2019-03-14', 'count': '18657'},
 {'name': 'events-2019-03-15.jsonl', 'date': '2019-03-15', 'count': '15206'},
 {'name': 'events-2019-03-16.jsonl', 'date': '2019-03-16', 'count': '8606'},
 {'name': 'events-2019-03-17.jsonl', 'date': '2019-03-17', 'count': '8110'},
 {'name': 'events-2019-03-18.jsonl', 'date': '2019-03-18', 'count': '15846'},
 {'name': 'events-2019-03-19.jsonl', 'date': '2019-03-19', 'count': '17909'},
 {'name': 'events-2019-03-20.jsonl', 'date': '2019-03-20', 'count': '15610'},
 {'name': 'events-2019-03-21.jsonl', 'date': '2019-03-21', 'count': '14671'},
 {'name': 'events-2019-03-22.jsonl', 'date': '2019-03-22', 'count': '12962'},
 {'name': 'events-2019-03-23.jsonl', 'date': '2019-03-23', 'count': '7941'},
 {'name': 'events-2019-03-24.jsonl', 'date': '2019-03-24', 'count': '7248'},
 {'name': 'events-2019-03-25.jsonl', 'date': '2019-03-25', 'count': '16775'},
 {'name': 'events-2019-03-26.jsonl', 'date': '2019-03-26', 'count': '18064'},
 {'name': 'events-2019-03-27.jsonl', 'date': '2019-03-27', 'count': '17773'},
 {'name': 'events-2019-03-28.jsonl', 'date': '2019-03-28', 'count': '17945'},
 {'name': 'events-2019-03-29.jsonl', 'date': '2019-03-29', 'count': '13126'},
 {'name': 'events-2019-03-30.jsonl', 'date': '2019-03-30', 'count': '7315'},
 {'name': 'events-2019-03-31.jsonl', 'date': '2019-03-31', 'count': '7750'},
 {'name': 'events-2019-04-01.jsonl', 'date': '2019-04-01', 'count': '16049'},
 {'name': 'events-2019-04-02.jsonl', 'date': '2019-04-02', 'count': '18909'},
 {'name': 'events-2019-04-03.jsonl', 'date': '2019-04-03', 'count': '17629'},
 {'name': 'events-2019-04-04.jsonl', 'date': '2019-04-04', 'count': '17635'},
 {'name': 'events-2019-04-05.jsonl', 'date': '2019-04-05', 'count': '14057'},
 {'name': 'events-2019-04-06.jsonl', 'date': '2019-04-06', 'count': '8297'},
 {'name': 'events-2019-04-07.jsonl', 'date': '2019-04-07', 'count': '8726'},
 {'name': 'events-2019-04-08.jsonl', 'date': '2019-04-08', 'count': '18217'},
 {'name': 'events-2019-04-09.jsonl', 'date': '2019-04-09', 'count': '17833'},
 {'name': 'events-2019-04-10.jsonl', 'date': '2019-04-10', 'count': '19018'},
 {'name': 'events-2019-04-11.jsonl', 'date': '2019-04-11', 'count': '19173'},
 {'name': 'events-2019-04-12.jsonl', 'date': '2019-04-12', 'count': '15502'},
 {'name': 'events-2019-04-13.jsonl', 'date': '2019-04-13', 'count': '7839'},
 {'name': 'events-2019-04-14.jsonl', 'date': '2019-04-14', 'count': '8119'},
 {'name': 'events-2019-04-15.jsonl', 'date': '2019-04-15', 'count': '14567'},
 {'name': 'events-2019-04-16.jsonl', 'date': '2019-04-16', 'count': '16254'},
 {'name': 'events-2019-04-17.jsonl', 'date': '2019-04-17', 'count': '15211'},
 {'name': 'events-2019-04-18.jsonl', 'date': '2019-04-18', 'count': '15989'},
 {'name': 'events-2019-04-19.jsonl', 'date': '2019-04-19', 'count': '11296'},
 {'name': 'events-2019-04-20.jsonl', 'date': '2019-04-20', 'count': '8527'},
 {'name': 'events-2019-04-21.jsonl', 'date': '2019-04-21', 'count': '7861'},
 {'name': 'events-2019-04-22.jsonl', 'date': '2019-04-22', 'count': '13118'},
 {'name': 'events-2019-04-23.jsonl', 'date': '2019-04-23', 'count': '16865'},
 {'name': 'events-2019-04-24.jsonl', 'date': '2019-04-24', 'count': '17125'},
 {'name': 'events-2019-04-25.jsonl', 'date': '2019-04-25', 'count': '18687'},
 {'name': 'events-2019-04-26.jsonl', 'date': '2019-04-26', 'count': '16476'},
 {'name': 'events-2019-04-27.jsonl', 'date': '2019-04-27', 'count': '9517'},
 {'name': 'events-2019-04-28.jsonl', 'date': '2019-04-28', 'count': '9435'},
 {'name': 'events-2019-04-29.jsonl', 'date': '2019-04-29', 'count': '15896'},
 {'name': 'events-2019-04-30.jsonl', 'date': '2019-04-30', 'count': '16116'},
 {'name': 'events-2019-05-01.jsonl', 'date': '2019-05-01', 'count': '11664'},
 {'name': 'events-2019-05-02.jsonl', 'date': '2019-05-02', 'count': '15713'},
 {'name': 'events-2019-05-03.jsonl', 'date': '2019-05-03', 'count': '14162'},
 {'name': 'events-2019-05-04.jsonl', 'date': '2019-05-04', 'count': '8356'},
 {'name': 'events-2019-05-05.jsonl', 'date': '2019-05-05', 'count': '8610'},
 {'name': 'events-2019-05-06.jsonl', 'date': '2019-05-06', 'count': '15230'},
 {'name': 'events-2019-05-07.jsonl', 'date': '2019-05-07', 'count': '16286'},
 {'name': 'events-2019-05-08.jsonl', 'date': '2019-05-08', 'count': '17393'},
 {'name': 'events-2019-05-09.jsonl', 'date': '2019-05-09', 'count': '16657'},
 {'name': 'events-2019-05-10.jsonl', 'date': '2019-05-10', 'count': '13726'},
 {'name': 'events-2019-05-11.jsonl', 'date': '2019-05-11', 'count': '8098'},
 {'name': 'events-2019-05-12.jsonl', 'date': '2019-05-12', 'count': '8217'},
 {'name': 'events-2019-05-13.jsonl', 'date': '2019-05-13', 'count': '16635'},
 {'name': 'events-2019-05-14.jsonl', 'date': '2019-05-14', 'count': '17309'},
 {'name': 'events-2019-05-15.jsonl', 'date': '2019-05-15', 'count': '15230'},
 {'name': 'events-2019-05-16.jsonl', 'date': '2019-05-16', 'count': '15208'},
 {'name': 'events-2019-05-17.jsonl', 'date': '2019-05-17', 'count': '13078'},
 {'name': 'events-2019-05-18.jsonl', 'date': '2019-05-18', 'count': '7788'},
 {'name': 'events-2019-05-19.jsonl', 'date': '2019-05-19', 'count': '7587'},
 {'name': 'events-2019-05-20.jsonl', 'date': '2019-05-20', 'count': '14891'},
 {'name': 'events-2019-05-21.jsonl', 'date': '2019-05-21', 'count': '16516'},
 {'name': 'events-2019-05-22.jsonl', 'date': '2019-05-22', 'count': '18627'},
 {'name': 'events-2019-05-23.jsonl', 'date': '2019-05-23', 'count': '16218'},
 {'name': 'events-2019-05-24.jsonl', 'date': '2019-05-24', 'count': '12376'},
 {'name': 'events-2019-05-25.jsonl', 'date': '2019-05-25', 'count': '8312'},
 {'name': 'events-2019-05-26.jsonl', 'date': '2019-05-26', 'count': '6938'},
 {'name': 'events-2019-05-27.jsonl', 'date': '2019-05-27', 'count': '13366'},
 {'name': 'events-2019-05-28.jsonl', 'date': '2019-05-28', 'count': '15430'},
 {'name': 'events-2019-05-29.jsonl', 'date': '2019-05-29', 'count': '14477'},
 {'name': 'events-2019-05-30.jsonl', 'date': '2019-05-30', 'count': '13264'},
 {'name': 'events-2019-05-31.jsonl', 'date': '2019-05-31', 'count': '11721'},
 {'name': 'events-2019-06-01.jsonl', 'date': '2019-06-01', 'count': '6994'},
 {'name': 'events-2019-06-02.jsonl', 'date': '2019-06-02', 'count': '6808'},
 {'name': 'events-2019-06-03.jsonl', 'date': '2019-06-03', 'count': '9141'},
 {'name': 'events-2019-06-04.jsonl', 'date': '2019-06-04', 'count': '14414'},
 {'name': 'events-2019-06-05.jsonl', 'date': '2019-06-05', 'count': '13852'},
 {'name': 'events-2019-06-06.jsonl', 'date': '2019-06-06', 'count': '15534'},
 {'name': 'events-2019-06-07.jsonl', 'date': '2019-06-07', 'count': '11335'},
 {'name': 'events-2019-06-08.jsonl', 'date': '2019-06-08', 'count': '6799'},
 {'name': 'events-2019-06-09.jsonl', 'date': '2019-06-09', 'count': '7062'},
 {'name': 'events-2019-06-10.jsonl', 'date': '2019-06-10', 'count': '12834'},
 {'name': 'events-2019-06-11.jsonl', 'date': '2019-06-11', 'count': '14359'},
 {'name': 'events-2019-06-12.jsonl', 'date': '2019-06-12', 'count': '14899'},
 {'name': 'events-2019-06-13.jsonl', 'date': '2019-06-13', 'count': '15819'},
 {'name': 'events-2019-06-14.jsonl', 'date': '2019-06-14', 'count': '11579'},
 {'name': 'events-2019-06-15.jsonl', 'date': '2019-06-15', 'count': '6267'},
 {'name': 'events-2019-06-16.jsonl', 'date': '2019-06-16', 'count': '6274'},
 {'name': 'events-2019-06-17.jsonl', 'date': '2019-06-17', 'count': '12672'},
 {'name': 'events-2019-06-18.jsonl', 'date': '2019-06-18', 'count': '14996'},
 {'name': 'events-2019-06-19.jsonl', 'date': '2019-06-19', 'count': '17509'},
 {'name': 'events-2019-06-20.jsonl', 'date': '2019-06-20', 'count': '14436'},
 {'name': 'events-2019-06-21.jsonl', 'date': '2019-06-21', 'count': '13387'},
 {'name': 'events-2019-06-22.jsonl', 'date': '2019-06-22', 'count': '6912'},
 {'name': 'events-2019-06-23.jsonl', 'date': '2019-06-23', 'count': '6485'},
 {'name': 'events-2019-06-24.jsonl', 'date': '2019-06-24', 'count': '14478'},
 {'name': 'events-2019-06-25.jsonl', 'date': '2019-06-25', 'count': '15199'},
 {'name': 'events-2019-06-26.jsonl', 'date': '2019-06-26', 'count': '15114'},
 {'name': 'events-2019-06-27.jsonl', 'date': '2019-06-27', 'count': '16424'},
 {'name': 'events-2019-06-28.jsonl', 'date': '2019-06-28', 'count': '15936'},
 {'name': 'events-2019-06-29.jsonl', 'date': '2019-06-29', 'count': '7213'},
 {'name': 'events-2019-06-30.jsonl', 'date': '2019-06-30', 'count': '6855'},
 {'name': 'events-2019-07-01.jsonl', 'date': '2019-07-01', 'count': '16461'},
 {'name': 'events-2019-07-02.jsonl', 'date': '2019-07-02', 'count': '15384'},
 {'name': 'events-2019-07-03.jsonl', 'date': '2019-07-03', 'count': '15709'},
 {'name': 'events-2019-07-04.jsonl', 'date': '2019-07-04', 'count': '14922'},
 {'name': 'events-2019-07-05.jsonl', 'date': '2019-07-05', 'count': '16336'},
 {'name': 'events-2019-07-06.jsonl', 'date': '2019-07-06', 'count': '6732'},
 {'name': 'events-2019-07-07.jsonl', 'date': '2019-07-07', 'count': '6954'},
 {'name': 'events-2019-07-08.jsonl', 'date': '2019-07-08', 'count': '18121'},
 {'name': 'events-2019-07-09.jsonl', 'date': '2019-07-09', 'count': '18321'},
 {'name': 'events-2019-07-10.jsonl', 'date': '2019-07-10', 'count': '15141'},
 {'name': 'events-2019-07-11.jsonl', 'date': '2019-07-11', 'count': '15025'},
 {'name': 'events-2019-07-12.jsonl', 'date': '2019-07-12', 'count': '13490'},
 {'name': 'events-2019-07-13.jsonl', 'date': '2019-07-13', 'count': '7508'},
 {'name': 'events-2019-07-14.jsonl', 'date': '2019-07-14', 'count': '7056'},
 {'name': 'events-2019-07-15.jsonl', 'date': '2019-07-15', 'count': '13588'},
 {'name': 'events-2019-07-16.jsonl', 'date': '2019-07-16', 'count': '15043'},
 {'name': 'events-2019-07-17.jsonl', 'date': '2019-07-17', 'count': '13545'},
 {'name': 'events-2019-07-18.jsonl', 'date': '2019-07-18', 'count': '13197'},
 {'name': 'events-2019-07-19.jsonl', 'date': '2019-07-19', 'count': '12350'},
 {'name': 'events-2019-07-20.jsonl', 'date': '2019-07-20', 'count': '8074'},
 {'name': 'events-2019-07-21.jsonl', 'date': '2019-07-21', 'count': '7701'},
 {'name': 'events-2019-07-22.jsonl', 'date': '2019-07-22', 'count': '13099'},
 {'name': 'events-2019-07-23.jsonl', 'date': '2019-07-23', 'count': '15365'},
 {'name': 'events-2019-07-24.jsonl', 'date': '2019-07-24', 'count': '14878'},
 {'name': 'events-2019-07-25.jsonl', 'date': '2019-07-25', 'count': '13480'},
 {'name': 'events-2019-07-26.jsonl', 'date': '2019-07-26', 'count': '11324'},
 {'name': 'events-2019-07-27.jsonl', 'date': '2019-07-27', 'count': '7142'},
 {'name': 'events-2019-07-28.jsonl', 'date': '2019-07-28', 'count': '7413'},
 {'name': 'events-2019-07-29.jsonl', 'date': '2019-07-29', 'count': '12181'},
 {'name': 'events-2019-07-30.jsonl', 'date': '2019-07-30', 'count': '13921'},
 {'name': 'events-2019-07-31.jsonl', 'date': '2019-07-31', 'count': '13653'},
 {'name': 'events-2019-08-01.jsonl', 'date': '2019-08-01', 'count': '12863'},
 {'name': 'events-2019-08-02.jsonl', 'date': '2019-08-02', 'count': '11907'},
 {'name': 'events-2019-08-03.jsonl', 'date': '2019-08-03', 'count': '7599'},
 {'name': 'events-2019-08-04.jsonl', 'date': '2019-08-04', 'count': '7344'},
 {'name': 'events-2019-08-05.jsonl', 'date': '2019-08-05', 'count': '12694'},
 {'name': 'events-2019-08-06.jsonl', 'date': '2019-08-06', 'count': '13990'},
 {'name': 'events-2019-08-07.jsonl', 'date': '2019-08-07', 'count': '14971'},
 {'name': 'events-2019-08-08.jsonl', 'date': '2019-08-08', 'count': '13643'},
 {'name': 'events-2019-08-09.jsonl', 'date': '2019-08-09', 'count': '12367'},
 {'name': 'events-2019-08-10.jsonl', 'date': '2019-08-10', 'count': '7689'},
 {'name': 'events-2019-08-11.jsonl', 'date': '2019-08-11', 'count': '7181'},
 {'name': 'events-2019-08-12.jsonl', 'date': '2019-08-12', 'count': '11641'},
 {'name': 'events-2019-08-13.jsonl', 'date': '2019-08-13', 'count': '14053'},
 {'name': 'events-2019-08-14.jsonl', 'date': '2019-08-14', 'count': '14120'},
 {'name': 'events-2019-08-15.jsonl', 'date': '2019-08-15', 'count': '12333'},
 {'name': 'events-2019-08-16.jsonl', 'date': '2019-08-16', 'count': '12151'},
 {'name': 'events-2019-08-17.jsonl', 'date': '2019-08-17', 'count': '7937'},
 {'name': 'events-2019-08-18.jsonl', 'date': '2019-08-18', 'count': '7805'},
 {'name': 'events-2019-08-19.jsonl', 'date': '2019-08-19', 'count': '13711'},
 {'name': 'events-2019-08-20.jsonl', 'date': '2019-08-20', 'count': '16311'},
 {'name': 'events-2019-08-21.jsonl', 'date': '2019-08-21', 'count': '15197'},
 {'name': 'events-2019-08-22.jsonl', 'date': '2019-08-22', 'count': '15148'},
 {'name': 'events-2019-08-23.jsonl', 'date': '2019-08-23', 'count': '13493'},
 {'name': 'events-2019-08-24.jsonl', 'date': '2019-08-24', 'count': '8584'},
 {'name': 'events-2019-08-25.jsonl', 'date': '2019-08-25', 'count': '7771'},
 {'name': 'events-2019-08-26.jsonl', 'date': '2019-08-26', 'count': '14480'},
 {'name': 'events-2019-08-27.jsonl', 'date': '2019-08-27', 'count': '16201'},
 {'name': 'events-2019-08-28.jsonl', 'date': '2019-08-28', 'count': '16800'},
 {'name': 'events-2019-08-29.jsonl', 'date': '2019-08-29', 'count': '17369'},
 {'name': 'events-2019-08-30.jsonl', 'date': '2019-08-30', 'count': '14596'},
 {'name': 'events-2019-08-31.jsonl', 'date': '2019-08-31', 'count': '9850'},
 {'name': 'events-2019-09-01.jsonl', 'date': '2019-09-01', 'count': '9254'},
 {'name': 'events-2019-09-02.jsonl', 'date': '2019-09-02', 'count': '14149'},
 {'name': 'events-2019-09-03.jsonl', 'date': '2019-09-03', 'count': '19495'},
 {'name': 'events-2019-09-04.jsonl', 'date': '2019-09-04', 'count': '19707'},
 {'name': 'events-2019-09-05.jsonl', 'date': '2019-09-05', 'count': '20233'},
 {'name': 'events-2019-09-06.jsonl', 'date': '2019-09-06', 'count': '16405'},
 {'name': 'events-2019-09-07.jsonl', 'date': '2019-09-07', 'count': '11599'},
 {'name': 'events-2019-09-08.jsonl', 'date': '2019-09-08', 'count': '10924'},
 {'name': 'events-2019-09-09.jsonl', 'date': '2019-09-09', 'count': '20269'},
 {'name': 'events-2019-09-10.jsonl', 'date': '2019-09-10', 'count': '20096'},
 {'name': 'events-2019-09-11.jsonl', 'date': '2019-09-11', 'count': '21835'},
 {'name': 'events-2019-09-12.jsonl', 'date': '2019-09-12', 'count': '19533'},
 {'name': 'events-2019-09-13.jsonl', 'date': '2019-09-13', 'count': '16485'},
 {'name': 'events-2019-09-14.jsonl', 'date': '2019-09-14', 'count': '10929'},
 {'name': 'events-2019-09-15.jsonl', 'date': '2019-09-15', 'count': '10916'},
 {'name': 'events-2019-09-16.jsonl', 'date': '2019-09-16', 'count': '21071'},
 {'name': 'events-2019-09-17.jsonl', 'date': '2019-09-17', 'count': '23236'},
 {'name': 'events-2019-09-18.jsonl', 'date': '2019-09-18', 'count': '22101'},
 {'name': 'events-2019-09-19.jsonl', 'date': '2019-09-19', 'count': '22235'},
 {'name': 'events-2019-09-20.jsonl', 'date': '2019-09-20', 'count': '17436'},
 {'name': 'events-2019-09-21.jsonl', 'date': '2019-09-21', 'count': '11592'},
 {'name': 'events-2019-09-22.jsonl', 'date': '2019-09-22', 'count': '11086'},
 {'name': 'events-2019-09-23.jsonl', 'date': '2019-09-23', 'count': '21170'},
 {'name': 'events-2019-09-24.jsonl', 'date': '2019-09-24', 'count': '23135'},
 {'name': 'events-2019-09-25.jsonl', 'date': '2019-09-25', 'count': '23756'},
 {'name': 'events-2019-09-26.jsonl', 'date': '2019-09-26', 'count': '22016'},
 {'name': 'events-2019-09-27.jsonl', 'date': '2019-09-27', 'count': '18908'},
 {'name': 'events-2019-09-28.jsonl', 'date': '2019-09-28', 'count': '12331'},
 {'name': 'events-2019-09-29.jsonl', 'date': '2019-09-29', 'count': '11831'},
 {'name': 'events-2019-09-30.jsonl', 'date': '2019-09-30', 'count': '20538'},
 {'name': 'events-2019-10-01.jsonl', 'date': '2019-10-01', 'count': '20371'},
 {'name': 'events-2019-10-02.jsonl', 'date': '2019-10-02', 'count': '19756'},
 {'name': 'events-2019-10-03.jsonl', 'date': '2019-10-03', 'count': '20469'},
 {'name': 'events-2019-10-04.jsonl', 'date': '2019-10-04', 'count': '17035'},
 {'name': 'events-2019-10-05.jsonl', 'date': '2019-10-05', 'count': '11407'},
 {'name': 'events-2019-10-06.jsonl', 'date': '2019-10-06', 'count': '11555'},
 {'name': 'events-2019-10-07.jsonl', 'date': '2019-10-07', 'count': '20748'},
 {'name': 'events-2019-10-08.jsonl', 'date': '2019-10-08', 'count': '21232'},
 {'name': 'events-2019-10-09.jsonl', 'date': '2019-10-09', 'count': '23141'},
 {'name': 'events-2019-10-10.jsonl', 'date': '2019-10-10', 'count': '21796'},
 {'name': 'events-2019-10-11.jsonl', 'date': '2019-10-11', 'count': '17924'},
 {'name': 'events-2019-10-12.jsonl', 'date': '2019-10-12', 'count': '12297'},
 {'name': 'events-2019-10-13.jsonl', 'date': '2019-10-13', 'count': '11693'},
 {'name': 'events-2019-10-14.jsonl', 'date': '2019-10-14', 'count': '20589'},
 {'name': 'events-2019-10-15.jsonl', 'date': '2019-10-15', 'count': '22692'},
 {'name': 'events-2019-10-16.jsonl', 'date': '2019-10-16', 'count': '20935'}]
[5]:
filenames = (db.read_text('https://archive.analytics.mybinder.org/index.jsonl')
               .map(json.loads)
               .pluck('name')
               .compute())

filenames = ['https://archive.analytics.mybinder.org/' + fn for fn in filenames]
filenames[:5]
[5]:
['https://archive.analytics.mybinder.org/events-2018-11-03.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-04.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-05.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-06.jsonl',
 'https://archive.analytics.mybinder.org/events-2018-11-07.jsonl']

Create Bag of all events

We now create a Dask Bag around that list of URLs, and then call the json.loads function on every line to turn those lines of JSON-encoded text into Python dictionaries that can be more easily manipulated.

[6]:
events = db.read_text(filenames).map(json.loads)
events.take(2)
[6]:
({'timestamp': '2018-11-03T00:00:00+00:00',
  'schema': 'binderhub.jupyter.org/launch',
  'version': 1,
  'provider': 'GitHub',
  'spec': 'Qiskit/qiskit-tutorial/master',
  'status': 'success'},
 {'timestamp': '2018-11-03T00:00:00+00:00',
  'schema': 'binderhub.jupyter.org/launch',
  'version': 1,
  'provider': 'GitHub',
  'spec': 'ipython/ipython-in-depth/master',
  'status': 'success'})

Convert to Dask Dataframe

Finally, we can convert our bag of Python dictionaries into a Dask Dataframe, and follow up with more Pandas-like computations.

We’ll do the same computation as above, now with Pandas syntax.

[8]:
df = events.to_dataframe()
df.head()
[8]:
timestamp schema version provider spec status
0 2018-11-03T00:00:00+00:00 binderhub.jupyter.org/launch 1 GitHub Qiskit/qiskit-tutorial/master success
1 2018-11-03T00:00:00+00:00 binderhub.jupyter.org/launch 1 GitHub ipython/ipython-in-depth/master success
2 2018-11-03T00:00:00+00:00 binderhub.jupyter.org/launch 1 GitHub QISKit/qiskit-tutorial/master success
3 2018-11-03T00:01:00+00:00 binderhub.jupyter.org/launch 1 GitHub QISKit/qiskit-tutorial/master success
4 2018-11-03T00:01:00+00:00 binderhub.jupyter.org/launch 1 GitHub jupyterlab/jupyterlab-demo/master success
[9]:
df.spec.value_counts().nlargest(20).to_frame().compute()
[9]:
spec
ipython/ipython-in-depth/master 1966412
jupyterlab/jupyterlab-demo/master 340568
jupyterlab/jupyterlab-demo/try.jupyter.org 285266
ines/spacy-io-binder/live 185166
DS-100/textbook/master 150260
bokeh/bokeh-notebooks/master 92133
binder-examples/requirements/master 87843
binder-examples/r/master 71266
rationalmatter/juno-demo-notebooks/master 57190
QuantStack/xeus-cling/stable 50517
ines/spacy-course/binder 33963
numba/numba-examples/master 28830
rasahq/docs-binder/master 21936
binder-examples/julia-python/master 21041
QISKit/qiskit-tutorial/master 19453
RasaHQ/rasa_core/master 19253
dask/dask-examples/master 19003
ELC/8fdc0f490b3058872a7014f01416dfb6/master 18548
wshuyi/demo-spacy-text-processing/master 15618
data-8/textbook/gh-pages 15591

Persist in memory

This dataset fits nicely into memory. Lets avoid downloading data every time we do an operation and instead keep the data local in memory.

[10]:
df = df.persist()

Honestly, at this point it makes more sense to just switch to Pandas, but this is a Dask example, so we’ll continue with Dask dataframe.

Investigate providers other than Github

Most binders are specified as git repositories on GitHub, but not all. Lets investigate other providers.

[11]:
import urllib
[12]:
df.provider.value_counts().compute()
[12]:
GitHub      4497792
Gist          27980
GitLab        17084
Git           10473
Zenodo          332
Figshare        124
Name: provider, dtype: int64
[13]:
(df[df.provider == 'GitLab']
 .spec
 .map(urllib.parse.unquote, meta=('spec', object))
 .value_counts()
 .to_frame()
 .compute())
[13]:
spec
rruizz/inforfis/R 1480
rruizz/inforfis/master 1263
lfortran/web/lfortran-binder/master 1192
JC_Bonnefoy/snt_2019/master 1124
DGothrek/ipyaggrid/binder-demo 845
... ...
smendez2/minimal-notebook-test/124ced540ed5587ea5565c20d8f6c10eff1b024f 1
rogermarkussen/test/master 1
atomap/atomap_demos/master 1
elfua/ipython-notebooks/master 1
plotnips/numerical_linear_algebra_notebooks/master 1

453 rows × 1 columns

[14]:
(df[df.provider == 'Git']
 .spec
 .apply(urllib.parse.unquote, meta=('spec', object))
 .value_counts()
 .to_frame()
 .compute())
[14]:
spec
https://git.unilim.fr/grossp01/notebooks.git/master 944
https://bitbucket.org/nikiubel/nikiubel.bitbucket.io.git/841046b40e936fa187b974aabb310a6ac0ecd094 794
https://gitlab.science.ru.nl/roesner_public/programming_2_notebooks.git/master 521
https://framagit.org/debimax/cours-debimax/master 410
https://gitlab.tudelft.nl/aj-lab/teaching.git/master 355
... ...
https://dominikschroeck.de/gitea/dominik/stockquote_analyses/92445b46d65253c165d57809f3a5298e0551e9bf 1
https://dominikschroeck.de/gitea/dominik/stockquote_analyses/961d014669a4fd6e3cbce7bd8c4b716c19fd724b 1
https://gitlab.in2p3.fr/gregoire.henning/python-class-for-nuclear-physicist/9a46ec05bd72f30f366e9aa9e354c7d55ce363e8 1
https://elmord.org/cgit/mybinder-test.git/master 1
https://bitbucket.org/therealMatteo/newspopolarityproject/src/master/e62cfdaad8759cb3266e37f2dfcd8defc091543a 1

1117 rows × 1 columns

[ ]: