You can run this notebook in a live session or view it on Github.

Homework¶

The constraints of a tutorial environment hinder the use of real-world moderately large datasets. This keeps us from a fully satisfying experience. To remedy this situation we recommend playing with the following datasets. Please wait until you’re off of the conference WiFi before downloading them.

NYCTaxi¶

Download link

Taxi trips taken in 2013 released by a FOIA request. Around 20GB CSV uncompressed.

Try the following:

Use dask.dataframe with pandas-style queries
Store in HDF5 both with and without categoricals, measure the size of the file and query times
Set the index by one of the date-time columns and store in castra (also using categoricals). Perform range queries and measure speed. What size and complexity of query can you perform while still having an “interactive” experience?

Github Archive¶

Download link

Every public github event for the last few years stored as gzip compressed line-delimited JSON data. Watch out, the schema switches at the 2014-2015 transition.

Try the following:

Use dask.bag to inspect the data
Drill down using functions like pluck and filter
Find who the most popular committers were in 2015

Reddit Comments¶

Download link

Every publicly available reddit comment, available as a large torrent

Try the following:

Use dask.bag to inspect the data
Combine dask.bag with nltk or gensim to perform textual analyis on the data
Reproduce the work of Daniel Rodriguez and see if you can improve upon his speeds when analyzing this data.

NYC 311¶

Download link

All 311 service requests since 2010 in New York City

European Centre for Medium Range Weather Forecasts¶

Download script

Download historical global weather data from the ECMWF.

Try the following:

What is the variance in temperature over time?
What areas experienced the largest temperature swings in the last month relative to their previous history?
Plot the temperature of the earth as a function of latitude and then as longitude