You can run this notebook in a live session  or view it on Github.
Homework¶
The constraints of a tutorial environment hinder the use of real-world moderately large datasets. This keeps us from a fully satisfying experience. To remedy this situation we recommend playing with the following datasets. Please wait until you’re off of the conference WiFi before downloading them.
NYCTaxi¶
Taxi trips taken in 2013 released by a FOIA request. Around 20GB CSV uncompressed.
Try the following:
- Use - dask.dataframewith pandas-style queries
- Store in HDF5 both with and without categoricals, measure the size of the file and query times 
- Set the index by one of the date-time columns and store in castra (also using categoricals). Perform range queries and measure speed. What size and complexity of query can you perform while still having an “interactive” experience? 
Github Archive¶
Every public github event for the last few years stored as gzip compressed line-delimited JSON data. Watch out, the schema switches at the 2014-2015 transition.
Try the following:
- Use - dask.bagto inspect the data
- Drill down using functions like - pluckand- filter
- Find who the most popular committers were in 2015 
Reddit Comments¶
Every publicly available reddit comment, available as a large torrent
Try the following:
- Use - dask.bagto inspect the data
- Combine - dask.bagwith- nltkor- gensimto perform textual analyis on the data
- Reproduce the work of Daniel Rodriguez and see if you can improve upon his speeds when analyzing this data. 
European Centre for Medium Range Weather Forecasts¶
Download historical global weather data from the ECMWF.
Try the following:
- What is the variance in temperature over time? 
- What areas experienced the largest temperature swings in the last month relative to their previous history? 
- Plot the temperature of the earth as a function of latitude and then as longitude