You can run this notebook in a live session or view it on Github.
Homework¶
The constraints of a tutorial environment hinder the use of real-world moderately large datasets. This keeps us from a fully satisfying experience. To remedy this situation we recommend playing with the following datasets. Please wait until you’re off of the conference WiFi before downloading them.
NYCTaxi¶
Taxi trips taken in 2013 released by a FOIA request. Around 20GB CSV uncompressed.
Try the following:
Use
dask.dataframe
with pandas-style queriesStore in HDF5 both with and without categoricals, measure the size of the file and query times
Set the index by one of the date-time columns and store in castra (also using categoricals). Perform range queries and measure speed. What size and complexity of query can you perform while still having an “interactive” experience?
Github Archive¶
Every public github event for the last few years stored as gzip compressed line-delimited JSON data. Watch out, the schema switches at the 2014-2015 transition.
Try the following:
Use
dask.bag
to inspect the dataDrill down using functions like
pluck
andfilter
Find who the most popular committers were in 2015
Reddit Comments¶
Every publicly available reddit comment, available as a large torrent
Try the following:
Use
dask.bag
to inspect the dataCombine
dask.bag
withnltk
orgensim
to perform textual analyis on the dataReproduce the work of Daniel Rodriguez and see if you can improve upon his speeds when analyzing this data.
European Centre for Medium Range Weather Forecasts¶
Download historical global weather data from the ECMWF.
Try the following:
What is the variance in temperature over time?
What areas experienced the largest temperature swings in the last month relative to their previous history?
Plot the temperature of the earth as a function of latitude and then as longitude