{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Homework\n", "========\n", "\n", "The constraints of a tutorial environment hinder the use of real-world moderately large datasets. This keeps us from a fully satisfying experience. To remedy this situation we recommend playing with the following datasets. Please wait until you're off of the conference WiFi before downloading them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NYCTaxi\n", "------\n", "\n", "[Download link](http://www.andresmh.com/nyctaxitrips/)\n", "\n", "Taxi trips taken in 2013 released by a FOIA request. Around 20GB CSV uncompressed.\n", "\n", "**Try the following:**\n", "\n", "* Use `dask.dataframe` with pandas-style queries\n", "* Store in HDF5 both with and without categoricals, measure the size of the file and query times\n", "* Set the index by one of the date-time columns and store in castra (also using categoricals). Perform range queries and measure speed. What size and complexity of query can you perform while still having an \"interactive\" experience?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Github Archive\n", "----------------\n", "\n", "[Download link](https://www.githubarchive.org/)\n", "\n", "Every public github event for the last few years stored as gzip compressed line-delimited JSON data. Watch out, the schema switches at the 2014-2015 transition.\n", "\n", "**Try the following:**\n", "\n", "* Use `dask.bag` to inspect the data\n", "* Drill down using functions like `pluck` and `filter`\n", "* Find who the most popular committers were in 2015" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reddit Comments\n", "-----------------\n", "\n", "[Download link](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)\n", "\n", "Every publicly available reddit comment, available as a large torrent\n", "\n", "**Try the following:**\n", "\n", "* Use `dask.bag` to inspect the data\n", "* Combine `dask.bag` with `nltk` or `gensim` to perform textual analyis on the data\n", "* Reproduce the work of [Daniel Rodriguez](https://extrapolations.dev/blog/2015/07/reproduceit-reddit-word-count-dask/) and see if you can improve upon his speeds when analyzing this data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NYC 311\n", "---------\n", "\n", "[Download link](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9)\n", "\n", "All 311 service requests since 2010 in New York City" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "European Centre for Medium Range Weather Forecasts\n", "----------------------------------------------------------\n", "\n", "[Download script](https://gist.github.com/mrocklin/26d8323f9a8a6a75fce0)\n", "\n", "Download historical global weather data from the ECMWF.\n", "\n", "**Try the following:**\n", "\n", "* What is the variance in temperature over time?\n", "* What areas experienced the largest temperature swings in the last month relative to their previous history?\n", "* Plot the temperature of the earth as a function of latitude and then as longitude" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 1 }