{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Homework\n",
    "========\n",
    "\n",
    "The constraints of a tutorial environment hinder the use of real-world moderately large datasets.  This keeps us from a fully satisfying experience.  To remedy this situation we recommend playing with the following datasets.  Please wait until you're off of the conference WiFi before downloading them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NYCTaxi\n",
    "------\n",
    "\n",
    "[Download link](http://www.andresmh.com/nyctaxitrips/)\n",
    "\n",
    "Taxi trips taken in 2013 released by a FOIA request.  Around 20GB CSV uncompressed.\n",
    "\n",
    "**Try the following:**\n",
    "\n",
    "*  Use `dask.dataframe` with pandas-style queries\n",
    "*  Store in HDF5 both with and without categoricals, measure the size of the file and query times\n",
    "*  Set the index by one of the date-time columns and store in castra (also using categoricals).  Perform range queries and measure speed.  What size and complexity of query can you perform while still having an \"interactive\" experience?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Github Archive\n",
    "----------------\n",
    "\n",
    "[Download link](https://www.githubarchive.org/)\n",
    "\n",
    "Every public github event for the last few years stored as gzip compressed line-delimited JSON data.  Watch out, the schema switches at the 2014-2015 transition.\n",
    "\n",
    "**Try the following:**\n",
    "\n",
    "*  Use `dask.bag` to inspect the data\n",
    "*  Drill down using functions like `pluck` and `filter`\n",
    "*  Find who the most popular committers were in 2015"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Reddit Comments\n",
    "-----------------\n",
    "\n",
    "[Download link](https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/)\n",
    "\n",
    "Every publicly available reddit comment, available as a large torrent\n",
    "\n",
    "**Try the following:**\n",
    "\n",
    "*  Use `dask.bag` to inspect the data\n",
    "*  Combine `dask.bag` with `nltk` or `gensim` to perform textual analyis on the data\n",
    "*  Reproduce the work of [Daniel Rodriguez](https://extrapolations.dev/blog/2015/07/reproduceit-reddit-word-count-dask/) and see if you can improve upon his speeds when analyzing this data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NYC 311\n",
    "---------\n",
    "\n",
    "[Download link](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9)\n",
    "\n",
    "All 311 service requests since 2010 in New York City"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "European Centre for Medium Range Weather Forecasts\n",
    "----------------------------------------------------------\n",
    "\n",
    "[Download script](https://gist.github.com/mrocklin/26d8323f9a8a6a75fce0)\n",
    "\n",
    "Download historical global weather data from the ECMWF.\n",
    "\n",
    "**Try the following:**\n",
    "\n",
    "*  What is the variance in temperature over time?\n",
    "*  What areas experienced the largest temperature swings in the last month relative to their previous history?\n",
    "*  Plot the temperature of the earth as a function of latitude and then as longitude"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}