In the Coursework, you will apply Spark techniques to the NYC Rideshare dataset, which focuses on analyzing the New York Uber/Lyft data from January 1, 2023, to May 31, 2023. Source data pre-processed was provided by the NYC Taxi and Limousine Commission (TLC). The Rideshare dataset is part of New York. The dataset used in the Coursework is distributed under the MIT license. The source of the datasets is available on this link.
Useful Resources: Lectures, Labs, and other materials are shared along with the following links:
- https://sparkbyexamples.com/vspark
- https://sparkbyexamples.com/vspark-tutorial/
- https://spark.apache.org/docs/3.1.2/api/python/getting_started/index.html
- https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
You can see two CSV files under the path /data-repository-bkt/ECS765/rideshare_2023/, including rideshare_data.csv and taxi_zone_lookup.csv. Using the ccc method bucket ls command to check them.
Dataset Schema
The information on the two CSV files is described below. Please read them carefully because it will help you understand the dataset and tasks.
taxi_zone_lookup.csv The taxi zone lookup CSV has the details for each pickup_location/dropoff_location of the rideshare_data.csv. taxi_zone_lookup.csv has the following schema:
- LocationID: string (nullable = true)
- Borough: string (nullable = true)
- Zone: string (nullable = true)
- Service_zone: string (nullable = true)
The table below shows the samples from the taxi_zone_lookup.csv:
LocationID | Borough | Zone | Service_zone |
---|---|---|---|
1 | EWR | Newark Airport | EWR |
2 | Queens | Jamaica Bay | Boro Zone |
3 | Bronx | Allerton/Pelham Gardens | Boro Zone |
4 | Manhattan | Alphabet City | Yellow Zone |
5 | Staten Island | Arden Heights | Boro Zone |
As you can see, the pickup_location/dropoff_location fields in rideshare_data.csv are encoded with numbers that you can find the counterpart number (LocationID) in taxi_zone_lookup.csv. You need to join the two datasets by using the mentioned fields. Before you go to the assignment part, there are three notes you need to understand about the taxi_zone_lookup.csv: (1) The LocationID 264 and 265 are “Unknown” in the Borough field, we see the ‘Unknown’ as one of the borough names; (2) In the Borough field, you can see the same Borough has different locationIDs, it does not matter because you use LocationID (Unique Key) to apply ‘join function’; and (3) If you see any obscure description (like NW, NA, N/A, etc) in any fields, purely see it as the valid names.
Note
- Write SPARK scripts to answer the questions, and you are allowed to use any Spark functions/API not limited to ones you learned in the module, any solutions without using SPARK scripts are invalid.
- For drawing Graphs, you need to download your outputs, and then you can use any visualization toolkit; Python’s matplotlib (https://matplotlib.org/stable/users/index.html), Gnuplot (http://www.gnuplot.info), or any plotting tool of your preference. Note that Matplotlib is readily available in your local Jupyter environment, and you cannot run matplotlib gnuplot on the Spark cluster. The method of plotting should be clear. If no script is used, the plotting method should be reproducible and all steps to reproduce need to be described in detail in the report.
- You might need to convert when needed the fields from string type to another appropriate (e.g., integer or float) type if math operations are involved.
There are two separate files you need to submit, (1) a PDF file, and (2) a Zip file.
- Submit a single PDF report, including (1) detailed explanations for each step and used API to solve each task, (2) the visualization of your results (Graphs or Screenshots), (3) expressing challenges you encountered for each task, and how you overcome the challenges, and (4) what knowledge/insight attained from each task. This should not be included in the zip file.
- Submit a zip file including (1) a well-commented and organized Spark script for each task, (2) your output results (Screenshot or Data, file containing data points