Characteristic strength of concrete, f𝑐𝑘ck = f𝑚m - 1.64 x…

Question

Diego Fernandez

Asked: May 14, 20242024-05-14T05:18:19-04:00 2024-05-14T05:18:19-04:00In: Uni

NYC Rideshare Analysis

In the Coursework, you will apply Spark techniques to the NYC Rideshare dataset, which focuses on analyzing the New York Uber/Lyft data from January 1, 2023, to May 31, 2023. Source data pre-processed was provided by the NYC Taxi and Limousine Commission (TLC). The Rideshare dataset is part of New York. The dataset used in the Coursework is distributed under the MIT license. The source of the datasets is available on this link.

Useful Resources: Lectures, Labs, and other materials are shared along with the following links:

https://sparkbyexamples.com/vspark
https://sparkbyexamples.com/vspark-tutorial/
https://spark.apache.org/docs/3.1.2/api/python/getting_started/index.html
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

You can see two CSV files under the path /data-repository-bkt/ECS765/rideshare_2023/, including rideshare_data.csv and taxi_zone_lookup.csv. Using the ccc method bucket ls command to check them.

Dataset Schema

The information on the two CSV files is described below. Please read them carefully because it will help you understand the dataset and tasks.

taxi_zone_lookup.csv The taxi zone lookup CSV has the details for each pickup_location/dropoff_location of the rideshare_data.csv. taxi_zone_lookup.csv has the following schema:

LocationID: string (nullable = true)
Borough: string (nullable = true)
Zone: string (nullable = true)
Service_zone: string (nullable = true)

The table below shows the samples from the taxi_zone_lookup.csv:

LocationID	Borough	Zone	Service_zone
1	EWR	Newark Airport	EWR
2	Queens	Jamaica Bay	Boro Zone
3	Bronx	Allerton/Pelham Gardens	Boro Zone
4	Manhattan	Alphabet City	Yellow Zone
5	Staten Island	Arden Heights	Boro Zone

As you can see, the pickup_location/dropoff_location fields in rideshare_data.csv are encoded with numbers that you can find the counterpart number (LocationID) in taxi_zone_lookup.csv. You need to join the two datasets by using the mentioned fields. Before you go to the assignment part, there are three notes you need to understand about the taxi_zone_lookup.csv: (1) The LocationID 264 and 265 are “Unknown” in the Borough field, we see the ‘Unknown’ as one of the borough names; (2) In the Borough field, you can see the same Borough has different locationIDs, it does not matter because you use LocationID (Unique Key) to apply ‘join function’; and (3) If you see any obscure description (like NW, NA, N/A, etc) in any fields, purely see it as the valid names.

Note

Write SPARK scripts to answer the questions, and you are allowed to use any Spark functions/API not limited to ones you learned in the module, any solutions without using SPARK scripts are invalid.
For drawing Graphs, you need to download your outputs, and then you can use any visualization toolkit; Python’s matplotlib (https://matplotlib.org/stable/users/index.html), Gnuplot (http://www.gnuplot.info), or any plotting tool of your preference. Note that Matplotlib is readily available in your local Jupyter environment, and you cannot run matplotlib gnuplot on the Spark cluster. The method of plotting should be clear. If no script is used, the plotting method should be reproducible and all steps to reproduce need to be described in detail in the report.
You might need to convert when needed the fields from string type to another appropriate (e.g., integer or float) type if math operations are involved.

Submission Guidelines

There are two separate files you need to submit, (1) a PDF file, and (2) a Zip file.

Submit a single PDF report, including (1) detailed explanations for each step and used API to solve each task, (2) the visualization of your results (Graphs or Screenshots), (3) expressing challenges you encountered for each task, and how you overcome the challenges, and (4) what knowledge/insight attained from each task. This should not be included in the zip file.
Submit a zip file including (1) a well-commented and organized Spark script for each task, (2) your output results (Screenshot or Data, file containing data points

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

NYC Rideshare Analysis

Leave an answer
Cancel reply

Where To Find Cheap Dissertation Writing Services?

24 Hours

Nursing Help

Sarah

Williamson Tucker

Mellisa PhD

Sign Up

Sign In

Forgot Password

NYC Rideshare Analysis

Leave an answerCancel reply

Leave an answer
Cancel reply