NYU Dataproc

Google Cloud Dataproc is a cloud-based Hadoop service that NYU HPC provides for courses. It comes with the following interfaces for interacting with Hadoop:

Interface Description URL
Command Line Interface Console This is how you log into Dataproc. https://dataproc.hpc.nyu.edu/ssh
Data Ingest Console This is how you upload data into Dataproc. https://dataproc.hpc.nyu.edu/ingest
MapReduce Job History A web interface where you can see all MapReduce jobs that have run on the cluster. https://dataproc.hpc.nyu.edu/jobhistory/
Spark History Server A web interface where you can see all Spark jobs that have run on the cluster. https://dataproc.hpc.nyu.edu/sparkhistory/
YARN Application Timeline A web interface where you can see all applications (both Hadoop MapReduce and Spark) that have run on the cluster. https://dataproc.hpc.nyu.edu/apphistory/
YARN ResourceManager YARN is the resource manager and job scheduler used by the Dataproc cluster. YARN allows you to use various data processing engines for batch, interactive, and real-time stream processing of data stored in HDFS. The YARN web interface allows you to see all current and recently executed applications, as well as information about the current state of the cluster (such as number of vCores and RAM free). https://dataproc.hpc.nyu.edu/yarn/