bubblevast.blogg.se - Scala interview questions

RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. What is RDD?Īns: RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. These vectors are used for storing non-zero entries to save space. Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.Īns: A sparse vector has two parallel arrays –one for indices and the other for values.Spark is preferred over Hadoop for real time querying of data.Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.List some use cases where Spark outperforms Hadoop in processing. Shark tool helps data users run Hive on Spark - offering compatibility with Hive metastore, queries and data.

Shark is a tool, developed for people who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. What is Shark?Īns: Most of the data users know only SQL and are not good at programming. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels. What are the various levels of persistence in Apache Spark?Īns: Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it.