Azure Databricks Flashcards
What is Apache Spark? replacement for what?
Apache Spark is an open-source framework for doing big data processing.
It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses.
Main characteristics of Spark
Not only does Spark handle data analytics tasks, but it also handles machine learning. It has a library called MLlib that includes a variety of pre-built algorithms, such as logistic regression, naive Bayes, and random forest. At the moment, it doesn’t include neural networks. However, you can still create neural networks on Spark using other machine learning frameworks, such as TensorFlow.
What is Databricks?
In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s basically a managed implementation of Spark in the cloud, so you don’t have to worry about building clusters yourself. It also has a user-friendly interface for running code on clusters interactively.
Benefits of Azure Databricks
Microsoft has partnered with Databricks to bring their product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services, which is something I’ll show you later in this course.
What is the process to run Spark on Databricks?
We need a Databricks Workspace and then create(Spin up) a Compute Cluster
Explain Databricks Workspace and Notebooks
Azure Databricks is a data analytics platform optimized for the Microsoft Azure cloud services platform. Azure Databricks offers three environments for developing data intensive applications: Databricks SQL, Databricks Data Science & Engineering, and Databricks Machine Learning