Databricks SPARK developer certification — Preparation guide

pavan moganti
4 min readJul 22, 2020

--

On July 12th, 2020, I have cleared my Spark certification from Databricks. I am writing this blog to share my learning experience with people who are planning to give this exam a shot since a few from my Linkedin network had approached me to know my learning strategy which helped to clear this exam.

Here is the link to the exam. This exam doesn’t test you on SPARK streaming, Machine Learning, and graph. It is concentrated only on Dataframes. I have taken this exam in Python programming so I am not much aware of Scala related questions. But I think even scala exam will be concentrated mostly on API related questions as “DataFrame is simply a type alias of Dataset[Row]”.

Preparation Tips:

  1. Firstly, I recommend reading SPARK definitive guide from chapters 1 to 19 excluding the content related to RDD. This exam will test your capability of using Dataframe only.
  2. This exam doesn't require any working knowledge of databricks notebooks. We can practice using API in jupyter notebooks as well.
  3. Developers should be aware of which function to call based on the requirement. Pdf document provided during the exam will not be much help since search option is disabled on it.
  4. Hands-on as much as you can by writing code for different scenarios. MCQs options will be very close with a slight change in syntax.
  5. Developer should know which cluster will be effective having 1 executor with 100GB of RAM and 20 cores or 10 executors with 10GB of RAM and 2 cores each, etc.
  6. Get a thorough understanding of SPARK architecture especially how it splits spark-submit to Jobs, Stages, and Tasks.

Resources:

While preparing for the exam, I have read the definitive guide twice. During my first reading, I took each code chunk and relevant dataset from SPARK definitive guide Github, uploaded it to dbfs file in databricks community edition, and executed the code to understand how data is transforming after each function call.

Running the code interactively has helped me to gain confidence and play with Dataframe APIs. You can use it by

  1. Navigate to https://github.com/pavanmoganti/DataEngineering-SPARKguide and clone SPARK Definitive Guide.dbc file.
  2. Create a free account at databricks community edition and create a cluster.
  3. Import the SPARK Definitive Guide.dbc file to databricks.
  4. Start reading the definitive guide book and execute each code cell. For example, in chapter 2 page no 30 the code in the book can be more understood on databricks notebook as
Textbook code
Datanbricks notebook showing 3 records and schema of the dataframe

5. This .dbc file includes code from chapters 1 to 19

Additional Resources:

Once you are done reading definitive guide, Please refer to ETL part 1: Data Extraction and ETL part 2: Data transformation and Loads courses at databricks self-paced courses. I recommend doing these capstone projects independently without referring to solutions. This will gauge your coding capability and using Dataframe APIs.

Some practice scenarios I felt to be useful are

  1. Adding a column to dataframe based on a condition applied on other columns(using case, when and otherwise)
  2. Handling null values and removing duplicates
  3. Filtering, Sorting, Renaming and Selecting
  4. Joining dataframes — refer to challenges when using joins in chapter 8
  5. Column manipulations using regex
  6. Creating UDFs and using it for the next steps
  7. Aggregating and using window functions
  8. Creating schema based on the data requirements and reading CSV, JSON and nested data formats
  9. Reading and writing different data formats with different modes and partitioning on columns

Finally, get a good understanding about catalyst optimizer and how to use SPARK UI. Using Databricks will give you access to quick view SPARK UI. Understand how spark is parallelizing the job and how can we improve the performance and minimize the infrastructure cost.

As mentioned in certification FAQs you will get to know the result(Pass/Fail)once you submit your responses. The official score report will be available after a week. In my case, it took almost 7 days to get an official certificate.

I hope this article gives a pretty good outline of the learning path you can follow for taking the SPARK certification exam. Please let me know in the comments which resources are helpful and which needs improvement. I would be happy to update based on the review comments.

Also recently found a good udemy course for practice exam and this is very helpful in the mock test for the exam. Feel free to check out this link

https://www.udemy.com/course/databricks-certified-developer-for-apache-spark-30-practice-exams/?referralCode=AF25CE1782C1C371DAF8

Good Luck with your exam !!!!!

--

--

Responses (5)