
"Most of our exploration and testing in databricks can be achieved through its notebooks, but soon with scale this becomes a issue to maintain, deploy, version control these notebooks and files manually. To avoid all that we need a good repository setup with all features required to avoid these issues."
"Goal of repository should be: Easy builds for both Scala and Python, Should easily handle dependencies, Should support multiple projects / utils in same repository, Should be able to handle multiple task/jobs/pipelines in single project."
"While Scala is chosen for Gradle, you can still maintain Python files and notebooks in the same repository. Gradle here will mainly be used to support scala, but we can write some tasks in gradle to support automated builds for python files also."
"You might ask, why considering utils as seperate project instead of simple module, that is because in future we would like to use the same utils for any other repo, then we can simply build the jar for only utils."
Databricks notebooks become difficult to maintain, deploy, and version control at scale. A well-structured repository setup solves these issues by enabling easy builds for Scala and Python, handling dependencies efficiently, supporting multiple projects and utilities within one repository, and managing multiple tasks or pipelines per project. Gradle is recommended as the build tool for this setup, supporting both Scala and Python while maintaining flexibility. The initialization process involves creating a directory, initializing Git and Gradle with specific configurations (Application type, Scala language, Java 11, Groovy DSL, single application structure), then organizing projects as separate folders including dedicated utility projects for future reusability across repositories.
#databricks-repository-setup #gradle-build-tool #multi-language-development #version-control #scala-and-python
Read at Medium
Unable to calculate read time
Collection
[
|
...
]