#apache-spark
#apache-spark

Leveraging Broadcast Joins in Apache Spark (Scala)

Broadcast joins optimize Spark for faster dataset joins by broadcasting smaller datasets, avoiding costly shuffle operations.

Scala

From Frustrating to Fast: Speeding Up Spark Tests Using Shared Sessions

Using a shared Spark session significantly reduces the execution time for unit tests in Spark jobs.

RDD vs DataFrame vs Dataset in Apache Spark: Which One Should You Use and Why

Understanding Spark's APIs—RDD, DataFrame, and Dataset—saves time and boosts efficiency in big data processing.

Frequent Spark Interview QuestionsPart 2

Both cache() and persist() store an RDD/DataFrame/Dataset in memory (or disk) to avoid recomputation. cache() is shorthand for persist(StorageLevel.MEMORY_ONLY), while persist() offers more control.

Scala

#data-engineering

Data science

Day 6-Sessionization of Web Logs using Time Difference | Apache Spark Interview Problem.

Data science

Understanding the load() Function in Apache Spark: Syntax, Examples, and Best Practices

Data science

Day 6-Sessionization of Web Logs using Time Difference | Apache Spark Interview Problem.

Data science

Understanding the load() Function in Apache Spark: Syntax, Examples, and Best Practices

more#data-engineering

Apache Spark: Fix data skew issue using salting technique (practical example)

Data skew in Apache Spark is a performance issue where a few keys dominate the data distribution, leading to uneven partitions and slow queries, especially during operations that require shuffling.

Data science

Scala #15: Spark: Text Feature Transformers

Tokenization and HashingTF are essential steps in preparing text data for machine learning in Spark.

Scala #15: Spark: Text Feature Transformers

Tokenization is a crucial step in natural language data processing, enabling the breakdown of sentences into individual tokens essential for machine learning applications.

Scala

Data Quality Verification with Deequ: A Practical Approach Using Scala

Utilizing Deequ and Scala for efficient and automated data validation is highly effective for managing large datasets.

#big-data

Scala

Apache Spark and the Big Data Ecosystem

Data science

Handling Large Data Volumes (100GB-1TB) in Scala with Apache Spark

Scala

Installing Apache Spark 3.5.4 on Windows

Scala

Apache Spark and the Big Data Ecosystem

Data science

Handling Large Data Volumes (100GB-1TB) in Scala with Apache Spark

Scala

Installing Apache Spark 3.5.4 on Windows

Data science

Big Data for the Data Science-Driven Manager 03- Apache Spark Explained for Managers

Apache Spark is crucial for efficiently processing large datasets in modern enterprises.

Scala

Counting Files Using Spark and Scala with Regex Matching

Leveraging Apache Spark and regex can streamline the process of counting files based on naming patterns in large datasets.

Data science

Big Data for the Data Science-Driven Manager 03- Apache Spark Explained for Managers

Apache Spark is crucial for efficiently processing large datasets in modern enterprises.

Scala

Counting Files Using Spark and Scala with Regex Matching

Data science

Spark Scala Exercise 5: Column Operations with DataFramesA Complete Guide for Data Engineers

Scala

Spark Scala Exercise 2: Load a CSV and Count Rows

fromawstip.com

Data science

Spark Scala Exercise 5: Column Operations with DataFramesA Complete Guide for Data Engineers

Scala

Spark Scala Exercise 2: Load a CSV and Count Rows

more#dataframe

Word Count Program

The Word Count program is a key example of distributed computing frameworks, demonstrating how to count word occurrences using methods such as flatMap and reduceByKey.

Data science

#scala

Scala

How to Print the Scala Version in Apache Spark

Scala

Intro to Scala-Day 98 of 100 Days of Data Engineering, AI and Azure Challenge

Scala

21 Days of Spark Scala: Day 3-Exploring Case Classes: The Building Blocks of Functional...

Scala

21 Days of Spark Scala: Day 4-Immutable Collections in Scala: Why They Matter for Big Data

Embracing immutability in Scala enhances safety and predictability in big data processing.

Scala

Scala Vs. Python-What Data Engineers Need To Know

Scala improves upon Java while remaining JVM-compatible, making it attractive for organizations.

Scala

How to Print the Scala Version in Apache Spark

Scala

Intro to Scala-Day 98 of 100 Days of Data Engineering, AI and Azure Challenge

Scala

21 Days of Spark Scala: Day 3-Exploring Case Classes: The Building Blocks of Functional...

Scala

21 Days of Spark Scala: Day 4-Immutable Collections in Scala: Why They Matter for Big Data

Scala

Scala Vs. Python-What Data Engineers Need To Know

Scala improves upon Java while remaining JVM-compatible, making it attractive for organizations.

more#scala