#apache-spark

[ follow ]
fromInfoQ
3 weeks ago

Databricks Contributes Spark Declarative Pipelines to Apache Spark

Databricks is contributing the technology behind Delta Live Tables (DLT) to the Apache Spark project as Spark Declarative Pipelines, simplifying the development of streaming pipelines.
Data science
fromMedium
1 month ago

Leveraging Broadcast Joins in Apache Spark (Scala)

Broadcast joins optimize Spark for faster dataset joins by broadcasting smaller datasets, avoiding costly shuffle operations.
Scala
fromMedium
1 month ago

From Frustrating to Fast: Speeding Up Spark Tests Using Shared Sessions

Using a shared Spark session significantly reduces the execution time for unit tests in Spark jobs.
fromMedium
1 month ago

RDD vs DataFrame vs Dataset in Apache Spark: Which One Should You Use and Why

Understanding Spark's APIs—RDD, DataFrame, and Dataset—saves time and boosts efficiency in big data processing.
fromMedium
1 month ago

Frequent Spark Interview QuestionsPart 2

Both cache() and persist() store an RDD/DataFrame/Dataset in memory (or disk) to avoid recomputation. cache() is shorthand for persist(StorageLevel.MEMORY_ONLY), while persist() offers more control.
Scala
#data-engineering
fromMedium
2 months ago
Data science

Day 6-Sessionization of Web Logs using Time Difference | Apache Spark Interview Problem.

fromMedium
2 months ago
Data science

Understanding the load() Function in Apache Spark: Syntax, Examples, and Best Practices

fromMedium
2 months ago
Data science

Day 6-Sessionization of Web Logs using Time Difference | Apache Spark Interview Problem.

fromMedium
2 months ago
Data science

Understanding the load() Function in Apache Spark: Syntax, Examples, and Best Practices

fromMedium
2 months ago

Apache Spark: Fix data skew issue using salting technique (practical example)

Data skew in Apache Spark is a performance issue where a few keys dominate the data distribution, leading to uneven partitions and slow queries, especially during operations that require shuffling.
Data science
fromMedium
2 months ago

Scala #15: Spark: Text Feature Transformers

Tokenization and HashingTF are essential steps in preparing text data for machine learning in Spark.
fromMedium
2 months ago

Scala #15: Spark: Text Feature Transformers

Tokenization is a crucial step in natural language data processing, enabling the breakdown of sentences into individual tokens essential for machine learning applications.
Scala
Scala
fromMedium
2 months ago

Data Quality Verification with Deequ: A Practical Approach Using Scala

Utilizing Deequ and Scala for efficient and automated data validation is highly effective for managing large datasets.
#big-data
#data-processing
fromMedium
3 months ago
Data science

Big Data for the Data Science-Driven Manager 03- Apache Spark Explained for Managers

Apache Spark is crucial for efficiently processing large datasets in modern enterprises.
fromMedium
5 months ago
Scala

Counting Files Using Spark and Scala with Regex Matching

Leveraging Apache Spark and regex can streamline the process of counting files based on naming patterns in large datasets.
Data science
fromMedium
3 months ago

Big Data for the Data Science-Driven Manager 03- Apache Spark Explained for Managers

Apache Spark is crucial for efficiently processing large datasets in modern enterprises.
#dataframe
fromawstip.com
3 months ago
Data science

Spark Scala Exercise 5: Column Operations with DataFramesA Complete Guide for Data Engineers

fromawstip.com
3 months ago
Data science

Spark Scala Exercise 5: Column Operations with DataFramesA Complete Guide for Data Engineers

fromMedium
3 months ago

Word Count Program

The Word Count program is a key example of distributed computing frameworks, demonstrating how to count word occurrences using methods such as flatMap and reduceByKey.
Data science
#scala
fromMedium
4 months ago
Scala

21 Days of Spark Scala: Day 3-Exploring Case Classes: The Building Blocks of Functional...

fromMedium
4 months ago
Scala

21 Days of Spark Scala: Day 4-Immutable Collections in Scala: Why They Matter for Big Data

Embracing immutability in Scala enhances safety and predictability in big data processing.
Scala
fromMedium
5 months ago

Scala Vs. Python-What Data Engineers Need To Know

Scala improves upon Java while remaining JVM-compatible, making it attractive for organizations.
fromMedium
4 months ago
Scala

21 Days of Spark Scala: Day 3-Exploring Case Classes: The Building Blocks of Functional...

fromMedium
4 months ago
Scala

21 Days of Spark Scala: Day 4-Immutable Collections in Scala: Why They Matter for Big Data

Scala
fromMedium
5 months ago

Scala Vs. Python-What Data Engineers Need To Know

Scala improves upon Java while remaining JVM-compatible, making it attractive for organizations.
fromMedium
5 months ago

Testing MySQL in Spark: Fake It Till You Make It with H2!

MySQL is a reliable, open-source RDBMS ideal for structured data management and integrates with Apache Spark for seamless data operations.
[ Load more ]