#spark

[ follow ]
#data-processing

Efficient Scala BiqQuery Data Retrieval: A Comprehensive Guide

You can use the spark-bigquery connector to read data from BigQuery tables directly into Spark DataFrames.
It is essential to set GCP credentials, specify the table path correctly, and include necessary dependencies to connect with BigQuery.

Customer Segmentation with Scala on GCP Dataproc

Customer segmentation can be effectively performed using k-means clustering in Spark after addressing missing data.

Scala #14: Spark: Pipeline

End-to-end ML pipelines in Spark automate and streamline machine learning processes, improving productivity and efficiency.

Deploy a Scala Spark job on GCP Dataproc with IntelliJ

Creating a Scala Spark job on GCP Dataproc involves setting up IntelliJ, adding Spark dependencies, and writing the job code.

Efficient Scala BiqQuery Data Retrieval: A Comprehensive Guide

You can use the spark-bigquery connector to read data from BigQuery tables directly into Spark DataFrames.
It is essential to set GCP credentials, specify the table path correctly, and include necessary dependencies to connect with BigQuery.

Customer Segmentation with Scala on GCP Dataproc

Customer segmentation can be effectively performed using k-means clustering in Spark after addressing missing data.

Scala #14: Spark: Pipeline

End-to-end ML pipelines in Spark automate and streamline machine learning processes, improving productivity and efficiency.

Deploy a Scala Spark job on GCP Dataproc with IntelliJ

Creating a Scala Spark job on GCP Dataproc involves setting up IntelliJ, adding Spark dependencies, and writing the job code.
moredata-processing

How to feel the spark (and keep it alive) from first date to 50th anniversary

The spark in relationships is a combination of initial excitement and deep contentment, vital for long-term affinity.

MLOps With Databricks and Spark - Part 1 | HackerNoon

This series provides a practical approach to implementing MLOps using Databricks and Spark.

TABLE JOIN cheat sheet

The cheat sheet is a comprehensive resource for merging datasets in SQL, Spark, and Python pandas, including cross joins.
#scala

Hadoop and Spark on Ubuntu 22.04 LTS with Canada 2021 Census data

Step-by-step guide for configuring Hadoop and Spark on Ubuntu 22.04 LTS
Demonstrating CSV file loading into HDFS and data manipulation with Spark using Scala

Scala Jobs on AWS Glue: A Practical Guide to Development, Local Testing and Deployment

AWS Glue is highly scalable, cost-effective, and integrates well with other AWS services for orchestrating complex pipelines.
Performance issues exist in AWS Glue when dealing with large Python-based Pyspark jobs due to expensive data shuffling between JVM and Python processes.

WindowsJupyter Almond Scala

Jupyter Notebook is more effective for debugging Spark programs compared to IDEs like IDEA.

Analisis de la Felicidad Mundial

Spark can execute processes directly in RAM for faster data processing compared to traditional disk systems.
Lazy evaluation in Spark optimizes memory usage by executing transformations only when required.

Time Series Feature Engineering in Apache Spark for Python with Scala

Feature engineering is crucial for unlocking insights from complex data sets.
Time series feature engineering requires specialized methods due to temporal dependencies.

Hadoop and Spark on Ubuntu 22.04 LTS with Canada 2021 Census data

Step-by-step guide for configuring Hadoop and Spark on Ubuntu 22.04 LTS
Demonstrating CSV file loading into HDFS and data manipulation with Spark using Scala

Scala Jobs on AWS Glue: A Practical Guide to Development, Local Testing and Deployment

AWS Glue is highly scalable, cost-effective, and integrates well with other AWS services for orchestrating complex pipelines.
Performance issues exist in AWS Glue when dealing with large Python-based Pyspark jobs due to expensive data shuffling between JVM and Python processes.

WindowsJupyter Almond Scala

Jupyter Notebook is more effective for debugging Spark programs compared to IDEs like IDEA.

Analisis de la Felicidad Mundial

Spark can execute processes directly in RAM for faster data processing compared to traditional disk systems.
Lazy evaluation in Spark optimizes memory usage by executing transformations only when required.

Time Series Feature Engineering in Apache Spark for Python with Scala

Feature engineering is crucial for unlocking insights from complex data sets.
Time series feature engineering requires specialized methods due to temporal dependencies.
morescala
#data-engineering

Spark Starter Guide 4.13: Importing Data from a Relational Database (MySQL)

Relational databases are vital for operational data but can also hold valuable analytics data.
Spark simplifies accessing databases to populate Spark DataFrames for analysis.

Why to avoid multiple chaining of withColumn() function in Spark job.

Chaining multiple withColumn() calls in Spark may lead to performance issues and inefficient resource usage.

Spark Starter Guide 4.13: Importing Data from a Relational Database (MySQL)

Relational databases are vital for operational data but can also hold valuable analytics data.
Spark simplifies accessing databases to populate Spark DataFrames for analysis.

Why to avoid multiple chaining of withColumn() function in Spark job.

Chaining multiple withColumn() calls in Spark may lead to performance issues and inefficient resource usage.
moredata-engineering

Caching in Spark | What? How? Why?

Lazy evaluation in Spark allows for optimized execution plans.
Caching within Spark helps in avoiding recomputation of RDDs.

Run your first analysis project on Apache Zeppelin using Scala (Spark), Shell, and SQL

Move Spark directory to /opt/spark
Configure JAVA_HOME and SPARK_HOME variables
Adjust Zeppelin configuration in the 'Interpreter' section

Spark Essentials: A Guide to Setting up and Running Spark projects with Scala and sbt

This article provides a detailed guide on initializing a Spark project using Scala Build Tool (SBT)
The guide covers creating projects, managing dependencies, local testing, compilation, and deployment on a cluster

Working with Different Types of Data in spar

DF methods can be found in DataFrame (Dataset) methods and Column methods
Functions in pyspark.sql.functions cover a range of data types
[ Load more ]