Skip to content

Spark

TODO

Similar Software

Bookmarks

Install

Software.list.JDK

ssh -L 8080:localhost:8080 paul@10.10.64.125
ssh -L 7077:localhost:7077 paul@10.10.64.125
./sbin/start-master.sh --webui-port 8090

# Check local URL, localhost:8090

# ./sbin/start-worker.sh spark://pop-os.localdomain:7077 
./sbin/start-worker.sh spark://pop-os.localdomain:7077 --memory 10G
# --memory 4G

./sbin/start-slave.sh spark://pop-os.localdomain:7077  --memory 4G
# --memory 4G

## Stop

./sbin/stop-all.sh
./sbin/stop-slave.sh
./sbin/stop-worker.sh
SPARK_HOME=/home/paul/Projects/Software/spark-3.2.1-bin-hadoop3.2
$SPARK_HOME/bin/pyspark

$SPARK_HOME/bin/spark-submit \
--master spark://127.0.0.1:7707 \
--packages 'com.somesparkjar.dependency' \
--py-files packages.zip \
--files configs/etl_config.json \
jobs/etl_job.py
  • How to I test a connection to a spark cluster pip3 install pipenv
  • QPS
  • How to I get access to the dashboard
    • localhost:8080

Connect

  • EMR, SSH in wih port forwarding then connect localhost
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types     as T
spark = SparkSession.builder.master("spark://pop-os.localdomain:7077").getOrCreate()

AWS EMR

aws emr create-cluster \
  --name pauls-erm-cluster \
  --use-default-roles \
  --release-label emr-5.28.0 \
  --instance-count 3 \
  --applications Name=Spark  \
  --ec2-attributes KeyName=spark-cluster \
  --instance-type m5.xlarge 
aws emr describe-cluster --cluster-id j-15X6GZ29LDDWQ
arn:aws:elasticmapreduce:us-east-1:223248724517:cluster/j-15X6GZ29LDDWQ


aws emr list-clusters
aws emr ssh --cluster-id j-2P9C90OTJUUXO --key-pair-file ~/.ssh/spark-cluster.pem
aws emr socks --cluster-id j-2P9C90OTJUUXO --key-pair-file ~/.ssh/spark-cluster.pem
aws emr terminate-clusters --cluster-ids j-15X6GZ29LDDWQ

Core Functions

  • SparkContext, SparkConf
  • General Functions
    • select()
    • filter()
    • where()
    • sort()
    • dropDuplicates()
    • withColumn()
  • Aggregate Functions
    • count()
    • countDistinct()
    • avg()
    • max()
    • min()
    • groupBy()
    • agg()
  • UDFs and pyspark.sql.types
  • Window Functions
    • partitionBy
    • rangeBetween
    • rowsBetween
  • Spark SQL and DataFrames - Spark 3.2.1 Documentation

List stuff from specific column

dataframe.select('ID').where(dataframe.ID == 4).count()

Count values by condition in PySpark Dataframe - GeeksforGeeks

Distinct Specific Column

df.select('col_name').distinct().show()

scala - Fetching distinct values on a column using Spark DataFrame - Stack Overflow

Count in specifc columns

Count values by condition in PySpark Dataframe - GeeksforGeeks

Count Distinct

Spark DataFrame: count distinct values of every column - Stack Overflow

Sorting

PySpark Sort | How PySpark Sort Function works in PySpark?