Spark
TODO
Similar Software
Bookmarks
Install
- docker-spark/docker-compose.yml at master · big-data-europe/docker-spark
- Submitting pyspark script to a remote Spark server? - Stack Overflow
- PierreKieffer/docker-spark-yarn-cluster: Docker multi-nodes Hadoop cluster with Spark 2.4.1 on Yarn
- mohsenasm/spark-on-yarn-cluster: A Procedure To Create A Yarn Cluster Based on Docker, Run Spark, And Do TPC-DS Performance Test.
ssh -L 8080:localhost:8080 paul@10.10.64.125
ssh -L 7077:localhost:7077 paul@10.10.64.125
./sbin/start-master.sh --webui-port 8090
# Check local URL, localhost:8090
# ./sbin/start-worker.sh spark://pop-os.localdomain:7077
./sbin/start-worker.sh spark://pop-os.localdomain:7077 --memory 10G
# --memory 4G
./sbin/start-slave.sh spark://pop-os.localdomain:7077 --memory 4G
# --memory 4G
## Stop
./sbin/stop-all.sh
./sbin/stop-slave.sh
./sbin/stop-worker.sh
SPARK_HOME=/home/paul/Projects/Software/spark-3.2.1-bin-hadoop3.2
$SPARK_HOME/bin/pyspark
$SPARK_HOME/bin/spark-submit \
--master spark://127.0.0.1:7707 \
--packages 'com.somesparkjar.dependency' \
--py-files packages.zip \
--files configs/etl_config.json \
jobs/etl_job.py
- How to I test a connection to a spark cluster
pip3 install pipenv
- QPS
- How to I get access to the dashboard
localhost:8080
Connect
- EMR, SSH in wih port forwarding then connect localhost
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
spark = SparkSession.builder.master("spark://pop-os.localdomain:7077").getOrCreate()
AWS EMR
aws emr create-cluster \
--name pauls-erm-cluster \
--use-default-roles \
--release-label emr-5.28.0 \
--instance-count 3 \
--applications Name=Spark \
--ec2-attributes KeyName=spark-cluster \
--instance-type m5.xlarge
aws emr describe-cluster --cluster-id j-15X6GZ29LDDWQ
arn:aws:elasticmapreduce:us-east-1:223248724517:cluster/j-15X6GZ29LDDWQ
aws emr list-clusters
aws emr ssh --cluster-id j-2P9C90OTJUUXO --key-pair-file ~/.ssh/spark-cluster.pem
aws emr socks --cluster-id j-2P9C90OTJUUXO --key-pair-file ~/.ssh/spark-cluster.pem
aws emr terminate-clusters --cluster-ids j-15X6GZ29LDDWQ
Core Functions
- SparkContext, SparkConf
- General Functions
- select()
- filter()
- where()
- sort()
- dropDuplicates()
- withColumn()
- Aggregate Functions
- count()
- countDistinct()
- avg()
- max()
- min()
- groupBy()
- agg()
- UDFs and pyspark.sql.types
- Window Functions
- partitionBy
- rangeBetween
- rowsBetween
- Spark SQL and DataFrames - Spark 3.2.1 Documentation
List stuff from specific column
dataframe.select('ID').where(dataframe.ID == 4).count()
Count values by condition in PySpark Dataframe - GeeksforGeeks
Distinct Specific Column
df.select('col_name').distinct().show()
scala - Fetching distinct values on a column using Spark DataFrame - Stack Overflow
Count in specifc columns
Count values by condition in PySpark Dataframe - GeeksforGeeks
Count Distinct
Spark DataFrame: count distinct values of every column - Stack Overflow