Welcome to our guide on how to install Apache Spark on Ubuntu 20.04/18.04 & Debian 9/8/10. Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is a fast unified analytics engine used for big data and machine learning processing.
Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Before we install Apache Spark on Ubuntu / Debian, let’s update our system packages.
sudo apt update
sudo apt -y upgrade
Consider a system reboot after upgrade is required.
[ -f /var/run/reboot-required ] && sudo reboot -f
Now use the steps shown next to install Spark on Ubuntu 18.04 / Debian 9.
Apache Spark requires Java to run, let’s make sure we have Java installed on our Ubuntu / Debian system.
For default system Java:
sudo apt install curl mlocate default-jdk -y
Verify Java version using the command:
$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)
For missing add-apt-repository command, check How to Install add-apt-repository on Debian / Ubuntu
Download the latest release of Apache Spark from the downloads page. As of this update, this is 2.4.5.
curl -O https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
Extract the Spark tarball.
tar xvf spark-3.1.1-bin-hadoop3.2.tgz
Move the Spark folder created after extraction to the /opt/ directory.
sudo mv spark-3.1.1-bin-hadoop3.2/ /opt/spark
Open your bashrc configuration file.
vim ~/.bashrc
Add:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Activate the changes.
source ~/.bashrc
You can now start a standalone master server using the start-master.sh command.
$ start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out
The process will be listening on TCP port 8080.
$ sudo ss -tunelp | grep 8080
tcp LISTEN 0 1 *:8080 *:* users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <->
The Web UI looks like below.
My Spark URL is spark://ubuntu:7077.
The start-slave.sh command is used to start Spark Worker Process.
$ start-slave.sh spark://ubuntu:7077 starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out
If you don’t have the script in your $PATH, you can first locate it.
$ sudo updatedb $ locate start-slave.sh /opt/spark/sbin/start-slave.sh
You can also use the absolute path to run the script.
Use the spark-shell command to access Spark Shell.
$ /opt/spark/bin/spark-shell
21/04/27 08:49:09 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 10.10.10.2 instead (on interface eth0)
21/04/27 08:49:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/04/27 08:49:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.10.10.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1619513355938).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.1.1
/_/
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.10)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
If you’re more of a Python person, use pyspark.
$ /opt/spark/bin/pyspark
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
21/04/27 08:50:09 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 10.10.10.2 instead (on interface eth0)
21/04/27 08:50:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/04/27 08:50:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.1
/_/
Using Python version 3.8.5 (default, Jan 27 2021 15:41:15)
Spark context Web UI available at http://10.10.10.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1619513411109).
SparkSession available as 'spark'.
>>>
Easily shut down the master and slave Spark processes using commands below.
$ SPARK_HOME/sbin/stop-slave.sh
$ SPARK_HOME/sbin/stop-master.sh
There you have it. Read more on Spark Documentation.
或是邮件反馈可也:
askdama[AT]googlegroups.com
订阅 substack 体验古早写作:
关注公众号, 持续获得相关各种嗯哼: