computingforgeeks.com /how-to-install-apache-spark-on-ubuntu-debian/

Install Apache Spark on Ubuntu 20.04/18.04 & Debian 10/9 | ComputingForGeeks

Josphat Mutai 6-8 minutes

Welcome to our guide on how to install Apache Spark on Ubuntu 20.04/18.04 & Debian 9/8/10. Apache Spark is an open-source distributed general-purpose cluster-computing framework. It is a fast unified analytics engine used for big data and machine learning processing.

Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Before we install Apache Spark on Ubuntu / Debian, let’s update our system packages.

sudo apt update
sudo apt -y upgrade

Consider a system reboot after upgrade is required.

[ -f /var/run/reboot-required ] && sudo reboot -f

Now use the steps shown next to install Spark on Ubuntu 18.04 / Debian 9.

Step 1: Install Java

Apache Spark requires Java to run, let’s make sure we have Java installed on our Ubuntu / Debian system.

For default system Java:

sudo apt install curl mlocate default-jdk -y

Verify Java version using the command:

$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

For missing add-apt-repository command, check How to Install add-apt-repository on Debian / Ubuntu

Step 2: Download Apache Spark

Download the latest release of Apache Spark from the downloads page. As of this update, this is 2.4.5.

curl -O https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz

Extract the Spark tarball.

tar xvf spark-3.1.1-bin-hadoop3.2.tgz

Move the Spark folder created after extraction to the /opt/ directory.

sudo mv spark-3.1.1-bin-hadoop3.2/ /opt/spark 

Set Spark environment

Open your bashrc configuration file.

vim ~/.bashrc

Add:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Activate the changes.

source ~/.bashrc

Step 3: Start a standalone master server

You can now start a standalone master server using the start-master.sh command.

$ start-master.sh 
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu.out

The process will be listening on TCP port 8080.

$ sudo ss -tunelp | grep 8080
tcp   LISTEN  0       1                           *:8080                *:*      users:(("java",pid=8033,fd=238)) ino:41613 sk:5 v6only:0 <-> 

The Web UI looks like below.

Apache spark start master

My Spark URL is spark://ubuntu:7077.

Step 4: Starting Spark Worker Process

The start-slave.sh command is used to start Spark Worker Process.

$ start-slave.sh spark://ubuntu:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu.out

If you don’t have the script in your $PATH, you can first locate it.

$ sudo updatedb
$ locate start-slave.sh
/opt/spark/sbin/start-slave.sh

You can also use the absolute path to run the script.

Step 5: Using Spark shell

Use the spark-shell command to access Spark Shell.

$ /opt/spark/bin/spark-shell
21/04/27 08:49:09 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 10.10.10.2 instead (on interface eth0)
21/04/27 08:49:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/04/27 08:49:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.10.10.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1619513355938).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.10)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

If you’re more of a Python person, use pyspark.

$ /opt/spark/bin/pyspark
Python 3.8.5 (default, Jan 27 2021, 15:41:15)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
21/04/27 08:50:09 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 10.10.10.2 instead (on interface eth0)
21/04/27 08:50:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
21/04/27 08:50:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Python version 3.8.5 (default, Jan 27 2021 15:41:15)
Spark context Web UI available at http://10.10.10.2:4040
Spark context available as 'sc' (master = local[*], app id = local-1619513411109).
SparkSession available as 'spark'.
>>>

Easily shut down the master and slave Spark processes using commands below.

$ SPARK_HOME/sbin/stop-slave.sh
$ SPARK_HOME/sbin/stop-master.sh

There you have it. Read more on Spark Documentation.

::...
免责声明:
当前网页内容, 由 大妈 ZoomQuiet 使用工具: ScrapBook :: Firefox Extension 人工从互联网中收集并分享;
内容版权归原作者所有;
本人对内容的有效性/合法性不承担任何强制性责任.
若有不妥, 欢迎评注提醒:

或是邮件反馈可也:
askdama[AT]googlegroups.com


订阅 substack 体验古早写作:


点击注册~> 获得 100$ 体验券: DigitalOcean Referral Badge

关注公众号, 持续获得相关各种嗯哼:
zoomquiet


自怼圈/年度番新

DU22.4
关于 ~ DebugUself with DAMA ;-)
粤ICP备18025058号-1
公安备案号: 44049002000656 ...::