0


Spark系列之Spark安装部署


title: Spark系列


第二章 Spark安装部署

2.1 版本选择

下载地址:

https://archive.apache.org/dist/spark

四大主要版本

  1. Spark-0.X
  2. Spark-1.X(主要Spark-1.3Spark-1.6
  3. Spark-2.X(最新Spark-2.4.8
  4. Spark-3.x(最新3.2.0

在我们自己使用的版本中使用的是,可以和我们前面使用到的hadoop3.2.2版本匹配。

  1. spark-3.1.2-bin-hadoop3.2.tgz

在这里插入图片描述

2.2 Scala安装

三台节点上面都要配置

2.2.1 上传解压重命名

  1. [root@hadoop10 software]# tar -zxvf scala-2.12.14.tgz
  2. [root@hadoop10 software]# mv scala-2.12.14 scala
  3. [root@hadoop10 software]# ll
  4. total 1984008
  5. -rw-r--r--. 1 root root 67938106 Oct 18 11:00 apache-flume-1.9.0-bin.tar.gz
  6. -rw-r--r--. 1 root root 278813748 Sep 8 19:21 apache-hive-3.1.2-bin.tar.gz
  7. -rw-r--r--. 1 root root 12387614 Sep 8 17:40 apache-zookeeper-3.7.0-bin.tar.gz
  8. drwxr-xr-x. 5 root root 211 Oct 20 16:46 azkaban
  9. drwxr-xr-x. 11 root root 4096 Oct 18 16:52 flume
  10. drwxr-xr-x. 11 aa aa 173 Aug 29 21:16 hadoop
  11. -rw-r--r--. 1 root root 395448622 Aug 29 21:03 hadoop-3.2.2.tar.gz
  12. drwxr-xr-x. 8 root root 194 Oct 11 16:36 hbase
  13. -rw-r--r--. 1 root root 272332786 Oct 11 16:20 hbase-2.3.6-bin.tar.gz
  14. drwxr-xr-x. 10 root root 184 Sep 21 17:45 hive
  15. drwxr-xr-x. 7 10 143 245 Dec 16 2018 jdk
  16. -rw-r--r--. 1 root root 194042837 Aug 29 20:08 jdk-8u202-linux-x64.tar.gz
  17. drwxr-xr-x. 2 root root 4096 Sep 18 16:35 mysql
  18. -rw-r--r--. 1 root root 542750720 Sep 17 19:53 mysql-5.7.32-1.el7.x86_64.rpm-bundle.tar
  19. drwxrwxr-x. 6 2000 2000 79 May 28 10:00 scala
  20. -rw-r--r--. 1 root root 21087936 Nov 9 15:55 scala-2.12.14.tgz
  21. -rw-r--r--. 1 root root 228834641 Nov 9 15:55 spark-3.1.2-bin-hadoop3.2.tgz
  22. drwxr-xr-x. 9 aa aa 4096 Dec 19 2017 sqoop
  23. -rw-r--r--. 1 root root 17953604 Oct 19 11:29 sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
  24. drwxr-xr-x. 8 root root 157 Sep 8 17:51 zk

在这里插入图片描述

2.2.2 配置环境变量

  1. export SCALA_HOME=/software/scala
  2. export PATH=.:$PATH:$SCALA_HOME/bin

在这里插入图片描述
在这里插入图片描述

2.2.3 验证

  1. [root@hadoop10 software]# source /etc/profile
  2. [root@hadoop10 software]# scala -version
  3. Scala code runner version 2.12.14 -- Copyright 2002-2021, LAMP/EPFL and Lightbend, Inc.
  4. [root@hadoop10 software]# scala
  5. Welcome to Scala 2.12.14 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202).
  6. Type in expressions for evaluation. Or try :help.
  7. scala> 6+6
  8. res0: Int = 12
  9. scala>

在这里插入图片描述

2.2.4 三个节点都是要安装

分发到其他的节点上面

  1. scp -r /software/scala hadoop11:/software/
  2. scp -r /software/scala hadoop12:/software/

然后再去对应的节点上面配置环境变量即可。

  1. export SCALA_HOME=/software/scala
  2. export PATH=.:$PATH:$SCALA_HOME/bin

记得source一下

  1. [root@hadoop12 software]# source /etc/profile

2.3 Spark安装

集群规划

  1. hadoop10 master
  2. hadoop11 worker/slave
  3. hadoop12 worker/slave

2.3.1 下载

下载地址

https://archive.apache.org/dist/spark/spark-3.1.2/

2.3.2 上传解压重命名

  1. [root@hadoop10 software]# tar -zxvf spark-3.1.2-bin-hadoop3.2.tgz
  2. [root@hadoop10 software]# mv spark-3.1.2-bin-hadoop3.2 spark

在这里插入图片描述

2.3.3 修改配置文件

1、配置slaves/workers

进入配置目录

cd /software/spark/conf

  1. 修改配置文件名称
  2. [root@hadoop0 conf]# mv workers.template workers
  3. vim workers
  4. 内容如下:
  5. hadoop11
  6. hadoop12

在这里插入图片描述

在这里插入图片描述

2、配置master

进入配置目录

cd /software/spark/conf

修改配置文件名称

mv spark-env.sh.template spark-env.sh

修改配置文件

vim spark-env.sh

增加如下内容:

  1. ## 设置JAVA安装目录
  2. JAVA_HOME=/software/jdk
  3. ## HADOOP软件配置文件目录,读取HDFS上文件和运行Spark在YARN集群时需要,先提前配上
  4. HADOOP_CONF_DIR=/software/hadoop/etc/hadoop
  5. YARN_CONF_DIR=/software/hadoop/etc/hadoop
  6. ## 指定spark老大Master的IP和提交任务的通信端口
  7. SPARK_MASTER_HOST=hadoop10
  8. SPARK_MASTER_PORT=7077
  9. SPARK_MASTER_WEBUI_PORT=8080
  10. SPARK_WORKER_CORES=1
  11. SPARK_WORKER_MEMORY=1g

如下图:

在这里插入图片描述

2.3.4 分发到其他节点上面

  1. scp -r /software/spark hadoop11:/software/
  2. scp -r /software/spark hadoop12:/software/

2.3.5 启动与停止

在主节点上启动spark集群

cd /software/spark/sbin

start-all.sh

在这里插入图片描述

/software/spark/sbin/start-all.sh

在这里插入图片描述

各个节点上面的进程:

  1. [root@hadoop0 sbin]# jps
  2. 21720 Master
  3. 21788 Jps
  4. [root@hadoop1 software]# jps
  5. 66960 Worker
  6. 67031 Jps
  7. [root@hadoop2 software]# jps
  8. 17061 Jps
  9. 16989 Worker

查看页面:

在这里插入图片描述

在主节点上停止spark集群

/software/spark/sbin/stop-all.sh

在这里插入图片描述

在主节点上单独启动和停止Master:

start-master.sh

stop-master.sh

在从节点上单独启动和停止Workers(Worker指的是slaves配置文件中的主机名)

start-workers.sh

stop-workers.sh

2.3.6 运行个求PI的例子

  1. [root@hadoop10 bin]# cd /software/spark/bin/
  2. [root@hadoop10 bin]# pwd
  3. /software/spark/bin
  4. [root@hadoop10 bin]# run-example SparkPi 10
  5. 前面有好多日志
  6. ......
  7. 2021-11-09 16:21:27,969 INFO scheduler.TaskSetManager: Starting task 9.0 in stage 0.0 (TID 9) (hadoop10, executor driver, partition 9, PROCESS_LOCAL, 4578 bytes) taskResourceAssignments Map()
  8. 2021-11-09 16:21:27,972 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 140 ms on hadoop10 (executor driver) (8/10)
  9. 2021-11-09 16:21:27,982 INFO executor.Executor: Running task 9.0 in stage 0.0 (TID 9)
  10. 2021-11-09 16:21:28,009 INFO executor.Executor: Finished task 8.0 in stage 0.0 (TID 8). 957 bytes result sent to driver
  11. 2021-11-09 16:21:28,016 INFO scheduler.TaskSetManager: Finished task 8.0 in stage 0.0 (TID 8) in 74 ms on hadoop10 (executor driver) (9/10)
  12. 2021-11-09 16:21:28,035 INFO executor.Executor: Finished task 9.0 in stage 0.0 (TID 9). 957 bytes result sent to driver
  13. 2021-11-09 16:21:28,037 INFO scheduler.TaskSetManager: Finished task 9.0 in stage 0.0 (TID 9) in 68 ms on hadoop10 (executor driver) (10/10)
  14. 2021-11-09 16:21:28,039 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
  15. 2021-11-09 16:21:28,047 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 1.121 s
  16. 2021-11-09 16:21:28,054 INFO scheduler.DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
  17. 2021-11-09 16:21:28,055 INFO scheduler.TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
  18. 2021-11-09 16:21:28,058 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 1.262871 s
  19. Pi is roughly 3.1433951433951433
  20. 2021-11-09 16:21:28,130 INFO server.AbstractConnector: Stopped Spark@777c9dc9{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
  21. 2021-11-09 16:21:28,131 INFO ui.SparkUI: Stopped Spark web UI at http://hadoop10:4040
  22. 2021-11-09 16:21:28,165 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
  23. 2021-11-09 16:21:28,333 INFO memory.MemoryStore: MemoryStore cleared
  24. 2021-11-09 16:21:28,333 INFO storage.BlockManager: BlockManager stopped
  25. 2021-11-09 16:21:28,341 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
  26. 2021-11-09 16:21:28,343 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
  27. 2021-11-09 16:21:28,355 INFO spark.SparkContext: Successfully stopped SparkContext
  28. 2021-11-09 16:21:28,397 INFO util.ShutdownHookManager: Shutdown hook called
  29. 2021-11-09 16:21:28,397 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-d4aeb352-14b7-4e72-ada7-e66b45192bc5
  30. 2021-11-09 16:21:28,402 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-000896f1-04df-45a1-81be-969838a25457

在这里插入图片描述

在这里插入图片描述

2.3.7 运行个WordCount的例子

2.3.7.1 spark-shell 本地模式

  1. [root@hadoop10 bin]# cd /software/spark/bin/
  2. [root@hadoop10 bin]# pwd
  3. /software/spark/bin
  4. [root@hadoop10 bin]# spark-shell
  5. 2021-11-09 16:57:03,855 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  6. Setting default log level to "WARN".
  7. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  8. Spark context Web UI available at http://hadoop10:4040
  9. Spark context available as 'sc' (master = local[*], app id = local-1636448230277).
  10. Spark session available as 'spark'.
  11. Welcome to
  12. ____ __
  13. / __/__ ___ _____/ /__
  14. _\ \/ _ \/ _ `/ __/ '_/
  15. /___/ .__/\_,_/_/ /_/\_\ version 3.1.2
  16. /_/
  17. Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
  18. Type in expressions to have them evaluated.
  19. Type :help for more information.
  20. scala> sc.textFile("file:///home/data/wordcount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)
  21. (hadoop,1) (0 + 2) / 2]
  22. (hbase,1)
  23. (hello,3)
  24. (world,1)
  25. scala>

在这里插入图片描述

2.3.7.2 Spark集群模式

这种集群模式需要先启动Spark集群。

在这里插入图片描述

在/software/spark/bin 目录下面运行下面的命令

  1. ./spark-shell \
  2. --master spark://hadoop10:7077 \
  3. --executor-memory 512m \
  4. --total-executor-cores 1

完整过程如下:

  1. [root@hadoop10 bin]# pwd
  2. /software/spark/bin
  3. [root@hadoop10 bin]# ./spark-shell \
  4. > --master spark://hadoop10:7077 \
  5. > --executor-memory 512m \
  6. > --total-executor-cores 1
  7. 2021-11-09 17:00:13,074 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  8. Setting default log level to "WARN".
  9. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  10. Spark context Web UI available at http://hadoop10:4040
  11. Spark context available as 'sc' (master = spark://hadoop10:7077, app id = app-20211109170018-0000).
  12. Spark session available as 'spark'.
  13. Welcome to
  14. ____ __
  15. / __/__ ___ _____/ /__
  16. _\ \/ _ \/ _ `/ __/ '_/
  17. /___/ .__/\_,_/_/ /_/\_\ version 3.1.2
  18. /_/
  19. Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
  20. Type in expressions to have them evaluated.
  21. Type :help for more information.
  22. scala>

在这里插入图片描述

2.3.7.3 Spark集群模式下FileNotFoundException问题及解决方案

遇到错误了,错误如下:

  1. scala> sc.textFile("file:///home/data/wordcount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)
  2. 2021-11-09 17:01:07,431 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (192.168.22.137 executor 0): java.io.FileNotFoundException: File file:/home/data/wordcount.txt does not exist
  3. at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
  4. at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
  5. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
  6. at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
  7. at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
  8. at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
  9. at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
  10. at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
  11. at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
  12. at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
  13. at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:285)
  14. at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
  15. at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
  16. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  17. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  18. at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  19. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  20. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  21. at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  22. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  23. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  24. at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  25. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  26. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  27. at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
  28. at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  29. at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
  30. at org.apache.spark.scheduler.Task.run(Task.scala:131)
  31. at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  32. at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  33. at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  34. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  35. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  36. at java.lang.Thread.run(Thread.java:748)
  37. 2021-11-09 17:01:07,586 ERROR scheduler.TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
  38. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (192.168.22.137 executor 0): java.io.FileNotFoundException: File file:/home/data/wordcount.txt does not exist
  39. at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
  40. at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
  41. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
  42. at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
  43. at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
  44. at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
  45. at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
  46. at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
  47. at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
  48. at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
  49. at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:285)
  50. at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
  51. at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
  52. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  53. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  54. at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  55. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  56. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  57. at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  58. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  59. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  60. at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  61. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  62. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  63. at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
  64. at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  65. at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
  66. at org.apache.spark.scheduler.Task.run(Task.scala:131)
  67. at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  68. at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  69. at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  70. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  71. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  72. at java.lang.Thread.run(Thread.java:748)
  73. Driver stacktrace:
  74. at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
  75. at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
  76. at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
  77. at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  78. at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  79. at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  80. at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
  81. at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
  82. at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
  83. at scala.Option.foreach(Option.scala:407)
  84. at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
  85. at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
  86. at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
  87. at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
  88. at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
  89. at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
  90. at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
  91. at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
  92. at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
  93. at org.apache.spark.SparkContext.runJob(SparkContext.scala:2261)
  94. at org.apache.spark.rdd.RDD.$anonfun$foreach$1(RDD.scala:1012)
  95. at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  96. at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  97. at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
  98. at org.apache.spark.rdd.RDD.foreach(RDD.scala:1010)
  99. ... 47 elided
  100. Caused by: java.io.FileNotFoundException: File file:/home/data/wordcount.txt does not exist
  101. at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:666)
  102. at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:987)
  103. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:656)
  104. at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
  105. at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146)
  106. at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347)
  107. at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:899)
  108. at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:109)
  109. at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
  110. at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:286)
  111. at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:285)
  112. at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:243)
  113. at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:96)
  114. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  115. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  116. at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  117. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  118. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  119. at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  120. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  121. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  122. at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  123. at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
  124. at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
  125. at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
  126. at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  127. at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
  128. at org.apache.spark.scheduler.Task.run(Task.scala:131)
  129. at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  130. at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  131. at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  132. at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  133. at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  134. at java.lang.Thread.run(Thread.java:748)
  135. scala>

在这里插入图片描述

解决方案:

1、看看执行语句

  1. ./spark-shell \
  2. --master spark://hadoop10:7077 \
  3. --executor-memory 512m \
  4. --total-executor-cores 1

这个里面有需要一个executor的cores

2、去三个节点上面去看进程

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

发现在hadoop11上面多了一个进程CoarseGrainedExecutorBackend

CoarseGrainedExecutorBackend是什么呢?

我们知道Executor负责计算任务,即执行task,而Executor对象的创建及维护是由CoarseGrainedExecutorBackend负责的。

3、总结

在spark-shell里执行textFile方法时,如果total-executor-cores设置为N,哪N台机有CoarseGrainedExecutorBackend进程的,读取的文件需要在这N台机都存在。

4、那我们去hadoop11上给这个路径的文件创建一下

  1. [root@hadoop10 data]# scp /home/data/wordcount.txt hadoop11:/home/data/
  2. wordcount.txt 100% 37 7.9KB/s 00:00
  3. [root@hadoop10 data]#

5、再次执行

  1. scala> sc.textFile("file:///home/data/wordcount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)
  2. scala> sc.textFile("file:///home/data/wordcount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect()
  3. res2: Array[(String, Int)] = Array((hello,3), (world,1), (hadoop,1), (hbase,1))
  4. scala>

在这里插入图片描述

参考: https://www.cnblogs.com/dummyly/p/10000421.html

2.3.8 再运行一个hdfs上面的WordCount例子试试

2.3.8.1 spark-shell 本地模式

  1. [root@hadoop10 bin]# pwd
  2. /software/spark/bin
  3. [root@hadoop10 bin]# spark-shell
  4. 2021-11-09 17:29:58,238 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  5. Setting default log level to "WARN".
  6. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  7. Spark context Web UI available at http://hadoop10:4040
  8. Spark context available as 'sc' (master = local[*], app id = local-1636450204011).
  9. Spark session available as 'spark'.
  10. Welcome to
  11. ____ __
  12. / __/__ ___ _____/ /__
  13. _\ \/ _ \/ _ `/ __/ '_/
  14. /___/ .__/\_,_/_/ /_/\_\ version 3.1.2
  15. /_/
  16. Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
  17. Type in expressions to have them evaluated.
  18. Type :help for more information.
  19. scala> sc.textFile("hdfs://hadoop10:8020/home/data/wordcount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)
  20. (hadoop,1) (0 + 2) / 2]
  21. (hbase,1)
  22. (hello,3)
  23. (world,1)
  24. scala> sc.textFile("hdfs://hadoop10/home/data/wordcount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)
  25. (hadoop,1)
  26. (hbase,1)
  27. (hello,3)
  28. (world,1)
  29. scala>

在这里插入图片描述

2.3.8.2 Spark集群模式

  1. [root@hadoop10 bin]# ./spark-shell \
  2. > --master spark://hadoop10:7077 \
  3. > --executor-memory 512m \
  4. > --total-executor-cores 1
  5. 2021-11-09 17:31:27,690 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  6. Setting default log level to "WARN".
  7. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  8. Spark context Web UI available at http://hadoop10:4040
  9. Spark context available as 'sc' (master = spark://hadoop10:7077, app id = app-20211109173133-0002).
  10. Spark session available as 'spark'.
  11. Welcome to
  12. ____ __
  13. / __/__ ___ _____/ /__
  14. _\ \/ _ \/ _ `/ __/ '_/
  15. /___/ .__/\_,_/_/ /_/\_\ version 3.1.2
  16. /_/
  17. Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
  18. Type in expressions to have them evaluated.
  19. Type :help for more information.
  20. scala>
  21. scala> sc.textFile("hdfs://hadoop10:8020/home/data/wordcount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect()
  22. res0: Array[(String, Int)] = Array((hello,3), (world,1), (hadoop,1), (hbase,1))
  23. scala> sc.textFile("hdfs://hadoop10/home/data/wordcount.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect()
  24. res1: Array[(String, Int)] = Array((hello,3), (world,1), (hadoop,1), (hbase,1))
  25. scala>

在这里插入图片描述

2.9 再去页面中看看运行的历史任务

会发现有一些历史的运行状态。

在这里插入图片描述

声明:
文章中代码及相关语句为自己根据相应理解编写,文章中出现的相关图片为自己实践中的截图和相关技术对应的图片,若有相关异议,请联系删除。感谢。转载请注明出处,感谢。

By luoyepiaoxue2014

B站: https://space.bilibili.com/1523287361 点击打开链接
微博地址: http://weibo.com/luoyepiaoxue2014 点击打开链接

标签: Spark 大数据

本文转载自: https://blog.csdn.net/luoyepiaoxue2014/article/details/128072477
版权归原作者 落叶飘雪2014 所有, 如有侵权,请联系我们删除。

“Spark系列之Spark安装部署”的评论:

还没有评论