Spark连接访问Hive数据

    Spark可以直接与Hive进行交互，读取Hive的数据，并在Spark中进行数据分析。以下为Spark读取Hive数据的具体操作步骤。

配置Hive的thrift服务进入Hive的配置目录，我的Hive安装在/usr/local目录下，打开hive-site.xml文件cd /usr/local/hive/confgedit hive-site.xml
在hive-site.xml中添加thrift服务的配置。完整的Hive-site.xml配置内容如下所示:<?xml version="1.0" encoding="UTF-8" standalone="no"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration><property><name>hive.metastore.warehouse.dir</name><value>/user/hive/warehouse</value></property><property><name>hive.metastore.local</name><value>true</value></property><property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&useSSL=false</value><description>JDBC connect string for a JDBC metastore</description></property><property><name>javax.jdo.option.ConnectionDriverName</name><value>com.mysql.jdbc.Driver</value><description>Driver class name for a JDBC metastore</description></property><property><name>javax.jdo.option.ConnectionUserName</name><value>root</value><description>username to use against metastore database</description></property><property><name>javax.jdo.option.ConnectionPassword</name><value>123456</value><description>password to use against metastore database</description></property><property><name>hive.metastore.schema.verification</name><value>true</value></property><property><name>datanucleus.schema.autoCreateAll</name><value>true</value></property><property><name>hive.metastore.uris</name><value>thrift://localhost:9083</value></property></configuration>
开启thrift端口#启动hadoop集群start-all.sh#开启hive端口nohup hive --service metastore &
在IDEA中创建一个Maven项目。JDK选择1.8，Archetype为[org.apache.maven.archetypes:maven-archetype-site-simple]，点击[create]创建项目。
添加Scala依赖。按顺序点击顶部菜单栏中[File]->[Project Structure]，配置项目架构结构配置。在项目结构配置中，选择[Global Libraries]->[+]->[Scala SDK]（如果没有Scala SDK选项的同学，需要自己在IDEA中下载Scala插件）。选择Spark对应的Scala SDK版本，本案例使用的是Spark3.2.1，支持Scala版本为2.12。点击OK，完成Scala配置。在[src/main/]目录下创建一个scala文件夹，并设置为源目录，专用于存放Scala代码。
在pom.xml文件中添加以下依赖和打包插件。<properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><scala.binary.version>2.12</scala.binary.version><spark.version>3.2.1</spark.version><mysql.version>5.1.49</mysql.version></properties><dependencies><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_${scala.binary.version}</artifactId><version>${spark.version}</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-sql_${scala.binary.version}</artifactId><version>${spark.version}</version></dependency><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>${mysql.version}</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-hive_${scala.binary.version}</artifactId><version>${spark.version}</version></dependency></dependencies><build><plugins><plugin><groupId>net.alchim31.maven</groupId><artifactId>scala-maven-plugin</artifactId><version>4.4.0</version><executions><execution><goals><goal>compile</goal></goals></execution></executions></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-assembly-plugin</artifactId><version>3.3.0</version><configuration><descriptorRefs><descriptorRef> jar-with-dependencies </descriptorRef></descriptorRefs></configuration><executions><execution><id>make-assembly</id><phase>package</phase><goals><goal>single</goal></goals></execution></executions></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><configuration><source>8</source><target>8</target></configuration></plugin></plugins></build>完成pom.xml文件配置后，刷新工程，加载依赖包。
在IDEA中创建一个Scala类。类名为SparkOnHiveDemo，类型选择Object在SparkOnHiveDemo类中添加以下代码，查看Hive数据库。importorg.apache.spark.sql.SparkSessionobject SparkOnHiveDemo {def main(args: Array[String]):Unit={val spark = SparkSession.builder().master("local[*]").appName(this.getClass.getName).enableHiveSupport()//开启Hive模式.config("hive.metastore.uris","thrift://localhost:9083")//连接Hive thrift端口.getOrCreate()//使用Spark SQL方法，查看Hive数据库。 spark.sql("show databases").show()}}运行结果如下。同时，我们也可以进入到Hive CLI界面使用同样的SQL指令查看结果进行对比。查询出来的结果一致，表示Spark可以正常访问Hive的数据，并在Hive中进行数据分析。