专业的编程技术博客社区

网站首页 > 博客文章 正文

spark on hive环境搭建(hadoop+hive+spark搭建)

baijin 2024-10-16 07:39:17 博客文章 8 ℃ 0 评论

spark on hive: spark使用hive metastore进行元数据管理,spark-sql自身提供sql解析与执行

hive搭建

  • hive下载安装
# hive下载
wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz

# 解压
tar xzvf apache-hive-3.1.3-bin.tar.gz -C /opt/module/

# hive配置添加
cd /opt/module/apache-hive-3.1.3-bin/conf
cp hive-env.sh.template hive-env.sh
# 环境变量添加
vi hive-env.sh
export HADOOP_HOME=/opt/module/hadoop-3.4.0/
export HIVE_CONF_DIR=/opt/module/apache-hive-3.1.3-bin/conf
export HIVE_AUX_JARS_PATH=/opt/module/apache-hive-3.1.3-bin/lib
  • hive内嵌模式启动(数据存储在内存derby数据库中)
# 初始化metadata(内嵌模式)
cd /opt/module
ln -s apache-hive-3.1.3-bin/ hive
cd hive
bin/schematool -initSchema -dbType derby 

# 启动hive
bin/hive
  • hive远程模式配置(元数据存储在mysql)
# 远程模式搭建
# mysql驱动添加
cd /opt/module/apache-hive-3.1.3-bin/lib
wget https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.0.33/mysql-connector-j-8.0.33.jar

# mysql服务器关闭防火墙
systemctl stop firewalld

# mysql创建hive用户
create user 'hive'@'%' identified by 'hive用户密码';
grant all on *.* to 'hive'@'%';
flush privileges;

# hive配置文件hive-site.xml添加
cd hive/conf
vi hive-site.xml
  • hive-sitem.xml配置
<configuration>
    <!-- 存储元数据mysql相关配置 -->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://<mysql host>:3306/hive?createDatabaseIfNotExist=true&useSSL=false&useUnicode=true&characterEncoding=UTF-8&allowPublicKeyRetrieval=true</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.cj.jdbc.Driver</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>mysql hive user</value>
    </property>

    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>mysql hive user password</value>
    </property>

    <!-- H2S运行绑定host -->
    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>vm100</value>
    </property>

    <!-- 远程模式部署metastore 服务地址 -->
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://vm100:9083</value>
    </property>

    <!-- 关闭元数据存储授权  -->
    <property>
        <name>hive.metastore.event.db.notification.api.auth</name>
        <value>false</value>
    </property>

    <!-- 关闭元数据存储版本的验证 -->
    <property>
        <name>hive.metastore.schema.verification</name>
        <value>false</value>
    </property>

    <property>
        <name>hive.server2.enable.doAs</name>
        <value>false</value>
    </property>
</configuration>
  • hive远程模式初始化与启动
# 初始化metadata(远程模式)
# 初始化成功后mysql hive数据库会创建相关的表
cd hive
bin/schematool -initSchema -dbType mysql -verbos 

# 启动hive
bin/hive
hive> show databases;
# 报错,没有启动metastore服务
FAILED: HiveException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

# 启动metastore服务
nohup bin/hive --service metastore &
  • 客户端连接hive
# 客户端
bin/hive 旧客户端(废弃),亦可远程访问
bin/beeline 新客户端(性能安全性好)
beeline远程模式通过Thrift协议连接到HiveServer2服务
Hue/Apache Impala/HCatalog

# 启动HiveServer2服务
nohup bin/hive --service hiveserver2 &

# 启动beeline
bin/beeline
# 连接hiveserver2
beeline>! connect jdbc:hive2://vm100:10000
# 输入用户,提示输入密码时直接回车
  • bin/hive连接
hive> show databases;
OK
default
test
Time taken: 0.159 seconds, Fetched: 2 row(s)
hive> show tables;
OK
people
t_user
t_user1
Time taken: 0.155 seconds, Fetched: 3 row(s)
hive> 
  • bin/beeline连接
0: jdbc:hive2://vm100:10000> show databases;
+----------------+
| database_name  |
+----------------+
| default        |
| test           |
+----------------+
2 rows selected (3.331 seconds)
0: jdbc:hive2://vm100:10000> use test;
0: jdbc:hive2://vm100:10000> show tables;
+-----------+
| tab_name  |
+-----------+
| people    |
| t_user    |
| t_user1   |
+-----------+
3 rows selected (0.164 seconds)
0: jdbc:hive2://vm100:10000> 

spark on hive配置

  • spark conf/hive-site.xml文件添加
<configuration>

    <!-- 配置hdfs存储目录 -->
    <property>
            <name>hive.metastore.warehouse.dir</name>
            <value>/user/hive/warehouse</value>
    </property>

    <property>
      <name>hive.metastore.local</name>
      <value>false</value>
    </property>

    <!-- 远程模式部署metastore 服务地址 -->
    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://vm100:9083</value>
    </property>

</configuration>
  • 使用pyspark执行SparkSQL验证
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .enableHiveSupport() \
    .getOrCreate()

spark.sql('select * from test.people').show()
  • 使用spark自带的bin/spark-sql查询验证
bin/spark-sql 
spark-sql (default)> show databases;
default
test
spark-sql (default)> use test;
spark-sql (test)> show tables;
people
t_user
t_user1
spark-sql (test)> select * from people;
NULL	Michael
30	Andy
19	Justin
spark-sql (test)> show create table people;
CREATE TABLE spark_catalog.test.people (
  age BIGINT,
  name STRING)
USING parquet
  • spark thriftserver服务:spark提供类似hive thriftserver服务,可供客户端如datagrid、heidisql、dbvear等直接访问,实现纯sql操作SparkSQL,类似的pyhive python库也可以直接访问 spark thriftserver
# spark thriftserver启停脚本
sbin/start-thriftserver.sh
sbin/stop-thriftserver.sh

Tags:

本文暂时没有评论,来添加一个吧(●'◡'●)

欢迎 发表评论:

最近发表
标签列表