Trino 一个分布式异构 SQL 查询引擎

Trino 是一个分布式 SQL 查询引擎，旨在查询分布在一个或多个的异质数据源的大型数据集。

可以说与 Hive 简直是异曲同工了。

地址

官方仓库：github.com/trinodb/trino
官方网站：trino.io

环境要求

系统和限制

目前 Trino 只支持 64 位系统，为了优化集群查询效果，需要解除最大文件打开数，创建文件 /etc/security/limits.d/88-trino.conf：

trino soft nofile 131072
trino hard nofile 131072
trino soft nproc 128000
trino hard nproc 128000

Java

Trino 只支持 Java 17 版本，且版本需要大于 17.0.3，其他版本的不支持。官方推荐使用 Azul Zulu 打包的 JDK 版本。

sudo tar xf zulu17.42.21-ca-crac-jdk17.0.7-linux_x64.tar.gz -C /opt/

Python

支持 2.6.x / 2.7.x 或 3.x 版本。推荐使用 Python3。

sudo apt install python3

安装

从官网下载其压缩包即可

wget https://repo1.maven.org/maven2/io/trino/trino-server/420/trino-server-420.tar.gz

配置

为了隔离需要创建专用用户

getent group trino >/dev/null || sudo groupadd -r trino
getent passwd trino >/dev/null || sudo useradd --comment "Trino" -s /sbin/nologin -g trino -r -d /var/lib/trino trino

创建程序目录

sudo install --directory --mode=755 /var/lib/trino
sudo install --directory --mode=755 /var/log/trino

授权

sudo chown -R trino:trino /var/lib/trino
sudo chown -R trino:trino /var/log/trino

etc/node.properties

节点配置，最小配置项如下：

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/lib/trino/data

上述属性描述如下：

node.environment：环境名，在同一个集群中的节点，此属性必须一致，必须以小写字母开头，仅可包含大小写字母、数字和下划线（_）。
node.id：UUID 唯一标识，用于标识集群中的节点唯一性，每个集群中的 ID 不能重复，仅可包含大小写字母、数字、下划线（_）和横线（-），采用 8-4-4-4-12 结构，合 32 个十六进制数，共计 128 位，理论上和 IPv6 地址数量一样多，可用 UU在线工具进行生成，RedHat 系可以使用 uuidgen 命令进行生成，Debian 系可以使用 uuid 命令进行生成（可能需要安装 uuid 包）。
node.data-dir：数据目录路径，本地数据的存储位置，查询日志、缓存等都会写入此路径，运行程序的用户必须对此目录拥有写入权限。

etc/jvm.config

JVM 配置，推荐配置项如下：

-server
-Xmx16G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
-XX:+UnlockDiagnosticVMOptions
-XX:+UseAESCTRIntrinsics
# Disable Preventive GC for performance reasons (JDK-8293861)
-XX:-G1UsePreventiveGC

上述属性描述如下：

-Xmx：Trino 可使用的内存阈值，如果此机器作为 Trino 的专用机器，推荐配置为机器物理内存的 70% ~ 85%，非专用机器按实际分配内存进行计算。生产环境建议分配 32G 以上的内存，且禁用交换空间，内存大部分将用于逻辑处理，少部分用于 JVM 内部进程（比如垃圾回收）。
-XX：在 ARM 机器上建议开启 -XX:+UnlockDiagnosticVMOptions 和 -XX:+UseAESCTRIntrinsics 可以提升 S3 的 AES 加密性能。

etc/config.properties

Trino 配置，配置文件的内容与节点的工作类型有关，分为 coordinator（调度节点）和 worker（工作节点）两种。

调度节点最小配置项（coordinator）如下：

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery.uri=http://example.net:8080

工作节点最小配置项（workers）如下：

coordinator=false
http-server.http.port=8080
discovery.uri=http://example.net:8080

在测试等非正式环境中也可以使用二合一配置，即调度节点也作为工作节点，其配置如下：

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery.uri=http://example.net:8080

上述属性描述如下：

coordinator：调度开关，允许当前实例以调度节点身份工作，可以接受客户端的查询请求和管理请求的执行。
node-scheduler.include-coordinator：允许在调度节点上进行调度工作，对于大型集群来说，在协调器上处理调度工作会影响查询工作的性能，因为资源不能用于调度、管理和监控查询执行的关键任务。
http-server.http.port：指定 HTTP 服务的端口，Trino 使用 HTTP 协议进行全部通讯，包含内部通讯和外部通讯。
discovery.uri：指定发现服务的 URI，用于节点间互相发现，每个实例启动时都会向发现服务进行注册，并传送心跳以保持注册状态，发现服务复用 HTTP 端口，因此如果修改了默认端口则需要手动修正。如果使用了 HTTPS 则需要修改协议头。

etc/log.properties

日志配置，可以指定日志相关的参数，比如日志等级等参数。

io.trino=INFO

日志等级支持 DEBUG, INFO, WARN, ERROR 四个等级，默认就是 INFO 因此上面的例子实际上什么也没有改变。

etc/catalog/*.properties

目录配置，这个目录中配置了 Trino 所使用的连接器配置，连接器提供了所有的 tables 和 schemas 。以 Hive 连接器为例，如果要使用需要先在集群中安装 Hive 并启动 MetaStore 和 Hive Server，具体参考「官网文档」。

Hive connector

connector.name=hive
hive.metastore.uri=thrift://hadoop01.local:9083

thrift 地址如果不清楚可以使用命令检查 Hive 的配置目录

grep -ri 'thrift:' /opt/apache-hive-3.1.3-bin/conf/

AiO 配置

如果搭建小型集群或者测试环境，可以直接使用下面的配置。

config.properties:

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
http-server.log.path=/var/log/trino/http-request.log
discovery.uri=http://172.16.16.241:8080
query.max-memory=8GB

jvm.config:

-server
-Xmx8G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
-XX:+UnlockDiagnosticVMOptions
-XX:+UseAESCTRIntrinsics
-XX:-G1UsePreventiveGC

log.properties:
```
io.trino=INFO
```

node.properties:

node.environment=production
node.id=45C72394-AD86-581D-E603-A7D7A267771A
node.data-dir=/var/lib/trino/data
#catalog.config-dir=/etc/trino/catalog
#node.server-log-file=/var/log/trino/server.log
#node.launcher-log-file=/var/log/trino/launcher.log

如果需要修改 catalog 配置文件路径，将倒数第三行解除注释，最后两行日志在 systemd 守护模式下不生效，可以直接注释掉，后台模式运行时才生效。

如果环境内还有其他程序所需要的 JDK 版本，那么需要在运行前单独声明 JDK，以 Zulu JDK 为例。

export JAVA_HOME=/opt/zulu17.42.21-ca-crac-jdk17.0.7-linux_x64
export PATH=$JAVA_HOME/bin:$PATH
## 先前台运行一下检查功能是否正常
bin/launcher run
## 后台守护运行
bin/launcher start
## 检查状态
bin/launcher status

测试及检查

测试

获取官方命令行客户端

wget https://repo1.maven.org/maven2/io/trino/trino-cli/420/trino-cli-420-executable.jar
chmod +x trino-cli-420-executable.jar
mv trino-cli-420-executable.jar trino-cli
sudo mv trino-cli /usr/local/bin/trino-cli
trino-cli --version

使用命令行客户端连接 Trino

trino-cli --server 127.0.0.1:8080 --catalog hive --schema default

执行 SQL 检查服务状态

## 查询所有库（相当于 MySQL 里的 show databases;）
show schemas;
## 查询表中值（需要在 [库].[表] 前加上 `hive.`）
select * from hive.xxx.xxx_member_detail;

检查

在 WebUI 中可以查看服务运行状态，SQL 执行状态等信息。浏览器访问 :8080 ，使用配置的帐号登录，如果集群部署了 LDAP 应用，则需要使用统一授权帐号登录。

登录后可以查看集群状态，服务状态等

在下方的 State 里点选 Finished 可以查看执行完成的 SQL。

优化

守护进程

为了方便启动，可以将其配置为 systemd 单元。创建文件 /usr/lib/systemd/system/trino.service

[Unit]
Description=Trino, a query engine that runs at ludicrous speed

[Service]
User=trino
Group=trino
EnvironmentFile=/opt/trino-server-420/run.env
WorkingDirectory=/opt/trino-server-420
ExecStart=/usr/bin/python3 /opt/trino-server-420/bin/launcher.py run -Djol.tryWithSudo=true
Restart=always
LimitNOFILE=131072
LimitNPROC=128000

[Install]
WantedBy=multi-user.target

如果系统自带的 Python 版本不同，或者路径不符，需要手动调整。

小贴士：如果在守护单元里添加了参数 -Djol.tryWithSudo=true ，那么需要为 trino 用户配置 sudo 权限。执行命令 sudo usermod -a -G sudo trino

# /opt/trino-server-420/run.env
JAVA_HOME="/opt/zulu17.42.21-ca-crac-jdk17.0.7-linux_x64"
PATH="/opt/zulu17.42.21-ca-crac-jdk17.0.7-linux_x64/bin:/opt/apache-maven-3.9.3/bin:/opt/apache-hive-3.1.3-bin/bin:/opt/hadoop-3.2.2/bin:/opt/hadoop-3.2.2/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

免用户登录

如果登录 WebUI 是不想输入用户名，可以配置免用户登录 config.properties。

web-ui.authentication.type=FIXED
web-ui.user=hadoop                   ## 这里写上自动配置的用户

附录

参考链接

本文由柒创作，采用知识共享署名4.0 国际许可协议进行许可。
转载本站文章前请注明出处，文章作者保留所有权限。
最后编辑时间： 2023-08-07 17:01 PM