Trinoから、Hive connectorでAmazon S3互換のオブジェクトストレージMinIOにアクセスしてみる

これは、なにをしたくて書いたもの？

Trinoから、Amazon S3のようなオブジェクトストレージにアクセスしてみたいな、ということで。

今回はAmazon S3互換のオブジェクトストレージであるMinIOを使って、Trinoからアクセスしてみたいと思います。

MinIO | High Performance, Kubernetes Native Object Storage

TrinoからAmazon S3のようなオブジェクトストレージにアクセスするには、Hive connectorを使うとよさそうです。

Hive connector — Trino 393 Documentation

Hive connector

Hive connectorのドキュメントは、こちら。

Hive connector — Trino 393 Documentation

また、Hive connectorでAmazon S3にアクセスするためのドキュメントはこちらです。

Hive connector with Amazon S3 — Trino 393 Documentation

Hive connectorを使用すると、Apache Hiveデータウェアハウスに格納されたデータを参照できます。

このドキュメントによると、Apache Hiveは以下の3つのコンポーネントの組み合わせのようです。

HDFS（Apache Hadoopの分散ファイルシステム）やAmazon S3のようなオブジェクトストレージに格納される様々な形式のデータファイル
データファイルをどのようにしてスキーマやテーブルにマッピングするかを示すメタデータ
- メタデータはMySQLのようなデータベースに保存され、Hiveメタストアサービスを介してアクセスされる
HiveQLと呼ばれるクエリー言語
- HiveQLはMapRequceやTezのような分散コンピューティングフレームワークで実行される

Trinoは、これらのApache Hiveのコンポーネントのうち最初の2つ、データとメタデータのみを使用します。
HiveQLやApache Hiveの実行環境の一部は使用しません。

つまり、Hiveメタストアサービスと実際にストレージに格納されたデータのみを利用する、というわけですね。

こちらのブログエントリーを見ると、Apache Hiveのクエリー（HiveQL）を実行するランタイムをTrinoのランタイムで置き換えていることが
書かれています。

Trino | A gentle introduction to the Hive connector

とりあえず、動かしてみましょうか。

環境

今回の環境は、こちら。

$ java --version
openjdk 17.0.4 2022-07-19
OpenJDK Runtime Environment (build 17.0.4+8-Ubuntu-120.04)
OpenJDK 64-Bit Server VM (build 17.0.4+8-Ubuntu-120.04, mixed mode, sharing)


$ python -V
Python 2.7.18

Trinoのバージョン。表示しているのはTrinoのCLIのバージョンのみですが、サーバーも同じバージョンとします。

$ ./trino --version
Trino CLI 393

MinIO操作用のAWS CLI。

$ aws --version
aws-cli/2.7.25 Python/3.9.11 Linux/5.4.0-124-generic exe/x86_64.ubuntu.20 prompt/off

MinIOのバージョンは、こちら。

$ minio --version
minio version RELEASE.2022-08-13T21-54-44Z (commit-id=49862ba3470335decccecb27649167025e18c406)
Runtime: go1.18.5 linux/amd64
License: GNU AGPLv3 <https://www.gnu.org/licenses/agpl-3.0.html>
Copyright: 2015-2022 MinIO, Inc.

MinIOは、172.17.0.2で動作しているものとします。

$ MINIO_ROOT_USER=minioadmin MINIO_ROOT_PASSWORD=minioadmin minio server /var/lib/minio/data --console-address :9001

お題

今回のお題は、MinIOに格納したCSVファイルをTrinoからクエリーできるようにする、でいきたいと思います。データのお題はサザエさんとします。
また、参照したデータをParquet形式で別テーブルとして保存することもやってみましょう。

Apache Hiveメタストアサービスをインストールする

Hive connectorのRequirementsを確認すると、Hive connectorを使用するにはHiveメタストアサービス、もしくは互換性のあるサービス（たとえば
AWS Glue Data CatalogといったHiveメタストア互換の実装が必要なようです。

The Hive connector requires a Hive metastore service (HMS), or a compatible implementation of the Hive metastore, such as AWS Glue Data Catalog.

Hive connector / Requirements

今回は、Apache Hiveが提供するスタンドアロンなHiveメタストアサービスを使用します。

Apache Hive

Hiveメタストアサービスのドキュメントはこちら。バージョン3.0以降ですね（それ以前は別ページ）。

AdminManual Metastore 3.0 Administration - Apache Hive - Apache Software Foundation

ダウンロードは、こちらから。

Downloads

$ curl -LO https://dlcdn.apache.org/hive/hive-standalone-metastore-3.0.0/hive-standalone-metastore-3.0.0-bin.tar.gz

展開して、ディレクトリ内へ。

$ tar xf hive-standalone-metastore-3.0.0-bin.tar.gz
$ cd apache-hive-metastore-3.0.0-bin

ディレクトリ構成を見てみます。

$ tree -d
.
├── bin
│   └── ext
├── binary-package-licenses
├── conf
├── lib
│   ├── php
│   │   └── packages
│   │       └── hive_metastore
│   │           └── metastore
│   └── py
│       └── hive_metastore
└── scripts
    └── metastore
        └── upgrade
            ├── derby
            ├── mssql
            ├── mysql
            ├── oracle
            └── postgres

19 directories

デフォルトの設定ファイルは、こちら。

conf/metastore-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- These are default values meant to allow easy smoke testing of the metastore.  You will
likely need to add a number of new values. -->
<configuration>
  <property>
    <name>metastore.thrift.uris</name>
    <value>thrift://localhost:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>
  <property>
    <name>metastore.task.threads.always</name>
    <value>org.apache.hadoop.hive.metastore.events.EventCleanerTask</value>
  </property>
  <property>
    <name>metastore.expression.proxy</name>
    <value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
  </property>
</configuration>

metastore.thrift.urisが、ThriftプロトコルでHiveメタストアサービスにアクセスするためのURLになります。

設定ファイルを少し修正して、こちらの内容に。

conf/metastore-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- These are default values meant to allow easy smoke testing of the metastore.  You will
likely need to add a number of new values. -->
<configuration>
  <property>
    <name>metastore.thrift.uris</name>
    <value>thrift://localhost:9083</value>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
  </property>
  <property>
    <name>metastore.task.threads.always</name>
    <value>org.apache.hadoop.hive.metastore.events.EventCleanerTask</value>
  </property>
  <property>
    <name>metastore.expression.proxy</name>
    <value>org.apache.hadoop.hive.metastore.DefaultPartitionExpressionProxy</value>
  </property>

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>
  <property>
    <name>fs.s3a.access.key</name>
    <value>minioadmin</value>
  </property>
  <property>
    <name>fs.s3a.secret.key</name>
    <value>minioadmin</value>
  </property>
  <property>
    <name>fs.s3a.endpoint</name>
    <value>http://172.17.0.2:9000</value>
  </property>
  <property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
  </property>
</configuration>

デフォルトの設定内容から追加したのは、MinIOに接続するためのこの部分です。

  <property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
  </property>
  <property>
    <name>fs.s3a.access.key</name>
    <value>minioadmin</value>
  </property>
  <property>
    <name>fs.s3a.secret.key</name>
    <value>minioadmin</value>
  </property>
  <property>
    <name>fs.s3a.endpoint</name>
    <value>http://172.17.0.2:9000</value>
  </property>
  <property>
    <name>fs.s3a.path.style.access</name>
    <value>true</value>
  </property>

ところで、Hiveメタストアサービスを起動するためには、Apache Hadoopが必要なようです。

$ bin/start-metastore
Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path

というわけで、Apache Hadoopもダウンロードしてきます。

Apache Hadoop / Download

$ curl -LO https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

展開して、ディレクトリ内へ。

$ tar xf hadoop-3.3.4.tar.gz
$ cd hadoop-3.3.4

こんな感じですね。

$ ll
合計 124
drwxr-xr-x 10 xxxxx xxxxx  4096  7月 29 22:44 ./
drwxrwxr-x  3 xxxxx xxxxx  4096  8月 20 10:09 ../
-rw-rw-r--  1 xxxxx xxxxx 24707  7月 29 05:30 LICENSE-binary
-rw-rw-r--  1 xxxxx xxxxx 15217  7月 17 03:20 LICENSE.txt
-rw-rw-r--  1 xxxxx xxxxx 29473  7月 17 03:20 NOTICE-binary
-rw-rw-r--  1 xxxxx xxxxx  1541  4月 22 23:58 NOTICE.txt
-rw-rw-r--  1 xxxxx xxxxx   175  4月 22 23:58 README.txt
drwxr-xr-x  2 xxxxx xxxxx  4096  7月 29 22:44 bin/
drwxr-xr-x  3 xxxxx xxxxx  4096  7月 29 21:35 etc/
drwxr-xr-x  2 xxxxx xxxxx  4096  7月 29 22:44 include/
drwxr-xr-x  3 xxxxx xxxxx  4096  7月 29 22:44 lib/
drwxr-xr-x  4 xxxxx xxxxx  4096  7月 29 22:44 libexec/
drwxr-xr-x  2 xxxxx xxxxx  4096  7月 29 22:44 licenses-binary/
drwxr-xr-x  3 xxxxx xxxxx  4096  7月 29 21:35 sbin/
drwxr-xr-x  4 xxxxx xxxxx  4096  7月 29 23:21 share/

環境変数HADOOP_HOMEに、Apache Hadoopのインストール先を指定します。

$ export HADOOP_HOME=/paht/to/hadoop-3.3.4

ちなみに、Apache Hadoop自体は起動しなくてもよいみたいです。あくまで、モジュールとして必要なだけみたいですね。

また、HiveメタストアサービスからMinIO…というかAmazon S3にアクセスするためには、Hadoop AWSとAWS SDK for Javaをクラスパスに
追加する必要があります。これには、HADOOP_CLASSPATHという環境変数を使用します。

export HADOOP_CLASSPATH=${HADOOP_HOME}/share/hadoop/tools/lib/aws-java-sdk-bundle-1.12.262.jar:${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-aws-3.3.4.jar

ところで、HiveメタストアサービスにはRDBMSが必要のようです。またスキーマの初期化も要るみたいですね。

今回はこちらで初期化。

$ bin/schematool -initSchema -dbType derby

RDBMSは、組み込みのApache Derbyを使います。

AdminManual Metastore 3.0 Administration / RDBMS / Option 1: Embedding Derby

そして、起動。

$ bin/start-metastore

ちなみに、スキーマを初期化せずに起動してしまうと、こんな感じのエラーになります。

MetaException(message:Version information not found in metastore.)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:84)
        at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:93)
        at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:8541)
        at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:8536)
        at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:8806)
        at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:8723)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:236)

さらにそこからスキーマを初期化しようとしても、エラーになります。

Schema initialization FAILED! Metastore state would be inconsistent !!
Underlying cause: java.io.IOException : Schema script failed, errorcode OTHER
Use --verbose for detailed stacktrace.
*** schemaTool failed ***

こうなった場合は、metastore_dbディレクトリを削除すればよさそうです。

$ rm -rf metastore_db

ちょっと話が逸れましたが、これでHiveメタストアサービスの準備は完了です。

テストデータを用意する

MinIOに登録する、テストデータを用意しましょう。お題は、サザエさんで。

1行目をヘッダーにしたCSVファイルを3つ用意します。

input/isono-family.csv

family_id,id,first_name,last_name,age
1,1,サザエ,フグ田,24
1,2,マスオ,フグ田,28
1,3,波平,磯野,54
1,4,フネ,磯野,50
1,5,カツオ,磯野,11
1,6,ワカメ,磯野,9
1,7,タラオ,フグ田,3

input/namino-family.csv

family_id,id,first_name,last_name,age
2,1,ノリスケ,波野,26
2,2,タイコ,波野,22
2,3,イクラ,波野,1

input/isasaka-family.csv

family_id,id,first_name,last_name,age
3,1,難物,伊佐坂,60
3,2,お軽,伊佐坂,50
3,3,甚六,伊佐坂,20
3,4,浮江,伊佐,16

これらのCSVファイルは、AWS CLIを使ってMinIOにアップロードしましょう。

クレデンシャルを設定して

$ export AWS_ACCESS_KEY_ID=minioadmin
$ export AWS_SECRET_ACCESS_KEY=minioadmin
$ export AWS_DEFAULT_REGION=ap-northeast-1

エンドポイントはこちら。

$ MINIO_ENDPOINT=http://172.17.0.2:9000

バケットを作成。

$ aws s3 mb --endpoint-url $MINIO_ENDPOINT s3://trino-bucket

syncでアップロードします。

$ aws s3 sync --endpoint-url $MINIO_ENDPOINT input s3://trino-bucket/input
upload: input/isono-family.csv to s3://trino-bucket/input/isono-family.csv
upload: input/isasaka-family.csv to s3://trino-bucket/input/isasaka-family.csv
upload: input/namino-family.csv to s3://trino-bucket/input/namino-family.csv

確認。

$ aws s3 ls --endpoint-url $MINIO_ENDPOINT trino-bucket/input/
2022-08-24 00:06:51        131 isasaka-family.csv
2022-08-24 00:06:51        207 isono-family.csv
2022-08-24 00:06:51        112 namino-family.csv

これで、データの準備も完了です。

TrinoからMiniOのデータを読み込んでみる

では、TrinoからMinIOにアクセスしましょう。先ほどMinIOにアップロードしたCSVファイルを読み込んでみます。

Trinoのインストールディレクトリ内に、ディレクトリを作成。

$ mkdir -p etc/catalog data

設定ファイルは、こんな感じで作成しました。

etc/node.properties

node.environment=my_trino
node.id=340fae6b-55fe-486e-b122-d0fbe61d0ebb
node.data-dir=../data

etc/jvm.config

-server
-Xmx2G
-XX:InitialRAMPercentage=80
-XX:MaxRAMPercentage=80
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
-XX:+UnlockDiagnosticVMOptions
-XX:+UseAESCTRIntrinsics

etc/config.properties

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery.uri=http://192.168.0.6:8080

そして、Hive connectorを使ってMinIOに接続する設定ファイルを作成します。カタログ名は、minioとします。

etc/catalog/minio.properties

connector.name=hive
hive.metastore.uri=thrift://localhost:9083
hive.storage-format=ORC
hive.non-managed-table-writes-enabled=true
hive.non-managed-table-creates-enabled=true

hive.s3.aws-access-key=minioadmin
hive.s3.aws-secret-key=minioadmin
hive.s3.endpoint=http://172.17.0.2:9000
hive.s3.path-style-access=true
#hive.s3select-pushdown.enabled=true

ファイルの内容は、こちらのページを見て設定。

Hive connector — Trino 393 Documentation

Hive connector with Amazon S3 — Trino 393 Documentation

hive.metastore.uriには、先ほど用意したHiveメタストアサービスのThriftのURLを設定します。
hive.s3.〜は、MinIOにアクセスするための設定ですね。

では、Trinoを起動。

$ bin/launcher run

TrinoのCLIでアクセス。

$ ./trino
trino>

先ほどCSVファイルをアップロードしたMinIOのバケットを指定して、スキーマを作成。

trino> create schema minio.bucket with(location = 's3a://trino-bucket/');
CREATE SCHEMA

ちなみに、この時にHiveメタストアサービス側でHadoop AWSとAWS SDK for Javaに対してクラスパスが通っていない場合は、こんな感じで
失敗します。

trino> create schema minio.foo with(location = 's3a://trino-bucket/');
CREATE SCHEMA
Query 20220820_030243_00001_cg7ia failed: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

次に、テーブルを作成します。

create table minio.bucket.people (
  family_id varchar,
  id varchar,
  first_name varchar,
  last_name varchar,
  age varchar
) with (
  format = 'csv',
  csv_separator = ',',
  csv_quote = '"',
  csv_escape = '"',
  skip_header_line_count = 1,
  external_location = 's3a://trino-bucket/input'
);

withで、テーブルに対してプロパティを指定できるようです。

CREATE TABLE — Trino 393 Documentation

Apache Hiveのテーブルの場合、formatにはこのあたりが指定できそうですね。

Hive connector / Supported file type

https://github.com/trinodb/trino/blob/393/plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveStorageFormat.java#L60-L104

その他のプロパティについては、こちらで確認しました。

https://github.com/trinodb/trino/blob/393/plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableProperties.java#L49-L68

ファイルを読み込む場所。

  external_location = 's3a://trino-bucket/input'

CSVファイルのフォーマットや、スキップする行数の指定。

  csv_separator = ',',
  csv_quote = '"',
  csv_escape = '"',
  skip_header_line_count = 1,

ところで、Hiveメタストアサービスの設定にMinIOにアクセスするための設定が含まれていない場合はここで失敗します。

Query 20220823_152404_00002_t3aw5 failed: Got exception: java.nio.file.AccessDeniedException s3a://trino-bucket/input: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: No AWS Credentials provided by TemporaryAWSCredentialsProvider SimpleAWSCredentialsProvider EnvironmentVariableCredentialsProvider IAMInstanceCredentialsProvider : com.amazonaws.SdkClientException: Unable to load AWS credentials from environment variables (AWS_ACCESS_KEY_ID (or AWS_ACCESS_KEY) and AWS_SECRET_KEY (or AWS_SECRET_ACCESS_KEY))

また、CSVをフォーマットにした場合は、カラムの型はvarcharに限定されるようです。こんな感じの指定をすると

create table minio.bucket.people (
  family_id integer,
  id integer,
  first_name varchar,
  last_name varchar,
  age integer
) with (
  format = 'csv',
  csv_separator = ',',
  csv_quote = '"',
  csv_escape = '"',
  skip_header_line_count = 1,
  external_location = 's3a://trino-bucket/input'
);

こんなエラーになります。

Query 20220822_142349_00007_89i97 failed: Hive CSV storage format only supports VARCHAR (unbounded). Unsupported columns: family_id integer, id integer, age integer

このあたりは、Apache Hiveのドキュメントを見た方がよいでしょう。

Limitation

This SerDe treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type. The type information is retrieved from the SerDe. To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type.

CSV Serde - Apache Hive - Apache Software Foundation

CSVフォーマットを使った場合、カラムの型はvarcharになるようですが、型変換はできるのでちょっと試しに。

Conversion functions — Trino 393 Documentation

trino> select family_id, id, first_name, last_name, cast(age as integer) as int_age from minio.bucket.people order by int_age;
 family_id | id | first_name | last_name | int_age
-----------+----+------------+-----------+---------
 2         | 3  | イクラ     | 波野      |       1
 1         | 7  | タラオ     | フグ田    |       3
 1         | 6  | ワカメ     | 磯野      |       9
 1         | 5  | カツオ     | 磯野      |      11
 3         | 4  | 浮江       | 伊佐      |      16
 3         | 3  | 甚六       | 伊佐坂    |      20
 2         | 2  | タイコ     | 波野      |      22
 1         | 1  | サザエ     | フグ田    |      24
 2         | 1  | ノリスケ   | 波野      |      26
 1         | 2  | マスオ     | フグ田    |      28
 3         | 2  | お軽       | 伊佐坂    |      50
 1         | 4  | フネ       | 磯野      |      50
 1         | 3  | 波平       | 磯野      |      54
 3         | 1  | 難物       | 伊佐坂    |      60
(14 rows)

Query 20220823_153714_00004_t3aw5, FINISHED, 1 node
Splits: 9 total, 9 done (100.00%)
1.21 [14 rows, 450B] [11 rows/s, 372B/s]

こんな感じになりました。

like検索。

trino> select family_id, id, concat(last_name, first_name) as name, age from minio.bucket.people where concat(last_name, first_name) like '磯野%';
 family_id | id |    name    | age
-----------+----+------------+-----
 1         | 3  | 磯野波平   | 54
 1         | 4  | 磯野フネ   | 50
 1         | 5  | 磯野カツオ | 11
 1         | 6  | 磯野ワカメ | 9
(4 rows)

Query 20220823_153733_00005_t3aw5, FINISHED, 1 node
Splits: 3 total, 3 done (100.00%)
0.68 [14 rows, 450B] [20 rows/s, 662B/s]

こんな感じで、MinIOにアップロードしたCSVファイルを元にしたテーブルに対して、データの読み込みができました。

TrinoからMinIOに対してデータを書き込んでみる

最後に、先ほど作成したCSVフォーマットのテーブルのデータを元に、Parquetフォーマットのテーブルを作成してみましょう。

データの書き込みになります。

こんな感じで、select文の結果からテーブルを作成。

create table minio.bucket.people_parquet
with (
  format = 'parquet',
  external_location = 's3a://trino-bucket/output'
)
as select
  cast(family_id as integer) as family_id,
  cast(id as integer) as id,
  first_name,
  last_name,
  cast(age as integer) as age
from minio.bucket.people;

各カラムには型変換を入れています。

データの格納先は、CSVファイルの配置場所とは別です。

  external_location = 's3a://trino-bucket/output'

ところで、このcreate table文を実行するには、Hive connectorの設定でhive.non-managed-table-writes-enabledをtrueにしておく必要が
あります。

これを行っていない場合は、こちらのようなエラーになります。

Query 20220822_145855_00021_89i97 failed: Writes to non-managed Hive tables is disabled

デフォルトの設定がfalseですし、大量のデータを書き込むこともあると考えると、こういうのはApache Sparkなどでやるのが良いのかもですね。

テーブル定義はCSVフォーマットの時と異なり、varchar以外も使えています。

trino> desc minio.bucket.people_parquet;
   Column   |  Type   | Extra | Comment
------------+---------+-------+---------
 family_id  | integer |       |
 id         | integer |       |
 first_name | varchar |       |
 last_name  | varchar |       |
 age        | integer |       |
(5 rows)

Query 20220823_154211_00007_t3aw5, FINISHED, 1 node
Splits: 7 total, 7 done (100.00%)
0.38 [5 rows, 323B] [13 rows/s, 846B/s]

データの表示確認。

trino> select * from minio.bucket.people_parquet order by age;
 family_id | id | first_name | last_name | age
-----------+----+------------+-----------+-----
         2 |  3 | イクラ     | 波野      |   1
         1 |  7 | タラオ     | フグ田    |   3
         1 |  6 | ワカメ     | 磯野      |   9
         1 |  5 | カツオ     | 磯野      |  11
         3 |  4 | 浮江       | 伊佐      |  16
         3 |  3 | 甚六       | 伊佐坂    |  20
         2 |  2 | タイコ     | 波野      |  22
         1 |  1 | サザエ     | フグ田    |  24
         2 |  1 | ノリスケ   | 波野      |  26
         1 |  2 | マスオ     | フグ田    |  28
         3 |  2 | お軽       | 伊佐坂    |  50
         1 |  4 | フネ       | 磯野      |  50
         1 |  3 | 波平       | 磯野      |  54
         3 |  1 | 難物       | 伊佐坂    |  60
(14 rows)

Query 20220823_154254_00008_t3aw5, FINISHED, 1 node
Splits: 7 total, 7 done (100.00%)
0.29 [14 rows, 2.04KB] [48 rows/s, 7.09KB/s]

OKですね。

MinIO上でも確認。

$ aws s3 ls --endpoint-url $MINIO_ENDPOINT trino-bucket/output/
2022-08-24 00:39:40       1479 20220823_153938_00006_t3aw5_c29808d5-1047-486f-a54b-c656d5fc6bbd

中身の表示は、バイナリなので割愛。

今回は、こんなところでしょうか。

まとめ

Trinoから、Amazon S3互換のオブジェクトストレージであるMinIOにアクセスしてみました。

クエリーのエンジンはTrinoになっているとはいえ、そもそもApache Hiveを扱ったことがなかったので、かなりてこずりました。
Hiveメタストアサービスの位置づけと、Amazon S3のようなオブジェクトストレージにアクセスするための設定がよくわからなかったですね。
また、Apache Hadoopが必要になることも驚きましたが。

Hiveメタストアサービスをクリアしたら、今度はApache Hiveのテーブルのプロパティ指定をTrinoでどうしたらいいんだろう？というところに
悩んだり。調べたり、結局ソースコードを見たりしましたが。

まあ、とりあえず目標となるところまでは到達できたので、良しとしましょう。

これで次からは、TrinoでHive connectorを扱う時のハードルが少し下がっているでしょう。

CLOVER🍀

That was when it all began.

Trinoから、Hive connectorでAmazon S3互換のオブジェクトストレージMinIOにアクセスしてみる

これは、なにをしたくて書いたもの？

Hive connector

環境

お題

Apache Hiveメタストアサービスをインストールする

テストデータを用意する

TrinoからMiniOのデータを読み込んでみる

TrinoからMinIOに対してデータを書き込んでみる

まとめ