15 KiB
title | slug |
---|---|
Run the Iceberg Connector Externally | /connectors/database/iceberg/yaml |
{% connectorDetailsHeader name="Iceberg" stage="BETA" platform="OpenMetadata" availableFeatures=["Metadata", "Owners"] unavailableFeatures=["Query Usage", "Data Profiler", "Data Quality", "Lineage", "Column-level Lineage", "dbt", "Tags", "Stored Procedures", "Sample Data"] / %}
In this section, we provide guides and references to use the Iceberg connector.
Configure and schedule Iceberg metadata from the OpenMetadata UI:
{% partial file="/v1.8/connectors/external-ingestion-deployment.md" /%}
Requirements
The requirements actually depend on the Catalog and the FileSystem used. In a nutshell, the used credentials must have access to reading the Catalog and the Metadata File.
Glue Catalog
Must have glue:GetDatabases
, and glue:GetTables
permissions to be able to read the Catalog.
Must also have the s3:GetObject
permission for the location of the Iceberg tables.
DynamoDB Catalog
Must have dynamodb:DescribeTable
and dynamodb:GetItem
permissions on the Iceberg Catalog table.
Must also have the s3:GetObject
permission for the location of the Iceberg tables.
Hive / REST Catalog
It depends on where and how the Hive / Rest Catalog is setup and where the Iceberg files are stored.
Python Requirements
{% partial file="/v1.8/connectors/python-requirements.md" /%}
To run the Iceberg ingestion, you will need to install:
pip3 install "openmetadata-ingestion[iceberg]"
Metadata Ingestion
All connectors are defined as JSON Schemas. Here you can find the structure to create a connection to Iceberg.
In order to create and run a Metadata Ingestion workflow, we will follow the steps to create a YAML configuration able to connect to the source, process the Entities if needed, and reach the OpenMetadata server.
The workflow is modeled around the following JSON Schema
1. Define the YAML Config
This is a sample config for Iceberg using a Glue Catalog:
{% codePreview %}
{% codeInfoContainer %}
Source Configuration - Service Connection
{% codeInfo srNumber=1 %}
- name: Enter the catalog name of choice.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
- awsAccessKeyId: Enter your secure access key ID for your AWS connection.
- awsSecretAccessKey: Enter the Secret Access Key (the passcode key pair to the key ID from above).
- awsSessionToken (optional): Enter the Session Access Token (used if using short lived credentials).
- awsRegion: Specify the AWS region used.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
- databaseName (optional): Enter the database name of choice. If not it will be set as 'default'.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
- ownershipProperty (optional): Property to use when searching for the owner. It defaults to 'owner'.
{% /codeInfo %}
{% partial file="/v1.8/connectors/yaml/database/source-config-def.md" /%}
{% partial file="/v1.8/connectors/yaml/ingestion-sink-def.md" /%}
{% partial file="/v1.8/connectors/yaml/workflow-config-def.md" /%}
{% /codeInfoContainer %}
{% codeBlock fileName="filename.yaml" %}
source:
type: iceberg
serviceName: glue_test
serviceConnection:
config:
type: Iceberg
catalog:
name: my_glue
connection:
awsConfig:
awsAccessKeyId: access key id
awsSecretAccessKey: access secret key
awsRegion: aws region name
databaseName: my_database_name
ownershipProperty: custom_owner_property
{% partial file="/v1.8/connectors/yaml/database/source-config.md" /%}
{% partial file="/v1.8/connectors/yaml/ingestion-sink.md" /%}
{% partial file="/v1.8/connectors/yaml/workflow-config.md" /%}
{% /codeBlock %}
{% /codePreview %}
This is a sample config for Iceberg using a DynamoDB Catalog:
{% codePreview %}
{% codeInfoContainer %}
Source Configuration - Service Connection
{% codeInfo srNumber=1 %}
- name: Enter the catalog name of choice.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
- tableName: Enter the name of the table where the Iceberg Catalog is stored.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
- awsAccessKeyId: Enter your secure access key ID for your AWS connection.
- awsSecretAccessKey: Enter the Secret Access Key (the passcode key pair to the key ID from above).
- awsSessionToken (optional): Enter the Session Access Token (used if using short lived credentials).
- awsRegion: Specify the AWS region used.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
- databaseName (optional): Enter the database name of choice. If not it will be set as 'default'.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
- ownershipProperty (optional): Property to use when searching for the owner. It defaults to 'owner'.
{% /codeInfo %}
{% partial file="/v1.8/connectors/yaml/database/source-config-def.md" /%}
{% partial file="/v1.8/connectors/yaml/ingestion-sink-def.md" /%}
{% partial file="/v1.8/connectors/yaml/workflow-config-def.md" /%}
{% /codeInfoContainer %}
{% codeBlock fileName="filename.yaml" %}
source:
type: iceberg
serviceName: glue_test
serviceConnection:
config:
type: Iceberg
catalog:
name: my_dynamo
connection:
tableName: catalog_table
awsConfig:
awsAccessKeyId: access key id
awsSecretAccessKey: access secret key
awsRegion: aws region name
databaseName: my_database_name
ownershipProperty: custom_owner_property
{% partial file="/v1.8/connectors/yaml/database/source-config.md" /%}
{% partial file="/v1.8/connectors/yaml/ingestion-sink.md" /%}
{% partial file="/v1.8/connectors/yaml/workflow-config.md" /%}
{% /codeBlock %}
{% /codePreview %}
This is a sample config for Iceberg using a Hive Catalog:
{% codePreview %}
{% codeInfoContainer %}
Source Configuration - Service Connection
{% codeInfo srNumber=1 %}
- name: Enter the catalog name of choice.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
- uri: Enter the uri to the Hive Metastore. Example: 'thrift://localhost:9083'.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
- fileSystem (Optional): Enter the specific configuration given the file system used to store the Iceberg files.
- Local: No configuration needed
- S3 (Or S3 Compatible):
- awsAccessKeyId: Enter your secure access key ID for your AWS connection.
- awsSecretAccessKey: Enter the Secret Access Key (the passcode key pair to the key ID from above).
- awsSessionToken (optional): Enter the Session Access Token (used if using short lived credentials).
- awsRegion: Specify the AWS region used.
- endPointURL: EndPoint URL to use with AWS.
- Azure:
- clientId : Client ID of the data storage account
- clientSecret : Client Secret of the account
- tenantId : Tenant ID under which the data storage account falls
- accountName : Account Name of the data Storage
{% /codeInfo %}
{% codeInfo srNumber=4 %}
- databaseName (optional): Enter the database name of choice. If not it will be set as 'default'.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
- ownershipProperty (optional): Property to use when searching for the owner. It defaults to 'owner'.
{% /codeInfo %}
{% partial file="/v1.8/connectors/yaml/database/source-config-def.md" /%}
{% partial file="/v1.8/connectors/yaml/ingestion-sink-def.md" /%}
{% partial file="/v1.8/connectors/yaml/workflow-config-def.md" /%}
{% /codeInfoContainer %}
{% codeBlock fileName="filename.yaml" %}
source:
type: iceberg
serviceName: glue_test
serviceConnection:
config:
type: Iceberg
catalog:
name: my_hive
connection:
uri: thrift://localhost:9083
fileSystem:
# S3 Compatible
awsAccessKeyId: access key id
awsSecretAccessKey: access secret key
awsRegion: aws region name
# Azure
# clientId: client_id
# clientSecret: client_secret
# tenantId: tenant_id
# accountName: account_name
databaseName: my_database_name
ownershipProperty: custom_owner_property
{% partial file="/v1.8/connectors/yaml/database/source-config.md" /%}
{% partial file="/v1.8/connectors/yaml/ingestion-sink.md" /%}
{% partial file="/v1.8/connectors/yaml/workflow-config.md" /%}
{% /codeBlock %}
{% /codePreview %}
This is a sample config for Iceberg using a REST Catalog:
{% codePreview %}
{% codeInfoContainer %}
Source Configuration - Service Connection
{% codeInfo srNumber=1 %}
- name: Enter the catalog name of choice.
{% /codeInfo %}
{% codeInfo srNumber=2 %}
- uri: Enter the uri to the Rest Catalog. Example: 'http://localhost:8181'.
{% /codeInfo %}
{% codeInfo srNumber=3 %}
- credential (Optional): OAuth2 Credential to use for Authentication flow.
- clientId: OAuth2 Client ID.
- clientSecret: OAuth2 Client Secret.
{% /codeInfo %}
{% codeInfo srNumber=4 %}
- token (Optional): Enter the Bearer token for the 'Authorization' header.
{% /codeInfo %}
{% codeInfo srNumber=5 %}
- ssl (Optional): Needed configuration for SSL.
- caCertPath: CA Certificate Path.
- clientCertPath: Client Certificate Path.
- privateKeyPath: Private Key Path.
{% /codeInfo %}
{% codeInfo srNumber=6 %}
- sigv4 (Optional): Used if signing requests using AWS SigV4 protocol.
- signingRegion : AWS Region used when signing a request.
- signingName : Name used to sign the request with.
{% /codeInfo %}
{% codeInfo srNumber=7 %}
- fileSystem (Optional): Enter the specific configuration given the file system used to store the Iceberg files.
- Local: No configuration needed
- S3 (Or S3 Compatible):
- awsAccessKeyId: Enter your secure access key ID for your AWS connection.
- awsSecretAccessKey: Enter the Secret Access Key (the passcode key pair to the key ID from above).
- awsSessionToken (optional): Enter the Session Access Token (used if using short lived credentials).
- awsRegion: Specify the AWS region used.
- endPointURL: EndPoint URL to use with AWS.
- Azure:
- clientId : Client ID of the data storage account
- clientSecret : Client Secret of the account
- tenantId : Tenant ID under which the data storage account falls
- accountName : Account Name of the data Storage
{% /codeInfo %}
{% codeInfo srNumber=8 %}
- databaseName (optional): Enter the database name of choice. If not it will be set as 'default'.
{% /codeInfo %}
{% codeInfo srNumber=9 %}
- warehouseLocation (optional): Warehouse Location. Used to specify a custom warehouse location if needed.
Most Catalogs should have a working default warehouse location.
{% /codeInfo %}
{% codeInfo srNumber=10 %}
- ownershipProperty (optional): Property to use when searching for the owner. It defaults to 'owner'.
{% /codeInfo %}
{% partial file="/v1.8/connectors/yaml/database/source-config-def.md" /%}
{% partial file="/v1.8/connectors/yaml/ingestion-sink-def.md" /%}
{% partial file="/v1.8/connectors/yaml/workflow-config-def.md" /%}
{% /codeInfoContainer %}
{% codeBlock fileName="filename.yaml" %}
source:
type: iceberg
serviceName: glue_test
serviceConnection:
config:
type: Iceberg
catalog:
name: my_rest
connection:
uri: http://localhost:8181
credential:
clientId: client_id
clientSecret: client_secret
token: my_bearer_token
ssl:
caCertPath: ./ca_cert.pem
clientCertPath: ./client_cert.crt
privateKeyPath: ./private.key
sigv4:
signingRegion: us-east-2
signingName: signing_name
fileSystem:
# S3 compatible
awsAccessKeyId: access key id
awsSecretAccessKey: access secret key
awsRegion: aws region name
# Azure
# clientId: client_id
# clientSecret: client_secret
# tenantId: tenant_id
# accountName: account_name
databaseName: my_database_name
warehouseLocation: warehouse_location
ownershipProperty: custom_owner_property
{% partial file="/v1.8/connectors/yaml/database/source-config.md" /%}
{% partial file="/v1.8/connectors/yaml/ingestion-sink.md" /%}
{% partial file="/v1.8/connectors/yaml/workflow-config.md" /%}
{% /codeBlock %}
{% /codePreview %}
{% partial file="/v1.8/connectors/yaml/ingestion-cli.md" /%}
Securing Rest Catalog Connection with SSL in OpenMetadata
When using SSL
to establish secure connections between OpenMetadata and Rest Catalog, you can specify the caCertificate
to provide the CA certificate used for SSL validation. Alternatively, if both client and server require mutual authentication, you'll need to use all three parameters: ssl_key
, ssl_cert
, and ssl_ca
. In this case, ssl_cert
is used for the client’s SSL certificate, ssl_key
for the private key associated with the SSL certificate, and ssl_ca
for the CA certificate to validate the server’s certificate.
ssl:
caCertPath: ./ca_cert.pem
clientCertPath: ./client_cert.crt
privateKeyPath: ./private.key