mirror of
https://github.com/datahub-project/datahub.git
synced 2025-10-04 05:26:24 +00:00
docs(ingestion/redshift): update documentation to cover svv and stv system tables (#14727)
This commit is contained in:
parent
10649f3f38
commit
7e8049bfe6
@ -4,9 +4,9 @@ title: Setup
|
||||
|
||||
# Redshift Ingestion Guide: Setup & Prerequisites
|
||||
|
||||
To configure ingestion from Redshift, you'll need a [User](https://docs.aws.amazon.com/redshift/latest/gsg/t_adding_redshift_user_cmd.html) configured with the proper permission sets, and an associated.
|
||||
To configure ingestion from Redshift, you'll need a [User](https://docs.aws.amazon.com/redshift/latest/gsg/t_adding_redshift_user_cmd.html) configured with the proper permission sets.
|
||||
|
||||
This setup guide will walk you through the steps you'll need to take via your Google Cloud Console.
|
||||
This setup guide will walk you through the steps you'll need to take in your Amazon Redshift cluster.
|
||||
|
||||
## Redshift Prerequisites
|
||||
|
||||
@ -20,13 +20,63 @@ CREATE USER datahub WITH PASSWORD 'Datahub1234';
|
||||
|
||||
## Redshift Setup
|
||||
|
||||
1. Grant the following permission to your `datahub` user:
|
||||
1. Grant the following permissions to your `datahub` user. For most users, the **minimal set** below will be sufficient:
|
||||
|
||||
### Minimal Required Permissions (Recommended)
|
||||
|
||||
```sql
|
||||
-- Core system access (required for lineage and usage statistics)
|
||||
ALTER USER datahub WITH SYSLOG ACCESS UNRESTRICTED;
|
||||
GRANT SELECT ON pg_catalog.svv_table_info to datahub;
|
||||
GRANT SELECT ON pg_catalog.svl_user_info to datahub;
|
||||
|
||||
-- Core metadata extraction (always required)
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_databases TO datahub;
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_schemas TO datahub;
|
||||
GRANT SELECT ON pg_catalog.svv_external_schemas TO datahub;
|
||||
GRANT SELECT ON pg_catalog.svv_table_info TO datahub;
|
||||
GRANT SELECT ON pg_catalog.svv_external_tables TO datahub;
|
||||
GRANT SELECT ON pg_catalog.svv_external_columns TO datahub;
|
||||
GRANT SELECT ON pg_catalog.pg_class_info TO datahub;
|
||||
|
||||
-- Datashare lineage (enabled by default)
|
||||
GRANT SELECT ON pg_catalog.svv_datashares TO datahub;
|
||||
|
||||
-- Choose ONE based on your Redshift type:
|
||||
-- For Provisioned Clusters:
|
||||
GRANT SELECT ON pg_catalog.stv_mv_info TO datahub;
|
||||
|
||||
-- For Serverless Workgroups:
|
||||
-- GRANT SELECT ON pg_catalog.svv_user_info TO datahub;
|
||||
-- GRANT SELECT ON pg_catalog.svv_mv_info TO datahub;
|
||||
```
|
||||
|
||||
### Data Access Permissions (Required for Profiling/Classification)
|
||||
|
||||
**Important**: The above permissions only provide access to metadata. For data profiling, classification, or any feature that reads actual table data, you need:
|
||||
|
||||
```sql
|
||||
-- Schema access (required to access tables within schemas)
|
||||
GRANT USAGE ON SCHEMA public TO datahub;
|
||||
GRANT USAGE ON SCHEMA your_schema_name TO datahub;
|
||||
|
||||
-- Table data access (required for profiling and classification)
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datahub;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema_name TO datahub;
|
||||
|
||||
-- For production environments (future tables/views):
|
||||
-- IMPORTANT: Only works for objects created by the user running this command
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema_name GRANT SELECT ON TABLES TO datahub;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema_name GRANT SELECT ON VIEWS TO datahub;
|
||||
--
|
||||
-- Alternative: Run this periodically to catch all new objects regardless of creator:
|
||||
-- GRANT SELECT ON ALL TABLES IN SCHEMA your_schema_name TO datahub;
|
||||
```
|
||||
|
||||
### Additional Permissions (Only if needed)
|
||||
|
||||
```sql
|
||||
-- Only if using shared databases (datashare consumers):
|
||||
-- GRANT SELECT ON pg_catalog.svv_redshift_tables TO datahub;
|
||||
-- GRANT SELECT ON pg_catalog.svv_redshift_columns TO datahub;
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
172
metadata-ingestion/docs/sources/redshift/redshift_post.md
Normal file
172
metadata-ingestion/docs/sources/redshift/redshift_post.md
Normal file
@ -0,0 +1,172 @@
|
||||
## Troubleshooting
|
||||
|
||||
### Schema Discovery Issues
|
||||
|
||||
If you're not seeing all schemas or tables after following the setup steps, check the following:
|
||||
|
||||
#### Missing Schemas
|
||||
|
||||
**1. Check schema filtering configuration:**
|
||||
|
||||
```yaml
|
||||
# In your recipe, ensure schema patterns are correct
|
||||
schema_pattern:
|
||||
allow:
|
||||
- "your_schema_name"
|
||||
- "public"
|
||||
# Remove deny patterns that might be blocking schemas
|
||||
```
|
||||
|
||||
**2. Verify permissions on specific schemas:**
|
||||
|
||||
```sql
|
||||
-- Test if you can see schemas
|
||||
SELECT schema_name, schema_type
|
||||
FROM svv_redshift_schemas
|
||||
WHERE database_name = 'your_database';
|
||||
|
||||
-- Test external schemas
|
||||
SELECT schemaname, eskind, databasename
|
||||
FROM SVV_EXTERNAL_SCHEMAS;
|
||||
```
|
||||
|
||||
**3. Check for external schemas:**
|
||||
External schemas (Redshift Spectrum) require both permissions:
|
||||
|
||||
```sql
|
||||
GRANT SELECT ON pg_catalog.svv_external_schemas TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_external_tables TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_external_columns TO datahub_user;
|
||||
```
|
||||
|
||||
#### Missing Tables Within Schemas
|
||||
|
||||
**1. Check table filtering:**
|
||||
|
||||
```yaml
|
||||
table_pattern:
|
||||
allow:
|
||||
- "your_schema.your_table"
|
||||
# Ensure no overly restrictive deny patterns
|
||||
```
|
||||
|
||||
**2. Test table visibility:**
|
||||
|
||||
```sql
|
||||
-- For regular tables
|
||||
SELECT schemaname, tablename, tabletype
|
||||
FROM pg_tables
|
||||
WHERE schemaname = 'your_schema';
|
||||
|
||||
-- For views
|
||||
SELECT schemaname, viewname
|
||||
FROM pg_views
|
||||
WHERE schemaname = 'your_schema';
|
||||
|
||||
-- For external tables
|
||||
SELECT schemaname, tablename
|
||||
FROM SVV_EXTERNAL_TABLES
|
||||
WHERE schemaname = 'your_schema';
|
||||
```
|
||||
|
||||
#### Configuration Issues
|
||||
|
||||
**1. Database specification:**
|
||||
Ensure you're connecting to the correct database - Redshift ingestion works per database:
|
||||
|
||||
```yaml
|
||||
database: "your_actual_database_name" # Not the cluster name
|
||||
```
|
||||
|
||||
**2. Schema access permissions:**
|
||||
Ensure you have USAGE permissions on the schemas you want to discover:
|
||||
|
||||
```sql
|
||||
-- Check if you have USAGE on schemas
|
||||
SELECT n.nspname as schema_name,
|
||||
has_schema_privilege('datahub_user', n.nspname, 'USAGE') as has_usage
|
||||
FROM pg_catalog.pg_namespace n
|
||||
WHERE n.nspname NOT LIKE 'pg_%'
|
||||
AND n.nspname != 'information_schema';
|
||||
|
||||
-- Grant USAGE if missing
|
||||
GRANT USAGE ON SCHEMA your_schema_name TO datahub_user;
|
||||
```
|
||||
|
||||
**3. Shared database configuration:**
|
||||
If using datashare consumers, add:
|
||||
|
||||
```yaml
|
||||
is_shared_database: true
|
||||
```
|
||||
|
||||
#### Permission Test Queries
|
||||
|
||||
Run these to verify your permissions are working:
|
||||
|
||||
```sql
|
||||
-- Test core permissions
|
||||
SELECT COUNT(*) FROM svv_redshift_schemas WHERE database_name = 'your_database';
|
||||
SELECT COUNT(*) FROM svv_table_info WHERE database = 'your_database';
|
||||
|
||||
-- Test external permissions
|
||||
SELECT COUNT(*) FROM svv_external_schemas;
|
||||
SELECT COUNT(*) FROM svv_external_tables;
|
||||
```
|
||||
|
||||
### Data Profiling Issues
|
||||
|
||||
#### Profile Data Not Appearing
|
||||
|
||||
**1. Check data access permissions:**
|
||||
Ensure you have `USAGE` on schemas and `SELECT` on tables:
|
||||
|
||||
```sql
|
||||
-- Test schema access
|
||||
SELECT has_schema_privilege('datahub_user', 'your_schema', 'USAGE');
|
||||
|
||||
-- Test table access
|
||||
SELECT has_table_privilege('datahub_user', 'your_schema.your_table', 'SELECT');
|
||||
```
|
||||
|
||||
**2. Enable table-level profiling only:**
|
||||
If you cannot grant `SELECT` on tables, use table-level profiling:
|
||||
|
||||
```yaml
|
||||
profiling:
|
||||
profile_table_level_only: true
|
||||
```
|
||||
|
||||
### Lineage Issues
|
||||
|
||||
#### Missing Lineage Information
|
||||
|
||||
**1. Check lineage configuration:**
|
||||
|
||||
```yaml
|
||||
table_lineage_mode: stl_scan_based # or sql_based, mixed
|
||||
include_usage_statistics: true
|
||||
```
|
||||
|
||||
**2. Verify SYSLOG ACCESS:**
|
||||
|
||||
```sql
|
||||
-- Check if user has SYSLOG ACCESS
|
||||
SELECT usename, usesyslog
|
||||
FROM pg_user
|
||||
WHERE usename = 'datahub_user';
|
||||
-- usesyslog should be 't' (true)
|
||||
```
|
||||
|
||||
#### Cross-Cluster Lineage (Datashares)
|
||||
|
||||
For lineage across datashares, ensure:
|
||||
|
||||
1. DataHub user has `SHARE` privileges on datashares
|
||||
2. Both producer and consumer clusters are ingested
|
||||
3. `include_share_lineage: true` in configuration
|
||||
|
||||
```sql
|
||||
-- Check datashare access
|
||||
SELECT * FROM svv_datashares WHERE share_name = 'your_share';
|
||||
```
|
@ -1,39 +1,203 @@
|
||||
### Prerequisites
|
||||
|
||||
This source needs to access system tables that require extra permissions.
|
||||
To grant these permissions, please alter your datahub Redshift user the following way:
|
||||
The DataHub Redshift connector requires specific database privileges to extract metadata, lineage, and usage statistics from your Amazon Redshift cluster.
|
||||
|
||||
## Permission Overview
|
||||
|
||||
DataHub requires three categories of permissions:
|
||||
|
||||
1. **System Table Access** - Access to Redshift system tables for lineage and usage statistics
|
||||
2. **System View Access** - Access to system views for metadata discovery
|
||||
3. **Data Access** - Access to user schemas and tables for profiling and classification
|
||||
|
||||
## System Table and View Permissions
|
||||
|
||||
Execute the following commands as a database superuser or user with sufficient privileges to grant these permissions:
|
||||
|
||||
```sql
|
||||
-- Enable access to system log tables (STL_*, SVL_*, SYS_*)
|
||||
-- Required for lineage extraction and usage statistics
|
||||
ALTER USER datahub_user WITH SYSLOG ACCESS UNRESTRICTED;
|
||||
GRANT SELECT ON pg_catalog.svv_table_info to datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svl_user_info to datahub_user;
|
||||
|
||||
-- Database and schema metadata
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_databases TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_schemas TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_external_schemas TO datahub_user;
|
||||
|
||||
-- Table and column metadata
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_tables TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_columns TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_table_info TO datahub_user;
|
||||
|
||||
-- External table support (Amazon Redshift Spectrum)
|
||||
GRANT SELECT ON pg_catalog.svv_external_tables TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_external_columns TO datahub_user;
|
||||
|
||||
-- User information for usage statistics
|
||||
GRANT SELECT ON pg_catalog.svv_user_info TO datahub_user; -- Serverless workgroups
|
||||
GRANT SELECT ON pg_catalog.svl_user_info TO datahub_user; -- Provisioned clusters
|
||||
|
||||
-- Materialized view information
|
||||
GRANT SELECT ON pg_catalog.stv_mv_info TO datahub_user; -- Provisioned clusters
|
||||
GRANT SELECT ON pg_catalog.svv_mv_info TO datahub_user; -- Serverless workgroups
|
||||
|
||||
-- Datashares (cross-cluster lineage)
|
||||
GRANT SELECT ON pg_catalog.svv_datashares TO datahub_user;
|
||||
|
||||
-- Table creation timestamps (provisioned clusters)
|
||||
GRANT SELECT ON pg_catalog.pg_class_info TO datahub_user;
|
||||
```
|
||||
|
||||
To ingest datashares lineage, ingestion user for both producer and consumer namespace would need alter/share
|
||||
access to datashare. See [svv_datashares](https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_DATASHARES.html)
|
||||
docs for more information.
|
||||
## Detailed Permission Breakdown
|
||||
|
||||
The following sections provide detailed information about which permissions are required for specific features and configurations.
|
||||
|
||||
### Core System Views (Always Required)
|
||||
|
||||
These system views are accessed in all DataHub configurations:
|
||||
|
||||
```sql
|
||||
GRANT SHARE ON <share_name> to datahub_user
|
||||
-- Schema discovery
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_schemas TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_external_schemas TO datahub_user;
|
||||
|
||||
-- Database information
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_databases TO datahub_user;
|
||||
|
||||
-- Table metadata and statistics
|
||||
GRANT SELECT ON pg_catalog.svv_table_info TO datahub_user;
|
||||
|
||||
-- External table support
|
||||
GRANT SELECT ON pg_catalog.svv_external_tables TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_external_columns TO datahub_user;
|
||||
|
||||
-- Table creation timestamps
|
||||
GRANT SELECT ON pg_catalog.pg_class_info TO datahub_user;
|
||||
```
|
||||
|
||||
:::note
|
||||
### Conditional System Tables (Feature Dependent)
|
||||
|
||||
Giving a user unrestricted access to system tables gives the user visibility to data generated by other users. For example, STL_QUERY and STL_QUERYTEXT contain the full text of INSERT, UPDATE, and DELETE statements.
|
||||
#### Shared Database (Datashare Consumer)
|
||||
|
||||
```sql
|
||||
-- Required when is_shared_database = True
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_tables TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_columns TO datahub_user;
|
||||
```
|
||||
|
||||
#### Redshift Serverless Workgroups
|
||||
|
||||
```sql
|
||||
-- Required for serverless workgroups
|
||||
GRANT SELECT ON pg_catalog.svv_user_info TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_mv_info TO datahub_user;
|
||||
```
|
||||
|
||||
#### Redshift Provisioned Clusters
|
||||
|
||||
```sql
|
||||
-- Required for provisioned clusters
|
||||
GRANT SELECT ON pg_catalog.svl_user_info TO datahub_user; -- Covered by SYSLOG ACCESS
|
||||
GRANT SELECT ON pg_catalog.stv_mv_info TO datahub_user;
|
||||
```
|
||||
|
||||
#### Datashares Lineage
|
||||
|
||||
```sql
|
||||
-- Required when include_share_lineage: true (default)
|
||||
GRANT SELECT ON pg_catalog.svv_datashares TO datahub_user;
|
||||
```
|
||||
|
||||
### Recommended Permission Set
|
||||
|
||||
For a typical provisioned cluster with default settings:
|
||||
|
||||
```sql
|
||||
-- Core system access
|
||||
ALTER USER datahub_user WITH SYSLOG ACCESS UNRESTRICTED;
|
||||
|
||||
-- Always required (7 grants)
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_databases TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_redshift_schemas TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_external_schemas TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_table_info TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_external_tables TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.svv_external_columns TO datahub_user;
|
||||
GRANT SELECT ON pg_catalog.pg_class_info TO datahub_user;
|
||||
|
||||
-- Datashares (since include_share_lineage defaults to true)
|
||||
GRANT SELECT ON pg_catalog.svv_datashares TO datahub_user;
|
||||
|
||||
-- Provisioned cluster materialized views
|
||||
GRANT SELECT ON pg_catalog.stv_mv_info TO datahub_user;
|
||||
```
|
||||
|
||||
#### Data Access Privileges (Required for Data Profiling and Classification)
|
||||
|
||||
**Important**: The system table permissions above only provide access to metadata. To enable data profiling, classification, or any feature that reads actual table data, you must grant additional privileges:
|
||||
|
||||
```sql
|
||||
-- Grant USAGE privilege on schemas (required to access schema objects)
|
||||
GRANT USAGE ON SCHEMA public TO datahub_user;
|
||||
GRANT USAGE ON SCHEMA your_schema_name TO datahub_user;
|
||||
|
||||
-- Grant SELECT privilege on existing tables for data access
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datahub_user;
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema_name TO datahub_user;
|
||||
|
||||
-- Grant privileges on future objects (recommended for production)
|
||||
-- IMPORTANT: These must be run by each user who will create tables/views
|
||||
-- OR by a superuser with FOR ROLE clause
|
||||
|
||||
-- Option 1: If you (as admin) will create all future tables/views:
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datahub_user;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON VIEWS TO datahub_user;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema_name GRANT SELECT ON TABLES TO datahub_user;
|
||||
ALTER DEFAULT PRIVILEGES IN SCHEMA your_schema_name GRANT SELECT ON VIEWS TO datahub_user;
|
||||
|
||||
-- Option 2: If other users will create tables/views, run this for each user:
|
||||
-- ALTER DEFAULT PRIVILEGES FOR ROLE other_user_name IN SCHEMA public GRANT SELECT ON TABLES TO datahub_user;
|
||||
-- ALTER DEFAULT PRIVILEGES FOR ROLE other_user_name IN SCHEMA public GRANT SELECT ON VIEWS TO datahub_user;
|
||||
|
||||
-- Option 3: For all future users (requires superuser):
|
||||
-- ALTER DEFAULT PRIVILEGES FOR ALL ROLES IN SCHEMA public GRANT SELECT ON TABLES TO datahub_user;
|
||||
-- ALTER DEFAULT PRIVILEGES FOR ALL ROLES IN SCHEMA public GRANT SELECT ON VIEWS TO datahub_user;
|
||||
```
|
||||
|
||||
:::caution Data Access vs Metadata Access
|
||||
|
||||
**The permissions are split into two categories:**
|
||||
|
||||
1. **System table permissions** (above) - Required for metadata extraction, lineage, and usage statistics
|
||||
2. **Data access permissions** (this section) - Required for data profiling, classification, and any feature that reads actual table content
|
||||
|
||||
**Default privileges only apply to objects created by the user who ran the ALTER DEFAULT PRIVILEGES command.** If multiple users create tables in your schemas, you need to:
|
||||
|
||||
1. **Run the commands as each user**, OR
|
||||
2. **Use `FOR ROLE other_user_name`** for each user who creates objects, OR
|
||||
3. **Use `FOR ALL ROLES`** (requires superuser privileges)
|
||||
|
||||
**Common gotcha**: If User A runs `ALTER DEFAULT PRIVILEGES` and User B creates a table, DataHub won't have access to User B's table unless you used Option 2 or 3 above.
|
||||
|
||||
**Alternative approach**: Instead of default privileges, consider using a scheduled job to periodically grant access to new tables:
|
||||
|
||||
```sql
|
||||
-- Run this periodically to catch new tables
|
||||
GRANT SELECT ON ALL TABLES IN SCHEMA your_schema_name TO datahub_user;
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
### Concept mapping
|
||||
#### Optional: Datashare Privileges
|
||||
|
||||
| Source Concept | DataHub Concept | Notes |
|
||||
| -------------- | --------------------------------------------------------- | ------------------ |
|
||||
| `"redshift"` | [Data Platform](../../metamodel/entities/dataPlatform.md) | |
|
||||
| Database | [Container](../../metamodel/entities/container.md) | Subtype `Database` |
|
||||
| Schema | [Container](../../metamodel/entities/container.md) | Subtype `Schema` |
|
||||
| Table | [Dataset](../../metamodel/entities/dataset.md) | Subtype `Table` |
|
||||
| View | [Dataset](../../metamodel/entities/dataset.md) | Subtype `View` |
|
||||
To enable cross-cluster lineage through datashares, grant the following privileges:
|
||||
|
||||
### Ingestion of multiple redshift databases, namespaces
|
||||
```sql
|
||||
-- Grant SHARE privilege on datashares (replace with actual datashare names)
|
||||
GRANT SHARE ON your_datashare_name TO datahub_user;
|
||||
```
|
||||
|
||||
## Ingestion of multiple redshift databases, namespaces
|
||||
|
||||
- If multiple databases are present in the Redshift namespace (or provisioned cluster),
|
||||
you would need to set up a separate ingestion per database.
|
||||
@ -44,50 +208,51 @@ Giving a user unrestricted access to system tables gives the user visibility to
|
||||
you specify a platform_instance equivalent to namespace in recipe. It can be same as namespace id or other
|
||||
human readable name however it should be unique across all your redshift namespaces.
|
||||
|
||||
### Lineage
|
||||
## Lineage
|
||||
|
||||
There are multiple lineage collector implementations as Redshift does not support table lineage out of the box.
|
||||
|
||||
#### stl_scan_based
|
||||
### stl_scan_based
|
||||
|
||||
The stl_scan based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) and [stl_scan](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_SCAN.html) system tables to
|
||||
discover lineage between tables.
|
||||
Pros:
|
||||
|
||||
**Pros:**
|
||||
|
||||
- Fast
|
||||
- Reliable
|
||||
|
||||
Cons:
|
||||
**Cons:**
|
||||
|
||||
- Does not work with Spectrum/external tables because those scans do not show up in stl_scan table.
|
||||
- If a table is depending on a view then the view won't be listed as dependency. Instead the table will be connected with the view's dependencies.
|
||||
|
||||
#### sql_based
|
||||
### sql_based
|
||||
|
||||
The sql_based based collector uses Redshift's [stl_insert](https://docs.aws.amazon.com/redshift/latest/dg/r_STL_INSERT.html) to discover all the insert queries
|
||||
and uses sql parsing to discover the dependencies.
|
||||
|
||||
Pros:
|
||||
**Pros:**
|
||||
|
||||
- Works with Spectrum tables
|
||||
- Views are connected properly if a table depends on it
|
||||
|
||||
Cons:
|
||||
**Cons:**
|
||||
|
||||
- Slow.
|
||||
- Less reliable as the query parser can fail on certain queries
|
||||
|
||||
#### mixed
|
||||
### mixed
|
||||
|
||||
Using both collector above and first applying the sql based and then the stl_scan based one.
|
||||
|
||||
Pros:
|
||||
**Pros:**
|
||||
|
||||
- Works with Spectrum tables
|
||||
- Views are connected properly if a table depends on it
|
||||
- A bit more reliable than the sql_based one only
|
||||
|
||||
Cons:
|
||||
**Cons:**
|
||||
|
||||
- Slow
|
||||
- May be incorrect at times as the query parser can fail on certain queries
|
||||
@ -98,14 +263,14 @@ The redshift stl redshift tables which are used for getting data lineage retain
|
||||
|
||||
:::
|
||||
|
||||
### Datashares Lineage
|
||||
## Datashares Lineage
|
||||
|
||||
This is enabled by default, can be disabled via setting `include_share_lineage: False`
|
||||
|
||||
It is mandatory to run redshift ingestion of datashare producer namespace at least once so that lineage
|
||||
shows up correctly after datashare consumer namespace is ingested.
|
||||
|
||||
### Profiling
|
||||
## Profiling
|
||||
|
||||
Profiling runs sql queries on the redshift cluster to get statistics about the tables. To be able to do that, the user needs to have read access to the tables that should be profiled.
|
||||
|
||||
@ -115,3 +280,17 @@ If you don't want to grant read access to the tables you can enable table level
|
||||
profiling:
|
||||
profile_table_level_only: true
|
||||
```
|
||||
|
||||
### Caveats
|
||||
|
||||
:::note
|
||||
|
||||
**System table access**: The `SYSLOG ACCESS UNRESTRICTED` privilege gives the user visibility to data generated by other users. For example, STL_QUERY and STL_QUERYTEXT contain the full text of INSERT, UPDATE, and DELETE statements.
|
||||
|
||||
:::
|
||||
|
||||
:::note
|
||||
|
||||
**Datashare lineage**: For cross-cluster lineage through datashares, the DataHub user requires `SHARE` privileges on datashares in both producer and consumer namespaces. See the [Amazon Redshift datashare documentation](https://docs.aws.amazon.com/redshift/latest/dg/r_SVV_DATASHARES.html) for more information.
|
||||
|
||||
:::
|
||||
|
@ -16,7 +16,7 @@ source:
|
||||
include_table_lineage: true
|
||||
include_usage_statistics: true
|
||||
# The following options are only used when include_usage_statistics is true
|
||||
# it appends the domain after the resdhift username which is extracted from the Redshift audit history
|
||||
# it appends the domain after the redshift username which is extracted from the Redshift audit history
|
||||
# in the format username@email_domain
|
||||
email_domain: mydomain.com
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user