mirror of
				https://github.com/datahub-project/datahub.git
				synced 2025-11-04 04:39:10 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			368 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			368 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
import FeatureAvailability from '@site/src/components/FeatureAvailability';
 | 
						|
 | 
						|
import Tabs from '@theme/Tabs';
 | 
						|
import TabItem from '@theme/TabItem';
 | 
						|
 | 
						|
# Metadata Ingestion
 | 
						|
 | 
						|
<FeatureAvailability/>
 | 
						|
 | 
						|
DataHub helps you discover and understand your organization's data by automatically collecting information about your data sources. This process is called **metadata ingestion**, allowing DataHub to automatically pull in:
 | 
						|
 | 
						|
- **Table and column names** from your databases
 | 
						|
- **Asset Lineage** showing how information flows between systems
 | 
						|
- **Usage statistics** revealing which datasets are most popular
 | 
						|
- **Data quality information** including freshness and completeness
 | 
						|
- **Business context** like ownership and documentation
 | 
						|
 | 
						|
This makes it simple to connect to popular platforms like Snowflake, BigQuery, dbt, and more, schedule automatic updates, and manage credentials securely.
 | 
						|
 | 
						|
## Prerequisites and Permissions
 | 
						|
 | 
						|
To manage metadata ingestion in DataHub, you need appropriate permissions.
 | 
						|
 | 
						|
### Option 1: Admin-Level Access
 | 
						|
 | 
						|
Users can be granted the following privileges for full administrative access to all ingestion sources:
 | 
						|
 | 
						|
- **`Manage Metadata Ingestion`** - Provides complete access to create, edit, run, and delete all ingestion sources
 | 
						|
- **`Manage Secrets`** - Allows creation and management of encrypted credentials used in ingestion configurations
 | 
						|
 | 
						|
These privileges can be granted in two ways:
 | 
						|
 | 
						|
1. **Admin Role Assignment** - Users assigned to the **Admin Role** receive these privileges by default
 | 
						|
2. **Custom Policy with Platform Privileges** - Create a [Custom Policy](authorization/policies.md) that grants the `Manage Metadata Ingestion` and `Manage Secrets` platform privileges to specific users or groups
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion-privileges.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
### Option 2: Resource-Specific Policies
 | 
						|
 | 
						|
For more granular control, administrators can create [Custom Policies](authorization/policies.md) that apply specifically to **Ingestion Sources**, allowing different users to have different levels of access:
 | 
						|
 | 
						|
- **View** - View ingestion source configurations and run history
 | 
						|
- **Edit** - Modify ingestion source configurations
 | 
						|
- **Delete** - Remove ingestion sources
 | 
						|
- **Execute** - Run ingestion sources on-demand
 | 
						|
 | 
						|
**Prerequisites:**
 | 
						|
 | 
						|
- **DataHub Core**: Enable the `VIEW_INGESTION_SOURCE_PRIVILEGES_ENABLED` feature flag
 | 
						|
- **DataHub Cloud**: Work with your customer success team to get the feature enabled
 | 
						|
 | 
						|
:::caution
 | 
						|
**Important**: Once this feature flag is enabled, any policies that apply to "All" resource types will now include Ingestion Sources, including the default read-only policies. This will make the Ingestion tab visible and potentially actionable depending on the applied privileges. Implement this with care if you have view-only policies that should not expose the Data Sources page.
 | 
						|
:::
 | 
						|
 | 
						|
### Accessing the Ingestion Interface
 | 
						|
 | 
						|
Once you have the appropriate privileges, navigate to the **Ingestion** tab in DataHub.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion-tab.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
On this page, you'll see a list of active **Ingestion Sources**. An Ingestion Source represents a configured connection to an external data system from which DataHub extracts metadata.
 | 
						|
 | 
						|
If you're just getting started, you won't have any sources configured. The following sections will guide you through creating your first ingestion source.
 | 
						|
 | 
						|
## Creating an Ingestion Source
 | 
						|
 | 
						|
### Step 1: Select a Data Source
 | 
						|
 | 
						|
Begin by clicking **+ Create new source** to start the ingestion source creation process.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/create-new-ingestion-source-button.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
Next, select the type of data source you want to connect. DataHub provides pre-built templates for popular platforms including:
 | 
						|
 | 
						|
- **Data Warehouses**: Snowflake, BigQuery, Redshift, Databricks
 | 
						|
- **Databases**: MySQL, PostgreSQL, SQL Server, Oracle
 | 
						|
- **Business Intelligence**: Looker, Tableau, PowerBI
 | 
						|
- **Streaming**: Kafka, Pulsar
 | 
						|
- **And many more...**
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/select-platform-template.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
Select the template that matches your data source. If your specific platform isn't listed, you can choose **Custom** to configure a source manually, though this requires more technical knowledge.
 | 
						|
 | 
						|
### Step 2: Configure Connection Details
 | 
						|
 | 
						|
After selecting your data source template, you'll be presented with a user-friendly form to configure the connection. The exact fields will vary depending on your chosen platform, but typically include:
 | 
						|
 | 
						|
**Connection Information:**
 | 
						|
 | 
						|
- Host/server address and port
 | 
						|
- Database or project names
 | 
						|
- Authentication credentials
 | 
						|
 | 
						|
**Data Selection:**
 | 
						|
 | 
						|
- Which databases, schemas, or tables to include
 | 
						|
- Filtering options to exclude certain data
 | 
						|
- Sampling and profiling settings
 | 
						|
 | 
						|
#### Managing Sensitive Information with Secrets
 | 
						|
 | 
						|
For production environments, sensitive information like passwords and API keys should be stored securely using DataHub's **Secrets** functionality.
 | 
						|
 | 
						|
To create a secret:
 | 
						|
 | 
						|
1. Navigate to the **Secrets** tab in the Ingestion interface
 | 
						|
2. Click **Create new secret**
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/create-secret.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
3. Provide a descriptive name (e.g., `BIGQUERY_PRIVATE_KEY`)
 | 
						|
4. Enter the sensitive value
 | 
						|
5. Optionally add a description
 | 
						|
6. Click **Create**
 | 
						|
 | 
						|
Once created, secrets can be referenced in your ingestion configuration forms using the dropdown menus provided for credential fields.
 | 
						|
 | 
						|
> **Security Note**: Users with the `Manage Secrets` privilege can retrieve plaintext secret values through DataHub's GraphQL API. Ensure secrets are only accessible to trusted administrators.
 | 
						|
 | 
						|
### Step 3: Test Your Connection
 | 
						|
 | 
						|
Before proceeding, it's important to verify that DataHub can successfully connect to your data source. Most ingestion source forms include a **Test Connection** button that validates:
 | 
						|
 | 
						|
- Network connectivity to your data source
 | 
						|
- Authentication credentials
 | 
						|
- Required permissions for metadata extraction
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="75%" alt="Test BigQuery connection" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/guides/bigquery/bigquery-test-connection.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
If the connection test fails, review your configuration and ensure that:
 | 
						|
 | 
						|
- Network access is available between DataHub and your data source
 | 
						|
- Credentials are correct and have sufficient permissions
 | 
						|
- Any firewall rules allow the connection
 | 
						|
 | 
						|
### Step 4: Schedule Execution (Optional)
 | 
						|
 | 
						|
You can configure automatic execution of your ingestion source on a regular schedule. This ensures your metadata stays up-to-date without manual intervention.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/schedule-ingestion.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
If you prefer to run ingestion manually or on an ad-hoc basis, you can skip the scheduling step entirely.
 | 
						|
 | 
						|
### Step 5: Finish Up and Run
 | 
						|
 | 
						|
Finally, provide a descriptive name for your ingestion source that will help you and your team identify it later.
 | 
						|
 | 
						|
You can also assign **Users** and/or **Groups** as owners of this ingestion source. By default, you (the creator) will be assigned as an owner, but you can add additional owners or change this at any time after creation.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/name-ingestion-source.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
Click **Save and Run** to create the ingestion source and execute it immediately, or **Save** to create it without running.
 | 
						|
 | 
						|
#### Advanced Configuration Options
 | 
						|
 | 
						|
For users who need additional control, DataHub provides advanced configuration options:
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/custom-ingestion-cli-version.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
- **CLI Version:** Specify a particular version of the DataHub CLI for ingestion execution
 | 
						|
- **Environment Variables:** Set custom environment variables for the ingestion process
 | 
						|
 | 
						|
## Running and Monitoring Ingestion
 | 
						|
 | 
						|
### Executing an Ingestion Source
 | 
						|
 | 
						|
Once you've created your Ingestion Source, you can run it by clicking the 'Play' button. Shortly after, you should see the 'Last Status' column of the ingestion source change to `Running`, indicating that DataHub has successfully queued the ingestion job.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion/running.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
When ingestion completes successfully, the status will show as `Success` in green.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion/success-run.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
### Viewing Run History
 | 
						|
 | 
						|
The **Run History** tab shows you a complete history of all your ingestion runs. Here you can:
 | 
						|
 | 
						|
- **See all runs**: View every ingestion execution across all your sources
 | 
						|
- **Check recent activity**: Runs are listed with the most recent at the top
 | 
						|
- **Filter by source**: Use the dropdown to see runs from a specific ingestion source
 | 
						|
- **Access from Sources tab**: Click on any source's **Last Run** status or select **View Run History** from the source menu
 | 
						|
 | 
						|
This makes it easy to track your ingestion performance and troubleshoot any issues over time.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion/run-history-tab.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
### Viewing Ingestion Results
 | 
						|
 | 
						|
After successful ingestion, you can view detailed information about what was extracted:
 | 
						|
 | 
						|
1. Click the **Success** status button on a completed ingestion run
 | 
						|
2. Select **View All** to see the list of ingested entities
 | 
						|
3. Click on individual entities to validate the extracted metadata
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="75%" alt="ingestion_details_view_all" src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion/ingestion-run-summary.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
### Cancelling Running Ingestion
 | 
						|
 | 
						|
If an ingestion run is taking too long or appears to be stuck, you can cancel it by clicking the 'Stop' button on the running job.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion/cancelled-run.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
This is useful when encountering issues like:
 | 
						|
 | 
						|
- Network timeouts
 | 
						|
- Ingestion source bugs
 | 
						|
- Resource constraints
 | 
						|
 | 
						|
## Troubleshooting Failed Ingestion
 | 
						|
 | 
						|
### Common Failure Reasons
 | 
						|
 | 
						|
When ingestion fails, the most common causes include:
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion/failed-source.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
1. **Configuration Errors**: Incorrect connection details, missing required fields, or invalid parameter values
 | 
						|
2. **Authentication Issues**: Wrong credentials, expired tokens, or insufficient permissions
 | 
						|
3. **Network Connectivity**: DNS resolution failures, firewall blocks, or unreachable data sources
 | 
						|
4. **Secret Resolution Problems**: Referenced secrets that don't exist or have incorrect names
 | 
						|
5. **Resource Constraints**: Memory limits, timeouts, or processing capacity issues
 | 
						|
 | 
						|
### Viewing Detailed Logs
 | 
						|
 | 
						|
To diagnose ingestion failures, click on a run history status (Failed, Aborted) value to view and download comprehensive ingestion run logs.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion/ingestion-run-log.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
The logs provide detailed information about:
 | 
						|
 | 
						|
- Connection attempts and errors
 | 
						|
- Authentication failures
 | 
						|
- Data extraction progress
 | 
						|
- Error messages and stack traces
 | 
						|
 | 
						|
### Authentication for Secured DataHub Instances
 | 
						|
 | 
						|
If your DataHub instance has [Metadata Service Authentication](authentication/introducing-metadata-service-authentication.md) enabled, you'll need to provide a Personal Access Token in your configuration.
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/ingestion-with-token.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
## Advanced Configuration with YAML
 | 
						|
 | 
						|
While the UI-based forms handle most common ingestion scenarios, advanced users may need direct access to YAML configuration for:
 | 
						|
 | 
						|
- Custom ingestion sources not available in the UI
 | 
						|
- Complex transformation pipelines
 | 
						|
- Advanced filtering and processing logic
 | 
						|
- Integration with external systems
 | 
						|
 | 
						|
For these advanced use cases, DataHub supports direct YAML recipe configuration. For detailed information about YAML-based configuration, including syntax and examples, see the [Recipe Overview Guide](metadata-ingestion/recipe_overview.md).
 | 
						|
 | 
						|
<Tabs>
 | 
						|
   <TabItem value="cli" label="CLI">
 | 
						|
 | 
						|
You can deploy recipes using the CLI as mentioned in the [CLI documentation for uploading ingestion recipes](./cli.md#ingest-deploy).
 | 
						|
 | 
						|
```bash
 | 
						|
datahub ingest deploy --name "My Test Ingestion Source" --schedule "5 * * * *" --time-zone "UTC" -c recipe.yaml
 | 
						|
```
 | 
						|
 | 
						|
   </TabItem>
 | 
						|
   <TabItem value="graphql" label="GraphQL">
 | 
						|
 | 
						|
Create ingestion sources using [DataHub's GraphQL API](./api/graphql/overview.md) using the **createIngestionSource** mutation endpoint.
 | 
						|
 | 
						|
```graphql
 | 
						|
mutation {
 | 
						|
  createIngestionSource(
 | 
						|
    input: {
 | 
						|
      name: "My Test Ingestion Source"
 | 
						|
      type: "mysql"
 | 
						|
      description: "My ingestion source description"
 | 
						|
      schedule: { interval: "*/5 * * * *", timezone: "UTC" }
 | 
						|
      config: {
 | 
						|
        recipe: "{\"source\":{\"type\":\"mysql\",\"config\":{\"include_tables\":true,\"database\":null,\"password\":\"${MYSQL_PASSWORD}\",\"profiling\":{\"enabled\":false},\"host_port\":null,\"include_views\":true,\"username\":\"${MYSQL_USERNAME}\"}},\"pipeline_name\":\"urn:li:dataHubIngestionSource:f38bd060-4ea8-459c-8f24-a773286a2927\"}"
 | 
						|
        version: "0.8.18"
 | 
						|
        executorId: "mytestexecutor"
 | 
						|
      }
 | 
						|
    }
 | 
						|
  )
 | 
						|
}
 | 
						|
```
 | 
						|
 | 
						|
**Note**: Recipe must be double quotes escaped when using GraphQL
 | 
						|
 | 
						|
   </TabItem>
 | 
						|
</Tabs>
 | 
						|
 | 
						|
## Frequently Asked Questions
 | 
						|
 | 
						|
### Why does ingestion fail with 'Failed to Connect' errors in Docker environments?
 | 
						|
 | 
						|
If you're running DataHub using `datahub docker quickstart` and experiencing connection failures, this may be due to network configuration issues. The ingestion executor might be unable to reach DataHub's backend services.
 | 
						|
 | 
						|
Try updating your ingestion configuration to use the Docker internal DNS name:
 | 
						|
 | 
						|
<p align="center">
 | 
						|
  <img width="80%"  src="https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/quickstart-ingestion-config.png"/>
 | 
						|
</p>
 | 
						|
 | 
						|
### What does a dash mark (-) status mean and how do I fix it?
 | 
						|
 | 
						|
If your ingestion source shows a dash mark (-) status and never changes to 'Running', this could mean:
 | 
						|
 | 
						|
1. **The source has never been triggered to run** - Try clicking the "Play" button to execute the source
 | 
						|
2. **The DataHub actions executor is not running or healthy** (DataHub Core users only)
 | 
						|
 | 
						|
If clicking "Play" doesn't resolve the issue, DataHub Core users should diagnose their actions container:
 | 
						|
 | 
						|
1. Check container status with `docker ps`
 | 
						|
2. View executor logs with `docker logs <container-id>`
 | 
						|
3. Restart the actions container if necessary
 | 
						|
 | 
						|
### When should I use CLI/YAML instead of UI ingestion?
 | 
						|
 | 
						|
Consider using CLI-based ingestion when:
 | 
						|
 | 
						|
- Your data sources aren't reachable from DataHub's network (use [remote executors](managed-datahub/operator-guide/setting-up-remote-ingestion-executor.md) for DataHub Cloud)
 | 
						|
- You need custom ingestion logic not available in UI templates
 | 
						|
- Your ingestion requires local file system access
 | 
						|
- You want to distribute ingestion across multiple environments
 | 
						|
- You need complex transformations or custom metadata processing
 | 
						|
 | 
						|
## Additional Resources
 | 
						|
 | 
						|
- **Demo Video**: [Watch a complete UI ingestion walkthrough](https://www.youtube.com/watch?v=EyMyLcaw_74)
 | 
						|
- **Quick Start Guides**: Step-by-step setup instructions for popular data sources
 | 
						|
- **Recipe Documentation**: [Comprehensive YAML configuration reference](metadata-ingestion/recipe_overview.md)
 | 
						|
- **Integration Catalog**: [Browse all supported data sources and their features](https://docs.datahub.com/integrations)
 |