<summary>datahub command not found with PyPI install</summary>
If you've already run the pip install, but running `datahub` in your command line doesn't work, then there is likely an issue with your PATH setup and Python.
The easiest way to circumvent this is to install and run via Python, and use `python3 -m datahub` in place of `datahub`.
<summary>Failure to install confluent_kafka: "error: command 'x86_64-linux-gnu-gcc' failed with exit status 1"</summary>
This sometimes happens if there's a version mismatch between the Kafka's C library and the Python wrapper library. Try running `pip install confluent_kafka==1.5.0` and then retrying.
The base `acryl-datahub` package supports both Pydantic 1.x and 2.x. However, some of our specific sources require Pydantic 1.x because of transitive dependencies.
If you're primarily using `acryl-datahub` for the SDKs, you can install `acryl-datahub` and some extras, like `acryl-datahub[sql-parser]`, without getting conflicts related to Pydantic versioning.
We recommend not installing full ingestion sources into your main environment (e.g. avoid having a dependency on `acryl-datahub[snowflake]` or other ingestion sources).
Instead, we recommend using UI-based ingestion or isolating the ingestion pipelines using [virtual environments](https://docs.python.org/3/library/venv.html). If you're using an orchestrator, they often have first-class support for virtual environments - here's an [example for Airflow](./schedule_docs/airflow.md).
The architecture of this metadata ingestion framework is heavily inspired by [Apache Gobblin](https://gobblin.apache.org/) (also originally a LinkedIn project!). We have a standardized format - the MetadataChangeEvent - and sources and sinks which respectively produce and consume these objects. The sources pull metadata from a variety of data systems, while the sinks are primarily for moving this metadata into DataHub.
- The high level interfaces are defined in the [API directory](./src/datahub/ingestion/api).
- The actual [sources](./src/datahub/ingestion/source) and [sinks](./src/datahub/ingestion/sink) have their own directories. The registry files in those directories import the implementations.
- The metadata models are created using code generation, and eventually live in the `./src/datahub/metadata` directory. However, these files are not checked in and instead are generated at build time. See the [codegen](./scripts/codegen.sh) script for details.
- Tests live in the [`tests`](./tests) directory. They're split between smaller unit tests and larger integration tests.
The vast majority of our dependencies are not required by the "core" package but instead can be optionally installed using Python "extras". This allows us to keep the core package lightweight. We should be deliberate about adding new dependencies to the core framework.
Where possible, we should avoid pinning version dependencies. The `acryl-datahub` package is frequently used as a library and hence installed alongside other tools. If you need to restrict the version of a dependency, use a range like `>=1.2.3,<2.0.0` or a negative constraint like `>=1.2.3, !=1.2.7` instead. Every upper bound and negative constraint should be accompanied by a comment explaining why it's necessary.
Caveat: Some packages like Great Expectations and Airflow frequently make breaking changes. For such packages, it's ok to add a "defensive" upper bound with the current latest version, accompanied by a comment. It's critical that we revisit these upper bounds at least once a month and broaden them if possible.
- Most important point: we should **match the terminology of the source system**. For example, snowflake shouldn’t have a `host_port`, it should have an `account_id`.
- We should prefer slightly more verbose names when the alternative isn’t descriptive enough. For example `client_id` or `tenant_id` over a bare `id` and `access_secret` over a bare `secret`.
- AllowDenyPatterns should be used whenever we need to filter a list. The pattern should always apply to the fully qualified name of the entity. These configs should be named `*_pattern`, for example `table_pattern`.
- Avoid `*_only` configs like `profile_table_level_only` in favor of `profile_table_level` and `profile_column_level`. `include_tables` and `include_views` are a good example.
#### Content
- All configs should have a description.
- When using inheritance or mixin classes, make sure that the fields and documentation is applicable in the base class. The `bigquery_temp_table_schema` field definitely shouldn’t be showing up in every single source’s profiling config!
- Set reasonable defaults!
- The configs should not contain a default that you’d reasonably expect to be built in. As a **bad** example, the Postgres source’s `schema_pattern` has a default deny pattern containing `information_schema`. This means that if the user overrides the schema_pattern, they’ll need to manually add the information_schema to their deny patterns. This is a bad, and the filtering should’ve been handled automatically by the source’s implementation, not added at runtime by its config.
#### Coding
- Use a single pydantic validator per thing to validate - we shouldn’t have validation methods that are 50 lines long.
- Use `SecretStr` for passwords, auth tokens, etc.
- When doing simple field renames, use the `pydantic_renamed_field` helper.
- When doing field deprecations, use the `pydantic_removed_field` helper.
- Validator methods must only throw ValueError, TypeError, or AssertionError. Do not throw ConfigurationError from validators.
- Set `hidden_from_docs` for internal-only config flags. However, needing this often indicates a larger problem with the code structure. The hidden field should probably be a class attribute or an instance variable on the corresponding source.
If you made some changes that require generating new "golden" data files for use in testing a specific ingestion source, you can run the following to re-generate them: