Eugenio ef8b19142f
Create documentation resources for Data Quality as Code (closes #23800) (#24169)
* Brief documentation of installation requirements

* Minor fix to run tests only defined in OpenMetadata

* Add full example to Data Quality as Code

* Install `griffe2md` and fix docstrings

* Remove local openmetadata reference

* Fix writing, grammar and typos

* Fix test

* Fix formatting
2025-11-11 10:25:42 +00:00

1182 lines
69 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"id": "14093fef-c7ab-4566-88e2-318553ba3aef",
"metadata": {},
"source": "# Running Data Quality tests for tables in OpenMetadata\n\nIn the following Notebook we will join two data sources to load into our `Tutorial Postgres.raw.public.taxi_yellow` table.\n\nWe will be using the following two assets from the [NYC Yellow Taxi Ride Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page):\n- [Yellow Taxi Ride for September 2025 (parquet)](https://python-sdk-examples.s3.eu-west-3.amazonaws.com/data-quality/yellow_tripdata_2025-09.parquet)\n- [Taxi Zones Lookup (csv)](https://python-sdk-examples.s3.eu-west-3.amazonaws.com/data-quality/taxi_zone_lookup.csv)\n\n## Purpose\nWe want to showcase how we can leverage OpenMetadata's data quality mechanisms directly from code. For that, we're simulating a very simple ETL that builds the data for which we have set up data quality tests in the [given instructions](/lab/tree/README.md).\n\n## Description of the ETL\nThe Yellow Taxi Ride dataset contains a couple of columns, Pickup Location ID and Dropoff Location ID, which refer to the zone in which each stop of the ride takes place. Yellow taxis either start or end in one of those zones, but we want to find only those that never leave the yellow area. The Taxi Zones Lookup dataset contains a mapping between the zone ID and the taxi type (e.g: Yellow Zone).\n\nOur ETL will join the two data sources and filter for those of which Pickup and Dropoff location ID are both yellow zones. Since we only want a subset of it, we will also load only 10,000 rows of data to our table.\n\nOnce we've loaded the results to the destination table, we will use the [`openmetadata-ingestion`](https://pypi.org/project/openmetadata-ingestion/) library to run the Data Quality tests we have defined in [OpenMetadata](http://localhost:8585/table/Tutorial%20Postgres.raw.public.taxi_yellow/profiler/data-quality).\n\n## Dependencies\nFor our ETL we will be using Pyarrow to load the Parquet file, Pandas DataFrames to work with the Taxi Rides and Taxi Zones areas, [`openmetadata-ingestion`](https://pypi.org/project/openmetadata-ingestion/) to run data quality tests and, since we're using Postgres as a database for our fake Data Warehouse we will need to install dependencies for the OpenMetadata [Postgres Connector](https://docs.open-metadata.org/latest/connectors/database/postgres). We will also need SQLAlchemy, which is installed by default with `openmetadata-ingestion`.\n\nWe can install all these dependencies specifying the right extras. A full list can be found in the project's [`setup.py`](https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/setup.py), check it out if your installation differs from the example below.\n\n## Requirements\nIf you haven't, please follow the [setup](/lab/tree/README.md#setup) steps in the README\n\nFor this example you will need:\n\n- An OpenMetadata instance running (achieved by following the setup instructions above)\n- A bot JWT token. You can do so by using [Ingestion Bot's](http://localhost:8585/bots/ingestion-bot) token from your OpenMetadata instance\n- [`openmetadata-ingestion`](https://pypi.org/project/openmetadata-ingestion/) version 1.11.0.0 or above (installed in this Notebook)"
},
{
"cell_type": "code",
"execution_count": 1,
"id": "initial_id",
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Obtaining file:///opt/openmetadata/ingestion\n",
" Installing build dependencies ... \u001b[?25ldone\n",
"\u001b[?25h Checking if build backend supports build_editable ... \u001b[?25ldone\n",
"\u001b[?25h Getting requirements to build editable ... \u001b[?25ldone\n",
"\u001b[?25h Preparing editable metadata (pyproject.toml) ... \u001b[?25ldone\n",
"\u001b[?25hRequirement already satisfied: collate-sqllineage~=1.6.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.6.22)\n",
"Requirement already satisfied: azure-keyvault-secrets in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (4.10.0)\n",
"Requirement already satisfied: shapely in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (2.1.2)\n",
"Requirement already satisfied: tabulate==0.9.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (0.9.0)\n",
"Requirement already satisfied: sqlalchemy<2,>=1.4.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.4.54)\n",
"Requirement already satisfied: setuptools~=70.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (70.3.0)\n",
"Requirement already satisfied: cached-property==1.5.2 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.5.2)\n",
"Requirement already satisfied: chardet==4.0.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (4.0.0)\n",
"Requirement already satisfied: google-crc32c in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.7.1)\n",
"Requirement already satisfied: boto3<2.0,>=1.20 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.40.68)\n",
"Requirement already satisfied: antlr4-python3-runtime==4.9.2 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (4.9.2)\n",
"Requirement already satisfied: python-dotenv>=0.19.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.2.1)\n",
"Requirement already satisfied: azure-identity~=1.12 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.25.1)\n",
"Requirement already satisfied: Jinja2>=2.11.3 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (3.1.6)\n",
"Requirement already satisfied: pydantic<2.12,>=2.7.0,~=2.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (2.9.2)\n",
"Requirement already satisfied: httpx~=0.28.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (0.28.1)\n",
"Requirement already satisfied: typing-inspect in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (0.9.0)\n",
"Requirement already satisfied: requests>=2.23 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (2.32.3)\n",
"Requirement already satisfied: pydantic-settings>=2.7.0,~=2.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (2.11.0)\n",
"Requirement already satisfied: jaraco.functools<4.2.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (4.1.0)\n",
"Requirement already satisfied: mypy-extensions>=0.4.3 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.1.0)\n",
"Requirement already satisfied: PyYAML~=6.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (6.0.2)\n",
"Requirement already satisfied: collate-data-diff>=0.11.6 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (0.11.7)\n",
"Requirement already satisfied: snowflake-connector-python<4.0.0,>=3.13.1 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (3.18.0)\n",
"Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (2.9.0)\n",
"Requirement already satisfied: kubernetes>=21.0.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (34.1.0)\n",
"Requirement already satisfied: jsonpatch<2.0,>=1.24 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.33)\n",
"Requirement already satisfied: memory-profiler in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (0.61.0)\n",
"Requirement already satisfied: email-validator>=2.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (2.3.0)\n",
"Requirement already satisfied: google-cloud-secret-manager==2.24.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (2.24.0)\n",
"Requirement already satisfied: cryptography>=42.0.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (46.0.0)\n",
"Requirement already satisfied: importlib-metadata>=4.13.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (8.5.0)\n",
"Requirement already satisfied: packaging in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (24.1)\n",
"Requirement already satisfied: pymysql~=1.0 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.1.2)\n",
"Requirement already satisfied: requests-aws4auth~=1.1 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.3.1)\n",
"Requirement already satisfied: mysql-connector-python>=9.1 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (9.5.0)\n",
"Requirement already satisfied: google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.1 in /opt/conda/lib/python3.11/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.1->google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (2.28.1)\n",
"Requirement already satisfied: google-auth!=2.24.0,!=2.25.0,<3.0.0,>=2.14.1 in /opt/conda/lib/python3.11/site-packages (from google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (2.43.0)\n",
"Requirement already satisfied: proto-plus<2.0.0,>=1.22.3 in /opt/conda/lib/python3.11/site-packages (from google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (1.26.1)\n",
"Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<7.0.0,>=3.20.2 in /opt/conda/lib/python3.11/site-packages (from google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (6.33.0)\n",
"Requirement already satisfied: grpc-google-iam-v1<1.0.0,>=0.14.0 in /opt/conda/lib/python3.11/site-packages (from google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (0.14.3)\n",
"Requirement already satisfied: pandas~=2.0.3 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (2.0.3)\n",
"Requirement already satisfied: numpy<2 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (1.26.4)\n",
"Requirement already satisfied: psycopg2-binary in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (2.9.11)\n",
"Requirement already satisfied: GeoAlchemy2~=0.12 in /opt/conda/lib/python3.11/site-packages (from openmetadata-ingestion==1.10.0.0.dev0) (0.18.0)\n",
"Collecting pyarrow~=16.0 (from openmetadata-ingestion==1.10.0.0.dev0)\n",
" Downloading pyarrow-16.1.0-cp311-cp311-manylinux_2_28_aarch64.whl.metadata (3.0 kB)\n",
"Requirement already satisfied: azure-core>=1.31.0 in /opt/conda/lib/python3.11/site-packages (from azure-identity~=1.12->openmetadata-ingestion==1.10.0.0.dev0) (1.36.0)\n",
"Requirement already satisfied: msal>=1.30.0 in /opt/conda/lib/python3.11/site-packages (from azure-identity~=1.12->openmetadata-ingestion==1.10.0.0.dev0) (1.34.0)\n",
"Requirement already satisfied: msal-extensions>=1.2.0 in /opt/conda/lib/python3.11/site-packages (from azure-identity~=1.12->openmetadata-ingestion==1.10.0.0.dev0) (1.3.1)\n",
"Requirement already satisfied: typing-extensions>=4.0.0 in /opt/conda/lib/python3.11/site-packages (from azure-identity~=1.12->openmetadata-ingestion==1.10.0.0.dev0) (4.12.2)\n",
"Requirement already satisfied: botocore<1.41.0,>=1.40.68 in /opt/conda/lib/python3.11/site-packages (from boto3<2.0,>=1.20->openmetadata-ingestion==1.10.0.0.dev0) (1.40.68)\n",
"Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.11/site-packages (from boto3<2.0,>=1.20->openmetadata-ingestion==1.10.0.0.dev0) (1.0.1)\n",
"Requirement already satisfied: s3transfer<0.15.0,>=0.14.0 in /opt/conda/lib/python3.11/site-packages (from boto3<2.0,>=1.20->openmetadata-ingestion==1.10.0.0.dev0) (0.14.0)\n",
"Requirement already satisfied: attrs>=23.1.0 in /opt/conda/lib/python3.11/site-packages (from collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (24.2.0)\n",
"Requirement already satisfied: click>=8.1 in /opt/conda/lib/python3.11/site-packages (from collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (8.3.0)\n",
"Requirement already satisfied: dbt-core<2.0.0,>=1.0.0 in /opt/conda/lib/python3.11/site-packages (from collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.10.13)\n",
"Requirement already satisfied: dsnparse<0.2.0 in /opt/conda/lib/python3.11/site-packages (from collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (0.1.15)\n",
"Requirement already satisfied: keyring in /opt/conda/lib/python3.11/site-packages (from collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (25.6.0)\n",
"Requirement already satisfied: mashumaro<3.11.0,>=2.9 in /opt/conda/lib/python3.11/site-packages (from mashumaro[msgpack]<3.11.0,>=2.9->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (3.10)\n",
"Requirement already satisfied: rich in /opt/conda/lib/python3.11/site-packages (from collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (14.2.0)\n",
"Requirement already satisfied: toml>=0.10.2 in /opt/conda/lib/python3.11/site-packages (from collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (0.10.2)\n",
"Requirement already satisfied: urllib3<2 in /opt/conda/lib/python3.11/site-packages (from collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.26.20)\n",
"Requirement already satisfied: sqlparse<0.6,>=0.5 in /opt/conda/lib/python3.11/site-packages (from collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (0.5.3)\n",
"Requirement already satisfied: networkx>=2.4 in /opt/conda/lib/python3.11/site-packages (from collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (3.5)\n",
"Requirement already satisfied: collate-sqlfluff~=3.3.0 in /opt/conda/lib/python3.11/site-packages (from collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (3.3.6)\n",
"Requirement already satisfied: cffi>=1.14 in /opt/conda/lib/python3.11/site-packages (from cryptography>=42.0.0->openmetadata-ingestion==1.10.0.0.dev0) (1.17.1)\n",
"Requirement already satisfied: dnspython>=2.0.0 in /opt/conda/lib/python3.11/site-packages (from email-validator>=2.0->openmetadata-ingestion==1.10.0.0.dev0) (2.8.0)\n",
"Requirement already satisfied: idna>=2.0.0 in /opt/conda/lib/python3.11/site-packages (from email-validator>=2.0->openmetadata-ingestion==1.10.0.0.dev0) (3.10)\n",
"Requirement already satisfied: anyio in /opt/conda/lib/python3.11/site-packages (from httpx~=0.28.0->openmetadata-ingestion==1.10.0.0.dev0) (4.6.2.post1)\n",
"Requirement already satisfied: certifi in /opt/conda/lib/python3.11/site-packages (from httpx~=0.28.0->openmetadata-ingestion==1.10.0.0.dev0) (2024.8.30)\n",
"Requirement already satisfied: httpcore==1.* in /opt/conda/lib/python3.11/site-packages (from httpx~=0.28.0->openmetadata-ingestion==1.10.0.0.dev0) (1.0.6)\n",
"Requirement already satisfied: h11<0.15,>=0.13 in /opt/conda/lib/python3.11/site-packages (from httpcore==1.*->httpx~=0.28.0->openmetadata-ingestion==1.10.0.0.dev0) (0.14.0)\n",
"Requirement already satisfied: zipp>=3.20 in /opt/conda/lib/python3.11/site-packages (from importlib-metadata>=4.13.0->openmetadata-ingestion==1.10.0.0.dev0) (3.20.2)\n",
"Requirement already satisfied: more-itertools in /opt/conda/lib/python3.11/site-packages (from jaraco.functools<4.2.0->openmetadata-ingestion==1.10.0.0.dev0) (10.8.0)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/lib/python3.11/site-packages (from Jinja2>=2.11.3->openmetadata-ingestion==1.10.0.0.dev0) (3.0.2)\n",
"Requirement already satisfied: jsonpointer>=1.9 in /opt/conda/lib/python3.11/site-packages (from jsonpatch<2.0,>=1.24->openmetadata-ingestion==1.10.0.0.dev0) (3.0.0)\n",
"Requirement already satisfied: six>=1.9.0 in /opt/conda/lib/python3.11/site-packages (from kubernetes>=21.0.0->openmetadata-ingestion==1.10.0.0.dev0) (1.16.0)\n",
"Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in /opt/conda/lib/python3.11/site-packages (from kubernetes>=21.0.0->openmetadata-ingestion==1.10.0.0.dev0) (1.8.0)\n",
"Requirement already satisfied: requests-oauthlib in /opt/conda/lib/python3.11/site-packages (from kubernetes>=21.0.0->openmetadata-ingestion==1.10.0.0.dev0) (2.0.0)\n",
"Requirement already satisfied: durationpy>=0.7 in /opt/conda/lib/python3.11/site-packages (from kubernetes>=21.0.0->openmetadata-ingestion==1.10.0.0.dev0) (0.10)\n",
"Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.11/site-packages (from pandas~=2.0.3->openmetadata-ingestion==1.10.0.0.dev0) (2024.2)\n",
"Requirement already satisfied: tzdata>=2022.1 in /opt/conda/lib/python3.11/site-packages (from pandas~=2.0.3->openmetadata-ingestion==1.10.0.0.dev0) (2025.2)\n",
"Requirement already satisfied: annotated-types>=0.6.0 in /opt/conda/lib/python3.11/site-packages (from pydantic<2.12,>=2.7.0,~=2.0->openmetadata-ingestion==1.10.0.0.dev0) (0.7.0)\n",
"Requirement already satisfied: pydantic-core==2.23.4 in /opt/conda/lib/python3.11/site-packages (from pydantic<2.12,>=2.7.0,~=2.0->openmetadata-ingestion==1.10.0.0.dev0) (2.23.4)\n",
"Requirement already satisfied: typing-inspection>=0.4.0 in /opt/conda/lib/python3.11/site-packages (from pydantic-settings>=2.7.0,~=2.0->openmetadata-ingestion==1.10.0.0.dev0) (0.4.2)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.11/site-packages (from requests>=2.23->openmetadata-ingestion==1.10.0.0.dev0) (3.4.0)\n",
"Requirement already satisfied: asn1crypto<2.0.0,>0.24.0 in /opt/conda/lib/python3.11/site-packages (from snowflake-connector-python<4.0.0,>=3.13.1->openmetadata-ingestion==1.10.0.0.dev0) (1.5.1)\n",
"Requirement already satisfied: pyOpenSSL<26.0.0,>=22.0.0 in /opt/conda/lib/python3.11/site-packages (from snowflake-connector-python<4.0.0,>=3.13.1->openmetadata-ingestion==1.10.0.0.dev0) (25.3.0)\n",
"Requirement already satisfied: pyjwt<3.0.0 in /opt/conda/lib/python3.11/site-packages (from snowflake-connector-python<4.0.0,>=3.13.1->openmetadata-ingestion==1.10.0.0.dev0) (2.9.0)\n",
"Requirement already satisfied: filelock<4,>=3.5 in /opt/conda/lib/python3.11/site-packages (from snowflake-connector-python<4.0.0,>=3.13.1->openmetadata-ingestion==1.10.0.0.dev0) (3.20.0)\n",
"Requirement already satisfied: sortedcontainers>=2.4.0 in /opt/conda/lib/python3.11/site-packages (from snowflake-connector-python<4.0.0,>=3.13.1->openmetadata-ingestion==1.10.0.0.dev0) (2.4.0)\n",
"Requirement already satisfied: platformdirs<5.0.0,>=2.6.0 in /opt/conda/lib/python3.11/site-packages (from snowflake-connector-python<4.0.0,>=3.13.1->openmetadata-ingestion==1.10.0.0.dev0) (4.3.6)\n",
"Requirement already satisfied: tomlkit in /opt/conda/lib/python3.11/site-packages (from snowflake-connector-python<4.0.0,>=3.13.1->openmetadata-ingestion==1.10.0.0.dev0) (0.13.3)\n",
"Requirement already satisfied: greenlet!=0.4.17 in /opt/conda/lib/python3.11/site-packages (from sqlalchemy<2,>=1.4.0->openmetadata-ingestion==1.10.0.0.dev0) (3.1.1)\n",
"Requirement already satisfied: isodate>=0.6.1 in /opt/conda/lib/python3.11/site-packages (from azure-keyvault-secrets->openmetadata-ingestion==1.10.0.0.dev0) (0.6.1)\n",
"Requirement already satisfied: psutil in /opt/conda/lib/python3.11/site-packages (from memory-profiler->openmetadata-ingestion==1.10.0.0.dev0) (6.0.0)\n",
"Requirement already satisfied: pycparser in /opt/conda/lib/python3.11/site-packages (from cffi>=1.14->cryptography>=42.0.0->openmetadata-ingestion==1.10.0.0.dev0) (2.22)\n",
"Requirement already satisfied: colorama>=0.3 in /opt/conda/lib/python3.11/site-packages (from collate-sqlfluff~=3.3.0->collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (0.4.6)\n",
"Requirement already satisfied: diff-cover>=2.5.0 in /opt/conda/lib/python3.11/site-packages (from collate-sqlfluff~=3.3.0->collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (9.7.1)\n",
"Requirement already satisfied: pathspec in /opt/conda/lib/python3.11/site-packages (from collate-sqlfluff~=3.3.0->collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (0.12.1)\n",
"Requirement already satisfied: pytest in /opt/conda/lib/python3.11/site-packages (from collate-sqlfluff~=3.3.0->collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (8.4.2)\n",
"Requirement already satisfied: regex in /opt/conda/lib/python3.11/site-packages (from collate-sqlfluff~=3.3.0->collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (2025.11.3)\n",
"Requirement already satisfied: tblib in /opt/conda/lib/python3.11/site-packages (from collate-sqlfluff~=3.3.0->collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (3.2.1)\n",
"Requirement already satisfied: tqdm in /opt/conda/lib/python3.11/site-packages (from collate-sqlfluff~=3.3.0->collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (4.66.5)\n",
"Requirement already satisfied: agate<1.10,>=1.7.0 in /opt/conda/lib/python3.11/site-packages (from dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.9.1)\n",
"Requirement already satisfied: jsonschema<5.0,>=4.19.1 in /opt/conda/lib/python3.11/site-packages (from dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (4.23.0)\n",
"Requirement already satisfied: snowplow-tracker<2.0,>=1.0.2 in /opt/conda/lib/python3.11/site-packages (from dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.1.0)\n",
"Requirement already satisfied: dbt-extractor<=0.6,>=0.5.0 in /opt/conda/lib/python3.11/site-packages (from dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (0.6.0)\n",
"Requirement already satisfied: dbt-semantic-interfaces<0.10,>=0.9.0 in /opt/conda/lib/python3.11/site-packages (from dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (0.9.0)\n",
"Requirement already satisfied: dbt-common<2.0,>=1.27.0 in /opt/conda/lib/python3.11/site-packages (from dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.36.0)\n",
"Requirement already satisfied: dbt-adapters<2.0,>=1.15.5 in /opt/conda/lib/python3.11/site-packages (from dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.18.0)\n",
"Requirement already satisfied: dbt-protos<2.0,>=1.0.346 in /opt/conda/lib/python3.11/site-packages (from dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.0.382)\n",
"Requirement already satisfied: daff>=1.3.46 in /opt/conda/lib/python3.11/site-packages (from dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.4.2)\n",
"Requirement already satisfied: googleapis-common-protos<2.0.0,>=1.56.2 in /opt/conda/lib/python3.11/site-packages (from google-api-core!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.1->google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.1->google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (1.72.0)\n",
"Requirement already satisfied: grpcio<2.0.0,>=1.33.2 in /opt/conda/lib/python3.11/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.1->google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (1.76.0)\n",
"Requirement already satisfied: grpcio-status<2.0.0,>=1.33.2 in /opt/conda/lib/python3.11/site-packages (from google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.10.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,!=2.8.*,!=2.9.*,<3.0.0,>=1.34.1->google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (1.76.0)\n",
"Requirement already satisfied: cachetools<7.0,>=2.0.0 in /opt/conda/lib/python3.11/site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=2.14.1->google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (6.2.1)\n",
"Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/conda/lib/python3.11/site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=2.14.1->google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (0.4.2)\n",
"Requirement already satisfied: rsa<5,>=3.1.4 in /opt/conda/lib/python3.11/site-packages (from google-auth!=2.24.0,!=2.25.0,<3.0.0,>=2.14.1->google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (4.9.1)\n",
"Requirement already satisfied: msgpack>=0.5.6 in /opt/conda/lib/python3.11/site-packages (from mashumaro[msgpack]<3.11.0,>=2.9->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.1.2)\n",
"Requirement already satisfied: sniffio>=1.1 in /opt/conda/lib/python3.11/site-packages (from anyio->httpx~=0.28.0->openmetadata-ingestion==1.10.0.0.dev0) (1.3.1)\n",
"Requirement already satisfied: SecretStorage>=3.2 in /opt/conda/lib/python3.11/site-packages (from keyring->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (3.4.0)\n",
"Requirement already satisfied: jeepney>=0.4.2 in /opt/conda/lib/python3.11/site-packages (from keyring->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (0.9.0)\n",
"Requirement already satisfied: jaraco.classes in /opt/conda/lib/python3.11/site-packages (from keyring->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (3.4.0)\n",
"Requirement already satisfied: jaraco.context in /opt/conda/lib/python3.11/site-packages (from keyring->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (6.0.1)\n",
"Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.11/site-packages (from requests-oauthlib->kubernetes>=21.0.0->openmetadata-ingestion==1.10.0.0.dev0) (3.2.2)\n",
"Requirement already satisfied: markdown-it-py>=2.2.0 in /opt/conda/lib/python3.11/site-packages (from rich->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (4.0.0)\n",
"Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /opt/conda/lib/python3.11/site-packages (from rich->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (2.19.2)\n",
"Requirement already satisfied: Babel>=2.0 in /opt/conda/lib/python3.11/site-packages (from agate<1.10,>=1.7.0->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (2.14.0)\n",
"Requirement already satisfied: leather>=0.3.2 in /opt/conda/lib/python3.11/site-packages (from agate<1.10,>=1.7.0->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (0.4.0)\n",
"Requirement already satisfied: parsedatetime!=2.5,>=2.1 in /opt/conda/lib/python3.11/site-packages (from agate<1.10,>=1.7.0->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (2.6)\n",
"Requirement already satisfied: python-slugify>=1.2.1 in /opt/conda/lib/python3.11/site-packages (from agate<1.10,>=1.7.0->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (8.0.4)\n",
"Requirement already satisfied: pytimeparse>=1.1.5 in /opt/conda/lib/python3.11/site-packages (from agate<1.10,>=1.7.0->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.1.8)\n",
"Requirement already satisfied: deepdiff<9.0,>=7.0 in /opt/conda/lib/python3.11/site-packages (from dbt-common<2.0,>=1.27.0->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (8.6.1)\n",
"Requirement already satisfied: pluggy<2,>=0.13.1 in /opt/conda/lib/python3.11/site-packages (from diff-cover>=2.5.0->collate-sqlfluff~=3.3.0->collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (1.5.0)\n",
"Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /opt/conda/lib/python3.11/site-packages (from jsonschema<5.0,>=4.19.1->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (2024.10.1)\n",
"Requirement already satisfied: referencing>=0.28.4 in /opt/conda/lib/python3.11/site-packages (from jsonschema<5.0,>=4.19.1->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (0.35.1)\n",
"Requirement already satisfied: rpds-py>=0.7.1 in /opt/conda/lib/python3.11/site-packages (from jsonschema<5.0,>=4.19.1->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (0.20.0)\n",
"Requirement already satisfied: mdurl~=0.1 in /opt/conda/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (0.1.2)\n",
"Requirement already satisfied: pyasn1<0.7.0,>=0.6.1 in /opt/conda/lib/python3.11/site-packages (from pyasn1-modules>=0.2.1->google-auth!=2.24.0,!=2.25.0,<3.0.0,>=2.14.1->google-cloud-secret-manager==2.24.0->openmetadata-ingestion==1.10.0.0.dev0) (0.6.1)\n",
"Requirement already satisfied: backports.tarfile in /opt/conda/lib/python3.11/site-packages (from jaraco.context->keyring->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.2.0)\n",
"Requirement already satisfied: iniconfig>=1 in /opt/conda/lib/python3.11/site-packages (from pytest->collate-sqlfluff~=3.3.0->collate-sqllineage~=1.6.0->openmetadata-ingestion==1.10.0.0.dev0) (2.3.0)\n",
"Requirement already satisfied: orderly-set<6,>=5.4.1 in /opt/conda/lib/python3.11/site-packages (from deepdiff<9.0,>=7.0->dbt-common<2.0,>=1.27.0->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (5.5.0)\n",
"Requirement already satisfied: text-unidecode>=1.3 in /opt/conda/lib/python3.11/site-packages (from python-slugify>=1.2.1->agate<1.10,>=1.7.0->dbt-core<2.0.0,>=1.0.0->collate-data-diff>=0.11.6->openmetadata-ingestion==1.10.0.0.dev0) (1.3)\n",
"Downloading pyarrow-16.1.0-cp311-cp311-manylinux_2_28_aarch64.whl (38.1 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m38.1/38.1 MB\u001b[0m \u001b[31m22.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
"\u001b[?25hBuilding wheels for collected packages: openmetadata-ingestion\n",
" Building editable for openmetadata-ingestion (pyproject.toml) ... \u001b[?25ldone\n",
"\u001b[?25h Created wheel for openmetadata-ingestion: filename=openmetadata_ingestion-1.10.0.0.dev0-0.editable-py3-none-any.whl size=14132 sha256=5c3a6a7cd0b44a262ae2892074f397b31b22c32350458d08a836971f65907500\n",
" Stored in directory: /tmp/pip-ephem-wheel-cache-6kn7h0xf/wheels/94/a6/4b/951e6297508c20775c8465f8caed457f0821461c94c158f900\n",
"Successfully built openmetadata-ingestion\n",
"Installing collected packages: pyarrow, openmetadata-ingestion\n",
" Attempting uninstall: openmetadata-ingestion\n",
" Found existing installation: openmetadata-ingestion 1.10.0.0.dev0\n",
" Uninstalling openmetadata-ingestion-1.10.0.0.dev0:\n",
" Successfully uninstalled openmetadata-ingestion-1.10.0.0.dev0\n",
"Successfully installed openmetadata-ingestion-1.10.0.0.dev0 pyarrow-16.1.0\n"
]
}
],
"source": "!pip install \"openmetadata-ingestion[pandas,pyarrow,postgres]>=1.11.0.0\""
},
{
"cell_type": "markdown",
"id": "84b422b5-dae7-4094-ae50-cd38c0754a6b",
"metadata": {},
"source": [
"## Initial SDK setup\n",
"In this step we make sure our Python code is ready to work against OpenMetadata\n",
"\n",
"You will be prompted for the JWT token mentioned in the [requirements](#requirements) section"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4c2a0cc0-f335-4b53-943d-357a078d7de2",
"metadata": {},
"outputs": [
{
"name": "stdin",
"output_type": "stream",
"text": [
"Please introduce a JWT token for authentication with OM ········\n"
]
},
{
"data": {
"text/plain": [
"<metadata.sdk.client.OpenMetadata at 0xffff63bcda90>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from getpass import getpass\n",
"\n",
"from metadata.sdk import configure\n",
"\n",
"jwt_token = getpass(\"Please introduce a JWT token for authentication with OM\")\n",
"\n",
"configure(\n",
" host=\"http://openmetadata_server:8585/api\",\n",
" jwt_token=jwt_token,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "aa6bb55b-2dd5-4852-a849-78611bbc7ebe",
"metadata": {},
"source": [
"## Implementation of the ETL"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4d27a4eb-9dd8-499a-ab4b-157675972315",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"taxi_rides = pd.read_parquet(\"https://python-sdk-resources.s3.eu-west-3.amazonaws.com/data-quality/yellow_tripdata_2025-09.parquet\")\n",
"taxi_zones = pd.read_csv(\"https://python-sdk-resources.s3.eu-west-3.amazonaws.com/data-quality/taxi_zone_lookup.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "9bda2954-cd19-4747-8d2f-976577b4f27b",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VendorID</th>\n",
" <th>tpep_pickup_datetime</th>\n",
" <th>tpep_dropoff_datetime</th>\n",
" <th>passenger_count</th>\n",
" <th>trip_distance</th>\n",
" <th>RatecodeID</th>\n",
" <th>store_and_fwd_flag</th>\n",
" <th>PULocationID</th>\n",
" <th>DOLocationID</th>\n",
" <th>payment_type</th>\n",
" <th>fare_amount</th>\n",
" <th>extra</th>\n",
" <th>mta_tax</th>\n",
" <th>tip_amount</th>\n",
" <th>tolls_amount</th>\n",
" <th>improvement_surcharge</th>\n",
" <th>total_amount</th>\n",
" <th>congestion_surcharge</th>\n",
" <th>Airport_fee</th>\n",
" <th>cbd_congestion_fee</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:19:20</td>\n",
" <td>2025-09-01 00:45:17</td>\n",
" <td>1.0</td>\n",
" <td>9.92</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>138</td>\n",
" <td>114</td>\n",
" <td>1</td>\n",
" <td>42.9</td>\n",
" <td>6.0</td>\n",
" <td>0.5</td>\n",
" <td>10.73</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>66.13</td>\n",
" <td>2.5</td>\n",
" <td>1.75</td>\n",
" <td>0.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:15:20</td>\n",
" <td>2025-09-01 00:26:08</td>\n",
" <td>2.0</td>\n",
" <td>6.82</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>93</td>\n",
" <td>157</td>\n",
" <td>1</td>\n",
" <td>26.8</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" <td>5.86</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>35.16</td>\n",
" <td>0.0</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:06:07</td>\n",
" <td>2025-09-01 00:22:23</td>\n",
" <td>1.0</td>\n",
" <td>3.95</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>68</td>\n",
" <td>13</td>\n",
" <td>1</td>\n",
" <td>19.8</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" <td>5.11</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>30.66</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" <td>0.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:49:47</td>\n",
" <td>2025-09-01 01:04:49</td>\n",
" <td>1.0</td>\n",
" <td>3.14</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>234</td>\n",
" <td>87</td>\n",
" <td>1</td>\n",
" <td>17.7</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" <td>3.52</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>26.97</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" <td>0.75</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:05:00</td>\n",
" <td>2025-09-01 00:15:32</td>\n",
" <td>6.0</td>\n",
" <td>2.81</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>230</td>\n",
" <td>151</td>\n",
" <td>1</td>\n",
" <td>14.9</td>\n",
" <td>1.0</td>\n",
" <td>0.5</td>\n",
" <td>4.13</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>24.78</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" <td>0.75</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n",
"0 2 2025-09-01 00:19:20 2025-09-01 00:45:17 1.0 \n",
"1 2 2025-09-01 00:15:20 2025-09-01 00:26:08 2.0 \n",
"2 2 2025-09-01 00:06:07 2025-09-01 00:22:23 1.0 \n",
"3 2 2025-09-01 00:49:47 2025-09-01 01:04:49 1.0 \n",
"4 2 2025-09-01 00:05:00 2025-09-01 00:15:32 6.0 \n",
"\n",
" trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n",
"0 9.92 1.0 N 138 114 \n",
"1 6.82 1.0 N 93 157 \n",
"2 3.95 1.0 N 68 13 \n",
"3 3.14 1.0 N 234 87 \n",
"4 2.81 1.0 N 230 151 \n",
"\n",
" payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n",
"0 1 42.9 6.0 0.5 10.73 0.0 \n",
"1 1 26.8 1.0 0.5 5.86 0.0 \n",
"2 1 19.8 1.0 0.5 5.11 0.0 \n",
"3 1 17.7 1.0 0.5 3.52 0.0 \n",
"4 1 14.9 1.0 0.5 4.13 0.0 \n",
"\n",
" improvement_surcharge total_amount congestion_surcharge Airport_fee \\\n",
"0 1.0 66.13 2.5 1.75 \n",
"1 1.0 35.16 0.0 0.00 \n",
"2 1.0 30.66 2.5 0.00 \n",
"3 1.0 26.97 2.5 0.00 \n",
"4 1.0 24.78 2.5 0.00 \n",
"\n",
" cbd_congestion_fee \n",
"0 0.75 \n",
"1 0.00 \n",
"2 0.75 \n",
"3 0.75 \n",
"4 0.75 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"taxi_rides.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "56f0272a-e326-4f30-a4c6-2a521cd51f53",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>LocationID</th>\n",
" <th>Borough</th>\n",
" <th>Zone</th>\n",
" <th>service_zone</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>EWR</td>\n",
" <td>Newark Airport</td>\n",
" <td>EWR</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Queens</td>\n",
" <td>Jamaica Bay</td>\n",
" <td>Boro Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Bronx</td>\n",
" <td>Allerton/Pelham Gardens</td>\n",
" <td>Boro Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Manhattan</td>\n",
" <td>Alphabet City</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Staten Island</td>\n",
" <td>Arden Heights</td>\n",
" <td>Boro Zone</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" LocationID Borough Zone service_zone\n",
"0 1 EWR Newark Airport EWR\n",
"1 2 Queens Jamaica Bay Boro Zone\n",
"2 3 Bronx Allerton/Pelham Gardens Boro Zone\n",
"3 4 Manhattan Alphabet City Yellow Zone\n",
"4 5 Staten Island Arden Heights Boro Zone"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"taxi_zones.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "872bb8e2-8b74-4d3d-9078-6cbccf10882a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['EWR', 'Boro Zone', 'Yellow Zone', 'Airports', nan], dtype=object)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Check existing values.\n",
"taxi_zones[\"service_zone\"].unique()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "b85dc6b8-819f-476b-8593-788a30461a78",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VendorID</th>\n",
" <th>tpep_pickup_datetime</th>\n",
" <th>tpep_dropoff_datetime</th>\n",
" <th>passenger_count</th>\n",
" <th>trip_distance</th>\n",
" <th>RatecodeID</th>\n",
" <th>store_and_fwd_flag</th>\n",
" <th>PULocationID</th>\n",
" <th>DOLocationID</th>\n",
" <th>payment_type</th>\n",
" <th>...</th>\n",
" <th>mta_tax</th>\n",
" <th>tip_amount</th>\n",
" <th>tolls_amount</th>\n",
" <th>improvement_surcharge</th>\n",
" <th>total_amount</th>\n",
" <th>congestion_surcharge</th>\n",
" <th>Airport_fee</th>\n",
" <th>cbd_congestion_fee</th>\n",
" <th>PUZone</th>\n",
" <th>DOZone</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:19:20</td>\n",
" <td>2025-09-01 00:45:17</td>\n",
" <td>1.0</td>\n",
" <td>9.92</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>138</td>\n",
" <td>114</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>10.73</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>66.13</td>\n",
" <td>2.5</td>\n",
" <td>1.75</td>\n",
" <td>0.75</td>\n",
" <td>Airports</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:15:20</td>\n",
" <td>2025-09-01 00:26:08</td>\n",
" <td>2.0</td>\n",
" <td>6.82</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>93</td>\n",
" <td>157</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>5.86</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>35.16</td>\n",
" <td>0.0</td>\n",
" <td>0.00</td>\n",
" <td>0.00</td>\n",
" <td>Boro Zone</td>\n",
" <td>Boro Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:06:07</td>\n",
" <td>2025-09-01 00:22:23</td>\n",
" <td>1.0</td>\n",
" <td>3.95</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>68</td>\n",
" <td>13</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>5.11</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>30.66</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" <td>0.75</td>\n",
" <td>Yellow Zone</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:49:47</td>\n",
" <td>2025-09-01 01:04:49</td>\n",
" <td>1.0</td>\n",
" <td>3.14</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>234</td>\n",
" <td>87</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>3.52</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>26.97</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" <td>0.75</td>\n",
" <td>Yellow Zone</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:05:00</td>\n",
" <td>2025-09-01 00:15:32</td>\n",
" <td>6.0</td>\n",
" <td>2.81</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>230</td>\n",
" <td>151</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>4.13</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>24.78</td>\n",
" <td>2.5</td>\n",
" <td>0.00</td>\n",
" <td>0.75</td>\n",
" <td>Yellow Zone</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 22 columns</p>\n",
"</div>"
],
"text/plain": [
" VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n",
"0 2 2025-09-01 00:19:20 2025-09-01 00:45:17 1.0 \n",
"1 2 2025-09-01 00:15:20 2025-09-01 00:26:08 2.0 \n",
"2 2 2025-09-01 00:06:07 2025-09-01 00:22:23 1.0 \n",
"3 2 2025-09-01 00:49:47 2025-09-01 01:04:49 1.0 \n",
"4 2 2025-09-01 00:05:00 2025-09-01 00:15:32 6.0 \n",
"\n",
" trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n",
"0 9.92 1.0 N 138 114 \n",
"1 6.82 1.0 N 93 157 \n",
"2 3.95 1.0 N 68 13 \n",
"3 3.14 1.0 N 234 87 \n",
"4 2.81 1.0 N 230 151 \n",
"\n",
" payment_type ... mta_tax tip_amount tolls_amount \\\n",
"0 1 ... 0.5 10.73 0.0 \n",
"1 1 ... 0.5 5.86 0.0 \n",
"2 1 ... 0.5 5.11 0.0 \n",
"3 1 ... 0.5 3.52 0.0 \n",
"4 1 ... 0.5 4.13 0.0 \n",
"\n",
" improvement_surcharge total_amount congestion_surcharge Airport_fee \\\n",
"0 1.0 66.13 2.5 1.75 \n",
"1 1.0 35.16 0.0 0.00 \n",
"2 1.0 30.66 2.5 0.00 \n",
"3 1.0 26.97 2.5 0.00 \n",
"4 1.0 24.78 2.5 0.00 \n",
"\n",
" cbd_congestion_fee PUZone DOZone \n",
"0 0.75 Airports Yellow Zone \n",
"1 0.00 Boro Zone Boro Zone \n",
"2 0.75 Yellow Zone Yellow Zone \n",
"3 0.75 Yellow Zone Yellow Zone \n",
"4 0.75 Yellow Zone Yellow Zone \n",
"\n",
"[5 rows x 22 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Join tables based on `PULocationID` and `DOLocationID`\n",
"pickup_zones = taxi_rides.merge(taxi_zones[[\"LocationID\", \"service_zone\"]], left_on=\"PULocationID\", right_on=\"LocationID\", how=\"left\")[\"service_zone\"]\n",
"dropoff_zones = taxi_rides.merge(taxi_zones[[\"LocationID\", \"service_zone\"]], left_on=\"DOLocationID\", right_on=\"LocationID\", how=\"left\")[\"service_zone\"]\n",
"taxi_rides_with_pickup_and_dropoff_zone = taxi_rides.assign(PUZone=pickup_zones, DOZone=dropoff_zones)\n",
"taxi_rides_with_pickup_and_dropoff_zone.head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "788dfa4d-42a2-4af7-a8e1-6dd4d7c20c59",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>VendorID</th>\n",
" <th>tpep_pickup_datetime</th>\n",
" <th>tpep_dropoff_datetime</th>\n",
" <th>passenger_count</th>\n",
" <th>trip_distance</th>\n",
" <th>RatecodeID</th>\n",
" <th>store_and_fwd_flag</th>\n",
" <th>PULocationID</th>\n",
" <th>DOLocationID</th>\n",
" <th>payment_type</th>\n",
" <th>...</th>\n",
" <th>mta_tax</th>\n",
" <th>tip_amount</th>\n",
" <th>tolls_amount</th>\n",
" <th>improvement_surcharge</th>\n",
" <th>total_amount</th>\n",
" <th>congestion_surcharge</th>\n",
" <th>Airport_fee</th>\n",
" <th>cbd_congestion_fee</th>\n",
" <th>PUZone</th>\n",
" <th>DOZone</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:06:07</td>\n",
" <td>2025-09-01 00:22:23</td>\n",
" <td>1.0</td>\n",
" <td>3.95</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>68</td>\n",
" <td>13</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>5.11</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>30.66</td>\n",
" <td>2.5</td>\n",
" <td>0.0</td>\n",
" <td>0.75</td>\n",
" <td>Yellow Zone</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:49:47</td>\n",
" <td>2025-09-01 01:04:49</td>\n",
" <td>1.0</td>\n",
" <td>3.14</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>234</td>\n",
" <td>87</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>3.52</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>26.97</td>\n",
" <td>2.5</td>\n",
" <td>0.0</td>\n",
" <td>0.75</td>\n",
" <td>Yellow Zone</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2</td>\n",
" <td>2025-09-01 00:05:00</td>\n",
" <td>2025-09-01 00:15:32</td>\n",
" <td>6.0</td>\n",
" <td>2.81</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>230</td>\n",
" <td>151</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>4.13</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>24.78</td>\n",
" <td>2.5</td>\n",
" <td>0.0</td>\n",
" <td>0.75</td>\n",
" <td>Yellow Zone</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1</td>\n",
" <td>2025-09-01 00:16:53</td>\n",
" <td>2025-09-01 00:29:36</td>\n",
" <td>2.0</td>\n",
" <td>2.00</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>79</td>\n",
" <td>164</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>4.00</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>23.95</td>\n",
" <td>2.5</td>\n",
" <td>0.0</td>\n",
" <td>0.75</td>\n",
" <td>Yellow Zone</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1</td>\n",
" <td>2025-09-01 00:33:01</td>\n",
" <td>2025-09-01 00:43:13</td>\n",
" <td>2.0</td>\n",
" <td>3.10</td>\n",
" <td>1.0</td>\n",
" <td>N</td>\n",
" <td>164</td>\n",
" <td>236</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0.5</td>\n",
" <td>4.10</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>24.75</td>\n",
" <td>2.5</td>\n",
" <td>0.0</td>\n",
" <td>0.75</td>\n",
" <td>Yellow Zone</td>\n",
" <td>Yellow Zone</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 22 columns</p>\n",
"</div>"
],
"text/plain": [
" VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n",
"2 2 2025-09-01 00:06:07 2025-09-01 00:22:23 1.0 \n",
"3 2 2025-09-01 00:49:47 2025-09-01 01:04:49 1.0 \n",
"4 2 2025-09-01 00:05:00 2025-09-01 00:15:32 6.0 \n",
"5 1 2025-09-01 00:16:53 2025-09-01 00:29:36 2.0 \n",
"6 1 2025-09-01 00:33:01 2025-09-01 00:43:13 2.0 \n",
"\n",
" trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n",
"2 3.95 1.0 N 68 13 \n",
"3 3.14 1.0 N 234 87 \n",
"4 2.81 1.0 N 230 151 \n",
"5 2.00 1.0 N 79 164 \n",
"6 3.10 1.0 N 164 236 \n",
"\n",
" payment_type ... mta_tax tip_amount tolls_amount \\\n",
"2 1 ... 0.5 5.11 0.0 \n",
"3 1 ... 0.5 3.52 0.0 \n",
"4 1 ... 0.5 4.13 0.0 \n",
"5 1 ... 0.5 4.00 0.0 \n",
"6 1 ... 0.5 4.10 0.0 \n",
"\n",
" improvement_surcharge total_amount congestion_surcharge Airport_fee \\\n",
"2 1.0 30.66 2.5 0.0 \n",
"3 1.0 26.97 2.5 0.0 \n",
"4 1.0 24.78 2.5 0.0 \n",
"5 1.0 23.95 2.5 0.0 \n",
"6 1.0 24.75 2.5 0.0 \n",
"\n",
" cbd_congestion_fee PUZone DOZone \n",
"2 0.75 Yellow Zone Yellow Zone \n",
"3 0.75 Yellow Zone Yellow Zone \n",
"4 0.75 Yellow Zone Yellow Zone \n",
"5 0.75 Yellow Zone Yellow Zone \n",
"6 0.75 Yellow Zone Yellow Zone \n",
"\n",
"[5 rows x 22 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Filter out rows where either pick up or drop off zones are not `Yellow Zone`\n",
"yellow_only_rides = taxi_rides_with_pickup_and_dropoff_zone.loc[(taxi_rides_with_pickup_and_dropoff_zone.PUZone == \"Yellow Zone\") & (taxi_rides_with_pickup_and_dropoff_zone.DOZone == \"Yellow Zone\")]\n",
"yellow_only_rides.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "6d37d288-882f-47b5-abba-962145d8bc63",
"metadata": {},
"outputs": [],
"source": [
"# Write dataframe to the database\n",
"## Credentials to a user with write access are set up in `docker-compose.yml`\n",
"from sqlalchemy import MetaData, Table, create_engine, delete, insert\n",
"\n",
"def insert_taxi_yellow_table(table, conn, keys, data_iter):\n",
" keys = [key.lower() for key in keys]\n",
" taxi_yellow_table = Table(table.table, MetaData(), autoload_with=conn)\n",
" \n",
" # Clean existing data\n",
" conn.execute(delete(taxi_yellow_table))\n",
" \n",
" # Prepare insert statement \n",
" data = [dict(zip(keys, row)) for row in data_iter]\n",
" \n",
" stmt = insert(taxi_yellow_table).values(data)\n",
" \n",
" result = conn.execute(stmt)\n",
" return result.rowcount\n",
"\n",
"engine = create_engine(\"postgresql://user:pass@dwh:5432/raw\")\n",
"\n",
"with engine.connect() as connection:\n",
" yellow_only_rides.head(10_000).to_sql(\n",
" name=\"taxi_yellow\",\n",
" con=connection,\n",
" index=False,\n",
" if_exists=\"append\",\n",
" method=insert_taxi_yellow_table,\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "df37a76c-5e7a-43e4-90e2-bdda99df4d87",
"metadata": {},
"source": [
"## Run Data Quality tests"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "79db7230-51ca-40c8-b2bd-a8a998d6bb01",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[2025-11-07 10:32:57] INFO {metadata.OMetaAPI:server_mixin:74} - OpenMetadata client running with Server version [1.10.4] and Client version [1.10.0.0]\n",
"[2025-11-07 10:32:57] INFO {metadata.TestSuite:test_suite:102} - Retrieving table entity for FQN: Tutorial Postgres.raw.public.taxi_yellow\n",
"/opt/conda/lib/python3.11/site-packages/pydantic/_internal/_fields.py:172: UserWarning: Field name \"schema\" in \"PostgresStoredProcedure\" shadows an attribute in parent \"BaseModel\"\n",
" warnings.warn(\n",
"[2025-11-07 10:32:58] INFO {metadata.TestSuite:test_suite:245} - Using existing test suite for table taxi_yellow\n",
"[2025-11-07 10:32:58] INFO {metadata.TestSuite:core:33} - Executing test case dozone_column_value_is_yellow_zone for entity Tutorial Postgres.raw.public.taxi_yellow\n",
"[2025-11-07 10:32:58] INFO {metadata.TestSuite:core:33} - Executing test case puzone_column_value_is_yellow_zone for entity Tutorial Postgres.raw.public.taxi_yellow\n",
"[2025-11-07 10:32:58] INFO {metadata.TestSuite:core:33} - Executing test case taxi_yellow_table_row_count_is_10000 for entity Tutorial Postgres.raw.public.taxi_yellow\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - \u001b[1mWorkflow OpenMetadata Summary:\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Processed records: 1\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Updated records: 0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Warnings: 0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Errors: 0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Success %: 100.0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - \u001b[1mWorkflow Processor Summary:\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Processed records: 1\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Updated records: 0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Warnings: 0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Errors: 0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Success %: 100.0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - \u001b[1mWorkflow OpenMetadata Summary:\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Processed records: 4\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Updated records: 0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Warnings: 0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Errors: 0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - Success %: 100.0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - \u001b[1m\u001b[36;1mWorkflow Success %: 100.0\u001b[0m\n",
"[2025-11-07 10:32:59] INFO {metadata.Utils:logger:205} - \u001b[1m\u001b[36;1mWorkflow finished in time: 1.61s\u001b[0m\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Test: dozone_column_value_is_yellow_zone\n",
"Status: TestCaseStatus.Success\n",
"Result: Found 10000 value(s) matching regex pattern vs 10000 value(s) in the column.\n",
"\n",
"Test: puzone_column_value_is_yellow_zone\n",
"Status: TestCaseStatus.Success\n",
"Result: Found 10000 value(s) matching regex pattern vs 10000 value(s) in the column.\n",
"\n",
"Test: taxi_yellow_table_row_count_is_10000\n",
"Status: TestCaseStatus.Success\n",
"Result: Found rowCount=10000 rows vs. the expected 10000.0\n"
]
}
],
"source": [
"from metadata.sdk.data_quality import TestRunner\n",
"\n",
"runner = TestRunner.for_table(\"Tutorial Postgres.raw.public.taxi_yellow\")\n",
"results = runner.run()\n",
"\n",
"for result in results:\n",
" test_case = result.testCase\n",
" test_result = result.testCaseResult\n",
"\n",
" print(f\"\\nTest: {test_case.name.root}\")\n",
" print(f\"Status: {test_result.testCaseStatus}\")\n",
" print(f\"Result: {test_result.result}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}