feat: add ML models (#1721)

* ML Model Schema Initial Version for feedback

* Added Deprecation Model

* Remove lock files

* Committing yarn lock file

* Fix Review Comments

* Using Common VersionTag Entity

* PR Review Comments Round-2

* Updated all model and feature references to MLModel and MLFeature

* Addressing PR Comments (Round 3)

* Updating Hyperparameter to a Map type

* Update to Dataset

* Review comments based on RFC

* ML Model Schema Initial Version for feedback

* Added Deprecation Model

* Remove lock files

* Committing yarn lock file

* Fix Review Comments

* Using Common VersionTag Entity

* PR Review Comments Round-2

* Updated all model and feature references to MLModel and MLFeature

* Addressing PR Comments (Round 3)

* Updating Hyperparameter to a Map type

* Update to Dataset

* fix: modify the etl script dependency (#1726)

Co-authored-by: Cobolbaby <Zhang.Xing-Long@inventec.com>

* fix: correct the way to catch the exception (#1727)

* fix: modify the etl script dependency

* fix: Correct the way to catch the exception

* fix: Compatible with the following kafka cluster when the Kafka Topic message Key cannot be empty

* fix: Adjust the kafka message key; Improve the comment of field

* fix: Avro schema required for key

Co-authored-by: Cobolbaby <Zhang.Xing-Long@inventec.com>

* refactor(models): remove internal cluster model (#1733)

* refactor(models): remove internal cluster model

Remove internal model which is not used in open source

* build(deps): bump lodash from 4.17.15 to 4.17.19 in /datahub-web (#1738)

Bumps [lodash](https://github.com/lodash/lodash) from 4.17.15 to 4.17.19.
- [Release notes](https://github.com/lodash/lodash/releases)
- [Commits](https://github.com/lodash/lodash/compare/4.17.15...4.17.19)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update README.md

* Update README.md

* Update README.md

* Update the roadmap (#1740)

* Update the roadmap

- Make short term more like what we're doing this quarter
- Medium term is next quarter
- Long term is 2 or 3 quarters from now
- Visionary is even beyond that

Making this PR mostly to discuss the roadmap. I've moved a few items down to "unprioritized"; before merging this we should put these in a category. Mostly saving the state of what I've done so far.

* Update roadmap.md

Co-authored-by: Mars Lan <mars.th.lan@gmail.com>

* Update roadmap.md

* Update README.md

* doc: add a separate doc to keep track of the full list or links (#1744)

* Update README.md

* Create links.md

* Update README.md

* Update links.md

* Update README.md

* Update README.md

* Update features.md

* Update faq.md

* Update README.md

* Update README.md

* feat(gms): add postgres & mariadb supports to GMS (#1742)

* feat(gms): add postgres & mariadb supports to GMS

Also add corresponding docker-compose files

* Update README.md

* build(frontend): Drop unnecessary DB-related dependencies (#1741)

* refactor(frontend): Drop unnecessary DB-related dependencies

* Drop unused dependencies from top-level build script

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update links.md

* Update README.md

* Doc fixes

* Update roadmap.md

* Update faq.md

* Set theme jekyll-theme-cayman

* Create _config.yml

* Delete _config.yml

* Set theme jekyll-theme-cayman

* Update _config.yml

* Update _config.yml

* build: build GitHub Page from /docs directory (#1750)

- Move top-level MD files to /docs and symlink them back
- Update all absolute links to files in /docs to relative links

* Revert "build: build GitHub Page from /docs directory (#1750)" (#1751)

This reverts commit b0f56de7a81b8bf921ff37cb81024692d1b9a8ce.

* build: build GitHub Pages from /docs directory (#1752)

- Move non-README top-level MD files to /docs
- Update all absolute links to files in /docs to relative links
- Add a placeholder front page for GitHub Pages

* Update README.md

* Update README.md

* Update README.md

* feat(kafka-config): Add ability to configure other Kafka props (#1745)

* Integarte spring-kafka & spring-boot for security props

- Upgrade spring-kafka to 2.1.14
- Use KafkaListener and KafkaTemplates to enable KafkaAutoConfiguration
- Integrates spring-boot's KafkaProperties into spring-kafka's config

* Cleanup imports

* Add DataHub kafka env vars

* Remove kafka-streams dependency

* Add KafkaProperties to gms; Add docs

* Add to Adoption

* Remove KAFKA_BOOTSTRAP_SERVER default

Co-authored-by: jsotelo <javier.sotelo@viasat.com>
Co-authored-by: Kerem Sahin <ksahin@linkedin.com>

* Agenda for next town hall

* Update townhalls.md

* Update README.md

* Update README.md

* Add documentation around the DataHub RFC process. (#1754)

Other repos have similar RFC processes (though they seem to have a separate repo for their RFC docs).

This provides a more structured way for contributors to make siginficant design contributions.

https://github.com/linkedin/datahub/issues/1692

* metadata-models 72.0.8 -> 80.0.0 (#1756)

* <refactor>[ingestions]: align the default kafka topics with PR #1756 (#1758)

* docs: add a sequence diagram and a description (#1757)

* add a sequence diagram and a description

* update descrpition based on feedback

* Update README.md

* Update README.md

Co-authored-by: Mars Lan <mars.th.lan@gmail.com>

* Update README.md

* Fix reflinks in PR template (#1764)

* Update kafka-config.md (#1763)

Fix name of spring-kafka property to pass SASL_JAAS config

* Update entity.md

* Update README.md

* Update faq.md

* Update townhalls.md

* Update README.md

* Update townhalls.md

* Update townhalls.md

* docs: move quickstart guide to a separate file under docs (#1765)

docs: move quickstart guide to a separate doc under docs directory

* Update slack.md

* Update README.md

* Update slack.md

* Update metadata-ingestion.md

* Add workflow to check build and tests on PRs + releases. (#1769)

PRs are setup to skip docs.

Also, only run docker actions on linkedin/datahub (i.e. disable on forks; makes forks nicer since you don't have failing actions).

* Update developers.md

* Update developers.md

* Update README.md

* fix(models): remove unused model (#1748)

* fix(models): remove unused model

Fixes https://github.com/linkedin/datahub/issues/1719

* Drop DeploymentInfo from Dataset's value model & rebuild snapshot

* Update README.md

* Add a separate page for previous townhalls

* Update for August invite; link to history

* Update README.md

* build: remove travis (we're using GitHub actions). (#1770)

Remove travis (we're using GitHub actions).

Also ignore markdown in our current workflows.

Also update the README.md badge.

* update townhall date

* Update README.md

* Update townhalls.md

* build(docker): build & publish GitHub Package (#1771)

* build(docker): build & publish docker images to GitHub Packages

Will kepp publishing to Docker Hub meanwhile until all Dockerfiles have been updated to point to GitHub.
Fixes https://github.com/linkedin/datahub/issues/1548

* Rebase & fix dockerfile locations

* Update README.md

* Fix README.md

* docs: add placeholders for advanced topics (#1780)

* Create high-cardinality.md

* Create pdl-best-practices

* Create partial-update.md

* Rename pdl-best-practices to pdl-best-practices.md

* Create entity-hierarchy.md

* docs: more placeholders for advance topics (#1781)

* Create aspect-versioning.md

* Create derived-aspects.md

* Create backfilling.md

* Update README.md

* Update aspect-versioning.md

* Update aspect.md

* Update README.md

* Update townhall-history.md

* Update townhall-history.md

* Update rfc.md

* refactor(docker): make docker files easier to use during development. (#1777)

* Make docker files easier to use during development.

During development it quite nice to have docker work with locally built code. This allows you to launch all services very quickly, with your changes, and optionally with debugging support.

Changes made to docker files:
- Removed all redundant docker-compose files. We now have 1 giant file, and smaller files to use as overrides.
- Remove redundant README files that provided little information.
- Rename docker/<dir> to match the service name in the docker-compose file for clarity.
- Move environment variables to .env files. We only provide dev / the default environment for quickstart.
- Add debug options to docker files using multistage build to build minimal images with the idea that built files will be mounted instead.
- Add a docker/dev.sh script + compose file to easily use the dev override images (separate tag; images never published; uses debug docker files; mounts binaries to image).
- Added docs/docker documentation for this.

* build: fix docker actions. (#1787)

* bug: Fix docker actions.

We renamed directories in docker/ which broke the actions.

Also try to refactor the action files a little so that we can run (but not publish) these images on pull requests that change the docker/ dir as an extra check. Note this only seems to be supported by the dockerhub plugin; the github plugin doesn't support this (so that will be an issue when we move to it only).

* Drop extra pipes

* Update README.md

* refactor: remove unused model (#1788)

* refactor: remove unused internal models (#1789)

* docs: create search-over-new-field.md (#1790)

Add a doc on searching over a new field

* Update search-onboarding.md

* add description field for dataset index mapping (#1791)

* docs: how to customize the search experience (#1795)

* add description field for dataset index mapping

* documentation on how to customize the search experience

* feat(ingest): add example crawler for MS SQL (#1803)

Also fix the incorrect assumption on column comments & add sample docker-compose file

* Add log documentation

we didn't end up mounting logs; docker desktop is a better experience

* Update townhall-history.md

* Update quickstart.md

* fix(search): clear description from dataset index when it's cleared (#1808)

Fixes https://github.com/linkedin/datahub/issues/1798

* Update README.md

* Revert "Update README.md"

This reverts commit 74a0d7b262a2ac22de9bc52974b721d580914ff0.

* Update README.md

* Update README.md

* Update high-cardinality.md

* Update README.md

* Update relationship.md

* Update high-cardinality.md

* Update metadata-models to head! (#1811)

metadata-models 80.0.0 -> 90.0.13:

   90.0.13: Roll forward: Fix the open source build by avoiding URN method that isn't part of the open source URN.
    90.0.2: Refactor listUrnsFromIndex method
    90.0.0: Start distinguishing between [] aspects vs null aspects input param
    89.0.4: Fix the open source build by avoiding URN method that isn't part of the open source URN.
    89.0.2: fix some test case name
    89.0.0: META-12686: Made the MXE_v5 topics become strictly ACL'ed to avoid the wildcard write ACL as "MetadataXEvent.+"
    88.0.6: change DAO to take Storage Config as input
    88.0.3: Add a comment on lack of avro generation for MXEv5 + add MXEv5 to the pegasus validation task.
   87.0.15: META-12651: Integrate the metadata-models-ext with metadata-models
   87.0.13: add StorageConfig to Local DAO
    87.0.3: Treat empty aspect vs optional aspect same until all clients are migrated
    87.0.2: Treat empty aspect vs optional aspect differently
    87.0.1: META-12533: Skip processing unregistered aspect specific MAE.
    83.0.6: action method to return list of urns from strong consistent index
    83.0.4: Change input param type for batch backfill
    83.0.3: Implement batch backfill
    83.0.1: Implement support for OR filter in browse query
   82.0.10: Throw UnsupportedOperationException for unsupported condition types in search filter
    82.0.6: Implement local secondary backfilling index as part of backfill method
    82.0.5: [strongly consistent index] implement getUrns method
    82.0.4: Add indexing urn fields to the local secondary index
    82.0.0: Render Delta fiels in the MCE_v5.
    81.0.1: Add pegasus to avro conversion for FMCE
    80.0.4: add get all support for BaseSingleAspectEntitySimpleKeyResource
    80.0.2: Add a BaseSearchWriterDAO with an ESBulkWriterDAO implementation.
    80.0.1: META-12254: Produce aspect specific MAE with always emit option
    80.0.0: Convert getNodesInTraversedPath to getSubgraph to return complete view of the subgraph (nodes+edges)

* Update townhalls.md

* Update townhalls.md

* fix: drop the commits badge as it's flakey

* Update README.md

* fix: update defaults of aspectNames params (#1815)

fix: Update defaults of aspectNames params.

The last PR to sync internal code broke the external GMS, as code was now expected aspectNames to be null rather than empty by default. This preventing me logging into DataHub as the corp user request would fail (it assumed I asked for no aspects rather than all aspects).

TESTED: Built locally, launched with docker/dev.sh (so used latest frontend, but whatever). Verified I can now log into DataHub, browse and search for datasets, and view my profile.

* Update README.md

* Update README.md

* feat(kubernetes): Improve the security of the kubernetes/helm charts (#1782)

* 1747 | remove obsolete yaml files

* 1747 | remove configmap and its hardcoded references

* 1747 | add missing input parameter of neo4j.host

* 1747 | remove obsolete secrets and parameterize the rest

* 1747 | auto-generate gms secret

* 1747 | remove fullName overrides

* 1747 | fix parameters in subchart's values.yaml

* 1747 | remove hardcoding from parameters for gms host and port

* 1747 | upgrade chart version

* 1747 | update helm docs

* 1747 | add extraEnv, extraVolume and extraMounts

* 1747 | Alters pull policy of images to 'always' for ldh

Co-authored-by: shakti-garg <shakti.garg@gmail.com>

* Update README.md

* feat(data-platforms): adding rest resource for /dataPlatforms and mid-tier support (#1817)

* feat(data-platforms): Adding rest resource for /dataPlatforms and mid-tier support

* Removed data platforms which are Linkedin internal

* docs: add NOTICE (#1810)

* Copy NOTICE from wherehows

Copies the file from the wherehows branch.

* Update notice.

* Update links.md

* Update links.md

* Update README.md

* feat(dashboards): RFC for dashboards (#1778)

* feature(dashboards): RFC for dashboards

* Change directory structure

* Create goals & non-goals sections

* Removing alternatives section

* Update README.md

* Update links.md

* Update townhalls.md

* Update notice to include embedded licenses

Also list apache projects specifically.

* feat(frontend): update datahub-web client UI code (#1806)

* Releases updated version of datahub-web client UI code

* Fix typo in yarn lock

* Change yarn lock to match yarn registry directories

* Previous commit missed some paths

* Even more changes to yarnlock missing in previous commit

* Include codegen file for typings

* Add files to get parity for datahub-web and current OS datahub-midtier

* Add in typo fix from previous commit - change to proper license

* Implement proper OS fix for person entity picture url

* Workarounds for open source DH issues

* Fixes institutional memory api and removes unopensourced tabs for datasets

* Fixes search dataset deprecation and user search issue as a result of changes

* Remove internal only options in the avatar menu

* Update search-over-new-field.md

* docs: add external link (#1828)

* Update README.md

* Update links.md

* Review comments based on RFC

Co-authored-by: cobolbaby <cobolbaby@qq.com>
Co-authored-by: Cobolbaby <Zhang.Xing-Long@inventec.com>
Co-authored-by: Harsh Shah <hrshah@linkedin.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Mars Lan <mars.th.lan@gmail.com>
Co-authored-by: John Plaisted <jplaisted@linkedin.com>
Co-authored-by: Kerem Sahin <ksahin@linkedin.com>
Co-authored-by: Javier Sotelo <javier.a.sotelo@gmail.com>
Co-authored-by: jsotelo <javier.sotelo@viasat.com>
Co-authored-by: Jyoti Wadhwani <jywadhwani@linkedin.com>
Co-authored-by: Chris Lee <wlee@linkedin.com>
Co-authored-by: Liangjun Jiang <ljiang510@gmail.com>
Co-authored-by: shakti-garg-saxo <68685481+shakti-garg-saxo@users.noreply.github.com>
Co-authored-by: na zhang <nazhang@linkedin.com>
Co-authored-by: shakti-garg <shakti.garg@gmail.com>
Co-authored-by: Charlie Tran <catran@linkedin.com>
This commit is contained in:
Arun Vasudevan 2020-09-10 17:52:50 -05:00 committed by GitHub
parent e8a1d61961
commit 66dd008e3d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
36 changed files with 935 additions and 0 deletions

View File

@ -0,0 +1,41 @@
namespace com.linkedin.ml
import com.linkedin.common.ChangeAuditStamps
import com.linkedin.common.MLFeatureUrn
import com.linkedin.common.VersionTag
import com.linkedin.common.Ownership
import com.linkedin.common.InstitutionalMemory
import com.linkedin.common.Status
import com.linkedin.common.Deprecation
import com.linkedin.ml.metadata.MLFeatureProperties
/**
* MLFeature spec. for a feature store. A collection of MLFeature metadata schema that can evolve over time.
*/
record MLFeature includes MLFeatureKey, ChangeAuditStamps {
/**
* Ownership Info
*/
ownership: optional Ownership
/**
* MLFeature Properties
*/
featureProperties: optional MLFeatureProperties
/**
* Institutional Memory
*/
institutionalMemory: optional InstitutionalMemory
/**
* Status
*/
status: optional Status
/**
* Deprecation
*/
deprecation: optional Deprecation
}

View File

@ -0,0 +1,25 @@
namespace com.linkedin.ml
/**
* Key for MLFeature resource
*/
record MLFeatureKey {
/**
* ML Feature Namespace e.g. {db}.{table}, /dir/subdir/{name}, or {name}
*/
@validate.strlen = {
"max" : 500,
"min" : 1
}
featureNamespace: string
/**
* Feature Name
*/
@validate.strlen = {
"max" : 500,
"min" : 1
}
featureName: string
}

View File

@ -0,0 +1,102 @@
namespace com.linkedin.ml
import com.linkedin.common.ChangeAuditStamps
import com.linkedin.common.MlModelUrn
import com.linkedin.common.VersionTag
import com.linkedin.common.Ownership
import com.linkedin.common.InstitutionalMemory
import com.linkedin.common.Status
import com.linkedin.common.Cost
import com.linkedin.common.Deprecation
import com.linkedin.ml.metadata.MLModelProperties
import com.linkedin.ml.metadata.IntendedUse
import com.linkedin.ml.metadata.MLModelFactors
import com.linkedin.ml.metadata.Metrics
import com.linkedin.ml.metadata.EvaluationData
import com.linkedin.ml.metadata.TrainingData
import com.linkedin.ml.metadata.QuantitativeAnalyses
import com.linkedin.ml.metadata.EthicalConsiderations
import com.linkedin.ml.metadata.CaveatsAndRecommendations
import com.linkedin.ml.metadata.SourceCode
/**
* MlModel spec. for a model store. A collection of MlModel metadata schema that can evolve over time.
*/
record MLModel includes MLModelKey, ChangeAuditStamps {
/**
* Ownership Info
*/
ownership: optional Ownership
/**
* MLModel Properties
*/
mlModelProperties: optional MLModelProperties
/**
* Intended Use
*/
intendedUse: optional IntendedUse
/**
* MLModel Factors
*/
mlModelFactors: optional MLModelFactors
/**
* Metrics
*/
metrics: optional Metrics
/**
* Evaluation Data
*/
evaluationData: optional EvaluationData
/**
* Training Data
*/
trainingData: optional TrainingData
/**
* Quantitative Analyses
*/
quantitativeAnalyses: optional QuantitativeAnalyses
/**
* Ethical Considerations
*/
ethicalConsiderations: optional EthicalConsiderations
/**
* Caveats and Recommendations
*/
caveatsAndRecommendations: optional CaveatsAndRecommendations
/**
* Institutional Memory
*/
institutionalMemory: optional InstitutionalMemory
/**
* Source Code
*/
sourceCode: optional SourceCode
/**
* Status
*/
status: optional Status
/**
* Cost
*/
cost: optional Cost
/**
* Deprecation
*/
deprecation: optional Deprecation
}

View File

@ -0,0 +1,30 @@
namespace com.linkedin.ml
import com.linkedin.common.DataPlatformUrn
import com.linkedin.common.FabricType
/**
* Key for MLModel resource
*/
record MLModelKey {
/**
* Standardized platform urn where ML Model is defined. The data platform Urn (urn:li:platform:{dataScienceplatform_name})
*/
@validate.`com.linkedin.dataset.rest.validator.DataPlatformValidator` = { }
platform: DataPlatformUrn
/**
* ML Model name e.g. {db}.{table}, /dir/subdir/{name}, or {name}
*/
@validate.strlen = {
"max" : 500,
"min" : 1
}
name: string
/**
* Fabric type where ML Model belongs to or where it was generated.
*/
origin: FabricType
}

View File

@ -0,0 +1,27 @@
package com.linkedin.common.urn;
public final class MLFeatureUrn extends Urn {
public static final String ENTITY_TYPE = "mlFeature";
private static final String CONTENT_FORMAT = "(%s,%s,%s)";
private final String mlFeatureNamespace;
private final String mlFeatureName;
public MLFeatureUrn(String mlFeatureNamespace, String mlFeatureName) {
super(ENTITY_TYPE, String.format(CONTENT_FORMAT, mlFeatureNamespace, mlFeatureName));
this.mlFeatureNamespace = mlFeatureNamespace;
this.mlFeatureName = mlFeatureName;
}
public String getMlFeatureName() {
return mlFeatureName;
}
public String getMlFeatureNamespace() {
return mlFeatureNamespace;
}
}

View File

@ -0,0 +1,50 @@
package com.linkedin.common.urn;
import java.net.URISyntaxException;
import com.linkedin.common.FabricType;
import static com.linkedin.common.urn.UrnUtils.toFabricType;
public final class MLModelUrn extends Urn {
public static final String ENTITY_TYPE = "mlModel";
private static final String CONTENT_FORMAT = "(%s,%s,%s)";
private final DataPlatformUrn platformEntity;
private final String mlModelNameEntity;
private final FabricType originEntity;
public MLModelUrn(DataPlatformUrn platform, String mlModelName, FabricType origin) {
super(ENTITY_TYPE, String.format(CONTENT_FORMAT, platform.toString(), mlModelName, origin.name()));
this.platformEntity = platform;
this.mlModelNameEntity = mlModelName;
this.originEntity = origin;
}
public DataPlatformUrn getPlatformEntity() {
return platformEntity;
}
public String getMlModelNameEntity() {
return mlModelNameEntity;
}
public FabricType getOriginEntity() {
return originEntity;
}
public static MLModelUrn createFromString(String rawUrn) throws URISyntaxException {
String content = new Urn(rawUrn).getContent();
String[] parts = content.substring(1, content.length() - 1).split(",");
return new MLModelUrn(DataPlatformUrn.createFromString(parts[0]), parts[1], toFabricType(parts[2]));
}
public static MLModelUrn deserialize(String rawUrn) throws URISyntaxException {
return createFromString(rawUrn);
}
}

View File

@ -0,0 +1,83 @@
namespace com.linkedin.common
/**
* MLFeature Data Type
*/
enum MLFeatureDataType {
/**
* Useless data is unique, discrete data with no potential relationship with the outcome variable.
* A useless feature has high cardinality. An example would be bank account numbers that were generated randomly.
*/
USELESS
/**
* Nominal data is made of discrete values with no numerical relationship between the different categories — mean and median are meaningless.
* Animal species is one example. For example, pig is not higher than bird and lower than fish.
*/
NOMINAL
/**
* Ordinal data are discrete integers that can be ranked or sorted.
* For example, the distance between first and second may not be the same as the distance between second and third.
*/
ORDINAL
/**
* Binary data is discrete data that can be in only one of two categories — either yes or no, 1 or 0, off or on, etc
*/
BINARY
/**
* Count data is discrete whole number data — no negative numbers here.
* Count data often has many small values, such as zero and one.
*/
COUNT
/**
* Time data is a cyclical, repeating continuous form of data.
* The relevant time features can be any period— daily, weekly, monthly, annual, etc.
*/
TIME
/**
* Interval data has equal spaces between the numbers and does not represent a temporal pattern.
* Examples include percentages, temperatures, and income.
*/
INTERVAL
/**
* Image Data
*/
IMAGE
/**
* Video Data
*/
VIDEO
/**
* Audio Data
*/
AUDIO
/**
* Text Data
*/
TEXT
/**
* Mapping Data Type ex: dict, map
*/
MAP
/**
* Sequence Data Type ex: list, tuple, range
*/
SEQUENCE
/**
* Set Data Type ex: set, frozenset
*/
SET
}

View File

@ -0,0 +1,27 @@
namespace com.linkedin.common
/**
* Standardized MLFeature identifier.
*/
@java.class = "com.linkedin.common.urn.MLFeatureUrn"
@validate.`com.linkedin.common.validator.TypedUrnValidator` = {
"accessible" : true,
"owningTeam" : "urn:li:internalTeam:datahub",
"entityType" : "mlFeature",
"constructable" : true,
"namespace" : "li",
"name" : "MLFeature",
"doc" : "Standardized MLFeature identifier.",
"owners" : [ "urn:li:corpuser:fbar", "urn:li:corpuser:bfoo" ],
"fields" : [ {
"name" : "mlFeatureNamespace",
"type" : "string",
"doc" : "Namespace for the MLFeature"
}, { "type" : "string",
"name" : "mlFeatureName",
"doc" : "Name of the MLFeature",
"maxLength" : 210
}],
"maxLength" : 284
}
typeref MLFeatureUrn = string

View File

@ -0,0 +1,32 @@
namespace com.linkedin.common
/**
* Standardized MLModel identifier.
*/
@java.class = "com.linkedin.common.urn.MLModelUrn"
@validate.`com.linkedin.common.validator.TypedUrnValidator` = {
"accessible" : true,
"owningTeam" : "urn:li:internalTeam:datahub",
"entityType" : "mlModel",
"constructable" : true,
"namespace" : "li",
"name" : "MlModel",
"doc" : "Standardized model identifier.",
"owners" : [ "urn:li:corpuser:fbar", "urn:li:corpuser:bfoo" ],
"fields" : [ {
"name" : "platform",
"type" : "com.linkedin.common.urn.DataPlatformUrn",
"doc" : "Standardized platform urn for the MLModel."
}, {
"name" : "mlModelName",
"doc" : "Name of the MLModel",
"type" : "string",
"maxLength" : 210
}, {
"name" : "origin",
"type" : "com.linkedin.common.FabricType",
"doc" : "Fabric type where model belongs to or where it was generated."
} ],
"maxLength" : 284
}
typeref MLModelUrn = string

View File

@ -0,0 +1,17 @@
namespace com.linkedin.common
/*
* Cost Details for an Entity
*/
record Cost {
/*
* Type of Cost Code
*/
costType: CostType
/*
* Code to which the Cost of this entity should be attributed to
*/
cost: CostValue
}

View File

@ -0,0 +1,12 @@
namespace com.linkedin.common
/**
* Type of Cost Code
*/
enum CostType {
/**
* Org Cost Type to which the Cost of this entity should be attributed to
*/
ORG_COST_TYPE
}

View File

@ -0,0 +1,9 @@
namespace com.linkedin.common
/**
* A union of all supported Cost Value types
*/
typeref CostValue = union[
costId: double
costCode: string
]

View File

@ -0,0 +1,18 @@
namespace com.linkedin.metadata.aspect
import com.linkedin.common.InstitutionalMemory
import com.linkedin.common.Ownership
import com.linkedin.common.Status
import com.linkedin.ml.metadata.MLFeatureProperties
import com.linkedin.common.Deprecation
/**
* A union of all supported metadata aspects for a MLFeature
*/
typeref MLFeatureAspect = union[
Ownership,
MLFeatureProperties,
InstitutionalMemory,
Status,
Deprecation
]

View File

@ -0,0 +1,38 @@
namespace com.linkedin.metadata.aspect
import com.linkedin.common.InstitutionalMemory
import com.linkedin.common.Ownership
import com.linkedin.common.Status
import com.linkedin.ml.metadata.CaveatsAndRecommendations
import com.linkedin.ml.metadata.EthicalConsiderations
import com.linkedin.ml.metadata.EvaluationData
import com.linkedin.ml.metadata.IntendedUse
import com.linkedin.ml.metadata.Metrics
import com.linkedin.ml.metadata.MLModelFactorPrompts
import com.linkedin.ml.metadata.MLModelProperties
import com.linkedin.ml.metadata.QuantitativeAnalyses
import com.linkedin.ml.metadata.TrainingData
import com.linkedin.common.Cost
import com.linkedin.common.Deprecation
import com.linkedin.ml.metadata.SourceCode
/**
* A union of all supported metadata aspects for a ML Model
*/
typeref MLModelAspect = union[
Ownership,
MLModelProperties,
IntendedUse,
MLModelFactorPrompts,
Metrics,
EvaluationData,
TrainingData,
QuantitativeAnalyses,
EthicalConsiderations,
CaveatsAndRecommendations,
InstitutionalMemory,
SourceCode,
Status,
Cost,
Deprecation
]

View File

@ -0,0 +1,17 @@
namespace com.linkedin.metadata.snapshot
import com.linkedin.common.MLFeatureUrn
import com.linkedin.metadata.aspect.MLFeatureAspect
record MLFeatureSnapshot {
/**
* URN for the entity the metadata snapshot is associated with.
*/
urn: MLFeatureUrn
/**
* The list of metadata aspects associated with the MLModel. Depending on the use case, this can either be all, or a selection, of supported aspects.
*/
aspects: array[MLFeatureAspect]
}

View File

@ -0,0 +1,20 @@
namespace com.linkedin.metadata.snapshot
import com.linkedin.common.MLModelUrn
import com.linkedin.metadata.aspect.MLModelAspect
/**
* MLModel Snapshot entity details.
*/
record MLModelSnapshot {
/**
* URN for the entity the metadata snapshot is associated with.
*/
urn: MLModelUrn
/**
* The list of metadata aspects associated with the MLModel. Depending on the use case, this can either be all, or a selection, of supported aspects.
*/
aspects: array[MLModelAspect]
}

View File

@ -8,4 +8,6 @@ typeref Snapshot = union[
CorpUserSnapshot,
DatasetSnapshot,
DataProcessSnapshot,
MLModelSnapshot,
MLFeatureSnapshot
]

View File

@ -0,0 +1,24 @@
namespace com.linkedin.ml.metadata
import com.linkedin.common.DatasetUrn
/**
* BaseData record
*/
record BaseData {
/**
* What dataset were used in the MLModel?
*/
dataset: DatasetUrn
/**
* Why was this dataset chosen?
*/
motivation: optional string
/**
* How was the data preprocessed (e.g., tokenization of sentences, cropping of images, any filtering such as dropping images without faces)?
*/
preProcessing: optional array[string]
}

View File

@ -0,0 +1,23 @@
namespace com.linkedin.ml.metadata
/**
* This section should list additional concerns that were not covered in the previous sections. For example, did the results suggest any further testing? Were there any relevant groups that were not represented in the evaluation dataset? Are there additional recommendations for model use?
*/
record CaveatDetails {
/**
* Did the results suggest any further testing?
*/
needsFurtherTesting: optional boolean
/**
* Caveat Description
* For ex: Given gender classes are binary (male/not male), which we include as male/female. Further work needed to evaluate across a spectrum of genders.
*/
caveatDescription: optional string
/**
* Relevant groups that were not represented in the evaluation dataset?
*/
groupsNotRepresented: optional array[string]
}

View File

@ -0,0 +1,22 @@
namespace com.linkedin.ml.metadata
/**
* This section should list additional concerns that were not covered in the previous sections. For example, did the results suggest any further testing? Were there any relevant groups that were not represented in the evaluation dataset? Are there additional recommendations for model use?
*/
record CaveatsAndRecommendations {
/**
* This section should list additional concerns that were not covered in the previous sections. For example, did the results suggest any further testing? Were there any relevant groups that were not represented in the evaluation dataset?
*/
caveats: optional CaveatDetails
/**
* Recommendations on where this MLModel should be used.
*/
recommendations: optional string
/**
* Ideal characteristics of an evaluation dataset for this MLModel
*/
idealDatasetCharacteristics: optional array[string]
}

View File

@ -0,0 +1,32 @@
namespace com.linkedin.ml.metadata
/**
* This section is intended to demonstrate the ethical considerations that went into MLModel development, surfacing ethical challenges and solutions to stakeholders.
*/
record EthicalConsiderations {
/**
* Does the MLModel use any sensitive data (e.g., protected classes)?
*/
data: optional array[string]
/**
* Is the MLModel intended to inform decisions about matters central to human life or flourishing e.g., health or safety? Or could it be used in such a way?
*/
humanLife: optional array[string]
/**
* What risk mitigation strategies were used during MLModel development?
*/
mitigations: optional array[string]
/**
* What risks may be present in MLModel usage? Try to identify the potential recipients, likelihood, and magnitude of harms. If these cannot be determined, note that they were considered but remain unknown.
*/
risksAndHarms: optional array[string]
/**
* Are there any known MLModel use cases that are especially fraught? This may connect directly to the intended use section
*/
useCases: optional array[string]
}

View File

@ -0,0 +1,14 @@
namespace com.linkedin.ml.metadata
import com.linkedin.common.DatasetUrn
/**
* All referenced datasets would ideally point to any set of documents that provide visibility into the source and composition of the dataset.
*/
record EvaluationData {
/**
* Details on the dataset(s) used for the quantitative analyses in the MLModel
*/
evaluationData: array[BaseData]
}

View File

@ -0,0 +1,6 @@
namespace com.linkedin.ml.metadata
/**
* A union of all supported metadata aspects for HyperParameter Value
*/
typeref HyperParameterValueType = union[string, int, float, double, boolean]

View File

@ -0,0 +1,22 @@
namespace com.linkedin.ml.metadata
/**
* Intended Use for the ML Model
*/
record IntendedUse {
/**
* Primary Use cases for the MLModel.
*/
primaryUses: optional array[string]
/**
* Primary Intended Users - For example, was the MLModel developed for entertainment purposes, for hobbyists, or enterprise solutions?
*/
primaryUsers: optional array[IntendedUserType]
/**
* Highlight technology that the MLModel might easily be confused with, or related contexts that users could try to apply the MLModel to.
*/
outOfScopeUses: optional array[string]
}

View File

@ -0,0 +1,22 @@
namespace com.linkedin.ml.metadata
/*
* Primary Intended User Types or User Categories
*/
enum IntendedUserType {
/*
* Developed for Enterprise Users
*/
ENTERPRISE
/*
* Developed for Hobbyists
*/
HOBBY
/*
* Developed for Entertainment Purposes
*/
ENTERTAINMENT
}

View File

@ -0,0 +1,25 @@
namespace com.linkedin.ml.metadata
import com.linkedin.common.MLFeatureDataType
import com.linkedin.common.VersionTag
/**
* Properties associated with a MLFeature
*/
record MLFeatureProperties {
/**
* Documentation of the MLFeature
*/
description: optional string
/**
* Data Type of the MLFeature
*/
dataType: optional MLFeatureDataType
/**
* Version of the MLFeature
*/
version: optional VersionTag
}

View File

@ -0,0 +1,17 @@
namespace com.linkedin.ml.metadata
/**
* Prompts which affect the performance of the MLModel
*/
record MLModelFactorPrompts {
/**
* What are foreseeable salient factors for which MLModel performance may vary, and how were these determined?
*/
relevantFactors: optional array[MLModelFactors]
/**
* Which factors are being reported, and why were these chosen?
*/
evaluationFactors: optional array[MLModelFactors]
}

View File

@ -0,0 +1,25 @@
namespace com.linkedin.ml.metadata
/**
* Factors affecting the performance of the MLModel.
*/
record MLModelFactors {
/**
* Groups refers to distinct categories with similar characteristics that are present in the evaluation data instances.
* For human-centric machine learning MLModels, groups are people who share one or multiple characteristics.
*/
groups: optional array[string]
/**
* The performance of a MLModel can vary depending on what instruments were used to capture the input to the MLModel.
* For example, a face detection model may perform differently depending on the cameras hardware and software,
* including lens, image stabilization, high dynamic range techniques, and background blurring for portrait mode.
*/
instrumentation: optional array[string]
/**
* A further factor affecting MLModel performance is the environment in which it is deployed.
*/
environment: optional array[string]
}

View File

@ -0,0 +1,46 @@
namespace com.linkedin.ml.metadata
import com.linkedin.common.MLFeatureUrn
import com.linkedin.common.Time
import com.linkedin.common.VersionTag
/**
* Properties associated with a ML Model
*/
record MLModelProperties {
/**
* Documentation of the MLModel
*/
description: optional string
/**
* Date when the MLModel was developed
*/
date: optional Time
/**
* Version of the MLModel
*/
version: optional VersionTag
/**
* Type of Algorithm or MLModel such as whether it is a Naive Bayes classifier, Convolutional Neural Network, etc
*/
type: optional string
/**
* Hyper Parameters of the MLModel
*/
hyperParameters: optional map[string, HyperParameterValueType]
/**
* List of features used for MLModel training
*/
mlFeatures: optional array[MLFeatureUrn]
/**
* Tags for the MLModel
*/
tags: array[string] = [ ]
}

View File

@ -0,0 +1,17 @@
namespace com.linkedin.ml.metadata
/**
* Metrics to be featured for the MLModel.
*/
record Metrics {
/**
* Measures of MLModel performance
*/
performanceMeasures: optional array[string]
/**
* Decision Thresholds used (if any)?
*/
decisionThreshold: optional array[string]
}

View File

@ -0,0 +1,17 @@
namespace com.linkedin.ml.metadata
/**
* Quantitative analyses should be disaggregated, that is, broken down by the chosen factors. Quantitative analyses should provide the results of evaluating the MLModel according to the chosen metrics, providing confidence interval values when possible.
*/
record QuantitativeAnalyses {
/**
* Link to a dashboard with results showing how the MLModel performed with respect to each factor
*/
unitaryResults: optional ResultsType
/**
* Link to a dashboard with results showing how the MLModel performed with respect to the intersection of evaluated factors?
*/
intersectionalResults: optional ResultsType
}

View File

@ -0,0 +1,6 @@
namespace com.linkedin.ml.metadata
/**
* A union of all supported metadata aspects for ResultsType
*/
typeref ResultsType = union[string]

View File

@ -0,0 +1,12 @@
namespace com.linkedin.ml.metadata
/**
* Source Code
*/
record SourceCode {
/**
* Source Code along with types
*/
sourceCode: array[SourceCodeUrl]
}

View File

@ -0,0 +1,19 @@
namespace com.linkedin.ml.metadata
import com.linkedin.common.Url
/**
* Source Code Url Entity
*/
record SourceCodeUrl {
/**
* Source Code Url Types
*/
type: SourceCodeUrlType
/**
* Source Code Url
*/
sourceCodeUrl: Url
}

View File

@ -0,0 +1,22 @@
namespace com.linkedin.ml.metadata
/*
* Source Code Url Types
*/
enum SourceCodeUrlType {
/*
* MLModel Source Code
*/
ML_MODEL_SOURCE_CODE
/*
* Training Pipeline Source Code
*/
TRAINING_PIPELINE_SOURCE_CODE
/*
* Evaluation Pipeline Source Code
*/
EVALUATION_PIPELINE_SOURCE_CODE
}

View File

@ -0,0 +1,14 @@
namespace com.linkedin.ml.metadata
import com.linkedin.common.DatasetUrn
/**
* Ideally, the MLModel card would contain as much information about the training data as the evaluation data. However, there might be cases where it is not feasible to provide this level of detailed information about the training data. For example, the data may be proprietary, or require a non-disclosure agreement. In these cases, we advocate for basic details about the distributions over groups in the data, as well as any other details that could inform stakeholders on the kinds of biases the model may have encoded.
*/
record TrainingData {
/**
* Details on the dataset(s) used for training the MLModel
*/
trainingData: array[BaseData]
}