Tom Aarsen 1580c1bf8e
feat: Add GitLab ingest connector (#349)
Add GitLab data connector for ingest.

Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.

Prevent code duplication for functionality between GitHub and GitLab ingest connectors.

Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.

These work for GitHub and GitLab.
2023-03-08 00:15:21 -08:00

20 lines
667 B
Bash
Executable File

#!/usr/bin/env bash
# Processes the arbitrarily chosen https://gitlab.com/gitlab-com/content-sites/docsy-gitlab repository
# through Unstructured's library in 2 processes.
# Structured outputs are stored in gitlab-ingest-output/
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
cd "$SCRIPT_DIR"/../../.. || exit 1
PYTHONPATH=. ./unstructured/ingest/main.py \
--gitlab-url https://gitlab.com/gitlab-com/content-sites/docsy-gitlab \
--git-branch 'v0.0.7' \
--structured-output-dir gitlab-ingest-output \
--num-processes 2 \
--verbose
# Alternatively, you can call it using:
# unstructured-ingest --gitlab-url ...