haystack/.github/utils/tutorials.sh
Sara Zan 735ffa635b
[CI refactoring] Tutorials on CI (#2547)
* Experimental Ci workflow for running tutorials

* Run on every push for now

* Not starting?

* Disabling paths temporarily

* Sort tutorials in natural order

* Install ipython

* remove ipython install

* Try running ipython with sudo

* env.pythonLocation

* Skipping tutorial2 and 9 for speed

* typo

* Use one runner per tutorial, for now

* Typo in dependend job

* Missing quotes broke scripts matrix

* Simplify setup for the tutorials, try to prevent containers conflict

* Remove needless job dependencies

* Try prevent cache issues, fix small Tut10 bug

* Missing deps for running notebook tutorials

* Create three groups of tutorials excluding the longest among them

* remove deps

* use proper bash loop

* Try with a single string

* Fix typo in echo

* Forgot do

* Typo

* Try to make the GraphDB tutorial without launching its own container

* Run notebook and script together

* Whitespace

* separate scrpits and notebooks execution

* Run notebooks first

* Try caching the GoT data before running the scripts

* add note

* fix mkdir

* Fix path

* Update Documentation & Code Style

* missing -r

* Fix folder numbering

* Run notebooks as well

* Typo in notebook command

* complete path in notebook command

* Try with TIKA_LOG_PATH

* Fix folder naming

* Do not use cached data in Tut9

* extracting the number better

* Small tweaks

* Same fix on Tut10 on the notebook

* Exclude GoT cache for tut5 too

* Remove faiss files after tutorial run

* Layout

* fix remove command

* Fix path in tut10 notebook

* Fix typo in node name in tut14

* Third block was too long, rebancing

* Reduce GoT dataset even more, why wasting time after all...

* Fix paths in tut10 again

* do git clean to make sure to cleanup everything (breaks post Python)

* Remove ES file with bad permission at the end of the run

* Split first block, takes >30mins

* take out tut15 for a moment, has an actual bug

* typo

* Forgot rm option

* Simply remove all ES files

* Improve logs of GoT reduction

* Exclude also tut16 from cache to try fix bug

* Replace ll with ls

* Reintroduce 15_TableQA

* Small regrouping

* regrouping to make the min num of runners go for about 30mins

* Add cron schedule and PR paths conditions

* Add some timing information

* Separate tutorials by diff and tutorials by cron

* temp add pull_request to tutorials nightly

* Add badge in README to keep track of the nightly tutorials run

* Remove prefixes from data folder names

* Add fetch depth to get diff with master

* Fix paths again

* typo

* Exclude long-running ones

* Typo

* Fix tutorials.yml as well

* Use head_ref

* Using an action for now

* exclude other files

* Use only the correct command to run the tutorial

* Add long running tutorials in separate runners, just for experiment

* Factor out the complex bash script

* Pass the python path to the bash script

* Fix paths

* adding log statement

* Missing dollarsign

* Resetting variable in loop

* using mini GoT dataset and improving bash script

* change dataset name

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-06-15 09:53:36 +02:00

75 lines
2.4 KiB
Bash
Executable File

#!/bin/bash
export LAUNCH_GRAPHDB=0 # See tut 10 - GraphDB is already running in CI
export TIKA_LOG_PATH=$PWD # Avoid permission denied errors while importing tika
set -e # Fails on any error in the following loop
python_path=$1
files_changed=$2
exclusion_list=$3
no_got_tutorials='4_FAQ_style_QA 5_Evaluation 7_RAG_Generator 8_Preprocessing 10_Knowledge_Graph 15_TableQA 16_Document_Classifier_at_Index_Time'
echo "Files changed in this PR: $files_changed"
echo "Excluding: $exclusion_list"
# Collect the tutorials to run
scripts_to_run=""
for script in $files_changed; do
if [[ "$script" != *"tutorials/Tutorial"* ]] || ([[ "$script" != *".py"* ]] && [[ "$script" != *".ipynb"* ]]); then
echo "- not a tutorial: $script"
continue
fi
skip_to_next=0
for excluded in $exclusion_list; do
if [[ "$script" == *"$excluded"* ]]; then skip_to_next=1; fi
done
if [[ $skip_to_next == 1 ]]; then
echo "- excluded: $script"
continue
fi
scripts_to_run="$scripts_to_run $script"
done
for script in $scripts_to_run; do
echo ""
echo "##################################################################################"
echo "##################################################################################"
echo "## Running $script ..."
echo "##################################################################################"
echo "##################################################################################"
# Do not cache GoT data
reduce_dataset=1
for no_got_tut in $no_got_tutorials; do
if [[ "$script" == *"$no_got_tut"* ]]; then
reduce_dataset=0
fi
done
if [[ $reduce_dataset == 1 ]]; then
# Copy the reduced GoT data into a folder named after the tutorial
# to trigger the caching mechanism of `fetch_archive_from_http`
echo "Using reduced GoT dataset"
no_prefix=${script#"tutorials/Tutorial"}
split_on_underscore=(${no_prefix//_/ })
cp -r data/tutorials data/tutorial${split_on_underscore[0]}
else
echo "NOT using reduced GoT dataset!"
fi
if [[ "$script" == *".py" ]]; then
time python $script
else
sudo $python_path/bin/ipython -c "%run $script"
fi
git clean -f
done
# causes permission errors on Post Cache
sudo rm -rf data/
sudo rm -rf /home/runner/work/haystack/haystack/elasticsearch-7.9.2/