update README for hn mine

2025-06-27 02:39:58 +00:00 · 2025-01-13 14:35:49 +08:00 · 2025-01-13 14:35:49 +08:00 · a572e6312a
commit a572e6312a
parent aaa4727776
2 changed files with 17 additions and 4 deletions
--- a/examples/finetune/embedder/README.md
+++ b/examples/finetune/embedder/README.md
@ -57,20 +57,33 @@ cd FlagEmbedding/scripts

 ```shell
 python hn_mine.py \
--model_name_or_path BAAI/bge-base-en-v1.5 \
 --input_file toy_finetune_data.jsonl \
 --output_file toy_finetune_data_minedHN.jsonl \
 --range_for_sampling 2-200 \
 --negative_number 15 \
--use_gpu_for_searching 
+--use_gpu_for_searching \
+--embedder_name_or_path BAAI/bge-base-en-v1.5
 ```

 - **`input_file`**: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents).
 - **`output_file`**: path to save JSON data with mined hard negatives for finetuning
 - **`negative_number`**: the number of sampled negatives
 - **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
+- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. If provided, it should be a jsonl file, each line is a dict with a key `text`. If input a candidate_pool, this script will retrieve negatives from this file.
 - **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
+- **`search_batch_size`**: batch size for searching. Default is 64.
+- **`embedder_name_or_path`**: The name or path to the embedder.
+- **`embedder_model_class`**: Class of the model used for embedding (current options include 'encoder-only-base', 'encoder-only-m3', 'decoder-only-base', 'decoder-only-icl'.). Default is None. For the custom model, you should set this argument.
+- **`normalize_embeddings`**: Set to `True` to normalize embeddings.
+- **`pooling_method`**: The pooling method for the embedder.
+- **`use_fp16`**: Use FP16 precision for inference.
+- **`devices`**: List of devices used for inference.
+- **`query_instruction_for_retrieval`**, **`query_instruction_format_for_retrieval`**: Instructions and format for query during retrieval.
+- **`examples_for_task`**, **`examples_instruction_format`**: Example tasks and their instructions format. This is only used when `embedder_model_class` is set to `decoder-only-icl`.
+- **`trust_remote_code`**: Set to `True` to trust remote code execution.
+- **`cache_dir`**: Cache directory for models.
+- **`embedder_batch_size`**: Batch sizes for embedding and reranking.
+- **`embedder_query_max_length`**, **`embedder_passage_max_length`**: Maximum length for embedding queries and passages.

 ### Teacher Scores

--- a/scripts/README.md
+++ b/scripts/README.md
@ -29,7 +29,7 @@ python hn_mine.py \
 - **`output_file`**: path to save JSON data with mined hard negatives for finetuning
 - **`negative_number`**: the number of sampled negatives
 - **`range_for_sampling`**: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)**
- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file.
+- **`candidate_pool`**: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. If provided, it should be a jsonl file, each line is a dict with a key `text`. If input a candidate_pool, this script will retrieve negatives from this file.
 - **`use_gpu_for_searching`**: whether to use faiss-gpu to retrieve negatives.
 - **`search_batch_size`**: batch size for searching. Default is 64.
 - **`embedder_name_or_path`**: The name or path to the embedder.