"Suppose we are willing to fine-tune our model for financial tasks. We found an open-source dataset that could be useful: [financial-qa-10k](https://huggingface.co/datasets/virattt/financial-qa-10K). Let's see how to properly prepare our dataset for fine-tuning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The raw dataset has the following structure:\n",
"- 5 columns of: 'question', 'answer', 'context', 'ticker', and 'filing'.\n",
"`query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts. `pos_scores` is a list of scores corresponding to the query and pos, `neg_scores` is a list of scores corresponding to the `query` and `neg`, if you don't use knowledge distillation, it can be ignored. `prompt` is the prompt used for the query, it will cover query_instruction_for_retrieval. `type` is used for bge-en-icl, it includes `normal`, `symmetric_class`, `symmetric_clustering`, .etc. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We select the columns 'question' and 'context' as our query and answer(pos), and rename the columns. Then add the 'id' column for later evaluation use."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'query': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',\n",
" 'pos': 'Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.',\n",
"ds = ds.add_column(\"id\", [str(i) for i in range(len(ds))])\n",
"ds[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Negative examples are important during the training of embedding models. Our initial dataset does not come with negative texts. Thus we directly sample a few from the whole corpus."
"{'query': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',\n",
" 'pos': ['Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.'],\n",
" 'id': '0',\n",
" 'neg': ['Kroger expects that its value creation model will deliver total shareholder return within a target range of 8% to 11% over time.',\n",
" 'CSB purchased First Mortgages of $2.9 billion during 2023.',\n",
" 'See Note 13 to our Consolidated Financial Statements for information on certain legal proceedings for which there are contingencies.',\n",
" 'Diluted earnings per share were $16.69 in fiscal 2022 compared to $15.53 in fiscal 2021.',\n",
" 'In the year ended December 31, 2023, Total net sales and revenue increased primarily due to: (1) increased net wholesale volumes primarily due to increased sales of crossover vehicles and full-size pickup trucks, partially offset by decreased sales of mid-size pickup trucks; (2) favorable Price as a result of low dealer inventory levels and strong demand for our products; (3) favorable Mix associated with increased sales of full-size pickup trucks and full-size SUVs and decreased sales of vans, passenger cars and mid-size pickup trucks, partially offset by increased sales of crossover vehicles; and (4) favorable Other due to increased sales of parts and accessories.',\n",
" 'As of December 31, 2023, we had 3,157 full-time employees.',\n",
" 'Item 3. Legal Proceedings. The information contained in Note 18 ‘‘Commitments and Contingencies’’ included in Item 8 of this 10-K is incorporated herein by reference.',\n",
" 'Under the amended 2019 Secured Facility, the maturity date is set to July 20, 2026.',\n",
" 'Accounts receivable for Las Vegas Sands Corp. on December 31, 2023, totaled $685 million, with a provision for credit losses of $201 million, resulting in a net balance of $484 million.',\n",
" 'Operating expenses as a percentage of segment net sales decreased 25 basis points for fiscal 2023 when compared to the previous fiscal year, primarily driven by strong sales growth and lower incremental COVID-19 related costs, partially offset by increased wage costs.'],\n",
" 'prompt': 'Represent this sentence for searching relevant passages: '}"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ds[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we split the dataset into training set and testing set."