diff --git a/examples/chroma-news-of-the-day/README.md b/examples/chroma-news-of-the-day/README.md new file mode 100644 index 000000000..1b76b8c24 --- /dev/null +++ b/examples/chroma-news-of-the-day/README.md @@ -0,0 +1,6 @@ +## News of the Day + +This notebook shows how to use Unstructured + Chroma + LangChain to preprocess and +summarize the news of the day from CNN Lite. To run this notebook, first install +the dependencies with `pip install -r requirements.txt`. Then run `jupyter lab` and +open the notebook. diff --git a/examples/chroma-news-of-the-day/news-of-the-day.ipynb b/examples/chroma-news-of-the-day/news-of-the-day.ipynb new file mode 100644 index 000000000..be30b948c --- /dev/null +++ b/examples/chroma-news-of-the-day/news-of-the-day.ipynb @@ -0,0 +1,268 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a92fa55b-7051-4aad-9aae-15eec21704d1", + "metadata": {}, + "source": [ + "# News of the Day\n", + "\n", + "In this notebook, we'll show how to use [Unstructured.IO](https://unstructured.io/), [ChromaDB](https://www.trychroma.com/), and [LangChain](https://github.com/langchain-ai/langchain) to summarize topics from the front page of CNN Lite. Without tooling from the modern LLM stack, this would have been a time-consuming project. With Unstructured, Chroma, and LangChain, the entire workflow is less than two dozen lines of code." + ] + }, + { + "cell_type": "markdown", + "id": "28208685-c4a9-4e59-973c-2144f64dc275", + "metadata": {}, + "source": [ + "## Gather links with `unstructured`\n", + "\n", + "First, we'll gather links from the [CNN Lite](https://lite.cnn.com/) homepage using the `partition_html` function from `unstructured`. When `unstructured` partitions HTML pages, links are included in the metadata for each element, make link collection a simple task. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "d994c585-af48-416e-99fa-6bf9f736c4f7", + "metadata": {}, + "outputs": [], + "source": [ + "from unstructured.partition.html import partition_html" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "f3ae24d4-2926-4ea3-ad12-52e372918039", + "metadata": {}, + "outputs": [], + "source": [ + "cnn_lite_url = \"https://lite.cnn.com/\"" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "deca76d9-8f38-4dcc-a34a-466aaba3ab45", + "metadata": {}, + "outputs": [], + "source": [ + "elements = partition_html(url=cnn_lite_url)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "cb258d76-459e-4249-88e3-5bd340e06a32", + "metadata": {}, + "outputs": [], + "source": [ + "links = []\n", + "for element in elements:\n", + " if element.metadata.links is not None:\n", + " relative_link = element.metadata.links[0][\"url\"][1:]\n", + " if relative_link.startswith(\"2023\"):\n", + " links.append(f\"{cnn_lite_url}{relative_link}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "c4189214-0cf4-4ccf-b996-bb89c1d30233", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['https://lite.cnn.com/2023/08/07/health/breast-cancer-overdiagnosis/index.html',\n", + " 'https://lite.cnn.com/2023/08/08/investing/ups-earnings/index.html',\n", + " 'https://lite.cnn.com/2023/08/08/business/molson-coors-blue-run-spirits-acquisition/index.html',\n", + " 'https://lite.cnn.com/2023/08/08/politics/ukraine-counteroffensive-us-briefings/index.html',\n", + " 'https://lite.cnn.com/2023/08/08/europe/italian-cheesemaker-dies-italy-intl-scli/index.html']" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "links[:5]" + ] + }, + { + "cell_type": "markdown", + "id": "0e340a96-e66b-431a-8416-4a9b4d2bdadf", + "metadata": {}, + "source": [ + "## Ingest individual articles with `UnstructuredURLLoader`\n", + "\n", + "Now that we have the links, we can preprocess individual news articles with `UnstructuredURLLoader`. `UnstructuredURLLoader` fetches content from the web and then uses the `unstructured` `partition` function to extract content and metadata. In this example we preprocess HTML files, but it works with other response types such as `application/pdf` as well. After calling `.load()`, the result is a list of `langchain` `Document` objects." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "cd629187-dcd7-4411-8e61-dea34bae2963", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.document_loaders import UnstructuredURLLoader\n", + "\n", + "loaders = UnstructuredURLLoader(urls=links, show_progress_bar=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "6249c4c4-4c0d-4c49-8f0b-4e6f97b5d8c8", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:11<00:00, 8.41it/s]\n" + ] + } + ], + "source": [ + "docs = loaders.load()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "aef8c3ea-26e9-4a09-8314-0d1e7580ae26", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Document(page_content='CNN\\n\\n8/8/2023\\n\\nOlder women’s breast cancer is often overdiagnosed, study finds, raising risk of unnecessary treatment\\n\\nBy Amanda Musa, CNN\\n\\nUpdated: \\n 8:38 AM EDT, Tue August 8, 2023\\n\\nSource: CNN\\n\\nA breast cancer diagnosis is an all-too-common reality for women around the world. In the US, about 240,000 cases of breast cancer are diagnosed in women every year, the US Centers for Disease Control and Prevention estimates.\\n\\nHealth care providers and patients alike are usually inclined to pursue treatment to stop the disease. But some experts say that it isn’t always necessary to treat breast cancer in older women with aggressive therapy.\\n\\nA study published Monday in the Annals of Internal Medicine found that large numbers of American women ages 70 to 85 are potentially overdiagnosed with breast cancer and therefore could receive unnecessary treatment.\\n\\n“Overdiagnosis refers to this phenomenon where we find breast cancers on screening that never would have become clinically apparent,” says Dr. Ilana Richman, the lead author of the study and an assistant professor of medicine at the Yale School of Medicine.\\n\\n“The breast cancers are real, true cancers under the microscope, but they would have laid dormant in a person’s body and never caused symptoms,” she added. “We wouldn’t have known about that had we not gone looking.”\\n\\nThe study looked at 54,635 women 70 or older who had recently been screened for breast cancer. Richman and her team estimate that 31% of women ages 70 to 74 were potentially overdiagnosed. Among women 75 to 84, an estimated 47% were potentially overdiagnosed.\\n\\nWomen 85 and older were the age group most likely to be overdiagnosed, at 54%.\\n\\nBreast cancer diagnosis peaks among women in their 70s, according to data from the American Cancer Society. The risk decreases as women age into their 80s, partly “because women die of other health conditions instead like heart disease or other cancers,” Richman says.\\n\\nWhether an older women should get screened for breast cancer is an individual decision, said Richman, whose study found that overdiagnosis of breast cancer was more prevalent as age increased and life expectancy decreased.\\n\\nTechnological developments drive overdiagnosis\\n\\nMore women are being diagnosed with breast cancer thanks to advancements in screening such as three-dimensional mammography as well as CT, MRI and PET scanning. Technology makes it easier for oncologists to find even the smallest\\xa0cancer, said Dr. Otis Brawley, an oncologist and epidemiologist at Johns Hopkins University.\\n\\n“We can now find a lesion in a woman’s breasts that is the size of a green pea,” he said. “We can stick a needle into that woman’s breast and get the needle into that green pea-sized lesion and send a piece of that green pea-sized lesion to a pathologist.”\\n\\nBut Brawley notes that some of these breast cancers are not destined to grow, spread or kill due to their genetic makeup – and meeting them with aggressive treatment such as surgery, radiation or chemotherapy can sometimes be unnecessary and even harmful to the patient.\\n\\nNot only can overtreatment increase the risk of complications, especially in older patients, it can cause unwarranted stress and financial trouble, he said.\\n\\n“There’s also the experience of being diagnosed,” Richman says. “It is one of the most life-changing experiences. If we could spare women that experience who don’t need to have it, then I think that would be beneficial.”\\n\\nShe doesn’t want to suggest that less-threatening cancers always go untreated, however. Less-invasive surgery and the use of medications that block estrogen can often treat these cancers sufficiently without exposing patients to radiation or chemotherapy.\\n\\nMammograms and older women\\n\\nFor women over the age of 75, there is no guidance on whether to continue getting mammograms, according to the US Preventive Services Task Force, a group of independent medical experts whose recommendations help guide doctors’ decisions. The evidence determining the risks and benefits of screening in women in this age group is “insufficient,” the group says.\\n\\nIn May, the USPSTF issued a draft recommendation urging women to start breast cancer screenings at age 40. It also noted an effort to conduct more research on whether women 75 and older need to be screened for breast cancer.\\n\\nMammograms for this age group cost the federal health plan for seniors more than\\xa0$410 million a year, according to a 2013 study in JAMA Internal Medicine.\\n\\nTaxpayers usually foot the bill for these tests because most seniors are covered by Medicare.\\n\\nAnd while cancer screenings generally aren’t expensive – a\\xa0mammogram averages about $100\\xa0– they can launch a cascade of follow-up tests and treatments that add to the total cost of care.\\n\\nMuch of the screening guidance for mammograms also depends on how much longer a person is expected to live, Richman says, something that’s hard to quantify and a topic that most patients and doctors are not eager to discuss.\\n\\n“For any one individual in front of you, there’s a fair bit of uncertainty about how long that person will live,” she said. “It also raises issues that are just typical to talk about, like, certainly nobody wants to hear that they’re not going to live long enough to benefit” from a mammogram.\\n\\nRichman adds that more research needs to be done to conclude whether a woman of a certain age needs to get a mammogram in the first place.\\n\\n“For women who are getting into the upper age range or women who have lots of other serious medical conditions that are also taking up a lot of time and deserve attention, it’s unlikely that mammography is going to be the thing that is most beneficial to that person’s health.”\\n\\nRichman also worries about unnecessary exposure to radiation during a mammogram.\\n\\nUltimately, she says, the decision to get screened is up to the individual.\\n\\n“I think for women who are used to being screened and feel like they derive value from knowing something about their bodies, that’s an important consideration,” she says.\\n\\nThe future of breast cancer diagnosis\\n\\nIn the near future, doctors will be able to monitor “good-behaving” breast cancers in the same way low-risk prostate cancers are treated, Brawley said, adding that estimates say “overdiagnosis occurs in more than half of all screen-detected prostate cancer.”\\n\\n“Within the next 10 years, there will be some women who are going to be told ‘you have breast cancer, but it is one of the more benign breast cancers. Therefore, we’re going to watch you instead of treating you aggressively,”\\xa0he said.\\n\\nUltimately, Brawley says, the answer to overdiagnosis is the study of cancer genomics, analyzing the molecular biology of an individual’s cancer. But the technology is not quite there yet.\\n\\n“It will lead to a redefinition of cancer. The pathologic definition used today was developed in the mid-19th century using the biopsy and a light microscope. We need to move to a 21st-century definition, encompassing both the biopsy and pathologic appearance as well as genomics,” Brawley says. “The 21st-century definition of cancer will recognize that breast cancer is not one entity but many distinct diseases with potentially different patterns of behavior calling for different treatments and, sometimes, no treatment.”\\n\\nCNN’s Jacqueline Howard contributed to this report.\\n\\nCorrection: A previous version of this story misspelled Dr. Ilana Richman’s name.\\n\\nSee Full Web Article\\n\\nGo to the full CNN experience\\n\\n© 2023 Cable News Network. A Warner Bros. Discovery Company. All Rights Reserved.\\n\\nTerms of Use\\n\\nPrivacy Policy\\n\\nAd Choices\\n\\nCookie Settings', metadata={'source': 'https://lite.cnn.com/2023/08/07/health/breast-cancer-overdiagnosis/index.html'})" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "docs[0]" + ] + }, + { + "cell_type": "markdown", + "id": "90852a76-6804-4eb4-9929-b15a3dcfec4f", + "metadata": {}, + "source": [ + "## Load documents into ChromaDB\n", + "\n", + "With the documents preprocessed, we're not ready to load them into ChromaDB. We accomplish this easily by using the OpenAI embeddings the Chroma vectrostore from `langchain`. This workflow will vectorize the documents using the OpenAI embeddings endpoint, and then load the documents and associated vectors into Chroma. Once the documents are in Chroma, we can perform a similarity search to retrieve documents related to our topic of interest." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "dfdf919b-f7b1-4ea3-af49-44d333c53c4d", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.vectorstores.chroma import Chroma\n", + "from langchain.embeddings import OpenAIEmbeddings" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "4576e386-0fca-4d73-88dd-7df7d1afdef8", + "metadata": {}, + "outputs": [], + "source": [ + "embeddings = OpenAIEmbeddings()\n", + "vectorstore = Chroma.from_documents(docs, embeddings)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "dc31bc23-21e8-4c96-9eb2-5a09734ca936", + "metadata": {}, + "outputs": [], + "source": [ + "query_docs = vectorstore.similarity_search(\"Update on the coup in Niger.\", k=1)" + ] + }, + { + "cell_type": "markdown", + "id": "d1c1498d-6b68-4c7f-9787-78c06f5992f1", + "metadata": {}, + "source": [ + "## Summarize the Documents\n", + "\n", + "After retrieving relevant documents from Chroma, we're ready to summarize them! There are multiple ways to accomplish this in `langchain`, but `load_summarization_chain` is the easiest. Simply choose an LLM, load the summarization chain, and you're ready to summarize the documents. Here we limit the summary to snippets related to our topic of choice." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "85326ac4-ed5b-4ae9-aae0-854c8e0587b0", + "metadata": {}, + "outputs": [], + "source": [ + "from langchain.chat_models import ChatOpenAI\n", + "from langchain.chains.summarize import load_summarize_chain" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "de09860c-0b95-4a91-abf3-92c683446c00", + "metadata": {}, + "outputs": [], + "source": [ + "llm = ChatOpenAI(temperature=0, model_name=\"gpt-3.5-turbo-16k\")\n", + "chain = load_summarize_chain(llm, chain_type=\"stuff\")" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "c3e4e37d-286a-4d58-813a-3c4a02f31ab0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Niger's military has deployed reinforcements to the capital city after refusing to cede power following a deadline set by a regional bloc. The military junta took control of the country in a coup last month, leading to political chaos. The Economic Community of West African States (ECOWAS) has imposed sanctions and threatened military intervention if the junta does not step down. The situation remains uncertain, with ECOWAS leaders seeking a diplomatic solution but willing to use force as a last resort. The uncertainty has caused concern among residents, who are stocking up on supplies and attempting to flee the capital. The future of Niger's elected government is important to its democratic neighbors and Western partners, including the United States and France. Russia's mercenary group Wagner has also shown interest in the situation, potentially seeking to gain influence in the region.\n" + ] + } + ], + "source": [ + "print(chain.run(query_docs))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.11" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/chroma-news-of-the-day/requirements.txt b/examples/chroma-news-of-the-day/requirements.txt new file mode 100644 index 000000000..45cf1453c --- /dev/null +++ b/examples/chroma-news-of-the-day/requirements.txt @@ -0,0 +1,3 @@ +chromadb +langchain +unstructured