Carlos Fernández c1c339923f
feat: add DocxToDocument converter (#7838)
* first fucntioning DocxFileToDocument

* fix lazy import message

* add reno

* Add license headder

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* change DocxFileToDocument to DocxToDocument

* Update library install to the maintained version

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* clan try-exvept to only take non haystack errors into account

* Add wanring on docstring of component ignoring page brakes, mark test as skip

* make warnings lazy evaluations

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* make warnings lazy evaluations

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Make warnings lazy evaluated

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Solve f bug

* Get more metadata from docx files

* add 'python-docx' dependency and docs

* Change logging import

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Fix typo

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* remake metadata extraction for docx

* solve bug regarding _get_docx_metadata method

* Update haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Update haystack/components/converters/docx.py

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>

* Delete unused test

---------

Co-authored-by: Sebastian Husch Lee <sjrl@users.noreply.github.com>
2024-06-12 11:58:36 +02:00
..
2024-03-11 11:08:58 +01:00
2023-02-02 15:38:56 +01:00

📒 Looking for the docs?

You can find Haystack's documentation at https://docs.haystack.deepset.ai/.

💻 How to update docs?

Overview, Components, Pipeline Nodes, and Guides

You can find these docs on the Haystack Docs page: https://docs.haystack.deepset.ai/docs/get_started. If you want to contribute, and we welcome every contribution, do the following:

  1. Make sure you're on the right version (check the version expanding list in the top left corner).
  2. Use the "Suggest Edits" link you can find in the top right corner of every page.
  3. Suggest a change right in the docs and click Submit Suggested Edits.
  4. Optionally, leave us a comment and submit your change.

Once we take care of it, you'll get an email telling you the change's been merged, or not. If not, we'll give you the reason why.

Make sure to check our Contribution Guidelines.

Tutorials

The Tutorials live in a separate repo: https://github.com/deepset-ai/haystack-tutorials. For instructions on how to contribute to tutorials, see Contributing to Tutorials.

API Reference

We use Pydoc-Markdown to create Markdown files from the docstrings in our code. There is a Github Action that regenerates the API pages with each commit.

If you want to generate a new Markdown file for a new Haystack module, create a .yml file in docs/src/api/api which configures how Pydoc-Markdown will generate the page and commit it to main.

All the updates to doctrings get pushed to documentation when you commit to the main branch.

Configuration

Pydoc will read the configuration from a .yml file which is located under /haystack/docs/_src/api/pydoc. Our files contain three main sections:

  • loader: A list of plugins that load API objects from python source files.
    • type: Loader for python source files
    • search_path: Location of source files
    • modules: Module which are used for generating the markdown file
    • ignore_when_discovered: Define which files should be ignored
  • processor: A list of plugins that process API objects to modify their docstrings (e.g. to adapt them from a documentation format to Markdown or to remove items that should not be rendered into the documentation).
    • type: filter: Filter for specific modules
    • documented_only: Only documented API objects
    • do_not_filter_modules: Do not filter module objects
    • skip_empty_modules: Skip modules without content
  • renderer: A plugin that produces the output files. We use a custom ReadmeRenderer based on the Markdown renderer. It makes sure the Markdown files comply with ReadMe requirements.
    • type: Define the renderer which you want to use. We are using the ReadmeRenderer to make sure the files display properly in ReadMe.
    • excerpt: Add a short description of the page. It shows up right below the page title.
    • category: This is the ReadMe category ID to make sure the doc lands in the right section of Haystack docs.
    • title: The title of the doc as it will appear on the website. Make sure you always add "API" at the end.
    • slug: The page slug, each word should be separated with a dash.
    • order: Pages are ordered alphabetically. This defines where in the TOC the page lands.
    • markdown:
      • descriptive_class_title: Remove the word "Object" from class titles.
      • descriptive_module_title: Adding the word “Module” before the module name.
      • add_method_class_prefix: Add the class name as a prefix to method names.
      • add_member_class_prefix: Add the class name as a prefix to member names.
      • filename: File name of the generated file, use underscores to separate each word.