### Summary
Updates documentation references in the README to point to
https://docs.unstructured.io and cleans up a few sections of the README.
Specifically:
- Removes an old API announcement
- Removes the section mentioning Chipper as a beta feature. Chipper is
only available through the SaaS API.
Also adds a Python 3.12 tag to `setup.py` since we now support Python
3.12.
### Summary
Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to
reduce CVEs. Also adds a step in the docker publish job that scans the
images and checks for CVEs before publishing. The job will fail if there
are high or critical vulnerabilities.
### Testing
Run `make docker-run-dev` and then `python3.11` once you're in. And that
point, you can try:
```python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"])
elements
```
Stop the container once you're done.
To test:
cd docs && make HTML
changelogs:
point main readme to the correct connector html page
point chroma docs to correct sample code
---------
Co-authored-by: potter-potter <david.potter@gmail.com>
Update to API key link per Slack convo.
There may be more that's needed, @ron-unstructured please run with this
PR if it's the right solution, close it if not, or make changes if
needed. I haven't looked for other places where the link might need to
be changed. The link appears to work, but any further investigation or
testing is appreciated.
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
### Summary
We no longer use the "bricks" terminology for partioning functions, etc
in the library. This PR updates various references to bricks within the
repo and the docs. This is just an initial pass to swap the terminology
out, it'll likely be helpful to reorganize the docs a bit as well.
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
There is a built in option to not send data by setting an env var,
SCARF_NO_ANALYTICS=true.
DoD:
- When importing or running unstructured package it will make a get call
to scarf
- When env variable is set to not track, call is not made
This adds extra-index-url to our docs to allow for anonymous install
analytics to help us understand and improve our product.
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
We've created a custom domain, downloads.unstructured.io that redirects
to quay.io
(using https://scarf.sh/). This custom domain allows us to swap the
underlying container registry without impacting users. It also provides
us with important metrics about container and package usage, without
surfacing PII
like IP addresses.
Python package follows the same pattern at packages.unstructured.io
This PR adds a new developer tool for profiling performance: `py-spy`.
Additionally it adds a new make command to start a docker with your
local `unstructured` repo mounted for quick testing code in a Rocky
Linux environment (see usage below for intent).
### py-spy
It is a sampling profiler https://github.com/benfred/py-spy and in
practice usually provides more readily usable information than commonly
used `cProfiler`. It also supports output to `speedscope` format,
[which](https://github.com/jlfwong/speedscope#usage) provides a rich
view of the profiling result.
### usage
The new tool is added to the existing `profile.sh` script and is readily
discoverable in the interactive interface. When select to view the new
speedscope format profile it would show up in your local browser if you
followed the readme to install speedscope locally via `npm install -g
speedscope`.
On macOS the profiling tool needs superuser privilege. If you are not
comfortable with that feel free to run the profiling inside a Linux
container if your local dev env is macOS.
Summary
Closes#1124
Updates dead links in repository README
- Quick Start > Install for local development
- Learn more > Batch Processing)
Updates document dependencies to include tesseract-lang for additional language support (requirement for tests to pass)
Testing
All tests pass
* process attachments for email
* add attachment processing to msg
* fix up metadata for attachments
* add test for processing email attachments
* added test for processing msg attachments
* update docs
* tests for error conditions
* version and changelog
* add max partition size logic
* work splitting logic into split_by_paragraph
* pass through max_partition to other functions
* added test for splitting long document
* add type hint
* add documentation
* version and changelog
* ingest-test-fixtures-update
* Update ingest test fixtures (#819)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* retrigger ci
* ingest-test-fixtures-update
* ingest-test-fixtures-update
* Update ingest test fixtures (#821)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* update default for partition_xml
* update version for release
* update msg doc string
---------
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* first pass at partition_tsv
* working tests
* create constants for tests and debug `make test` failure
* make check and tidy
* undo changes for testing locally
* update changelog and version
* fix bricks.rst
* refactor if statements
* make tidy
* fix README and change try/except to if/else
* update changelog and version
* fix\ docstring
* add support for page numbers in docx when present
* version and changelog
* add comment on page numbers
* add header and footer to doc elements list
* update integrations docs
* include_page_breaks kwarg for doc and docx
* merge element metadata for pagebreaks
* fix typo
* fix changelog typo
* change page number default to None
* add initial_page_number kwarg
* make page number tests in pdf more explicit
* revert test file
* update ingest tests
* update test fixture outputs
* updates to IRS forms fixtures
* ingest-test-fixtures-update
* Update ingest test fixtures (#759)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
---------
Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* first pass on partition_xml
* add option to keep xml tags
* added tests for xml
* fix filename
* update filenames
* remove outdated readme
* add xml to auto
* version and changelog
* update readme and docs
* pass through include_metadata
* update include_metadata description
* add README back in
* linting, linting, linting
* more linting
* spooled to bytes doesnt need to be a tuple
* Add tests for newly supported filetypes
* Correct metadata filetype
* doc typo
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* typo fix
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* keep_xml_tags -> xml_keep_tags
---------
Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
* first pass on partition_xlsx
* add support for files
* add test for xlsx from filename
* added filetype metadata
* add xlsx to auto
* remove fake excel from unsupported
* version and changelog
* update docs
* update readme
* fix removed file reference
* fix some more tests
* pass in metadata filename
* add include_metadata flag