95 Commits

Author SHA1 Message Date
Ronny H
8be7108829
Replace Serverless API to Platform announcement on README page (#4003)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2025-05-20 16:54:53 +00:00
cragwolfe
7ff0ff890d
chore: utils update (#3909) 2025-02-07 05:58:23 +00:00
cragwolfe
71208ca2ee
doc: emphasize deprecation of ingest (#3610)
Given that unstructured-ingest is now maintained in [its own
repo](https://github.com/Unstructured-IO/unstructured-ingest), update
documentation references in this repo to point there.

Note that the forked, deprecated unstructured.ingest [in this repo
](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest)will
be removed in the near future, once CI is updated properly.
2024-09-09 16:03:44 -07:00
Matt Robinson
116200559b
docs: add link to serverless api in readme (#3322)
### Summary

Adds links to the serverless api. README updates look like the
following:

<img width="904" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/fcb2b0c5-0dff-4612-8f18-62836ca6de8b">
2024-07-01 07:39:12 -04:00
Matt Robinson
23e570fc8a
docs: cleanup readme; add python 3.12 (#3120)
### Summary

Updates documentation references in the README to point to
https://docs.unstructured.io and cleans up a few sections of the README.
Specifically:

- Removes an old API announcement
- Removes the section mentioning Chipper as a beta feature. Chipper is
only available through the SaaS API.

Also adds a Python 3.12 tag to `setup.py` since we now support Python
3.12.
2024-05-30 16:22:54 +00:00
Taha Yassine
ecdfb7a07c
fix: update link in readme (#2493)
Fix link to the "Run the library in a container" section.
2024-05-15 22:48:53 -07:00
Matt Robinson
612905e311
build: wolfi base image for Dockerfile (#3016)
### Summary

Updates the `Dockerfile` to use the Chainguard `wolfi-base` image to
reduce CVEs. Also adds a step in the docker publish job that scans the
images and checks for CVEs before publishing. The job will fail if there
are high or critical vulnerabilities.

### Testing

Run `make docker-run-dev` and then `python3.11` once you're in. And that
point, you can try:

```python
from unstructured.partition.auto import partition
elements = partition(filename="example-docs/DA-1p.pdf", skip_infer_table_types=["pdf"])
elements
```

Stop the container once you're done.
2024-05-15 22:53:15 +00:00
Matt Robinson
6abfb8b2b3
docs: add morph badge (#2666)
Adds the Morph badge to the README. Supersedes #2663. The badge renders
correctly on the branch, as seen below.

<img width="924" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/3dce2e6f-ce9d-452c-a0a7-4077ec7d66ce">
2024-03-19 19:55:17 +00:00
Michał Martyniak
b9aa4b7452
fix: Install pandoc consistently, via Makefile recipe (version that supports .rtf files as input format) (#2593)
## Problem Description
In some cases you might find yourselves in a situation when pandoc won't
be able to process an `rtf` as input file format, because older versions
simply do not support that.

```
RuntimeError: Invalid input format! Got "rtf" but expected one of these: commonmark, creole, csv, docbook, docx, dokuwiki, epub, fb2, gfm, haddock, html, ipynb, jats, jira, json, latex, man, markdown, markdown_github, markdown_mmd, markdown_phpextra, markdown_strict, mediawiki, muse, native, odt, opml, org, rst, t2t, textile, tikiwiki, twiki, vimwiki
```

Basically, some user may install the wrong version. The `README.md` is
not be precise enough when mentioning RTF files support:

47b35ccdd6/README.md (L120-L122)

## Example
Installing `pandoc` from a [stable repository, like
Debian](https://packages.debian.org/source/bullseye/pandoc) will give
you `2.9` and the official documentation shows clearly that support for
rtf was introduced in `2.14`
https://pandoc.org/releases.html#pandoc-2.14.2-2021-08-21

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/3d5199f1-5e39-46ad-ac90-fff9cc5543a8)

### Note that `rtf` is not there

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/de90ebaf-86f2-4b21-83fb-085e27eeea38)

### More detail

![image](https://github.com/Unstructured-IO/unstructured/assets/64484917/59fbb91f-1650-4091-bdcb-15aa035416c8)

## Proposed Solution 
- [x] I've simply added/copied `make install-pandoc` calls, mimicking
other recipes in order to ensure that `3.1.2` will be installed in all
cases. **Side note**: `make install-pandoc` calls
`./scripts/install-pandoc.sh` under the hood.
- [x] Update README file - mention that `make install-pandoc` is
recommended (`>=2.14.2`)
- [x] Verify tests that cover `rtf` cases:
47b35ccdd6/test_unstructured/file_utils/test_file_conversion.py (L14)
- [x] Update `setup_ubuntu.sh` if needed?:
47b35ccdd6/scripts/setup_ubuntu.sh (L87)
-
2024-03-04 11:02:32 +00:00
David Potter
d7f4c24e21
fix documentation for chroma (#2403)
To test:

cd docs && make HTML

changelogs:

point main readme to the correct connector html page
point chroma docs to correct sample code

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-17 01:53:52 +00:00
qued
231f04eb84
chore: update api key link (#2182)
Update to API key link per Slack convo.

There may be more that's needed, @ron-unstructured please run with this
PR if it's the right solution, close it if not, or make changes if
needed. I haven't looked for other places where the link might need to
be changed. The link appears to work, but any further investigation or
testing is appreciated.

Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2023-12-06 23:46:32 +00:00
qued
04fcdb91fe
chore: Update readme slack links (#2030)
Updated slack links in the README that were using an old shortened URL.
2023-11-07 13:02:43 -08:00
Matt Robinson
d9c035edb1
docs: no more bricks (#1967)
### Summary

We no longer use the "bricks" terminology for partioning functions, etc
in the library. This PR updates various references to bricks within the
repo and the docs. This is just an initial pass to swap the terminology
out, it'll likely be helpful to reorganize the docs a bit as well.

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-11-02 09:43:26 -05:00
Ronny H
f78d4d505a
Updated "join Slack" link (#1948)
Updated "join Slack" links on README page.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-31 00:02:21 -07:00
Trevor Bossert
6acd06987b
Remove extra index url from docs (#1711)
It’s no longer required to specify the extra index url as we utilize a
different method of gathering install anonymous analytics.
2023-10-11 19:34:49 +00:00
Trevor Bossert
f0a63e2712
Add basic call to scarf to get anonymous analytics (#1705)
There is a built in option to not send data by setting an env var,
SCARF_NO_ANALYTICS=true.

DoD:
- When importing or running unstructured package it will make a get call
to scarf
- When env variable is set to not track, call is not made
2023-10-11 09:15:36 -07:00
Trevor Bossert
ce206f1f85
add extra-index-url for scarf anonymous tracking (#1668)
This adds extra-index-url to our docs to allow for anonymous install
analytics to help us understand and improve our product.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:16:38 +00:00
Ronny H
b1cbfb845c
Update Slack community link on README page (#1653) 2023-10-05 12:07:00 -07:00
Trevor Bossert
961223da2a
Chore: Update readme to using new download location to track download metrics (#1507)
Related to:
https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics

Testing:
`docker pull
downloads.unstructured.io/unstructured-io/unstructured:latest`

There should be no additional steps needed.
2023-09-22 17:30:37 -07:00
Trevor Bossert
e8dfbfdbe5
Add notification that we will be utilizing scarf for docker and python downloads (#1503)
We've created a custom domain, downloads.unstructured.io that redirects
to quay.io
(using https://scarf.sh/). This custom domain allows us to swap the
underlying container registry without impacting users. It also provides
us with important metrics about container and package usage, without
surfacing PII
like IP addresses.

Python package follows the same pattern at packages.unstructured.io
2023-09-22 12:59:58 -07:00
Ryan Nikolaidis
8c1d03e5cf update slack invite 2023-09-20 00:02:03 -07:00
Yao You
b504a48e06
dev: add py-spy profiling (#1251)
This PR adds a new developer tool for profiling performance: `py-spy`.
Additionally it adds a new make command to start a docker with your
local `unstructured` repo mounted for quick testing code in a Rocky
Linux environment (see usage below for intent).

### py-spy

It is a sampling profiler https://github.com/benfred/py-spy and in
practice usually provides more readily usable information than commonly
used `cProfiler`. It also supports output to `speedscope` format,
[which](https://github.com/jlfwong/speedscope#usage) provides a rich
view of the profiling result.

### usage

The new tool is added to the existing `profile.sh` script and is readily
discoverable in the interactive interface. When select to view the new
speedscope format profile it would show up in your local browser if you
followed the readme to install speedscope locally via `npm install -g
speedscope`.

On macOS the profiling tool needs superuser privilege. If you are not
comfortable with that feel free to run the profiling inside a Linux
container if your local dev env is macOS.
2023-08-31 19:26:29 +00:00
Ronny H
2d5f931c3f
Update README to Python-3.10 (#1231) 2023-08-29 03:21:23 +00:00
omahs
64b4287308
fix: typos (#1215)
fix: typos
2023-08-28 12:05:48 +00:00
Newel H
be093d2e66
chore: Update dead links to correct pages (#1127)
Summary
Closes #1124

Updates dead links in repository README
- Quick Start > Install for local development
- Learn more > Batch Processing)

Updates document dependencies to include tesseract-lang for additional language support (requirement for tests to pass)

Testing
All tests pass
2023-08-16 10:43:37 -04:00
Ronny H
0d5b5a0e79
Revamp README & Bricks documentation (#1103)
Reorganize README.md
2023-08-12 19:58:51 +00:00
Ronny H
7a05ef2cd9
Python script to collect environment for debugging issues (#989)
* Tested on Mac, Windows & Rocky Linux OS
* Updated README to include bugs reporting script
2023-08-02 22:54:43 +00:00
Emily Chen
050cfafb70
Add subsection for docs; prioritize getting started with container (#962) 2023-07-21 17:29:58 -07:00
Amanda Cameron
35e529f2d4
updating api key link (#960) 2023-07-21 13:05:40 -07:00
Yuming Long
208148abe7
Chore: update require api key in readme (#952) 2023-07-20 16:10:03 +00:00
Ronny H
31511793cb
Update README and API doc for Chipper announcement (#940)
Update README and API doc for Chipper model beta version announcement
2023-07-19 13:00:37 -07:00
ryannikolaidis
3b33331082
docs: fix readme word docs typo (#946) 2023-07-17 20:04:50 +00:00
Matt Robinson
c581a33c8a
feat: attachment processing for emails (#855)
* process attachments for email

* add attachment processing to msg

* fix up metadata for attachments

* add test for processing email attachments

* added test for processing msg attachments

* update docs

* tests for error conditions

* version and changelog
2023-06-29 18:01:12 -04:00
Matt Robinson
44411ecc59
enhancement: max_partition kwarg for limiting element size (#818)
* add max partition size logic

* work splitting logic into split_by_paragraph

* pass through max_partition to other functions

* added test for splitting long document

* add type hint

* add documentation

* version and changelog

* ingest-test-fixtures-update

* Update ingest test fixtures (#819)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* retrigger ci

* ingest-test-fixtures-update

* ingest-test-fixtures-update

* Update ingest test fixtures (#821)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* update default for partition_xml

* update version for release

* update msg doc string

---------

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-28 15:26:01 -04:00
Amanda Cameron
95f02f290d
chore: update readme for api keys (#792)
* api announcement

* updating copy

* version bump
2023-06-26 11:56:01 -07:00
Martin Mauch
752e78e803
feat: partition_org for Org Mode documents (#780)
* feat: partition_org for Org Mode documents

* update version
2023-06-23 18:45:31 +00:00
shreyanid
21c346dab8
broken file link in quick start sample code (#789) 2023-06-21 13:39:10 -07:00
cragwolfe
2989f53358
chore: bump to python 3.8.17 (#766)
The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.
2023-06-16 11:17:03 -07:00
John
a9b9b873b1
feat: partition_tsv for tab separated value files (#758)
* first pass at partition_tsv

* working tests

* create constants for tests and debug `make test` failure

* make check and tidy

* undo changes for testing locally

* update changelog and version

* fix bricks.rst

* refactor if statements

* make tidy

* fix README and change try/except to if/else

* update changelog and version

* fix\ docstring
2023-06-15 18:50:53 +00:00
Matt Robinson
a800967478
enhancements: add page numbers for word docs when available (#750)
* add support for page numbers in docx when present

* version and changelog

* add comment on page numbers

* add header and footer to doc elements list

* update integrations docs

* include_page_breaks kwarg for doc and docx

* merge element metadata for pagebreaks

* fix typo

* fix changelog typo

* change page number default to None

* add initial_page_number kwarg

* make page number tests in pdf more explicit

* revert test file

* update ingest tests

* update test fixture outputs

* updates to IRS forms fixtures

* ingest-test-fixtures-update

* Update ingest test fixtures (#759)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-15 12:21:17 -04:00
Matt Robinson
e0c477de68
docs: update slack invite link (#749) 2023-06-14 10:06:45 -04:00
Matt Robinson
c82fdb6a89
feat: partition_rst for ReStructured Text documents (#725)
* add example rst file

* filetype detection for rst files

* add partition_rst function

* add partition_rst to auto

* update readme

* update docs

* changelog and version

* pandocs -> pandoc

* fix typo
2023-06-12 19:31:10 +00:00
Matt Robinson
c6dc466e79
docs: update capabilities table; fix mistake in para grouping docs (#683)
* docs: update capabilities table with rtf/md/epub tables

* fix regex in docs

* revert bricks update

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-06-06 18:29:56 +00:00
Matt Robinson
6c10d8f022
docs: update detectron2 instructions in readme (#678) 2023-06-02 19:44:41 +00:00
Matt Robinson
be04e1b7c4 docs: tables supported for ppt now 2023-05-31 16:15:04 -04:00
cshaddox
d23e0d6420
feat: table extraction for power points (#664)
* Handling tables

* updating changelog

* Adding accidentally removed code

* remove newline

* reuse table extraction function; add test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 18:26:32 +00:00
Matt Robinson
3e983efce3
docs: add feature table to README (#655)
* remove announcement

* add table with filetypes

* remove filetype specific examples

* remove line break

* remove easy gif

* fix extra whitespace
2023-05-30 15:56:25 +00:00
Matt Robinson
21c821d651
feat: add partition_csv function (#619)
* add csv into filetype detection

* first pass on csv

* add tests for csv

* add csv to auto

* version bump

* update readme and docs

* fix doc strings
2023-05-19 15:57:42 -04:00
Matt Robinson
23ff32cc42
feat: add partition_xml for XML files (#596)
* first pass on partition_xml

* add option to keep xml tags

* added tests for xml

* fix filename

* update filenames

* remove outdated readme

* add xml to auto

* version and changelog

* update readme and docs

* pass through include_metadata

* update include_metadata description

* add README back in

* linting, linting, linting

* more linting

* spooled to bytes doesnt need to be a tuple

* Add tests for newly supported filetypes

* Correct metadata filetype

* doc typo

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* keep_xml_tags -> xml_keep_tags

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-05-18 15:40:12 +00:00
Matt Robinson
b8037118c4
feat: add partition_xlsx for MSFT Excel files (#594)
* first pass on partition_xlsx

* add support for files

* add test for xlsx from filename

* added filetype metadata

* add xlsx to auto

* remove fake excel from unsupported

* version and changelog

* update docs

* update readme

* fix removed file reference

* fix some more tests

* pass in metadata filename

* add include_metadata flag
2023-05-16 19:40:40 +00:00