mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-11-01 10:33:09 +00:00
fix: consider all the required lines instead of first line to detect file type as CSV (#1728)
Current file detection logic for csv in file_utils/filetype.py is not
considering all the lines for counting the no. of comma's, it is
considering just the first line which will return always return true
```
lines = lines[: len(lines)] if len(lines) < 10 else lines[:10]
header_count = _count_commas(lines[0])
if any("," not in line for line in lines):
return False
return all(_count_commas(line) == header_count for line in lines[:1])
```
fixed issue by considering all the lines except the first line as shown
below
```
lines = lines[: len(lines)] if len(lines) < 10 else lines[:10]
header_count = _count_commas(lines[0])
if any("," not in line for line in lines):
return False
return all(_count_commas(line) == header_count for line in lines[1:])
```
This commit is contained in:
parent
ef391e1a3e
commit
21df17f7fa
@ -3,6 +3,7 @@
|
||||
### Enhancements
|
||||
|
||||
* **Add functionality to limit precision when serializing to json** Precision for `points` is limited to 1 decimal point if coordinates["system"] == "PixelSpace" (otherwise 2 decimal points?). Precision for `detection_class_prob` is limited to 5 decimal points.
|
||||
* **Fix csv file detection logic when mime-type is text/plain** Previously the logic to detect csv file type was considering only first row's comma count comparing with the header_row comma count and both the rows being same line the result was always true, Now the logic is changed to consider the comma's count for all the lines except first line and compare with header_row comma count.
|
||||
|
||||
### Features
|
||||
|
||||
|
||||
@ -496,7 +496,7 @@ def _is_text_file_a_csv(
|
||||
header_count = _count_commas(lines[0])
|
||||
if any("," not in line for line in lines):
|
||||
return False
|
||||
return all(_count_commas(line) == header_count for line in lines[:1])
|
||||
return all(_count_commas(line) == header_count for line in lines[1:])
|
||||
|
||||
|
||||
def _check_eml_from_buffer(file: IO[bytes]) -> bool:
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user