ragflow/deepdoc/parser/html_parser.py

# -*- coding: utf-8 -*-
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
#  limitations under the License.
#
from rag.nlp import find_codec
import readability
import html_text
import chardet


def get_encoding(file):
    with open(file,'rb') as f:
        tmp = chardet.detect(f.read())
        return tmp['encoding']


class RAGFlowHtmlParser:
    def __call__(self, fnm, binary=None):
        txt = ""
        if binary:
            encoding = find_codec(binary)
            txt = binary.decode(encoding, errors="ignore")
        else:
            with open(fnm, "r",encoding=get_encoding(fnm)) as f:
                txt = f.read()
        return self.parser_txt(txt)

    @classmethod
    def parser_txt(cls, txt):
        if not isinstance(txt, str):
            raise TypeError("txt type should be str!")
        html_doc = readability.Document(txt)
        title = html_doc.title()
        content = html_text.extract_text(html_doc.summary(html_partial=True))
        txt = f"{title}\n{content}"
        sections = txt.split("\n")
        return sections
Add support for HTML file (#973) ### What problem does this PR solve? Add support for HTML file ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-05-30 09:12:55 +08:00			`# -- coding: utf-8 --`
fix create dialog bug (#982) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2024-05-30 09:25:05 +08:00			`# Licensed under the Apache License, Version 2.0 (the "License");`
			`# you may not use this file except in compliance with the License.`
			`# You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
			`#`
Add support for HTML file (#973) ### What problem does this PR solve? Add support for HTML file ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-05-30 09:12:55 +08:00			`from rag.nlp import find_codec`
			`import readability`
			`import html_text`
			`import chardet`

search between multiple indiices for team function (#3079) ### What problem does this PR solve? #2834 ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-29 13:19:01 +08:00
Add support for HTML file (#973) ### What problem does this PR solve? Add support for HTML file ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-05-30 09:12:55 +08:00			`def get_encoding(file):`
			`with open(file,'rb') as f:`
			`tmp = chardet.detect(f.read())`
			`return tmp['encoding']`
fix create dialog bug (#982) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2024-05-30 09:25:05 +08:00
search between multiple indiices for team function (#3079) ### What problem does this PR solve? #2834 ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-10-29 13:19:01 +08:00
Add support for HTML file (#973) ### What problem does this PR solve? Add support for HTML file ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-05-30 09:12:55 +08:00			`class RAGFlowHtmlParser:`
			`def __call__(self, fnm, binary=None):`
			`txt = ""`
			`if binary:`
			`encoding = find_codec(binary)`
			`txt = binary.decode(encoding, errors="ignore")`
			`else:`
			`with open(fnm, "r",encoding=get_encoding(fnm)) as f:`
			`txt = f.read()`
add support for eml file parser (#1768) ### What problem does this PR solve? add support for eml file parser #1363 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com> 2024-08-06 16:42:14 +08:00			`return self.parser_txt(txt)`
fix create dialog bug (#982) ### What problem does this PR solve? ### Type of change - [x] Bug Fix (non-breaking change which fixes an issue) 2024-05-30 09:25:05 +08:00
add support for eml file parser (#1768) ### What problem does this PR solve? add support for eml file parser #1363 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com> 2024-08-06 16:42:14 +08:00			`@classmethod`
			`def parser_txt(cls, txt):`
Fix errors detected by Ruff (#3918) ### What problem does this PR solve? Fix errors detected by Ruff ### Type of change - [x] Refactoring 2024-12-08 14:21:12 +08:00			`if not isinstance(txt, str):`
add support for eml file parser (#1768) ### What problem does this PR solve? add support for eml file parser #1363 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com> 2024-08-06 16:42:14 +08:00			`raise TypeError("txt type should be str!")`
Add support for HTML file (#973) ### What problem does this PR solve? Add support for HTML file ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-05-30 09:12:55 +08:00			`html_doc = readability.Document(txt)`
			`title = html_doc.title()`
			`content = html_text.extract_text(html_doc.summary(html_partial=True))`
add support for eml file parser (#1768) ### What problem does this PR solve? add support for eml file parser #1363 ### Type of change - [x] New Feature (non-breaking change which adds functionality) --------- Co-authored-by: Zhedong Cen <cenzhedong2@126.com> Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com> 2024-08-06 16:42:14 +08:00			`txt = f"{title}\n{content}"`
Add support for HTML file (#973) ### What problem does this PR solve? Add support for HTML file ### Type of change - [x] New Feature (non-breaking change which adds functionality) 2024-05-30 09:12:55 +08:00			`sections = txt.split("\n")`
			`return sections`