Drop libxml2 dependency

It seems that Python's internal XML parser is good enough to do the job.
2025-11-02 10:50:29 +00:00 · 2015-08-17 15:26:07 -07:00 · 2015-08-17 15:26:07 -07:00 · 2dff3e07ce
commit 2dff3e07ce
parent 53c88093ad
4 changed files with 7 additions and 5 deletions
--- a/README.rst
+++ b/README.rst
@ -96,7 +96,6 @@ Install dependencies::
   sudo apt-get install \
      zlib1g-dev \
      libjpeg-dev \
-      libxml2 \
      tesseract-ocr \
      qpdf \
      unpaper \
--- a/RELEASE_NOTES.rst
+++ b/RELEASE_NOTES.rst
@ -47,6 +47,7 @@ Changes
   - MuPDF_ tools
   - shell scripts
   - Java and JHOVE_
+   - libxml2

 -  Some new external dependencies are required or optional, compared to v2.x:

@ -66,6 +67,10 @@ Changes
 Release candidates
 ------------------

+-  rc6:
+
+   - dropped libxml2 (Python lxml) since Python 3's internal XML parser is sufficient
+
 -  rc5:

   - dropped Java and JHOVE in favour of qpdf
--- a/ocrmypdf/hocrtransform.py
+++ b/ocrmypdf/hocrtransform.py
@ -9,7 +9,7 @@
 ##############################################################################
 from reportlab.pdfgen.canvas import Canvas
 from reportlab.lib.units import inch
-from lxml import etree as ElementTree
+from xml.etree import ElementTree
 from PIL import Image
 from collections import namedtuple
 import re
@ -35,8 +35,7 @@ class HocrTransform():
        self.dpi = dpi
        self.boxPattern = re.compile(r'bbox((\s+\d+){4})')

-        self.hocr = ElementTree.ElementTree()
-        self.hocr.parse(hocrFileName)
+        self.hocr = ElementTree.parse(hocrFileName)

        # if the hOCR file has a namespace, ElementTree requires its use to
        # find elements
--- a/setup.py
+++ b/setup.py
@ -203,7 +203,6 @@ setup(
    install_requires=[
        'ruffus>=2.6.3',
        'Pillow>=2.4.0',
-        'lxml>=3.3.3',
        'reportlab>=3.1.44',
        'PyPDF2>=1.25.1'
    ],