How to install tesseract-ocr in a training DLC of HF via a script?

Hi,

in order to train a model layoutLMv2 on AWS Sagemaker (inspiration from Fine-tuning LayoutLMForSequenceClassification on RVL-CDIP.ipynb of @nielsr) through a script running in a training DL container (DLC) of Hugging Face, I need to install tesseract-ocr in this container.

Here is the code (inspiration from 01_getting_started_pytorch of @philschmid) of my AWS Sagemaker notebook that runs my Hugging Face Estimator that installs the DLC and then, run the script tesseract.py (I have no problem with this code; the problem comes after the DLC installation when the script tesseract.py is run and starts with the tesseract-ocr installation):

!pip install "sagemaker>=2.48.0" "transformers==4.12.3" "datasets[s3]==1.18.3" --upgrade

import sagemaker

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

import sagemaker.huggingface
from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(entry_point='tesseract.py',
                            source_dir='./scripts',
                            instance_type='ml.g4dn.4xlarge', #'ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.12',
                            pytorch_version='1.9',
                            py_version='py38',
                            #hyperparameters = hyperparameters
                                   )

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit()

I did test different version of the script tesseract.py. None of them successfully install tesseract-ocr. I always get the following error message:

E: Unable to locate package tesseract-ocr

I copy here the different versions of the script tesseract.py I tested and the whole error message.
How can help me on finding the right tesseract-ocr installation commands in this script?
Thanks you.

Version 1 of tesseract.py

import os

os.system('apt-get install tesseract-ocr')
os.system('pip install -q pytesseract')

import pytesseract
print("pytesseract:",pytesseract.__version__)

if __name__ == "__main__":

    print("YES")

The error message:

Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package tesseract-ocr
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
pytesseract: 0.3.9

Version 2 of tesseract.py

import os

os.system('apt install tesseract-ocr')
os.system('pip install -q pytesseract')

import pytesseract
print("pytesseract:",pytesseract.__version__)

if __name__ == "__main__":

    print("YES")

The error message:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package tesseract-ocr
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
pytesseract: 0.3.9

Version 3 of tesseract.py

import os

os.system('apt-get update')
os.system('apt-get install tesseract-ocr')
os.system('pip install -q pytesseract')

import pytesseract
print("pytesseract:",pytesseract.__version__)

if __name__ == "__main__":

    print("YES")

The error message:

Get:1 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Err:2 http://security.ubuntu.com/ubuntu focal-security InRelease
  Couldn't create temporary file /tmp/apt.conf.GCKUDO for passing config to apt-key
Err:1 http://archive.ubuntu.com/ubuntu focal InRelease
  Couldn't create temporary file /tmp/apt.conf.u82fEW for passing config to apt-key
Get:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Err:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
  Couldn't create temporary file /tmp/apt.conf.yGlnEY for passing config to apt-key
Get:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Err:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
  Couldn't create temporary file /tmp/apt.conf.aez3e6 for passing config to apt-key
Reading package lists...
W: GPG error: http://security.ubuntu.com/ubuntu focal-security InRelease: Couldn't create temporary file /tmp/apt.conf.GCKUDO for passing config to apt-key
E: The repository 'http://security.ubuntu.com/ubuntu focal-security InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal InRelease: Couldn't create temporary file /tmp/apt.conf.u82fEW for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal-updates InRelease: Couldn't create temporary file /tmp/apt.conf.yGlnEY for passing config to apt-key
E:
The repository 'http://archive.ubuntu.com/ubuntu focal-updates InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal-backports InRelease: Couldn't create temporary file /tmp/apt.conf.aez3e6 for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal-backports InRelease' is not signed.
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package tesseract-ocr
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
pytesseract: 0.3.9

Version 4 of tesseract.py

import os

os.system('apt update')
os.system('apt install tesseract-ocr')
os.system('pip install -q pytesseract')

import pytesseract
print("pytesseract:",pytesseract.__version__)

if __name__ == "__main__":

    print("YES")

The error message:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Err:1 http://security.ubuntu.com/ubuntu focal-security InRelease
  Couldn't create temporary file /tmp/apt.conf.DTV8WR for passing config to apt-key
Err:2 http://archive.ubuntu.com/ubuntu focal InRelease
  Couldn't create temporary file /tmp/apt.conf.XWgcxV for passing config to apt-key
Get:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Err:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
  Couldn't create temporary file /tmp/apt.conf.saNs71 for passing config to apt-key
Get:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Err:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
  Couldn't create temporary file /tmp/apt.conf.K2jHm2 for passing config to apt-key
Reading package lists...
W: GPG error: http://security.ubuntu.com/ubuntu focal-security InRelease: Couldn't create temporary file /tmp/apt.conf.DTV8WR for passing config to apt-key
E: The repository 'http://security.ubuntu.com/ubuntu focal-security InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal InRelease: Couldn't create temporary file /tmp/apt.conf.XWgcxV for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal-updates InRelease: Couldn't create temporary file /tmp/apt.conf.saNs71 for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal-updates InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal-backports InRelease: Couldn't create temporary file /tmp/apt.conf.K2jHm2 for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal-backports InRelease' is not signed.
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package tesseract-ocr
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
pytesseract: 0.3.9

@pierreguillou Not sure what the error here is but i suggest you to rather go with LayoutLMv3 instead of LayoutLM. Among other improvements it simplifies LayoutLMv2 by using patch embeddings (as in ViT) instead of leveraging a CNN backbone.

@nielsr has also a pretty good example on how to fine-tune it Transformers-Tutorials/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb at master · NielsRogge/Transformers-Tutorials · GitHub

Hello @philschmid . Thanks for your reply but my main problem is installing tesseract-ocr in a HF DLC in AWS Sagemaker. That’s why I tried to install it with a simple and dedicated script (what I called tesseract.py in my initial post).

How to solve this? (tesseract-ocr is useful for many NLP applications and I guess, solving this issue would help a lot of people)

Note: about using LayoutLMv3 instead of LayoutLMv2, this is another interesting topic. My short answer on this: I would love to do this, but I’m dealing with NLP applications in a language other than English (mostly Portuguese). At this stage, without a multilingual LayoutLMv3 (see the answer of @nielsr on this), I can’t use LayoutLMv3.

@philschmid : do you think that the DLC HF on AWS Sagemaker does not accept the apt-get update command? (that would explain why the package tesserat-ocr is not found byt the apt-get install command).

The post Fix E: “Unable to Locate Package” Error in Kali Linux talks about updating the file /etc/apt/sources.list. What do you think? (but I do not see how can I do it through your sagemaker notebooks). Thanks for your help.

Have you tried following the steps here? Introduction | tessdoc

Note for Ubuntu users : In case apt is unable to find the package try adding universe entry to the sources.list file as shown below.

You can programmatically the line with the following command

echo "deb http://archive.ubuntu.com/ubuntu bionic universe" >>  /etc/apt/sources.list

Hi @philschmid. Thank you for taking some time to this issue.

Even if your idea didn’t work, it helped me in finding the solution.

In fact, testing your idea, I got again an error message like (...) Couldn't create temporary file /tmp/apt.conf... (...). Then, I did a search and found this post (Solve the error: couldn’t create temporary file / TMP / apt.conf.irqbcz) that says to run the following command to solve the problem:

chmod 777 /tmp

Thus, I changed the content of my script tesseract.py to this one:

import os

os.system('chmod 777 /tmp')
os.system('apt-get update -y')
os.system('apt-get install tesseract-ocr -y')
os.system('pip install -q pytesseract')

if __name__ == "__main__":

    print("YES")

… and it worked! (see what was printed)

Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Get:3 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [1324 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal/restricted amd64 Packages [33.4 kB]
Get:7 http://archive.ubuntu.com/ubuntu focal/universe amd64 Packages [11.3 MB]
Get:8 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [1974 kB]
Get:9 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [27.5 kB]
Get:10 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [881 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages [1275 kB]
Get:12 http://archive.ubuntu.com/ubuntu focal/multiverse amd64 Packages [177 kB]
Get:13 http://archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages [30.3 kB]
Get:14 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [1411 kB]
Get:15 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1161 kB]
Get:16 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [2420 kB]
Get:17 http://archive.ubuntu.com/ubuntu focal-backports/main amd64 Packages [54.2 kB]
Get:18 http://archive.ubuntu.com/ubuntu focal-backports/universe amd64 Packages [27.1 kB]
Fetched 22.7 MB in 3s (7964 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
liblept5 libopenjp2-7 libtesseract4 tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
liblept5 libopenjp2-7 libtesseract4 tesseract-ocr tesseract-ocr-eng
tesseract-ocr-osd
0 upgraded, 6 newly installed, 0 to remove and 5 not upgraded.
Need to get 7227 kB of archives.
After this operation, 22.8 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libopenjp2-7 amd64 2.3.1-1ubuntu4.20.04.1 [141 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/universe amd64 liblept5 amd64 1.79.0-1 [999 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal/universe amd64 libtesseract4 amd64 4.1.1-2build2 [1237 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1 [1598 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1 [2990 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal/universe amd64 tesseract-ocr amd64 4.1.1-2build2 [262 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 7227 kB in 1s (6070 kB/s)
Selecting previously unselected package libopenjp2-7:amd64.#015
(Reading database ...
(Reading database ... 5%#015(Reading database ... 10%#015(Reading database ... 15%#015(Reading database ... 20%#015(Reading database ... 25%#015(Reading database ... 30%#015(Reading database ... 35%#015(Reading database ... 40%#015(Reading database ... 45%#015(Reading database ... 50%#015(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%#015(Reading database ... 44741 files and directories currently installed.)
Preparing to unpack .../0-libopenjp2-7_2.3.1-1ubuntu4.20.04.1_amd64.deb ...
Unpacking libopenjp2-7:amd64 (2.3.1-1ubuntu4.20.04.1) ...
Selecting previously unselected package liblept5:amd64.
Preparing to unpack .../1-liblept5_1.79.0-1_amd64.deb ...
Unpacking liblept5:amd64 (1.79.0-1) ...
Selecting previously unselected package libtesseract4:amd64.
Preparing to unpack .../2-libtesseract4_4.1.1-2build2_amd64.deb ...
Unpacking libtesseract4:amd64 (4.1.1-2build2) ...
Selecting previously unselected package tesseract-ocr-eng.
Preparing to unpack .../3-tesseract-ocr-eng_1%3a4.00~git30-7274cfa-1_all.deb ...
Unpacking tesseract-ocr-eng (1:4.00~git30-7274cfa-1) ...
Selecting previously unselected package tesseract-ocr-osd.
Preparing to unpack .../4-tesseract-ocr-osd_1%3a4.00~git30-7274cfa-1_all.deb ...
Unpacking tesseract-ocr-osd (1:4.00~git30-7274cfa-1) ...
Selecting previously unselected package tesseract-ocr.
Preparing to unpack .../5-tesseract-ocr_4.1.1-2build2_amd64.deb ...
Unpacking tesseract-ocr (4.1.1-2build2) ...
Setting up tesseract-ocr-eng (1:4.00~git30-7274cfa-1) ...
Setting up libopenjp2-7:amd64 (2.3.1-1ubuntu4.20.04.1) ...
Setting up tesseract-ocr-osd (1:4.00~git30-7274cfa-1) ...
Setting up liblept5:amd64 (1.79.0-1) ...
Setting up libtesseract4:amd64 (4.1.1-2build2) ...
Setting up tesseract-ocr (4.1.1-2build2) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
1 Like