Hi,
in order to train a model layoutLMv2 on AWS Sagemaker (inspiration from Fine-tuning LayoutLMForSequenceClassification on RVL-CDIP.ipynb of @nielsr) through a script running in a training DL container (DLC) of Hugging Face, I need to install tesseract-ocr
in this container.
Here is the code (inspiration from 01_getting_started_pytorch of @philschmid) of my AWS Sagemaker notebook that runs my Hugging Face Estimator that installs the DLC and then, run the script tesseract.py
(I have no problem with this code; the problem comes after the DLC installation when the script tesseract.py
is run and starts with the tesseract-ocr
installation):
!pip install "sagemaker>=2.48.0" "transformers==4.12.3" "datasets[s3]==1.18.3" --upgrade
import sagemaker
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
import sagemaker.huggingface
from sagemaker.huggingface import HuggingFace
huggingface_estimator = HuggingFace(entry_point='tesseract.py',
source_dir='./scripts',
instance_type='ml.g4dn.4xlarge', #'ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.12',
pytorch_version='1.9',
py_version='py38',
#hyperparameters = hyperparameters
)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit()
I did test different version of the script tesseract.py
. None of them successfully install tesseract-ocr
. I always get the following error message:
E: Unable to locate package tesseract-ocr
I copy here the different versions of the script tesseract.py
I tested and the whole error message.
How can help me on finding the right tesseract-ocr
installation commands in this script?
Thanks you.
Version 1 of tesseract.py
import os
os.system('apt-get install tesseract-ocr')
os.system('pip install -q pytesseract')
import pytesseract
print("pytesseract:",pytesseract.__version__)
if __name__ == "__main__":
print("YES")
The error message:
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package tesseract-ocr
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
pytesseract: 0.3.9
Version 2 of tesseract.py
import os
os.system('apt install tesseract-ocr')
os.system('pip install -q pytesseract')
import pytesseract
print("pytesseract:",pytesseract.__version__)
if __name__ == "__main__":
print("YES")
The error message:
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package tesseract-ocr
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
pytesseract: 0.3.9
Version 3 of tesseract.py
import os
os.system('apt-get update')
os.system('apt-get install tesseract-ocr')
os.system('pip install -q pytesseract')
import pytesseract
print("pytesseract:",pytesseract.__version__)
if __name__ == "__main__":
print("YES")
The error message:
Get:1 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Err:2 http://security.ubuntu.com/ubuntu focal-security InRelease
Couldn't create temporary file /tmp/apt.conf.GCKUDO for passing config to apt-key
Err:1 http://archive.ubuntu.com/ubuntu focal InRelease
Couldn't create temporary file /tmp/apt.conf.u82fEW for passing config to apt-key
Get:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Err:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Couldn't create temporary file /tmp/apt.conf.yGlnEY for passing config to apt-key
Get:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Err:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Couldn't create temporary file /tmp/apt.conf.aez3e6 for passing config to apt-key
Reading package lists...
W: GPG error: http://security.ubuntu.com/ubuntu focal-security InRelease: Couldn't create temporary file /tmp/apt.conf.GCKUDO for passing config to apt-key
E: The repository 'http://security.ubuntu.com/ubuntu focal-security InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal InRelease: Couldn't create temporary file /tmp/apt.conf.u82fEW for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal-updates InRelease: Couldn't create temporary file /tmp/apt.conf.yGlnEY for passing config to apt-key
E:
The repository 'http://archive.ubuntu.com/ubuntu focal-updates InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal-backports InRelease: Couldn't create temporary file /tmp/apt.conf.aez3e6 for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal-backports InRelease' is not signed.
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package tesseract-ocr
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
pytesseract: 0.3.9
Version 4 of tesseract.py
import os
os.system('apt update')
os.system('apt install tesseract-ocr')
os.system('pip install -q pytesseract')
import pytesseract
print("pytesseract:",pytesseract.__version__)
if __name__ == "__main__":
print("YES")
The error message:
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Err:1 http://security.ubuntu.com/ubuntu focal-security InRelease
Couldn't create temporary file /tmp/apt.conf.DTV8WR for passing config to apt-key
Err:2 http://archive.ubuntu.com/ubuntu focal InRelease
Couldn't create temporary file /tmp/apt.conf.XWgcxV for passing config to apt-key
Get:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Err:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Couldn't create temporary file /tmp/apt.conf.saNs71 for passing config to apt-key
Get:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Err:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Couldn't create temporary file /tmp/apt.conf.K2jHm2 for passing config to apt-key
Reading package lists...
W: GPG error: http://security.ubuntu.com/ubuntu focal-security InRelease: Couldn't create temporary file /tmp/apt.conf.DTV8WR for passing config to apt-key
E: The repository 'http://security.ubuntu.com/ubuntu focal-security InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal InRelease: Couldn't create temporary file /tmp/apt.conf.XWgcxV for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal-updates InRelease: Couldn't create temporary file /tmp/apt.conf.saNs71 for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal-updates InRelease' is not signed.
W: GPG error: http://archive.ubuntu.com/ubuntu focal-backports InRelease: Couldn't create temporary file /tmp/apt.conf.K2jHm2 for passing config to apt-key
E: The repository 'http://archive.ubuntu.com/ubuntu focal-backports InRelease' is not signed.
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package tesseract-ocr
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
pytesseract: 0.3.9