Python PDF 읽기

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

보조기억장치

Python PDF 읽기 본문

Any

Python PDF 읽기

캐세이 2024. 3. 14. 12:39

모듈 설치

pip3 install PyMuPdf

- PDF(및 기타) 문서 의 데이터 추출, 분석, 변환 및 조작을 위한 고성능 Python 라이브러리

pip3 install pdfminer.six

- PDF 문서용 텍스트 추출 도구

pip3 install pillow

- Python Imaging Library

텍스트 추출하기

import re
from pdfminer.high_level import extract_text, extract_pages

for page_layout in extract_pages('foo.pdf'):
for element in page_layout:
print(element)

text = extract_text('sample.pdf')
print(text)

pattern = re.compile(r'[a-zA-Z]+,{1}\s{1}')
matches = pattern.findall(text)
names = [n[:-2] for n in matches]
print(names)

이미지 추출하기

import fitz # PyMuPDF
import PIL.Image # pillow
import io

pdf = fitz.open('ccsg.pdf')
counter = 1 # All Image
for i in range(len(pdf)):
    page = pdf[i]
    images = page.get_images()
    for image in images:
        base_img = pdf.extract_image(image[0])
        image_data = base_img["image"]
        img = PIL.Image.open(io.BytesIO(image_data))
        extension = base_img["ext"]
        img.save(open(f"image{counter}.{extension}", "wb"))
        counter += 1

'Any' 카테고리의 다른 글

JR큐슈 주요역 명칭 (0)	2024.04.27
뱃부 아프리카 사파리 버스 시간표 (0)	2024.04.27
FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the future. Use int(ser.iloc[0]) instead (0)	2024.01.19
VSCODE Delete `␍`eslintprettier/prettier (0)	2023.12.09
PowerShell 에서 tsc 실행시 보안오류 발생 조치방법 (0)	2023.08.07

'Any' Related Articles

보조기억장치

Python PDF 읽기 본문

Python PDF 읽기

'Any' 카테고리의 다른 글

티스토리툴바