최종 개발물
- demo example
- 결과 요약
1. Layout Detection
2. Text Extraction
- 2.1 microsoft/Phi-3
- 2.2 PdfReader + ocrmypdf
3. Table Parsing
- 3.1 camelot-py
- 3.2 pdfplumber
4. PDF table extraction after image extraction
5. SwinTransformer Encoder - BART Decoder (Donut / Nougat)
layout detection - text extraction - table parsing 문제점
- 해결 방안

실험 대상 pdf는 우리은행 (20240603) 금융시장 브리프.pdf로, Google에 “금융보고서”를 검색하여 랜덤하게 다운받은 pdf이다. 해당 pdf에서 차트와 표 그리고 텍스트가 고루 포함된 페이지를 대표 실험 대상으로 삼음. 해당 보고서 내용과 본 포스트는 무관함.

최종 개발물

demo example

가로형 PDF 최종 개발물 데모 (가로형 PDF)
세로형 PDF 최종 개발물 데모 (세로형 PDF)
image로 삽입된 table parsing 최종 개발물 데모 (table parsing)
text로 삽입된 table parsing 최종 개발물 데모 (table parsing)

결과 요약

Title, Text, Figure, Table layout detectin 성능 양호
layout 내 text 추출 성능 양호
text table, image table parsing 성능 양호

1. Layout Detection

1.1 PP-YOLOE

layout detection 확인한 것들 중 가장 잘 되는 모델

1.1.1 PP-YOLOE 기반 PDF Layout detection

표가 그림으로 삽입된 페이지 | 표가 문서 내 표로 삽입된 페이지

1.1.2 PP-YOLOE + PyMuPDF

중국 모델로, layout 내 한국어 text 추출에 취약하여 text 추출은 PyMuPDF로 교체하여 실험한 결과 detection 부터 extraction까지 좋은 결과가 나옴.
table과 image에 대한 처리는 추가 커스텀이 필요해 보임.

PP-YOLOE로 layout detection 이후 PyMuPDF로 text extraction

1.2 microsoft/Florence-2-large

표가 그림으로 삽입된 페이지 | 표가 문서 내 표로 삽입된 페이지

<반 PDF Layout detection

1.3 Detectron2

문제점:

Detectron2 모델 사용에 필요한 layoutparser에서 제공하는 text extraction 모듈 TesseractAgent가 한국어를 지원하지 않아 text 추출이 제대로 되지 않음
Detectron2로는 text 위치만 파악하고, 얻어진 bbox를 활용하여 다른 text 추출 라이브러리 혹은 모델을 사용해야 할 것 같음

요약:

Pirma가 가장 나음
다른 detection label이 함께 추출되어 layout만 잘 감지한다면 pdf 구조화 시 custom이 용이할 것으로 보임. e.g. Pirma label {1:”Page Frame”, 2:”Row”, 3:”Title Region”, 4:”Text Region”, 5:”Title”, 6:”Subtitle”, 7:”Other”}

modelzoo에서 확인 가능

1.3.1 Pirma

text dectection:
- PDF 내 텍스트 위치 파악 능력이 나쁘지 않음. 하지만 감지된 부분의 라벨 정보는 부정확함.
table detection:
- 표가 그림으로 삽입된 경우, 표를 감지하지 못함.
- 표가 표로 삽입되었을 때는 잘 감지함.

표가 그림으로 삽입된 페이지 | 표가 문서 내 표로 삽입된 페이지

1.3.2 HJDataset

HJDataset은 거대 Aisan language dataset으로, 이를 학습한 모델로 문서 layout을 감지하는 task를 수행한다.
text와 table 전반적으로 다 잘 감지하지 못함.

표가 그림으로 삽입된 페이지 | 표가 문서 내 표로 삽입된 페이지

1.3.3 PubLayNet

결과가 포괄적인 경향이 있음.
제목 위치가 layout에 포함되지 않는 등의 문제가 있음.

표가 그림으로 삽입된 페이지 | 표가 문서 내 표로 삽입된 페이지

1.4 PDF to XML

PDF 구조를 XML로 변경하여 layout 파악(PDFQuery 사용)

1.4.1 PDF to XML 시각화

표가 그림으로 삽입된 페이지

표가 그림으로 삽입된 페이지 | 표가 문서 내 표로 삽입된 페이지

1.4.2 PDF to xml + PyMuPDF

PyMuPDF
- 일반 text는 대부분 잘 추출하는 편.
- 표가 그림으로 삽입된 페이지의 table data 추출 못함.
- 표가 문서 내 표로 삽입된 페이지의 table data의 경우 텍스트는 추출되지만 정리되지 않음.

표가 그림으로 삽입된 페이지 | 표가 문서 내 표로 삽입된 페이지

2. Text Extraction

2.1 microsoft/Phi-3

image에서 한국어 인식 잘 못함.

# Import necessary libraries
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import BitsAndBytesConfig
import torch
from IPython.display import display
import time, tqdm


# Define model ID
model_id = "microsoft/Phi-3-vision-128k-instruct"

# Load processor
processor = AutoProcessor.from_pretrained(model_id, 
                                          trust_remote_code=True
                                         )

# Define BitsAndBytes configuration for 4-bit quantization
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model with 4-bit quantization and map to CUDA
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda:1",
    trust_remote_code=True,
    torch_dtype="auto",
    quantization_config=nf4_config,
)

def model_inference(messages, path_image):
    
    start_time = time.time()
    
    image = Image.open(path_image)

    # Prepare prompt with image token
    prompt = processor.tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    # Process prompt and image for model input
    inputs = processor(prompt, [image], return_tensors="pt").to("cuda:1")

    # Generate text response using model
    generate_ids = model.generate(
        **inputs,
        eos_token_id=processor.tokenizer.eos_token_id,
        max_new_tokens=500,
        do_sample=False,
    )

    # Remove input tokens from generated response
    generate_ids = generate_ids[:, inputs["input_ids"].shape[1] :]

    # Decode generated IDs to text
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]


    display(image)
    end_time = time.time()
    print("Inference time: {}".format(end_time - start_time))

    # Print the generated response
    print(response)

prompt_cie_front = [{"role": "user", "content": "<|image_1|>\nOCR the text of the image. Most of the text is in Korean."}]

path_image = "./1.png"

model_inference(prompt_cie_front, path_image)

microsoft/Phi-3-vision-128k-instruct OCR 결과

2.2 PdfReader + ocrmypdf

pypdf.PdfReader로 text 추출
추출된 게 거의 없고, pdf 대부분이 imgage로 이루어져 있으면 ocrmypdf로 ocr한 이후 다시 pypdf.PdfReader로 text 추출

from pypdf import PdfReader
import ocrmypdf


def extract_text_from_pdf(reader):
    full_text = ""
    for idx, page in enumerate(reader.pages):
        text = page.extract_text()
        if len(text) > 0:
            full_text += f"---- Page {idx} ----\n" + page.extract_text() + "\n\n"

    return full_text.strip()


def convert(pdf_file):
    reader = PdfReader(pdf_file)

    # Extract metadata
    metadata = {
        "author": reader.metadata.author,
        "creator": reader.metadata.creator,
        "producer": reader.metadata.producer,
        "subject": reader.metadata.subject,
        "title": reader.metadata.title,
    }

    # Extract text
    full_text = extract_text_from_pdf(reader)

    # Check if there are any images
    image_count = 0
    for page in reader.pages:
        image_count += len(page.images)

    # If there are images and not much content, perform OCR on the document
    if image_count > 0 and len(full_text) < 1000:
        out_pdf_file = pdf_file.replace(".pdf", "_ocr.pdf")
        ocrmypdf.ocr(pdf_file, out_pdf_file, force_ocr=True)

        # Re-extract text
        reader = PdfReader(pdf_file)
        full_text = extract_text_from_pdf(reader)

    return full_text, metadata

convert("/workspace/PDF_Parsing/(20240603) 금융시장 브리프.pdf")

3. Table Parsing

3.1 camelot-py

표가 그림으로 삽입된 페이지
- 테이블 감지는 잘 됨
- 테이블 내 text 추출이 잘 되지 않음
- 아래 그림을 보면 표 위치에서 감지하는 text가 없음

camelot-py table detection result(Illustrated by the author)

camelot-py table text extraction

표가 문서 내 표로 삽입된 페이지
- 테이블 감지 잘 됨
- 테이블 내 text 가 추출 됨. 추가 후처리 모듈을 붙이면 깔끔하게 나올 것 같음

camelot-py table detection result(Illustrated by the author)

camelot-py table text extraction

3.2 pdfplumber

표가 그림으로 삽입된 페이지는 추출되는 text가 없음
표가 표로 삽입된 페이지는 아래와 같이 잘 추출됨.

pdfplumber table text extraction

4. PDF table extraction after image extraction

1단계. pdf에서 이미지 추출

import pymupdf

def extract_images_from_pdf(pdf_path, output_folder):
    doc = pymupdf.open(pdf_path)

    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        image_list = page.get_images(full=True)

        for img_index, img in enumerate(image_list):
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]

            # Save the image
            image_filename = f"{output_folder}/page_{page_num + 1}_img_{img_index + 1}.png"
            with open(image_filename, "wb") as image_file:
                image_file.write(image_bytes)

2단계. 이미지로부터 텍스트 추출

from PIL import Image
from pytesseract import pytesseract

image = Image.open('/workspace/PDF_Parsing/Pix2Struct/images_pymu/page_5_img_2.png')
image = image.resize((400,200))

text = pytesseract.image_to_string(image, lang='kor')

print('detected text : ',text)

결과

PDF table extraction after image extraction

5. SwinTransformer Encoder - BART Decoder (Donut / Nougat)

Encoder로 layout detection 후 Decoder로 bounding box 내부 text 추출하는 모델

Donut Github: https://github.com/clovaai/donut/
Nougat Github: https://github.com/facebookresearch/nougat/tree/main?tab=readme-ov-file

5.1 Donut (naver clovaai)

Donut text extraction result

추출되는 것이 거의 없음

5.2 Nougat (meta)

추출되는 것이 전혀 없음

5.3 보강 방안

두 모델 모두 한국어 text 추출에 취약함.

SwinTransformer로 bbox 특정 후 아래 두 방법으로 한국어 추출 문제를 완화할 수 있을 것으로 보임

Decoder가 하는 역할을 Rulbase 한국어 추출을 잘 하는 라이브러리로 text를 따로 추출하는 방법
Decoder model을 변경하여

layout detection - text extraction - table parsing 문제점

pdf layout detection + 한국어 text extraction을 모두 잘하는 모델이 없음
table 정보를 한번에 잘 추출하는 모델이 없음

해결 방안

방법1 (최종 결과물 반영 방법)
- text: layout detection 이후 bbox내 text 추출, 각 단계마다 가장 괜찮은 방법은 아래와 같음:
  - detection : PP-YOLOE
  - text 추출 : PyMuPDF
- table: table은 경우에 따라 다르게 추출.
  1) 문서 내 image로 삽입된 table: 그림 추출 -> 그림에서 텍스트 추출
  - 그림 추출 : PyMuPDF
  - 그림에서 text 추출: pytesseract
  2) 문서 내 text로 삽입된 table: 이 경우 그림으로 표가 추출되지 않음.
  - table 추출 : pdfplumber
방법2

Donut / Nougat 모두 한국어 text 추출에 취약함. 이는 Decoder 문제로 예상됨.
- Encoder로부터 bbox를 얻고, Decoder를 PyMuPDF로 대체하여 text 추출
- Decoder model을 교체(한국어 잘하는 Transforemer decoder only model 들 중 하나로.)

한국어 PDF Parser / PDF OCR (layout detection + text extraction)

02 Feb 2024 | Development

최종 개발물

demo example

결과 요약

1. Layout Detection

1.1 PP-YOLOE

1.1.1 PP-YOLOE 기반 PDF Layout detection

1.1.2 PP-YOLOE + PyMuPDF

1.2 microsoft/Florence-2-large

1.3 Detectron2

1.3.1 Pirma

1.3.2 HJDataset

1.3.3 PubLayNet

1.4 PDF to XML

1.4.1 PDF to XML 시각화

1.4.2 PDF to xml + PyMuPDF

2. Text Extraction

2.1 microsoft/Phi-3

2.2 PdfReader + ocrmypdf

3. Table Parsing

3.1 camelot-py

3.2 pdfplumber

4. PDF table extraction after image extraction

1단계. pdf에서 이미지 추출

2단계. 이미지로부터 텍스트 추출

결과

5. SwinTransformer Encoder - BART Decoder (Donut / Nougat)

5.1 Donut (naver clovaai)

5.2 Nougat (meta)

5.3 보강 방안

layout detection - text extraction - table parsing 문제점

해결 방안