finddme

1. Introduction
2. Model Flow
3. Probing Tasks
4. Models
- 4.1 Probing Model
- 4.2 Contextualizers
5. Pretrained Contextualizer Comparison: What features of language do these vectors capture, and what do they miss?
6. Analyzing Layerwise Transferability: How and why does transferability vary across representation layers in contextualizers?
- 6.1 Experimental Setup
- 6.2 Results and Discussion
7. Transferring Between Tasks: How does the choice of pretraining task affect the vectors’ learned linguistic knowledge and transferability?
- 7.1 Experimental Setup
- 7.2 Results and Discussion
Reference

1. Introduction

이번에 살펴볼 논문은 Liu. et al. (2019)의 Linguistic Knowledge and Transferability of Contextual Representations이다. 본 논문에서 제시하는 문제는 아래와 같다:

What features of language do these vectors capture, and what do they miss?
How and why does transferability vary across representation layers in contextualizers?
How does the choice of pretraining task affect the vectors’ learned linguistic knowledge and transferability?

Transferability(Transfer Learning(전이 학습)이 얼마나 가능한지): Pre-Trained contexualizer가 특정 Task에 대해 어느정도의 성능을 보이느냐에 따라 전이가 잘 되었는지 아닌지를 판단한다.
Contextualizer: Sota를 찍은 NLP model에서 중요한 것은 Pre-trained representation이다. 전통적으로 word vector는 static한 word embedding이다. 즉 1 word, 1 vector이다. 이러한 방식은 Glove와 같은 모델에서 사용되었는데, 이러한 모델들은 단어들을 벡터화하는 것에 집중하여 단어 하나하나를 파악했기 때문에 문맥 파악에 취약하다. 최근 연구에서는 이러한 단점을 해결한 Elmo와 같은 모델을 통해 contextual word representations(CWR)을 사용한다.

Glove보다 BiLM의 성능이 높기도 하지만 둘의 큰 차이점은 다음과 같다:Glove는 특정 동사를 POS tagging할 때 co-occurrence 빈도가 높은 것을 기준으로 tagging이 진행되는 반면 BiLM은 morphological inflection과 morphological derivation을 구분하여 POS tag가 붙는다.(Peters et al. 2018)

2. Model Flow

모델의 흐름은 위와 같다. 위 그림은 POS tagging task에 대한 그림이다. 우선 Input Token이 모델에 입력되면 이는 이미 대량의 데이터로 학습된 Pretrained Contextualizer를 통해 feature를 갖는다. 이 feature들은 Probing model을 거쳐서 결과가 나온다. 이렇게 Pretrained model을 사용했을 때가 그렇지 않은 경우보다 더 좋은 성능을 낸다고 하는데, 이러한 방식이 linguistic knowledge를 가져서 높게 나오는 것인지, 어떻게 linguistic knowledge를 가지는지 그리고 어떻게 그 지식들을 전이하는지에 대해 실험해 보는 것이 본 논문의 주요 골자이다.

3. Probing Tasks

본 연구에서는 17개의 Probing task를 통해 Linguistic Knowledge를 잘 파악했는지 확인한다. 이 Task들은 크게 Token Labeling, Segmentation, Segmentation로 나뉠 수 있다.

3.1 Token Labeling

1) part-of-speech tagging (POS)

형태소 분석 tagging이다. 이를 통해 CWR이 기본적인 통사를 파악했는지를 평가할 수 있다.

dataset:
the Penn Treebank (PTB; Marcus et al., 1993),
the Universal Dependencies English Web Treebank (UDEWT; Silveira et al., 2014).

2) CCG supertagging (CCG)

어것은 문장 내 단어들의 통사적 역할에 대한 세부적인 정보를 확인하는 task이다.

dataset:
CCGbank (Hockenmaier and Steedman, 2007), a conversion of the PTB into CCG derivations

3) Syntactic constituency ancestor tagging

이 task의 경우 문장을 넣었을 때 문장 내에서 특정 단어가 어떤 단어에 대해 Parent인지, GParent인지, GGParent인지를 파악하는 것이다. 이를 통해 모델이 계층적 통사구조를 작 파악했는지 알 수 있다.

dataset:
phrasestructure tree (from the PTB)

어떤 문장이 입력되었을 때 문장 내의 단어들을 트리로 그렸을 때,

위 그림처럼 F자리에 오는 단어의 parent는 D이고, D의 입장에서 F와 G는 자식이다. 그리고 F와 G는 서로 sister이고 B는 F의 입장에서 grandparent이다.

4) Semantic tagging

문맥에서의 word’s semantic role(agent, patient, etc.)을 tagging하는 task이다. 이는 어휘적 의미를 파악하는지 그리고 불필요한 POS 변별하는지, 즉 POS tag에서의 유용한 것을 명확히 하는 지를 평가한다.

dataset:
dataset of Bjerva et al. (2016),
the tagset has since been developed as part of the Parallel Meaning Bank (Abzianidze et al., 2017)

5) Preposition supersense disambiguation

여러 의미를 가질 수 있는 Preposition이 각각 어떤 의미를 위해 사용되었는지 tag해주는 task이다. 이를 통해 의미의 모호성과 어휘 의미론적 지식을 검증할 수 있다. 앞서 살펴본 tagging task들(모든 token에 대해 decision을 만들었던)과 달리 이 모델은 single-token preposition에 대해 학습과 평가가 이루어진다.

dataset:
STREUSLE 4.0 corpus (Schneider et al., 2018)

6) Event Factuality (EF)

이것은 해당 Event가 실제로 발생했는지 안 했는지 확인하는 task이다. 예를 들어 Jo didn’t remember to leave와 Jo didn’t remember leaving에서 전자는 안 떠난 것이고 후자는 떠난 것이다. 이에 대해 factuality관점에서 바라보면 전자는 발생하지 않은 것이고 후자는 발생한 것이다. (해당 모델은 발생/비발생 예측 값을 [-3, 3]범위에서 값을 매기도록 훈련된다.

dataset:
Universal Decompositional Semantics It Happened v2 dataset (Rudinger et al., 2018)

3.2 Segmentation

7) Syntactic chunking (Chunk)

단어들이 서로 어떻게 관계되는지를 나누어 단어가 가지는 span과 boundary를 판별하는 task이다.

dataset:
CoNLL 2000 shared task dataset (Tjong Kim Sang and Buchholz, 2000)

8) Named entity recognition (NER)

개체명 인식. 고유명사를 얼마나 잘 찾아내는지를 평가하는 task이다.

dataset:
CoNLL 2003 shared task dataset (Tjong Kim Sang and De Meulder, 2003)

9) Grammatical error detection (GED)

문법 교정기

dataset:
First Certificate in English (Yannakoudakis et al., 2011) dataset,
converted into sequence-labeling format by Rei and Yannakoudakis (2016)

10) conjunct identification (Conj)

coordination으로 연결된 것인지 판별하는 task.

dataset:
coordinationannotated PTB of Ficler and Goldberg (2016)

3.3 Pairwise Relations

11) Arc prediction

binary classification task로, 두 token이 관계되어 있는지 확인하는 task이다.

dataset:
PTB (converted to UD),
UD-EWT

12) Arc classification

multiclass classification task로, 두 token이 관계되어 있는지, 그리고 어떤 관계이 있는지 분류하는 task이다.

dataset:
PTB (converted to UD),
UD-EWT

13) syntactic dependency arc prediction

14) syntactic dependency arc classification

15) semantic dependency arc prediction

16) semantic dependency arc classification

17) coreference arc prediction

공지시관계를 확인하는 task이다.

dataset
CoNLL 2012 shared task (Pradhan et al., 2012)

4. Models

4.1 Probing Model

Probing Model은 선형모델을 사용했다.

4.2 Contextualizers

Contextualizer로는 6가지 모델이 사용되었다.

ELMo (Peters et al., 2018a): ELMo는 biLM을 통해 독립적으로 학습된 두 contextualizer의 결과를 concatenate한다. 여기서 사용된 ELMo model들은 sentence-shuffled newswire text (the 1 Billion Word Benchmark; Chelba et al., 2014) 800M token으로 훈련되었다.
1) ELMo (original): contextualiz하기 위해 2-layer LSTM을 사용한다.
2) ELMo (4- layer)
3) ELMo (transformer): 6-layer transformer를 사용한다.

4) OpenAI transformer:

left-to-right 12-layer transformer language model로, contiguous text from over 7,000 unique unpublished books (BookCorpus; Zhu et al., 2015) 800M token으로 훈련되었다.

BERT (Devlin et al., 2018): bidirectional transformer로 masked language modeling task 와 next sentence prediction task 훈련을 진행하였고, BookCorpus 와 English Wikipedia를 합쳐 약 3300M token으로 훈련되었다.
5) BERT (base, cased): 12-layer transformer 사용
6) BERT (large, cased): 24-layer transformer 사용

5. Pretrained Contextualizer Comparison: What features of language do these vectors capture, and what do they miss?

첫 번째 실험은 앞서 제시한 문제 중 첫 번째 문제와 관련되며, 어떤 언어학적 지식을 습득하고 어떤 것을 습득하지 못하는가에 대한 실험이다.

5.1 Experimental Setup

본 실험에 사용되는 probing model들은 각 contextualizer의 개별 layer를 통해 생성된 representation으로 훈련되었다. probing model들 간의 비교 외에도 noncontextual vector로 학습된 GloVe와 pretraining을 사용하지 않은 각 task의 이전 state of the art점수와의 비교도 진행된다.

5.2 Results and Discussion

위 표를 보면 각 모델에서 출력된 best layer를 pretrained representation으로 활용하여 각 task를 수행한 것을 볼 수 있다. Avg.는 모델들이 낸 성능의 평균값이다. BERT가 평균 성능도 높고 각 task들 사이에서도 상당 부분 높은 점수를 지니는 것을 확인할 수 있다. 심지어 몇몇 부분에서는 이전에 sota를 찍은 점수보다 높은 점수를 낸 것도 있다.
ELMo-based model을 보면 recurrent model(LSTM)으로 학습된 orginal과 4-layer가 ELMo-transformer보다 높은 성능을 보이는 것도 확인할 수 있다.
OpenAI같은 경우에는 ELMo기반 모델들과 BERT보다 낮은 점수를 보이고 있다. OpenAI는 여기에서 unidirectional로 훈련된 유일한 모델로, OpenAI의 저조한 성적을 통해 bidirection이 고성능 contextualizer에 핵심적인 요소라는 것을 확인할 수 있다.
위 결과를 보면 NER과 coreference task(본 논문 Appendix D에서 확인 가능)에 대한 transferable information을 학습하지 못하고 있는 것을 볼 수 있다. 이러한 취약점을 완화하기 위해 explicit entity representation을 사용한 pretrained contextualizer를 강화하는 시도를 해볼 수 있다.

(위 표는 정리된 표인데 본 논문 Appentix를 보면 결과들이 쭉 나와있다.)

5.3 Probing Failures

앞서 살펴본 표를 보면 해당 실험에서 contextualizer를 통해 추출한 feature가 각 task들에 잘 전이된 것을 확인할 수 있다. 하지만 transfer에 실패한 몇몇 task(논문 Appendix부분에서 보면 개체명 인식, 문법 판독기, conjunction 그리고 grandparents를 뽑는 task의 점수가 낮다.)들이 있다. 실패한 task의 요인은 두 가지로 볼 수 있다:

(1) CWR이 적절한 정보나 예측 가능한 관련성을 encoding하지 못했거나(언어학적 지식들을 학습하지 못했다는 말)
(2) probing model이 정보나 예측 가능한 관련성을 vector로부터 추출할 능력이 없는 것(proving model이 약했다는 말).

1번 요인은 CWR이 task-specific한 정보를 인코딩하지 못하고 있는 것인데, 즉 pretrained contextualizer에 정보가 충분히 들어가고 있지 않다는 것인데, 이를 해결하기 위해 task-specific contextual feature를 학습해야 한다(주입식 교육 시켜야한다는 말…). 이렇게 contextual probing model이 task-specific contextual feature를 학습하는 것은 2번 요인의 해결책이 될 수 있지만 결과적으로 probing model의 능력을 향상시키는 것과 비슷하다.

본 연구의 probing model들의 실패를 더 잘 이해하기 위해 아래 표와 같이 또 다른 실험을 진행하였다.

(1) Linear: 원래 이전에 썼던 모델
(2) MLP: linear probing model을 MLP(multilayer perceptron; here used: single 1024d hidden layer activated by ReLU)으로 교체하여 더 많은 parameter를 probing model에 추가하는 실험. (그냥 다른 task처리는 전혀 안 하고 차원을 늘려 계산을 좀 더 복잡하게 하겠다는 것)
(3) LSTM+ Linear: linear output layer이전에 task-trained LSTM(unidirectional, 200 hidden units)을 사용한 contextual probing model을 사용하여 task-specific한 contextualization을 추가하는 실험. (LSTM을 task에 맞게 훈련시키는 것. 주입식교육시켜서 feature를 더 잘 추출할 수 있게 학습시킨 모델을 probing model을 사용)
(4) BiLSTM+MLP: CWR의 output이 BiLSTM(512 hidden unit)에 들어가고 이 output은 다시 MLP(single 1024-dimensional hidden layer activated by ReLU)에 들어가는 것.

위 그림을 보면 일단 MLP나 LSTM+Linear나 일단 기존의 Contextual Word Representation을 사용하고 있다. 같은 CWR을 기반으로 probing model쪽만 손본 것인데 위 표에서 NER과 grammatical error detection task의 경우 LSTM+Linear(task에 맞게 학습시킨 model)과 MLP(단순히 차원만 늘린 model)의 성능 차이가 별로 없다. 이는 Contextualizer은 개체명 인식을 위한 언어학적 지식을 충분히 학습했는데 Linear model의 차원이 부족해서 계산을 못한 것으로 볼 수 있다.

반면 조금 더 specific syntactic 지식을 요하는 conjunct identification과 great-grandparent prediction task의 경우에는 MLP와 LSTM+Linear(conjunction을 잘 추출하도록 훈련된 LSTM)의 점수 차이가 크다. 이는 Contextualizer가 conjunction에 대한 언어학적 정보를 충분히 학습하지 못했음을 보여준다.

이러한 실험은 end task가 pretraining task가 학습하지 못한 특정 정보를 요구할 때 task-trained contextualization이 중요하다는 것을 보여준다. 따라서 단순히 MLP를 쓰는 것 보다는 fine-tuning을 이용해서 task에 맞는 추출부(decoder)에 신경을 쓰는 것이 중요하다고 본다.

6. Analyzing Layerwise Transferability: How and why does transferability vary across representation layers in contextualizers?

이번 실험은 앞서 제시한 문제 중 두 번째 문제와 관련되며, 모델들의 layer에 어떤 차이가 있는지 보는 실험이다.

6.1 Experimental Setup

ELMo-based model은 contextualizer architecture만 다르기 때문에 제어된 비교에 용이하다. 따라서 여기에서는 layer들의 차이를 보는 것이 중요하기 때문에 ELMo-based model에 집중한다. ELMo-based model들은 수만개의 Word Benchmark로 훈련되었기 때문에 유연한 domain shift를 위해 softmax classifier만 비슷한 데이터로 재학습시킨다.

6.2 Results and Discussion

Recurrent model(LSTM)인 ELMo original(2-layer)과 4-layer ELMo에서는 layer 1의 output을 그대로 proving model로 사용했을 때 가장 높은 performance가 나왔다. 이는 LSTM-ELMo model의 경우, low level layer일수록 더 일반적인 정보를 가지고 있음을 나타낸다. 즉, low level layer일수록 더 transferable하다. 때문에 여러 task에 있어 좋은 점수가 잘 나오는 것이다.

반면 transformer-based model의 경우에는 첫 번째 layer의 결과도 안 좋고 마지막 결과도 안 좋은데 대부분 중간 layer의 결과들이 좋은 것을 확인할 수 있다. 이는 transformer model은 middle level layer가 일반적인 정보를 많이 가지고 있음을 나타낸다. 따라서 middle level layer의 transferability가 높다.

결과적으로 LSTM의 경우, low level layer에서 일반적인 언어지식이 가장 많이 전달될 수 있기 때문에 여러 task를 수행하기 위해서는 LSTM-layer를 많이 쌓아봐야 별 소용이 없음을 알 수 있다. 그리고 Transformer의 경우, middle level layer에서 일반적인 언어지식이 가장 잘 학습되기 때문에 middle level later의 출력을 활용하는 것이 여러 task를 수행하는 것에 도움이 된다.

위 표는 하나의 특정 task에 대한 각 모델의 layer들에 대한 PPL(perplexity, language model을 평가하는 척도로, 낮으면 낮을 수록 좋다)를 평가한 것을 나타낸다. ELMo original은 layer 2에서 PPL이 가장 낮고 ELMo 4-layer에서는 layer 4에서 가장 낮다. 이는 task specific한 정보는 layer를 쌓을수록 higher layer에 담긴다는 것을 나타낸다. 반면 transformer의 경우에는 monotonic increase pattern이 나타나지 않는다.

이러한 결과를 통해 LSTM과 Transformer에 대한 layerwise behavior의 차이도 알 수 있다. LSTM의 layer는 task specific하게 학습되는 반면 transformer을 그렇지 않다.

RNN계열의 모델(ex. LSTM)과 Transformer계열의 모델(ex. BERT)의 syntactic knowledge encoding behavior pattern은 다르다. (본 논문에서는 이 차이가 가지는 의미에 대한 설명이 자세히 나오지 않는다.) RNN은 저차원의 syntax개념부터 고차원의 개념까지 monotonic increasing pattern을 보이며 encoding되고, Transformer는 non-monotonic behavior pattern을 띈다. monotonic increasing pattern은 computer vision쪽으로 생각하면 이해가 편하다.

위 그림처럼 input으로 조각들이 들어가고 hidden layer에서 처음에는 edge를, 그 다음에는 corner과 contour를 학습하고 마지막으로 object part를 학습하여 car, person, animal과 같은 object identity를 구분하는 것이 monotonic increasing pattern이다.

7. Transferring Between Tasks: How does the choice of pretraining task affect the vectors’ learned linguistic knowledge and transferability?

마지막으로 특정 task에 대해 training했을 때 그것이 어떤 결과를 가져오는가에 대한 실험이다. 지금까지 contextualizer들은 bidirectional LM과 같은 self-supervised task를 사용했다. 이런 contextualizer 자체를 특정 task에 대해 사전훈련시킨 결과를 보도록 하겠다.

pretraining task의 결과가 언어학적 지식을 습득하는데 얼마나 영향을 미치고 CWR의 전이성에 얼마나 영향을 미치는지 확인하기 위해 bidirectional LM 사전학습결과와 contextualizer를 사전학습시킨 것을 비교해보도록 하겠다.

7.1 Experimental Setup

사전학습 task들의 제어된 비교를 위해 contextualizer의 architecture와 사전학습 dataset은 고정한다. 이 실험에서 사용하는 모든 contextualizer는 모두 ELMo(original) architecture를 사용하고, pretraining task에 대한 사전학습 dataset은 PTB를 사용한다.

비교 모델은 아래와 같다:

(1) GloVe: contextualization의 결과를 확인하기 위한 non-contextual baseline을 사용한다.
(2) Untrained ELMo(original): pretraining의 결과를 측정하기 위해 randomly-initialized, uncontextualization basline을 사용한다.
(3) ELMo(original): bidirectional LM을 다량의 데이터로 학습시킨 결과를 보기 위해 Billion Word Benchmark로 사전학습시킨 모델을 사용한다.

7.2 Results and Discussion

위 표를 보면 bi-LM의 점수가 가장 높은 것을 확인할 수 있다.

하지만 위 표는 특정 task에 맞춰서 사전학습한 것에 대한 결과인데 syntactic arc prediction의 경우에는 이처럼 특정 task에 맞춰서 학습한 결과가 더 높은 것을 확인할 수 있다. 이에 따라 어느정도 task specific한 contextualize를 하는 것도 고려해야 한다는 것을 알 수 있다.

Reference

Liu et al. “Linguistic Knowledge and Transferability of Contextual Representations,”Association for Computational Linguistics. 2019

https://arxiv.org/abs/1808.08079

https://dmitry.ai/t/topic/172

Deep Learning 2017

Linguistic Knowledge and Transferability of Contextual Representations(Liu(2019))

17 Jan 2021 | NLP & Linguistics