000 | 00000nam c2200205 c 4500 | |
001 | 000045978962 | |
005 | 20190416165931 | |
007 | ta | |
008 | 181226s2019 ulkd bmAC 000c eng | |
040 | ▼a 211009 ▼c 211009 ▼d 211009 | |
085 | 0 | ▼a 0510 ▼2 KDCP |
090 | ▼a 0510 ▼b 6D36 ▼c 1098 | |
100 | 1 | ▼a 김예찬 |
245 | 1 0 | ▼a Toward word embedding techniques to handle out-of-vocabulary problem with subword information / ▼d Yeachan Kim |
260 | ▼a Seoul : ▼b Graduate School, Korea Unversity, ▼c 2019 | |
300 | ▼a v, 41장 : ▼b 도표 ; ▼c 26 cm | |
500 | ▼a 지도교수: 이상근 | |
502 | 0 | ▼a 학위논문(석사)-- ▼b 고려대학교 대학원: ▼c 컴퓨터·전파통신공학과, ▼d 2019. 2 |
504 | ▼a 참고문헌: 장 37-41 | |
530 | ▼a PDF 파일로도 이용가능; ▼c Requires PDF file reader(application/pdf) | |
653 | ▼a Word Embeddings ▼a NLP | |
776 | 0 | ▼t Toward Word Embedding Techniques to Handle Out-of-Vocabulary Problem with Subword Information ▼w (DCOLL211009)000000083458 |
900 | 1 0 | ▼a Kim, Yea-chan, ▼e 저 |
900 | 1 0 | ▼a 이상근 ▼g 李尙根, ▼e 지도교수 |
945 | ▼a KLPA |
전자정보
소장정보
No. | 소장처 | 청구기호 | 등록번호 | 도서상태 | 반납예정일 | 예약 | 서비스 |
---|---|---|---|---|---|---|---|
No. 1 | 소장처 과학도서관/학위논문서고/ | 청구기호 0510 6D36 1098 | 등록번호 123060863 | 도서상태 대출가능 | 반납예정일 | 예약 | 서비스 |
No. 2 | 소장처 과학도서관/학위논문서고/ | 청구기호 0510 6D36 1098 | 등록번호 123060864 | 도서상태 대출가능 | 반납예정일 | 예약 | 서비스 |
컨텐츠정보
초록
Word embeddings have been a crucial component in natural language processing (NLP) models. In particular, pre-trained word embeddings (e.g., word2vec, GloVe) have proven to be invaluable for improving the performance in NLP tasks. However, such embeddings are usually blind to the relatedness between words. This gives rise to a fatal constraint in representing out-of-vocabulary (OOV) words, even if such words are lexically related to vocabulary words. In this thesis, we present two kinds of methodologies to handle this problem. Our first approach is to expand a vocabulary by transforming word embeddings itself. To this end, we propose a novel deep neural network which takes a set of pre-trained word embeddings and generalizes it to word entries involving OOV words. The second approach is to modify the encodings (e.g., one-hot encoding) that are used in a look-up function of an embedding-layer. To build a new encoding, we propose a neural network that takes a word as an input and outputs an encoding of the input word. In particular, we seek to inject the relatedness between into the encodings. This makes OOV words have their distinct encodings which are represented by their related words. The common characteristic of the aforementioned methods is that they use subword information to represent words. This allows us to consider the relatedness between words and represent any word effectively. Experimental results and our in-depth analysis show that our two methodologies lead to excellent performance improvement by generating OOV words and demonstrate that our methods produce meaningful representations for OOV words.
목차
1 Introduction 1 2 RelatedWorks 5 2.1 Representing Words using Subword Information 5 2.2 Utilizing Linguistic Resources to Handle OOV Words 6 3 Embedding-based Technique 7 3.1 Character-level Convolutional Neural Networks for Embeddings 7 3.2 Highway Network 9 3.3 Training and Deriving Word Embeddings 10 4 Encoding-based Technique 12 4.1 Character-level Convolutional Neural Networks for Encodings 13 4.2 Knowledge Distillation into Character-based Encodings 15 4.3 Training and Deriving Word Embeddings 16 5 Experiments 17 5.1 Experimental Settings 17 5.2 Experiments for Embedding-based Technique 19 5.2.1 Word Similarity 19 5.2.2 Language Modeling 21 5.3 Experiments for Encoding-based Technique 23 5.3.1 Word Similarity 23 5.3.2 Word Analogy 24 5.3.3 Chunking 27 5.4 Comparison between Embedding and Encoding-based Technique 28 5.4.1 Large-scale Text Classification 28 6 Analysis 31 6.1 Analysis for Embedding-based Technique 31 6.1.1 Effect of Highway Networks 31 6.1.2 Nearest Neighbor of Words 33 6.2 Analysis for Encoding-based Technique 34 6.2.1 Effect of the Number of Elements in PEN 34 6.2.2 Nearest Neighbor of Words 35 7 Conclusion 36 Bibliography 37 Acknowledgement 42