[ML] 텍스트 전처리(텍스트 정규화)

-목차-
1. 텍스트 정규화란?

2. 클렌징(Cleansing)

3. 텍스트 토큰화(Text Tokenization)

3.1 문장 토큰화(Sentence Tokenization)

3.2 단어 토큰화(Word Tokenization)

4. 스톱 워드 제거

5. 어근 추출(stemming과 Lemmatization)

1. 텍스트 정규화란?

텍스트 자체를 바로 피처로 만들 수는 없다. 이를 위해 사전에 텍스트를 가공하는 준비 작업이 필요하다. 텍스트 정규화는 텍스트를 머신러닝 알고리즘이나 NLP 애플리케이션에 입력 데이터로 사용하지 위해 클렌징, 정제, 토큰화, 어근화 등의 다양한 텍스트 데이터의 사전 작업을 수행하는 것을 의미한다. 이러한 텍스트 작업은 크게 다음과 같이 분류할 수 있다.

클렌징(Cleansing)
토큰화(Tokenization)
필터링/스톱 워드 제거/철자 수정
stemming
Lemmatization

각 전처리 작업의 의미와 Python 기반의 NLP, 텍스트 분석 패키지인 NLTK를 활용하여 어떠한 방식으로 수행되는지 알아보자.

2. 클렌징(Cleansing)

텍스트에서 분석에 오히려 방해가 되는 불필요한 문자, 기호 등을 사전에 제거하는 작업이다. 예를 들어 HTML, XML 태그나 특정 기호 등을 사전에 제거한다.

3. 텍스트 토큰화(Text Tokenization)

토큰화의 유형은 문서에서 문장을 분리하는 문장 토큰화 문장에서 단어를 토큰으로 분리하는 단어 토큰화로 나눌 수 있다.

3.1 문장 토큰화(Sentence Tokenization)

문장 토큰화(Sentence tokenization)는 문장의 마침표(.), 개행 문자(\n) 등 문장의 마지막을 뜻하는 기호에 따라 분리하는 것이 일반적이다. 아래 예시는 Python 기반의 NLP, 텍스트 분석 패키지인 NLTK에서 일반적으로 많이 사용하는 sent_tokenize()를 이용해 문장을 토큰화한 예시이다.

예제 코드

from nltk import sent_tokenize
import nltk
nltk.download('punkt')

text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
               You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'
sentences = sent_tokenize(text=text_sample)
print(type(sentences),len(sentences))
print(sentences)

실행 결과

<class 'list'> 3
['The Matrix is everywhere its all around us, here even in this room.', 
'You can see it out your window or on your television.', 
'You feel it when you go to work, or go to church or pay your taxes.']

3.2 단어 토큰화(Word Tokenization)

단어 토큰화(Word Tokenization)는 문장을 단어로 토큰화 하는 것이다. 기본적으로 공백, 콤마(,), 마침표(.), 개행문자 등으로 단어를 분리한다. 아래 예시는 NLTK에서 제공하는 word_tokenize()를 이용해 단어로 토큰화한 예시이다.

예제 코드

from nltk import word_tokenize

sentence = "The Matrix is everywhere its all around us, here even in this room."
words = word_tokenize(sentence)
print(type(words), len(words))
print(words)

실행 결과

<class 'list'> 15
['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.']

4. 스톱 워드 제거

스톱 워드(Stop word)는 분석에 큰 의미가 없는 단어를 지칭한다. 예를 들어 영어에서 is, the, a, will등 문장을 구성하는 필수 문법 요소지만 문맥적으로 큰 의미가 없는 단어가 이에 해당한다. 이러한 단어의 경우 문법적인 특성으로 인해 빈번하게 텍스트에 나타나므로 중요한 단어로 인지될 수 있다. 따라서 이 의미 없는 단어를 제거하는 것 또한 중요한 전처리 작업이다.

언어별로 이러한 스톱 워드는 목록화 되어 있다. NLTK의 경우 다양한 언어의 스톱 워드를 제공한다. 아래 예시를 보자.

예제 코드

import nltk
from nltk import sent_tokenize
from nltk import word_tokenize

nltk.download('stopwords')

#여러개의 문장으로 된 입력 데이터를 문장별로 단어 토큰화 만드는 함수 생성
def tokenize_text(text):
    
    # 문장별로 분리 토큰
    sentences = sent_tokenize(text)
    # 분리된 문장별 단어 토큰화
    word_tokens = [word_tokenize(sentence) for sentence in sentences]
    return word_tokens

text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
               You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'

#여러 문장들에 대해 문장별 단어 토큰화 수행. 
word_tokens = tokenize_text(text_sample)

stopwords = nltk.corpus.stopwords.words('english')
all_tokens = []
# 위 예제의 3개의 문장별로 얻은 word_tokens list 에 대해 stop word 제거 Loop
for sentence in word_tokens:
    filtered_words=[]
    # 개별 문장별로 tokenize된 sentence list에 대해 stop word 제거 Loop
    for word in sentence:
        #소문자로 모두 변환합니다. 
        word = word.lower()
        # tokenize 된 개별 word가 stop words 들의 단어에 포함되지 않으면 word_tokens에 추가
        if word not in stopwords:
            filtered_words.append(word)
    all_tokens.append(filtered_words)
    
print(all_tokens)

실행 결과

[['matrix', 'everywhere', 'around', 'us', ',', 'even', 'room', '.'], 
['see', 'window', 'television', '.'], 
['feel', 'go', 'work', ',', 'go', 'church', 'pay', 'taxes', '.']]

NLTK의 stopwards목록을 다운받고, sent_tokenize와 word_tokenize를 조합해 문서에 대해 모든 단어를 토큰화 한 다음 토큰화 된 리스트 객체에 대해서 stopwords를 필터링으로 제거해 분석을 위한 의미 있는 단어만 추출하였다. is, this와 같은 스톱 워드가 필터링을 통해 제거됐음을 알 수 있다.

5. 어근 추출(stemming과 Lemmatization)

많은 언어에서 문법적인 요소에 따라 단어가 다양하게 변화한다. 영어의 경우 과거/현재, 3인칭 단수 여부 진행형 등 매우 많은 조건에 따라 원래 단어가 변화한다. Stemming과 Lemmatization은 문법적 또는 의미적으로 변화하는 단어의 원형을 찾는 것이다.

두 기능 모두 원형 단어를 찾는다는 목적은 유사하지만, Lemmatization이 Stemming보다 정교하며 의미론적인 관점에서 단어의 원형을 찾는다. NLTK는 이러한 Stemmer와 Lemmatizer를 제공한다. 먼저 NLTK의 LancasterStemmer를 이용해 Stemmer의 동작 과정을 이해해 보자. 아래 코드를 보자.

예제 코드

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

print(stemmer.stem('working'),stemmer.stem('works'),stemmer.stem('worked'))
print(stemmer.stem('amusing'),stemmer.stem('amuses'),stemmer.stem('amused'))
print(stemmer.stem('happier'),stemmer.stem('happiest'))
print(stemmer.stem('fancier'),stemmer.stem('fanciest'))

실행 결과

work work work
amus amus amus
happy happiest
fant fanciest

work의 경우 진행형(working), 3인칭 단수(works), 과거형(worked) 모두 기본 단어인 work에 ing, s, ed가 붙는 단순한 변환이므로 원형 단어로 work를 제대로 인식한다. 하지만 amuse의 경우 각 변화가 amus에 ing, s, ed가 붙으므로 amus를 원형 단어로 인식한다. 형용사인 happy, fancy의 경우도 비교형, 최상급으로 변형된 단어의 정확한 원형을 찾지 못하고 원형 단어에서 철자가 다른 어근 단어로 인식하는 경우가 발생한다.

아래 예시는 WordNetLemmatizer를 이용해 Lemmatization을 수행한 것이다.

예제 코드

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

lemma = WordNetLemmatizer()
print(lemma.lemmatize('amusing','v'),lemma.lemmatize('amuses','v'),lemma.lemmatize('amused','v'))
print(lemma.lemmatize('happier','a'),lemma.lemmatize('happiest','a'))
print(lemma.lemmatize('fancier','a'),lemma.lemmatize('fanciest','a'))

실행 결과

amuse amuse amuse
happy happy
fancy fancy

일반적으로 Lemmatization은 보다 정확한 원형 단어 추출을 위해 단어의 품사를 입력해야 한다. 위의 예제에서 볼 수 있듯 lemmatize()의 파라미터로 동사의 경우 'v' 형용사의 경우'a'를 입력한다. 이전 예제인 Stemmer보다 정확하게 원형 단어를 추출해줌을 알 수 있다.

참고 자료

파이썬 머신러닝 완벽 가이드

'머신러닝, 딥러닝 > NLP' 카테고리의 다른 글

[논문 리뷰, GPT-2]Language Models are Unsupervised Multitask Learners (0)	2022.05.05
[논문 리뷰, GPT]Improving Language Understanding by Generative Pre-Training (0)	2022.05.03
[ML] Bag of Words(BOW) (0)	2022.03.03

Deeppago's study note

[ML] 텍스트 전처리(텍스트 정규화)

1. 텍스트 정규화란?

2. 클렌징(Cleansing)

3. 텍스트 토큰화(Text Tokenization)

3.1 문장 토큰화(Sentence Tokenization)

3.2 단어 토큰화(Word Tokenization)

4. 스톱 워드 제거

5. 어근 추출(stemming과 Lemmatization)

'머신러닝, 딥러닝 > NLP' 카테고리의 다른 글

댓글

티스토리툴바

[ML] 텍스트 전처리(텍스트 정규화)

1. 텍스트 정규화란?

2. 클렌징(Cleansing)

3. 텍스트 토큰화(Text Tokenization)

3.1 문장 토큰화(Sentence Tokenization)

3.2 단어 토큰화(Word Tokenization)

4. 스톱 워드 제거

5. 어근 추출(stemming과 Lemmatization)

'머신러닝, 딥러닝 > NLP' 카테고리의 다른 글

관련글

댓글

티스토리툴바