반응형
영문 원형복원 (spay "en_core_web_sm" VS nltk)¶
예시 문장 만들기¶
In [1]:
sentences = [
"the biggest apple in many apples sold best !",
"The talented musician played a beautiful melody yesterday.",
"Dancing in the rain is my favorite activity.",
"I enjoy swimming in the ocean during the summer.",
"I was working late at the office last night."
]
sentences[0]
Out[1]:
'the biggest apple in many apples sold best !'
토큰화¶
- 문장에서 원형화 대상이되는 최소 분석 단위를 찾아내는 과정
In [2]:
from nltk.tokenize import RegexpTokenizer
# 참고 : RegexpTokenizer [\w]+ 옵션을 사용하면 특수문자를 제외하고 문장을 토큰화 합니다.
# 예 : "I will be back !!! @.@ #cat" -> ['I', 'will', 'be', 'back', 'cat']
retokenize = RegexpTokenizer("[\w]+")
token_sentences = []
for sentence in sentences:
token_sentences.append(retokenize.tokenize(sentence))
token_sentences[0]
Out[2]:
['the', 'biggest', 'apple', 'in', 'many', 'apples', 'sold', 'best']
spacy를 활용한 원문 복원¶
In [3]:
import spacy
# SpaCy 모델 로드 (영어 모델)
nlp = spacy.load("en_core_web_sm")
spacy_lemmatizing_sentences = []
for token_sentence in token_sentences:
lemmatizing_sentence = []
for token in token_sentence:
doc = nlp(token)
lemma = doc[0].lemma_
lemmatizing_sentence.append(lemma)
spacy_lemmatizing_sentences.append(lemmatizing_sentence)
spacy_lemmatizing_sentences[0]
Out[3]:
['the', 'big', 'apple', 'in', 'many', 'apple', 'sell', 'good']
nltk 를 활용한 원문 복원¶
In [4]:
from nltk.stem import WordNetLemmatizer
import warnings
warnings.filterwarnings(action='ignore')
lemmatizer = WordNetLemmatizer()
nltk_lemmatizing_sentences = []
for token_sentence in token_sentences:
lemmatizing_sentence = []
for token in token_sentence:
lemma = lemmatizer.lemmatize(token)
lemmatizing_sentence.append(lemma)
nltk_lemmatizing_sentences.append(lemmatizing_sentence)
nltk_lemmatizing_sentences[0]
Out[4]:
['the', 'biggest', 'apple', 'in', 'many', 'apple', 'sold', 'best']
In [5]:
### 두 패키지 원형복원 결과 확인
for spa, nlt in zip(spacy_lemmatizing_sentences, nltk_lemmatizing_sentences):
print("spacy :", spa)
print("nltk :", nlt)
print("---")
spacy : ['the', 'big', 'apple', 'in', 'many', 'apple', 'sell', 'good'] nltk : ['the', 'biggest', 'apple', 'in', 'many', 'apple', 'sold', 'best'] --- spacy : ['the', 'talente', 'musician', 'play', 'a', 'beautiful', 'melody', 'yesterday'] nltk : ['The', 'talented', 'musician', 'played', 'a', 'beautiful', 'melody', 'yesterday'] --- spacy : ['dancing', 'in', 'the', 'rain', 'be', 'my', 'favorite', 'activity'] nltk : ['Dancing', 'in', 'the', 'rain', 'is', 'my', 'favorite', 'activity'] --- spacy : ['I', 'enjoy', 'swim', 'in', 'the', 'ocean', 'during', 'the', 'summer'] nltk : ['I', 'enjoy', 'swimming', 'in', 'the', 'ocean', 'during', 'the', 'summer'] --- spacy : ['I', 'be', 'work', 'late', 'at', 'the', 'office', 'last', 'night'] nltk : ['I', 'wa', 'working', 'late', 'at', 'the', 'office', 'last', 'night'] ---
[참고] nltk 패키지를 이용하여 원형 복원 할때, 품사를 지정하면 좀 더 정확한 원형을 찾을 수 있습니다. 코드가 살짝 복잡해 지겠지만요...¶
In [6]:
from nltk.tag import pos_tag
nltk_pos_lemmatizing_sentences = []
for token_sentence in token_sentences:
lemmatizing_sentence = []
for token in pos_tag(token_sentence):
if token[1] in ["JJ", "JJR", "JJS"]:
lemma = lemmatizer.lemmatize(token[0], pos="a")
elif token[1] in ["NN", "NNP", "NNS", "PRP", "PRP$"]:
lemma = lemmatizer.lemmatize(token[0], pos="n")
elif token[1] in ["RB", "RBR", "RBS"]:
lemma = lemmatizer.lemmatize(token[0], pos="r")
elif token[1] in ["MD", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]:
lemma = lemmatizer.lemmatize(token[0], pos="v")
else:
lemma = lemmatizer.lemmatize(token[0])
lemmatizing_sentence.append(lemma)
nltk_pos_lemmatizing_sentences.append(lemmatizing_sentence)
nltk_pos_lemmatizing_sentences[0]
Out[6]:
['the', 'big', 'apple', 'in', 'many', 'apple', 'sell', 'best']
In [7]:
### nltk 품사 지정이후 원형복원과 미지정한 후 원형복원 결과 비교
for pre, post in zip(nltk_pos_lemmatizing_sentences, nltk_lemmatizing_sentences):
print("with pos tag :", pre)
print("without pos tag :", post)
print("---")
with pos tag : ['the', 'big', 'apple', 'in', 'many', 'apple', 'sell', 'best'] without pos tag : ['the', 'biggest', 'apple', 'in', 'many', 'apple', 'sold', 'best'] --- with pos tag : ['The', 'talented', 'musician', 'play', 'a', 'beautiful', 'melody', 'yesterday'] without pos tag : ['The', 'talented', 'musician', 'played', 'a', 'beautiful', 'melody', 'yesterday'] --- with pos tag : ['Dancing', 'in', 'the', 'rain', 'be', 'my', 'favorite', 'activity'] without pos tag : ['Dancing', 'in', 'the', 'rain', 'is', 'my', 'favorite', 'activity'] --- with pos tag : ['I', 'enjoy', 'swim', 'in', 'the', 'ocean', 'during', 'the', 'summer'] without pos tag : ['I', 'enjoy', 'swimming', 'in', 'the', 'ocean', 'during', 'the', 'summer'] --- with pos tag : ['I', 'be', 'work', 'late', 'at', 'the', 'office', 'last', 'night'] without pos tag : ['I', 'wa', 'working', 'late', 'at', 'the', 'office', 'last', 'night'] ---
- 품사지정을 하지 않고 원형복원을 하는
nltk
패키지를 사용하는 것보다 품사지정 이후 원형복원을 하는nltk
패키지를 쓰거나spacy
패키지를 쓰는 것이 더 나아 보입니다.
반응형
'python' 카테고리의 다른 글
맥북 jpg, png 파일 모두 찾아서 USB로 복사하기 (0) | 2023.12.12 |
---|---|
파이썬 무작위로 년월일 만드는 함수 (0) | 2023.12.12 |
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory. (1) | 2023.12.07 |
ModuleNotFoundError: No module named 'nltk' (0) | 2023.12.07 |
extend TypeError: 'int' object is not iterable (0) | 2023.12.07 |
댓글