파이썬 영문 원형복원 (spay VS nltk )

영문 원형복원 (spay "en_core_web_sm" VS nltk)¶

예시 문장 만들기¶

In [1]:

sentences = [
    "the biggest apple in many apples sold best !",
    "The talented musician played a beautiful melody yesterday.",
    "Dancing in the rain is my favorite activity.",
    "I enjoy swimming in the ocean during the summer.",
    "I was working late at the office last night."
]

sentences[0]

Out[1]:

'the biggest apple in many apples sold best !'

토큰화¶

문장에서 원형화 대상이되는 최소 분석 단위를 찾아내는 과정

In [2]:

from nltk.tokenize import RegexpTokenizer
# 참고 : RegexpTokenizer [\w]+ 옵션을 사용하면 특수문자를 제외하고 문장을 토큰화 합니다.
# 예 : "I will be back !!! @.@ #cat" -> ['I', 'will', 'be', 'back', 'cat']
retokenize = RegexpTokenizer("[\w]+")

token_sentences = []
for sentence in sentences:
    token_sentences.append(retokenize.tokenize(sentence))

token_sentences[0]

Out[2]:

['the', 'biggest', 'apple', 'in', 'many', 'apples', 'sold', 'best']

spacy를 활용한 원문 복원¶

In [3]:

import spacy
# SpaCy 모델 로드 (영어 모델)
nlp = spacy.load("en_core_web_sm")


spacy_lemmatizing_sentences = []
for token_sentence in token_sentences:
    lemmatizing_sentence = []
    for token in token_sentence:
        doc = nlp(token)
        lemma = doc[0].lemma_
        lemmatizing_sentence.append(lemma)
    spacy_lemmatizing_sentences.append(lemmatizing_sentence)

spacy_lemmatizing_sentences[0]

Out[3]:

['the', 'big', 'apple', 'in', 'many', 'apple', 'sell', 'good']

nltk 를 활용한 원문 복원¶

In [4]:

from nltk.stem import WordNetLemmatizer
import warnings
warnings.filterwarnings(action='ignore')

lemmatizer = WordNetLemmatizer()

nltk_lemmatizing_sentences = []
for token_sentence in token_sentences:
    lemmatizing_sentence = []
    for token in token_sentence:
        lemma = lemmatizer.lemmatize(token)
        lemmatizing_sentence.append(lemma)
    nltk_lemmatizing_sentences.append(lemmatizing_sentence)

nltk_lemmatizing_sentences[0]

Out[4]:

['the', 'biggest', 'apple', 'in', 'many', 'apple', 'sold', 'best']

In [5]:

### 두 패키지 원형복원 결과 확인

for spa, nlt in zip(spacy_lemmatizing_sentences, nltk_lemmatizing_sentences):
    print("spacy :", spa)
    print("nltk :", nlt)
    print("---")

spacy : ['the', 'big', 'apple', 'in', 'many', 'apple', 'sell', 'good']
nltk : ['the', 'biggest', 'apple', 'in', 'many', 'apple', 'sold', 'best']
---
spacy : ['the', 'talente', 'musician', 'play', 'a', 'beautiful', 'melody', 'yesterday']
nltk : ['The', 'talented', 'musician', 'played', 'a', 'beautiful', 'melody', 'yesterday']
---
spacy : ['dancing', 'in', 'the', 'rain', 'be', 'my', 'favorite', 'activity']
nltk : ['Dancing', 'in', 'the', 'rain', 'is', 'my', 'favorite', 'activity']
---
spacy : ['I', 'enjoy', 'swim', 'in', 'the', 'ocean', 'during', 'the', 'summer']
nltk : ['I', 'enjoy', 'swimming', 'in', 'the', 'ocean', 'during', 'the', 'summer']
---
spacy : ['I', 'be', 'work', 'late', 'at', 'the', 'office', 'last', 'night']
nltk : ['I', 'wa', 'working', 'late', 'at', 'the', 'office', 'last', 'night']
---

[참고] nltk 패키지를 이용하여 원형 복원 할때, 품사를 지정하면 좀 더 정확한 원형을 찾을 수 있습니다. 코드가 살짝 복잡해 지겠지만요...¶

In [6]:

from nltk.tag import pos_tag

nltk_pos_lemmatizing_sentences = []
for token_sentence in token_sentences:
    lemmatizing_sentence = []
    for token in pos_tag(token_sentence):
        if token[1] in ["JJ", "JJR", "JJS"]:
            lemma = lemmatizer.lemmatize(token[0], pos="a")
        elif token[1] in ["NN", "NNP", "NNS", "PRP", "PRP$"]:
            lemma = lemmatizer.lemmatize(token[0], pos="n")
        elif token[1] in ["RB", "RBR", "RBS"]:
            lemma = lemmatizer.lemmatize(token[0], pos="r")
        elif token[1] in ["MD", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]:
            lemma = lemmatizer.lemmatize(token[0], pos="v")
        else:   
            lemma = lemmatizer.lemmatize(token[0])
        lemmatizing_sentence.append(lemma)
    nltk_pos_lemmatizing_sentences.append(lemmatizing_sentence)

nltk_pos_lemmatizing_sentences[0]

Out[6]:

['the', 'big', 'apple', 'in', 'many', 'apple', 'sell', 'best']

In [7]:

### nltk 품사 지정이후 원형복원과 미지정한 후 원형복원 결과 비교

for pre, post in zip(nltk_pos_lemmatizing_sentences, nltk_lemmatizing_sentences):
    print("with    pos tag :", pre)
    print("without pos tag :", post)
    print("---")

with    pos tag : ['the', 'big', 'apple', 'in', 'many', 'apple', 'sell', 'best']
without pos tag : ['the', 'biggest', 'apple', 'in', 'many', 'apple', 'sold', 'best']
---
with    pos tag : ['The', 'talented', 'musician', 'play', 'a', 'beautiful', 'melody', 'yesterday']
without pos tag : ['The', 'talented', 'musician', 'played', 'a', 'beautiful', 'melody', 'yesterday']
---
with    pos tag : ['Dancing', 'in', 'the', 'rain', 'be', 'my', 'favorite', 'activity']
without pos tag : ['Dancing', 'in', 'the', 'rain', 'is', 'my', 'favorite', 'activity']
---
with    pos tag : ['I', 'enjoy', 'swim', 'in', 'the', 'ocean', 'during', 'the', 'summer']
without pos tag : ['I', 'enjoy', 'swimming', 'in', 'the', 'ocean', 'during', 'the', 'summer']
---
with    pos tag : ['I', 'be', 'work', 'late', 'at', 'the', 'office', 'last', 'night']
without pos tag : ['I', 'wa', 'working', 'late', 'at', 'the', 'office', 'last', 'night']
---

품사지정을 하지 않고 원형복원을 하는 nltk 패키지를 사용하는 것보다 품사지정 이후 원형복원을 하는 nltk 패키지를 쓰거나 spacy 패키지를 쓰는 것이 더 나아 보입니다.

'python' 카테고리의 다른 글

맥북 jpg, png 파일 모두 찾아서 USB로 복사하기 (0)	2023.12.12
파이썬 무작위로 년월일 만드는 함수 (0)	2023.12.12
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory. (1)	2023.12.07
ModuleNotFoundError: No module named 'nltk' (0)	2023.12.07
extend TypeError: 'int' object is not iterable (0)	2023.12.07

아항 !!

파이썬 영문 원형복원 (spay VS nltk )

영문 원형복원 (spay "en_core_web_sm" VS nltk)¶

예시 문장 만들기¶

토큰화¶

spacy를 활용한 원문 복원¶

nltk 를 활용한 원문 복원¶

[참고] nltk 패키지를 이용하여 원형 복원 할때, 품사를 지정하면 좀 더 정확한 원형을 찾을 수 있습니다. 코드가 살짝 복잡해 지겠지만요...¶

'python' 카테고리의 다른 글

댓글

티스토리툴바

파이썬 영문 원형복원 (spay VS nltk )

영문 원형복원 (spay "en_core_web_sm" VS nltk)¶

예시 문장 만들기¶

토큰화¶

spacy를 활용한 원문 복원¶

nltk 를 활용한 원문 복원¶

[참고] nltk 패키지를 이용하여 원형 복원 할때, 품사를 지정하면 좀 더 정확한 원형을 찾을 수 있습니다. 코드가 살짝 복잡해 지겠지만요...¶

'python' 카테고리의 다른 글

관련글

댓글

티스토리툴바