GPT에게 자연어처리 배우기(문자열 전처리)

나 : 문자열 전처리 기법에는 어떤게 있어?

GPT :

문자열 전처리 기법은 텍스트 데이터를 정리하고 분석 가능한 형태로 변환하기 위해 사용됩니다. 다양한 기법이 있으며, 사용 목적과 데이터의 특성에 따라 적절한 방법을 선택할 수 있습니다. 아래에 몇 가지 주요 문자열 전처리 기법을 소개할게요:

1. 소문자 변환 (Lowercasing)

텍스트 데이터를 소문자로 변환하여 같은 단어의 다른 형태를 통합합니다.

text = "Hello World"
text = text.lower()  # "hello world"

2. 구두점 제거 (Removing Punctuation)

텍스트에서 구두점을 제거하여 단어들만 남깁니다.

import string

text = "Hello, world!"
text = text.translate(str.maketrans('', '', string.punctuation))  # "Hello world"

3. 불용어 제거 (Removing Stop Words)

의미 없는 일반적인 단어들(예: "and", "the")을 제거합니다.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

stop_words = set(stopwords.words('english'))
text = "This is a sample sentence"
words = word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stop_words]  # ['This', 'sample', 'sentence']

4. 토큰화 (Tokenization)

텍스트를 단어 또는 문장 단위로 분리합니다.

from nltk.tokenize import word_tokenize

text = "This is a sample sentence."
tokens = word_tokenize(text)  # ['This', 'is', 'a', 'sample', 'sentence', '.']

5. 어간 추출 (Stemming)

단어의 어간을 추출하여 기본 형태로 만듭니다.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
text = "running runs runner"
words = word_tokenize(text)
stemmed_words = [stemmer.stem(word) for word in words]  # ['run', 'run', 'runner']

6. 표제어 추출 (Lemmatization)

단어의 표제어를 추출하여 기본 사전 형태로 만듭니다.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
text = "running runs runner"
words = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word, pos=wordnet.VERB) for word in words]  # ['run', 'run', 'run']

7. 철자 교정 (Spelling Correction)

철자가 틀린 단어를 교정합니다.

from textblob import TextBlob

text = "I havv goood speling"
corrected_text = str(TextBlob(text).correct())  # "I have good spelling"

8. 숫자 및 특수 문자 제거 (Removing Numbers and Special Characters)

텍스트에서 숫자 및 특수 문자를 제거합니다.

import re

text = "Hello world! 123"
text = re.sub(r'\d+', '', text)  # "Hello world! "
text = re.sub(r'\W+', ' ', text)  # "Hello world "

9. 공백 제거 (Removing Extra Whitespace)

불필요한 공백을 제거합니다.

text = "This   is  a   sample   sentence."
text = ' '.join(text.split())  # "This is a sample sentence."

10. 정규 표현식 사용 (Using Regular Expressions)

정규 표현식을 사용하여 특정 패턴을 찾고 변환합니다.

text = "Visit https://example.com"
text = re.sub(r'http\S+', '', text)  # "Visit "

종합 예시

여러 전처리 기법을 조합하여 텍스트를 전처리하는 예시를 보여드릴게요:

import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

def preprocess(text):
    # 소문자로 변환
    text = text.lower()
    # URL 제거
    text = re.sub(r'http\S+', '', text)
    # 구두점 제거
    text = text.translate(str.maketrans('', '', string.punctuation))
    # 숫자 제거
    text = re.sub(r'\d+', '', text)
    # 공백 제거
    text = ' '.join(text.split())
    # 토큰화
    words = word_tokenize(text)
    # 불용어 제거
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # 표제어 추출
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word, pos=wordnet.VERB) for word in words]
    return ' '.join(words)

# 예시 데이터
text = "Running a test sentence with http://example.com and numbers 1234."
processed_text = preprocess(text)
print(processed_text)  # "run test sentence number"

'python > MLDL' 카테고리의 다른 글

Spooky Author Identification (클러스터링, PCA) (0)	2024.06.04
GPT에게 배우는 클러스터링의 종류, 장단점과 사용처 (0)	2024.05.27
GPT에게 자연어처리 배우기(모델 구현) (0)	2024.05.24
캐글 playground spaceship titanic (0)	2024.05.13
GPT통해서 Metric 컨닝페이퍼 만들기 (0)	2024.04.26

good in good out

GPT에게 자연어처리 배우기(문자열 전처리)

1. 소문자 변환 (Lowercasing)

2. 구두점 제거 (Removing Punctuation)

3. 불용어 제거 (Removing Stop Words)

4. 토큰화 (Tokenization)

5. 어간 추출 (Stemming)

6. 표제어 추출 (Lemmatization)

7. 철자 교정 (Spelling Correction)

8. 숫자 및 특수 문자 제거 (Removing Numbers and Special Characters)

9. 공백 제거 (Removing Extra Whitespace)

10. 정규 표현식 사용 (Using Regular Expressions)

종합 예시

'python > MLDL' 카테고리의 다른 글

티스토리툴바

GPT에게 자연어처리 배우기(문자열 전처리)

1. 소문자 변환 (Lowercasing)

2. 구두점 제거 (Removing Punctuation)

3. 불용어 제거 (Removing Stop Words)

4. 토큰화 (Tokenization)

5. 어간 추출 (Stemming)

6. 표제어 추출 (Lemmatization)

7. 철자 교정 (Spelling Correction)

8. 숫자 및 특수 문자 제거 (Removing Numbers and Special Characters)

9. 공백 제거 (Removing Extra Whitespace)

10. 정규 표현식 사용 (Using Regular Expressions)

종합 예시

'python > MLDL' 카테고리의 다른 글

관련글

티스토리툴바