[python] 네이버 블로그 크롤링 및 분석하기 (2)

크리쓰마스 2021. 5. 28. 12:05

2021. 5. 28. 12:05

1편 에서 네이버 블로그 검색을 통해 블로그 게시글을 크롤링 하는 부분까지 구현했다.

[python] 네이버 블로그 크롤링 및 분석하기 (1)

부탁을 받아서 네이버에서 특정 키워드를 검색하고, 블로그에서 나온 자료를 크롤링 한 뒤 분석하는 걸 하게됐다. python selenium webdriver를 사용해서 특정 키워드를 검색하고, 블로그 본문 내용을

yong0810.tistory.com

이번에는 크롤링한 게시글을 바탕으로
Python KoNLPy 한국어 처리 패키지를 사용해 문장의 형태소를 분석하고

분석 내용에서 동사만 추출해서 빈도수를 그래프 및 wordCloud로 데이터 시각화를 진행한다.

KoNLPy에 대한 정보는 여기에서 확인할 수 있다.

소스코드

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76

from selenium import webdriver
from konlpy.tag import Okt
from nltk import Text
from matplotlib import font_manager, rc
from wordcloud import WordCloud
 
import matplotlib.pyplot as plt
import time
 
path = "D:/4_projects/chromedriver.exe" # 웹드라이버 실행
 
driver = webdriver.Chrome(path) # 드라이버 경로 설정
url_list = [] # 블로그 url을 저장하기 위한 변수
content_list = "" # 블로그 content를 누적하기 위한 변수
text = "서울식물원" # 검색어
 
for i in range(1, 100):  # 1~2페이지까지의 블로그 내용을 읽어옴
    url = 'https://section.blog.naver.com/Search/Post.nhn?pageNo='+ str(i) + '&rangeType=ALL&orderBy=sim&keyword=' + text # url 값 설정
    driver.get(url)
    time.sleep(0.5) # 오류 방지 sleep
 
    for j in range(1, 3):
        titles = driver.find_element_by_xpath('/html/body/ui-view/div/main/div/div/section/div[2]/div['+str(j)+']/div/div[1]/div[1]/a[1]')
        title = titles.get_attribute('href')
        url_list.append(title)
 
print("url 수집 끝, 해당 url 데이터 크롤링")
 
for url in url_list: # 저장했던 블로그 하나씩 순회
    driver.get(url)
 
    driver.switch_to.frame('mainFrame')
    overlays = ".se-component.se-text.se-l-default" # 내용 크롤링
    contents = driver.find_elements_by_css_selector(overlays)
 
    for content in contents:
        content_list = content_list + content.text # 각 블로그의 내용을 변수에 누적함
 
# 트위터에서 만든 소셜 분석을 위한 형태소 분석기 Okt 사용
okt = Okt()
myList = okt.pos(content_list, norm=True, stem=True) # 모든 형태소 추출
myList_filter = [x for x, y in myList if y in ['Verb']] # 추출된 값 중 동사만 추출
 
Okt = Text(myList_filter, name="Okt")
 
# 그래프에서 한글이 출력이 안되는 문제 해결 (ㅁㅁㅁ 처럼 출력됨)
font_location = "c:/Windows/Fonts/malgun.ttf"
font_name = font_manager.FontProperties(fname=font_location).get_name()
rc('font', family=font_name)
 
# 그래프 x, y 라벨 설정
plt.xlabel("동사")
plt.ylabel("빈도")
 
# 그래프에서 x, y 값을 설정
wordInfo = dict()
for tags, counts in Okt.vocab().most_common(50):
    if(len(str(tags)) > 1):
        wordInfo[tags] = counts
 
values = sorted(wordInfo.values(), reverse=True)
keys = sorted(wordInfo, key=wordInfo.get, reverse=True)
 
# 그래프 값 설정
plt.bar(range(len(wordInfo)), values, align='center')
plt.xticks(range(len(wordInfo)), list(keys), rotation='70')
plt.show()
 
 
# wordCloud 출력
wc = WordCloud(width = 1000, height = 600, background_color="white", font_path=font_location, max_words=50)
plt.imshow(wc.generate_from_frequencies(Okt.vocab()))
plt.axis("off")
plt.show()
 
 
Colored by Color Scripter

cs

나는 블로그 글을 분석하는 것이다 보니, 신조어나 ㅋㅋㅋ 같은 단어가 많을 것이라고 예측해서

해당 단어를 예측하는데 가장 좋은 성능을 보이는 Okt를 사용했다.

가장 먼저 모든 형태소를 추출하고, 그 다음 동사만 따로 추출해서 그래프에 출력한다.

matplotlib에서 한글이 ㅁㅁㅁㅁ 형태로 깨지는 경우가 있어, 폰트를 따로 설정해주고 상위 50개씩만 출력한다.

결과

끝

'프로젝트 > 블로그 크롤링' 카테고리의 다른 글

[python] 네이버 블로그 크롤링 및 분석하기 (1) (0)	2021.05.12

. . .