[Machine Learning] 가짜 뉴스 분류 모델

728x90

진짜 뉴스와 가짜 뉴스가 무엇인지 판별하는 분류 모델을 만들어보겠습니다. 우선 데이터셋은 kaggle-fake and real news dataset에서 다운로드할 수 있습니다.

최종적으로 98.59%의 정확도를 가진 모델을 만들 수 있습니다.

fake_path = './data/fake.csv'
true_path = './data/true.csv'
fake_df = pd.read_csv(fake_path)
true_df = pd.read_csv(true_path)

우선 사용할 데이터를 pd.read_csv를 사용하여 가져와줍니다.

print(fake_df.tail(3))

                                                   title  \
23478  Sunnistan: US and Allied ‘Safe Zone’ Plan to T...   
23479  How to Blow $700 Million: Al Jazeera America F...   
23480  10 U.S. Navy Sailors Held by Iranian Military ...   

                                                    text      subject  \
23478  Patrick Henningsen  21st Century WireRemember ...  Middle-east   
23479  21st Century Wire says Al Jazeera America will...  Middle-east   
23480  21st Century Wire says As 21WIRE predicted in ...  Middle-east   

                   date  
23478  January 15, 2016  
23479  January 14, 2016  
23480  January 12, 2016

fake_df에 있는 데이터를 확인해본다. title, text, subject, date 열이 존재한다.

print(true_df.tail(3))

                                                   title  \
21414  Minsk cultural hub becomes haven from authorities   
21415  Vatican upbeat on possibility of Pope Francis ...   
21416  Indonesia to buy $1.14 billion worth of Russia...   

                                                    text    subject  \
21414  MINSK (Reuters) - In the shadow of disused Sov...  worldnews   
21415  MOSCOW (Reuters) - Vatican Secretary of State ...  worldnews   
21416  JAKARTA (Reuters) - Indonesia will buy 11 Sukh...  worldnews   

                   date  
21414  August 22, 2017   
21415  August 22, 2017   
21416  August 22, 2017

true_df도 fake_df와 같은 구조를 가지고 있다.

fake_df['label'] = 1
true_df['label'] = 0

진짜 뉴스와 가짜 뉴스를 구분하기 위한 label 열을 만들어준다. 이때 1은 가짜 뉴스, 0은 진짜 뉴스이다.

result_df = pd.concat([fake_df, true_df], ignore_index=True)

label을 넣어주었으니 fake_df, true_df를 result_df에 합쳐서 저장한다. ignore_index=True를 사용하면 각각의 데이터에서 가지고 있던 index가 사라지고, 새로운 index가 부여된다.

print(result_df[['title','label']].tail(3))

title  label
44895  Minsk cultural hub becomes haven from authorities      0
44896  Vatican upbeat on possibility of Pope Francis ...      0
44897  Indonesia to buy $1.14 billion worth of Russia...      0

result_df에 정보가 저장되었는지 확인한다.

result_df['message_length'] = result_df['text'].apply(len)
result_df['message_length'] += result_df['title'].apply(len)

새로운 열 message_length를 만들어주고 text, title의 문자 길이를 저장해준다.

print(result_df['message_length'].tail(3))

44895    1999
44896    1260
44897    1390
Name: message_length, dtype: int64

message_length에 값이 저장되었는지 확인한다.

왼쪽(파란색)은 진짜 뉴스의 message_length 그래프이고, 오른쪽(빨간색)은 가짜 뉴스의 message_length 그래프이다. 위의 그림에서 알 수 있듯이 가짜 뉴스는 대부분 글자의 수가 적다는 것을 알 수 있다.

728x90

저작자표시

'Artificial Intelligence > Machine Learning' 카테고리의 다른 글

[Machine Learning] 가짜 뉴스 분류 모델 - 머신러닝 모델 (0)	2021.03.05
[Machine Learning] 가짜 뉴스 분류 모델 - 데이터 전처리 (0)	2021.03.04
[Machine Learning] 영화 관객수 예측 모델 (5) (0)	2021.02.24
[Machine Learning] 영화 관객수 예측 모델 (4) (0)	2021.02.23
[Machine Learning] 영화 관객수 예측 모델 (3) (0)	2021.02.19

청청이chungchung

[Machine Learning] 가짜 뉴스 분류 모델 - 데이터 수집 및 확인

'Artificial Intelligence > Machine Learning' 카테고리의 다른 글

티스토리툴바

[Machine Learning] 가짜 뉴스 분류 모델 - 데이터 수집 및 확인

'Artificial Intelligence > Machine Learning' 카테고리의 다른 글

'Artificial Intelligence/Machine Learning' Related Articles

티스토리툴바