코로나 국제 현황 시각화 (1) Coronavirus data 크롤링

Notice

Recent Posts

Recent Comments

Link

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Ming's blog

코로나 국제 현황 시각화 (1) Coronavirus data 크롤링 본문

Visualization

코로나 국제 현황 시각화 (1) Coronavirus data 크롤링

H._.ming 2020. 2. 16. 23:33

코로나19의 현황을 알아보기 위해 코로나 국제 현황 데이터를 시각화하고자 합니다.

worldometer에서 제공하는 데이터를 이용하려 합니다.

(1) Coronavirus data 크롤링

1. 크롤링에 필요한 pandas 와 requests, BeautifulSoup 모듈을 불러옵니다.

# 모듈 불러오기
import pandas as pd
import requests
from bs4 import BeautifulSoup

2. BeautifulSoup을 이용해서 worldometer 페이지의 HTML 소스를 가져옵니다.

req = requests.get('https://www.worldometers.info/coronavirus/')
html=req.text
soup=BeautifulSoup(html,'html.parser')

3. 코로나 국제 현황표의 열 이름을 추출합니다.

코로나 국제 현황 데이터의 경우, id가 table3 인 table내에 존재하는 것을 알 수 있습니다.

<table id = "table3"><thred><tr><th> 안에 각 컬럼 이름이 존재하므로 select를 이용하여

아래와 같은 방법으로 값을 가져옵니다.

#열추출
columns=soup.select('table#table3 > thead > tr > th')
columns

[<th width="70"><span class="style1">Country,<br/> Territory </span></th>,
 <th width="20"><span class="style1"><br/> Total Cases</span></th>,
 <th width="30"><span class="style1">New<br/> Cases</span> </th>,
 <th width="30"><span class="style1"> Total<br/> Deaths</span></th>,
 <th width="30"><span class="style1">New<br/>Deaths </span></th>,
 <th width="30"><span class="style1">Total<br/> Recovered</span></th>,
 <th width="30"><span class="style1">Serious, <br/> Critical </span></th>,
 <th class="hidden" width="30"><span class="style1">Region</span></th>]

'columns'의 text값만 추출하여 'columnlist'라는 리스트에 넣어줍니다.

이때, strip함수를 이용해서 각 열 이름의 앞뒤 공백을 제거합니다.

추출한 열이름을 이용하여 'df'라는 데이터 프레임을 만듭니다.

columnlist=[]
for column in columns:
    columnlist.append(column.text.strip())
df=pd.DataFrame(columns=columnlist)
df

Country, Territory

Total Cases

New Cases

Total Deaths

NewDeaths

Total Recovered

Serious, Critical

Region

데이터는 위와 같이 총 8개의 열로 이루어져 있습니다.

4. 앞서 만든 df 데이터 프레임에 표의 내용을 채워 넣습니다.

table3의 데이터는 <td> 태그 안에 존재합니다.

find_all 함수를 이용하여 <td> 태그를 모두 추출합니다.

각 <td> 태그 안에 있는 값들은 데이터 관측치 한 행을 의미합니다.

<td> 태그의 text를 추출한 후, 이를 df 데이터 프레임에 넣어줍니다.

contents=soup.select('table#table3 > tbody > tr')
allobs=[]
for content in contents:
    obs=[]
    tds=content.find_all("td")
    for td in tds:
        obs.append(td.text)
    allobs.append(dfcontent)
    
df=pd.DataFrame(columns=columnlist, data=allobs)
df.head(5)

	Country, Territory	Total Cases	New Cases	Total Deaths	NewDeaths	Total Recovered	Serious, Critical	Region
0	China	68,509	+2,017	1,666	+143	9,754	11,272	Asia
1	Japan	414	+76	1		17	9	Asia
2	Singapore	75	+3			19	6	Asia
3	Hong Kong	57	+1	1		2	7	Asia
4	Thailand	34				14	2	Asia

그 결과, 위와 같은 데이터 프레임이 생성이 되는데, 불필요한 ', '와 '+'가 있는 것을 확인할 수 있습니다.

5. replace함수와 applymap함수를 이용하여 필요 없는 문자(', ', '+')를 제거합니다.

df=df.applymap(lambda x : x.replace(",",""))
df=df.applymap(lambda x : x.replace("+",""))
df.head(5)

	Country, Territory	Total Cases	New Cases	Total Deaths	NewDeaths	Total Recovered	Serious, Critical	Region
0	China	68509	2017	1666	143	9754	11272	Asia
1	Japan	414	76	1		17	9	Asia
2	Singapore	75	3			19	6	Asia
3	Hong Kong	57	1	1		2	7	Asia
4	Thailand	34				14	2	Asia

현재 데이터 형식을 살펴보면, 모두 object인 것을 확인할 수 있습니다.

df.dtypes

Country, Territory    object
Total Cases           object
New Cases             object
Total Deaths          object
NewDeaths             object
Total Recovered       object
Serious,  Critical    object
Region                object
dtype: object

6. to_numeric함수를 이용하여 데이터의 형식을 문자에서 숫자로 변환합니다.

df['Total Cases']=pd.to_numeric(df['Total Cases'])
df['New Cases']=pd.to_numeric(df['New Cases'])
df['Total Deaths']=pd.to_numeric(df['Total Deaths'])
df['NewDeaths']=pd.to_numeric(df['NewDeaths'])
df['Total Recovered']=pd.to_numeric(df['Total Recovered'])
df['Serious,  Critical']=pd.to_numeric(df['Serious,  Critical'])
df.head(5)

	Country, Territory	Total Cases	New Cases	Total Deaths	NewDeaths	Total Recovered	Serious, Critical	Region
0	China	68509.0	2017.0	1666.0	143.0	9754.0	11272.0	Asia
1	Japan	414.0	76.0	1.0	NaN	17.0	9.0	Asia
2	Singapore	75.0	3.0	NaN	NaN	19.0	6.0	Asia
3	Hong Kong	57.0	1.0	1.0	NaN	2.0	7.0	Asia
4	Thailand	34.0	NaN	NaN	NaN	14.0	2.0	Asia

df.dtypes

Country, Territory     object
Total Cases           float64
New Cases             float64
Total Deaths          float64
NewDeaths             float64
Total Recovered       float64
Serious,  Critical    float64
Region                 object
dtype: object

변환 결과 다음과 같이 숫자형으로 잘 변환된 것을 확인할 수 있습니다.

7. 결측치의 경우, 공백으로 처리합니다.

#결측치 공백처리
df=df.fillna("")

	Country, Territory	Total Cases	New Cases	Total Deaths	NewDeaths	Total Recovered	Serious, Critical	Region
0	China	68509	2017	1666	143	9754	11272	Asia
1	Japan	414	76	1		17	9	Asia
2	Singapore	75	3			19	6	Asia
3	Hong Kong	57	1	1		2	7	Asia
4	Thailand	34				14	2	Asia

코로나 국제 현황 데이터 가져오기 끝!

'Visualization' 카테고리의 다른 글

Tableau를 이용한 코로나 국제 현황 시각화(03.14기준) (1)	2020.03.14
power bi를 이용한 코로나 서울시 현황 시각화(03.03 10시 기준) (1)	2020.03.03

'Visualization' Related Articles

Comments

Ming's blog

코로나 국제 현황 시각화 (1) Coronavirus data 크롤링 본문

코로나 국제 현황 시각화 (1) Coronavirus data 크롤링

(1) Coronavirus data 크롤링

1. 크롤링에 필요한 pandas 와 requests, BeautifulSoup 모듈을 불러옵니다.

2. BeautifulSoup을 이용해서 worldometer 페이지의 HTML 소스를 가져옵니다.

3. 코로나 국제 현황표의 열 이름을 추출합니다.

4. 앞서 만든 df 데이터 프레임에 표의 내용을 채워 넣습니다.

5. replace함수와 applymap함수를 이용하여 필요 없는 문자(', ', '+')를 제거합니다.

6. to_numeric함수를 이용하여 데이터의 형식을 문자에서 숫자로 변환합니다.

7. 결측치의 경우, 공백으로 처리합니다.

'Visualization' 카테고리의 다른 글

티스토리툴바