9/6 데이터 핸들링-판다스

Machine Learning/파이썬 머신러닝 완벽가이드

9/6 데이터 핸들링-판다스

햄스텅 2021. 9. 6. 16:18

p41~p86

import pandas as pd
titanic_df = pd.read_csv('./data/train.csv')
print('DataFrame 크기: ', titanic_df.shape)
titanic_df.describe()
titanic_Pclass = titanic_df['Pclass']
print(type(titanic_Pclass))

titanic데이터에 대해 먼저 위의 코드를 통해 알아보았습니다.

#데이터프레임을 리스트로 변환시키는 방법

list3 = df_dict.values.tolist()
print('df_dict.values.tolist()타입:',type(list3))
print(list3)

#새로운 칼럼 데이터 셋 생성

titanic_df['Age_0']=0
titanic_df.head(3)

새로운 칼럼 Age_0이 0으로 생성되었습니다.

#기존 컬럼을 가공해 새로운 칼럼 생성

titanic_df['Age_by_10'] = titanic_df['Age']*10
titanic_df['Family_No'] = titanic_df['SibSp']+titanic_df['Parch']+1
titanic_df.head(3)

axis 0 은 로우 방향 축, axis 1은 칼럼 방향 축
즉 drop() 메서드에 axis =1 을 입력하면 칼럼 축 방향으로 드롭을 수행하므로 칼럼을 드롭.
labels에 원하는 칼럼 입력하면 지정된 칼럼 드롭 가능

#Age_0칼럼 삭제

titanic_drop_df = titanic_df.drop('Age_0', axis=1)
titanic_drop_df.head(3)

#inplace=True를 통해 자신의 데이터프레임의 데이터를 삭제

drop_result = titanic_df.drop(['Age_0', 'Age_by_10','Family_No'],axis=1,inplace=True)
print('inplace=True 로 drop후 반환된 값:', drop_result)
titanic_df.head(3)

#로우 삭제하기

pd.set_option('display.width',1000)
pd.set_option('display.max_colwidth',15)
print('#### before axis 0 drop ####')
print(titanic_df.head(3))
titanic_df.drop([0,1,2],axis=0, inplace= True)

print('#### after axis 0 drop ####')
print(titanic_df.head(3))

#새로운 index추가

titanic_reset_df = titanic_df.reset_index(inplace=False)
titanic_reset_df.head(3)

데이터프레임 바로 뒤의 [ ]연산자는 넘파이의 [ ] 나 Series의 [ ] 와 다릅니다.
데이터프레임의 바로 뒤의 [ ]내 입력 값은 칼럼명을 지정해 칼럼 지정 연산에 사용하거나 불린 인덱스 용도로만 사용해야 합니다.
데이터프레임[0:2]와 같은 슬라이싱 연산으로 데이터를 추출하는 방법은 사용하지 않는 게 좋습니다.

iloc[ ]는 위치기반 인덱싱만 가능합니다. 따라서 행과 열 위치 값으로 정수형을 값을 지정해 원하는 데이터를 반환합니다.
loc[ ] 는 명칭기반 인덱싱만 가능합니다. 따라서 행 위치에 DataFrame인덱스가 오며, 열 위치에는 칼럼 명을 지정해 원하는 데이터를 반환합니다,

#불린 인덱싱

titanic_boolean = titanic_df[titanic_df['Age']>60]
print(type(titanic_boolean))
titanic_boolean

#조건 두개 추출

titanic_df[titanic_df['Age']>60][['Name','Age']].head(3)

#loc사용해서 동일 결과 추출

titanic_df.loc[titanic_df['Age']>60, ['Name','Age']].head(3)

1. and조건일 때는&
2. or조건일 때는 |
3. Not 조건일 때는 ~

#개별조건은 ( )로 묶고, 복합 조건 연산자를 사용

titanic_df[ (titanic_df['Age']>60)&(titanic_df['Pclass'] == 1) & (titanic_df['Sex'] == 'female')]

#오름차순은 ascending = True

#Name칼럼으로 오름차순 정렬

titanic_sorted = titanic_df.sort_values(by=['Name'])
titanic_sorted.head(3)

#Name칼럼으로 내림차순 정렬

titanic_sorted = titanic_df.sort_values(by=['Pclass', 'Name'],ascending=False)
titanic_sorted.head(3)

#특정 칼럼에 aggregation함수 적용하기 위해서는 대상 칼럼들만 추출해 적용.

titanic_df[['Age','Fare']].mean()

#group by는 ( )대상 칼럼을 제외한 모든 칼럼에 해당aggregation함수를 적용함.

titanic_groupby = titanic_df.groupby('Pclass').count()
titanic_groupby

#group by ( )는 agg( )이용해 각 함수 적용.

agg_format = {'Age':'max', 'SibSp':'sum','Fare':'mean'}
titanic_df.groupby('Pclass').agg(agg_format)

#isna( )로 결손 데이터 여부 확인

titanic_df.isna().head(3)

#결손데이터 개수 구하기

titanic_df.isna().sum()

#fillna( )로 결손 데이터 처리하기

titanic_df['Cabin'] = titanic_df['Cabin'].fillna('C000')
titanic_df.head(3)

'Machine Learning > 파이썬 머신러닝 완벽가이드' 카테고리의 다른 글

03. 데이터 전처리 (0)	2021.11.10
GridSearchCV (0)	2021.11.08
03.사이킷런의 기반 프레임워크 익히기 + 04.Model Selection (0)	2021.11.08
사이킷런으로 시작하는 머신러닝 02.붓꽃품종 예측하기 (0)	2021.09.06

현재글9/6 데이터 핸들링-판다스

햄

데이터 분석, 기획 공부중

파이썬, SQL코딩테스트, PM부트캠프, 코딩독학, 머신러닝공부, 머신러닝, 인공지능, SQL, 코드스테이츠, 머신러닝책추천, ADSP기출, 데이터자격증, pmb부트캠프, ADsP, 데이터분석준전문가, ADsP후기, 데이터분석, 혼자공부하는머신러닝딥러닝, 판다스, 비전공자머신러닝,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

햄