Pandas tips cool part 2
Một số tricks cơ bản trong pandas
Import thư viện và load một số dữ liệu để thực hiện thao tác
1
2
3
4
5
6
7
8
9
10
11
| import pandas as pd
import numpy as np
drinks = pd.read_csv('https://raw.githubusercontent.com/dophuchao/dophuchao.github.io/master/data/drinks.csv')
movies = pd.read_csv('https://raw.githubusercontent.com/dophuchao/dophuchao.github.io/master/data/imdb_1000.csv')
orders = pd.read_csv('https://raw.githubusercontent.com/dophuchao/dophuchao.github.io/master/data/chipotle.tsv', sep='\t')
orders['item_price'] = orders.item_price.str.replace('$', '', regex = True).astype('float64')
stocks = pd.read_csv('https://raw.githubusercontent.com/dophuchao/dophuchao.github.io/master/data/stocks.csv', parse_dates=['Date'])
titanic = pd.read_csv('https://raw.githubusercontent.com/dophuchao/dophuchao.github.io/master/data/titanic_train.csv')
ufo = pd.read_csv('https://raw.githubusercontent.com/dophuchao/dophuchao.github.io/master/data/ufo.csv', parse_dates=['Time'])
|
1. Show installed version
1
2
| #1. show installed version
print(pd.__version__)
|
2. Create an example dataframe
1
2
3
4
5
6
7
8
| #2. create an example dataframe
df = pd.DataFrame(
{
'col one': [100, 200],
'col two': [300, 400]
}
)
df
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/1.png)
1
| pd.DataFrame(np.random.rand(4, 8))
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/2.png)
1
| pd.DataFrame(np.random.rand(4, 8), columns = list('abcdefgh'))
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/3.png)
3. Rename columns
1
2
3
4
5
6
7
8
| df = df.rename({
'col one': 'col_one',
'col two': 'col_two',
}, axis = 'columns')
df.columns = ['col_one', 'col_two']
df.columns = df.columns.str.replace(' ', '_')
df
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/4.png)
Add prefix hoặc add suffix
1
2
| df.add_prefix('X_')
df.add_suffix('_Y')
|
4. Reverse row orders
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/5.png)
1
| drinks.loc[::-1].head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/6.png)
1
| drinks.loc[::-1].reset_index(drop=True).head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/7.png)
5. Reverse column orders
1
| drinks.loc[:, ::-1].head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/8.png)
6. Select columns by data type
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/9.png)
1
| drinks.select_dtypes(include = 'number').head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/10.png)
1
| drinks.select_dtypes(include = 'object').head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/11.png)
1
| drinks.select_dtypes(include = ['number', 'object', 'category', 'datetime']).head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/12.png)
1
| drinks.select_dtypes(exclude = 'number').head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/13.png)
7. Convert strings to numbers
1
2
3
4
5
6
7
8
9
| # 7. convert strings to numbers
df = pd.DataFrame(
{
'col_one': ['1.1', '2.2', '3.3'],
'col_two': ['4.4', '5.5', '6.6'],
'col_three': ['7.7', '8.8', '-']
}
)
df
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/14.png)
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/15.png)
1
2
3
4
5
6
| df.astype(
{
'col_one': 'float',
'col_two': 'float'
}
).dtypes
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/16.png)
1
2
3
| pd.to_numeric(df.col_three, errors = 'coerce')
pd.to_numeric(df.col_three, errors = 'coerce').fillna(0)
df = df.apply(pd.to_numeric, errors = 'coerce').fillna(0)
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/17.png)
8. Reduce dataframe size
1
| drinks.info(memory_usage = 'deep')
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/18.png)
1
2
3
| cols = ['beer_servings', 'continent']
small_drinks = pd.read_csv('https://raw.githubusercontent.com/dophuchao/dophuchao.github.io/master/data/drinks.csv', usecols = cols)
small_drinks.info(memory_usage = 'deep')
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/19.png)
1
2
3
| dtypes = {'continent': 'category'}
small_drinks = pd.read_csv('https://raw.githubusercontent.com/dophuchao/dophuchao.github.io/master/data/drinks.csv', usecols = cols, dtype = dtypes)
small_drinks.info(memory_usage = 'deep')
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/20.png)
9. Split a dataframe into two random subsets
1
2
3
4
| movies_1 = movies.sample(frac = 0.75, random_state = 1234)
movies_2 = movies.drop(movies_1.index)
movies_1.index.sort_values()
movies_2.index.sort_values()
|
10. Filter a dataframe by multiple categories
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/21.png)
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/22.png)
1
2
3
| movies[
(movies.genre == 'Action') | (movies.genre == 'Drama') | (movies.genre == 'Western')
].head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/23.png)
1
2
3
| movies[movies.genre.isin(
['Action', 'Drama', 'Western']
)].head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/24.png)
1
2
3
| movies[~movies.genre.isin(
['Action', 'Drama', 'Western']
)].head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/25.png)
11. Filter a dataframe by largest categories
1
2
| counts = movies.genre.value_counts()
counts
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/26.png)
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/27.png)
1
2
3
| movies[
movies.genre.isin(counts.nlargest(3).index)
].head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/28.png)
12. Handle missing values
1
2
3
4
5
| ufo.head()
ufo.isna().sum()
ufo.isna().mean()
ufo.dropna(axis='columns').head()
ufo.dropna(thresh=len(ufo) * 0.9, axis = 'columns').head()
|
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/29.png)
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/30.png)
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/31.png)
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/32.png)
![img](/img/in-post/2022-09-10-pandas-tips-cool-part2/33.png)
Link tham khảo
Full ipynb
Tài liệu tham khảo
Machine learning cơ bản
Hết.