はじめに結論を述べる。
- pandas の Series には max や std などの標準的な関数がある
- カンマ区切りの数値データを使うときは、read_csv のオプションに thousands を指定する
pandas の DataFrame から一次元データを取りだし、その最大値を求める
pandas の DataFrame から 1 つの列を選択し、さらにその最大値や分散を求めてみよう。今回も東京都の自治体別人口データを使う。
import pandas as pd
df = pd.read_csv('population.csv', thousands=',', index_col=0)
rows = df['総数']
print(rows)
max = rows.max()
min = rows.min()
mean = rows.mean()
var = rows.var()
std = rows.std()
print('最大値 {0}'.format(max))
print('最小値 {0}'.format(min))
print('平均 {0}'.format(mean))
print('分散 {0}'.format(var))
print('標準偏差 {0}'.format(std))
結果はこうなる。
市区町村
千代田区 63635
中央区 162502
港 区 257426
新宿区 346162
文京区 221489
台東区 199292
墨田区 271859
江東区 518479
品川区 394700
目黒区 279342
大田区 729534
世田谷区 908907
渋谷区 226594
中野区 331658
杉並区 569132
豊島区 289508
北 区 351976
荒川区 215966
板橋区 566890
練馬区 732433
足立区 688512
葛飾区 462591
江戸川区 698031
八王子市 562460
立川市 183822
武蔵野市 146399
三鷹市 187199
青梅市 134086
府中市 260011
昭島市 113215
...
小金井市 121443
小平市 193596
日野市 185393
東村山市 150789
国分寺市 123689
国立市 76038
福生市 58243
狛江市 82481
東大和市 85565
清瀬市 74737
東久留米市 116896
武蔵村山市 72546
多摩市 148745
稲城市 90585
羽村市 55607
あきる野市 80851
西東京市 202817
瑞穂町 33213
日の出町 16732
檜原村 2217
奥多摩町 5179
大島町 7716
利島村 323
新島村 2722
神津島村 1898
三宅村 2481
御蔵島村 317
八丈町 7465
青ヶ島村 159
小笠原村 2625
Name: 総数, Length: 62, dtype: int64
最大値 908907
最小値 159
平均 221624.70967741936
分散 48636512171.81597
標準偏差 220536.87259008634
ここで population.csv には下のデータが入っている。
市区町村,世帯数,総数,男,女,人口密度
千代田区,"35,830","63,635","31,935","31,700","5,458"
中央区,"91,852","162,502","77,241","85,261","15,916"
港 区,"145,865","257,426","121,326","136,100","12,638"
新宿区,"219,639","346,162","173,743","172,419","18,999"
文京区,"121,128","221,489","105,462","116,027","19,618"
台東区,"118,858","199,292","101,917","97,375","19,712"
墨田区,"150,855","271,859","134,678","137,181","19,743"
江東区,"267,262","518,479","256,116","262,363","12,910"
品川区,"220,678","394,700","193,644","201,056","17,281"
目黒区,"156,583","279,342","132,206","147,136","19,042"
大田区,"391,146","729,534","362,653","366,881","11,993"
世田谷区,"479,792","908,907","431,026","477,881","15,657"
渋谷区,"137,582","226,594","108,768","117,826","14,996"
中野区,"204,613","331,658","167,378","164,280","21,274"
杉並区,"321,531","569,132","273,057","296,075","16,710"
豊島区 ,"179,880","289,508","145,334","144,174","22,253"
北 区,"196,580","351,976","174,910","177,066","17,078"
荒川区,"115,944","215,966","107,283","108,683","21,256"
板橋区,"309,133","566,890","278,662","288,228","17,594"
練馬区,"370,567","732,433","356,279","376,154","15,234"
足立区,"346,739","688,512","345,291","343,221","12,930"
葛飾区,"233,158","462,591","231,272","231,319","13,293"
江戸川区,"342,016","698,031","351,914","346,117","13,989"
八王子市,"267,736","562,460","281,506","280,954","3,018"
立川市,"91,270","183,822","91,460","92,362","7,546"
武蔵野市,"76,765","146,399","70,120","76,279","13,333"
三鷹市,"93,665","187,199","91,624","95,575","11,401"
青梅市,"63,142","134,086","67,393","66,693","1,298"
府中市,"125,060","260,011","130,582","129,429","8,835"
昭島市,"53,827","113,215","56,384","56,831","6,529"
調布市,"118,804","235,169","114,909","120,260","10,898"
町田市,"195,643","428,685","209,971","218,714","5,991"
小金井市,"60,367","121,443","59,955","61,488","10,747"
小平市,"91,602","193,596","95,312","98,284","9,439"
日野市,"88,402","185,393","92,983","92,410","6,729"
東村山市,"72,676","150,789","73,621","77,168","8,797"
国分寺市,"60,111","123,689","60,901","62,788","10,793"
国立市,"37,728","76,038","37,161","38,877","9,330"
福生市,"30,506","58,243","29,132","29,111","5,733"
狛江市,"42,157","82,481","40,005","42,476","12,908"
東大和市,"38,852","85,565","42,208","43,357","6,376"
清瀬市,"35,454","74,737","36,092","38,645","7,306"
東久留米市,"54,257","116,896","57,066","59,830","9,076"
武蔵村山市,"31,640","72,546","36,177","36,369","4,735"
多摩市,"71,851","148,745","72,927","75,818","7,080"
稲城市,"39,991","90,585","45,589","44,996","5,041"
羽村市,"25,718","55,607","28,251","27,356","5,617"
あきる野市,"35,519","80,851","40,304","40,547","1,100"
西東京市,"97,350","202,817","98,839","103,978","12,877"
瑞穂町,"14,912","33,213","16,922","16,291","1,971"
日の出町,"7,383","16,732","8,224","8,508",596
檜原村,"1,181","2,217","1,100","1,117",21
奥多摩町,"2,685","5,179","2,601","2,578",23
大島町,"4,635","7,716","3,971","3,745",85
利島村,174,323,175,148,78
新島村,"1,381","2,722","1,325","1,397",99
神津島村,917,"1,898",975,923,102
三宅村,"1,620","2,481","1,356","1,125",45
御蔵島村,170,317,167,150,15
八丈町,"4,365","7,465","3,720","3,745",103
青ヶ島村,109,159,92,67,27
小笠原村,"1,492","2,625","1,451","1,174",25
引用:住民基本台帳による東京都の世帯と人口(町丁別・年齢別)
上のコードは次の処理を順番に行っている。
- pandas の read_csv でファイルの内容を DataFrame にする
- DataFrame に「総数」を指定して一次元データ(Series)を取りだす
- Series の最大値などを求める
pandas で最大値などを求めることは簡単だが、陥りやすいポイントがいくつかある。もともとの表データを見てほしい。東京都のデータをそのままダウンロードすると、数値はすべてカンマ区切りになっている。
上のコードをもう一度見ると read_csv のオプションに thousands という引数がある。今回のポイントはここだ。試しにこれを削除するとどうなるか?
import pandas as pd
df = pd.read_csv('population.csv', index_col=0)
rows = df['総数']
max = rows.max()
min = rows.min()
mean = rows.mean()
var = rows.var()
std = rows.std()
このコードはエラーになる。カンマ区切りのデータを使って平均や分散は求められない。最大値・最小値はどうだろう。
import pandas as pd
df = pd.read_csv('population.csv', index_col=0)
rows = df['総数']
print(rows)
max = rows.max()
min = rows.min()
print('最大値 {0}'.format(max))
print('最小値 {0}'.format(min))
実はこのコードはエラーにならない。
最大値 908,907
最小値 1,898
しかし最小値が 1,898 になっている。これは間違いで、本当は青ヶ島村の 159 人が正解。結局 Series の関数を使うときは、カンマ区切りの値を適正に処理しないといけないことがわかる。