Python可視化分析和預測了2022年FIFA世界杯

來源：奇酷教育發表于：2022-11-29 10:45:52

　　高端的效果，往往只需要采用最樸素的實現方式。

　　Python可視化分析和預測了2022年FIFA世界杯。

　　經過測試，本文建立的模型能成功地預測了在20-21賽季中期的所有英超、西甲、意甲和德甲這些聯賽的冠軍，這也是基于當時已經有近19場比賽了。同樣，我們使用該模型來預測下2022年世界杯，會不會有如此驚人的效果呢？一起拭目以待吧～

　　如何預測比賽呢？

　　有不同的方法來進行預測。我可以建立一個花哨的機器學習模型并給它提供多個變量，但在閱讀了一些論文后，我決定使用泊松分布試一試。

　　泊松分布

　　有讀者會疑問，這是為什么呢？那接下來首先看一下泊松分布的定義。

　　泊松分布是一個離散的概率分布，描述了在一個固定的時間間隔或機會區域內發生的事件的數量。

　　如果我們把進球看作是一場足球比賽90分鐘內可能發生的事件，我們可以計算出A隊和B隊在一場比賽中可能進球的概率。

　　但這還不夠。我們仍然需要滿足泊松分布的假設。

　　可以計算事件的數量（一場比賽可以有1、2、3或更多的進球）。

　　事件的發生是獨立的（一個目標的發生不應影響另一個目標的概率）。

　　事件發生的速度是恒定的（在某一時間間隔內發生目標的概率對于相同長度的其他每一個時間間隔都應該是完全相同的）。

　　兩個事件不可能在完全相同的時間內發生（兩個目標不可能同時發生）

　　毫無疑問，假設1和4是符合的，但2和3是部分正確的。也就是說，我們假設假設2和3總是正確的。

　　當預測歐洲頂級聯賽的冠軍時，我繪制了過去5年前4個聯賽每場比賽的進球數柱狀圖。

　　4個聯賽的進球數柱狀圖

　　如果你看一下任何聯賽的擬合曲線，它看起來像泊松分布。

　　現在我們可以說，可以用泊松分布來計算一場比賽中可能出現的進球數的概率。

　　下面是泊松分布的公式。

　　為了進行預測，我考慮了。

　　lambda：90分鐘內進球數的中位數（A隊和B隊）。

　　x：一場比賽中A隊和B隊可能進的球數

　　為了計算lambda，我們需要每個國家隊的平均進/丟球數。這將我們引向下一個問題。

　　每個國家隊的進球/丟球情況

　　在收集了從1930年到2018年的所有世界杯比賽的數據（需要完整數據請掃文末二維碼后，回復：世界杯獲取）后，可以計算出每個國家隊的平均進球和丟球情況。

　　數據清洗

　　讀取數據

　　df_historical_data = pd.read_csv('data/fifa_worldcup_matches.csv')

　　df_fixture = pd.read_csv('data/fifa_worldcup_fixture.csv')

　　df_missing_data = pd.read_csv('data/fifa_worldcup_missing_data.csv')

　　清洗df_fixture

　　df_fixture['home'] = df_fixture['home'].str.strip()

　　df_fixture['away'] = df_fixture['away'].str.strip()

　　清洗df_missing_data

　　df_missing_data.dropna(inplace=True)

　　df_historical_data = pd.concat([df_historical_data, df_missing_data], ignore_index=True)

　　df_historical_data.drop_duplicates(inplace=True)

　　df_historical_data.sort_values('year', inplace=True)

　　df_historical_data

　　清洗df_historical_data

　　# 刪掉與走過場的比賽

　　delete_index = df_historical_data[df_historical_data['home'].str.contains('Sweden') &

　　 df_historical_data['away'].str.contains('Austria')].index

　　df_historical_data.drop(index=delete_index, inplace=True)

　　# 清洗分數和主客場columns

　　df_historical_data['score'] = df_historical_data['score'].str.replace('[^\d–]', '', regex=True)

　　df_historical_data['home'] = df_historical_data['home'].str.strip() # 清洗空白格: Yugoslavia twice

　　df_historical_data['away'] = df_historical_data['away'].str.strip()

　　# splitting score columns into home and away goals and dropping score column

　　# 將得分columns分成主客場進球和降分columns

　　df_historical_data[['HomeGoals', 'AwayGoals']] = df_historical_data['score'].str.split('–', expand=True)

　　df_historical_data.drop('score', axis=1, inplace=True)

　　# 重命名列名并更改格式

　　df_historical_data.rename(columns={'home': 'HomeTeam', 'away': 'AwayTeam',

　　 'year':'Year'}, inplace=True)

　　df_historical_data = df_historical_data.astype({'HomeGoals': int, 'AwayGoals':int, 'Year': int})

　　# 創建一個新列 "totalgoals"

　　df_historical_data['TotalGoals'] = df_historical_data['HomeGoals'] + df_historical_data['AwayGoals']

　　df_historical_data

　　保存清洗過后的數據

　　df_historical_data.to_csv('clean_fifa_worldcup_matches.csv',index=False)

　　df_fixture.to_csv('clean_fifa_worldcup_fixture.csv',index=False)

　　數據可視化

　　上下滑動查看更多源碼

　　# nation_position, club_position, player_positions

　　df = pd.read_csv('players_22.csv', low_memory=False)

　　# 選擇需要用的列

　　df = df[['short_name', 'age', 'nationality_name', 'overall', 'potential',

　　 'club_name', 'value_eur', 'wage_eur', 'player_positions']]

　　# 只選擇一個position

　　df['player_positions'] = df['player_positions'].str.split(',', expand=True)[0]

　　# 刪除缺失值

　　df.dropna(inplace=True)

　　players_missing_worldcup = ['K. Benzema', 'S. Mané', 'S. Agüero', 'Sergio Ramos',

　　 'P. Pogba', 'M. Reus', 'Diogo Jota', 'A. Harit',

　　 'N. Kanté', 'G. Lo Celso', 'Piqué']

　　# 刪除受傷的球員

　　drop_index = df[df['short_name'].isin(players_missing_worldcup)].index

　　df.drop(drop_index, axis=0, inplace=True)

　　teams_worldcup = [

　　 'Qatar', 'Brazil', 'Belgium', 'France', 'Argentina', 'England', 'Spain', 'Portugal',

　　 'Mexico', 'Netherlands', 'Denmark', 'Germany', 'Uruguay', 'Switzerland', 'United States', 'Croatia',

　　 'Senegal', 'Iran', 'Japan', 'Morocco', 'Serbia', 'Poland', 'South Korea', 'Tunisia',

　　 'Cameroon', 'Canada', 'Ecuador', 'Saudi Arabia', 'Ghana', 'Wales', 'Costa Rica', 'Australia'

　　]

　　# 篩選國家隊

　　df = df[df['nationality_name'].isin(teams_worldcup)]

　　# 最佳球員

　　df.sort_values(by=['overall', 'potential', 'value_eur'], ascending=False, inplace=True)

　　球員分布

　　import numpy as np

　　fig, ax = plt.subplots(figsize=(12, 5), tight_layout=True)

　　sns.histplot(df, x='overall', binwidth=1)

　　bins = np.arange(df['overall'].min(), df['overall'].max(), 1)

　　plt.xticks(bins)

　　plt.show()

　　世界杯夢之隊球員

　　df.drop_duplicates('player_positions')

　　每個國家隊中最有技能的球員

　　df_best_players = df.copy()

　　df_best_players = df_best_players.drop_duplicates('nationality_name').reset_index(drop=True)

　　country_short = df_best_players['nationality_name'].str.extract('(^\w{3})', expand=False).str.upper()

　　df_best_players['name_nationality'] = df_best_players['short_name'] +' (' + country_short + ')'

　　fig, ax = plt.subplots(figsize=(10, 6), tight_layout=True)

　　sns.barplot(df_best_players, x='overall', y='name_nationality',

　　 palette=sns.color_palette('pastel'), width=0.5)

　　plt.show()

　　每支球隊的最佳陣容

　　def best_squad(nationality):

　　 df_best_squad = df.copy()

　　 df_best_squad = df_best_squad.groupby(['nationality_name', 'player_positions']).head(2)

　　 df_best_squad = df_best_squad[df_best_squad['nationality_name']==nationality].sort_values(['player_positions', 'overall', 'potential'], ascending=False)

　　 return df_best_squad

　　best_squad('Brazil')

　　average_overall = [best_squad(team)['overall'].mean() for team in teams_worldcup]

　　df_average_overall = pd.DataFrame({'Teams': teams_worldcup, 'AVG_Overall': average_overall})

　　df_average_overall = df_average_overall.dropna()

　　df_average_overall = df_average_overall.sort_values('AVG_Overall', ascending=False)

　　df_average_overall

　　上下滑動查看更多結果

　　fig, ax = plt.subplots(figsize=(12, 5), tight_layout=True)

　　sns.barplot(df_average_overall[:10], x='Teams', y='AVG_Overall',

　　 palette=sns.color_palette('pastel'))

　　plt.show()

　　每支球隊的最佳陣型

　　上下滑動查看更多源碼

　　def best_lineup(nationality, lineup):

　　 lineup_count = [lineup.count(i) for i in lineup]

　　 df_lineup = pd.DataFrame({'position': lineup, 'count': lineup_count})

　　 positions_non_repeated = df_lineup[df_lineup['count'] <= 1]['position'].values

　　 positions_repeated = df_lineup[df_lineup['count'] > 1]['position'].values

　　 df_squad = best_squad(nationality)

　　 df_lineup = pd.concat([

　　 df_squad[df_squad['player_positions'].isin(positions_non_repeated)].drop_duplicates('player_positions', keep='first'),

　　 df_squad[df_squad['player_positions'].isin(positions_repeated)]]

　　 )

　　 return df_lineup[['short_name', 'overall', 'club_name', 'player_positions']]

　　dict_formation = {

　　 '4-3-3': ['GK', 'RB', 'CB', 'CB', 'LB', 'CDM', 'CM', 'CAM', 'RW', 'ST', 'LW'],

　　 '4-4-2': ['GK', 'RB', 'CB', 'CB', 'LB', 'RM', 'CM', 'CM', 'LM', 'ST', 'ST'],

　　 '4-2-3-1': ['GK', 'RB', 'CB', 'CB', 'LB', 'CDM', 'CDM', 'CAM', 'CAM', 'CAM', 'ST'],

　　}

　　for index, row in df_average_overall[:9].iterrows():

　　 max_average = None

　　 for key, values in dict_formation.items():

　　 average = best_lineup(row['Teams'], values)['overall'].mean()

　　 if max_average is None or average>max_average:

　　 max_average = average

　　 formation = key

　　 print(row['Teams'], formation, max_average)

　　Spain 4-2-3-1 85.1

　　Portugal 4-2-3-1 84.9

　　England 4-4-2 84.45454545454545

　　Brazil 4-3-3 84.81818181818181

　　France 4-2-3-1 83.9

　　Argentina 4-3-3 83.54545454545455

　　Germany 4-2-3-1 84.1

　　Belgium 4-3-3 82.54545454545455

　　Netherlands 4-4-2 82.54545454545455

　　# best_lineup('Spain', dict_formation['4-2-3-1'])

　　# best_lineup('Argentina', dict_formation['4-3-3'])

　　best_lineup('Brazil', dict_formation['4-3-3'])

　　由于在世界杯中，幾乎所有的球隊都在中立球場比賽，所以在這次分析中沒有考慮主場/客場的因素。

　　一旦有了每個國家隊的進/丟球數，就創建了一個函數，預測每支球隊在小組賽中會得到多少分。

　　預測小組賽階段

　　下面是我用來預測每個國家隊在小組賽階段會得到多少分的代碼。

　　計算球隊實力

　　dict_table = pickle.load(open('dict_table','rb'))

　　df_historical_data = pd.read_csv('clean_fifa_worldcup_matches.csv')

　　df_fixture = pd.read_csv('clean_fifa_worldcup_fixture.csv')

　　df_home = df_historical_data[['HomeTeam', 'HomeGoals', 'AwayGoals']]

　　df_away = df_historical_data[['AwayTeam', 'HomeGoals', 'AwayGoals']]

　　df_home = df_home.rename(columns={'HomeTeam':'Team', 'HomeGoals': 'GoalsScored', 'AwayGoals': 'GoalsConceded'})

　　df_away = df_away.rename(columns={'AwayTeam':'Team', 'HomeGoals': 'GoalsConceded', 'AwayGoals': 'GoalsScored'})

　　df_team_strength = pd.concat([df_home, df_away], ignore_index=True).groupby(['Team']).mean()

　　df_team_strength

　　from scipy.stats import poisson

　　def predict_points(home, away):

　　 if home in df_team_strength.index and away in df_team_strength.index:

　　 lamb_home = df_team_strength.at[home,'GoalsScored'] * df_team_strength.at[away,'GoalsConceded']

　　 lamb_away = df_team_strength.at[away,'GoalsScored'] * df_team_strength.at[home,'GoalsConceded']

　　 prob_home, prob_away, prob_draw = 0, 0, 0

　　 for x in range(0,11): #number of goals home team

　　 for y in range(0, 11): #number of goals away team

　　 p = poisson.pmf(x, lamb_home) * poisson.pmf(y, lamb_away)

　　 if x == y:

　　 prob_draw += p

　　 elif x > y:

　　 prob_home += p

　　 else:

　　 prob_away += p

　　 points_home = 3 * prob_home + prob_draw

　　 points_away = 3 * prob_away + prob_draw

　　 return (points_home, points_away)

　　 else:

　　 return (0, 0)

　　通俗地說，predict_points 計算的是主隊和客隊會得到多少分。這里使用公式計算每支球隊的lambda，即average_goals_scored * average_goals_conceded 。

　　然后模擬了一場比賽從0-0到10-10的所有可能的比分（最后的那個比分只是我的進球范圍的極限）。一旦有了lambda和x，就可以使用泊松分布的公式來計算p。

　　prob_home、prob_draw和prob_away分別累積了p的值，如果說比賽以1-0（主場獲勝）、1-1（平局）或0-1（客場獲勝）結束。最后，用下面的公式計算積分。

　　point_home = 3 * prob_home + prob_draw

　　point_away = 3 * prob_away + prob_draw

　　如果我們用predict_points來預測英格蘭對美國的比賽，我們會得到這個結果。

　　>>> print(predict_points('England', 'United States'))

　　(2.2356147635326007, 0.5922397535606193)

　　這意味著英格蘭將得到2.23分，而美國將得到0.59分。因為這里使用的是概率，因此得到的是小數。

　　如果將這個predict_points函數應用于小組賽階段的所有比賽，我們將得到每個小組的第1和第2名，從而得到以下淘汰賽的比賽。

　　df_fixture_group_48 = df_fixture[:48].copy()

　　df_fixture_knockout = df_fixture[48:56].copy()

　　df_fixture_quarter = df_fixture[56:60].copy()

　　df_fixture_semi = df_fixture[60:62].copy()

　　df_fixture_final = df_fixture[62:].copy()

　　for group in dict_table:

　　 teams_in_group = dict_table[group]['Team'].values

　　 df_fixture_group_6 = df_fixture_group_48[df_fixture_group_48['home'].isin(teams_in_group)]

　　 for index, row in df_fixture_group_6.iterrows():

　　 home, away = row['home'], row['away']

　　 points_home, points_away = predict_points(home, away)

　　 dict_table[group].loc[dict_table[group]['Team'] == home, 'Pts'] += points_home

　　 dict_table[group].loc[dict_table[group]['Team'] == away, 'Pts'] += points_away

　　 dict_table[group] = dict_table[group].sort_values('Pts', ascending=False).reset_index()

　　 dict_table[group] = dict_table[group][['Team', 'Pts']]

　　 dict_table[group] = dict_table[group].round(0)

　　dict_table['Group A']

　　圖片

　　預測淘汰賽

　　df_fixture_knockout

　　圖片

　　for group in dict_table:

　　 group_winner = dict_table[group].loc[0, 'Team']

　　 runners_up = dict_table[group].loc[1, 'Team']

　　 df_fixture_knockout.replace({f'Winners {group}':group_winner,

　　 f'Runners-up {group}':runners_up}, inplace=True)

　　df_fixture_knockout['winner'] = '?'

　　df_fixture_knockout

　　圖片

　　對于淘汰賽，我不需要預測分數，而是預測每個小組的獲勝者。這就是為什么我在之前的 predict_points 函數基礎上創建了一個新的 get_winner 函數。

　　def get_winner(df_fixture_updated):

　　 for index, row in df_fixture_updated.iterrows():

　　 home, away = row['home'], row['away']

　　 points_home, points_away = predict_points(home, away)

　　 if points_home > points_away:

　　 winner = home

　　 else:

　　 winner = away

　　 df_fixture_updated.loc[index, 'winner'] = winner

　　 return df_fixture_updated

　　簡單地說，如果主隊的積分大于客隊的積分，那么贏家就是主隊，否則，贏家就是客隊。

　　使用get_winner函數可以得到如下的結果。

　　預測四分之一決賽、半決賽和決賽的情況

　　def update_table(df_fixture_round_1, df_fixture_round_2):

　　 for index, row in df_fixture_round_1.iterrows():

　　 winner = df_fixture_round_1.loc[index, 'winner']

　　 match = df_fixture_round_1.loc[index, 'score']

　　 df_fixture_round_2.replace({f'Winners {match}':winner}, inplace=True)

　　 df_fixture_round_2['winner'] = '?'

　　 return df_fixture_round_2

　　四分之一決賽

　　半決賽

　　決賽

　　如果我使用 get_winner，我可以預測世界杯的冠軍。這是最后的結果！!

　　通過再一次運行該函數，我得到的贏家是...巴西!

(如有雷同純屬巧合）

下一篇:程序員最討厭的100件事，瞬間笑噴上一篇:為什么 B 站的彈幕可以不擋人物？

Python可視化分析和預測了2022年FIFA世界杯

欄目導航

奇酷熱點

常見問題

奇酷技術交流中心

相關文章