Python - Web Scraping concurrent to improve my code? -
so i'm pulling statistics of nfl players. table shows max 50 rows, have filter down make sure don't miss stats, means i'm iterating through pages collect data season, position, team, week.
i figured out how url changes cycle through these, iteration process takes long, , thinking: we're able open multiple webpages @ 1 time, couldn't able run these processes parallel, each process simultaneously collects data each page, stores in temp_df, merge them @ end...instead of collecting 1 url, 1 url, merge, next url, merge, next,......at time. meaning iterates through 6,144 times (if i'm not iterating through positions), positions, on 36,000 iteration through.
but i'm stuck on how implement it, or if it's possible.
here's code i'm using. eliminated cycle through position give idea of how working, quarterbacks, p = 2.
so starts @ season 2005 = 1, team 1 = 1, week 1 =0, iterates last season 2016 = 12, team 32 = 33, , week 16 = 17:
import requests import pandas pd seasons = list(range(1,13)) teams = list(range(1,33)) weeks = list(range(0,17)) qb_df = pd.dataframe() p = 2 s in seasons: t in teams: w in weeks: url = 'https://fantasydata.com/nfl-stats/nfl-fantasy-football-stats.aspx?fs=2&stype=0&sn=%s&scope=1&w=%s&ew=%s&s=&t=%s&p=%s&st=fantasypointsfanduel&d=1&ls=fantasypointsfanduel&live=false&pid=true&minsnaps=4' % (s,w,w,t,p) html = requests.get(url).content df_list = pd.read_html(html) temp_df = df_list[-1] temp_df['nfl season'] = str(2017-s) qb_df = qb_df.append(temp_df, ignore_index = true) file = 'player_data_fanduel_2005_to_2016_qb.xls' qb_df.to_excel(file) print('\ndata has been saved.')
1/ create dict of season, team, weeks , urls.
2/ use multiprocessing pool call urls , data.
or use dedicated scraping tool scrapy.
Comments
Post a Comment