Python Scrapy Xpath not following url -
i new @ python , having bit of trouble getting scrapy follow urls. suspect may xpath specification, after doing several tutorials on topic, no closer resolving this. loops on urls in referenced table , scrapes content starting page repetitively. doing wrong?
code attached:
import scrapy scrapy.selector import selector scrapy.spiders import crawlspider scrapy.spiders import rule scrapy.linkextractors import linkextractor scrapy.http import request class myspider(crawlspider): name = 'unespider' allowed_domains = ['https://my.une.edu.au/'] start_urls = ['https://my.une.edu.au/courses/'] rules = rule(linkextractor(canonicalize = true, unique = true), follow = true, callback = "parse"), def parse(self, response): hxs = selector(response) url in response.xpath('//*'): yield { 'title': url.xpath('//*[@id="main-content"]/div/h2/a/text()').extract_first(), 'avail': url.xpath('//*[@id="overviewtab-snapshotdiv"]/p[3]/a/text()').extract_first(), } url in hxs.xpath('//tr/td/a/@href').extract(): yield request(response.urljoin(url), callback=self.parse)
** update see wanted , updated code, follows each year , outputs correct**
i apologize i'm not sure trying follow , scrape start page //*[@id="overviewtab-snapshotdiv"]. wasn't able find xpath. i'd out more since i'm new programming , scrapy hard @ first, ended making own class scraper later way, though i'm sure scrapy better :) i've done code scrape titles , urls, commented out rule since don't know trying follow or why.
import scrapy scrapy.spiders import rule scrapy.linkextractors import linkextractor overflowquestion2 import items #make sure import items.py class davesimspider(scrapy.spider): name = 'davesim' allowed_domains = ['my.une.edu.au'] start_urls = ['http://my.une.edu.au/courses/2007', ] #start @ 2007 rules = rule(linkextractor(canonicalize=true, unique=true), follow=true, callback="parse") def parse(self, response): #this scrape links , follow #grab main div wrapping links divlinkwrapper = response.xpath('//div[@class="pagination"]') links in divlinkwrapper: #for every element extract links thelinks = links.xpath('ul/li/a/@href').extract() in thelinks: #for every link, follow link yield scrapy.request(i, callback=self.contentparse) def contentparse(self, response): #scrape content want #grab main div wrapper content divmaincontent = response.xpath('//div[@id="main-content"]') titles in divmaincontent: #create item object items.py function item = items.overflowquestion2item() thetitles = titles.xpath('div[@class="content"]//a/text()').extract() #set item scrapy.field in items.py item['title'] = thetitles yield item #yield item through pipeline urls in divmaincontent: item = items.overflowquestion2item() theurls = urls.xpath('//table/tr/td/a/@href').extract() item['url'] = theurls yield item now items.py:
import scrapy class overflowquestion2item(scrapy.item): # define fields item here like: # name = scrapy.field() title = scrapy.field() url = scrapy.field() pass also remember in settings uncomment items pipeline,
item_pipelines = { 'overflowquestion2.pipelines.overflowquestion2pipeline': 300, } i'm positive coded better , hope here refine ;)
Comments
Post a Comment