Python Scrapy Xpath not following url -

September 15, 2013

i new @ python , having bit of trouble getting scrapy follow urls. suspect may xpath specification, after doing several tutorials on topic, no closer resolving this. loops on urls in referenced table , scrapes content starting page repetitively. doing wrong?

code attached:

import scrapy   scrapy.selector import selector scrapy.spiders import crawlspider scrapy.spiders import rule scrapy.linkextractors import linkextractor scrapy.http import request  class myspider(crawlspider):     name = 'unespider'     allowed_domains = ['https://my.une.edu.au/']     start_urls = ['https://my.une.edu.au/courses/']     rules = rule(linkextractor(canonicalize = true, unique = true), follow = true, callback = "parse"),          def parse(self, response):         hxs = selector(response)         url in response.xpath('//*'):         yield {             'title': url.xpath('//*[@id="main-content"]/div/h2/a/text()').extract_first(),             'avail': url.xpath('//*[@id="overviewtab-snapshotdiv"]/p[3]/a/text()').extract_first(),                }                 url in hxs.xpath('//tr/td/a/@href').extract():             yield request(response.urljoin(url), callback=self.parse)

** update see wanted , updated code, follows each year , outputs correct**

i apologize i'm not sure trying follow , scrape start page //*[@id="overviewtab-snapshotdiv"]. wasn't able find xpath. i'd out more since i'm new programming , scrapy hard @ first, ended making own class scraper later way, though i'm sure scrapy better :) i've done code scrape titles , urls, commented out rule since don't know trying follow or why.

import scrapy scrapy.spiders import rule  scrapy.linkextractors import linkextractor  overflowquestion2 import items #make sure import items.py   class davesimspider(scrapy.spider):      name = 'davesim'     allowed_domains = ['my.une.edu.au']     start_urls = ['http://my.une.edu.au/courses/2007', ] #start @ 2007     rules = rule(linkextractor(canonicalize=true, unique=true), follow=true, callback="parse")   def parse(self, response): #this scrape links , follow      #grab main div wrapping links     divlinkwrapper = response.xpath('//div[@class="pagination"]')      links in divlinkwrapper: #for every element extract links         thelinks = links.xpath('ul/li/a/@href').extract()         in thelinks: #for every link, follow link             yield scrapy.request(i, callback=self.contentparse)    def contentparse(self, response): #scrape content want      #grab main div wrapper content     divmaincontent = response.xpath('//div[@id="main-content"]')      titles in divmaincontent:         #create item object items.py function         item = items.overflowquestion2item()          thetitles = titles.xpath('div[@class="content"]//a/text()').extract()          #set item scrapy.field in items.py         item['title'] = thetitles          yield item #yield item through pipeline      urls in divmaincontent:         item = items.overflowquestion2item()         theurls = urls.xpath('//table/tr/td/a/@href').extract()         item['url'] = theurls         yield item

now items.py:

import scrapy   class overflowquestion2item(scrapy.item):     # define fields item here like:     # name = scrapy.field()     title = scrapy.field()     url = scrapy.field()      pass

also remember in settings uncomment items pipeline,

item_pipelines = {     'overflowquestion2.pipelines.overflowquestion2pipeline': 300, }

i'm positive coded better , hope here refine ;)

Search This Blog

Enable

Python Scrapy Xpath not following url -

Comments

Post a Comment

Popular posts from this blog

Sort a complex associative array in PHP -

vb.net - How to ignore if a cell is empty nothing -

How to restore default keyboard shortcuts on Ubuntu-17.04? -