django - 2 RabbitMQ workers and 2 Scrapyd daemons running on 2 local Ubuntu instances, in which one of the rabbitmq worker is not working -
i working on building "scrapy spiders control panel" in testing existing solution available on [distributed multi-user scrapy spiders control panel] https://github.com/aaldaber/distributed-multi-user-scrapy-system-with-a-web-ui.
i trying run on local ubuntu dev machine having issues scrapd
daemon. 1 of workers, linkgenerator
working scraper
worker1 not working. can not figure out why scrapyd
won't run on local instance.
the application comes bundled django, scrapy, pipeline mongodb (for saving scraped items) , scrapy scheduler rabbitmq (for distributing links among workers). have 2 local ubuntu instances in django, mongodb, scrapyd daemon , rabbitmq server running on instance1. on scrapyd daemon running on instance2. rabbitmq workers:
- linkgenerator
- worker1
ip configurations instances:
- ip local ubuntu instance1: 192.168.0.101
- ip local ubuntu instance2: 192.168.0.106
list of tools used:
- mongodb server
- rabbitmq server
- scrapy scrapyd api
- one rabbitmq linkgenerator worker (workername: linkgenerator) server scrapy installed , running
scrapyd
daemon on local ubuntu instance1: 192.168.0.101- another 1 rabbitmq scraper worker (workername: worker1) server scrapy installed , running
scrapyd
daemon on local ubuntu instance2: 192.168.0.106
instance1: 192.168.0.101
"instance1" on django, rabbitmq, scrapyd daemon servers running -- ip : 192.168.0.101
instance2: 192.168.0.106
scrapy installed on instance2 , running scrapyd
daemon
scrapy control panel ui snapshot:
rabbitmq status info
linkgenerator worker can push message rabbitmq queue, linkgenerator spider generates start_urls
"scraper spider* consumed scraper (worker1), not working, please see logs worker1 in end of post
rabbitmq settings
the below file contains settings mongodb , rabbitmq:
scheduler = ".rabbitmq.scheduler.scheduler" scheduler_persist = true rabbitmq_host = 'scrapydevu79' rabbitmq_port = 5672 rabbitmq_username = 'guest' rabbitmq_password = 'guest' mongodb_public_address = 'onescience:27017' # shown on web interface, won't used connecting db mongodb_uri = 'localhost:27017' # actual uri connect db mongodb_user = 'tariq' mongodb_password = 'toor' mongodb_sharded = true mongodb_buffer_data = 100 # set link generator worker address here link_generator = 'http://192.168.0.101:6800' scrapers = ['http://192.168.0.106:6800'] linux_user_creation_enabled = false # set true if want linux user account
linkgenerator scrapy.cfg settings: [settings] default = tester2_fda_trial20.settings [deploy:linkgenerator] url = http://192.168.0.101:6800 project = tester2_fda_trial20
scraper scrapy.cfg settings: [settings] default = tester2_fda_trial20.settings [deploy:worker1] url = http://192.168.0.101:6800 project = tester2_fda_trial20
scrapyd.conf file settings instance1 (192.168.0.101) cat /etc/scrapyd/scrapyd.conf
[scrapyd] eggs_dir = /var/lib/scrapyd/eggs dbs_dir = /var/lib/scrapyd/dbs items_dir = /var/lib/scrapyd/items logs_dir = /var/log/scrapyd max_proc = 0 max_proc_per_cpu = 4 finished_to_keep = 100 poll_interval = 5.0 bind_address = 0.0.0.0 #bind_address = 127.0.0.1 http_port = 6800 debug = on runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.launcher webroot = scrapyd.website.root [services] schedule.json = scrapyd.webservice.schedule cancel.json = scrapyd.webservice.cancel addversion.json = scrapyd.webservice.addversion listprojects.json = scrapyd.webservice.listprojects listversions.json = scrapyd.webservice.listversions listspiders.json = scrapyd.webservice.listspiders delproject.json = scrapyd.webservice.deleteproject delversion.json = scrapyd.webservice.deleteversion listjobs.json = scrapyd.webservice.listjobs daemonstatus.json = scrapyd.webservice.daemonstatus
scrapyd.conf
file settings instance2 (192.168.0.106) cat /etc/scrapyd/scrapyd.conf
[scrapyd] eggs_dir = /var/lib/scrapyd/eggs dbs_dir = /var/lib/scrapyd/dbs items_dir = /var/lib/scrapyd/items logs_dir = /var/log/scrapyd max_proc = 0 max_proc_per_cpu = 4 finished_to_keep = 100 poll_interval = 5.0 bind_address = 0.0.0.0 #bind_address = 127.0.0.1 http_port = 6800 debug = on runner = scrapyd.runner application = scrapyd.app.application launcher = scrapyd.launcher.launcher webroot = scrapyd.website.root [services] schedule.json = scrapyd.webservice.schedule cancel.json = scrapyd.webservice.cancel addversion.json = scrapyd.webservice.addversion listprojects.json = scrapyd.webservice.listprojects listversions.json = scrapyd.webservice.listversions listspiders.json = scrapyd.webservice.listspiders delproject.json = scrapyd.webservice.deleteproject delversion.json = scrapyd.webservice.deleteversion listjobs.json = scrapyd.webservice.listjobs daemonstatus.json = scrapyd.webservice.daemonstatus
rabbitmq status sudo service rabbitmq-server status
[sudo] password mtaziz: status of node rabbit@scrapydevu79 [{pid,53715}, {running_applications, [{rabbitmq_shovel_management, "management extension shovel plugin","3.6.11"}, {rabbitmq_shovel,"data shovel rabbitmq","3.6.11"}, {rabbitmq_management,"rabbitmq management console","3.6.11"}, {rabbitmq_web_dispatch,"rabbitmq web dispatcher","3.6.11"}, {rabbitmq_management_agent,"rabbitmq management agent","3.6.11"}, {rabbit,"rabbitmq","3.6.11"}, {os_mon,"cpo cxc 138 46","2.2.14"}, {cowboy,"small, fast, modular http server.","1.0.4"}, {ranch,"socket acceptor pool tcp protocols.","1.3.0"}, {ssl,"erlang/otp ssl application","5.3.2"}, {public_key,"public key infrastructure","0.21"}, {cowlib,"support library manipulating web protocols.","1.0.2"}, {crypto,"crypto version 2","3.2"}, {amqp_client,"rabbitmq amqp client","3.6.11"}, {rabbit_common, "modules shared rabbitmq-server , rabbitmq-erlang-client", "3.6.11"}, {inets,"inets cxc 138 49","5.9.7"}, {mnesia,"mnesia cxc 138 12","4.11"}, {compiler,"erts cxc 138 10","4.9.4"}, {xmerl,"xml parser","1.3.5"}, {syntax_tools,"syntax tools","1.6.12"}, {asn1,"the erlang asn1 compiler version 2.0.4","2.0.4"}, {sasl,"sasl cxc 138 11","2.3.4"}, {stdlib,"erts cxc 138 10","1.19.4"}, {kernel,"erts cxc 138 10","2.16.4"}]}, {os,{unix,linux}}, {erlang_version, "erlang r16b03 (erts-5.10.4) [source] [64-bit] [smp:4:4] [async-threads:64] [kernel-poll:true]\n"}, {memory, [{connection_readers,0}, {connection_writers,0}, {connection_channels,0}, {connection_other,6856}, {queue_procs,145160}, {queue_slave_procs,0}, {plugins,1959248}, {other_proc,22328920}, {metrics,160112}, {mgmt_db,655320}, {mnesia,83952}, {other_ets,2355800}, {binary,96920}, {msg_index,47352}, {code,27101161}, {atom,992409}, {other_system,31074022}, {total,87007232}]}, {alarms,[]}, {listeners,[{clustering,25672,"::"},{amqp,5672,"::"},{http,15672,"::"}]}, {vm_memory_calculation_strategy,rss}, {vm_memory_high_watermark,0.4}, {vm_memory_limit,3343646720}, {disk_free_limit,50000000}, {disk_free,56257699840}, {file_descriptors, [{total_limit,924},{total_used,2},{sockets_limit,829},{sockets_used,0}]}, {processes,[{limit,1048576},{used,351}]}, {run_queue,0}, {uptime,34537}, {kernel,{net_ticktime,60}}]
scrapyd daemon on instance1 ( 192.168.0.101 ) running status: scrapyd
2017-09-11t06:16:07+0600 [-] loading /home/mtaziz/.virtualenvs/onescience_dist_env/local/lib/python2.7/site-packages/scrapyd/txapp.py... 2017-09-11t06:16:07+0600 [-] scrapyd web console available @ http://0.0.0.0:6800/ 2017-09-11t06:16:07+0600 [-] loaded. 2017-09-11t06:16:07+0600 [twisted.scripts._twistd_unix.unixapplogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/onescience_dist_env/bin/python 2.7.6) starting up. 2017-09-11t06:16:07+0600 [twisted.scripts._twistd_unix.unixapplogger#info] reactor class: twisted.internet.epollreactor.epollreactor. 2017-09-11t06:16:07+0600 [-] site starting on 6800 2017-09-11t06:16:07+0600 [twisted.web.server.site#info] starting factory <twisted.web.server.site instance @ 0x7f5e265c77a0> 2017-09-11t06:16:07+0600 [launcher] scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner' 2017-09-11t06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/sep/2017:00:16:07 +0000] "get /listprojects.json http/1.1" 200 98 "-" "python-requests/2.18.4" 2017-09-11t06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/sep/2017:00:16:07 +0000] "get /listversions.json?project=tester2_fda_trial20 http/1.1" 200 80 "-" "python-requests/2.18.4" 2017-09-11t06:16:07+0600 [twisted.python.log#info] "192.168.0.101" - - [11/sep/2017:00:16:07 +0000] "get /listjobs.json?project=tester2_fda_trial20 http/1.1" 200 92 "-" "python-requests/2.18.4"
scrapyd daemon on instance2 (192.168.0.106) running status: scrapyd
2017-09-11t06:09:28+0600 [-] loading /home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapyd/txapp.py... 2017-09-11t06:09:28+0600 [-] scrapyd web console available @ http://0.0.0.0:6800/ 2017-09-11t06:09:28+0600 [-] loaded. 2017-09-11t06:09:28+0600 [twisted.scripts._twistd_unix.unixapplogger#info] twistd 17.5.0 (/home/mtaziz/.virtualenvs/scrapydevenv/bin/python 2.7.6) starting up. 2017-09-11t06:09:28+0600 [twisted.scripts._twistd_unix.unixapplogger#info] reactor class: twisted.internet.epollreactor.epollreactor. 2017-09-11t06:09:28+0600 [-] site starting on 6800 2017-09-11t06:09:28+0600 [twisted.web.server.site#info] starting factory <twisted.web.server.site instance @ 0x7fbe6eaeac20> 2017-09-11t06:09:28+0600 [launcher] scrapyd 1.2.0 started: max_proc=16, runner='scrapyd.runner' 2017-09-11t06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/sep/2017:00:09:32 +0000] "get /listprojects.json http/1.1" 200 98 "-" "python-requests/2.18.4" 2017-09-11t06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/sep/2017:00:09:32 +0000] "get /listversions.json?project=tester2_fda_trial20 http/1.1" 200 80 "-" "python-requests/2.18.4" 2017-09-11t06:09:32+0600 [twisted.python.log#info] "192.168.0.101" - - [11/sep/2017:00:09:32 +0000] "get /listjobs.json?project=tester2_fda_trial20 http/1.1" 200 92 "-" "python-requests/2.18.4" 2017-09-11t06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/sep/2017:00:09:37 +0000] "get /listprojects.json http/1.1" 200 98 "-" "python-requests/2.18.4" 2017-09-11t06:09:37+0600 [twisted.python.log#info] "192.168.0.101" - - [11/sep/2017:00:09:37 +0000] "get /listversions.json?project=tester2_fda_trial20 http/1.1" 200 80 "-" "python-requests/2.18.4"
worker1 logs after updating code rabbitmq server settings followed suggestions made @tarun lalwani
the suggestion use rabbitmq server
ip - 192.168.0.101:5672 instead of 127.0.0.1:5672. after updated suggested tarun lalwani got new problems below............
2017-09-11 15:49:18 [scrapy.utils.log] info: scrapy 1.4.0 started (bot: tester2_fda_trial20) 2017-09-11 15:49:18 [scrapy.utils.log] info: overridden settings: {'newspider_module': 'tester2_fda_trial20.spiders', 'robotstxt_obey': true, 'log_level': 'info', 'spider_modules': ['tester2_fda_trial20.spiders'], 'bot_name': 'tester2_fda_trial20', 'feed_uri': 'file:///var/lib/scrapyd/items/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.jl', 'scheduler': 'tester2_fda_trial20.rabbitmq.scheduler.scheduler', 'telnetconsole_enabled': false, 'log_file': '/var/log/scrapyd/tester2_fda_trial20/tester2_fda_trial20/79b1123a96d611e79276000c29bad697.log'} 2017-09-11 15:49:18 [scrapy.middleware] info: enabled extensions: ['scrapy.extensions.feedexport.feedexporter', 'scrapy.extensions.memusage.memoryusage', 'scrapy.extensions.logstats.logstats', 'scrapy.extensions.corestats.corestats'] 2017-09-11 15:49:18 [scrapy.middleware] info: enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.robotstxtmiddleware', 'scrapy.downloadermiddlewares.httpauth.httpauthmiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.downloadtimeoutmiddleware', 'scrapy.downloadermiddlewares.defaultheaders.defaultheadersmiddleware', 'scrapy.downloadermiddlewares.useragent.useragentmiddleware', 'scrapy.downloadermiddlewares.retry.retrymiddleware', 'scrapy.downloadermiddlewares.redirect.metarefreshmiddleware', 'scrapy.downloadermiddlewares.httpcompression.httpcompressionmiddleware', 'scrapy.downloadermiddlewares.redirect.redirectmiddleware', 'scrapy.downloadermiddlewares.cookies.cookiesmiddleware', 'scrapy.downloadermiddlewares.httpproxy.httpproxymiddleware', 'scrapy.downloadermiddlewares.stats.downloaderstats'] 2017-09-11 15:49:18 [scrapy.middleware] info: enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.httperrormiddleware', 'scrapy.spidermiddlewares.offsite.offsitemiddleware', 'scrapy.spidermiddlewares.referer.referermiddleware', 'scrapy.spidermiddlewares.urllength.urllengthmiddleware', 'scrapy.spidermiddlewares.depth.depthmiddleware'] 2017-09-11 15:49:18 [scrapy.middleware] info: enabled item pipelines: ['tester2_fda_trial20.pipelines.fdatrial20pipeline', 'tester2_fda_trial20.mongodb.scrapy_mongodb.mongodbpipeline'] 2017-09-11 15:49:18 [scrapy.core.engine] info: spider opened 2017-09-11 15:49:18 [pika.adapters.base_connection] info: connecting 192.168.0.101:5672 2017-09-11 15:49:18 [pika.adapters.blocking_connection] info: created channel=1 2017-09-11 15:49:18 [scrapy.core.engine] info: closing spider (shutdown) 2017-09-11 15:49:18 [pika.adapters.blocking_connection] info: channel.close(0, normal shutdown) 2017-09-11 15:49:18 [pika.channel] info: channel.close(0, normal shutdown) 2017-09-11 15:49:18 [scrapy.utils.signal] error: error caught on signal handler: <bound method ?.close_spider of <scrapy.extensions.feedexport.feedexporter object @ 0x7f94878b8c50>> traceback (most recent call last): file "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybedeferred result = f(*args, **kw) file "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustapply return receiver(*arguments, **named) file "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 201, in close_spider slot = self.slot attributeerror: 'feedexporter' object has no attribute 'slot' 2017-09-11 15:49:18 [scrapy.utils.signal] error: error caught on signal handler: <bound method ?.spider_closed of <tester2fda_trial20spider 'tester2_fda_trial20' @ 0x7f9484f897d0>> traceback (most recent call last): file "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybedeferred result = f(*args, **kw) file "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustapply return receiver(*arguments, **named) file "/tmp/user/1000/tester2_fda_trial20-10-d4req9.egg/tester2_fda_trial20/spiders/tester2_fda_trial20.py", line 28, in spider_closed attributeerror: 'tester2fda_trial20spider' object has no attribute 'statstask' 2017-09-11 15:49:18 [scrapy.statscollectors] info: dumping scrapy stats: {'finish_reason': 'shutdown', 'finish_time': datetime.datetime(2017, 9, 11, 9, 49, 18, 159896), 'log_count/error': 2, 'log_count/info': 10} 2017-09-11 15:49:18 [scrapy.core.engine] info: spider closed (shutdown) 2017-09-11 15:49:18 [twisted] critical: unhandled error in deferred: 2017-09-11 15:49:18 [twisted] critical: traceback (most recent call last): file "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlinecallbacks result = g.send(result) file "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 95, in crawl six.reraise(*exc_info) file "/home/mtaziz/.virtualenvs/scrapydevenv/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl yield self.engine.open_spider(self.spider, start_requests) operationfailure: command son([('saslstart', 1), ('mechanism', 'scram-sha-1'), ('payload', binary('n,,n=tariq,r=mjy5otq0otywmja4', 0)), ('autoauthorize', 1)]) on namespace admin.$cmd failed: authentication failed.
mongodbpipeline
# coding:utf-8 import datetime pymongo import errors pymongo.mongo_client import mongoclient pymongo.mongo_replica_set_client import mongoreplicasetclient pymongo.read_preferences import readpreference scrapy.exporters import baseitemexporter try: urllib.parse import quote except: urllib import quote def not_set(string): """ check if string none or '' :returns: bool - true if string empty """ if string none: return true elif string == '': return true return false class mongodbpipeline(baseitemexporter): """ mongodb pipeline class """ # default options config = { 'uri': 'mongodb://localhost:27017', 'fsync': false, 'write_concern': 0, 'database': 'scrapy-mongodb', 'collection': 'items', 'replica_set': none, 'buffer': none, 'append_timestamp': false, 'sharded': false } # needed sending acknowledgement signals rabbitmq persisted items queue = none acked_signals = [] # item buffer item_buffer = dict() def load_spider(self, spider): self.crawler = spider.crawler self.settings = spider.settings self.queue = self.crawler.engine.slot.scheduler.queue def open_spider(self, spider): self.load_spider(spider) # configure connection self.configure() self.spidername = spider.name self.config['uri'] = 'mongodb://' + self.config['username'] + ':' + quote(self.config['password']) + '@' + self.config['uri'] + '/admin' self.shardedcolls = [] if self.config['replica_set'] not none: self.connection = mongoreplicasetclient( self.config['uri'], replicaset=self.config['replica_set'], w=self.config['write_concern'], fsync=self.config['fsync'], read_preference=readpreference.primary_preferred) else: # connecting stand alone mongodb self.connection = mongoclient( self.config['uri'], fsync=self.config['fsync'], read_preference=readpreference.primary) # set collection self.database = self.connection[spider.name] # autoshard db if self.config['sharded']: db_statuses = self.connection['config']['databases'].find({}) partitioned = [] notpartitioned = [] status in db_statuses: if status['partitioned']: partitioned.append(status['_id']) else: notpartitioned.append(status['_id']) if spider.name in notpartitioned or spider.name not in partitioned: try: self.connection.admin.command('enablesharding', spider.name) except errors.operationfailure: pass else: collections = self.connection['config']['collections'].find({}) coll in collections: if (spider.name + '.') in coll['_id']: if coll['dropped'] not true: if coll['_id'].index(spider.name + '.') == 0: self.shardedcolls.append(coll['_id'][coll['_id'].index('.') + 1:]) def configure(self): """ configure mongodb connection """ # set regular options options = [ ('uri', 'mongodb_uri'), ('fsync', 'mongodb_fsync'), ('write_concern', 'mongodb_replica_set_w'), ('database', 'mongodb_database'), ('collection', 'mongodb_collection'), ('replica_set', 'mongodb_replica_set'), ('buffer', 'mongodb_buffer_data'), ('append_timestamp', 'mongodb_add_timestamp'), ('sharded', 'mongodb_sharded'), ('username', 'mongodb_user'), ('password', 'mongodb_password') ] key, setting in options: if not not_set(self.settings[setting]): self.config[key] = self.settings[setting] def process_item(self, item, spider): """ process item , add mongodb :type item: item object :param item: item put mongodb :type spider: basespider object :param spider: spider running queries :returns: item object """ item_name = item.__class__.__name__ # if working sharded db, collection sharded if self.config['sharded']: if item_name not in self.shardedcolls: try: self.connection.admin.command('shardcollection', '%s.%s' % (self.spidername, item_name), key={'_id': "hashed"}) self.shardedcolls.append(item_name) except errors.operationfailure: self.shardedcolls.append(item_name) itemtoinsert = dict(self._get_serialized_fields(item)) if self.config['buffer']: if item_name not in self.item_buffer: self.item_buffer[item_name] = [] self.item_buffer[item_name].append([]) self.item_buffer[item_name].append(0) self.item_buffer[item_name][1] += 1 if self.config['append_timestamp']: itemtoinsert['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()} self.item_buffer[item_name][0].append(itemtoinsert) if self.item_buffer[item_name][1] == self.config['buffer']: self.item_buffer[item_name][1] = 0 self.insert_item(self.item_buffer[item_name][0], spider, item_name) return item self.insert_item(itemtoinsert, spider, item_name) return item def close_spider(self, spider): """ method called when spider closed :type spider: basespider object :param spider: spider running queries :returns: none """ key in self.item_buffer: if self.item_buffer[key][0]: self.insert_item(self.item_buffer[key][0], spider, key) def insert_item(self, item, spider, item_name): """ process item , add mongodb :type item: (item object) or [(item object)] :param item: item(s) put mongodb :type spider: basespider object :param spider: spider running queries :returns: item object """ self.collection = self.database[item_name] if not isinstance(item, list): if self.config['append_timestamp']: item['scrapy-mongodb'] = {'ts': datetime.datetime.utcnow()} ack_signal = item['ack_signal'] item.pop('ack_signal', none) self.collection.insert(item, continue_on_error=true) if ack_signal not in self.acked_signals: self.queue.acknowledge(ack_signal) self.acked_signals.append(ack_signal) else: signals = [] eachitem in item: signals.append(eachitem['ack_signal']) eachitem.pop('ack_signal', none) self.collection.insert(item, continue_on_error=true) del item[:] ack_signal in signals: if ack_signal not in self.acked_signals: self.queue.acknowledge(ack_signal) self.acked_signals.append(ack_signal)
to sum up, believe problem lies in scrapyd
daemons running on both instances somehow scraper
or worker1
can not access it, not figure out, did not find use cases on stackoverflow.
any highly appreciated in regard. thank in advance!
Comments
Post a Comment