i need regular expression match "/page-2" or "/page-3" part of bigger url such http://domain.com/articles/page-number
so far, have tried these combinations: '/page-\d' '/page-\d' '\b/page-\d\b'
please note, using regex part of rules in start_urls section in scrapy project. suggestions appreciated. here's code snippet:
class ndtvxolonewsitem(crawlspider): name = "ndtvxolonews" allowed_domains = ["http://gadgets.ndtv.com/tags/"] start_urls = ["http://gadgets.ndtv.com/tags/xolo/articles"] rules = [rule(linkextractor(allow=['\b/page\-\d\b']))]
allowed_domains
should domain name. can filter specific path including start of url in regex
class ndtvxolonewsitem(crawlspider): name = "ndtvxolonews" allowed_domains = ["gadgets.ndtv.com"] start_urls = ["http://gadgets.ndtv.com/tags/xolo/articles"] rules = [rule(linkextractor(allow=['http://gadgets.ndtv.com/tags/.*/page\-\d+']))]
Comments
Post a Comment