python - The scrapy LinkExtractor(allow=(url)) get the wrong crawled page, the regulex doesn't work -

i want crawl page http://www.douban.com/tag/%e7%88%b1%e6%83%85/movie . , part of spider code :

class moviespider(crawlspider):     name = "doubanmovie"     allowed_domains = ["douban.com"]     start_urls = ["http://www.douban.com/tag/%e7%88%b1%e6%83%85/movie"]     rules = (             rule(linkextractor(allow=(r'http://www.douban.com/tag/%e7%88%b1%e6%83%85/movie\?start=\d{2}'))),             rule(linkextractor(allow=(r"http://movie.douban.com/subject/\d+")), callback = "parse_item")             )      def start_requests(self):         yield formrequest("http://www.douban.com/tag/%e7%88%b1%e6%83%85/movie",headers={'user-agent':'mozilla/5.0 (x11; ubuntu; linux x86_64; rv:37.0) gecko/20100101 firefox/37.0'})

i want crawl page "\?start=\d{2}", scrapy spider crawl page "\?start=100" or "\?start=1000". what's wrong it? how solve it? in advance.

the regular expression \d{2} matches every number starts 2 digits.

if want limit regular expression 2 digits can use \d{2}$ matches if there tow digits @ end of line.

even more general use \d{2}\b non-alphanumeric value or whitespace has follow.

Fun enginering

Search This Blog

python - The scrapy LinkExtractor(allow=(url)) get the wrong crawled page, the regulex doesn't work -

Comments

Post a Comment