python - The scrapy LinkExtractor(allow=(url)) get the wrong crawled page, the regulex doesn't work -
i want crawl page http://www.douban.com/tag/%e7%88%b1%e6%83%85/movie . , part of spider code :
class moviespider(crawlspider): name = "doubanmovie" allowed_domains = ["douban.com"] start_urls = ["http://www.douban.com/tag/%e7%88%b1%e6%83%85/movie"] rules = ( rule(linkextractor(allow=(r'http://www.douban.com/tag/%e7%88%b1%e6%83%85/movie\?start=\d{2}'))), rule(linkextractor(allow=(r"http://movie.douban.com/subject/\d+")), callback = "parse_item") ) def start_requests(self): yield formrequest("http://www.douban.com/tag/%e7%88%b1%e6%83%85/movie",headers={'user-agent':'mozilla/5.0 (x11; ubuntu; linux x86_64; rv:37.0) gecko/20100101 firefox/37.0'})
i want crawl page "\?start=\d{2}", scrapy spider crawl page "\?start=100" or "\?start=1000". what's wrong it? how solve it? in advance.
the regular expression \d{2}
matches every number starts 2 digits.
if want limit regular expression 2 digits can use \d{2}$
matches if there tow digits @ end of line.
even more general use \d{2}\b
non-alphanumeric value or whitespace has follow.
Comments
Post a Comment