scrapy
Best used to obtain one "stream" of data at a time, without trying to obtain data from different pages
scrapy runspider spider.py -o file.json
print(response.txt)
{URL}
fetch('url')
# Returns a `SelectorList`
response.css('p')
# Retrieve full HTML elements
response.css('p').extract()
response.css('p::text').extract()
response.css('p::text').extract_first()
response.css('p::text').extract()[0]
href
attribute value for an anchor tag
response.css('a').attrib['href']
$URL
scrapy shell $URL
scrapy genspider quotes domain
scrapy runspider scrapy1.py
scrapy runspider spider.py -o items.json
yield
keyword. For multiple items, a structural basis for iteration must be found and for each iteration, data is yielded
Extract URL from link using standard CSS selection techniques
Add the domain name to a relative link
response.urljoin()
parse
method again on the next page
yield scrapy.Request(url=next_page_url, callback=self.parse)
parse_details
would be a spider method sibling to the main parse
method
- if a detail page has more information than the main, then the yield
keyword should be in parse_details
yield scrapy.Request(url={url}, callback=self.parse_details)