Skip to content

scrapy

Best used to obtain one "stream" of data at a time, without trying to obtain data from different pages

scrapy runspider spider.py -o file.json
Display HTML source of the scraped page
print(response.txt)
Get {URL}
fetch('url')
Select a CSS selector
# Returns a `SelectorList`
response.css('p')
# Retrieve full HTML elements
response.css('p').extract()
Retrieve only the text within the element
response.css('p::text').extract()
response.css('p::text').extract_first()
response.css('p::text').extract()[0]
Get the href attribute value for an anchor tag
response.css('a').attrib['href']
Launch Scrapy shell and scrape $URL
scrapy shell $URL
Make a default spider named {quotes} that will be restricted to {domain}
scrapy genspider quotes domain
scrapy runspider scrapy1.py
Run a spider, saving scraped data to a JSON file
scrapy runspider spider.py -o items.json
Method which contains most of the logic of the spider, especially after the yield keyword. For multiple items, a structural basis for iteration must be found and for each iteration, data is yielded

Extract URL from link using standard CSS selection techniques

Add the domain name to a relative link

response.urljoin()
Recursively call the parse method again on the next page
yield scrapy.Request(url=next_page_url, callback=self.parse)
Scrape detail pages - parse_details would be a spider method sibling to the main parse method - if a detail page has more information than the main, then the yield keyword should be in parse_details
yield scrapy.Request(url={url}, callback=self.parse_details)