<html> <head> <title>My Website</title> </head> <body> <span>Hello world!!!</span> <div class = 'links'> <a href = 'one.html'>Link 1<img src = 'image1.jpg'/></a> <a href = 'two.html'>Link 2<img src = 'image2.jpg'/></a> <a href = 'three.html'>Link 3<img src = 'image3.jpg'/></a> </div> </body> </html>
from scrapy.selector import Selector from scrapy.http import HtmlResponse
Selector(text = body).xpath('//span/text()').extract()
[u'Hello world!!!']
response = HtmlResponse(url = 'http://mysite.com', body = body) Selector(response = response).xpath('//span/text()').extract()
[u'Hello world!!!']
>>response.selector.xpath('//title/text()')
>>response.xpath('//title/text()').extract()
[u'My Website']
>>response.xpath('//div[@class = "links"]/a/text()').extract()
Link 1 Link 2 Link 3
>>response.xpath('//div[@class = "links"]/a/text()').extract_first()
Link 1
links = response.xpath('//a[contains(@href, "image")]') for index, link in enumerate(links): args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract()) print 'The link %d pointing to url %s and image %s' % args
Link 1 pointing to url [u'one.html'] and image [u'image1.jpg'] Link 2 pointing to url [u'two.html'] and image [u'image2.jpg'] Link 3 pointing to url [u'three.html'] and image [u'image3.jpg']
>>response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'Link 1', u'Link 2', u'Link 3']
>>mydiv = response.xpath('//div')
>>for p in mydiv.xpath('.//p').extract()
前缀和用法 | 命名空间 |
re
正则表达式
|
http://exslt.org/regexp/index.html
|
set
set 操作
|
http://exslt.org/set/index.html
|