python抓取淘宝MM信息

Posted by Lazy on October 24, 2016

##我们打开这个地址 https://mm.taobao.com/json/request_top_list.htm?page=1

Paste_Image.png

首先我们先拿到帖子的标题,通过查看源码,我们发现,他的标题的html为:

<p class="top">
					<a class="lady-name" href="//mm.taobao.com/self/model_card.htm?user_id=687471686" target="_blank">田媛媛</a>
					<em><strong>26</strong></em>
					<span>广州市</span>
										<span class="friend-follow J_FriendFollow" data-custom="type=14&amp;app_id=12052609" data-group="" data-userid="687471686">加关注</span>
					</p>

我们需要中间的标题怎么搞呢?,我要拿到MM的姓名,年龄,那个城市,我总不能一个匹配一次吧,这个就需要组合匹配了?

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib
import urllib2
import re

page = 1
url = 'https://mm.taobao.com/json/request_top_list.htm?' + str(page)
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
try:
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    content = response.read().decode('gbk')
    pattern =re.compile('<a class="lady-name".*?>(.*?)</a>.*?<strong>(.*?)</strong>.*?<span>(.*?)</span>', re.S)
    items = re.findall(pattern, content)
    # 分别对应的是 昵称,年龄,城市
    for item in items:
        print item[0],item[1],item[2]
except urllib2.URLError, e:
    if hasattr(e, "code"):
        print e.code
    if hasattr(e, "reason"):
        print e.reason


这里需要注意,如果组合多个条件,每当结束一个匹配之后,.*?来标识结束

Paste_Image.png