python 入门，网页数据抓取

zwhc

浏览: 257999 次
性别:
来自: 福州

最近访客更多访客>>

u012363178

songgz

fengaodlw

songhait

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

python

Python 读书 Linux C++C

python 入门，网页数据抓取

正在研究这篇文章：
使用 Python 蛮力提取"网易读书"书籍内容
http://xanpeng.iteye.com/blog/816748

这个不错。正好入门学习使用。

1、其中用到 feedparser：
技巧：使用 Universal Feed Parser 驾驭 RSS
http://www.ibm.com/developerworks/cn/xml/x-tipufp.html
请访问 feedparser.org，详细了解 Universal Feed Parser，其中还包括一些下载资料和文档。

feedparser 实际下载地址：
http://code.google.com/p/feedparser/downloads/list

2、另外，需要将文件加上 utf-8 的 bom 头，需要用到 python 写入十六进制字符：
http://linux.byexamples.com/archives/478/python-writing-binary-file/
python 写入十六进制字符
file.write("\x5F\x9D\x3E")
file.close()

3、因为要调试，文件的打开模式改成 w 方便一些。

import urllib
import sys
import re
from feedparser import _getCharacterEncoding as enc

class TagParser:
    def __init__(self, value):
        self.value = value
    def get(self, start, end):
        regx = re.compile(r'<' + start + r'.*?>.*</' + end + r'>')
        return re.findall(regx, self.value)

if __name__ == "__main__":
    baseurl = "http://data.book.163.com/book/section/000BAfLU/000BAfLU"
    f = open("test_01.txt", "w")
    f.write("\xef\xbb\xbf")
#    for ndx in range(0, 56):
    for ndx in range(0, 1):
        url = baseurl + str(ndx) + ".html"
        print "get content from " + url
        src = urllib.urlopen(url)
        text = src.read()

    f1= open("tmp_" + str(ndx) + ".txt", "w")
    f1.write(text)
    f1.close()

        encoding = enc(src.headers, text)[0]
   
        tp = TagParser(text)
   
        title = tp.get('h1 class="f26s tC"', 'h1')
        article = tp.get('p class="ti2em"', 'p')
   
        t = re.sub(r'</.+>', '\n', title[0])
        t = re.sub(r'<.+>', '\n', t)
        data = t
   
        c = ""
        for p in article:
            pt = re.sub(r'</p>', '\n', p)
            c += pt
        c = re.sub(r'<.+>', '\n', c)
        data += c
        data = data.decode(encoding)
        f.write(data.encode('utf-8', 'ignore'))
   
    f.close()

分享到：