V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
explist
V2EX  ›  Python

如何用 python 抓取银行各年利率(动态的)?

  •  
  •   explist · 2015-09-05 18:38:46 +08:00 · 4729 次点击
    这是一个创建于 3402 天前的主题,其中的信息可能已经有所发展或是发生改变。
    想从工行网页上抓取历年储蓄利率,其网址为: http://www.icbc.com.cn/ICBCDynamicSite2/other/rmbdeposit.aspx 。想用 python3 自带的库写个爬虫程序,请大家帮忙指教:
    当选择不同的时间,发现网址并未改变,是不是 AJAX ?
    在浏览器中按 F12 毫无反应,但可以查看源码。在源码中抓取了日期列表,我认为这个日期应以某种方式告诉给服务器,但不知如何具体操作?
    18 条回复    2015-09-06 23:08:35 +08:00
    seki
        1
    seki  
       2015-09-05 18:48:27 +08:00
    信息是提交并 post 给同一个地址的,每次更改之后重新载入了而已

    审查 select ,可以看到 onchange 绑定,然后可以去找 <script>,代码也是明文的
    seki
        2
    seki  
       2015-09-05 18:51:24 +08:00   ❤️ 1
    关于 python 的部分,用 urllib2 或者 requests 来构造相同的 post 请求
    至于后台有什么反爬虫检测,这个就不清楚了,保守估计是不会有的,遇到了再说
    ljcarsenal
        3
    ljcarsenal  
       2015-09-05 19:33:47 +08:00
    不是 ajax 可以看到 select 的 onchange 绑定了事件
    rwalle
        4
    rwalle  
       2015-09-05 19:40:23 +08:00
    看 Network 标签
    1130335361
        5
    1130335361  
       2015-09-05 19:51:49 +08:00   ❤️ 1
    explist
        6
    explist  
    OP
       2015-09-05 19:52:18 +08:00
    Network 标签看不了,或许因为这是银行网站
    onchange 看见了,但是...但是我根本解读不了它(对 HTML 知之甚少)
    explist
        7
    explist  
    OP
       2015-09-05 19:54:57 +08:00
    def ghtest ():
    url = r'http://www.icbc.com.cn/ICBCDynamicSite2/other/rmbdeposit.aspx'
    req = request.Request (url )

    req.add_header ("User-Agent",'')
    g=ghHtml () # HTMLParser
    with request.urlopen (req ) as f:
    g.feed (f.read ().decode ())
    dataDict={}
    for item in g.dates:
    dataDict['id'] = item
    log=parse.urlencode (dataDict ).encode ('utf-8')
    f = request.urlopen (url,log )
    # dosoming
    f.close ()
    paradoxs
        8
    paradoxs  
       2015-09-05 19:59:16 +08:00
    Shy07
        9
    Shy07  
       2015-09-05 20:25:36 +08:00
    写了一个 Ruby 版的,只要改个日期就可以了

    ```ruby

    require 'net/http'

    params = {
    'Sel_Date' => '2012-07-06', # 修改日期即可
    '__EVENTTARGET' => 'Sel_Date',
    '__EVENTARGUMENT' => '',
    '__LASTFOCUS' => '',
    '__VIEWSTATE' => '/wEPDwUJNDkwNDM1MTYwD2QWAgIDD2QWAgIBD2QWBmYPEGQPFiFmAgECAgIDAgQCBQIGAgcCCAIJAgoCCwIMAg0CDgIPAhACEQISAhMCFAIVAhYCFwIYAhkCGgIbAhwCHQIeAh8CIBYhEAUP6K+36YCJ5oup5pe26Ze0ZWcQBQoyMDE1LTA4LTI2BQoyMDE1LTA4LTI2ZxAFCjIwMTUtMDYtMjgFCjIwMTUtMDYtMjhnEAUKMjAxNS0wNS0xMQUKMjAxNS0wNS0xMWcQBQoyMDE1LTAzLTAxBQoyMDE1LTAzLTAxZxAFCjIwMTQtMTEtMjIFCjIwMTQtMTEtMjJnEAUKMjAxMi0wNy0wNgUKMjAxMi0wNy0wNmcQBQoyMDEyLTA2LTA4BQoyMDEyLTA2LTA4ZxAFCjIwMTEtMDctMDcFCjIwMTEtMDctMDdnEAUKMjAxMS0wNC0wNgUKMjAxMS0wNC0wNmcQBQoyMDExLTAyLTA5BQoyMDExLTAyLTA5ZxAFCjIwMTAtMTItMjYFCjIwMTAtMTItMjZnEAUKMjAxMC0xMC0yMAUKMjAxMC0xMC0yMGcQBQoyMDA4LTEyLTIzBQoyMDA4LTEyLTIzZxAFCjIwMDgtMTEtMjcFCjIwMDgtMTEtMjdnEAUKMjAwOC0xMC0zMAUKMjAwOC0xMC0zMGcQBQoyMDA4LTEwLTA5BQoyMDA4LTEwLTA5ZxAFCjIwMDctMTItMjEFCjIwMDctMTItMjFnEAUKMjAwNy0wOS0xNQUKMjAwNy0wOS0xNWcQBQoyMDA3LTA4LTIyBQoyMDA3LTA4LTIyZxAFCjIwMDctMDctMjEFCjIwMDctMDctMjFnEAUKMjAwNy0wNS0xOQUKMjAwNy0wNS0xOWcQBQoyMDA3LTAzLTE4BQoyMDA3LTAzLTE4ZxAFCjIwMDYtMDgtMTkFCjIwMDYtMDgtMTlnEAUKMjAwNC0xMC0yOQUKMjAwNC0xMC0yOWcQBQoyMDAyLTAyLTIxBQoyMDAyLTAyLTIxZxAFCjE5OTktMDYtMTAFCjE5OTktMDYtMTBnEAUKMTk5OC0xMi0wNwUKMTk5OC0xMi0wN2cQBQoxOTk4LTA3LTAxBQoxOTk4LTA3LTAxZxAFCjE5OTgtMDMtMjUFCjE5OTgtMDMtMjVnEAUKMTk5Ny0xMC0yMwUKMTk5Ny0xMC0yM2cQBQoxOTk2LTA4LTIzBQoxOTk2LTA4LTIzZxAFCjE5OTYtMDUtMDEFCjE5OTYtMDUtMDFnFgFmZAIBDxYCHgRUZXh0BQoyMDE1LTA4LTI2ZAICDxYCHwAFohU8dGFibGUgYm9yZGVyPSIxIiBjZWxscGFkZGluZz0iMCIgY2VsbHNwYWNpbmc9IjAiIHdpZHRoPSI4NSUiICBydWxlcz0iYWxsIiBmcmFtZT0iYm9yZGVyIiBzdHlsZT0iYm9yZGVyLWNvbGxhcHNlOmNvbGxhcHNlOyBib3JkZXItY29sb3I6ICNDQ0NDQ0M7Ij48dGJvZHk+PHRyPjx0ZCB3aWR0aD0iNTclIiAgdmFsaWduPSJjZW50ZXIiIGJnY29sb3I9IiNlOGU4ZTgiPjxwIGFsaWduPSJjZW50ZXIiPjxiPumhueebrjwvYj48L3RkPjx0ZCB3aWR0aD0iNDMlIiBiZ2NvbG9yPSIjZThlOGU4IiBoZWlnaHQ9IjE5Ij48cCBhbGlnbj0iY2VudGVyIj48Yj7lubTliKnnjoc8L2I+JTwvdGQ+PC90cj48dHI+PHRkIHdpZHRoPSI1NyUiIGhlaWdodD0iMTkiIGFsaWduPSJsZWZ0Ij7kuIDjgIHln47kuaHlsYXmsJHlj4rljZXkvY3lrZjmrL48L3RkPjx0ZCBoZWlnaHQ9IjE5IiB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIj4mbmJzcDs8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBoZWlnaHQ9IjE5Ij48ZGl2IGFsaWduPSJsZWZ0Ij7vvIjkuIDvvInmtLvmnJ88L2Rpdj48L3RkPjx0ZCBoZWlnaHQ9IjE5IiB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIj4wLjM1PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgaGVpZ2h0PSIxOSIgYWxpZ249ImxlZnQiPu+8iOS6jO+8ieWumuacnzwvdGQ+PHRkIGhlaWdodD0iMTkiIHdpZHRoPSI0MyUiIGFsaWduPSJjZW50ZXIiPiZuYnNwOzwvdGQ+PC90cj48dHI+PHRkIHdpZHRoPSI1NyUiIGhlaWdodD0iMTkiPjxkaXYgYWxpZ249ImxlZnQiPjEu5pW05a2Y5pW05Y+WPC9kaXY+PC90ZD48dGQgaGVpZ2h0PSIxOSIgd2lkdGg9IjQzJSIgYWxpZ249ImNlbnRlciI+Jm5ic3A7PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+5LiJ5Liq5pyIPC90ZD48dGQgd2lkdGg9IjQzJSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+MS42PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+5Y2K5bm0PC90ZD48dGQgd2lkdGg9IjQzJSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+MS44PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+5LiA5bm0PC90ZD48dGQgd2lkdGg9IjQzJSIgYWxpZ249ImNlbnRlciIgaGVpZ2h0PSIxOSI+MjwvdGQ+PC90cj48dHI+PHRkIHdpZHRoPSI1NyUiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPuS6jOW5tDwvdGQ+PHRkIHdpZHRoPSI0MyUiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPjIuNTwvdGQ+PC90cj48dHI+PHRkIHdpZHRoPSI1NyUiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPuS4ieW5tDwvdGQ+PHRkIHdpZHRoPSI0MyUiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPjM8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kupTlubQ8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4zLjA1PC90ZD48L3RyPjx0cj48dGQgd2lkdGg9IjU3JSIgaGVpZ2h0PSIxOSI+PGRpdiBhbGlnbj0ibGVmdCI+Mi7pm7blrZjmlbTlj5bjgIHmlbTlrZjpm7blj5bjgIHlrZjmnKzlj5bmga88L2Rpdj48L3RkPjx0ZCB3aWR0aD0iNDMlIiBoZWlnaHQ9IjE5Ij4mbmJzcDs8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kuIDlubQ8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4xLjY8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kuInlubQ8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4xLjg8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kupTlubQ8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4xLjg1PC90ZD48L3RyPjx0cj48dGQgaGVpZ2h0PSIxOSI+PGRpdiBhbGlnbj0ibGVmdCI+My7lrprmtLvkuKTkvr88L2Rpdj48L3RkPjx0ZCBjb2xzcGFuPSIyIiBoZWlnaHQ9IjE5IiBhbGlnbj0ibGVmdCI+5oyJ5LiA5bm05Lul5YaF5a6a5pyf5pW05a2Y5pW05Y+W5ZCM5qGj5qyh5Yip546H5omTNuaKmDwvdGQ+PC90cj48dHI+PHRkIGhlaWdodD0iMTkiPjxkaXYgYWxpZ249ImxlZnQiPuS6jOOAgeWNj+WumuWtmOasvjwvZGl2PjwvdGQ+PHRkIGNvbHNwYW49IjIiIGFsaWduPSJjZW50ZXIiIGhlaWdodD0iMTkiPjEuMTU8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBoZWlnaHQ9IjE5Ij48ZGl2IGFsaWduPSJsZWZ0Ij7kuInjgIHpgJrnn6XlrZjmrL48L2Rpdj48L3RkPjx0ZCB3aWR0aD0iNDMlIiBoZWlnaHQ9IjE5Ij48Zm9udCBjb2xvcj0iI2ViZWJlYiI+LjwvZm9udD48L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kuIDlpKk8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4wLjg8L3RkPjwvdHI+PHRyPjx0ZCB3aWR0aD0iNTclIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij7kuIPlpKk8L3RkPjx0ZCB3aWR0aD0iNDMlIiBhbGlnbj0iY2VudGVyIiBoZWlnaHQ9IjE5Ij4xLjM1PC90ZD48L3RyPjwvdGFibGU+ZGRDrgsxnIFuzBq+7MoE9zn85XGzBQ=='
    }

    uri = URI.parse ("http://www.icbc.com.cn/ICBCDynamicSite2/other/rmbdeposit.aspx")
    res = Net::HTTP.post_form uri, params
    puts res.body
    ```
    Shy07
        10
    Shy07  
       2015-09-05 20:58:35 +08:00
    施工完毕

    require 'net/http'

    uri = URI.parse ("http://www.icbc.com.cn/ICBCDynamicSite2/other/rmbdeposit.aspx")
    html = Net::HTTP.get uri
    dates = []
    html.scan (/<option value="(\d{4}-\d{2}-\d{2})">/) {|s| dates += s }
    html =~ /name="__VIEWSTATE" id="__VIEWSTATE" value="(.*)" \/>/

    params = {
    '__EVENTTARGET' => 'Sel_Date',
    '__EVENTARGUMENT' => '',
    '__LASTFOCUS' => '',
    '__VIEWSTATE' => $1.clone
    }

    dates.each do |date|
    params['Sel_Date'] = date
    res = Net::HTTP.post_form uri, params
    # 正则提取具体内容就不写了,这里直接输出 html =_=b
    open ("#{date}.html", 'w') {|io| io.write res.body }
    end
    ljdawn
        11
    ljdawn  
       2015-09-05 21:10:53 +08:00
    先给左边的时间抓下来。 然后挨个儿 post 一下。。
    explist
        12
    explist  
    OP
       2015-09-05 21:42:58 +08:00
    有了时间列表后,如何构造 POST 请求?
    imlonghao
        13
    imlonghao  
       2015-09-05 21:45:19 +08:00
    @explist RTFM
    Shy07
        14
    Shy07  
       2015-09-05 22:01:32 +08:00 via iPhone
    @explist
    表单就五个参数, post 给原地址就可以了
    '__EVENTTARGET' => 'Sel_Date', 固定
    '__EVENTARGUMENT' => '', 固定
    '__LASTFOCUS' => '', 固定
    '__VIEWSTATE' => 那串 Base64 ,固定
    'Sel_Date' => 日期,可变
    explist
        15
    explist  
    OP
       2015-09-05 22:12:52 +08:00
    @Shy07 这下对了,你怎么知道她们间的对应关系的
    miemiekurisu
        16
    miemiekurisu  
       2015-09-05 22:23:03 +08:00
    ....你直接起个 scrapy 用 xpath 抓页面数据不就结了么...省时省力...
    Shy07
        17
    Shy07  
       2015-09-05 22:34:42 +08:00 via iPhone   ❤️ 1
    @explist
    看他的 js ,最后是 submit 提交的,所以把页面里所有可以提交的表单元素找出来就行了
    explist
        18
    explist  
    OP
       2015-09-06 23:08:35 +08:00
    出于学习目的问一下:
    建设银行的这个网站: http://www.ccb.com/cn/personal/interest/rmbdeposit.html 如何爬取,源代码中并无 table 标签
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2540 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 25ms · UTC 02:58 · PVG 10:58 · LAX 18:58 · JFK 21:58
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.