V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
eromoe
V2EX  ›  Python

召集爱看小薄本子又熟悉 regex 的各路绅士大神~

  •  
  •   eromoe · 2015-08-28 17:09:08 +08:00 · 2947 次点击
    这是一个创建于 3413 天前的主题,其中的信息可能已经有所发展或是发生改变。

    大家都知道,小薄本子多了,整理起来就麻烦了=。=
    我想按作者分,按社团分,按展会分等等,所以写了个正则 想从一个本子的名字里抽取所有信息
    但是本子标题五花八门,如下
    0. (event ) (tag ) [group (artist )] title (form ) [addition1] [addition2]

    1. (event ) [group (artist )] title (form ) [addition1]

    2. [event] [group (artist )] title (form ) (addition1 )

    3. (tag ) [group (artist )] title

    4. [group (artist )] title

    5. title

    我试着写了一个

    import re
    regex_patern = ur'([\(\[](?P<event>[^\)\]]*)[\)\]])?\s*([\(\[](?P<type>[^\)\](\)\])]*)[\)\]])?\s*(\[(?P<group>[^\(\]]*)(\((?P<artist>[^\)]*)\))?\])?(?P<title>[^\(\)\[\]]*)([\(\[](?P<from>[^\)\]]*)[\)\]])?(\s*[\(\[](?P<more1>[^\)\]]*)[\)\]])'
    
    p = re.compile (regex_patern )
    
    rows= [
    '(event ) (tag ) [group (artist )] title (form ) [addition1] [addition2]',
    '(event ) [group (artist )] title (form ) [addition1]',
    '[event] [group (artist )] title (form ) (addition1 )',
    '(tag ) [group (artist )] title',
    '[group (artist )] title',
    'title',
    ]
    
    for r in rows:
        r = re.search (p, r )
        print r.groupdict ()
    
    #输出:
    
    {u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': 'tag', u'event': 'event'}
    {u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
    {u'from': 'form', u'more1': 'addition1', u'artist': 'artist', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
    {u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': 'tag'}
    {u'from': None, u'more1': 'group (artist', u'artist': None, u'title': '', u'group': None, u'type': None, u'event': None}
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last )
    <ipython-input-5-831c548bc3f0> in <module>()
         15 for r in rows:
         16     r = re.search (p, r )
    ---> 17     print r.groupdict ()
    
    AttributeError: 'NoneType' object has no attribute 'groupdict'
    

    从第四行开始结果就不对了,我感觉 re 应该要先匹配中间的简单规则,再最后扩展到最复杂的规则,
    但是不知道怎么写。。。。特来请教各位

    9 条回复    2015-08-29 09:54:02 +08:00
    plqws
        1
    plqws  
       2015-08-28 17:48:29 +08:00
    为啥一定要用正则,代码看起来好难改的样子。
    还有我觉得这种东西用 日文分词 + tag 整理起来更方便吧。
    rogerchen
        2
    rogerchen  
       2015-08-28 18:18:38 +08:00
    (\s*[\(\[](?P<more1>[^\)\]]*)[\)\]]) 最后一个空白为什么要捕捉,和前边不一致,而且 more1 这个段是可选的吧,应该只有 title 这个段是强制的
    rogerchen
        3
    rogerchen  
       2015-08-28 18:20:54 +08:00
    楼主我还发现一个问题,你来源一会写 from 一会儿写 form ,虽然不影响吧,但确实把我看晕了
    rogerchen
        4
    rogerchen  
       2015-08-28 18:24:18 +08:00
    改了之后是这样,貌似还有点小问题,我继续看
    $ python re.py
    {u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': 'tag ', u'event': 'event '}
    {u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event '}
    {u'from': 'form ', u'more1': 'addition1 ', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
    {u'from': None, u'more1': None, u'artist': 'artist ', u'title': ' title', u'group': 'group ', u'type': None, u'event': 'tag '}
    {u'from': None, u'more1': None, u'artist': None, u'title': '', u'group': None, u'type': None, u'event': 'group (artist '}
    {u'from': None, u'more1': None, u'artist': None, u'title': 'title', u'group': None, u'type': None, u'event': None}
    rogerchen
        5
    rogerchen  
       2015-08-28 18:34:56 +08:00   ❤️ 1
    import re
    regex_patern = ur'([\(\[](?P<event>[^\()\)\]]*)[\)\]])?\s*([\(\[](?P<type>[^\)\](\)\])]*)[\)\]])?\s*(\[(?P<group>[^\(\]]*)(\((?P<artist>[^\)]*)\))?\])?(?P<title>[^\(\)\[\]]*)([\(\[](?P<from>[^\)\]]*)[\)\]])?\s*([\(\[](?P<more1>[^\)\]]*)[\)\]])?'

    p = re.compile (regex_patern )

    rows= [
    '(event ) (tag ) [group (artist )] title (form ) [addition1] [addition2]',
    '(event ) [group (artist )] title (form ) [addition1]',
    '[event] [group (artist )] title (form ) (addition1 )',
    '(tag ) [group (artist )] title',
    '[group (artist )] title',
    'title',
    ]

    for r in rows:
    r = re.search (p, r )
    print r.groupdict ()

    完全改好了,你有两个地方不对,一个是最后边那个地方强制捕获了,一个是不能让 event 捕获 [group (artist )],所以在 event 那个段里边要改成最后\(也放弃。

    $ python re.py
    {u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': 'tag ', u'event': 'event '}
    {u'from': 'form ', u'more1': 'addition1', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event '}
    {u'from': 'form ', u'more1': 'addition1 ', u'artist': 'artist ', u'title': ' title ', u'group': 'group ', u'type': None, u'event': 'event'}
    {u'from': None, u'more1': None, u'artist': 'artist ', u'title': ' title', u'group': 'group ', u'type': None, u'event': 'tag '}
    {u'from': None, u'more1': None, u'artist': 'artist ', u'title': ' title', u'group': 'group ', u'type': None, u'event': None}
    {u'from': None, u'more1': None, u'artist': None, u'title': 'title', u'group': None, u'type': None, u'event': None}
    eromoe
        6
    eromoe  
    OP
       2015-08-29 08:52:46 +08:00
    @rogerchen 非常感谢,准备写点代码先测测分类效果~
    eromoe
        7
    eromoe  
    OP
       2015-08-29 09:02:27 +08:00
    突然发现一个很囧的问题。。。
    [event] [group] title (from )
    [event] [artist] title (from )

    是不是无解啊。。。
    正则能不能写出 从 title 左边抓一个[XXX] ,然后 XXX 不包含 同人 /Cxx/成年 XXX 这样的,来判断是 group+artist 块?
    rogerchen
        8
    rogerchen  
       2015-08-29 09:12:01 +08:00
    都要涉及到比较字符串了,只用正则搞就是黑魔法了,建议先抓出来再写点代码判断
    eromoe
        9
    eromoe  
    OP
       2015-08-29 09:54:02 +08:00
    @rogerchen 嗯,也只能这样了
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2509 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 02:50 · PVG 10:50 · LAX 18:50 · JFK 21:50
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.