For performance purposes, I've tried to transform a set of regexes into only one using pipes.
self.regexes_token = [
{'descricao':'site www.', 'regex': r'^www\.(.+?)$'},
{'descricao':'apenas pontuacao','regex':r'^[[:punct:]]+?$'},
{'descricao':'palavra com sinal negativo', 'regex': r'^(-)(.*?)$', 'grupo': r'\2'},
{'descricao':'pronomes e títulos', 'regex': r'^(sra?|exm[º|°|o]|dr[a|ª]?|(v\.)?ex\.?(a|ª)\.?)\.??$'},
{'descricao':'oab sigla', 'regex': r'^oab\/[a-z]{2}$'},
{'descricao':'termos irrelevantes', 'regex': r'^(s\/n|e\/ou|e-?mail|cep|rj|tel\.?(\/fax|efone)?|anos?|rua|cpf|www)\.?$'},
{'descricao':'chassi (VIN)', 'regex': r'^[A-Za-z0-9]{1}[A-Za-z]{2}[A-Za-z0-9]{9}[\d+]{5}$'},
{'descricao':'data_br', 'regex': r'^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))|(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$'},
{'descricao':'um char e ponto', 'regex': r'^\w[[:punct:]]$'},
{'descricao':'rg','regex': r'^\d{2}\.\d{3}\.\d{3}-\d(\/.*)?'},
{'descricao':'unidades de medidas', 'regex': r'^(\d{1,2},?x?)+(cm|m(l|²|2|m)?|k(g|m))$'},
{'descricao':'zero seguido de qualquer coisa', 'regex': r'^0(.*)$'},
{'descricao':'::punct:: seguido de qualquer coisa','regex':r'^[[:punct:]](.+?)$'},
{'descricao':'telefone avulso', 'regex': r'^\d{4,5}-\d{4}$'},
{'descricao':'ano', 'regex': r'\b(19|20)\d{2}\.?\b'},
{'descricao':'contém char especial', 'regex': r'^.*?(~|\^|¿|¡|>|<|»|#|£|\?|»|·|#|\*|=|\+|¥|€|\||µ|®)+.*?$'}
]
self.regexes_token_union = r'('+'|'.join([d['regex'] for d in self.regexes_token])+r')'
print(self.regexes_token_union)
Follow regex:
(^www\.(.+?)$|^[[:punct:]]+?$|^(-)(.*?)$|^(sra?|exm[º|°|o]|dr[a|ª]?|(v\.)?ex\.?(a|ª)\.?)\.??$|^oab\/[a-z]{2}$|^(s\/n|e\/ou|e-?mail|cep|rj|tel\.?(\/fax|efone)?|anos?|rua|cpf|www)\.?$|^[A-Za-z0-9]{1}[A-Za-z]{2}[A-Za-z0-9]{9}[\d+]{5}$|^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))|(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^\w[[:punct:]]$|^\d{2}\.\d{3}\.\d{3}-\d(\/.*)?|^(\d{1,2},?x?)+(cm|m(l|²|2|m)?|k(g|m))$|^0(.*)$|^[[:punct:]](.+?)$|^\d{4,5}-\d{4}$|\b(19|20)\d{2}\.?\b|^.*?(~|\^|¿|¡|>|<|»|#|£|\?|»|·|#|\*|=|\+|¥|€|\||µ|®)+.*?$)
But when I've tried to run (compile), python's regex module (not re) got an error:
regex._regex_core.error: cannot refer to an open group at position 272
I've used notepad++ to see the "col" position but even then I was not able to detect what opened group is this
The "confusing thing" is that when I run each regex on a loop, it's works fine (but the performance is not good)
So, how can I fix this?
from Regex module error when join a set of regex using pipes
No comments:
Post a Comment