Thursday 31 December 2020

Regex module error when join a set of regex using pipes

For performance purposes, I've tried to transform a set of regexes into only one using pipes.

   self.regexes_token = [
                {'descricao':'site www.', 'regex': r'^www\.(.+?)$'},
                {'descricao':'apenas pontuacao','regex':r'^[[:punct:]]+?$'},
                {'descricao':'palavra com sinal negativo', 'regex': r'^(-)(.*?)$', 'grupo': r'\2'},
                {'descricao':'pronomes e títulos', 'regex': r'^(sra?|exm[º|°|o]|dr[a|ª]?|(v\.)?ex\.?(a|ª)\.?)\.??$'},
                {'descricao':'oab sigla', 'regex': r'^oab\/[a-z]{2}$'},
                {'descricao':'termos irrelevantes', 'regex': r'^(s\/n|e\/ou|e-?mail|cep|rj|tel\.?(\/fax|efone)?|anos?|rua|cpf|www)\.?$'},
                {'descricao':'chassi (VIN)', 'regex': r'^[A-Za-z0-9]{1}[A-Za-z]{2}[A-Za-z0-9]{9}[\d+]{5}$'},
                {'descricao':'data_br', 'regex': r'^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))|(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$'},
                {'descricao':'um char e ponto', 'regex': r'^\w[[:punct:]]$'},
                {'descricao':'rg','regex': r'^\d{2}\.\d{3}\.\d{3}-\d(\/.*)?'},
                {'descricao':'unidades de medidas', 'regex': r'^(\d{1,2},?x?)+(cm|m(l|²|2|m)?|k(g|m))$'},
                {'descricao':'zero seguido de qualquer coisa', 'regex': r'^0(.*)$'},
                {'descricao':'::punct:: seguido de qualquer coisa','regex':r'^[[:punct:]](.+?)$'},
                {'descricao':'telefone avulso', 'regex': r'^\d{4,5}-\d{4}$'},
                {'descricao':'ano', 'regex': r'\b(19|20)\d{2}\.?\b'},
                {'descricao':'contém char especial', 'regex': r'^.*?(~|\^|¿|¡|>|<|»|#|£|\?|»|·|#|\*|=|\+|¥|€|\||µ|®)+.*?$'}
            ]
            
            
            self.regexes_token_union = r'('+'|'.join([d['regex'] for d in self.regexes_token])+r')'
            print(self.regexes_token_union)

Follow regex:

(^www\.(.+?)$|^[[:punct:]]+?$|^(-)(.*?)$|^(sra?|exm[º|°|o]|dr[a|ª]?|(v\.)?ex\.?(a|ª)\.?)\.??$|^oab\/[a-z]{2}$|^(s\/n|e\/ou|e-?mail|cep|rj|tel\.?(\/fax|efone)?|anos?|rua|cpf|www)\.?$|^[A-Za-z0-9]{1}[A-Za-z]{2}[A-Za-z0-9]{9}[\d+]{5}$|^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[1,3-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))|(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^\w[[:punct:]]$|^\d{2}\.\d{3}\.\d{3}-\d(\/.*)?|^(\d{1,2},?x?)+(cm|m(l|²|2|m)?|k(g|m))$|^0(.*)$|^[[:punct:]](.+?)$|^\d{4,5}-\d{4}$|\b(19|20)\d{2}\.?\b|^.*?(~|\^|¿|¡|>|<|»|#|£|\?|»|·|#|\*|=|\+|¥|€|\||µ|®)+.*?$)

But when I've tried to run (compile), python's regex module (not re) got an error:

regex._regex_core.error: cannot refer to an open group at position 272

I've used notepad++ to see the "col" position but even then I was not able to detect what opened group is this

The "confusing thing" is that when I run each regex on a loop, it's works fine (but the performance is not good)

So, how can I fix this?



from Regex module error when join a set of regex using pipes

No comments:

Post a Comment