1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] Why do some regex engines match .* twice in a single input string?

Discussão em 'Python' iniciado por Stack, Outubro 3, 2024 às 14:02.

  1. Stack

    Stack Membro Participativo

    Many regex engines match .* twice in a single-line string, e.g., when performing regex-based string replacement:

    • The 1st match is - by definition - the entire (single-line) string, as expected.

    • In many engines there is a 2nd match, namely the empty string; that is, even though the 1st match has consumed the entire input string, .* is matched again, which then matches the empty string at the end of the input string.
      • Note: To ensure that only one match is found, use ^.*

    My questions are:


    • Is there a good reason for this behavior? Once the input string has been consumed in full, I wouldn't expect another attempt to find a match.


    • Other than trial and error, can you glean from the documentation / regex dialect/standard supported which engines exhibit this behavior?

    Update: revo's helpful answer explains the how of the current behavior; as for the potential why, see this related question.

    Languages/platforms that DO exhibit the behavior:

    # .NET, via PowerShell (behavior also applies to the -replace operator)
    PS> [regex]::Replace('a', '.*', '[$&]'
    [a][] # !! Note the *2* matches, first the whole string, then the empty string

    # Node.js
    $ node -pe "'a'.replace(/.*/g, '[$&]')"
    [a][]

    # Ruby
    $ ruby -e "puts 'a'.gsub(/.*/, '[\\0]')"
    [a][]

    # Python 3.7+ only
    $ python -c "import re; print(re.sub('.*', '[\g<0>]', 'a'))"
    [a][]

    # Perl 5
    $ echo a | perl -ple 's/.*/[$&]/g'
    [a][]

    # Perl 6
    $ echo 'a' | perl6 -pe 's:g/.*/[$/]/'
    [a][]

    # Others?


    Languages/platforms that do NOT exhibit the behavior:

    # Python 2.x and Python 3.x <= 3.6
    $ python -c "import re; print(re.sub('.*', '[\g<0>]', 'a'))"
    [a] # !! Only 1 match found.

    # Others?


    bobble bubble brings up some good related points:


    If you make it lazy like .*?, you'd even get 3 matches in some and 2 matches in others. Same with .??. As soon as we use a start anchor, I thought we should get only one match, but interestingly it seems ^.*? gives two matches in PCRE for a, whereas ^.* should result in one match everywhere.

    Here's a PowerShell snippet for testing the behavior across languages, with multiple regexes:

    Note: Assumes that Python 3.x is available as python3 and Perl 6 as perl6.
    You can paste the whole snippet directly on the command line and recall it from the history to modify the inputs.

    & {
    param($inputStr, $regexes)

    # Define the commands as script blocks.
    # IMPORTANT: Make sure that $inputStr and $regex are referenced *inside "..."*
    # Always use "..." as the outer quoting, to work around PS quirks.
    $cmds = { [regex]::Replace("$inputStr", "$regex", '[$&]') },
    { node -pe "'$inputStr'.replace(/$regex/g, '[$&]')" },
    { ruby -e "puts '$inputStr'.gsub(/$regex/, '[\\0]')" },
    { python -c "import re; print(re.sub('$regex', '[\g<0>]', '$inputStr'))" },
    { python3 -c "import re; print(re.sub('$regex', '[\g<0>]', '$inputStr'))" },
    { "$inputStr" | perl -ple "s/$regex/[$&]/g" },
    { "$inputStr" | perl6 -pe "s:g/$regex/[$/]/" }

    $regexes | foreach {
    $regex = $_
    Write-Verbose -vb "----------- '$regex'"
    $cmds | foreach {
    $cmd = $_.ToString().Trim()
    Write-Verbose -vb ('{0,-10}: {1}' -f (($cmd -split '\|')[-1].Trim() -split '[ :]')[0],
    $cmd -replace '\$inputStr\b', $inputStr -replace '\$regex\b', $regex)
    & $_ $regex
    }
    }

    } -inputStr 'a' -regexes '.*', '^.*', '.*$', '^.*$', '.*?'


    Sample output for regex ^.*, which confirms bobble bubble's expectation that using the start anchor (^) yields only one match in all languages:

    VERBOSE: ----------- '^.*'
    VERBOSE: [regex] : [regex]::Replace("a", "^.*", '[$&]')
    [a]
    VERBOSE: node : node -pe "'a'.replace(/^.*/g, '[$&]')"
    [a]
    VERBOSE: ruby : ruby -e "puts 'a'.gsub(/^.*/, '[\\0]')"
    [a]
    VERBOSE: python : python -c "import re; print(re.sub('^.*', '[\g<0>]', 'a'))"
    [a]
    VERBOSE: python3 : python3 -c "import re; print(re.sub('^.*', '[\g<0>]', 'a'))"
    [a]
    VERBOSE: perl : "a" | perl -ple "s/^.*/[$&]/g"
    [a]
    VERBOSE: perl6 : "a" | perl6 -pe "s:g/^.*/[$/]/"
    [a]

    Continue reading...

Compartilhe esta Página