• 12
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

name Punditsdkoslkdosdkoskdo

Issue warning for missing comma between list items bug

The Story:

When a list of strings is defined on multiple lines, it is often easy to forget a comma between list items, like in this example case:

test = [
    "item1"
    "item2"
]

The list test would now have a single item "item1item2".

Quite often the problem appears after rearranging the items in a list.

Sample Stack Overflow questions having this issue:

The Question:

Is there a way to, preferably using static code analysis, issue a warning in cases like this in order to spot the problem as early as possible?

      • 2
    • Having a comma on every line of the list, besides making this less of a problem, also helps with cleaner diffs. Still, the answer is nice, because commas do get forgotten.
      • 1
    • FWIW, I generally terminate the last item in multiline container literals with a comma in case I want to append more items or rearrange the existing items. IME, this tends to reduce the chances of unintended concatenation.
      • 1
    • @PM2Ring: Ah, so the details is: With this feature , we don't need backslash to join the strings which in brackets. So we can use print('He' 'llo') without a backslash after 'He'.

These are merely probable solutions since I'm not really apt with static-analysis.

With tokenize:

I recently fiddled around with tokenizing python code and I believe it has all the information needed to perform these kind of checks when sufficient logic is added. For your given list, the tokens generated with python -m tokenize list1.py are as follows:

python -m tokenize list1.py 

1,0-1,4:    NAME    'test'
1,5-1,6:    OP  '='
1,7-1,8:    OP  '['
1,8-1,9:    NL  '\n'
2,1-2,8:    STRING  '"item1"'
2,8-2,9:    NL  '\n'
3,1-3,8:    STRING  '"item2"'
3,8-3,9:    NL  '\n'
4,0-4,1:    OP  ']'
4,1-4,2:    NEWLINE '\n'
5,0-5,0:    ENDMARKER   ''

This of course is the 'problematic' case where the contents are going to get concatenated. In the case where a , is present, the output slightly changes to reflect this (I added the tokens only for the list body):

1,7-1,8:    OP  '['
1,8-1,9:    NL  '\n'
2,1-2,8:    STRING  '"item1"'
2,8-2,9:    OP  ','
2,9-2,10:   NL  '\n'
3,1-3,8:    STRING  '"item2"'
3,8-3,9:    NL  '\n'
4,0-4,1:    OP  ']'

Now we have the additional OP ',' token signifying the presence of a second element seperated by comma.

Given this information, we could use the really handy method generate_tokens in the tokenize module. Method tokenize.generate_tokens() , tokenize.tokenize() in Py3, has a single argument readline, a method on file-like objects which essentially returns the next line for that file like object (relevant answer). It returns a named tuple with 5 elements in total with information about the token type, the token string along with line number and position in the line.

Using this information, one could theoretically loop through a file and when an OP ',' is absent inside a list initialization (whose beginning is detected by checking that the tokens NAME, OP '=' and OP '[' exist on the same line number) one can issue a warning on the lines on which it was detected.

The good thing about this approach is that it is pretty straight-forward to generalize. To fit all cases where string literal concatenation takes place (namely, inside the 'grouping' operators (), {}, [] ) you check that the token is of type = 51 (or 53 for Python 3) or that a value in any of (, [, { exists on the same line (these are coarse, top of the head suggestions atm).

Now, I'm not really sure how other people go about with these sort of problems but it seems like it could be something you can look into. All the information necessary is offered by tokenize, the logic to detect it is the only thing missing.

Implementation Note: These values (for example, for type) do differ between versions and are subject to change so it is something one should be aware of. One could possibly leverage this by only working with constants for the tokens, though.


With parser and ast:

Another probable solution which is probably more tedious could involve the parser and ast modules. The concatenation of strings is actually performed during the creation of the Abstract Syntax Tree so you could alternatively detect it over there.

I don't really want to dump the full output of the methods for parser and ast that I'm going to mention, but, just to make sure we're on the same page, I'm going to be using the following list initialization statement:

l_init = """
test = [
    "item1"
    "item2",
    "item3"
]
"""

In order to get the parse tree generated, use p = parser.suite(l_init). After this is done, you can get a view of it with p.tolist() (output is too large to add it). What you notice is that there will be three entries for the three different str objects item1, item2, item3.

On the other hand, when the AST is created with node = ast.parse(l_init) and viewed with ast.dump(node) there are only two entries: one for the concatenated strs item1item2 and one for the other entry item3.

So, this is another probable way to do it but, as I previously mentioned, it is way more tedious. I'm not sure if line information is available and you deal with two different modules. Just have it as a back thought if you maybe want to play around with internal objects higher in the compiler chain.


Closing Comments: As a closing note, the tokenize approach seems to be most logical one in this case. On the contrary though, it seems that pylint actually works with astroid a python lib that eases analysis of Abstract Syntax Trees for python code. So, one should ideally look at it and how it is used inside pylint.

Note: Of course, I might be completely over-analyzing it and a simpler 'check for white-space or newline' solution as you guys suggested would suffice. :-)

  • 19
Reply Report
      • 2
    • Awesome ast and tokenize samples! It would probably be a good idea to make a pylint rule out of it and propose it to merge..I'll play around with it. Thanks!
      • 2
    • I really don't know how tools like pylint are implemented, it would be quite interesting to find out though. Yes, tokenize is pretty amazing in this specific case, you can gain a great deal of insight with all these modules that so easily let you 'peek in' and see what's happening. Do come back and post what you came up with if you do, I'd like to see the approach you decided to take.
      • 1
    • The idea with pylint should be improved before proposed. It the first string ends with " " or whitespace or if the second string is so long that it can not fit on the previous line, then no warning is probaly desirable and the strings are concatenated intentionally.

I implemented code based on @Jim's post. May it works in all situations:

import tokenize
from io import BytesIO

def my_checker(pycode):
    """
    tokenizes python code and yields 
    start, end, strline of any position where 
    a scenario like this happens (missing string seperator):
      [..., "a string" "derp", ...]
    """
    IDLE = 0
    WAITING_STRING = 1
    CHECKING_SEPARATOR = 2

    tokenizer = tokenize.tokenize(BytesIO(pycode.encode('utf-8')).readline)
    state = IDLE

    for toknum, tokval, start, end, strcode  in tokenizer:
        if state == IDLE:
            if toknum == tokenize.OP and tokval == '[':
                state = WAITING_STRING

        elif state == WAITING_STRING:
            if toknum == tokenize.STRING:
                state = CHECKING_SEPARATOR
            elif toknum == tokenize.OP and tokval == [']']:
                state = IDLE

        elif state == CHECKING_SEPARATOR:
            if toknum == tokenize.STRING:
                yield (start, end, strcode)
            elif toknum == tokenize.OP and tokval in ['+', ',']:
                state = WAITING_STRING
            elif toknum == tokenize.OP and tokval == ']':
                state = IDLE

my_code = """
foo = "derp"
def derp(a,x): 
    return str('dingdong'+str(a*x))
[
    "derp"+"FOO22"  , "FOO", "donk" "slurp",0, 0
]

class extreme_logical_class():
    STATIC_BAD_LIST = [0,
        "BLA,",
        "FOO"
        "derp"
    ] 
    def __init__(self):
        self._in_method_check = ["A" "B"]

nested_list = [
    ['DERP','FOO'],
    [0,'hello', 'peter' 'pan'],
    ['this', 'is', ['ultra', 'mega'
        'nested']] 
]
"""

for error in my_checker(my_code):
    print('missing , in list at: line {}@{} to line {}@{}: "{}"'.format(
        error[0][0],error[0][1],error[1][0],error[1][1], error[2].strip()
    ))

The result is:

keksnicoh@localhost ~ % python3 find_bad_lists.py
missing , in list at: line 6@36 to line 6@43: ""derp"+"FOO22"  , "FOO", "donk" "blurp",0 0"
missing , in list at: line 13@8 to line 13@14: ""derp""
missing , in list at: line 16@37 to line 16@40: "self._in_method_check = ["A" "B"]"
missing , in list at: line 20@24 to line 20@29: "[0,'hello', 'peter' 'pan'],"
missing , in list at: line 22@8 to line 22@16: "'nested']]"

In real life I would prefer to avoid doing such mistakes; there are good IDE's like Sublime Text which allow you to edit and format lists with multi cursor. If you get used to those concepts these sort of "separation" errors won't happen in your code.

Of course if one has a Team of Developers one could integrate such a tool into the testing environment.

  • 2
Reply Report

Trending Tags