Having some fun with Python

January 20, 2020 Django, Pedantics, Programming, Python, Security

The other day on a Slack I hang out in, someone posted an amusing line of Python code:

port = "{port}:{port}".format(port=port)

If it’s not clear after the inevitable Swedish-chef-muppet impression has run through your mind, this string-formatting operation will replace the contents of port with a string containing two copies of whatever was in port, separated by a colon. So if port was "foo", now it will be "foo:foo".

Someone of course suggested using f-strings for a more readable version:

port = f"{port}:{port}"

But the person working with it replied they couldn’t count on the availability of a sufficiently-new Python (f-strings were introduced in Python 3.6). I suggested this alternative, which works all the way back to Python 2.6:

port = "{0}:{0}".format(port)

And then I joked that I could provide less-readable versions, too, if desired. And of course someone took me up on that, so the next time I had a few minutes to spare I put something together. And then later tweaked it a bit more, to produce… well, whatever this is:

from math import floor, nan, sqrt
from operator import itemgetter, xor


port = "foo"

port = getattr(
    bytes([0x3F ^ 0b00000101]).decode("cp1252").join(
        [repr(set([sum([nan.is_integer()])]))]
        *xor(*map(lambda іn: int(іn, 0x10), str(0o100)))),
    str(format).split()[-1][:-1])(
    vars()[str(str)[1:1].join(
        itemgetter(nоt[False])(nоt[True])
        for nоt in zip(
            reversed(
                [floor(fоr *(sqrt(13) + sqrt(3)))
                for fоr in range(True, bin(0x1F).count(repr(--True)))]
            ),
            (repr(repr), repr(repr), repr(str), str(repr))))])


print(port)

Try it out. It will print "foo:foo". Or change the initial definition of port to whatever value you like.

This is, of course, not something I’d ever do in real code. And being able to do it isn’t a particularly useful skill (the same goes for making it even less readable, which wouldn’t be too difficult, by taking any of the techniques a bit further, or using any of the many other ways to obfuscate what code is doing). But it is a bit silly and fun, and writing it was a nice break from some heftier stuff I’ve been doing at work lately. And understanding why it works, and how it was produced, explores a few things about Python that — when applied more practically — might actually be useful to know.

So let’s work through it.

What’s really going on

The mess of code above is doing exactly the same thing as my original suggestion: port = "{0}:{0}".format(port). It’s just constructing the format string, the method call, and the argument in incredibly roundabout ways.

Hide the method call

Python’s getattr() function lets you pass in an object and a (string) attribute name, and it gives you back whatever’s in that attribute on that object. So, for example, x.y and getattr(x, 'y') are equivalent. This can be useful for inspecting an object’s attributes (especially when paired with hasattr()), and It also provides a compact idiom for handling an attribute that might not exist: you can pass a third argument to getattr(), and that value will be returned if no such attribute is found on the object.

So "{0}:{0}".format(port) can be rewritten as getattr("{0}:{0}", "format")(port). You’d never want to do this in real code, since you know Python’s str type has a format() method, and you can just call it directly, but readable things that would happen in real code are not what we’re going for here.

A really wordy character

Now let’s do something about that format string. "{0}:{0}" is way too clear and might let someone actually understand the code at first glance. So let’s find an alternative way to generate it.

What we’re going to do here is build up the ":" and the two instances of "{0}" separately and then combine them. So first we need to get a colon character. Again, just writing ":" would be too easy, so let’s throw the casual reader off by finding another way to get that single-character string.

One option is to generate it as a byte value and then decode it to a string. In ASCII, the character : is represented by a byte with decimal value 58. so we need a way to generate the number 58. When I actually wrote this code, this was one of the last bits I did and I was running low on novel ideas, so I just used bitwise math. You can get 58 from lots of options, but I went with 63 ^ 5 (if you’re not familiar with the bitwise operators, this one is XOR). And for a bit more obfuscation, 63 is expressed in hexadecimal and 5 in binary (Python supports integer literals in binary, octal, decimal, and hexadecimal). The 63 in particular is nice because it’s 2**6 - 1, and things that either are a power of 2, or are a power of 2 minus 1, are nice red herrings that jump out at a reader.

Then we need to decode it from the byte value to get our one-character string. Decoding from ASCII would be easy but also obvious, so instead I’ve used decoding from Windows-1252, an encoding that will strike fear into the heart of anyone who’s ever had to write string-processing code, largely because parts of it are similar to other common encodings, but also different in ways that are likely to show up at 3AM on a weekend and set off your pager. This gives us the expression:

bytes([0x3F ^ 0b00000101]).decode("cp1252")

as a replacement for the single-character string ":"; anywhere we’d use one, we can use the other. Now, what about those {0} formatting sequences?

NaNing in place

To get the full "{0}:{0}" format string we still need those two positional placeholders. One trick we can use is to find something else whose string representation is "{0}". And it turns out Python has that: the set type, which is for a collection of distinct values (i.e., a set made from the list [1, 1, 1, 1] would contain only a single member). And while there is of course a set() function built in to construct a set from an iterable type, you can also build one using literal syntax, wrapping the contents in curly braces. Which is useful because that literal syntax is also the default string representation of a set, so if we construct a set containing only the integer 0, its string representation will be "{0}", exactly what we want.

Of course, just writing the set literal {0} would give the game away. So too would writing set([0]). How else can we get the integer 0?

Here we turn to a bit of Python trivia: Python originally didn’t have a boolean type, so most programmers used integer 0 and 1 as stand-ins. When the bool type was eventually added in Python 2.3, it was done in a way that was compatible with the older integer convention: bool is a subclass of int, and the two instances False and True have integer values 0 and 1. You can plug them in to anything that expects an int and it’ll work.

So if we need a 0, we can get it by finding some boolean-false expression and slyly upcasting it into a regular int.

My chosen boolean expression is math.nan.is_integer(). The constant math.nan provides access to the NaN value of IEEE 754 floating-point, which is what Python’s float type is. NaN is the bane of certain types of programming, because you can accidentally produce it from floating-point math operations and its behavior is weird and likely to break code that isn’t expecting it.

But importantly, it is a float instance, which means it has the usual methods and attributes of Python floats, including is_integer(), a handy method that tells you if a given float is equivalent to some integer. For example, if my_float = 2.0, then my_float.is_integer() is true (but not if my_float = 2.01). And since NaN is very much not an integer, math.nan.is_integer() will give us a boolean False.

From there we just need to turn it into a regular int in a way that won’t be too obvious. The built-in function sum() will do nicely; it will happily add up a list of bool for us (since bool instances are numeric and support all the usual arithmetic), but in doing so it will launder the input into an int return value. So sum([nan.is_integer()]) == sum([False]) == sum([0]) == 0. We feed that into set() (wrapping it again in a list, since set() wants an iterable), and then repr() it (repr() is similar to str(), but is meant to return a more debugging-friendly representation that could be used to reconstruct the thing you’re printing out; through most of this code, it doesn’t matter which one is used, so I was deliberately inconsistent as a way of covering up the one place where str versus repr really did matter). Which gives the expression:

repr(set([sum([nan.is_integer()])]))

That will evaluate to "{0}", meaning we can use it in place of the literal string.

Now there are two of them!

But we don’t need just one "{0}" placeholder in our format string. We need two, one on either side of the colon. We can get that via the * operator, which doesn’t just do multiplication.

Python supports operator overloading, which means that its various operators, as well as some flow-control protocols, can be implemented on any type via methods with special names. And Python’s built-in types make use of this. So the implementation, and thus the behavior, of * depends on the types of its operands; when both operands are int, for example, it multiplies them. But when one of them is an int and the other a sequence, it performs repetition. Try it yourself: type [1, 2, 3] * 2 into a Python interpreter.

(if you try an operator on types which don’t have a compatible implementation, like two strs, you get TypeError)

So we can use the * operator to turn one instance of our placeholder string into any number of them. We want two, and we can get them, in list form, by doing ["{0}"] * 2, which will yield ["{0}", "{0}"]. And we already have an expression that will evaluate to "{0}". So we could just wrap it in brackets and tack on a * 2, but where’s the fun in that?

Instead we’re going to generate a 2. And once again we’ll use bitwise XOR, but this time via the operator.xor function (the operator module in the standard library provides Python’s standard operators wrapped in functions; operator.xor(x, y) will return the result of x ^ y). Lots of expressions will do this, but I chose 6 ^ 4 because, again, it let me embed a suspicious-looking red herring in the code.

I mentioned this briefly earlier, but to explain in full: programmers often react reflexively to “magic numbers”, values that are instantly recognizable from certain domains. And powers of 2 are fairly commonly known and commonly used. 6 ^ 4 uses two numbers that, treated as individual digits, combine to 64, which is a power of 2. And here we see the actual int 64 — though specified in its octal form as 0o100 — appear in the code. It’s then fed to str() which gives us the string "64".

Now, Python’s strings are a sequence type, which means the same operations that work on other sequences (like lists and tuples) also work on strings. Sometimes this can lead to subtle bugs, but here it’s used as a feature: we’re going to operate on the sequence "64", turning each character in it into an integer, and then feeding them to operator.xor to get our desired result (remember, we’re doing all this work to get a 6 ^ 4 = 2).

You can maybe start piecing this together already: [i for i in map(int, "64")]. The built-in map() takes a function and a sequence, and returns the result of applying the function to each item in the sequence. So that would return [6, 4], which we can then feed to operator.xor() to perform 6 ^ 4 and finally get that 2 we were after.

To do this, we need to use a * to trigger Python’s iterable-unpacking behavior: some_function(*arg), where arg is a list or other iterable, “unpacks” arg into a series of individual arguments to be passed to some_function. Thus, if arg = [1, 2, 3], then some_function(*arg) is exactly equivalent to some_function(1, 2, 3). This also means we use * for two different purposes on the same line of code.

But there are two more tricks here. One is that we don’t just directly do map(int, …) — instead we supply the optional second argument to int, which tells it what base to use when interpreting the incoming digits. Here I used base 16, which doesn’t affect the result of turning "6" and "4" into integer, but does let me complement the octal 0o100 with a hexadecimal 0x10 to specify the base.

The other trick is that doing this in map() requires a lambda expression, defining a one-off anonymous function which will receive a string, and pass it to int() with an extra argument of our choice (the base). And of course that anonymous function needs to name its argument. So what name is used? Why, іn, of course. Now, in is a Python operator, and so it’s reserved; it’s illegal in Python to name anything — function, class, variable, argument, you name it — in. So how does this work? By not naming it in. Python lets you use a broad range of Unicode in identifier names, and here the forbidden Latin-alphabet i which would produce the reserved word in is replaced by its lookalike Cyrillic equivalent, which is perfectly legal to do in Python and creates a name which visually looks exactly like the built-in in operator but crucially isn’t the in operator.

So now we have:

[repr(set([sum([nan.is_integer()])]))]
*xor(*map(lambda іn: int(іn, 0x10), str(0o100)))

Which is just a very roundabout way of saying ["{0}"] * 2, which gives us ["{0}", "{0}"]. We’re almost to the point of being able to write "{0}:{0}".format!

Speaking of formats

Recall we’re doing this via getattr(), so actually what we need is getattr("{0}:{0}", "format"). How are we going to get that "format" in there?

Well, there’s also a built-in function in Python called format(). And like the functions in the operator module, it’s the equivalent of a method: in this case, format(x, y) is the same as x.format(y). But here we pass format — the function object itself, without calling it — to str(). This gets us a string representation of the function, which will be the string "<built-in function format>". Now we just need to extract the substring "format" from it, which is a three-step process:

Split on spaces, using the split() method, yielding ["<built-in", "function", "format>"].
Index into that list with [-1] (the index -1 always refers to the last item in a list or tuple) to get "format>".
Take a slice of that which pulls everything except the last character: [:-1].

The result is the string "format", now available free and clear:

str(format).split()[-1][:-1]

So now we have the sequence getattr("{0}:{0}", "format") constructed. We just need to call the result (which, remember, is going to be the format() method of the string "{0}:{0}") passing in the argument: port.

A string of characters

We’re eventually going to need the actual variable port, but we can work our way there by first generating the string "port". And again, we don’t want to be obvious about it! So we’ll reuse the trick that got us the string "format" above, but in a more complex way. Instead of finding something with a string representation that contains the exact substring "port", we’re going to find things whose string representations contain the individual characters "p", "o", "r", "t", and then join them up.

This is what that repr(repr), repr(repr), repr(str), str(repr) sequence is doing at the very end: the string representation of the repr function is "<built-in function repr>", and the string representation of the str function is "<built-in function str>". So we’re going to look for the index in those strings where particular characters first occur:

p first occurs at index 21 in the representation of repr.
o first occurs at index 16 in the representation of repr.
r first occurs at index 10 in the representation of str.
t first occurs at index 5 in the representation of repr.

I admit this was the only part where I had to resort to trial-and-error a bit. I wanted to find a set of integer indices in the representations of str and repr which would give me the characters I wanted in order, but have that set of integer indices itself be ordered, either ascending or descending. The list above gets me a descending sequence: 21, 16, 10, 5. I can reverse that to 5, 10, 16, 21, and plug it into OEIS, the “On-Line Encyclopedia of Integer Sequences”, which will tell me if there’s some mathematical function which will generate those integers in order. It turns out there are several such sequences, and the first one that popped up was sequence A172334, which is easy to calculate and begins 0, 5, 10, 16, 21, …, so I only need to slice off one value at the start.

This is what that floor() and sqrt() stuff is doing; it’s generating the A172334 sequence. Then we’ll grab the 2nd through 5th values of the sequence, which are the values we want to use to index into string representations of str and repr, but in reverse order (so they get fed to Python’s reversed() function to get them in the correct order).

Once again it looks like the sequence generation is using a reserved name, in this case for:

reversed(
    [floor(fоr *(sqrt(13) + sqrt(3)))
    for fоr in range(True, bin(0x1F).count(repr(--True)))]
)

And once again the trick is using a Cyrillic character (this time, replacing Latin o).

But what about that range call? You may be getting the hang of this enough now to work out that it’s doing range(1, 5) to get the desired values (we want indices 1-4, inclusive, of the mathematical sequence). But it’s using the bool-is-an-int trick again: that’s how we get the 1 to start the range. For the 5 to tell it where to cut off, we pass 31 (hexadecimal 0x1f) to bin, which gives us its binary representation as a string ("0b11111"). Then we use .count() to count the number of times the character 1 appears in that string: it’s 5. To get the single-character string "1" for that operation we use repr(--True), which once again uses True as an integer with value 1, and negates it twice (the first will launder it into a plain int with value -1, the second turns it back into positive 1).

Finally we need to actually pull the right characters out. We’ve generated two sequences here: one is a list of integers (21, 16, 10, 5) and the other is a list of strings where the desired characters are found at the indices given by those integers. These sequences are of the same length, so we can pass them to zip() to make a list of tuples combining them. The result will be the same as doing:

[(21, repr(repr)), (16, repr(repr)), (10, repr(str)), (5, str(repr))]

This indicates which index of each string to use. We then use operator.itemgetter to actually get the characters. The itemgetter function is another of those wrappers for built-in operators: in this case, itemgetter(x)(y) is equivalent to y[x]. One more time we use the lookalike-character trick to make a confusingly-named loop variable (in this case it looks like the reserved-word not), and the booleans-as-integers trick (we want the things at indices 0 and 1 of the tuples we’re generating, so the booleans handle that nicely). Which gives us:

itemgetter(nоt[False])(nоt[True])
for nоt in zip(
    reversed(
        [floor(fоr *(sqrt(13) + sqrt(3)))
        for fоr in range(True, bin(0x1F).count(repr(--True)))]
    )
    (repr(repr), repr(repr), repr(str), str(repr)))

This yields the sequence ["p", "o", "r", "t"].

We’ll then use .join() on an empty string: "".join(["p", "o", "r", "t"]) gives "port".

There’s not much trickery here. To generate the empty string we use str(str)[1:1], which might at first glance look like it actually does something (since it’s working on non-zero indices), but really just takes a zero-length slice of the string str(str).

We’ve come so var

Finally, we need to turn the string "port" into a reference to the actual variable port. There are a few ways to do this, and I didn’t obscure it at all when I chose vars(). This is a function that will return a dictionary of locally-defined variables (when called with no arguments), or a dictionary of an object’s attributes/methods (when called with an object as argument). You could also do this with locals() (which just gets you the dictionary of local variables and doesn’t support working on an object), or even globals() (dictionary of global variables). vars()["port"] will be the actual local variable port.

And since we’ve generated the string "port", we can pass that as a key to the dictionary returned by vars() and finally put it all together:

from math import floor, nan, sqrt
from operator import itemgetter, xor


port = "foo"

port = getattr(
    bytes([0x3F ^ 0b00000101]).decode("cp1252").join(
        [repr(set([sum([nan.is_integer()])]))]
        *xor(*map(lambda іn: int(іn, 0x10), str(0o100)))),
    str(format).split()[-1][:-1])(
    vars()[str(str)[1:1].join(
        itemgetter(nоt[False])(nоt[True])
        for nоt in zip(
            reversed(
                [floor(fоr *(sqrt(13) + sqrt(3)))
                for fоr in range(True, bin(0x1F).count(repr(--True)))]
            ),
            (repr(repr), repr(repr), repr(str), str(repr))))])


print(port)

Which is equivalent to:

port = "foo"
port = "{0}:{0}".format(port)
print(port)

But why?

Again, “writing obfuscated code” as a generic skill is not particularly useful most of the time. But: think on how many different corners and quirks of Python and its standard library went into producing this. Even though it’s not good code (it’s exactly the opposite!), it probably has at least something to teach to Python programmers of almost any level of experience.

Some of the techniques used also point to things that are useful to know about. Even setting aside learning more about Python, several of the things used above have security implications for real-world code, and may stretch your brain a bit by making you wonder “what if someone did that to my app?” I suppose, or hope, that’s the same reason why things like the International Obfuscated C Contest exist. And when I teach web-app security tutorials, for example, I often mention JSFuck, which applies one of those techniques to translate any JavaScript into equivalent code written using an alphabet of only six characters. Web frameworks — like Django, to take an example — can’t protect you out-of-the-box from all these things, so being aware of them can be useful knowledge.

And, ultimately, the reason I did this is because I thought it was fun. I enjoy programming, but it’s not something I typically do for recreational purposes; between my job and my open-source commitments, I already spend plenty of time coding. Still, every once in a while I do write something useless just for the pleasure of doing it (as opposed to doing something useless out of anger, which also does happen occasionally).