XDXA

Blog / String Concatenation - Root of all Evil

String Concatenation - Root of all Evil

After college, I spent three years in software security acting as a pen-tester. It was my job to identify security vul­ner­a­bil­i­ties in the company's software via ex­ploita­tion. During that time, string con­cate­na­tion was the root cause of the vast majority of my findings. For that reason, it is my general philosophy that:

If you're con­cate­nat­ing strings, you're doing it wrong.

What?

To make my point easier to grok, let's start with a simple example:

>>> person.first_name + " " + person.last_name
'Bob Smith'

We've joined a person's first and last name with a space. The con­cate­nat­ed value can be thought of as a space delimited string. In of itself, this is not an issue. But, what happens when we need to reverse this trans­for­ma­tion? Easy enough. Using the previous output as our input, we split on the space character.

>>> name = "Bob Smith".split(" ")
>>> print name
['Bob', 'Smith']

What happens if the first name contains a space? If an attacker controls that value, an injected space could break the ap­pli­ca­tion's intended behavior. It's con­ceiv­able to imagine that after de­se­ri­al­iza­tion, the value might be used in some security related decision.

>>> name = "Bob Smith RealLastName".split(" ")
>>> print name
['Bob', 'Smith', 'RealLastName']
>>> if "Smith" == name[1]:
...     print "Welcome Smith!"
...
Welcome Smith!

To address the vul­ner­a­bil­i­ty, we have a few options:

  • Validate the first name, rejecting requests with spaces
  • Sanitize the first name, removing spaces
  • Introduce an escape sequence and escape the names

Validation and san­i­ti­za­tion are appealing choices, but if the business logic forbids it, our only option is escaping. A simple solution would be to replace spaces with '\ ' and backlashes with '\\'.

>>> person.first_name.replace(' ', '\ ') + " " +
    person.last_name.replace(' ', '\ ')
'Bob\\ Smith RealLastName'

You might be inclined to argue that this is a sufficient fix (nevermind that it's not to spec). However, I'm here to suggest that's not true. Manually escaping and con­cate­nat­ing strings is implicitly brittle. Forgetting to apply escaping algorithm in one place is enough to introduce a severe vul­ner­a­bil­i­ty.

The mistake here was casually performing se­ri­al­iza­tion rather than utilizing some library, whose sole re­spon­si­bil­i­ty is se­ri­al­iza­tion. This example is somewhat contrived, but the fun­da­men­tal principal can be expanded to CSVs, XML, JSON, URIs, SQL, or any formatted data.

Examples

Let's break down a few exploits to demon­strate my point.

SQL

Here we have a simple SQL injection in PHP:

<? "SELECT * FROM table WHERE col='" . $input . "'";

If the input contains a single quote, "'", the SQL query's structure is disrupted. An attacker can use this to alter the query. (This example is benign. There are sig­nif­i­cant­ly more malicious uses for SQL injection, but that's not what I'm here to discuss.)

<?
$input = "' OR ''='";
$query = "SELECT * FROM table WHERE col='" . $input . "'";
echo $query;

SELECT * FROM table WHERE col='' OR ''=''

How do we fix it? There are three general approaches:

<?
1. "...WHERE col='" . str_replace("'", "", $x) . "'"
2. "...WHERE col='" . mysql_real_escape_string($x) . "'"
3. $mysqli->prepare("...WHERE col=?", $x)

The first approach manually sanitizes the input by black­list­ing the single quote character. This approach is prob­lem­at­ic, because it doesn't account for all possible special characters. If the sur­round­ing quotes were instead double quotes, this method would fail. When I've seen this type of san­i­ti­za­tion, it is typically done at the beginning of the request, far away from the code which actually utilizes the variable. That increases the prob­a­bil­i­ty of such a mistake.

The second approach utilizes a library call to escape the input. The last approach pa­ra­me­ter­izes the query, sending the query and its arguments separately to the database. Both the second and third approach utilize a library call, and both work for this query. If that's the case, what's wrong with using the version that utilizes string con­cate­na­tion? What if the column was instead a number?

<? "...WHERE number=" . mysql_real_escape_string($x);

There are no characters to escape in this case! An injection like "0 OR 1=1" will still work. If the variable $x had been validated as an integer prior to use, there wouldn't have been an issue. However that's a brittle solution, because a simple mistake (or refactor) could still compromise the ap­pli­ca­tion's security.

What's really wrong here? The query is the formatted data. The library call is used to escape a value which is being serialized into the query. That's the re­spon­si­bil­i­ty of the library. By using string con­cate­na­tion, we've ef­fec­tive­ly written part of a SQL serializer.

(Learn more about mo­ti­va­tions to use pa­ra­me­ter­ized queries.)

URI

The URI is a fas­ci­nat­ing point to perform injection attacks upon. They are extremely easy to construct without a library, which means it's done frequently, and the results work most of the time. As more ap­pli­ca­tions move towards service oriented ar­chi­tec­tures, URIs are in­creas­ing­ly used as a component of the com­mu­ni­ca­tion. If unescaped input is placed into the URI, it might be possible to alter the ap­pli­ca­tion's intended behavior.

In the following example, we have a web ap­pli­ca­tion which transfers funds by dis­patch­ing to an HTTP service. The user controls the variable amount, but not user or target. The service domain s.xdxa.org is not internet accessible. We can assume the backend service will validate amount as a positive integer prior to use.

>>> print 'http://s.xdxa.org/pay?amount=' + amount +
                               '&from='   + user +
                               '&to='     + target
'http://s.xdxa.org/pay?amount=100&from=eve&to=bob'

To force Bob to pay Eve, we only need to alter the URI structure.

>>> amount = '100&from=bob&to=eve#'
>>> print ...
'.../pay?amount=100&from=bob&to=eve#&from=eve&to=bob'

The fragment (everything after the '#') is dropped before making the backend request. As a result, the service's log will look completely normal. To fix this, use a library to generate URIs. This is typically ac­com­plished by using either a builder or providing a map of "parameter to argument"s.

Un­for­tu­nate­ly, libraries for RESTful URIs, where arguments may be part of the path, are more scarce. In that cir­cum­stance, con­cate­na­tion + escaping may be your only option. However, if you do that, segment that code into its own domain, i.e., create a library. The Java library Handy URI Templates is a good model for generating said URIs.

Ad­di­tion­al­ly, be careful in other sections of the URI. Other components of the URI have different rules, i.e., scheme and authority. If you want to know more, The Tangled Web has an entire chapter dedicated to the URL.

HTML

XSS (Cross-Site Scripting) is a vul­ner­a­bil­i­ty where an attacker is able to inject JavaScript into a page, which is then executed in a victim's browser. It's a subset of the attacks against poorly serialized HTML. The attack is made possible by the fact that HTML mingles mark-up and code. In fact, HTML is composed of a number of different data formats within a single document:

  • HTML Tags
  • JavaScript
  • CSS
  • URIs

When rep­re­sent­ed in a object structure, we refer to the page as a DOM (document object model). However to get that page to the browser, it's serialized as a string and trans­mit­ted over HTTP. It is in that process that XSS is made possible. To make this concrete, let's take a look at an example. What happens if an attacker includes a script tag in the URL arg parameter? (Again, this is a simplistic example. There are sig­nif­i­cant­ly more malicious XSS attacks.)

<span><?php echo $_GET['arg']; ?></span>
becomes
<span>Hello! <script>alert(1);</script></span>

The user content altered the structure of the document. To fix this, we could escape the variable prior to output. However, that is simply string con­cate­na­tion. The output is exactly the same.

<span><?php echo htmlentities($_GET['arg']); ?></span>
is the same as
<?  echo '<span>' . htmlentities($_GET['arg']) . '</span>';

This means, it falls prey to all the same risks as above. One missed escape, or escaping for the wrong context, and we have a vul­ner­a­bil­i­ty.

<span href="<?php echo htmlentities($_GET['arg']); ?>">...</span>
becomes
<span href="javascript:alert(1);">...</span>

But wait! XSS is really because a programmer failed to check for JavaScript within the user inlined content. That's true, in the im­ple­men­ta­tion of HTML we know and love. If the issue wasn't truly with the failure to preserve the document structure, we could simply introduce a tag which would disable all scripts within the child element. That hasn't happened, because the issue is really about the injection. (As an aside, avoid using HTML for user supplied markup. It's a losing battle.)

So what do I recommend? Templating engines. (Keep in mind, templating engines may not au­to­mat­i­cal­ly escape variables within your output. Escaping must be configured on a per-engine basis.)

<span>{{ input.arg }}</span>

That having been said, automatic escaping is typically not a panacea. Why? The escaping is normally context unaware, which means it's performed for HTML tags. If your templating engine allows bad markup, it's doubtful that it un­der­stands the context for which it's escaping. Where can this bite you?

  1. Tag Attributes
  2. URIs
  3. CSS
  4. JavaScript
    (It is rarely safe to output content into a script tag. Avoid this whenever possible.)

Another even safer approach is to construct your page as a DOM. This will ensure tags, attributes, and possible URIs are properly escaped, but probably not CSS or Javascript. Almost no frameworks actually do this, because it's cumbersome and (probably) more cpu/memory costly.

Tangential from string con­cate­na­tion, is a perfect templating engine enough? Un­for­tu­nate­ly, browsers suck (from a security per­spec­tive). You could trace this back to the browser wars, where browsers which refused to render bad markup were labeled at fault. Regardless of the history, browsers have extremely "flexible" parsers. As a result, even if a document looks rightish, it might still contain an XSS. While I generally do not recommend sanitizing user input (it feels too much like a blacklist), input destined for HTML is an exception for the reasons outlined above. See OWASP's San­i­ti­za­tion Rec­om­men­da­tions for candidate libraries.

You could easily write a book on injection attacks against HTML. For that reason, I'm going to stop here. Want to know more?

JSON

To avoid making this article much longer, I'm not going to break out a JSON example. However, I do want to mention it, because I have frequently seen engineers construct JSON from a string. Don't do this.

Embedded Variables

Note, there is no dis­tinc­tion between embedding a variable and string con­cate­na­tion. Neither of the following two examples address the se­ri­al­iza­tion problem:

<? "$first $last" === $first . ' ' . $last;

Or with sprintf like sub­sti­tu­tion:

"%s %s" % (person.first_name, person.last_name)

Logging

Before we conclude, I want to make a special note with regards to logging. In my code, casual string con­cate­na­tion most frequently occurs in logging statements. Logs are typically loosely formatted, newline delimited files. Loosely structured output is mostly fine when intended for human con­sump­tion. However, this casual se­ri­al­iza­tion format is a nightmare for log parsers.

For this reason, one must be careful when logging output which will eventually be consumed by some downstream process. While not strictly the re­spon­si­bly of the logger, HTML log viewers have been sus­cep­ti­ble to XSS attacks in the past.

See CWE-117 and CAPEC-106 for more in­for­ma­tion.

Clar­i­fi­ca­tion

This concept is not new or innovative. The weakness I'm describing is actually CWE-707: Improper En­force­ment of Message or Data Structure, and it is not exclusive to string con­cate­na­tion or even strings. I wag my finger at string con­cate­na­tion, because it makes these mistakes easy. And con­se­quen­tial­ly, common.

Hyperboles

The title is mostly meant as tongue and cheek, but I'm quite serious in the severity of this topic. In my opinion there are only two places string con­cate­na­tion should occur:

  1. Se­ri­al­iza­tion/De­se­ri­al­iza­tion Libraries
  2. Un­for­mat­ted Data

Many security issues simply do not have an easy solution, e.g. side channel attacks. However, if engineers con­sis­tent­ly use libraries for se­ri­al­iza­tion, the majority of injection based attacks will simply go away.

Is it possible to write secure code with con­cate­na­tion? Yes. However, this code will almost always be brittle. You might understand the security con­sid­er­a­tions, but the next engineer probably will not. Writing secure code isn't just preventing a certain type of attack. It's following best practice throughout the entire code base to set the standard which avoids careless mistakes.

« Vertical Rhythm