Shimming PHP for Fun and Profit

November 15 2016

Slow websites in PHP are not especially rare, but most developers/clients do not seem to care too much. In cases when they do, there are usually strategies to address that. This was one of those times where the client both cared and I had nothing to offer them.

(Names changed to protect the ~~guilty~~ blameless)

In this instance, it was a PHP 5.5 website that was experiencing horrific (5-20 second+) load times on many of its pages. Usually such extreme symptoms indicate a degraded subsystem - slow disk, network timeouts, etc. In this case though, the CPU was pinned and there was no sign of anything untoward in strace - userland PHP code was legitimately pinning the cores.

I had spent a short amount of time profiling the application some months ago. By all indications, the framework upon which the site was built was doing something really stupid. tl;dr: there was a 900KiB longtext column in the database that was being base64 decoded and then unserialize’d some 100+ times per request. This column was fundamental to the site architecture - the client had been written into a corner by the framework developers and had no escape hatch.

I wrote a benchmark (just with the serialized data and the two PHP calls) that illustrated the problem as part of my initial report to the client:

cat client.b64 | php client.b64.bench.php 
2.21844410896

Yeah, that’s in seconds. Gross.

For somebody who is comfortable finding and fixing hotspots like this, it sounds like a dream come true. Not so. A quick grep through the code indicated that that particular hotspot existed in at least a dozen different points in the code base. As my role in this scenario was an ops. engineer, touching the client’s codebase was a no-no.

So, I left it for some months. The developers appeared to introduce some caching to the front page, but it still performed very poorly. The column had also grown to 3.5MiB, and the previous benchmark would now take at least 10 seconds.

That niggling feeling

Of course, the fact that I had failed to come up with an improvement that would not emburden the developers stayed in the back of my mind for a long time.

To me, the hypothetical solution was simple. It was a trivially memoizable operation - many seconds per request could be saved by just caching the output of the base64_decode and unserialize inputs for the specific input.

Eventually I got annoyed enough with myself that I decided to try attack this again.

A false start

I looked to the Zend engine - the PHP native runtime. This was clearly a mistake. Fuck Zend.

I would write a PHP extension that would shim the two functions and maintain a cache. Then, I would load the extension into cPanel and enable it for the client.

I soon realized that while writing an extension would get me some access to the engine, I had no idea what I was doing, writing an extension didn’t get me anything and the Zend engine is very, very poorly documented.

So, my second approach would be to write a shared library that would override the specific functions in the /usr/bin/php binary via ld’s LD_PRELOAD mechanism.

I identified my targets:

For base64_decode, I would need to shim php_base64_decode_ex
For unserialize, I would need to shim php_var_unserialize

I got close. I could find the original symbols with dlsym(RTLD_NEXT,...) and maintain a malloc’d cache for large inputs to those functions.

However, while base64_decode was quite easy to shim, unserialize proved more difficult, due to the way it constructed and destructed the zval** where the result was stored. In the end, I ended up with memory management issues, unless I turned the Zend Memory Manager off.

I know that I probably could have reached my goal with this approach, but a couple of days of hacking a C API with close to zero documentation had me close to giving up.

A less false start

This is where I should have started. Inevitably there have already been people looking for and implementing ways to override PHP built-ins.

Many references pointed to override_function, but despite the fact it would be perfect, it required a PECL extension (apd) that had been abandoned for well over a decade by now.

runkit to the rescue! It was not compatible with PHP 7, but the client site was stuck on 5.5, and it would still be working there. I built the module, deployed it to our shared server and enabled it for the client.

Initial testing indicated that it did not slow response times while turned on and not performing any work - suitable for production. Of course, I would only turn it on for the one client.

Writing the monkey-patch implementation for both functions proved to be doable within 40 LoC.

My initial tests got me very excited:

$ php bench.php
Without monkey patch: Running 100 Iterations ...
13.661208868027
With monkey patch: Running 100 Iterations ...
1.9911880493164

Out of curiosity, I spun up kcachegrind to see where the patched version was spending its CPU time:

results with crc32

Whoops. I was maintaining a map of the memoized results and keying it with the CRC32 hash of the input - my quick research indicated to me that it was the fastest non-cryptographic hash in the PHP stdlib.

I tried also just using the straight input as the cache key (since it would benefit from string interning), but the result was slightly worse and spent its time in strcmp:

With monkey patch: Running 100 Iterations ...
2.1407101154327

Finally, I tried using strlen. The chance of a collision is obviously much higher, but from my previous investigation, I figured this was safe enough. Since the length is already part of the zval, it would not need any calculation:

With monkey patch: Running 100 Iterations ...
0.053225994110107

Shit. Time to try it on the client website.

To Production and Beyond

Optimization is worthless without measurement, so I found the slowest link on the front page:

curl -X GET -I http://www.client.com.au/really-slow-page 0.00s user 0.00s system 0% cpu 20.174 total

20 seconds jeeeesus.

I turned on runkit extension and re-ran the test. No change. Great.

I had distilled the shim into a single file that would redefine the base64_decode and deserialize functions, and it just needed to be included from the application entry point. Minimizing the surface area that the developer would need to interact with would be the goal.

I quickly inserted (and subsequently removed) the shim:

include(dirname(__FILE__)."/perf_monkey_patch.php");

and re-ran the test:

curl -X GET -I http://www.client.com.au/really-slow-page 0.00s user 0.00s system 0% cpu 11.102 total

Not bad, but not great. Shedding 10 seconds is nothing to scoff at, but it was not as dramatic as I had hoped (e.g. the 13,000ms to 7ms drop in the benchmark). TTFB was consistently halved, more or less - usually more - across the site.

My instincts had been right that this deserialization was a big contributor to the performance issues. However, I had underestimated the contribution to slowness from other problems - complicated, long and numerous template conditions that took a long time to parse etc.

As an ops. engineer, you want to be able to make helpful suggestions that extend past “this code is bad”. As illustrated with only a partial improvement in this instance, you can’t fix everything. However, it pays to have some flex and some empathy for the plight of the developer who regrets his framework choice.

The Code

Using runkit is not complicated, but I have published the monkey patch for anybody else who wishes to work around this absolutely indefensible design decision of Expression Engine.

github.com/alexzorin/ee_monkey_patch

Simply include the script from your application’s entry point.

Two things to be careful of:

Use a less collision prone cache key if your use case requires it.
If you are mutating the deserialization result, ensure that each caller recieves their own copy of the value.

Obviously, do not randomly hack the applications or PHP runtime of your clients without permission.

There are some things you can do with some degree of safety (all of which were done in this case):

Find the pathological code and create a reduced, independent test case/benchmark
Work on improving the test case
When it is time to test
- Deploy to staging or test if it exists
- Do not affect live traffic. Specifically in this instance, use whatever mechanism you have available to ‘switch on’ runkit: query parameters, IP address, session, etc.
- Minimize your test duration, revert all changes
Have a mandate or some form of permission to work on this issue to begin with

That niggling feeling

A false start

A less false start

To Production and Beyond

The Code

A Note About Consent