|
|
|
|
|
|||||||||||||||||||||||||||||||||||||||||
|
|
|
|||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||
Author: Martin Nilsson <nilsson@roxen.com>
Last modified: 2001-09-16 1:37:45
An often used, and sometimes essential program component is the cache. A cache is the programmers' way of outsmarting hardware prices. If CPU is cheap, cache less, if memory is cheap, cache more. In this article I'll take a look at caches in general as some of the caches available in Roxen WebServer.
Theory
Imagine a producer that, given some input that we can call "key", produces some sort of result, or "value". The "key" might be the string "A little red house in a forest" and the result a picture of a little red house in a forest. Let's presume that the process of parsing the text and creating the appropriate picture is a rather resource consuming task that we want to perform as seldom as possible. Hence we put a little box in front of our producer that intercepts the request from a consumer as well as the response from the producer and saves both. Next time a consumer, any consumer, requests "A little red house in a forest", we will already have the image ready, and return it to the consumer without letting the request pass through to the producer. The box is a cache.
If we remain a while longer in theory land, what characteristics would we want the cache to have? The reason that we put a cache between consumers and the producer is to produce fewer calls to the producer, hence the first requirement for an ideal cache is:
1. Never produce the same key/value-pair more than once.
Once the producer has been called, the result is stored in the cache for all successive consumer calls that want it. Given that the computer implementing the ideal cache itself isn't ideal, we don't want the cache to grow uncontrollably. One obvious way is to remove old values from the cache, which we can formulate as
2. Never keep a key/value-pair that will not be requested again.
Requirement 2 is of course impossible to implement with present technology, since we need a 100% accurate prediction of the future in order to know if the value will be requested again. We are however still in theory land, designing an ideal cache.Finally, it should be transparent to the consumers that we are using a cache, except for the improved response time. There is really only one way to tell if there is a cache involved, and you've all experienced it yourselves: stale cache values. We avoid this by following the rule:
3. Always return the correct value for any given key.
This is the reason for the awkward definition of rule number one, instead of e.g. "Never call the producer more than once for any key". Questions like "List todays events" would otherwise only be correct the same day the question was asked, given that no event was added, removed or altered.
The author, Martin Nilsson
<nilsson@roxen.com>
Cache of the First Degree
We begin by looking at a very simplistic cache that meets our first requirement, but only our first requirement; a pike mapping. Although not a very advanced solution, to say the least, a simple mapping is often very helpful to speed up your code. As long as the amount of values to be stored isn't huge, or at least expected to be limited, and the size of each entry is reasonable, then a plain mapping is a good choice. The cache overhead is also essentially as low as it gets.
mapping my_cache = ([]); mixed my_wrapper(mixed in) { if(my_cache[in]) return my_cache[in]; mixed result = my_heavy_function(in); my_cache[in] = result; return result; } mixed my_heavy_function(mixed in) { ... }
One, perhaps not so obvious, problem with this code is when a zero is associated with a key. Then the if-clause in my_wrapper will be false even though there exists a key in the cache. To detect this we have to resort to a slightly more advanced feature in pike (or, as some view it, the ugly kludge that fixes some of the problems you get when you have no real NULL-element) called zero types. The difference between a zero that has been indexed from a mapping and a zero that has been returned as a non-existence response is that the later has a zero flag set. That flag can be checked with the zero_type() function. if(my_cache[in]) should therefore be changed into if(!zero_type(my_cache[in])). In the example above I sacrificed some efficiency for readability, which is also fixed below.
mixed my_wrapper(mixed in) { mixed result; if(!zero_type( result=my_cache[in] )) return result; result = my_heavy_function(in); return my_cache[in] = result; }
GC Policies
As already stated, the cache described above is not generally useful, since it can grow uncontrollably, thus breaking my second rule of caching. Since it is almost always impossible to comply with that rule, this is the part of the cache system where people get creative. Let's start with a trivial, yet practically usable modification of the mapping cache, or "one-bucket cache", since all values are stored in the same bucket. The idea is that when the bucket is full, we just empty it.
mixed my_wrapper(mixed in) { mixed result; if(!zero_type( result=my_cache[in] )) return result; if(sizeof(my_cache)>10000) my_cache = ([]); result = my_heavy_function(in); return my_cache[in] = result; }
Although this is a perfectly functional approach, some people have troubles with the fact that this cache can throw out entries that where just added to it. The upside is of course that this is about as efficient cache control as you can get and thus well suited for high-performance applications. Also note that the "size" of the bucket (now 10000 entries) has to be tuned for the application and the hardware it runs on. If you are likely to have about e.g. 10010 entries then the cache will empty itself all the time, hampering performance.
From single bucket cache we move to a double bucket cache. The principle is simply that you have two mappings and take turns which to empty.
mapping my_cache = ({ ([]), ([]) }); int(0..1) active_cache; mixed my_wrapper(mixed in) { mixed result; if(!zero_type( result=my_cache[active_cache][in] )) return result; if(!zero_type( result=my_cache[active_cache^1][in] )) return result; if(sizeof(my_cache[active_cache]>10000)) { active_cache ^= 1; my_cache[active_cache] = ([]); } result = my_heavy_function(in); return my_cache[active][in] = result; }
Let's look a bit closer how this cache really works. First we can add 10001 entries to the cache and have it work more or less like the single bucket version. When we want to add entry 10002, the non-active bucket is emptied and made active. The effect is that no entry is deleted right after it has been inserted. We have however a little higher overhead during the lookup process, since we have to look in the second bucket if we didn't find the key in the first bucket. Note that this solves one of two "unlucky" events possible in the single bucket cache. It is now not possible to remove an item right after it has been added, but it is still possible that an item is removed right after it has been accessed.
We can of course generalize the double bucket idea to any number of buckets. The lookup overhead will then be even worse, so we make a little addition. When an entry is found in any bucket but the primary, copy it to the primary bucket. This will not only reduce the overhead, but also ensure that newly used items are not removed directly afterwards.
#define BUCKETS 5 mapping my_cache = ({ ([]) }) * BUCKETS; mixed my_wrapper(mixed in) { mixed result; if(!zero_type( result=my_cache[0][in] )) return result; foreach( my_cache[1..], mapping bucket) if(!zero_type( result=bucket[in] )) return my_cache[0][in] = result; if(sizeof(my_cache[0]>10000)) my_cache = ({ ([]) }) + my_cache[..BUCKETS-2]; result = my_heavy_function(in); return my_cache[0][in] = result; }
Temporal GC
All past examples have had a trigger-based garbage collect that removes cache entries once a certain threshold has been superseded. Although it makes a robust system with a very predicable memory footprint, it is not very well suited for an uneven load. In an extreme example, it would just waste memory during inactive periods and do excessive garbage collection during active periods. One answer is to clean out the cache on a given time interval instead. Even better is of course to combine that policy with size trigger, as described above, though that is seldom done in practice.
mapping cache = ([]); void init_cache() { // call do_gc once every minute call_out(do_gc, 60); } void do_gc() { // make sure we don't get two // concurrent calls by mistake. remove_call_out(do_gc); cache = ([]); call_out(do_gc, 60); }
There is of course much more that can be said about how my first and second cache criteria can be solved. Two main aspects that I'm not going to look at in any greater details here is meta data and persistence, but I feel that I should mention something about both. A cache entrys meta data could be the time it was put into the cache, when it was last looked upon, a time before which it is required to expire, the size it takes in bytes and how many milliseconds it took to produce the item, to pick but a few examples. All of these can be utilized in a cache to create a better cache policies, where better means closer to an ideal cache. The more complex the policy, the longer it takes for the cache to perform its functions (store, fetch, gc). One must also take into account that the meta data itself takes up memory space in the cache.
Persistence is to connect "another level" of cache under the cache, so that entries are saved to disk, e.g. in a database or as individual files. Saving of entries could be done during gc, but cache misses will become very costly, since the cache will have to look at the database or whatever to see if the entry is there.
Notification
Let's focus on my third criterion for caches: no inaccurate data returned. There is basically three ways we can ensure that the data is always correct. To begin with, it might be the case that we know beforehand how long the information is valid, e.g. information in a calendar such as birthdays and astronomical events. At the very moment the information is retrieved we know that it is valid only until the end of the day, which we can store as an expiration time in the caches meta data.
Secondly, we could be told that the value associated to certain keys are no longer valid, or perhaps get a less specified signal that something has changed and that the whole cache has to be emptied. One example of the former is a cache that maps a filesystem path to the number of words present in that file. When a file is changed the cache could be notified by the program that altered the file that the value for that file might not be correct anymore.
A third option is to let the key depend on all the factors that the resulting item depends on. This is however sometimes as costly as actually producing the result, and often it is impossible altogether. In the example with the cache that maps files to number fo words the key could however consist of both the file name and the last modified time of the file. Then we have to trust that the OS actually makes sure that the last modified time stamp actually gets updated, but that's another story.
A fourth option is to ignore the problem altogether and say that it doesn't matter if I don't know the exact outside temperature and only get an update of my graphic thermometer once every quarter of an hour. This appears to be the most common approach.
Real World
There are several caches in Roxen WebServer that are good to know about when developing WebServer modules in pike. My first example is the "Memory cache", which is available right away through globally defined functions. The most interesting are cache_set and cache_lookup, but here's a list of prototypes for all the globally available cache functions of the memory cache.
mixed cache_set(string cache_name, string key, mixed value, void|int timeout) mixed cache_lookup(string cache_name, string key) array(string) cache_indices(void|string cache_name) void cache_remove(string cache_name, string key) void cache_expire(string cache_name)
cache_set obviously sets an item in a cache with the name provided in cache_name. Note that the cache name space is global in roxen. You might have problem if your cache map a file path to some data and the module is added in two different virtual servers with different URLs, so that /index.html is two different files on the different sites. The problem is easily solved by adding the server name or another unique id to the cache name or the cache key. The optional timeout is when the item expires expressed as a posix time, e.g. if an entry is valid in five minutes you use time(1)+5*60. cache_indices shows all keys in the cache, or all the caches if no argument is given. cache_remove removes a specific item while cache_expire flushes the whole cache.
The actual cache, implemented in server/base_server/cache.pike, is a one-mapping cache with meta data. The cache is garbage collected every n:th second, where n is a configurable value that defaults to 300 seconds. Every time the GC runs, all items that have expired are removed. All items that have not been looked upon in m seconds, where m is n minus the items size in bytes divided by 100. The n value can be configured in the administration interface with the setting "Memory Cache Garbage Collect Interval" under the main tab Globals, subtag Cache. Also note that cache_lookup never returns an item that has expired.
Picture This
Finally I'll describe the functionally most advanced cache in WebServer, the Image Cache, used for modules like gtext, gbutton and cimg, but first we must point out some facts regarding web server-web client interaction. When a web client downloads a web page it will first fetch the HTML page in one HTTP query and then download the images on the page in subsequent queries. This might sound trivial, but it is never the less important to understand that images are fetched in a separate request that, from the servers point of view, has nothing to do with the previous requests.
The problem is that we are dealing with dynamic graphics, where the complete description of an image might be different for every person that downloads the web page, and is decided only when the person begins to fetch the image. Hence we must transfer the parameters needed to generate the dynamic image with the HTML page to the client in such a way that the client then transfers back the parameters when it requests the image.
This is done in the following way. When the HTML page is RXML parsed, the tag in question generates mapping with all states needed to know in order to generate the image. This mapping is then sent to the image cache that returns a hashed value of the cache. The module in question then combines that hash with the URL to its internal location and produces a text string that looks something like <img src="/_internal/graphic_module!0/a683$91a0"> which is returned to the page and then the web client.
// An image cache object // (Though not yet initialized in this example...) roxen.ImageCache image_cache; string simpletag_gsmiley(string name, mapping args, string content, RequestID id) { mapping img_args = ([]); // Adding important stuff to img_args. string hash = image_cache->store(img_args); string url = query_absolute_internal_location(id) + hash; return "<img src='" + url + "' />"; }
When the HTML page arrives at the web client, it begins to load the images from the provided URLs, and the hash gets returned to our module. The module gives the hash to the image cache, which returns the associated image directly if it already had it cached. If not, it calls the function it was initiated with and provides the mapping associated with the hash and gets an image returned. This image is stored in the image cache and sent back to the client. Messy.
void start() { // Initialize the image cache. image_cache = roxen.ImageCache( "gsmiley", draw_callback ); } array(Image.Layer) draw_callback( mapping args, RequestID id ) { // Image generation code return img; } mapping find_internal(string hash, RequestID id) { return image_cache->http_file_answer( hash, id ); }
The capabilities of the image cache don't end here. There is a lot of built-in image manipulation code. E.g. if you add (["quant":16]) to the image arguments mapping, you will have your image quantisized to 16 colors. Other arguments include format, dither, gamma, crop and many more. If you take a look at the cimg module and wonder where all the image code really is, then roxen.pike (class ImageCache) is the answer.
That concludes my little introduction to caches. Other things that would have been cool to include in this article includes the Cache module in pike, the frontend cache in WebServer, the argument cache, the various cache control headers in HTTP and the need for thread safety in caches in a multithreaded environment. Submitted articles in this or any other Pike/WebServer/Internet area would most likely be published (hint, hint!). Your contributions are most welcome.