Hashtables & C++ - part 1

By | October 23, 2021 | 4 minutes

C++

Not long ago I stumbled upon this post discussing different approaches to string interning.

The author did 4 implementations: (quote)

A control in pure C,

a Trie in pure C,

a hand rolled hash table in pure C,

and an std::unordered_map (hash table) wrapper, in C++.

The benchmarked it and found, among other things, that the C++ version was the slowest of them all.

This intrigued me. True, C++ standard defines std::unordered_map in a way that it can only use chaining, and that’s usually is inferior performance-wise to open-addressing hashtables, but there was no way it could be that bad. And definitely std::string should improve performance, not make it worse, compared to C-strings because of short string optimization (SSO) etc.

So, I decided to look into it. Here’s the core parts of the original version of the OP’s add function:

    // wrapper structure to ensure common interface in all the implementations
    struct Intern_pool {
        std::unordered_map<std::string, uint> map;
        uint next;
    };
    ...
    uint add(Intern_pool* pool, std::string s)
    {
        auto& map = pool->map;
        if (map.find(s) != std::end(map))
            return map.at(s);
        map.insert(std::make_pair(s, ++pool->next));
        return pool->next; 
    }
    ...
    int main(int argc, char **argv)
    {
        char buffer[256];
        while (fgets(buffer, 256, file))
        {
            buffer[strlen(buffer)-1] = '\0'; // removes the `\n` from fgets()
            add(&intern_pool, buffer);
        }        
    }

Since the author mentioned that he suspects C++ strings to be a source of slowdown, that was the first thing that caught my eye:

s was passed to add by value (so the data from buffer was copied into it) - that’d require allocation if the data in `buffer was long enough to prevent SSO.
then it was wrapped into a std::pair which would make a copy of it - again, if it is short enough, that’s just memcpy, but for a long one that’ll be one more allocation
on older compilers, map.insert would create yet another copy. Thankfully, since C++ insert must support move semantics, so that’s not a reason to worry anymore.

Still, extra copies are bad. Let’s eliminate at least one of them:

    uint add(Intern_pool* pool, std::string s)
    {
        auto& map = pool->map; 
        if (map.find(s) != std::end(map)) 
            return map.at(s);
        map.emplace(std::move(s), ++pool->next));
        return pool->next;
    }

Here we have moved s into the map and used emplace in addition, so no need to create tuple as well.

But the code is still suboptimal: Let’s see how many lookups in the map you do:

when doing map.find
when doing map.at
doing map.insert (or map.emplace).

We can do better: the very reason that map.find returns that cumbersome iterator on success is that it allows to extract value from that iterator, avoiding extra lookup.

So a better code would be

    uint add(Intern_pool* pool, std::string s)
    {
        auto& map = pool->map;
        auto it = map.find(s);
        if (it != std::end(map))
            return it->second;
        map.emplace(std::move(s), ++pool->next));
        return pool->next;
    }

which only leaves one lookup on the “happy” path, and two on “unhappy” one. Alternatively, we could use try_emplace (i.e. on conforming to C++17):

    uint add(Intern_pool* pool, std::string s)
    {
        auto& map = pool->map;
        auto [it, inserted_new] = map.try_emplace(std::move(s), pool->next + 1);
        if (inserted_new)
        {
            return ++pool->next;
        }
        return it->second;
    }

Here’s results I got on my laptop (Intel i7-4270HQ, 16GB, Win10, MSVC 19.16.27043, x64 mode):

(I didn’t understand the way OP built the dataset, so I just ran tar -xOvf java.* | identifiers > words.txt. The identifiers (author’s program to extract keywords and literals) crashed on some png file that tar didn’t filter out, at which point words.txt was ~592M, which I considered enough for a quick experiment)

	IO alone	Hash table	STL	STL+strings	same+2 lookups	same+1 lookup	Trie
total	7.042s	15.908s	16.461	15.522	13.814	13.861 (WTF?)	15.996
total-IO	0	8.866	9.419	8.480	6.772	6.819	8.954
slowdown		x1	x1.062	x0.956	x0.763	x0.769	x1.010

Against my original intuition, the version using try_emplace is more or less the same as the version not using it. Why? because (at least in Microsoft’s STL) try_emplace literally does the same as the previous version.

So we got our C++ version faster than the programs using data structures hand-rolled in C, with few changes in the code (y ~23%).

Nice! (but also shows how easily it is to make the code suboptimal with few minor mistakes; also, no surprise that code in a library that has been improved for decades is better than one written by a single person).

Well, we can get rid of the std::unordered_map in favor of an open addressing map. For example, the flat_hash_map from Abseil library by Google.

The change is trivial: we only need to #include "absl/container/flat_hash_map.h", add relevant import libraries to the build and replace std::unordered_map with absl::flat_hash_map. Everything else works the same way.

Having running this new version, I got average execution time 12.574s, i.e. 5.5 s of “useful” work, 0.62 of the “standard” hand-rolled hashtable by OP. Great success!

(To be continued at the next post)