Hashtables & C++ - part 2

(this is a continuation of the previous post)

After playing with performance of the standard (and semi-standard) hashtables, I wondered, if it’s hard enough to do my own hashmap to work better than the std::unordered_map. Algorithms and data structures aren’t my strongest skills, so I wondered what can I learn here.

(note: we’re not bound by STL interfaces here, we’re free to do our own; I’m going to keep calling the main method try_emplace, but that’s all)

First version

My first attempt (written in 15 min or so) looked like this, and is bad but still ran at ~14s (or ~7s of “useful” work) on the same ~600M dataset, similar to more or less optimized version using std::unordered_map from previous post. Huh, not bad.

what’s good here:

  • we only compute hash once for each item;

  • … that’s all? …

What’s bad here?

  • the underlying _store size is always a power of 21, so we can use a trick for computing remainder - full_h & (_store.size() - 1). That alone saves us ~0.5s of ttal execution.
  • we use recursion on our second attempt to put the value. We can replace it with iteration; that saves ~0.1s too.

  • moreover, we can move the resizing loop out of main one, since it’s only run at the end anyway. It’s not a big deal from performance standpoint but it’ll make code cleaner;

  • we do string comparison at line 19, but we could compare hashes first and only compare strings when hashes match. That is, instead of

    else if (std::get<std::string>(*it) == s) {
        return {it, false};
    }

we could do

    else if (std::get<size_t>(*it) == full_h && std::get<std::string>(*it) == s) {
        return {it, false};
    }

With this particular dataset it actually makes execution slightly worse because short strings dominate here. Still, a good practice.

  • we’re using std::tuple to store entries. It’s not necessarily a good thing, because fields may be unaligned. Explicitly define structure might be slightly better (and definitely more readable).

  • when resizing, we move entries even if their are empty. We can skip them.

Let's fix those!

In process, let’s also clean code a bit, for example, replace loop over iterators with good old loop over integers.

Here’s what I got:

struct alignas(16) Item {
    size_t full_hash;
    std::string key;
    uint32_t value;
};

struct oa_map {
    using Store = std::vector<Item>;
    using iterator = Store::iterator;
    Store _store{1024};

    std::pair<iterator, bool> try_emplace(std::string&& s, uint32_t v) {
        auto full_h = std::hash<std::string>{}(s);
        return try_emplace(full_h, std::move(s), v);
    }

    size_t truncate(size_t v) {
        return v & (_store.size() - 1);
    }

    std::pair<iterator, bool> try_emplace(size_t full_h, std::string&& s, uint32_t v) {
        while(true) {
            auto h = truncate(full_h);
            for(auto i = h; i < _store.size(); ++i) {
                auto& entry = _store[i];
                if (entry.key.empty()) {
                    entry = {full_h, s, v};
                    return {_store.begin() + i, true};
                }
                else if (entry.full_hash == full_h && entry.key == s) {
                    return {_store.begin() + i, false};
                }
            }
            // if we got here, there were too many collisions
            expand();
            // this time it'll succeed, right? RIGHT?
        }
    }

    void expand() {
        // hackish resize
        auto old_size = _store.size();
        _store.resize(old_size * 2);
        // re-place existing values
        for (auto i = 0; i  < old_size; ++i) {
            auto item = std::move(_store[i]);
            if (!item.key.empty()) {
                try_emplace(item.full_hash, std::move(item.key), item.value);
            }
        }
    }
};

This results in ~13.5s of execution. Nice.

BUT…

…but I omitted one more issue with this code, the worst of them all:

  • it only probes items after the original insertion point. Which means that if the insertion point (i.e. the hash modulo size) is close to the end, we’ll get resizing the _store repeatedly. Instead, we should “wrap” the iterator so that it starts from the beginning after hitting .end().

Let’s fix that, too!

    ...
    static constexpr size_t collision_threshold = 42;
    ...
    std::pair<iterator, bool> try_emplace(size_t full_h, std::string&& s, uint32_t v) {
        while(true) {
            auto h = truncate(full_h);
            for(auto i = 0; i < collision_threshold; ++i) {
                auto j = truncate(i + h);

                auto& entry = _store[j];
                if (entry.full_hash == empty_value) {
                    entry = {full_h, std::move(s), v};
                    return {_store.data()+j, true};
                }
                else if (entry.full_hash == full_h && entry.key == s) {
                    return {_store.data()+j, false};
                }
            }
            // if we got here, there were too many collisions
            expand();
        }
    }
    ...

This gives us average execution time of 12.85s, i.e. we just saved ~0.65s.

Do not pay per view

Now, to think of it, wwe’re creating a std::string for each line in the input file, but we’re going to discard it if it had been interned previously.

This was necessary because that’s what std::unordered_map and abseil::flat_hash_map do, but in our specific program we don’t have too.

Let’s only create std::string on actual insertion, and use std::string_view in all other contexts:

    std::pair<iterator, bool> try_emplace(std::string_view&& s, uint32_t v) {
        auto full_h = std::hash<std::string_view>{}(s);
        return try_emplace(full_h, std::move(s), v);
    }

    std::pair<iterator, bool> try_emplace(size_t full_h, std::string_view&& s, uint32_t v) {
        ...
                if (entry.key.empty()) {
                    entry = {full_h, std::string{s}, v};
                    return {_store.begin() + j, true};
                }
        ...          
    }
    ...

uint32_t add(Intern_pool* pool, std::string_view&& s) {
    ... // the same
}
...
int main(int argc, char** argv) {
    ...
        char buffer[256];
        while (fgets(buffer, 256, file)) // one identifier per line
        {
            auto len = strlen(buffer);
            buffer[len-1] = '\0'; // removes the `\n` from fgets()
            add(&intern_pool, std::move(std::string_view(buffer, len)));
        }
    ...
}

This one gives us 11.76s (so, ~4.7s of “useful” work). So, we have beaten abseil::flat_hash_map (by cheating and making a better signature, as well as not implementing 90% of what it can do).

Could we avoid that? to a degree. We could create a shared string in main, .assign the buffer contents to it. We won’t be able to move the string anymore, but that’s OK. The results wouldn’t be that good - I got ~12s of execution, which is worse than the version above.

Alternatives

There’re some changes that could be done to the code (well, I did them, and things stayed the same, or got worse):

  • using different hash function. I tried using xxhash. It doesn’t change things, really. Computing hash isn’t a bottleneck in this implementation (remember, hash is only computed once per call to try_emplace), so it’s not no that important.

  • Using prime numbers as sizes for store. Many (most?) hashtable implementations do that, but it didn’t make things better in case of mine. Main problem was that truncating the value (computing its modulo) is much harder in case of prime divisors. Even when using “magic” libraries like fastmod (and xxhash instead of std::hash), it’s close enough to the original version, at best, while being more complex. There’s probably some bug I miss that prevents the improvement. Oh,

  • Double hashing . Well, it may help. Maybe I’ll try that later

Better I/O

I think, I’m done with the hashtable tweaking, but there’s one more thing that could be done to improve the program in general.

If we profile the program, we find that I/O (using fgets) consumes a huge part of program execution. Our latest versions spends ~4.7s processing strings, but whole ~7s doing I/O.

Wouldn’t it be great if we could throw away files and buffers just go through the data as if it was just an array of bytes?…

Yes, it would be. It will be. That’s memory-mapped files I/O.

Every OS under sun provides API for that, albeit the particular interface may vary. Thankfully, there’re libraries that abstract those differences. I used mio because it’s really simple to use and it’s available via vcpkg.

With it, the input loop turned into

    std::error_code error;
    auto mmap = mio::make_mmap_source(argv[i], error);
    
    if (error)
    {
        ...
    }
    ...
    std::string_view s{mmap.begin(), static_cast<size_t>(mmap.size())};
    std::string_view delim{"\r\n"};
    while (!s.empty()) {
        auto eow = s.find_first_of(delim);
        auto sub = s.substr(0, eow);
        if (!sub.empty())
            add(std::move(sub));
        s.remove_prefix(eow);
        auto sow = s.find_first_not_of(delim);
        if (sow == std::string_view::npos) {
            break;
        }
        s.remove_prefix(sow);
    }

And works blazingly fast: going through all ~600M of data takes 4s (instead of 7s with traditional I/O), and the whole processing takes slightly longer than 8s.

Honestly, it’s a shame that I only got the idea of using memory-mapped files late in the game, it’d save me an hour or more of my life that I spent waiting for benchmarks to run. sigh

Anyway, here’s the final table:

IO alone C Hash table unordered_map flat_hash_map O/A map
total 7.042s 15.908s 13.814 12.574 11.76
total-IO 0 8.866 6.772 5.532 4.718
slowdown x1 x0.763 x0.624 x0.532

Conclusion and Clarification

I said that above and I’d like to repeat: the main wins here are not due to my ingenuity (algorithms aren’t my strong side), but from using C++ facilities (data structures and semantics), and from focusing on a small subset of functionality that is needed for this particular scenario. Implementing search and deletion would be trivial, but, say, supporting STL-style focused interface would complicate things and require some compromises of performance, likely.

And I got a lot of hints from my friends, too. Thank you guys.


  1. people who are smarter than I will immediately point that using prime sizes for the store are better. We’ll get there! ↩︎