Not long ago I stumbled upon this post discussing different approaches to string interning.
The author did 4 implementations: (quote)
A control in pure C,
a Trie in pure C,
a hand rolled hash table in pure C,
and an
std::unordered_map
(hash table) wrapper, in C++.
The benchmarked it and found, among other things, that the C++ version was the slowest of them all.
This intrigued me. True, C++ standard defines std::unordered_map
in a way that it can only use chaining, and that’s usually is inferior performance-wise to open-addressing hashtables, but there was no way it could be that bad. And definitely std::string
should improve performance, not make it worse, compared to C-strings because of short string optimization (SSO) etc.
So, I decided to look into it. Here’s the core parts of the original version of the OP’s add
function:
// wrapper structure to ensure common interface in all the implementations
struct Intern_pool {
std::unordered_map<std::string, uint> map;
uint next;
};
...
uint add(Intern_pool* pool, std::string s)
{
auto& map = pool->map;
if (map.find(s) != std::end(map))
return map.at(s);
map.insert(std::make_pair(s, ++pool->next));
return pool->next;
}
...
int main(int argc, char **argv)
{
char buffer[256];
while (fgets(buffer, 256, file))
{
buffer[strlen(buffer)-1] = '\0'; // removes the `\n` from fgets()
add(&intern_pool, buffer);
}
}
Since the author mentioned that he suspects C++ strings to be a source of slowdown, that was the first thing that caught my eye:
s
was passed toadd
by value (so the data frombuffer
was copied into it) - that’d require allocation if the data in `buffer was long enough to prevent SSO.- then it was wrapped into a
std::pair
which would make a copy of it - again, if it is short enough, that’s justmemcpy
, but for a long one that’ll be one more allocation - on older compilers,
map.insert
would create yet another copy. Thankfully, since C++insert
must support move semantics, so that’s not a reason to worry anymore.
Still, extra copies are bad. Let’s eliminate at least one of them:
uint add(Intern_pool* pool, std::string s)
{
auto& map = pool->map;
if (map.find(s) != std::end(map))
return map.at(s);
map.emplace(std::move(s), ++pool->next));
return pool->next;
}
Here we have moved s
into the map and used emplace
in addition, so no need to create tuple as well.
But the code is still suboptimal: Let’s see how many lookups in the map you do:
- when doing
map.find
- when doing
map.at
- doing
map.insert
(ormap.emplace
).
We can do better: the very reason that map.find
returns that cumbersome iterator on success is that it allows to extract value from that iterator, avoiding extra lookup.
So a better code would be
uint add(Intern_pool* pool, std::string s)
{
auto& map = pool->map;
auto it = map.find(s);
if (it != std::end(map))
return it->second;
map.emplace(std::move(s), ++pool->next));
return pool->next;
}
which only leaves one lookup on the “happy” path, and two on “unhappy” one. Alternatively, we could use try_emplace
(i.e. on conforming to C++17):
uint add(Intern_pool* pool, std::string s)
{
auto& map = pool->map;
auto [it, inserted_new] = map.try_emplace(std::move(s), pool->next + 1);
if (inserted_new)
{
return ++pool->next;
}
return it->second;
}
Here’s results I got on my laptop (Intel i7-4270HQ, 16GB, Win10, MSVC 19.16.27043, x64 mode):
(I didn’t understand the way OP built the dataset, so I just ran tar -xOvf java.* | identifiers > words.txt
.
The identifiers
(author’s program to extract keywords and literals) crashed on some png file that tar
didn’t filter out, at which point words.txt
was ~592M, which I considered enough for a quick experiment)
IO alone | Hash table | STL | STL+strings | same+2 lookups | same+1 lookup | Trie | |
---|---|---|---|---|---|---|---|
total | 7.042s | 15.908s | 16.461 | 15.522 | 13.814 | 13.861 (WTF?) | 15.996 |
total-IO | 0 | 8.866 | 9.419 | 8.480 | 6.772 | 6.819 | 8.954 |
slowdown | x1 | x1.062 | x0.956 | x0.763 | x0.769 | x1.010 |
Against my original intuition, the version using try_emplace
is more or less the same as the version not using it. Why? because (at least in Microsoft’s STL) try_emplace literally does the same as the previous version.
So we got our C++ version faster than the programs using data structures hand-rolled in C, with few changes in the code (y ~23%).
Nice! (but also shows how easily it is to make the code suboptimal with few minor mistakes; also, no surprise that code in a library that has been improved for decades is better than one written by a single person).
Well, we can get rid of the std::unordered_map
in favor of an open addressing map. For example, the flat_hash_map
from Abseil library by Google.
The change is trivial: we only need to #include "absl/container/flat_hash_map.h"
, add relevant import libraries to the build and replace std::unordered_map
with absl::flat_hash_map
. Everything else works the same way.
Having running this new version, I got average execution time 12.574s, i.e. 5.5 s of “useful” work, 0.62 of the “standard” hand-rolled hashtable by OP. Great success!
(To be continued at the next post)