Up until a couple of months ago, my default pick for XML-related things in C++ was pugixml, and by all means, it is a cool project.
Still, I started finding minor issues at first, and soon after some major ones; for each workaround I implemented, the experience of using it got more and more frustrating:
- There is no support for xml namespaces, but most of my other projects make use of that. I was forced to hack them in very inelegantly, making my code less pleasant to read and write.
- It has no good query system for XML documents. Sure, XPath is supported, but it does not align with the type of queries I am interested in.
pugixml
offers no portable reference to elements, attributes etc. This means that I cannot locate in $O(1)$ elements in the document, and I cannot share their location with others.
It also negatively affects debugging.pugixml
does not expose its internal representation in a format which is easily memory mappable, requiring a full parsing from XML each time the document must be loaded.
This introduces a significant loading delay for bigger files and hurts performance for offloaded/distributed applications.
There is more, like its usage of locale-specific functions to parse floats, but that list covers the most critical ones.
So yeah, pugixml
is a great general purpose library, and can well cover many applications, but it cannot fit all scenarios and many of mine seems to be out of scope.
I looked for alternatives, but I was not able to find much in the C++ ecosystem, except for other popular projects supporting namespaces.
Well, I will do it myself!
I decided to give it a go, and I started by fixing once for all the scope of the library, its constraints and the general design decision to make those possible:
- No need for editable documents once constructed, but external annotations must be allowed.
- It must have an internal representation we can trivially load from and save to disk; being trivially memory-mappable is a strict requirement.
- The internal representation should have good data locality.
- The internal representation must be fully portable and easy to shard to promote distributed applications.
- No need to support all XML features, this library is not meant to be validating.
If validation results in loss of performance, validation should be skipped, but correct input must lead to correct output. - No exceptions or at least being able to operate under
-fno-exceptions
. - Native support for namespaces in elements and attributes.
As for its scope, these are the expected features:
- an XML parser and serializer to convert from and to its binary representation;
- a tree builder, supporting the construction of archives; archives are collections of multiple documents sharing the same table of symbols, so to improve spatial compression;
- being able to directly work on top of memory mapped files, reducing the startup loading cost of the document to zero;
- some basic CLI utilities;
- a query engine;
- subset of core features which don’t require nor perform heap allocations, so that it can be implemented on embedded systems.
Local & portable data structures
[!NOTE]
The examples in this section are purely demonstrative, and will mostly focus on how we shape data, not the algorithms involved.
The way we shape data has a direct effect in which algorithms can be implemented, and in our case it is ever so important due to the many constraints we introduced earlier.
So, no surprise, picking the right data structures to represent trees is not exactly trivial in this project.
In C++, it is quite common to find a textbook tree implemented like this:
struct node{
node* parent;
std::vector<node*> children;
payload_t payload;
};
Even worse, we might find examples using smart pointers, so that there is no need to manually manage the lifetime of nodes.
Needless to say, this approach is just awful and against several of our objectives:
- It lacks memory locality, since the allocated nodes might be anywhere in the heap. You can expect fragmentation and cache misses at all times.
- A pretty huge memory footprints, since pointers need 64bit words to be stored on most modern architectures.
- Those children will also require additional heap allocations, since their container is a vector.
- The tree is not trivially serializable nor memory mappable. The tree is not stored in a normal form, and two equivalent trees in memory will always be different as pointers will be unique.
A first improvement would be the removal of that vector. One way to do that, is to implicitly encode the pointer to its children list, and information about its siblings.
struct node{
node* parent;
node* prev, next;
node* child_first;
payload_t payload;
};
We are going to lose random access to children this way, but it is usually a fair trade off.
Next step would be to rethink the way we allocate and access memory:
struct node{
using index_t = size_t;
index_t parent;
index_t left, right;
index_t child_first;
payload_t payload;
};
struct tree{
std::vector<node> storage; // or `std::span<node> storage;` if externally owned
node::index_t root;
}
Nodes will now be “allocated” from storage
. A functioning example would need some logic to do that, and additional data structures to detect, mark and reuse deleted cells. Regardless of how it is going to be implemented, this family of solutions has some clear advantages:
index_t
does not need to be 64bit long. When working with smaller trees, we can make safe use of smaller indices, reducingnode
footprint.- Significantly less memory fragmentation for smaller trees if we are good at managing storage. This allocator works on fixed-size cells of size
sizeof(node)
compared to a generic one. - We can save the storage vector on disk, or mapping it to memory without further actions needed.
The only real drawback is that we no longer perform direct access to memory, but we need to add offset to base. In practice, it comes for free on most architectures.
[!NOTE]
Some would also go on and remove the parent
reference/index. It is technically redundant, as long as there is a stack trace while descending.
This is a common trade-off which reduces the memory footprint of the tree as a hole for a small extra data structure.
However, it makes random access of a node impossible if we want to use a fixed size key (like its address/index). Hence, this optimization will not be considered.
What about the payload?
Since we are specifically trying to represent XML trees, we have to carefully consider our payload structure.
In XML, we have different types of potential nodes we want to represent:
element
attributes
comment
proc
text
cdata
One issue is that they are not all the same in terms of memory footprint:
struct base_node{
enum type_t : uint8 {ELEMENT,ATTRIBUTE,COMMENT,PROC,TEXT,CDATA};
using index_t = size_t;
type_t type;
index_t parent;
index_t left, right;
};
template<base_node::type_t type>
struct payload_t{};
template<>
struct payload_t<base_node::ELEMENT>{
index_t child_first;
index_t attr_first;
std::string ns;
std::string name;
};
template<>
struct payload_t<base_node::ATTRIBUTE>{
std::string ns;
std::string name;
std::string value;
};
//Same as for PROC, TEXT, and CDATA
template<>
struct payload_t<base_node::COMMENT>{
std::string content;
};
template<base_node::type_t type>
struct node : base_node, payload_t<type>{};
struct tree{
std::vector<node<base_node::ELEMENT>> storage_element;
std::vector<node<base_node::ATTRIBUTE>> storage_attr;
//...
//Storage for COMMENT, PROC, TEXT and CDATA can be unified since they all take up the same amount of space.
node::index_t root;
};
By splitting the storage in multiple containers, the allocators involved can stay simple, as they operate on fixed-size cells.
This also remove the potential of wasted memory due to alignment issues.
However, storing std::string
instances in nodes is not clever, as they require heap allocations, they are not relocatable nor trivially memory mappable from file.
The last step we will perform to optimize this data structure is moving to views with weak ownership:
//Custom string view. std::string_view not usable since it assumes absolute position in memory of the base.
struct sv{
using index_t = size_t
using length_t = size_t
index_t base;
length_t size;
}l
//...
template<>
struct payload_t<base_node::ELEMENT>{
index_t child_first;
index_t attr_first;
sv ns;
sv name;
};
template<>
struct payload_t<base_node::ATTRIBUTE>{
sv ns;
sv name;
sv value;
};
//Same as for PROC, TEXT, and CDATA
template<>
struct payload_t<base_node::COMMENT>{
sv content;
};
//...
struct tree{
std::vector<char> symbols;
std::vector<node<base_node::ELEMENT>> storage_element;
std::vector<node<base_node::ATTRIBUTE>> storage_attr;
//...
//Storage for COMMENT, PROC, TEXT and CDATA can be unified since they take up the same amount of space.
node::index_t root;
};
And this is a good place to stop. Let us briefly summarize how this last design would fare:
- This binary representation is portable, as all references are just relative addresses.
So relocating, offloading or memory mapping the resulting structure is very much viable by simply changing the storage model fortree
. - Each node can be accessed in $O(1)$ temporal complexity using a $O(1)$ token in spatial complexity.
- We gained some form of memory locality for smaller trees. For bigger ones it depends on the strategy used to handle allocations on the storage buffers.
- The memory footprint is reduced for smaller trees, or assuming smaller strings, as we can configure
index_t
andlength_t
as we desire. - Have basically removed the vast majority of memory allocations, and those which are left can still be fully removed if we go with a different storage type.
We could stop here and be happy about this design, but there is much more to squeeze if we recall one important detail: there is no need for mutations!
Stay tuned for part II.