xcdat/README.md

504 lines
18 KiB
Markdown
Raw Normal View History

2021-06-27 03:57:34 +00:00
# Xcdat: Fast compressed trie dictionary library
2016-12-03 07:51:00 +00:00
2021-06-29 22:11:58 +00:00
**Xcdat** is a C++17 header-only library of a fast compressed string dictionary based on the improved double-array trie structure described in the paper: [Compressed double-array tries for string dictionaries supporting fast lookup](https://doi.org/10.1007/s10115-016-0999-8), *Knowledge and Information Systems*, 2017, available at [here](https://kampersanda.github.io/pdf/KAIS2017.pdf).
2021-06-26 22:40:15 +00:00
2021-06-29 22:11:58 +00:00
## Table of contents
2021-06-29 06:39:26 +00:00
- [Features](#features)
- [Build instructions](#build-instructions)
- [Command line tools](#command-line-tools)
- [Sample usage](#sample-usage)
2021-06-29 22:11:58 +00:00
- [API](#api)
- [Performance](#performance)
- [Licensing](#licensing)
- [Todo](#todo)
- [References](#references)
2021-06-29 06:39:26 +00:00
2021-06-26 22:40:15 +00:00
## Features
2016-12-03 07:51:00 +00:00
2021-06-29 22:11:58 +00:00
- **Compressed string dictionary.** Xcdat implements a (static) *compressed string dictioanry* that stores a set of strings (or keywords) in a compressed space while supporting several search operations [1,2]. For example, Xcdat can store an entire set of English Wikipedia titles at half the size of the raw data.
- **Fast and compact data structure.** Xcdat employs the *double-array trie* [3] known as the fastest data structure for trie implementation. However, the double-array trie resorts to many pointers and consumes a large amount of memory. To address this, Xcdat applies the *XCDA* method [2] that represents the double-array trie in a compressed format while maintaining the fast searches.
- **Cache efficiency.** Xcdat employs a *minimal-prefix trie* [4] that replaces redundant trie nodes into strings, resulting in reducing random access and improving locality of references.
- **Dictionary encoding.** Xcdat maps `N` distinct keywords into unique IDs from `[0,N-1]`, and supports the two symmetric operations: `lookup` returns the ID corresponding to a given keyword; `decode` returns the keyword associated with a given ID. The mapping is so-called *dictionary encoding* (or *domain encoding*) and is fundamental in many DB applications as described by Martínez-Prieto et al [1] or Müller et al. [5].
2021-06-30 22:29:27 +00:00
- **Prefix search operations.** Xcdat supports prefix search operations realized by trie search algorithms: `prefix_search` returns all the keywords contained as prefixes of a given string; `predictive search` returns all the keywords starting with a given string. These will be useful in many NLP applications such as auto completions [6], stemmed searches [7], or morphological analysis [8].
2021-06-29 22:11:58 +00:00
- **64-bit support.** As mentioned before, since the double array is a pointer-based data structure, most double-array libraries use 32-bit pointers to reduce memory consumption, resulting in limiting the scale of the input dataset. On the other hand, the XCDA method allows Xcdat to represent 64-bit pointers without sacrificing memory efficiency.
- **Binary key support.** In normal mode, Xcdat will use the `\0` character as an end marker for each keyword. However, if the dataset include `\0` characters, it will use bit flags instead of end markers, allowing the dataset to consist of binary keywords.
- **Memory mapping.** Xcdat supports *memory mapping*, allowing data to be deserialized quickly without loading it into memory. Of course, deserialization by the loading is also supported.
2021-06-30 21:25:44 +00:00
- **Header only.** The library consists only of header files, and you can easily install it.
2016-12-04 06:53:02 +00:00
2021-06-29 06:39:26 +00:00
## Build instructions
2016-12-04 06:53:02 +00:00
2021-06-30 21:25:44 +00:00
You can download, compile, and install Xcdat with the following commands.
2016-12-03 07:51:00 +00:00
2021-06-29 22:11:58 +00:00
```
2017-11-17 14:40:24 +00:00
$ git clone https://github.com/kampersanda/xcdat.git
2016-12-03 07:51:00 +00:00
$ cd xcdat
$ mkdir build
$ cd build
2017-07-12 06:48:49 +00:00
$ cmake ..
2021-06-30 21:25:44 +00:00
$ make -j
2017-07-12 06:48:49 +00:00
$ make install
2016-12-03 07:51:00 +00:00
```
2021-06-26 22:40:15 +00:00
2021-06-30 21:25:44 +00:00
Or, since this library consists only of header files, you can easily install it by passing through the path to the directory `include`.
### Requirements
You need to install a modern C++17 ready compiler such as `g++ >= 7.0` or `clang >= 4.0`. For the build system, you need to install `CMake >= 3.0` to compile the library.
The library considers a 64-bit operating system. The code has been tested only on Mac OS X and Linux. That is, this library considers only UNIX-compatible OS.
2021-06-26 22:40:15 +00:00
## Command line tools
2021-06-30 22:29:27 +00:00
Xcdat provides command line tools to build the index and perform searches, which are inspired by [marisa-trie](https://github.com/s-yata/marisa-trie). All the tools will print the command line options by specifying the parameter `-h`.
2021-06-29 06:39:26 +00:00
2021-06-29 22:11:58 +00:00
### `xcdat_build`
2021-06-30 22:46:20 +00:00
It builds the trie index from a given dataset consisting of keywords separated by newlines. The following command builds the trie index from dataset `enwiki-titles.txt` and writes the index into file `idx.bin`.
2021-06-29 22:11:58 +00:00
```
2021-06-30 22:46:20 +00:00
$ xcdat_build enwiki-titles.txt idx.bin
2021-07-01 22:05:06 +00:00
Number of keys: 15955763
Number of trie nodes: 36441058
2021-07-02 04:40:38 +00:00
Number of DA units: 36520704
2021-07-01 22:05:06 +00:00
Memory usage in bytes: 1.70618e+08
Memory usage in MiB: 162.714
2021-06-29 06:39:26 +00:00
```
2021-06-29 22:11:58 +00:00
### `xcdat_lookup`
2021-06-29 06:39:26 +00:00
2021-06-30 22:29:27 +00:00
It tests the `lookup` operation for a given index. Given a query string via `stdin`, it prints the associated ID if found, or `-1` otherwise.
2021-06-30 21:25:44 +00:00
2021-06-29 22:11:58 +00:00
```
2021-06-29 06:39:26 +00:00
$ xcdat_lookup idx.bin
Algorithm
1255938 Algorithm
2021-06-30 21:25:44 +00:00
Double_Array
-1 Double_Array
2021-06-29 06:39:26 +00:00
```
2021-06-29 22:11:58 +00:00
### `xcdat_decode`
2021-06-29 06:39:26 +00:00
2021-06-30 22:29:27 +00:00
It tests the `decode` operation for a given index. Given a query ID via `stdin`, it prints the corresponding keyword if the ID is in the range `[0,N-1]`, where `N` is the number of stored keywords.
2021-06-29 22:11:58 +00:00
```
2021-06-29 06:39:26 +00:00
$ xcdat_decode idx.bin
1255938
1255938 Algorithm
```
2021-06-29 22:11:58 +00:00
### `xcdat_prefix_search`
2021-06-29 06:39:26 +00:00
2021-06-30 22:29:27 +00:00
It tests the `prefix_search` operation for a given index. Given a query string via `stdin`, it prints all the keywords contained as prefixes of a given string.
2021-06-29 22:11:58 +00:00
```
2021-06-29 06:39:26 +00:00
$ xcdat_prefix_search idx.bin
Algorithmic
6 found
57 A
798460 Al
1138004 Alg
1253024 Algo
1255938 Algorithm
1255931 Algorithmic
```
2021-06-29 22:11:58 +00:00
### `xcdat_predictive_search`
2021-06-29 06:39:26 +00:00
2021-06-30 22:29:27 +00:00
It tests the `predictive_search` operation for a given index. Given a query string via `stdin`, it prints the first `n` keywords starting with a given string, where `n` is one of the parameters.
2021-06-29 22:11:58 +00:00
```
2021-06-29 06:39:26 +00:00
$ xcdat_predictive_search idx.bin -n 3
Algorithm
263 found
1255938 Algorithm
1255944 Algorithm's_optimality
1255972 Algorithm_(C++)
```
2021-06-29 22:11:58 +00:00
### `xcdat_enumerate`
2021-06-26 22:40:15 +00:00
2021-06-30 22:29:27 +00:00
It prints all the keywords stored in a given index.
2021-06-29 22:11:58 +00:00
```
2021-06-29 06:39:26 +00:00
$ xcdat_enumerate idx.bin | head -3
0 !
107 !!
138 !!!
```
2021-06-26 22:40:15 +00:00
2021-07-01 22:05:06 +00:00
### `xcdat_benchmark`
2021-07-02 04:40:38 +00:00
It measures the performances of possible tries for a given dataset. To perform search operations, it randomly samples `n` queires from the dataset, where `n` is one of the parameters.
2021-07-01 22:05:06 +00:00
```
$ xcdat_benchmark enwiki-titles.txt
** xcdat::trie_7_type **
Number of keys: 15955763
Memory usage in bytes: 1.70618e+08
Memory usage in MiB: 162.714
2021-07-02 04:40:38 +00:00
Construction time in seconds: 12.907
Lookup time in microsec/query: 0.4674
Decode time in microsec/query: 0.8722
2021-07-01 22:05:06 +00:00
** xcdat::trie_8_type **
Number of keys: 15955763
Memory usage in bytes: 1.64104e+08
Memory usage in MiB: 156.502
2021-07-02 04:40:38 +00:00
Construction time in seconds: 13.442
Lookup time in microsec/query: 0.7593
Decode time in microsec/query: 1.2341
2021-07-01 22:05:06 +00:00
```
2021-06-29 06:39:26 +00:00
## Sample usage
2021-06-29 01:36:00 +00:00
2021-06-30 21:25:44 +00:00
`sample/sample.cpp` provides a sample usage.
2021-06-29 01:36:00 +00:00
```c++
#include <iostream>
#include <string>
#include <xcdat.hpp>
int main() {
// Input keys
std::vector<std::string> keys = {
"AirPods", "AirTag", "Mac", "MacBook", "MacBook_Air", "MacBook_Pro",
"Mac_Mini", "Mac_Pro", "iMac", "iPad", "iPhone", "iPhone_SE",
};
// The input keys must be sorted and unique (although they have already satisfied in this case).
std::sort(keys.begin(), keys.end());
keys.erase(std::unique(keys.begin(), keys.end()), keys.end());
const char* index_filename = "tmp.idx";
// The trie index type
using trie_type = xcdat::trie_8_type;
// Build and save the trie index.
{
const trie_type trie(keys);
xcdat::save(trie, index_filename);
}
// Load the trie index.
const auto trie = xcdat::load<trie_type>(index_filename);
// Basic statistics
std::cout << "NumberKeys: " << trie.num_keys() << std::endl;
std::cout << "MaxLength: " << trie.max_length() << std::endl;
std::cout << "AlphabetSize: " << trie.alphabet_size() << std::endl;
std::cout << "Memory: " << xcdat::memory_in_bytes(trie) << " bytes" << std::endl;
// Lookup IDs from keys
{
const auto id = trie.lookup("Mac_Pro");
std::cout << "Lookup(Mac_Pro) = " << id.value_or(UINT64_MAX) << std::endl;
}
{
const auto id = trie.lookup("Google_Pixel");
std::cout << "Lookup(Google_Pixel) = " << id.value_or(UINT64_MAX) << std::endl;
}
// Decode keys from IDs
{
const auto dec = trie.decode(4);
std::cout << "Decode(4) = " << dec << std::endl;
}
// Common prefix search
{
std::cout << "CommonPrefixSearch(MacBook_Air) = {" << std::endl;
auto itr = trie.make_prefix_iterator("MacBook_Air");
while (itr.next()) {
std::cout << " (" << itr.decoded_view() << ", " << itr.id() << ")," << std::endl;
}
std::cout << "}" << std::endl;
}
// Predictive search
{
std::cout << "PredictiveSearch(Mac) = {" << std::endl;
auto itr = trie.make_predictive_iterator("Mac");
while (itr.next()) {
std::cout << " (" << itr.decoded_view() << ", " << itr.id() << ")," << std::endl;
}
std::cout << "}" << std::endl;
}
// Enumerate all the keys (in lex order).
{
std::cout << "Enumerate() = {" << std::endl;
auto itr = trie.make_enumerative_iterator();
while (itr.next()) {
std::cout << " (" << itr.decoded_view() << ", " << itr.id() << ")," << std::endl;
}
std::cout << "}" << std::endl;
}
std::remove(index_filename);
return 0;
}
```
2021-06-30 21:25:44 +00:00
The output will be
2021-06-29 01:36:00 +00:00
```
NumberKeys: 12
MaxLength: 11
AlphabetSize: 20
Memory: 1762 bytes
Lookup(Mac_Pro) = 7
Lookup(Google_Pixel) = 18446744073709551615
Decode(4) = MacBook_Air
CommonPrefixSearch(MacBook_Air) = {
(Mac, 1),
(MacBook, 2),
(MacBook_Air, 4),
}
PredictiveSearch(Mac) = {
(Mac, 1),
(MacBook, 2),
(MacBook_Air, 4),
(MacBook_Pro, 11),
(Mac_Mini, 5),
(Mac_Pro, 7),
}
Enumerate() = {
(AirPods, 0),
(AirTag, 3),
(Mac, 1),
(MacBook, 2),
(MacBook_Air, 4),
(MacBook_Pro, 11),
(Mac_Mini, 5),
(Mac_Pro, 7),
(iMac, 10),
(iPad, 6),
(iPhone, 8),
(iPhone_SE, 9),
}
```
2021-06-29 22:11:58 +00:00
## API
2021-06-29 06:39:26 +00:00
2021-06-30 21:25:44 +00:00
`xcdat.hpp` provides
- `xcdat::trie_7_type`:
- `xcdat::trie_8_type`:
2021-06-29 06:39:26 +00:00
### Dictionary class
```c++
2021-07-02 05:37:03 +00:00
//! A compressed string dictionary based on an improved double-array trie.
//! 'BcVector' is the data type of Base and Check vectors.
2021-06-29 22:11:58 +00:00
template <class BcVector>
2021-06-29 06:39:26 +00:00
class trie {
public:
2021-06-29 06:56:59 +00:00
//! Default constructor
2021-06-29 06:39:26 +00:00
trie() = default;
2021-06-29 06:56:59 +00:00
//! Default destructor
2021-06-29 06:39:26 +00:00
virtual ~trie() = default;
2021-06-29 06:56:59 +00:00
//! Copy constructor (deleted)
2021-06-29 06:39:26 +00:00
trie(const trie&) = delete;
2021-06-29 06:56:59 +00:00
//! Copy constructor (deleted)
2021-06-29 06:39:26 +00:00
trie& operator=(const trie&) = delete;
2021-06-29 06:56:59 +00:00
//! Move constructor
2021-06-29 06:39:26 +00:00
trie(trie&&) noexcept = default;
2021-06-29 06:56:59 +00:00
//! Move constructor
2021-06-29 06:39:26 +00:00
trie& operator=(trie&&) noexcept = default;
2021-06-29 06:56:59 +00:00
//! Build the trie from the input keywords, which are lexicographically sorted and unique.
2021-07-02 05:37:03 +00:00
//!
2021-06-29 06:56:59 +00:00
//! If bin_mode = false, the NULL character is used for the termination of a keyword.
//! If bin_mode = true, bit flags are used istead, and the keywords can contain NULL characters.
//! If the input keywords contain NULL characters, bin_mode will be forced to be set to true.
2021-07-02 05:37:03 +00:00
//!
//! The type 'Strings' and 'Strings::value_type' should be a random iterable container such as std::vector.
//! Precisely, they should support the following operations:
//! - size() returns the container size.
//! - operator[](i) accesses the i-th element.
//! - begin() returns the iterator to the beginning.
//! - end() returns the iterator to the end.
//! The type 'Strings::value_type::value_type' should be one-byte integer type such as 'char'.
2021-06-29 06:39:26 +00:00
template <class Strings>
2021-07-01 22:05:06 +00:00
trie(const Strings& keys, bool bin_mode = false);
2021-06-29 06:39:26 +00:00
2021-06-29 06:56:59 +00:00
//! Check if the binary mode.
2021-07-01 22:05:06 +00:00
bool bin_mode() const;
2021-06-29 06:39:26 +00:00
//! Get the number of stored keywords.
2021-07-01 22:05:06 +00:00
std::uint64_t num_keys() const;
2021-06-29 06:39:26 +00:00
//! Get the alphabet size.
2021-07-01 22:05:06 +00:00
std::uint64_t alphabet_size() const;
2021-06-29 06:39:26 +00:00
//! Get the maximum length of keywords.
2021-07-01 22:05:06 +00:00
std::uint64_t max_length() const;
//! Get the number of trie nodes.
std::uint64_t num_nodes() const;
//! Get the number of DA units.
std::uint64_t num_units() const;
//! Get the number of unused DA units.
std::uint64_t num_free_units() const;
//! Get the number of unused DA units.
std::uint64_t tail_length() const;
2021-06-29 06:39:26 +00:00
2021-06-29 06:56:59 +00:00
//! Lookup the ID of the keyword.
2021-07-01 22:05:06 +00:00
std::optional<std::uint64_t> lookup(std::string_view key) const;
//! Decode the keyword associated with the ID.
std::string decode(std::uint64_t id) const;
2021-06-29 06:39:26 +00:00
2021-07-02 05:37:03 +00:00
//! Decode the keyword associated with the ID and store it in 'decoded'.
//! It can avoid reallocation of memory to store the result.
2021-07-01 22:05:06 +00:00
void decode(std::uint64_t id, std::string& decoded) const;
2021-06-29 06:39:26 +00:00
2021-06-29 06:56:59 +00:00
//! An iterator class for common prefix search.
2021-07-02 05:37:03 +00:00
//! It enumerates all the keywords contained as prefixes of a given string.
//! It should be instantiated via the function 'make_prefix_iterator'.
2021-06-29 06:39:26 +00:00
class prefix_iterator {
public:
prefix_iterator() = default;
2021-06-29 06:56:59 +00:00
//! Increment the iterator.
//! Return false if the iteration is terminated.
2021-07-01 22:05:06 +00:00
bool next();
2021-06-29 06:56:59 +00:00
//! Get the result ID.
2021-07-01 22:05:06 +00:00
std::uint64_t id() const;
2021-06-29 06:56:59 +00:00
//! Get the result keyword.
2021-07-01 22:05:06 +00:00
std::string decoded() const;
2021-06-29 06:56:59 +00:00
//! Get the reference to the result keyword.
//! Note that the referenced data will be changed in the next iteration.
2021-07-01 22:05:06 +00:00
std::string_view decoded_view() const;
2021-06-29 06:39:26 +00:00
};
2021-06-29 06:56:59 +00:00
//! Make the common prefix searcher for the given keyword.
2021-07-01 22:05:06 +00:00
prefix_iterator make_prefix_iterator(std::string_view key) const;
2021-06-29 06:39:26 +00:00
2021-06-29 06:56:59 +00:00
//! Preform common prefix search for the keyword.
2021-07-01 22:05:06 +00:00
void prefix_search(std::string_view key, const std::function<void(std::uint64_t, std::string_view)>& fn) const;
2021-06-29 06:39:26 +00:00
2021-06-29 06:56:59 +00:00
//! An iterator class for predictive search.
2021-07-02 05:37:03 +00:00
//! It enumerates all the keywords starting with a given string.
//! It should be instantiated via the function 'make_predictive_iterator'.
2021-06-29 06:39:26 +00:00
class predictive_iterator {
public:
predictive_iterator() = default;
2021-06-29 06:56:59 +00:00
//! Increment the iterator.
//! Return false if the iteration is terminated.
2021-07-01 22:05:06 +00:00
bool next();
2021-06-29 06:56:59 +00:00
//! Get the result ID.
2021-07-01 22:05:06 +00:00
std::uint64_t id() const;
2021-06-29 06:56:59 +00:00
//! Get the result keyword.
2021-07-01 22:05:06 +00:00
std::string decoded() const;
2021-06-29 06:56:59 +00:00
//! Get the reference to the result keyword.
//! Note that the referenced data will be changed in the next iteration.
2021-07-01 22:05:06 +00:00
std::string_view decoded_view() const;
2021-06-29 06:39:26 +00:00
};
2021-06-29 06:56:59 +00:00
//! Make the predictive searcher for the keyword.
2021-07-01 22:05:06 +00:00
predictive_iterator make_predictive_iterator(std::string_view key) const;
2021-06-29 06:39:26 +00:00
2021-06-29 06:56:59 +00:00
//! Preform predictive search for the keyword.
2021-07-01 22:05:06 +00:00
void predictive_search(std::string_view key, const std::function<void(std::uint64_t, std::string_view)>& fn) const;
2021-06-29 06:39:26 +00:00
2021-06-29 06:56:59 +00:00
//! An iterator class for enumeration.
2021-07-02 05:37:03 +00:00
//! It enumerates all the keywords stored in the trie.
//! It should be instantiated via the function 'make_enumerative_iterator'.
2021-06-29 06:39:26 +00:00
using enumerative_iterator = predictive_iterator;
2021-06-29 06:56:59 +00:00
//! An iterator class for enumeration.
2021-07-01 22:05:06 +00:00
enumerative_iterator make_enumerative_iterator() const;
2021-06-29 06:56:59 +00:00
//! Enumerate all the keywords and their IDs stored in the trie.
2021-07-01 22:05:06 +00:00
void enumerate(const std::function<void(std::uint64_t, std::string_view)>& fn) const;
2021-06-29 06:39:26 +00:00
2021-07-02 05:37:03 +00:00
//! Visit the members (commonly used for I/O).
2021-06-29 06:39:26 +00:00
template <class Visitor>
void visit(Visitor& visitor);
};
```
### I/O handlers
2021-06-29 01:36:00 +00:00
2021-06-29 06:56:59 +00:00
`xcdat.hpp` provides some functions for handling I/O operations.
```c++
2021-07-02 05:37:03 +00:00
//! Set the continuous memory block to a new trie instance.
2021-06-29 06:56:59 +00:00
template <class Trie>
Trie mmap(const char* address);
2021-07-02 05:37:03 +00:00
//! Load the trie index from the file.
template <class Trie>
Trie load(std::string_view filepath);
//! Save the trie index to the file and returns the file size in bytes.
template <class Trie>
std::uint64_t save(const Trie& idx, std::string_view filepath);
//! Get the index size in bytes.
template <class Trie>
std::uint64_t memory_in_bytes(const Trie& idx);
//! Get the flag indicating the trie type, embedded by the function 'save'.
//! The flag corresponds to trie::l1_bits and will be used to detect the trie type from the file.
std::uint32_t get_flag(std::string_view filepath);
//! Load the keywords from the file.
std::vector<std::string> load_strings(std::string_view filepath, char delim = '\n');
2021-06-29 06:56:59 +00:00
```
2021-06-29 22:11:58 +00:00
## Performance
To be added...
2021-06-29 01:36:00 +00:00
2021-06-26 22:40:15 +00:00
## Licensing
This library is free software provided under the MIT License.
If you use the library in academic settings, please cite the following paper.
2021-06-29 22:11:58 +00:00
```
2021-06-26 22:40:15 +00:00
@article{kanda2017compressed,
title={Compressed double-array tries for string dictionaries supporting fast lookup},
author={Kanda, Shunsuke and Morita, Kazuhiro and Fuketa, Masao},
journal={Knowledge and Information Systems (KAIS)},
volume={51},
number={3},
pages={1023--1042},
year={2017},
publisher={Springer}
}
```
2021-06-29 22:11:58 +00:00
## Todo
- Support other language bindings.
- Add SIMD-ization.
2021-06-26 22:40:15 +00:00
## References
2021-06-27 03:57:34 +00:00
1. J. Aoe. An efficient digital search algorithm by using a double-array structure. IEEE Transactions on Software Engineering, 15(9):10661077, 1989.
2. N. R. Brisaboa, S. Ladra, and G. Navarro. DACs: Bringing direct access to variable-length codes. Information Processing & Management, 49(1):392404, 2013.
3. S. Kanda, K. Morita, and M. Fuketa. Compressed double-array tries for string dictionaries supporting fast lookup. Knowledge and Information Systems, 51(3): 10231042, 2017.
4. M. A. Martínez-Prieto, N. Brisaboa, R. Cánovas, F. Claude, and G. Navarro. Practical compressed string dictionaries. Information Systems, 56:73108, 2016
5. S. Yata, M. Oono, K. Morita, M. Fuketa, T. Sumitomo, and J. Aoe. A compact static double-array keeping character codes. Information Processing & Management, 43(1):237247, 2007.