This commit is contained in:
Shunsuke Kanda 2021-07-11 03:29:10 +09:00
parent d804607eec
commit 2687a98bf8

View file

@ -18,7 +18,7 @@
## Features
- **Compressed string dictionary.** Xcdat implements a (static) *compressed string dictioanry* that stores a set of strings (or keywords) in a compressed space while supporting several search operations [1,2]. For example, Xcdat can store an entire set of English Wikipedia titles at half the size of the raw data. (see [Performance](#performance))
- **Compressed string dictionary.** Xcdat implements a (static) *compressed string dictioanry* that stores a set of strings (or keywords) in a compressed space while supporting several search operations [1,2]. For example, Xcdat can store an entire set of English Wikipedia titles at half the size of the raw data. (See [Performance](#performance))
- **Fast and compact data structure.** Xcdat employs the *double-array trie* [3] known as the fastest trie implementation. However, the double-array trie resorts to many pointers and consumes a large amount of memory. To address this, Xcdat applies the *XCDA* method [2] that represents the double-array trie in a compressed format while maintaining the fast searches.
- **Cache efficiency.** Xcdat employs a *minimal-prefix trie* [4] that replaces redundant trie nodes into strings to reduce random access and to improve locality of references.
- **Dictionary encoding.** Xcdat maps `N` distinct keywords into unique IDs from `[0,N-1]`, and supports the two symmetric operations: `lookup` returns the ID corresponding to a given keyword; `decode` returns the keyword associated with a given ID. The mapping is so-called *dictionary encoding* (or *domain encoding*) and is fundamental in many DB applications as described by Martínez-Prieto et al [1] or Müller et al. [5].
@ -27,7 +27,7 @@
- **Binary key support.** In normal mode, Xcdat will use the `\0` character as an end marker for each keyword. However, if the dataset include `\0` characters, it will use bit flags instead of end markers, allowing the dataset to consist of binary keywords.
- **Memory mapping.** Xcdat supports *memory mapping*, allowing data to be deserialized quickly without loading it into memory. Of course, deserialization by the loading is also supported.
- **Header only.** The library consists only of header files, and you can easily install it.
- **Python binding.** You can use Xcdat in Python3 via [pybind11](https://github.com/pybind/pybind11). (visit the directory [pybind](https://github.com/kampersanda/xcdat/tree/master/pybind))
- **Python binding.** You can use Xcdat in Python3 via [pybind11](https://github.com/pybind/pybind11). (Visit the directory [pybind](https://github.com/kampersanda/xcdat/tree/master/pybind))
## Build instructions
@ -51,6 +51,10 @@ You need to install a modern C++17 ready compiler such as `g++ >= 7.0` or `clang
The library considers a 64-bit operating system. The code has been tested only on Mac OS X and Linux. That is, this library considers only UNIX-compatible OS.
### Python binding
Xcdat supports the Python binding via [pybind11](https://github.com/pybind/pybind11). The description can be found in the directory [pybind](https://github.com/kampersanda/xcdat/tree/master/pybind).
## Command line tools
Xcdat provides command line tools to build the dictionary and perform searches, which are inspired by [marisa-trie](https://github.com/s-yata/marisa-trie). All the tools will print the command line options by specifying the parameter `-h`.