This commit is contained in:
Shunsuke Kanda 2021-07-01 07:29:27 +09:00
parent dffbac05c1
commit d29442de00

View file

@ -20,7 +20,7 @@
- **Fast and compact data structure.** Xcdat employs the *double-array trie* [3] known as the fastest data structure for trie implementation. However, the double-array trie resorts to many pointers and consumes a large amount of memory. To address this, Xcdat applies the *XCDA* method [2] that represents the double-array trie in a compressed format while maintaining the fast searches. - **Fast and compact data structure.** Xcdat employs the *double-array trie* [3] known as the fastest data structure for trie implementation. However, the double-array trie resorts to many pointers and consumes a large amount of memory. To address this, Xcdat applies the *XCDA* method [2] that represents the double-array trie in a compressed format while maintaining the fast searches.
- **Cache efficiency.** Xcdat employs a *minimal-prefix trie* [4] that replaces redundant trie nodes into strings, resulting in reducing random access and improving locality of references. - **Cache efficiency.** Xcdat employs a *minimal-prefix trie* [4] that replaces redundant trie nodes into strings, resulting in reducing random access and improving locality of references.
- **Dictionary encoding.** Xcdat maps `N` distinct keywords into unique IDs from `[0,N-1]`, and supports the two symmetric operations: `lookup` returns the ID corresponding to a given keyword; `decode` returns the keyword associated with a given ID. The mapping is so-called *dictionary encoding* (or *domain encoding*) and is fundamental in many DB applications as described by Martínez-Prieto et al [1] or Müller et al. [5]. - **Dictionary encoding.** Xcdat maps `N` distinct keywords into unique IDs from `[0,N-1]`, and supports the two symmetric operations: `lookup` returns the ID corresponding to a given keyword; `decode` returns the keyword associated with a given ID. The mapping is so-called *dictionary encoding* (or *domain encoding*) and is fundamental in many DB applications as described by Martínez-Prieto et al [1] or Müller et al. [5].
- **Prefix search operations.** Xcdat supports prefix search operations realized by trie search algorithms: common prefix and predictive searches. These will be useful in many NLP applications such as auto completions [6], stemmed searches [7], or morphological analysis [8]. - **Prefix search operations.** Xcdat supports prefix search operations realized by trie search algorithms: `prefix_search` returns all the keywords contained as prefixes of a given string; `predictive search` returns all the keywords starting with a given string. These will be useful in many NLP applications such as auto completions [6], stemmed searches [7], or morphological analysis [8].
- **64-bit support.** As mentioned before, since the double array is a pointer-based data structure, most double-array libraries use 32-bit pointers to reduce memory consumption, resulting in limiting the scale of the input dataset. On the other hand, the XCDA method allows Xcdat to represent 64-bit pointers without sacrificing memory efficiency. - **64-bit support.** As mentioned before, since the double array is a pointer-based data structure, most double-array libraries use 32-bit pointers to reduce memory consumption, resulting in limiting the scale of the input dataset. On the other hand, the XCDA method allows Xcdat to represent 64-bit pointers without sacrificing memory efficiency.
- **Binary key support.** In normal mode, Xcdat will use the `\0` character as an end marker for each keyword. However, if the dataset include `\0` characters, it will use bit flags instead of end markers, allowing the dataset to consist of binary keywords. - **Binary key support.** In normal mode, Xcdat will use the `\0` character as an end marker for each keyword. However, if the dataset include `\0` characters, it will use bit flags instead of end markers, allowing the dataset to consist of binary keywords.
- **Memory mapping.** Xcdat supports *memory mapping*, allowing data to be deserialized quickly without loading it into memory. Of course, deserialization by the loading is also supported. - **Memory mapping.** Xcdat supports *memory mapping*, allowing data to be deserialized quickly without loading it into memory. Of course, deserialization by the loading is also supported.
@ -50,11 +50,13 @@ The library considers a 64-bit operating system. The code has been tested only o
## Command line tools ## Command line tools
Xcdat provides command line tools to build the index and perform searches, which are inspired by [marisa-trie](https://github.com/s-yata/marisa-trie). Xcdat provides command line tools to build the index and perform searches, which are inspired by [marisa-trie](https://github.com/s-yata/marisa-trie). All the tools will print the command line options by specifying the parameter `-h`.
### `xcdat_build` ### `xcdat_build`
It builds the trie index from data set. It builds the trie index from a given dataset consisting of keywords separated by newlines. The keywords have to be sorted (in ascii order) and unique.
The following command builds the trie index from dataset `enwiki-latest-all-titles-in-ns0` and writes the index into file `idx.bin`.
``` ```
$ xcdat_build enwiki-latest-all-titles-in-ns0 idx.bin -u 1 $ xcdat_build enwiki-latest-all-titles-in-ns0 idx.bin -u 1
@ -68,7 +70,7 @@ max_length: 253
### `xcdat_lookup` ### `xcdat_lookup`
It It tests the `lookup` operation for a given index. Given a query string via `stdin`, it prints the associated ID if found, or `-1` otherwise.
``` ```
$ xcdat_lookup idx.bin $ xcdat_lookup idx.bin
@ -80,6 +82,8 @@ Double_Array
### `xcdat_decode` ### `xcdat_decode`
It tests the `decode` operation for a given index. Given a query ID via `stdin`, it prints the corresponding keyword if the ID is in the range `[0,N-1]`, where `N` is the number of stored keywords.
``` ```
$ xcdat_decode idx.bin $ xcdat_decode idx.bin
1255938 1255938
@ -88,6 +92,8 @@ $ xcdat_decode idx.bin
### `xcdat_prefix_search` ### `xcdat_prefix_search`
It tests the `prefix_search` operation for a given index. Given a query string via `stdin`, it prints all the keywords contained as prefixes of a given string.
``` ```
$ xcdat_prefix_search idx.bin $ xcdat_prefix_search idx.bin
Algorithmic Algorithmic
@ -102,6 +108,8 @@ Algorithmic
### `xcdat_predictive_search` ### `xcdat_predictive_search`
It tests the `predictive_search` operation for a given index. Given a query string via `stdin`, it prints the first `n` keywords starting with a given string, where `n` is one of the parameters.
``` ```
$ xcdat_predictive_search idx.bin -n 3 $ xcdat_predictive_search idx.bin -n 3
Algorithm Algorithm
@ -113,6 +121,8 @@ Algorithm
### `xcdat_enumerate` ### `xcdat_enumerate`
It prints all the keywords stored in a given index.
``` ```
$ xcdat_enumerate idx.bin | head -3 $ xcdat_enumerate idx.bin | head -3
0 ! 0 !