bot | ||
data | ||
tests | ||
.gitignore | ||
jitenbot.py | ||
LICENSE | ||
README.md | ||
requirements.txt | ||
TODO.md |
jitenbot
Jitenbot is a program for scraping Japanese dictionary websites and compiling the scraped data into compact dictionary file formats.
Supported Dictionaries
- Online
- Offline
Supported Output Formats
Examples
Usage
usage: jitenbot [-h] [-p PAGE_DIR] [-i IMAGE_DIR]
{jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}
Convert Japanese dictionary files to new formats.
positional arguments:
{jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}
name of dictionary to convert
options:
-h, --help show this help message and exit
-p PAGE_DIR, --page-dir PAGE_DIR
path to directory containing XML page files
-i IMAGE_DIR, --image-dir IMAGE_DIR
path to directory containing image folders (gaiji,
graphics, etc.)
Online Targets
Jitenbot will scrape the target website and save the pages to the user cache directory. As a courtesy to the website owners, jitenbot is configured to pause for 10 seconds between each page request. Consequently, a complete crawl of a target website may take several days.
HTTP request headers (user agent string, etc.) may be customized by editing the config.json
file created in the
user config directory.
Offline Targets
Page data and image data must be procured by the user and passed to jitenbot via the appropriate command line flags.
Attribution
Adobe-Japan1_sequences.txt
is provided by The Adobe-Japan1-7 Character Collection.