Web crawler for creating personal copies of Japanese dictionaries

Go to file

stephenmk c737f10885 Add tests for `Expressions` functions		2023-05-06 20:07:07 -05:00
bot	Add tests for `Expressions` functions	2023-05-06 20:07:07 -05:00
data	Add entry and term factories	2023-05-06 16:55:00 -05:00
tests	Add tests for `Expressions` functions	2023-05-06 20:07:07 -05:00
.gitignore	First version	2023-04-07 22:05:36 -05:00
jitenbot.py	Add entry and term factories	2023-05-06 16:55:00 -05:00
LICENSE	Initial commit	2023-04-07 16:37:51 -05:00
README.md	Update README.md	2023-05-01 23:53:22 -05:00
requirements.txt	Add support for Shinmeikai 8th edition & Daijirin 4th edition	2023-05-01 17:31:28 -05:00
TODO.md	Add crawler factory	2023-05-06 13:15:38 -05:00

README.md

jitenbot

Jitenbot is a program for scraping Japanese dictionary websites and compiling the scraped data into compact dictionary file formats.

Supported Dictionaries

Online
- 四字熟語辞典オンライン
- 故事・ことわざ・慣用句オンライン
Offline
- 新明解国語辞典第八版
- 大辞林第四版

Supported Output Formats

Yomichan

Examples

四字熟語辞典オンライン (web | yomichan)

故事・ことわざ・慣用句オンライン (web | yomichan)

新明解国語辞典第八版 (print | yomichan)

大辞林第四版 (print | yomichan)

Usage

usage: jitenbot [-h] [-p PAGE_DIR] [-i IMAGE_DIR]
                {jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}

Convert Japanese dictionary files to new formats.

positional arguments:
  {jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}
                        name of dictionary to convert

options:
  -h, --help            show this help message and exit
  -p PAGE_DIR, --page-dir PAGE_DIR
                        path to directory containing XML page files
  -i IMAGE_DIR, --image-dir IMAGE_DIR
                        path to directory containing image folders (gaiji,
                        graphics, etc.)

Online Targets

Jitenbot will scrape the target website and save the pages to the user cache directory. As a courtesy to the website owners, jitenbot is configured to pause for 10 seconds between each page request. Consequently, a complete crawl of a target website may take several days.

HTTP request headers (user agent string, etc.) may be customized by editing the config.json file created in the user config directory.

Offline Targets

Page data and image data must be procured by the user and passed to jitenbot via the appropriate command line flags.

Attribution

Adobe-Japan1_sequences.txt is provided by The Adobe-Japan1-7 Character Collection.