Web crawler for creating personal copies of Japanese dictionaries
Go to file
2023-05-01 18:25:42 -05:00
bot Add support for Shinmeikai 8th edition & Daijirin 4th edition 2023-05-01 17:31:28 -05:00
data Add support for Shinmeikai 8th edition & Daijirin 4th edition 2023-05-01 17:31:28 -05:00
.gitignore First version 2023-04-07 22:05:36 -05:00
jitenbot.py Update jitenbot.py 2023-05-01 18:25:42 -05:00
LICENSE Initial commit 2023-04-07 16:37:51 -05:00
README.md Create README.md 2023-05-01 18:23:05 -05:00
requirements.txt Add support for Shinmeikai 8th edition & Daijirin 4th edition 2023-05-01 17:31:28 -05:00

jitenbot

Jitenbot is a program for scraping Japanese dictionary websites and compiling the scraped data into compact dictionary file formats.

Supported Dictionaries

Supported Output Formats

Usage

usage: jitenbot [-h] [-p PAGE_DIR] [-i IMAGE_DIR]
                {jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}

Convert Japanese dictionary files to new formats.

positional arguments:
  {jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}
                        name of dictionary to convert

options:
  -h, --help            show this help message and exit
  -p PAGE_DIR, --page-dir PAGE_DIR
                        path to directory containing XML page files
  -i IMAGE_DIR, --image-dir IMAGE_DIR
                        path to directory containing image folders (gaiji,
                        graphics, etc.)

Online Targets

Jitenbot will scrape the target website and save the pages to the user's cache directory. As a courtesy to the website owners, jitenbot is configured to pause for 10 seconds between each page request. Consequently, a complete crawl of a target website may take several hours.

Offline Targets

Page data and image data must be supplied by the user and passed to jitenbot via the appropriate command line flags.

Attribution

Adobe-Japan1_sequences.txt is provided by The Adobe-Japan1-7 Character Collection.