Web crawler for creating personal copies of Japanese dictionaries

Go to file

stephenmk 8868383a08 Organize crawler logic into classes		2023-04-22 17:56:52 -05:00
bot	Organize crawler logic into classes	2023-04-22 17:56:52 -05:00
data	Move Yomichan index and tag metadata to data file	2023-04-22 14:14:28 -05:00
.gitignore	First version	2023-04-07 22:05:36 -05:00
jitenbot.py	Organize crawler logic into classes	2023-04-22 17:56:52 -05:00
LICENSE	Initial commit	2023-04-07 16:37:51 -05:00
README.md	Update README.md	2023-04-11 16:27:21 -05:00
requirements.txt	Use full version of sudachi dictionary	2023-04-22 12:09:36 -05:00

README.md

jitenbot

Jitenbot is a program for scraping Japanese dictionary websites and converting the scraped data into structured dictionary files.

Target Websites

Export Formats

Yomichan

Usage

Add your desired HTTP request headers to config.json and ensure that all requirements are installed.

jitenbot [-h] {all,jitenon-yoji,jitenon-kotowaza}

positional arguments:
  {all,jitenon-yoji,jitenon-kotowaza}
                        website to crawl

options:
  -h, --help            show this help message and exit

Scraped webpages are written to a webcache directory. Each page may be as large as 100 KiB, and a single dictionary may include thousands of pages. Ensure that adequate disk space is available.

Jitenbot will pause for at least 10 seconds between each web request. Depending upon the size of the target dictionary, it make take hours or days to finish scraping.

Exported dictionary files will be saved in an output directory.