Web crawler for creating personal copies of Japanese dictionaries

Go to file

stephenmk 7c40dafd52 Adjust padding style for child link lists in mdict		2023-07-12 20:27:47 -05:00
bot	Redesign search key logic for mdict	2023-07-12 19:02:07 -05:00
data	Adjust padding style for child link lists in mdict	2023-07-12 20:27:47 -05:00
tests	Add tests for `Expressions` functions	2023-05-06 20:07:07 -05:00
.gitignore	Add export support for the MDict dictionary format	2023-07-08 16:49:03 -05:00
jitenbot.py	Add export support for the MDict dictionary format	2023-07-08 16:49:03 -05:00
LICENSE	Initial commit	2023-04-07 16:37:51 -05:00
README.md	Update GoldenDict example in README.md	2023-07-09 14:11:15 -05:00
requirements.txt	Add export support for the MDict dictionary format	2023-07-08 16:49:03 -05:00
run_all.sh	Add export support for the MDict dictionary format	2023-07-08 16:49:03 -05:00
TODO.md	Add export support for the MDict dictionary format	2023-07-08 16:49:03 -05:00

README.md

jitenbot

Jitenbot is a program for scraping Japanese dictionary websites and compiling the scraped data into compact dictionary file formats.

Supported Dictionaries

Web Dictionaries
- 国語辞典オンライン (Jitenon Kokugo)
- 四字熟語辞典オンライン (Jitenon Yoji)
- 故事・ことわざ・慣用句オンライン (Jitenon Kotowaza)
Monokakido ("辞書 by 物書堂")
- 新明解国語辞典第八版 (Shinmeikai 8e)
- 大辞林第四版 (Daijirin 4e)

Supported Output Formats

Yomichan
MDict (.MDX & .MDD)

Examples

Jitenon Kokugo (web | yomichan)

Jitenon Yoji (web | yomichan)

Jitenon Kotowaza (web | yomichan)

Shinmeikai 8e (print | yomichan)

Daijirin 4e (print | yomichan)

Various (GoldenDict)

Usage

usage: jitenbot [-h] [-p PAGE_DIR] [-m MEDIA_DIR] [-i MDICT_ICON]
                [--no-yomichan-export] [--no-mdict-export]
                {jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}

Convert Japanese dictionary files to new formats.

positional arguments:
  {jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}
                        name of dictionary to convert

options:
  -h, --help            show this help message and exit
  -p PAGE_DIR, --page-dir PAGE_DIR
                        path to directory containing XML page files
  -m MEDIA_DIR, --media-dir MEDIA_DIR
                        path to directory containing media folders (gaiji,
                        graphics, audio, etc.)
  -i MDICT_ICON, --mdict-icon MDICT_ICON
                        path to icon file to be used with MDict
  --no-yomichan-export  skip export of dictionary data to Yomichan format
  --no-mdict-export     skip export of dictionary data to MDict format

See README.md for details regarding media directory structures

Web Targets

Jitenbot will scrape the target website and save the pages to the user cache directory. As a courtesy to the website owners, jitenbot is configured to pause for 10 seconds between each page request. Consequently, a complete crawl of a target website may take several days.

HTTP request headers (user agent string, etc.) may be customized by editing the config.json file created in the user config directory.

Monokakido Targets

Page data and media data must be procured by the user and passed to jitenbot via the appropriate command line flags.

smk8 media directory

Since Yomichan does not support audio files from imported dictionaries, the audio/ directory may be omitted to save filesize space in the output ZIP file if desired.

media
├── Audio.png
├── audio
│   ├── 00001.aac
│   ├── 00002.aac
│   ├── 00003.aac
│   │   ...
│   └── 82682.aac
└── gaiji
    ├── 1d110.svg
    ├── 1d15d.svg
    ├── 1d15e.svg
    │   ...
    └── xbunnoa.svg

daijirin2 media directory

The graphics/ directory may be omitted to save space if desired.

media
├── gaiji
│   ├── 1D10B.svg
│   ├── 1D110.svg
│   ├── 1D12A.svg
│   │   ...
│   └── vectorOB.svg
└── graphics
    ├── 3djr_0002.png
    ├── 3djr_0004.png
    ├── 3djr_0005.png
    │   ...
    └── 4djr_yahazu.png

Attribution

Adobe-Japan1_sequences.txt is provided by The Adobe-Japan1-7 Character Collection.