From c23db8c50eac92af612037b87513aa08863e66d9 Mon Sep 17 00:00:00 2001 From: Stephen Kraus <8003332+stephenmk@users.noreply.github.com> Date: Mon, 1 May 2023 18:23:05 -0500 Subject: [PATCH] Create README.md --- README.md | 47 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..30fb8af --- /dev/null +++ b/README.md @@ -0,0 +1,47 @@ +# jitenbot +Jitenbot is a program for scraping Japanese dictionary websites and +compiling the scraped data into compact dictionary file formats. + +### Supported Dictionaries +* Online + * [四字熟語辞典オンライン](https://yoji.jitenon.jp/) + * [故事・ことわざ・慣用句オンライン](https://kotowaza.jitenon.jp/) +* Offline + * [新明解国語辞典 第八版](https://www.monokakido.jp/ja/dictionaries/smk8/index.html) + * [大辞林 第四版](https://www.monokakido.jp/ja/dictionaries/daijirin2/index.html) + + +### Supported Output Formats + +* [Yomichan](https://github.com/foosoft/yomichan) + +# Usage +``` +usage: jitenbot [-h] [-p PAGE_DIR] [-i IMAGE_DIR] + {jitenon-yoji,jitenon-kotowaza,smk8,daijirin2} + +Convert Japanese dictionary files to new formats. + +positional arguments: + {jitenon-yoji,jitenon-kotowaza,smk8,daijirin2} + name of dictionary to convert + +options: + -h, --help show this help message and exit + -p PAGE_DIR, --page-dir PAGE_DIR + path to directory containing XML page files + -i IMAGE_DIR, --image-dir IMAGE_DIR + path to directory containing image folders (gaiji, + graphics, etc.) + +``` +### Online Targets +Jitenbot will scrape the target website and save the pages to the [user's cache directory](https://pypi.org/project/platformdirs/). +As a courtesy to the website owners, jitenbot is configured to pause for 10 seconds between each page request. Consequently, +a complete crawl of a target website may take several hours. + +### Offline Targets +Page data and image data must be supplied by the user and passed to jitenbot via the appropriate command line flags. + +# Attribution +`Adobe-Japan1_sequences.txt` is provided by [The Adobe-Japan1-7 Character Collection](https://github.com/adobe-type-tools/Adobe-Japan1).