From bc692f6c5ae50f9c93cfe61cccd64a981885a1e1 Mon Sep 17 00:00:00 2001 From: Stephen Kraus <8003332+stephenmk@users.noreply.github.com> Date: Tue, 11 Apr 2023 14:12:55 -0500 Subject: [PATCH] Create README.md --- README.md | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..1e631d1 --- /dev/null +++ b/README.md @@ -0,0 +1,35 @@ +# jitenbot +Jitenbot is a program for scraping Japanese dictionary websites and converting the scraped data into structured dictionary files. + +### Target Websites + +* [四字熟語辞典オンライン](https://yoji.jitenon.jp/) +* [故事・ことわざ・慣用句オンライン](https://kotowaza.jitenon.jp/) + +### Export Formats + +* [Yomichan](https://github.com/foosoft/yomichan) + +# Usage +Add your desired HTTP request headers to [config.json](https://github.com/stephenmk/jitenbot/blob/main/config.json) +and ensure that all [requirements](https://github.com/stephenmk/jitenbot/blob/main/requirements.txt) +are installed. + +``` +jitenbot [-h] {all,jitenon-yoji,jitenon-kotowaza} + +positional arguments: + {all,jitenon-yoji,jitenon-kotowaza} + website to crawl + +options: + -h, --help show this help message and exit +``` + +Scraped webpages are written to a `webcache` directory. Each page may be as large as a megabyte, +and a single dictionary may include thousands of pages. Ensure that adequate disk space is available. + +Jitenbot will pause for at least 10 seconds between each web request. Depending upon the size of +the target dictionary, it make take hours or days to finish scraping. + +Exported dictionary files will be saved in an `output` directory.