Create README.md

This commit is contained in:
Stephen Kraus 2023-04-11 14:12:55 -05:00 committed by GitHub
parent 1b89a3542c
commit bc692f6c5a
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

35
README.md Normal file
View file

@ -0,0 +1,35 @@
# jitenbot
Jitenbot is a program for scraping Japanese dictionary websites and converting the scraped data into structured dictionary files.
### Target Websites
* [四字熟語辞典オンライン](https://yoji.jitenon.jp/)
* [故事・ことわざ・慣用句オンライン](https://kotowaza.jitenon.jp/)
### Export Formats
* [Yomichan](https://github.com/foosoft/yomichan)
# Usage
Add your desired HTTP request headers to [config.json](https://github.com/stephenmk/jitenbot/blob/main/config.json)
and ensure that all [requirements](https://github.com/stephenmk/jitenbot/blob/main/requirements.txt)
are installed.
```
jitenbot [-h] {all,jitenon-yoji,jitenon-kotowaza}
positional arguments:
{all,jitenon-yoji,jitenon-kotowaza}
website to crawl
options:
-h, --help show this help message and exit
```
Scraped webpages are written to a `webcache` directory. Each page may be as large as a megabyte,
and a single dictionary may include thousands of pages. Ensure that adequate disk space is available.
Jitenbot will pause for at least 10 seconds between each web request. Depending upon the size of
the target dictionary, it make take hours or days to finish scraping.
Exported dictionary files will be saved in an `output` directory.