Create README.md
This commit is contained in:
parent
1b89a3542c
commit
bc692f6c5a
35
README.md
Normal file
35
README.md
Normal file
|
@ -0,0 +1,35 @@
|
||||||
|
# jitenbot
|
||||||
|
Jitenbot is a program for scraping Japanese dictionary websites and converting the scraped data into structured dictionary files.
|
||||||
|
|
||||||
|
### Target Websites
|
||||||
|
|
||||||
|
* [四字熟語辞典オンライン](https://yoji.jitenon.jp/)
|
||||||
|
* [故事・ことわざ・慣用句オンライン](https://kotowaza.jitenon.jp/)
|
||||||
|
|
||||||
|
### Export Formats
|
||||||
|
|
||||||
|
* [Yomichan](https://github.com/foosoft/yomichan)
|
||||||
|
|
||||||
|
# Usage
|
||||||
|
Add your desired HTTP request headers to [config.json](https://github.com/stephenmk/jitenbot/blob/main/config.json)
|
||||||
|
and ensure that all [requirements](https://github.com/stephenmk/jitenbot/blob/main/requirements.txt)
|
||||||
|
are installed.
|
||||||
|
|
||||||
|
```
|
||||||
|
jitenbot [-h] {all,jitenon-yoji,jitenon-kotowaza}
|
||||||
|
|
||||||
|
positional arguments:
|
||||||
|
{all,jitenon-yoji,jitenon-kotowaza}
|
||||||
|
website to crawl
|
||||||
|
|
||||||
|
options:
|
||||||
|
-h, --help show this help message and exit
|
||||||
|
```
|
||||||
|
|
||||||
|
Scraped webpages are written to a `webcache` directory. Each page may be as large as a megabyte,
|
||||||
|
and a single dictionary may include thousands of pages. Ensure that adequate disk space is available.
|
||||||
|
|
||||||
|
Jitenbot will pause for at least 10 seconds between each web request. Depending upon the size of
|
||||||
|
the target dictionary, it make take hours or days to finish scraping.
|
||||||
|
|
||||||
|
Exported dictionary files will be saved in an `output` directory.
|
Loading…
Reference in a new issue