Compare commits
30 commits
Author | SHA1 | Date | |
---|---|---|---|
630b529287 | |||
bd5b7a91d9 | |||
22f35ffe7f | |||
4c97c52534 | |||
09b585c49d | |||
b03978d1f7 | |||
8f30f9419d | |||
d37c3aca5b | |||
a5bb8d6f40 | |||
4eb7e12f37 | |||
7b2ba96db9 | |||
9b3fdc86d1 | |||
775da669b9 | |||
0cd530585f | |||
94c68b4e26 | |||
28dbf039d5 | |||
d6044e0c12 | |||
cfe1e98ab3 | |||
4ccb97f088 | |||
14e50fb4f4 | |||
e85d0a1625 | |||
b0a9ab5cae | |||
dbf0cf0eb8 | |||
4d6c3c3cf5 | |||
4e06482657 | |||
fd8d304726 | |||
7c40dafd52 | |||
d51de0b3dc | |||
c9ab0aea46 | |||
4cd81cda35 |
157
README.md
157
README.md
|
@ -4,12 +4,13 @@ compiling the scraped data into compact dictionary file formats.
|
|||
|
||||
### Supported Dictionaries
|
||||
* Web Dictionaries
|
||||
* [国語辞典オンライン](https://kokugo.jitenon.jp/) (Jitenon Kokugo)
|
||||
* [四字熟語辞典オンライン](https://yoji.jitenon.jp/) (Jitenon Yoji)
|
||||
* [故事・ことわざ・慣用句オンライン](https://kotowaza.jitenon.jp/) (Jitenon Kotowaza)
|
||||
* Monokakido (["辞書 by 物書堂"](https://www.monokakido.jp/ja/dictionaries/app/))
|
||||
* [新明解国語辞典 第八版](https://www.monokakido.jp/ja/dictionaries/smk8/index.html) (Shinmeikai 8e)
|
||||
* [大辞林 第四版](https://www.monokakido.jp/ja/dictionaries/daijirin2/index.html) (Daijirin 4e)
|
||||
* [国語辞典オンライン](https://kokugo.jitenon.jp/) (`jitenon-kokugo`)
|
||||
* [四字熟語辞典オンライン](https://yoji.jitenon.jp/) (`jitenon-yoji`)
|
||||
* [故事・ことわざ・慣用句オンライン](https://kotowaza.jitenon.jp/) (`jitenon-kotowaza`)
|
||||
* Monokakido
|
||||
* [新明解国語辞典 第八版](https://www.monokakido.jp/ja/dictionaries/smk8/index.html) (`smk8`)
|
||||
* [大辞林 第四版](https://www.monokakido.jp/ja/dictionaries/daijirin2/index.html) (`daijirin2`)
|
||||
* [三省堂国語辞典 第八版](https://www.monokakido.jp/ja/dictionaries/sankoku8/index.html) (`sankoku8`)
|
||||
|
||||
### Supported Output Formats
|
||||
|
||||
|
@ -48,6 +49,12 @@ compiling the scraped data into compact dictionary file formats.
|
|||
![daijirin2](https://user-images.githubusercontent.com/8003332/235578700-9dbf4fb0-0154-48b5-817c-8fe75e442afc.png)
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Sanseidō 8e (print | yomichan)</summary>
|
||||
|
||||
![sankoku8](https://github.com/stephenmk/jitenbot/assets/8003332/0358b3fc-71fb-4557-977c-1976a12229ec)
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Various (GoldenDict)</summary>
|
||||
|
||||
|
@ -57,13 +64,14 @@ compiling the scraped data into compact dictionary file formats.
|
|||
# Usage
|
||||
```
|
||||
usage: jitenbot [-h] [-p PAGE_DIR] [-m MEDIA_DIR] [-i MDICT_ICON]
|
||||
[--no-yomichan-export] [--no-mdict-export]
|
||||
{jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}
|
||||
[--no-mdict-export] [--no-yomichan-export]
|
||||
[--validate-yomichan-terms]
|
||||
{jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2,sankoku8}
|
||||
|
||||
Convert Japanese dictionary files to new formats.
|
||||
|
||||
positional arguments:
|
||||
{jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}
|
||||
{jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2,sankoku8}
|
||||
name of dictionary to convert
|
||||
|
||||
options:
|
||||
|
@ -75,10 +83,14 @@ options:
|
|||
graphics, audio, etc.)
|
||||
-i MDICT_ICON, --mdict-icon MDICT_ICON
|
||||
path to icon file to be used with MDict
|
||||
--no-yomichan-export skip export of dictionary data to Yomichan format
|
||||
--no-mdict-export skip export of dictionary data to MDict format
|
||||
--no-yomichan-export skip export of dictionary data to Yomichan format
|
||||
--validate-yomichan-terms
|
||||
validate JSON structure of exported Yomichan
|
||||
dictionary terms
|
||||
|
||||
See README.md for details regarding media directory structures
|
||||
|
||||
```
|
||||
### Web Targets
|
||||
Jitenbot will scrape the target website and save the pages to the [user cache directory](https://pypi.org/project/platformdirs/).
|
||||
|
@ -89,55 +101,112 @@ HTTP request headers (user agent string, etc.) may be customized by editing the
|
|||
[user config directory](https://pypi.org/project/platformdirs/).
|
||||
|
||||
### Monokakido Targets
|
||||
Page data and media data must be [procured by the user](https://github.com/golddranks/monokakido/)
|
||||
and passed to jitenbot via the appropriate command line flags.
|
||||
These digital dictionaries are available for purchase through the [Monokakido Dictionaries app](https://www.monokakido.jp/ja/dictionaries/app/) on MacOS/iOS. Under ideal circumstances, Jitenbot would be able to automatically fetch all the data it needs from this app's data directory[^1] on your system. In its current state of development, Jitenbot unfortunately requires you to find and assemble the necessary data yourself. The files must be organized into a particular folder structure (defined below) and then passed to Jitenbot via the corresponding command line arguments.
|
||||
|
||||
Some of the folders in the app's data directory[^1] contain encoded files that must be unencoded using [golddranks' monokakido tool](https://github.com/golddranks/monokakido/). These folders are indicated by a reference mark (※) in the notes below.
|
||||
|
||||
[^1]: `/Library/Application Support/AppStoreContent/jp.monokakido.Dictionaries/Products/`
|
||||
|
||||
<details>
|
||||
<summary>smk8 media directory</summary>
|
||||
<summary>smk8 files</summary>
|
||||
|
||||
Since Yomichan does not support audio files from imported
|
||||
dictionaries, the `audio/` directory may be omitted to save filesize
|
||||
space in the output ZIP file if desired.
|
||||
Since Yomichan does not support audio files from imported dictionaries, the `audio/` directory may be omitted to save filesize space in the output ZIP file if desired.
|
||||
|
||||
```
|
||||
media
|
||||
├── Audio.png
|
||||
├── audio
|
||||
│ ├── 00001.aac
|
||||
│ ├── 00002.aac
|
||||
│ ├── 00003.aac
|
||||
│ │ ...
|
||||
│ └── 82682.aac
|
||||
└── gaiji
|
||||
├── 1d110.svg
|
||||
├── 1d15d.svg
|
||||
├── 1d15e.svg
|
||||
│ ...
|
||||
└── xbunnoa.svg
|
||||
.
|
||||
├── media
|
||||
│ ├── audio (※)
|
||||
│ │ ├── 00001.aac
|
||||
│ │ ├── 00002.aac
|
||||
│ │ ├── 00003.aac
|
||||
│ │ ├── ...
|
||||
│ │ └── 82682.aac
|
||||
│ ├── Audio.png
|
||||
│ └── gaiji
|
||||
│ ├── 1d110.svg
|
||||
│ ├── 1d15d.svg
|
||||
│ ├── 1d15e.svg
|
||||
│ ├── ...
|
||||
│ └── xbunnoa.svg
|
||||
└── pages (※)
|
||||
├── 0000000000.xml
|
||||
├── 0000000001.xml
|
||||
├── 0000000002.xml
|
||||
├── ...
|
||||
└── 0000064581.xml
|
||||
```
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>daijirin2 media directory</summary>
|
||||
<summary>daijirin2 files</summary>
|
||||
|
||||
The `graphics/` directory may be omitted to save space if desired.
|
||||
|
||||
```
|
||||
media
|
||||
├── gaiji
|
||||
│ ├── 1D10B.svg
|
||||
│ ├── 1D110.svg
|
||||
│ ├── 1D12A.svg
|
||||
│ │ ...
|
||||
│ └── vectorOB.svg
|
||||
└── graphics
|
||||
├── 3djr_0002.png
|
||||
├── 3djr_0004.png
|
||||
├── 3djr_0005.png
|
||||
│ ...
|
||||
└── 4djr_yahazu.png
|
||||
.
|
||||
├── media
|
||||
│ ├── gaiji
|
||||
│ │ ├── 1D10B.svg
|
||||
│ │ ├── 1D110.svg
|
||||
│ │ ├── 1D12A.svg
|
||||
│ │ ├── ...
|
||||
│ │ └── vectorOB.svg
|
||||
│ └── graphics (※)
|
||||
│ ├── 3djr_0002.png
|
||||
│ ├── 3djr_0004.png
|
||||
│ ├── 3djr_0005.png
|
||||
│ ├── ...
|
||||
│ └── 4djr_yahazu.png
|
||||
└── pages (※)
|
||||
├── 0000000001.xml
|
||||
├── 0000000002.xml
|
||||
├── 0000000003.xml
|
||||
├── ...
|
||||
└── 0000182633.xml
|
||||
```
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>sankoku8 files</summary>
|
||||
|
||||
```
|
||||
.
|
||||
├── media
|
||||
│ ├── graphics
|
||||
│ │ ├── 000chouchou.png
|
||||
│ │ ├── ...
|
||||
│ │ └── 888udatsu.png
|
||||
│ ├── svg-accent
|
||||
│ │ ├── アクセント.svg
|
||||
│ │ └── 平板.svg
|
||||
│ ├── svg-frac
|
||||
│ │ ├── frac-1-2.svg
|
||||
│ │ ├── ...
|
||||
│ │ └── frac-a-b.svg
|
||||
│ ├── svg-gaiji
|
||||
│ │ ├── aiaigasa.svg
|
||||
│ │ ├── ...
|
||||
│ │ └── 異体字_西.svg
|
||||
│ ├── svg-intonation
|
||||
│ │ ├── 上昇下降.svg
|
||||
│ │ ├── ...
|
||||
│ │ └── 長.svg
|
||||
│ ├── svg-logo
|
||||
│ │ ├── denshi.svg
|
||||
│ │ ├── ...
|
||||
│ │ └── 重要語.svg
|
||||
│ └── svg-special
|
||||
│ └── 区切り線.svg
|
||||
└── pages (※)
|
||||
├── 0000000001.xml
|
||||
├── ...
|
||||
└── 0000065457.xml
|
||||
```
|
||||
</details>
|
||||
|
||||
# Attribution
|
||||
`Adobe-Japan1_sequences.txt` is provided by [The Adobe-Japan1-7 Character Collection](https://github.com/adobe-type-tools/Adobe-Japan1).
|
||||
|
||||
The Yomichan term-bank schema definition `dictionary-term-bank-v3-schema.json` is provided by the [Yomichan](https://github.com/foosoft/yomichan) project.
|
||||
|
||||
Many thanks to [epistularum](https://github.com/epistularum) for providing thoughtful feedback regarding the implementation of the MDict export functionality.
|
||||
|
|
7
TODO.md
7
TODO.md
|
@ -1,10 +1,13 @@
|
|||
### Todo
|
||||
|
||||
- [x] Add factory classes to reduce the amount of class import statements
|
||||
- [x] Add dynamic import functionality to factory classes to reduce boilerplate
|
||||
- [x] Support exporting to MDict (.MDX) dictionary format
|
||||
- [x] Validate JSON schema of Yomichan terms during export
|
||||
- [ ] Add support for monokakido search keys from index files
|
||||
- [ ] Delete unneeded media from temp build directory before final export
|
||||
- [ ] Add test suite
|
||||
- [ ] Add documentation (docstrings, etc.)
|
||||
- [ ] Validate JSON schema of Yomichan terms during export
|
||||
- [ ] Add build scripts for producing program binaries
|
||||
- [ ] Validate scraped webpages after downloading
|
||||
- [ ] Log non-fatal failures to a log file instead of raising exceptions
|
||||
|
@ -13,7 +16,7 @@
|
|||
- [ ] [Yoji-Jukugo.com](https://yoji-jukugo.com/)
|
||||
- [ ] [実用日本語表現辞典](https://www.weblio.jp/cat/dictionary/jtnhj)
|
||||
- [ ] Support more Monokakido dictionaries
|
||||
- [ ] 三省堂国語辞典 第8版 (SANKOKU8)
|
||||
- [x] 三省堂国語辞典 第8版 (SANKOKU8)
|
||||
- [ ] 精選版 日本国語大辞典 (NDS)
|
||||
- [ ] 大辞泉 第2版 (DAIJISEN2)
|
||||
- [ ] 明鏡国語辞典 第3版 (MK3)
|
||||
|
|
54
bot/crawlers/base/crawler.py
Normal file
54
bot/crawlers/base/crawler.py
Normal file
|
@ -0,0 +1,54 @@
|
|||
import re
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
from bot.factory import new_entry
|
||||
from bot.factory import new_yomichan_exporter
|
||||
from bot.factory import new_mdict_exporter
|
||||
|
||||
|
||||
class BaseCrawler(ABC):
|
||||
def __init__(self, target):
|
||||
self._target = target
|
||||
self._page_map = {}
|
||||
self._entries = []
|
||||
self._page_id_pattern = None
|
||||
|
||||
@abstractmethod
|
||||
def collect_pages(self, page_dir):
|
||||
raise NotImplementedError
|
||||
|
||||
def read_pages(self):
|
||||
pages_len = len(self._page_map)
|
||||
items = self._page_map.items()
|
||||
for idx, (page_id, page_path) in enumerate(items):
|
||||
update = f"\tReading page {idx+1}/{pages_len}"
|
||||
print(update, end='\r', flush=True)
|
||||
entry = new_entry(self._target, page_id)
|
||||
with open(page_path, "r", encoding="utf-8") as f:
|
||||
page = f.read()
|
||||
try:
|
||||
entry.set_page(page)
|
||||
except ValueError as err:
|
||||
print(err)
|
||||
print("Try deleting and redownloading file:")
|
||||
print(f"\t{page_path}\n")
|
||||
continue
|
||||
self._entries.append(entry)
|
||||
print()
|
||||
|
||||
def make_yomichan_dictionary(self, media_dir, validate):
|
||||
exporter = new_yomichan_exporter(self._target)
|
||||
exporter.export(self._entries, media_dir, validate)
|
||||
|
||||
def make_mdict_dictionary(self, media_dir, icon_file):
|
||||
exporter = new_mdict_exporter(self._target)
|
||||
exporter.export(self._entries, media_dir, icon_file)
|
||||
|
||||
def _parse_page_id(self, page_link):
|
||||
m = re.search(self._page_id_pattern, page_link)
|
||||
if m is None:
|
||||
return None
|
||||
page_id = int(m.group(1))
|
||||
if page_id in self._page_map:
|
||||
return None
|
||||
return page_id
|
30
bot/crawlers/base/jitenon.py
Normal file
30
bot/crawlers/base/jitenon.py
Normal file
|
@ -0,0 +1,30 @@
|
|||
from bs4 import BeautifulSoup
|
||||
|
||||
from bot.time import timestamp
|
||||
from bot.crawlers.scrapers.jitenon import Jitenon as JitenonScraper
|
||||
from bot.crawlers.base.crawler import BaseCrawler
|
||||
|
||||
|
||||
class JitenonCrawler(BaseCrawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._gojuon_url = None
|
||||
|
||||
def collect_pages(self, page_dir):
|
||||
print(f"{timestamp()} Scraping {self._gojuon_url}")
|
||||
jitenon = JitenonScraper()
|
||||
gojuon_doc, _ = jitenon.scrape(self._gojuon_url)
|
||||
gojuon_soup = BeautifulSoup(gojuon_doc, features="html.parser")
|
||||
for gojuon_a in gojuon_soup.select(".kana_area a", href=True):
|
||||
gojuon_href = gojuon_a['href']
|
||||
kana_doc, _ = jitenon.scrape(gojuon_href)
|
||||
kana_soup = BeautifulSoup(kana_doc, features="html.parser")
|
||||
for kana_a in kana_soup.select(".word_box a", href=True):
|
||||
page_link = kana_a['href']
|
||||
page_id = self._parse_page_id(page_link)
|
||||
if page_id is None:
|
||||
continue
|
||||
_, page_path = jitenon.scrape(page_link)
|
||||
self._page_map[page_id] = page_path
|
||||
pages_len = len(self._page_map)
|
||||
print(f"\n{timestamp()} Found {pages_len} entry pages")
|
20
bot/crawlers/base/monokakido.py
Normal file
20
bot/crawlers/base/monokakido.py
Normal file
|
@ -0,0 +1,20 @@
|
|||
import os
|
||||
from bot.time import timestamp
|
||||
from bot.crawlers.base.crawler import BaseCrawler
|
||||
|
||||
|
||||
class MonokakidoCrawler(BaseCrawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._page_id_pattern = r"^([0-9]+)\.xml$"
|
||||
|
||||
def collect_pages(self, page_dir):
|
||||
print(f"{timestamp()} Searching for page files in `{page_dir}`")
|
||||
for pagefile in os.listdir(page_dir):
|
||||
page_id = self._parse_page_id(pagefile)
|
||||
if page_id is None or page_id == 0:
|
||||
continue
|
||||
path = os.path.join(page_dir, pagefile)
|
||||
self._page_map[page_id] = path
|
||||
pages_len = len(self._page_map)
|
||||
print(f"{timestamp()} Found {pages_len} page files for processing")
|
|
@ -1,154 +0,0 @@
|
|||
import os
|
||||
import re
|
||||
from abc import ABC, abstractmethod
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
import bot.crawlers.scraper as Scraper
|
||||
from bot.entries.factory import new_entry
|
||||
from bot.yomichan.exporters.factory import new_yomi_exporter
|
||||
from bot.mdict.exporters.factory import new_mdict_exporter
|
||||
|
||||
|
||||
class Crawler(ABC):
|
||||
def __init__(self, target):
|
||||
self._target = target
|
||||
self._page_map = {}
|
||||
self._entries = []
|
||||
self._page_id_pattern = None
|
||||
|
||||
@abstractmethod
|
||||
def collect_pages(self, page_dir):
|
||||
pass
|
||||
|
||||
def read_pages(self):
|
||||
pages_len = len(self._page_map)
|
||||
items = self._page_map.items()
|
||||
for idx, (page_id, page_path) in enumerate(items):
|
||||
update = f"Reading page {idx+1}/{pages_len}"
|
||||
print(update, end='\r', flush=True)
|
||||
entry = new_entry(self._target, page_id)
|
||||
with open(page_path, "r", encoding="utf-8") as f:
|
||||
page = f.read()
|
||||
try:
|
||||
entry.set_page(page)
|
||||
except ValueError as err:
|
||||
print(err)
|
||||
print("Try deleting and redownloading file:")
|
||||
print(f"\t{page_path}\n")
|
||||
continue
|
||||
self._entries.append(entry)
|
||||
print()
|
||||
|
||||
def make_yomichan_dictionary(self, media_dir):
|
||||
exporter = new_yomi_exporter(self._target)
|
||||
exporter.export(self._entries, media_dir)
|
||||
|
||||
def make_mdict_dictionary(self, media_dir, icon_file):
|
||||
exporter = new_mdict_exporter(self._target)
|
||||
exporter.export(self._entries, media_dir, icon_file)
|
||||
|
||||
def _parse_page_id(self, page_link):
|
||||
m = re.search(self._page_id_pattern, page_link)
|
||||
if m is None:
|
||||
return None
|
||||
page_id = int(m.group(1))
|
||||
if page_id in self._page_map:
|
||||
return None
|
||||
return page_id
|
||||
|
||||
|
||||
class JitenonKokugoCrawler(Crawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._gojuon_url = "https://kokugo.jitenon.jp/cat/gojuonindex.php"
|
||||
self._page_id_pattern = r"word/p([0-9]+)$"
|
||||
|
||||
def collect_pages(self, page_dir):
|
||||
jitenon = Scraper.Jitenon()
|
||||
gojuon_doc, _ = jitenon.scrape(self._gojuon_url)
|
||||
gojuon_soup = BeautifulSoup(gojuon_doc, features="html.parser")
|
||||
for gojuon_a in gojuon_soup.select(".kana_area a", href=True):
|
||||
gojuon_href = gojuon_a['href']
|
||||
max_kana_page = 1
|
||||
current_kana_page = 1
|
||||
while current_kana_page <= max_kana_page:
|
||||
kana_doc, _ = jitenon.scrape(f"{gojuon_href}&page={current_kana_page}")
|
||||
current_kana_page += 1
|
||||
kana_soup = BeautifulSoup(kana_doc, features="html.parser")
|
||||
page_total = kana_soup.find(class_="page_total").text
|
||||
m = re.search(r"全([0-9]+)件", page_total)
|
||||
if m:
|
||||
max_kana_page = int(m.group(1))
|
||||
for kana_a in kana_soup.select(".word_box a", href=True):
|
||||
page_link = kana_a['href']
|
||||
page_id = self._parse_page_id(page_link)
|
||||
if page_id is None:
|
||||
continue
|
||||
_, page_path = jitenon.scrape(page_link)
|
||||
self._page_map[page_id] = page_path
|
||||
pages_len = len(self._page_map)
|
||||
print(f"Finished scraping {pages_len} pages")
|
||||
|
||||
|
||||
class _JitenonCrawler(Crawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._gojuon_url = None
|
||||
|
||||
def collect_pages(self, page_dir):
|
||||
print("Scraping jitenon.jp")
|
||||
jitenon = Scraper.Jitenon()
|
||||
gojuon_doc, _ = jitenon.scrape(self._gojuon_url)
|
||||
gojuon_soup = BeautifulSoup(gojuon_doc, features="html.parser")
|
||||
for gojuon_a in gojuon_soup.select(".kana_area a", href=True):
|
||||
gojuon_href = gojuon_a['href']
|
||||
kana_doc, _ = jitenon.scrape(gojuon_href)
|
||||
kana_soup = BeautifulSoup(kana_doc, features="html.parser")
|
||||
for kana_a in kana_soup.select(".word_box a", href=True):
|
||||
page_link = kana_a['href']
|
||||
page_id = self._parse_page_id(page_link)
|
||||
if page_id is None:
|
||||
continue
|
||||
_, page_path = jitenon.scrape(page_link)
|
||||
self._page_map[page_id] = page_path
|
||||
pages_len = len(self._page_map)
|
||||
print(f"Finished scraping {pages_len} pages")
|
||||
|
||||
|
||||
class JitenonYojiCrawler(_JitenonCrawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._gojuon_url = "https://yoji.jitenon.jp/cat/gojuon.html"
|
||||
self._page_id_pattern = r"([0-9]+)\.html$"
|
||||
|
||||
|
||||
class JitenonKotowazaCrawler(_JitenonCrawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._gojuon_url = "https://kotowaza.jitenon.jp/cat/gojuon.php"
|
||||
self._page_id_pattern = r"([0-9]+)\.php$"
|
||||
|
||||
|
||||
class _MonokakidoCrawler(Crawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._page_id_pattern = r"^([0-9]+)\.xml$"
|
||||
|
||||
def collect_pages(self, page_dir):
|
||||
print(f"Searching for page files in `{page_dir}`")
|
||||
for pagefile in os.listdir(page_dir):
|
||||
page_id = self._parse_page_id(pagefile)
|
||||
if page_id is None or page_id == 0:
|
||||
continue
|
||||
path = os.path.join(page_dir, pagefile)
|
||||
self._page_map[page_id] = path
|
||||
pages_len = len(self._page_map)
|
||||
print(f"Found {pages_len} page files for processing")
|
||||
|
||||
|
||||
class Smk8Crawler(_MonokakidoCrawler):
|
||||
pass
|
||||
|
||||
|
||||
class Daijirin2Crawler(_MonokakidoCrawler):
|
||||
pass
|
5
bot/crawlers/daijirin2.py
Normal file
5
bot/crawlers/daijirin2.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
from bot.crawlers.base.monokakido import MonokakidoCrawler
|
||||
|
||||
|
||||
class Crawler(MonokakidoCrawler):
|
||||
pass
|
|
@ -1,18 +0,0 @@
|
|||
from bot.targets import Targets
|
||||
|
||||
from bot.crawlers.crawlers import JitenonKokugoCrawler
|
||||
from bot.crawlers.crawlers import JitenonYojiCrawler
|
||||
from bot.crawlers.crawlers import JitenonKotowazaCrawler
|
||||
from bot.crawlers.crawlers import Smk8Crawler
|
||||
from bot.crawlers.crawlers import Daijirin2Crawler
|
||||
|
||||
|
||||
def new_crawler(target):
|
||||
crawler_map = {
|
||||
Targets.JITENON_KOKUGO: JitenonKokugoCrawler,
|
||||
Targets.JITENON_YOJI: JitenonYojiCrawler,
|
||||
Targets.JITENON_KOTOWAZA: JitenonKotowazaCrawler,
|
||||
Targets.SMK8: Smk8Crawler,
|
||||
Targets.DAIJIRIN2: Daijirin2Crawler,
|
||||
}
|
||||
return crawler_map[target](target)
|
40
bot/crawlers/jitenon_kokugo.py
Normal file
40
bot/crawlers/jitenon_kokugo.py
Normal file
|
@ -0,0 +1,40 @@
|
|||
import re
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from bot.time import timestamp
|
||||
from bot.crawlers.base.crawler import BaseCrawler
|
||||
from bot.crawlers.scrapers.jitenon import Jitenon as JitenonScraper
|
||||
|
||||
|
||||
class Crawler(BaseCrawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._gojuon_url = "https://kokugo.jitenon.jp/cat/gojuonindex.php"
|
||||
self._page_id_pattern = r"word/p([0-9]+)$"
|
||||
|
||||
def collect_pages(self, page_dir):
|
||||
print(f"{timestamp()} Scraping {self._gojuon_url}")
|
||||
jitenon = JitenonScraper()
|
||||
gojuon_doc, _ = jitenon.scrape(self._gojuon_url)
|
||||
gojuon_soup = BeautifulSoup(gojuon_doc, features="html.parser")
|
||||
for gojuon_a in gojuon_soup.select(".kana_area a", href=True):
|
||||
gojuon_href = gojuon_a['href']
|
||||
max_kana_page = 1
|
||||
current_kana_page = 1
|
||||
while current_kana_page <= max_kana_page:
|
||||
kana_doc, _ = jitenon.scrape(f"{gojuon_href}&page={current_kana_page}")
|
||||
current_kana_page += 1
|
||||
kana_soup = BeautifulSoup(kana_doc, features="html.parser")
|
||||
page_total = kana_soup.find(class_="page_total").text
|
||||
m = re.search(r"全([0-9]+)件", page_total)
|
||||
if m:
|
||||
max_kana_page = int(m.group(1))
|
||||
for kana_a in kana_soup.select(".word_box a", href=True):
|
||||
page_link = kana_a['href']
|
||||
page_id = self._parse_page_id(page_link)
|
||||
if page_id is None:
|
||||
continue
|
||||
_, page_path = jitenon.scrape(page_link)
|
||||
self._page_map[page_id] = page_path
|
||||
pages_len = len(self._page_map)
|
||||
print(f"\n{timestamp()} Found {pages_len} entry pages")
|
8
bot/crawlers/jitenon_kotowaza.py
Normal file
8
bot/crawlers/jitenon_kotowaza.py
Normal file
|
@ -0,0 +1,8 @@
|
|||
from bot.crawlers.base.jitenon import JitenonCrawler
|
||||
|
||||
|
||||
class Crawler(JitenonCrawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._gojuon_url = "https://kotowaza.jitenon.jp/cat/gojuon.php"
|
||||
self._page_id_pattern = r"([0-9]+)\.php$"
|
8
bot/crawlers/jitenon_yoji.py
Normal file
8
bot/crawlers/jitenon_yoji.py
Normal file
|
@ -0,0 +1,8 @@
|
|||
from bot.crawlers.base.jitenon import JitenonCrawler
|
||||
|
||||
|
||||
class Crawler(JitenonCrawler):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._gojuon_url = "https://yoji.jitenon.jp/cat/gojuon.html"
|
||||
self._page_id_pattern = r"([0-9]+)\.html$"
|
5
bot/crawlers/sankoku8.py
Normal file
5
bot/crawlers/sankoku8.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
from bot.crawlers.base.monokakido import MonokakidoCrawler
|
||||
|
||||
|
||||
class Crawler(MonokakidoCrawler):
|
||||
pass
|
10
bot/crawlers/scrapers/jitenon.py
Normal file
10
bot/crawlers/scrapers/jitenon.py
Normal file
|
@ -0,0 +1,10 @@
|
|||
import re
|
||||
from bot.crawlers.scrapers.scraper import BaseScraper
|
||||
|
||||
|
||||
class Jitenon(BaseScraper):
|
||||
def _get_netloc_re(self):
|
||||
domain = r"jitenon\.jp"
|
||||
pattern = r"^(?:([A-Za-z0-9.\-]+)\.)?" + domain + r"$"
|
||||
netloc_re = re.compile(pattern)
|
||||
return netloc_re
|
|
@ -1,24 +1,28 @@
|
|||
import time
|
||||
import requests
|
||||
import re
|
||||
import os
|
||||
import hashlib
|
||||
import random
|
||||
import math
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
from platformdirs import user_cache_dir
|
||||
from urllib.parse import urlparse
|
||||
from pathlib import Path
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from requests.packages.urllib3.util.retry import Retry
|
||||
from platformdirs import user_cache_dir
|
||||
|
||||
from bot.time import timestamp
|
||||
from bot.data import load_config
|
||||
|
||||
|
||||
class Scraper():
|
||||
class BaseScraper(ABC):
|
||||
def __init__(self):
|
||||
self.cache_count = 0
|
||||
self._config = load_config()
|
||||
pattern = r"^(?:([A-Za-z0-9.\-]+)\.)?" + self.domain + r"$"
|
||||
self.netloc_re = re.compile(pattern)
|
||||
self.netloc_re = self._get_netloc_re()
|
||||
self.__set_session()
|
||||
|
||||
def scrape(self, urlstring):
|
||||
|
@ -31,9 +35,14 @@ class Scraper():
|
|||
with open(cache_path, "w", encoding="utf-8") as f:
|
||||
f.write(html)
|
||||
else:
|
||||
print("Discovering cached files...", end='\r', flush=True)
|
||||
self.cache_count += 1
|
||||
print(f"\tDiscovering cached file {self.cache_count}", end='\r', flush=True)
|
||||
return html, cache_path
|
||||
|
||||
@abstractmethod
|
||||
def _get_netloc_re(self):
|
||||
raise NotImplementedError
|
||||
|
||||
def __set_session(self):
|
||||
retry_strategy = Retry(
|
||||
total=3,
|
||||
|
@ -87,21 +96,14 @@ class Scraper():
|
|||
def __get(self, urlstring):
|
||||
delay = 10
|
||||
time.sleep(delay)
|
||||
now = datetime.now().strftime("%H:%M:%S")
|
||||
print(f"{now} scraping {urlstring} ...", end='')
|
||||
print(f"{timestamp()} Scraping {urlstring} ...", end='')
|
||||
try:
|
||||
response = self.session.get(urlstring, timeout=10)
|
||||
print("OK")
|
||||
print(f"{timestamp()} OK")
|
||||
return response.text
|
||||
except Exception:
|
||||
print("failed")
|
||||
print("resetting session and trying again")
|
||||
except Exception as ex:
|
||||
print(f"\tFailed: {str(ex)}")
|
||||
print(f"{timestamp()} Resetting session and trying again")
|
||||
self.__set_session()
|
||||
response = self.session.get(urlstring, timeout=10)
|
||||
return response.text
|
||||
|
||||
|
||||
class Jitenon(Scraper):
|
||||
def __init__(self):
|
||||
self.domain = r"jitenon\.jp"
|
||||
super().__init__()
|
5
bot/crawlers/smk8.py
Normal file
5
bot/crawlers/smk8.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
from bot.crawlers.base.monokakido import MonokakidoCrawler
|
||||
|
||||
|
||||
class Crawler(MonokakidoCrawler):
|
||||
pass
|
49
bot/data.py
49
bot/data.py
|
@ -37,14 +37,16 @@ def load_config():
|
|||
|
||||
@cache
|
||||
def load_yomichan_inflection_categories():
|
||||
file_name = os.path.join("yomichan", "inflection_categories.json")
|
||||
file_name = os.path.join(
|
||||
"yomichan", "inflection_categories.json")
|
||||
data = __load_json(file_name)
|
||||
return data
|
||||
|
||||
|
||||
@cache
|
||||
def load_yomichan_metadata():
|
||||
file_name = os.path.join("yomichan", "index.json")
|
||||
file_name = os.path.join(
|
||||
"yomichan", "index.json")
|
||||
data = __load_json(file_name)
|
||||
return data
|
||||
|
||||
|
@ -53,31 +55,21 @@ def load_yomichan_metadata():
|
|||
def load_variant_kanji():
|
||||
def loader(data, row):
|
||||
data[row[0]] = row[1]
|
||||
file_name = os.path.join("entries", "variant_kanji.csv")
|
||||
file_name = os.path.join(
|
||||
"entries", "variant_kanji.csv")
|
||||
data = {}
|
||||
__load_csv(file_name, loader, data)
|
||||
return data
|
||||
|
||||
|
||||
@cache
|
||||
def load_smk8_phrase_readings():
|
||||
def load_phrase_readings(target):
|
||||
def loader(data, row):
|
||||
entry_id = (int(row[0]), int(row[1]))
|
||||
reading = row[2]
|
||||
data[entry_id] = reading
|
||||
file_name = os.path.join("entries", "smk8", "phrase_readings.csv")
|
||||
data = {}
|
||||
__load_csv(file_name, loader, data)
|
||||
return data
|
||||
|
||||
|
||||
@cache
|
||||
def load_daijirin2_phrase_readings():
|
||||
def loader(data, row):
|
||||
entry_id = (int(row[0]), int(row[1]))
|
||||
reading = row[2]
|
||||
data[entry_id] = reading
|
||||
file_name = os.path.join("entries", "daijirin2", "phrase_readings.csv")
|
||||
file_name = os.path.join(
|
||||
"entries", target.value, "phrase_readings.csv")
|
||||
data = {}
|
||||
__load_csv(file_name, loader, data)
|
||||
return data
|
||||
|
@ -92,7 +84,8 @@ def load_daijirin2_kana_abbreviations():
|
|||
if abbr.strip() != "":
|
||||
abbreviations.append(abbr)
|
||||
data[entry_id] = abbreviations
|
||||
file_name = os.path.join("entries", "daijirin2", "kana_abbreviations.csv")
|
||||
file_name = os.path.join(
|
||||
"entries", "daijirin2", "kana_abbreviations.csv")
|
||||
data = {}
|
||||
__load_csv(file_name, loader, data)
|
||||
return data
|
||||
|
@ -100,14 +93,24 @@ def load_daijirin2_kana_abbreviations():
|
|||
|
||||
@cache
|
||||
def load_yomichan_name_conversion(target):
|
||||
file_name = os.path.join("yomichan", "name_conversion", f"{target.value}.json")
|
||||
file_name = os.path.join(
|
||||
"yomichan", "name_conversion", f"{target.value}.json")
|
||||
data = __load_json(file_name)
|
||||
return data
|
||||
|
||||
|
||||
@cache
|
||||
def load_yomichan_term_schema():
|
||||
file_name = os.path.join(
|
||||
"yomichan", "dictionary-term-bank-v3-schema.json")
|
||||
schema = __load_json(file_name)
|
||||
return schema
|
||||
|
||||
|
||||
@cache
|
||||
def load_mdict_name_conversion(target):
|
||||
file_name = os.path.join("mdict", "name_conversion", f"{target.value}.json")
|
||||
file_name = os.path.join(
|
||||
"mdict", "name_conversion", f"{target.value}.json")
|
||||
data = __load_json(file_name)
|
||||
return data
|
||||
|
||||
|
@ -131,7 +134,8 @@ def __load_adobe_glyphs():
|
|||
data[code].append(character)
|
||||
else:
|
||||
data[code] = [character]
|
||||
file_name = os.path.join("entries", "adobe", "Adobe-Japan1_sequences.txt")
|
||||
file_name = os.path.join(
|
||||
"entries", "adobe", "Adobe-Japan1_sequences.txt")
|
||||
data = {}
|
||||
__load_csv(file_name, loader, data, delim=';')
|
||||
return data
|
||||
|
@ -139,7 +143,8 @@ def __load_adobe_glyphs():
|
|||
|
||||
@cache
|
||||
def __load_override_adobe_glyphs():
|
||||
file_name = os.path.join("entries", "adobe", "override_glyphs.json")
|
||||
file_name = os.path.join(
|
||||
"entries", "adobe", "override_glyphs.json")
|
||||
json_data = __load_json(file_name)
|
||||
data = {}
|
||||
for key, val in json_data.items():
|
||||
|
|
|
@ -18,15 +18,15 @@ class Entry(ABC):
|
|||
|
||||
@abstractmethod
|
||||
def get_global_identifier(self):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def set_page(self, page):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def get_page_soup(self):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
def get_headwords(self):
|
||||
if self._headwords is not None:
|
||||
|
@ -38,15 +38,15 @@ class Entry(ABC):
|
|||
|
||||
@abstractmethod
|
||||
def _get_headwords(self):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _add_variant_expressions(self, headwords):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def get_part_of_speech_tags(self):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
def get_parent(self):
|
||||
if self.entry_id in self.SUBENTRY_ID_TO_ENTRY_ID:
|
|
@ -31,11 +31,14 @@ def add_fullwidth(expressions):
|
|||
|
||||
def add_variant_kanji(expressions):
|
||||
variant_kanji = load_variant_kanji()
|
||||
for old_kanji, new_kanji in variant_kanji.items():
|
||||
for kyuuji, shinji in variant_kanji.items():
|
||||
new_exps = []
|
||||
for expression in expressions:
|
||||
if old_kanji in expression:
|
||||
new_exp = expression.replace(old_kanji, new_kanji)
|
||||
if kyuuji in expression:
|
||||
new_exp = expression.replace(kyuuji, shinji)
|
||||
new_exps.append(new_exp)
|
||||
if shinji in expression:
|
||||
new_exp = expression.replace(shinji, kyuuji)
|
||||
new_exps.append(new_exp)
|
||||
for new_exp in new_exps:
|
||||
if new_exp not in expressions:
|
||||
|
@ -85,40 +88,3 @@ def expand_abbreviation_list(expressions):
|
|||
if new_exp not in new_exps:
|
||||
new_exps.append(new_exp)
|
||||
return new_exps
|
||||
|
||||
|
||||
def expand_smk_alternatives(text):
|
||||
"""Return a list of strings described by △ notation."""
|
||||
m = re.search(r"△([^(]+)(([^(]+))", text)
|
||||
if m is None:
|
||||
return [text]
|
||||
alt_parts = [m.group(1)]
|
||||
for alt_part in m.group(2).split("・"):
|
||||
alt_parts.append(alt_part)
|
||||
alts = []
|
||||
for alt_part in alt_parts:
|
||||
alt_exp = re.sub(r"△[^(]+([^(]+)", alt_part, text)
|
||||
alts.append(alt_exp)
|
||||
return alts
|
||||
|
||||
|
||||
def expand_daijirin_alternatives(text):
|
||||
"""Return a list of strings described by = notation."""
|
||||
group_pattern = r"([^=]+)(=([^(]+)(=([^(]+)))?"
|
||||
groups = re.findall(group_pattern, text)
|
||||
expressions = [""]
|
||||
for group in groups:
|
||||
new_exps = []
|
||||
for expression in expressions:
|
||||
new_exps.append(expression + group[0])
|
||||
expressions = new_exps.copy()
|
||||
if group[1] == "":
|
||||
continue
|
||||
new_exps = []
|
||||
for expression in expressions:
|
||||
new_exps.append(expression + group[2])
|
||||
for expression in expressions:
|
||||
for alt in group[3].split("・"):
|
||||
new_exps.append(expression + alt)
|
||||
expressions = new_exps.copy()
|
||||
return expressions
|
|
@ -3,11 +3,11 @@ from abc import abstractmethod
|
|||
from datetime import datetime, date
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from bot.entries.entry import Entry
|
||||
import bot.entries.expressions as Expressions
|
||||
from bot.entries.base.entry import Entry
|
||||
import bot.entries.base.expressions as Expressions
|
||||
|
||||
|
||||
class _JitenonEntry(Entry):
|
||||
class JitenonEntry(Entry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.expression = ""
|
||||
|
@ -58,7 +58,7 @@ class _JitenonEntry(Entry):
|
|||
|
||||
@abstractmethod
|
||||
def _get_column_map(self):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
def __set_modified_date(self, page):
|
||||
m = re.search(r"\"dateModified\": \"(\d{4}-\d{2}-\d{2})", page)
|
||||
|
@ -140,104 +140,3 @@ class _JitenonEntry(Entry):
|
|||
elif isinstance(attr_val, list):
|
||||
colvals.append(";".join(attr_val))
|
||||
return ",".join(colvals)
|
||||
|
||||
|
||||
class JitenonYojiEntry(_JitenonEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.origin = ""
|
||||
self.kanken_level = ""
|
||||
self.category = ""
|
||||
self.related_expressions = []
|
||||
|
||||
def _get_column_map(self):
|
||||
return {
|
||||
"四字熟語": "expression",
|
||||
"読み方": "yomikata",
|
||||
"意味": "definition",
|
||||
"異形": "other_forms",
|
||||
"出典": "origin",
|
||||
"漢検級": "kanken_level",
|
||||
"場面用途": "category",
|
||||
"類義語": "related_expressions",
|
||||
}
|
||||
|
||||
def _add_variant_expressions(self, headwords):
|
||||
for expressions in headwords.values():
|
||||
Expressions.add_variant_kanji(expressions)
|
||||
|
||||
|
||||
class JitenonKotowazaEntry(_JitenonEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.origin = ""
|
||||
self.example = ""
|
||||
self.related_expressions = []
|
||||
|
||||
def _get_column_map(self):
|
||||
return {
|
||||
"言葉": "expression",
|
||||
"読み方": "yomikata",
|
||||
"意味": "definition",
|
||||
"異形": "other_forms",
|
||||
"出典": "origin",
|
||||
"例文": "example",
|
||||
"類句": "related_expressions",
|
||||
}
|
||||
|
||||
def _get_headwords(self):
|
||||
if self.expression == "金棒引き・鉄棒引き":
|
||||
headwords = {
|
||||
"かなぼうひき": ["金棒引き", "鉄棒引き"]
|
||||
}
|
||||
else:
|
||||
headwords = super()._get_headwords()
|
||||
return headwords
|
||||
|
||||
def _add_variant_expressions(self, headwords):
|
||||
for expressions in headwords.values():
|
||||
Expressions.add_variant_kanji(expressions)
|
||||
Expressions.add_fullwidth(expressions)
|
||||
|
||||
|
||||
class JitenonKokugoEntry(_JitenonEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.example = ""
|
||||
self.alt_expression = ""
|
||||
self.antonym = ""
|
||||
self.attachments = ""
|
||||
self.compounds = ""
|
||||
self.related_words = ""
|
||||
|
||||
def _get_column_map(self):
|
||||
return {
|
||||
"言葉": "expression",
|
||||
"読み方": "yomikata",
|
||||
"意味": "definition",
|
||||
"例文": "example",
|
||||
"別表記": "alt_expression",
|
||||
"対義語": "antonym",
|
||||
"活用": "attachments",
|
||||
"用例": "compounds",
|
||||
"類語": "related_words",
|
||||
}
|
||||
|
||||
def _get_headwords(self):
|
||||
headwords = {}
|
||||
for reading in self.yomikata.split("・"):
|
||||
if reading not in headwords:
|
||||
headwords[reading] = []
|
||||
for expression in self.expression.split("・"):
|
||||
headwords[reading].append(expression)
|
||||
if self.alt_expression.strip() != "":
|
||||
for expression in self.alt_expression.split("・"):
|
||||
headwords[reading].append(expression)
|
||||
return headwords
|
||||
|
||||
def _add_variant_expressions(self, headwords):
|
||||
for expressions in headwords.values():
|
||||
Expressions.add_variant_kanji(expressions)
|
||||
Expressions.add_fullwidth(expressions)
|
||||
Expressions.remove_iteration_mark(expressions)
|
||||
Expressions.add_iteration_mark(expressions)
|
60
bot/entries/base/sanseido_entry.py
Normal file
60
bot/entries/base/sanseido_entry.py
Normal file
|
@ -0,0 +1,60 @@
|
|||
from abc import abstractmethod
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from bot.entries.base.entry import Entry
|
||||
import bot.entries.base.expressions as Expressions
|
||||
|
||||
|
||||
class SanseidoEntry(Entry):
|
||||
def set_page(self, page):
|
||||
page = self._decompose_subentries(page)
|
||||
self._page = page
|
||||
|
||||
def get_page_soup(self):
|
||||
soup = BeautifulSoup(self._page, "xml")
|
||||
return soup
|
||||
|
||||
def get_global_identifier(self):
|
||||
parent_part = format(self.entry_id[0], '06')
|
||||
child_part = hex(self.entry_id[1]).lstrip('0x').zfill(4).upper()
|
||||
return f"@{self.target.value}-{parent_part}-{child_part}"
|
||||
|
||||
def _decompose_subentries(self, page):
|
||||
soup = BeautifulSoup(page, features="xml")
|
||||
for x in self._get_subentry_parameters():
|
||||
subentry_class, tags, subentry_list = x
|
||||
for tag in tags:
|
||||
tag_soup = soup.find(tag)
|
||||
while tag_soup is not None:
|
||||
tag_soup.name = "項目"
|
||||
subentry_id = self.id_string_to_entry_id(tag_soup.attrs["id"])
|
||||
self.SUBENTRY_ID_TO_ENTRY_ID[subentry_id] = self.entry_id
|
||||
subentry = subentry_class(self.target, subentry_id)
|
||||
page = tag_soup.decode()
|
||||
subentry.set_page(page)
|
||||
subentry_list.append(subentry)
|
||||
tag_soup.decompose()
|
||||
tag_soup = soup.find(tag)
|
||||
return soup.decode()
|
||||
|
||||
@abstractmethod
|
||||
def _get_subentry_parameters(self):
|
||||
raise NotImplementedError
|
||||
|
||||
def _add_variant_expressions(self, headwords):
|
||||
for expressions in headwords.values():
|
||||
Expressions.add_variant_kanji(expressions)
|
||||
Expressions.add_fullwidth(expressions)
|
||||
Expressions.remove_iteration_mark(expressions)
|
||||
Expressions.add_iteration_mark(expressions)
|
||||
|
||||
@staticmethod
|
||||
def id_string_to_entry_id(id_string):
|
||||
parts = id_string.split("-")
|
||||
if len(parts) == 1:
|
||||
return (int(parts[0]), 0)
|
||||
elif len(parts) == 2:
|
||||
# subentries have a hexadecimal part
|
||||
return (int(parts[0]), int(parts[1], 16))
|
||||
else:
|
||||
raise Exception(f"Invalid entry ID: {id_string}")
|
|
@ -1,231 +0,0 @@
|
|||
from bs4 import BeautifulSoup
|
||||
|
||||
import bot.entries.expressions as Expressions
|
||||
import bot.soup as Soup
|
||||
from bot.data import load_daijirin2_phrase_readings
|
||||
from bot.data import load_daijirin2_kana_abbreviations
|
||||
from bot.entries.entry import Entry
|
||||
from bot.entries.daijirin2_preprocess import preprocess_page
|
||||
|
||||
|
||||
class _BaseDaijirin2Entry(Entry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.children = []
|
||||
self.phrases = []
|
||||
self._kana_abbreviations = load_daijirin2_kana_abbreviations()
|
||||
|
||||
def get_global_identifier(self):
|
||||
parent_part = format(self.entry_id[0], '06')
|
||||
child_part = hex(self.entry_id[1]).lstrip('0x').zfill(4).upper()
|
||||
return f"@{self.target.value}-{parent_part}-{child_part}"
|
||||
|
||||
def set_page(self, page):
|
||||
page = self.__decompose_subentries(page)
|
||||
self._page = page
|
||||
|
||||
def get_page_soup(self):
|
||||
soup = BeautifulSoup(self._page, "xml")
|
||||
return soup
|
||||
|
||||
def get_part_of_speech_tags(self):
|
||||
if self._part_of_speech_tags is not None:
|
||||
return self._part_of_speech_tags
|
||||
self._part_of_speech_tags = []
|
||||
soup = self.get_page_soup()
|
||||
for pos_group in soup.find_all("品詞G"):
|
||||
if pos_group.parent.name == "大語義":
|
||||
self._set_part_of_speech_tags(pos_group)
|
||||
return self._part_of_speech_tags
|
||||
|
||||
def _set_part_of_speech_tags(self, el):
|
||||
pos_names = ["品詞", "品詞活用", "品詞行", "用法"]
|
||||
for child in el.children:
|
||||
if child.name is not None:
|
||||
self._set_part_of_speech_tags(child)
|
||||
continue
|
||||
pos = str(child)
|
||||
if el.name not in pos_names:
|
||||
continue
|
||||
elif pos in ["[", "]"]:
|
||||
continue
|
||||
elif pos in self._part_of_speech_tags:
|
||||
continue
|
||||
else:
|
||||
self._part_of_speech_tags.append(pos)
|
||||
|
||||
def _get_regular_headwords(self, soup):
|
||||
self._fill_alts(soup)
|
||||
reading = soup.find("見出仮名").text
|
||||
expressions = []
|
||||
for el in soup.find_all("標準表記"):
|
||||
expression = self._clean_expression(el.text)
|
||||
if "—" in expression:
|
||||
kana_abbrs = self._kana_abbreviations[self.entry_id]
|
||||
for abbr in kana_abbrs:
|
||||
expression = expression.replace("—", abbr, 1)
|
||||
expressions.append(expression)
|
||||
expressions = Expressions.expand_abbreviation_list(expressions)
|
||||
if len(expressions) == 0:
|
||||
expressions.append(reading)
|
||||
headwords = {reading: expressions}
|
||||
return headwords
|
||||
|
||||
def _add_variant_expressions(self, headwords):
|
||||
for expressions in headwords.values():
|
||||
Expressions.add_variant_kanji(expressions)
|
||||
Expressions.add_fullwidth(expressions)
|
||||
Expressions.remove_iteration_mark(expressions)
|
||||
Expressions.add_iteration_mark(expressions)
|
||||
|
||||
def __decompose_subentries(self, page):
|
||||
soup = BeautifulSoup(page, features="xml")
|
||||
subentry_parameters = [
|
||||
[Daijirin2ChildEntry, ["子項目"], self.children],
|
||||
[Daijirin2PhraseEntry, ["句項目"], self.phrases],
|
||||
]
|
||||
for x in subentry_parameters:
|
||||
subentry_class, tags, subentry_list = x
|
||||
for tag in tags:
|
||||
tag_soup = soup.find(tag)
|
||||
while tag_soup is not None:
|
||||
tag_soup.name = "項目"
|
||||
subentry_id = self.id_string_to_entry_id(tag_soup.attrs["id"])
|
||||
self.SUBENTRY_ID_TO_ENTRY_ID[subentry_id] = self.entry_id
|
||||
subentry = subentry_class(self.target, subentry_id)
|
||||
page = tag_soup.decode()
|
||||
subentry.set_page(page)
|
||||
subentry_list.append(subentry)
|
||||
tag_soup.decompose()
|
||||
tag_soup = soup.find(tag)
|
||||
return soup.decode()
|
||||
|
||||
@staticmethod
|
||||
def id_string_to_entry_id(id_string):
|
||||
parts = id_string.split("-")
|
||||
if len(parts) == 1:
|
||||
return (int(parts[0]), 0)
|
||||
elif len(parts) == 2:
|
||||
# subentries have a hexadecimal part
|
||||
return (int(parts[0]), int(parts[1], 16))
|
||||
else:
|
||||
raise Exception(f"Invalid entry ID: {id_string}")
|
||||
|
||||
@staticmethod
|
||||
def _delete_unused_nodes(soup):
|
||||
"""Remove extra markup elements that appear in the entry
|
||||
headword line which are not part of the entry headword"""
|
||||
unused_nodes = [
|
||||
"漢字音logo", "活用分節", "連語句活用分節", "語構成",
|
||||
"表外字マーク", "表外字マーク", "ルビG"
|
||||
]
|
||||
for name in unused_nodes:
|
||||
Soup.delete_soup_nodes(soup, name)
|
||||
|
||||
@staticmethod
|
||||
def _clean_expression(expression):
|
||||
for x in ["〈", "〉", "《", "》", " "]:
|
||||
expression = expression.replace(x, "")
|
||||
return expression
|
||||
|
||||
@staticmethod
|
||||
def _fill_alts(soup):
|
||||
for gaiji in soup.find_all(class_="gaiji"):
|
||||
if gaiji.name == "img" and gaiji.has_attr("alt"):
|
||||
gaiji.name = "span"
|
||||
gaiji.string = gaiji.attrs["alt"]
|
||||
|
||||
|
||||
class Daijirin2Entry(_BaseDaijirin2Entry):
|
||||
def __init__(self, target, page_id):
|
||||
entry_id = (page_id, 0)
|
||||
super().__init__(target, entry_id)
|
||||
|
||||
def set_page(self, page):
|
||||
page = preprocess_page(page)
|
||||
super().set_page(page)
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
if soup.find("漢字見出") is not None:
|
||||
headwords = self._get_kanji_headwords(soup)
|
||||
elif soup.find("略語G") is not None:
|
||||
headwords = self._get_acronym_headwords(soup)
|
||||
else:
|
||||
headwords = self._get_regular_headwords(soup)
|
||||
return headwords
|
||||
|
||||
def _get_kanji_headwords(self, soup):
|
||||
readings = []
|
||||
for el in soup.find_all("漢字音"):
|
||||
hira = Expressions.kata_to_hira(el.text)
|
||||
readings.append(hira)
|
||||
if soup.find("漢字音") is None:
|
||||
readings.append("")
|
||||
expressions = []
|
||||
for el in soup.find_all("漢字見出"):
|
||||
expressions.append(el.text)
|
||||
headwords = {}
|
||||
for reading in readings:
|
||||
headwords[reading] = expressions
|
||||
return headwords
|
||||
|
||||
def _get_acronym_headwords(self, soup):
|
||||
expressions = []
|
||||
for el in soup.find_all("略語"):
|
||||
expression_parts = []
|
||||
for part in el.find_all(["欧字", "和字"]):
|
||||
expression_parts.append(part.text)
|
||||
expression = "".join(expression_parts)
|
||||
expressions.append(expression)
|
||||
headwords = {"": expressions}
|
||||
return headwords
|
||||
|
||||
|
||||
class Daijirin2ChildEntry(_BaseDaijirin2Entry):
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
headwords = self._get_regular_headwords(soup)
|
||||
return headwords
|
||||
|
||||
|
||||
class Daijirin2PhraseEntry(_BaseDaijirin2Entry):
|
||||
def get_part_of_speech_tags(self):
|
||||
# phrases do not contain these tags
|
||||
return []
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
headwords = {}
|
||||
expressions = self._find_expressions(soup)
|
||||
readings = self._find_readings()
|
||||
for idx, expression in enumerate(expressions):
|
||||
reading = readings[idx]
|
||||
if reading in headwords:
|
||||
headwords[reading].append(expression)
|
||||
else:
|
||||
headwords[reading] = [expression]
|
||||
return headwords
|
||||
|
||||
def _find_expressions(self, soup):
|
||||
self._delete_unused_nodes(soup)
|
||||
text = soup.find("句表記").text
|
||||
text = self._clean_expression(text)
|
||||
alternatives = Expressions.expand_daijirin_alternatives(text)
|
||||
expressions = []
|
||||
for alt in alternatives:
|
||||
for exp in Expressions.expand_abbreviation(alt):
|
||||
expressions.append(exp)
|
||||
return expressions
|
||||
|
||||
def _find_readings(self):
|
||||
phrase_readings = load_daijirin2_phrase_readings()
|
||||
text = phrase_readings[self.entry_id]
|
||||
alternatives = Expressions.expand_daijirin_alternatives(text)
|
||||
readings = []
|
||||
for alt in alternatives:
|
||||
for reading in Expressions.expand_abbreviation(alt):
|
||||
readings.append(reading)
|
||||
return readings
|
88
bot/entries/daijirin2/base_entry.py
Normal file
88
bot/entries/daijirin2/base_entry.py
Normal file
|
@ -0,0 +1,88 @@
|
|||
import bot.soup as Soup
|
||||
from bot.data import load_daijirin2_kana_abbreviations
|
||||
from bot.entries.base.sanseido_entry import SanseidoEntry
|
||||
import bot.entries.base.expressions as Expressions
|
||||
|
||||
|
||||
class BaseEntry(SanseidoEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.children = []
|
||||
self.phrases = []
|
||||
self._kana_abbreviations = load_daijirin2_kana_abbreviations()
|
||||
|
||||
def get_part_of_speech_tags(self):
|
||||
if self._part_of_speech_tags is not None:
|
||||
return self._part_of_speech_tags
|
||||
self._part_of_speech_tags = []
|
||||
soup = self.get_page_soup()
|
||||
for pos_group in soup.find_all("品詞G"):
|
||||
if pos_group.parent.name == "大語義":
|
||||
self._set_part_of_speech_tags(pos_group)
|
||||
return self._part_of_speech_tags
|
||||
|
||||
def _set_part_of_speech_tags(self, el):
|
||||
pos_names = ["品詞", "品詞活用", "品詞行", "用法"]
|
||||
for child in el.children:
|
||||
if child.name is not None:
|
||||
self._set_part_of_speech_tags(child)
|
||||
continue
|
||||
pos = str(child)
|
||||
if el.name not in pos_names:
|
||||
continue
|
||||
elif pos in ["[", "]"]:
|
||||
continue
|
||||
elif pos in self._part_of_speech_tags:
|
||||
continue
|
||||
else:
|
||||
self._part_of_speech_tags.append(pos)
|
||||
|
||||
def _get_regular_headwords(self, soup):
|
||||
self._fill_alts(soup)
|
||||
reading = soup.find("見出仮名").text
|
||||
expressions = []
|
||||
for el in soup.find_all("標準表記"):
|
||||
expression = self._clean_expression(el.text)
|
||||
if "—" in expression:
|
||||
kana_abbrs = self._kana_abbreviations[self.entry_id]
|
||||
for abbr in kana_abbrs:
|
||||
expression = expression.replace("—", abbr, 1)
|
||||
expressions.append(expression)
|
||||
expressions = Expressions.expand_abbreviation_list(expressions)
|
||||
if len(expressions) == 0:
|
||||
expressions.append(reading)
|
||||
headwords = {reading: expressions}
|
||||
return headwords
|
||||
|
||||
def _get_subentry_parameters(self):
|
||||
from bot.entries.daijirin2.child_entry import ChildEntry
|
||||
from bot.entries.daijirin2.phrase_entry import PhraseEntry
|
||||
subentry_parameters = [
|
||||
[ChildEntry, ["子項目"], self.children],
|
||||
[PhraseEntry, ["句項目"], self.phrases],
|
||||
]
|
||||
return subentry_parameters
|
||||
|
||||
@staticmethod
|
||||
def _delete_unused_nodes(soup):
|
||||
"""Remove extra markup elements that appear in the entry
|
||||
headword line which are not part of the entry headword"""
|
||||
unused_nodes = [
|
||||
"漢字音logo", "活用分節", "連語句活用分節", "語構成",
|
||||
"表外字マーク", "表外字マーク", "ルビG"
|
||||
]
|
||||
for name in unused_nodes:
|
||||
Soup.delete_soup_nodes(soup, name)
|
||||
|
||||
@staticmethod
|
||||
def _clean_expression(expression):
|
||||
for x in ["〈", "〉", "《", "》", " "]:
|
||||
expression = expression.replace(x, "")
|
||||
return expression
|
||||
|
||||
@staticmethod
|
||||
def _fill_alts(soup):
|
||||
for gaiji in soup.find_all(class_="gaiji"):
|
||||
if gaiji.name == "img" and gaiji.has_attr("alt"):
|
||||
gaiji.name = "span"
|
||||
gaiji.string = gaiji.attrs["alt"]
|
9
bot/entries/daijirin2/child_entry.py
Normal file
9
bot/entries/daijirin2/child_entry.py
Normal file
|
@ -0,0 +1,9 @@
|
|||
from bot.entries.daijirin2.base_entry import BaseEntry
|
||||
|
||||
|
||||
class ChildEntry(BaseEntry):
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
headwords = self._get_regular_headwords(soup)
|
||||
return headwords
|
50
bot/entries/daijirin2/entry.py
Normal file
50
bot/entries/daijirin2/entry.py
Normal file
|
@ -0,0 +1,50 @@
|
|||
import bot.entries.base.expressions as Expressions
|
||||
from bot.entries.daijirin2.base_entry import BaseEntry
|
||||
from bot.entries.daijirin2.preprocess import preprocess_page
|
||||
|
||||
|
||||
class Entry(BaseEntry):
|
||||
def __init__(self, target, page_id):
|
||||
entry_id = (page_id, 0)
|
||||
super().__init__(target, entry_id)
|
||||
|
||||
def set_page(self, page):
|
||||
page = preprocess_page(page)
|
||||
super().set_page(page)
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
if soup.find("漢字見出") is not None:
|
||||
headwords = self._get_kanji_headwords(soup)
|
||||
elif soup.find("略語G") is not None:
|
||||
headwords = self._get_acronym_headwords(soup)
|
||||
else:
|
||||
headwords = self._get_regular_headwords(soup)
|
||||
return headwords
|
||||
|
||||
def _get_kanji_headwords(self, soup):
|
||||
readings = []
|
||||
for el in soup.find_all("漢字音"):
|
||||
hira = Expressions.kata_to_hira(el.text)
|
||||
readings.append(hira)
|
||||
if soup.find("漢字音") is None:
|
||||
readings.append("")
|
||||
expressions = []
|
||||
for el in soup.find_all("漢字見出"):
|
||||
expressions.append(el.text)
|
||||
headwords = {}
|
||||
for reading in readings:
|
||||
headwords[reading] = expressions
|
||||
return headwords
|
||||
|
||||
def _get_acronym_headwords(self, soup):
|
||||
expressions = []
|
||||
for el in soup.find_all("略語"):
|
||||
expression_parts = []
|
||||
for part in el.find_all(["欧字", "和字"]):
|
||||
expression_parts.append(part.text)
|
||||
expression = "".join(expression_parts)
|
||||
expressions.append(expression)
|
||||
headwords = {"": expressions}
|
||||
return headwords
|
67
bot/entries/daijirin2/phrase_entry.py
Normal file
67
bot/entries/daijirin2/phrase_entry.py
Normal file
|
@ -0,0 +1,67 @@
|
|||
import re
|
||||
|
||||
import bot.entries.base.expressions as Expressions
|
||||
from bot.data import load_phrase_readings
|
||||
from bot.entries.daijirin2.base_entry import BaseEntry
|
||||
|
||||
|
||||
class PhraseEntry(BaseEntry):
|
||||
def get_part_of_speech_tags(self):
|
||||
# phrases do not contain these tags
|
||||
return []
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
headwords = {}
|
||||
expressions = self._find_expressions(soup)
|
||||
readings = self._find_readings()
|
||||
for idx, expression in enumerate(expressions):
|
||||
reading = readings[idx]
|
||||
if reading in headwords:
|
||||
headwords[reading].append(expression)
|
||||
else:
|
||||
headwords[reading] = [expression]
|
||||
return headwords
|
||||
|
||||
def _find_expressions(self, soup):
|
||||
self._delete_unused_nodes(soup)
|
||||
text = soup.find("句表記").text
|
||||
text = self._clean_expression(text)
|
||||
alternatives = parse_phrase(text)
|
||||
expressions = []
|
||||
for alt in alternatives:
|
||||
for exp in Expressions.expand_abbreviation(alt):
|
||||
expressions.append(exp)
|
||||
return expressions
|
||||
|
||||
def _find_readings(self):
|
||||
phrase_readings = load_phrase_readings(self.target)
|
||||
text = phrase_readings[self.entry_id]
|
||||
alternatives = parse_phrase(text)
|
||||
readings = []
|
||||
for alt in alternatives:
|
||||
for reading in Expressions.expand_abbreviation(alt):
|
||||
readings.append(reading)
|
||||
return readings
|
||||
|
||||
|
||||
def parse_phrase(text):
|
||||
"""Return a list of strings described by = notation."""
|
||||
group_pattern = r"([^=]+)(=([^(]+)(=([^(]+)))?"
|
||||
groups = re.findall(group_pattern, text)
|
||||
expressions = [""]
|
||||
for group in groups:
|
||||
new_exps = []
|
||||
for expression in expressions:
|
||||
new_exps.append(expression + group[0])
|
||||
expressions = new_exps.copy()
|
||||
if group[1] == "":
|
||||
continue
|
||||
new_exps = []
|
||||
for expression in expressions:
|
||||
new_exps.append(expression + group[2])
|
||||
for expression in expressions:
|
||||
for alt in group[3].split("・"):
|
||||
new_exps.append(expression + alt)
|
||||
expressions = new_exps.copy()
|
||||
return expressions
|
|
@ -1,18 +0,0 @@
|
|||
from bot.targets import Targets
|
||||
|
||||
from bot.entries.jitenon import JitenonKokugoEntry
|
||||
from bot.entries.jitenon import JitenonYojiEntry
|
||||
from bot.entries.jitenon import JitenonKotowazaEntry
|
||||
from bot.entries.smk8 import Smk8Entry
|
||||
from bot.entries.daijirin2 import Daijirin2Entry
|
||||
|
||||
|
||||
def new_entry(target, page_id):
|
||||
entry_map = {
|
||||
Targets.JITENON_KOKUGO: JitenonKokugoEntry,
|
||||
Targets.JITENON_YOJI: JitenonYojiEntry,
|
||||
Targets.JITENON_KOTOWAZA: JitenonKotowazaEntry,
|
||||
Targets.SMK8: Smk8Entry,
|
||||
Targets.DAIJIRIN2: Daijirin2Entry,
|
||||
}
|
||||
return entry_map[target](target, page_id)
|
45
bot/entries/jitenon_kokugo/entry.py
Normal file
45
bot/entries/jitenon_kokugo/entry.py
Normal file
|
@ -0,0 +1,45 @@
|
|||
from bot.entries.base.jitenon_entry import JitenonEntry
|
||||
import bot.entries.base.expressions as Expressions
|
||||
|
||||
|
||||
class Entry(JitenonEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.example = ""
|
||||
self.alt_expression = ""
|
||||
self.antonym = ""
|
||||
self.attachments = ""
|
||||
self.compounds = ""
|
||||
self.related_words = ""
|
||||
|
||||
def _get_column_map(self):
|
||||
return {
|
||||
"言葉": "expression",
|
||||
"読み方": "yomikata",
|
||||
"意味": "definition",
|
||||
"例文": "example",
|
||||
"別表記": "alt_expression",
|
||||
"対義語": "antonym",
|
||||
"活用": "attachments",
|
||||
"用例": "compounds",
|
||||
"類語": "related_words",
|
||||
}
|
||||
|
||||
def _get_headwords(self):
|
||||
headwords = {}
|
||||
for reading in self.yomikata.split("・"):
|
||||
if reading not in headwords:
|
||||
headwords[reading] = []
|
||||
for expression in self.expression.split("・"):
|
||||
headwords[reading].append(expression)
|
||||
if self.alt_expression.strip() != "":
|
||||
for expression in self.alt_expression.split("・"):
|
||||
headwords[reading].append(expression)
|
||||
return headwords
|
||||
|
||||
def _add_variant_expressions(self, headwords):
|
||||
for expressions in headwords.values():
|
||||
Expressions.add_variant_kanji(expressions)
|
||||
Expressions.add_fullwidth(expressions)
|
||||
Expressions.remove_iteration_mark(expressions)
|
||||
Expressions.add_iteration_mark(expressions)
|
35
bot/entries/jitenon_kotowaza/entry.py
Normal file
35
bot/entries/jitenon_kotowaza/entry.py
Normal file
|
@ -0,0 +1,35 @@
|
|||
from bot.entries.base.jitenon_entry import JitenonEntry
|
||||
import bot.entries.base.expressions as Expressions
|
||||
|
||||
|
||||
class Entry(JitenonEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.origin = ""
|
||||
self.example = ""
|
||||
self.related_expressions = []
|
||||
|
||||
def _get_column_map(self):
|
||||
return {
|
||||
"言葉": "expression",
|
||||
"読み方": "yomikata",
|
||||
"意味": "definition",
|
||||
"異形": "other_forms",
|
||||
"出典": "origin",
|
||||
"例文": "example",
|
||||
"類句": "related_expressions",
|
||||
}
|
||||
|
||||
def _get_headwords(self):
|
||||
if self.expression == "金棒引き・鉄棒引き":
|
||||
headwords = {
|
||||
"かなぼうひき": ["金棒引き", "鉄棒引き"]
|
||||
}
|
||||
else:
|
||||
headwords = super()._get_headwords()
|
||||
return headwords
|
||||
|
||||
def _add_variant_expressions(self, headwords):
|
||||
for expressions in headwords.values():
|
||||
Expressions.add_variant_kanji(expressions)
|
||||
Expressions.add_fullwidth(expressions)
|
27
bot/entries/jitenon_yoji/entry.py
Normal file
27
bot/entries/jitenon_yoji/entry.py
Normal file
|
@ -0,0 +1,27 @@
|
|||
import bot.entries.base.expressions as Expressions
|
||||
from bot.entries.base.jitenon_entry import JitenonEntry
|
||||
|
||||
|
||||
class Entry(JitenonEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.origin = ""
|
||||
self.kanken_level = ""
|
||||
self.category = ""
|
||||
self.related_expressions = []
|
||||
|
||||
def _get_column_map(self):
|
||||
return {
|
||||
"四字熟語": "expression",
|
||||
"読み方": "yomikata",
|
||||
"意味": "definition",
|
||||
"異形": "other_forms",
|
||||
"出典": "origin",
|
||||
"漢検級": "kanken_level",
|
||||
"場面用途": "category",
|
||||
"類義語": "related_expressions",
|
||||
}
|
||||
|
||||
def _add_variant_expressions(self, headwords):
|
||||
for expressions in headwords.values():
|
||||
Expressions.add_variant_kanji(expressions)
|
104
bot/entries/sankoku8/base_entry.py
Normal file
104
bot/entries/sankoku8/base_entry.py
Normal file
|
@ -0,0 +1,104 @@
|
|||
import bot.soup as Soup
|
||||
from bot.entries.base.sanseido_entry import SanseidoEntry
|
||||
from bot.entries.sankoku8.parse import parse_hyouki_soup
|
||||
|
||||
|
||||
class BaseEntry(SanseidoEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.children = []
|
||||
self.phrases = []
|
||||
self._hyouki_name = "表記"
|
||||
self._midashi_name = None
|
||||
self._midashi_kana_name = None
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
readings = self._find_readings(soup)
|
||||
expressions = self._find_expressions(soup)
|
||||
headwords = {}
|
||||
for reading in readings:
|
||||
headwords[reading] = []
|
||||
if len(readings) == 1:
|
||||
reading = readings[0]
|
||||
if soup.find(self._midashi_name).find(self._hyouki_name) is None:
|
||||
headwords[reading].append(reading)
|
||||
for exp in expressions:
|
||||
if exp not in headwords[reading]:
|
||||
headwords[reading].append(exp)
|
||||
elif len(readings) > 1 and len(expressions) == 0:
|
||||
for reading in readings:
|
||||
headwords[reading].append(reading)
|
||||
elif len(readings) > 1 and len(expressions) == 1:
|
||||
if soup.find(self._midashi_name).find(self._hyouki_name) is None:
|
||||
for reading in readings:
|
||||
headwords[reading].append(reading)
|
||||
expression = expressions[0]
|
||||
for reading in readings:
|
||||
if expression not in headwords[reading]:
|
||||
headwords[reading].append(expression)
|
||||
elif len(readings) > 1 and len(expressions) == len(readings):
|
||||
if soup.find(self._midashi_name).find(self._hyouki_name) is None:
|
||||
for reading in readings:
|
||||
headwords[reading].append(reading)
|
||||
for idx, reading in enumerate(readings):
|
||||
exp = expressions[idx]
|
||||
if exp not in headwords[reading]:
|
||||
headwords[reading].append(exp)
|
||||
else:
|
||||
raise Exception() # shouldn't happen
|
||||
return headwords
|
||||
|
||||
def get_part_of_speech_tags(self):
|
||||
if self._part_of_speech_tags is not None:
|
||||
return self._part_of_speech_tags
|
||||
self._part_of_speech_tags = []
|
||||
soup = self.get_page_soup()
|
||||
for midashi in soup.find_all([self._midashi_name, "見出部要素"]):
|
||||
pos_group = midashi.find("品詞G")
|
||||
if pos_group is None:
|
||||
continue
|
||||
for tag in pos_group.find_all("a"):
|
||||
if tag.text not in self._part_of_speech_tags:
|
||||
self._part_of_speech_tags.append(tag.text)
|
||||
return self._part_of_speech_tags
|
||||
|
||||
def _find_expressions(self, soup):
|
||||
expressions = []
|
||||
for hyouki in soup.find_all(self._hyouki_name):
|
||||
self._fill_alts(hyouki)
|
||||
for expression in parse_hyouki_soup(hyouki, [""]):
|
||||
expressions.append(expression)
|
||||
return expressions
|
||||
|
||||
def _find_readings(self, soup):
|
||||
midasi_kana = soup.find(self._midashi_kana_name)
|
||||
readings = parse_hyouki_soup(midasi_kana, [""])
|
||||
return readings
|
||||
|
||||
def _get_subentry_parameters(self):
|
||||
from bot.entries.sankoku8.child_entry import ChildEntry
|
||||
from bot.entries.sankoku8.phrase_entry import PhraseEntry
|
||||
subentry_parameters = [
|
||||
[ChildEntry, ["子項目"], self.children],
|
||||
[PhraseEntry, ["句項目"], self.phrases],
|
||||
]
|
||||
return subentry_parameters
|
||||
|
||||
@staticmethod
|
||||
def _delete_unused_nodes(soup):
|
||||
"""Remove extra markup elements that appear in the entry
|
||||
headword line which are not part of the entry headword"""
|
||||
unused_nodes = [
|
||||
"語構成", "平板", "アクセント", "表外字マーク", "表外音訓マーク",
|
||||
"アクセント分節", "活用分節", "ルビG", "分書"
|
||||
]
|
||||
for name in unused_nodes:
|
||||
Soup.delete_soup_nodes(soup, name)
|
||||
|
||||
@staticmethod
|
||||
def _fill_alts(soup):
|
||||
for img in soup.find_all("img"):
|
||||
if img.has_attr("alt"):
|
||||
img.string = img.attrs["alt"]
|
8
bot/entries/sankoku8/child_entry.py
Normal file
8
bot/entries/sankoku8/child_entry.py
Normal file
|
@ -0,0 +1,8 @@
|
|||
from bot.entries.sankoku8.base_entry import BaseEntry
|
||||
|
||||
|
||||
class ChildEntry(BaseEntry):
|
||||
def __init__(self, target, page_id):
|
||||
super().__init__(target, page_id)
|
||||
self._midashi_name = "子見出部"
|
||||
self._midashi_kana_name = "子見出仮名"
|
14
bot/entries/sankoku8/entry.py
Normal file
14
bot/entries/sankoku8/entry.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
from bot.entries.sankoku8.base_entry import BaseEntry
|
||||
from bot.entries.sankoku8.preprocess import preprocess_page
|
||||
|
||||
|
||||
class Entry(BaseEntry):
|
||||
def __init__(self, target, page_id):
|
||||
entry_id = (page_id, 0)
|
||||
super().__init__(target, entry_id)
|
||||
self._midashi_name = "見出部"
|
||||
self._midashi_kana_name = "見出仮名"
|
||||
|
||||
def set_page(self, page):
|
||||
page = preprocess_page(page)
|
||||
super().set_page(page)
|
65
bot/entries/sankoku8/parse.py
Normal file
65
bot/entries/sankoku8/parse.py
Normal file
|
@ -0,0 +1,65 @@
|
|||
from bs4 import BeautifulSoup
|
||||
|
||||
|
||||
def parse_hyouki_soup(soup, base_exps):
|
||||
omitted_characters = [
|
||||
"/", "〈", "〉", "(", ")", "⦅", "⦆", ":", "…"
|
||||
]
|
||||
exps = base_exps.copy()
|
||||
for child in soup.children:
|
||||
new_exps = []
|
||||
if child.name == "言換G":
|
||||
for alt in child.find_all("言換"):
|
||||
parts = parse_hyouki_soup(alt, [""])
|
||||
for exp in exps:
|
||||
for part in parts:
|
||||
new_exps.append(exp + part)
|
||||
elif child.name == "補足表記":
|
||||
alt1 = child.find("表記対象")
|
||||
alt2 = child.find("表記内容G")
|
||||
parts1 = parse_hyouki_soup(alt1, [""])
|
||||
parts2 = parse_hyouki_soup(alt2, [""])
|
||||
for exp in exps:
|
||||
for part in parts1:
|
||||
new_exps.append(exp + part)
|
||||
for part in parts2:
|
||||
new_exps.append(exp + part)
|
||||
elif child.name == "省略":
|
||||
parts = parse_hyouki_soup(child, [""])
|
||||
for exp in exps:
|
||||
new_exps.append(exp)
|
||||
for part in parts:
|
||||
new_exps.append(exp + part)
|
||||
elif child.name is not None:
|
||||
new_exps = parse_hyouki_soup(child, exps)
|
||||
else:
|
||||
text = child.text
|
||||
for char in omitted_characters:
|
||||
text = text.replace(char, "")
|
||||
for exp in exps:
|
||||
new_exps.append(exp + text)
|
||||
exps = new_exps.copy()
|
||||
return exps
|
||||
|
||||
|
||||
def parse_hyouki_pattern(pattern):
|
||||
replacements = {
|
||||
"(": "<省略>(",
|
||||
")": ")</省略>",
|
||||
"{": "<補足表記><表記対象>",
|
||||
"・": "</表記対象><表記内容G>(<表記内容>",
|
||||
"}": "</表記内容>)</表記内容G></補足表記>",
|
||||
"〈": "<言換G>〈<言換>",
|
||||
"/": "</言換>/<言換>",
|
||||
"〉": "</言換>〉</言換G>",
|
||||
"⦅": "<補足表記><表記対象>",
|
||||
"\": "</表記対象><表記内容G>⦅<表記内容>",
|
||||
"⦆": "</表記内容>⦆</表記内容G></補足表記>",
|
||||
}
|
||||
markup = f"<span>{pattern}</span>"
|
||||
for key, val in replacements.items():
|
||||
markup = markup.replace(key, val)
|
||||
soup = BeautifulSoup(markup, "xml")
|
||||
hyouki_soup = soup.find("span")
|
||||
exps = parse_hyouki_soup(hyouki_soup, [""])
|
||||
return exps
|
37
bot/entries/sankoku8/phrase_entry.py
Normal file
37
bot/entries/sankoku8/phrase_entry.py
Normal file
|
@ -0,0 +1,37 @@
|
|||
from bot.data import load_phrase_readings
|
||||
from bot.entries.sankoku8.base_entry import BaseEntry
|
||||
from bot.entries.sankoku8.parse import parse_hyouki_soup
|
||||
from bot.entries.sankoku8.parse import parse_hyouki_pattern
|
||||
|
||||
|
||||
class PhraseEntry(BaseEntry):
|
||||
def get_part_of_speech_tags(self):
|
||||
# phrases do not contain these tags
|
||||
return []
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
expressions = self._find_expressions(soup)
|
||||
readings = self._find_readings(soup)
|
||||
headwords = {}
|
||||
if len(expressions) != len(readings):
|
||||
raise Exception(f"{self.entry_id[0]}-{self.entry_id[1]}")
|
||||
for idx, expression in enumerate(expressions):
|
||||
reading = readings[idx]
|
||||
if reading in headwords:
|
||||
headwords[reading].append(expression)
|
||||
else:
|
||||
headwords[reading] = [expression]
|
||||
return headwords
|
||||
|
||||
def _find_expressions(self, soup):
|
||||
phrase_soup = soup.find("句表記")
|
||||
expressions = parse_hyouki_soup(phrase_soup, [""])
|
||||
return expressions
|
||||
|
||||
def _find_readings(self, soup):
|
||||
reading_patterns = load_phrase_readings(self.target)
|
||||
reading_pattern = reading_patterns[self.entry_id]
|
||||
readings = parse_hyouki_pattern(reading_pattern)
|
||||
return readings
|
51
bot/entries/sankoku8/preprocess.py
Normal file
51
bot/entries/sankoku8/preprocess.py
Normal file
|
@ -0,0 +1,51 @@
|
|||
import re
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
from bot.data import get_adobe_glyph
|
||||
|
||||
|
||||
__GAIJI = {
|
||||
"svg-gaiji/byan.svg": "𰻞",
|
||||
"svg-gaiji/G16EF.svg": "篡",
|
||||
}
|
||||
|
||||
|
||||
def preprocess_page(page):
|
||||
soup = BeautifulSoup(page, features="xml")
|
||||
__replace_glyph_codes(soup)
|
||||
__add_image_alt_text(soup)
|
||||
__replace_tatehyphen(soup)
|
||||
page = __strip_page(soup)
|
||||
return page
|
||||
|
||||
|
||||
def __replace_glyph_codes(soup):
|
||||
for el in soup.find_all("glyph"):
|
||||
m = re.search(r"^glyph:([0-9]+);?$", el.attrs["style"])
|
||||
code = int(m.group(1))
|
||||
for geta in el.find_all(string="〓"):
|
||||
glyph = get_adobe_glyph(code)
|
||||
geta.replace_with(glyph)
|
||||
|
||||
|
||||
def __add_image_alt_text(soup):
|
||||
for img in soup.find_all("img"):
|
||||
if not img.has_attr("src"):
|
||||
continue
|
||||
src = img.attrs["src"]
|
||||
if src in __GAIJI:
|
||||
img.attrs["alt"] = __GAIJI[src]
|
||||
|
||||
|
||||
def __replace_tatehyphen(soup):
|
||||
for img in soup.find_all("img", {"src": "svg-gaiji/tatehyphen.svg"}):
|
||||
img.string = "−"
|
||||
img.unwrap()
|
||||
|
||||
|
||||
def __strip_page(soup):
|
||||
koumoku = soup.find(["項目"])
|
||||
if koumoku is not None:
|
||||
return koumoku.decode()
|
||||
else:
|
||||
raise Exception(f"Primary 項目 not found in page:\n{soup.prettify()}")
|
|
@ -1,221 +0,0 @@
|
|||
from bs4 import BeautifulSoup
|
||||
|
||||
import bot.entries.expressions as Expressions
|
||||
import bot.soup as Soup
|
||||
from bot.data import load_smk8_phrase_readings
|
||||
from bot.entries.entry import Entry
|
||||
from bot.entries.smk8_preprocess import preprocess_page
|
||||
|
||||
|
||||
class _BaseSmk8Entry(Entry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.children = []
|
||||
self.phrases = []
|
||||
self.kanjis = []
|
||||
|
||||
def get_global_identifier(self):
|
||||
parent_part = format(self.entry_id[0], '06')
|
||||
child_part = hex(self.entry_id[1]).lstrip('0x').zfill(4).upper()
|
||||
return f"@{self.target.value}-{parent_part}-{child_part}"
|
||||
|
||||
def set_page(self, page):
|
||||
page = self.__decompose_subentries(page)
|
||||
self._page = page
|
||||
|
||||
def get_page_soup(self):
|
||||
soup = BeautifulSoup(self._page, "xml")
|
||||
return soup
|
||||
|
||||
def get_part_of_speech_tags(self):
|
||||
if self._part_of_speech_tags is not None:
|
||||
return self._part_of_speech_tags
|
||||
self._part_of_speech_tags = []
|
||||
soup = self.get_page_soup()
|
||||
headword_info = soup.find("見出要素")
|
||||
if headword_info is None:
|
||||
return self._part_of_speech_tags
|
||||
for tag in headword_info.find_all("品詞M"):
|
||||
if tag.text not in self._part_of_speech_tags:
|
||||
self._part_of_speech_tags.append(tag.text)
|
||||
return self._part_of_speech_tags
|
||||
|
||||
def _add_variant_expressions(self, headwords):
|
||||
for expressions in headwords.values():
|
||||
Expressions.add_variant_kanji(expressions)
|
||||
Expressions.add_fullwidth(expressions)
|
||||
Expressions.remove_iteration_mark(expressions)
|
||||
Expressions.add_iteration_mark(expressions)
|
||||
|
||||
def _find_reading(self, soup):
|
||||
midasi_kana = soup.find("見出仮名")
|
||||
reading = midasi_kana.text
|
||||
for x in [" ", "・"]:
|
||||
reading = reading.replace(x, "")
|
||||
return reading
|
||||
|
||||
def _find_expressions(self, soup):
|
||||
clean_expressions = []
|
||||
for expression in soup.find_all("標準表記"):
|
||||
clean_expression = self._clean_expression(expression.text)
|
||||
clean_expressions.append(clean_expression)
|
||||
expressions = Expressions.expand_abbreviation_list(clean_expressions)
|
||||
return expressions
|
||||
|
||||
def __decompose_subentries(self, page):
|
||||
soup = BeautifulSoup(page, features="xml")
|
||||
subentry_parameters = [
|
||||
[Smk8ChildEntry, ["子項目F", "子項目"], self.children],
|
||||
[Smk8PhraseEntry, ["句項目F", "句項目"], self.phrases],
|
||||
[Smk8KanjiEntry, ["造語成分項目"], self.kanjis],
|
||||
]
|
||||
for x in subentry_parameters:
|
||||
subentry_class, tags, subentry_list = x
|
||||
for tag in tags:
|
||||
tag_soup = soup.find(tag)
|
||||
while tag_soup is not None:
|
||||
tag_soup.name = "項目"
|
||||
subentry_id = self.id_string_to_entry_id(tag_soup.attrs["id"])
|
||||
self.SUBENTRY_ID_TO_ENTRY_ID[subentry_id] = self.entry_id
|
||||
subentry = subentry_class(self.target, subentry_id)
|
||||
page = tag_soup.decode()
|
||||
subentry.set_page(page)
|
||||
subentry_list.append(subentry)
|
||||
tag_soup.decompose()
|
||||
tag_soup = soup.find(tag)
|
||||
return soup.decode()
|
||||
|
||||
@staticmethod
|
||||
def id_string_to_entry_id(id_string):
|
||||
parts = id_string.split("-")
|
||||
if len(parts) == 1:
|
||||
return (int(parts[0]), 0)
|
||||
elif len(parts) == 2:
|
||||
# subentries have a hexadecimal part
|
||||
return (int(parts[0]), int(parts[1], 16))
|
||||
else:
|
||||
raise Exception(f"Invalid entry ID: {id_string}")
|
||||
|
||||
@staticmethod
|
||||
def _delete_unused_nodes(soup):
|
||||
"""Remove extra markup elements that appear in the entry
|
||||
headword line which are not part of the entry headword"""
|
||||
unused_nodes = [
|
||||
"表音表記", "表外音訓マーク", "表外字マーク", "ルビG"
|
||||
]
|
||||
for name in unused_nodes:
|
||||
Soup.delete_soup_nodes(soup, name)
|
||||
|
||||
@staticmethod
|
||||
def _clean_expression(expression):
|
||||
for x in ["〈", "〉", "{", "}", "…", " "]:
|
||||
expression = expression.replace(x, "")
|
||||
return expression
|
||||
|
||||
@staticmethod
|
||||
def _fill_alts(soup):
|
||||
for el in soup.find_all(["親見出仮名", "親見出表記"]):
|
||||
el.string = el.attrs["alt"]
|
||||
for gaiji in soup.find_all("外字"):
|
||||
gaiji.string = gaiji.img.attrs["alt"]
|
||||
|
||||
|
||||
class Smk8Entry(_BaseSmk8Entry):
|
||||
def __init__(self, target, page_id):
|
||||
entry_id = (page_id, 0)
|
||||
super().__init__(target, entry_id)
|
||||
|
||||
def set_page(self, page):
|
||||
page = preprocess_page(page)
|
||||
super().set_page(page)
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
self._fill_alts(soup)
|
||||
reading = self._find_reading(soup)
|
||||
expressions = []
|
||||
if soup.find("見出部").find("標準表記") is None:
|
||||
expressions.append(reading)
|
||||
for expression in self._find_expressions(soup):
|
||||
if expression not in expressions:
|
||||
expressions.append(expression)
|
||||
headwords = {reading: expressions}
|
||||
return headwords
|
||||
|
||||
|
||||
class Smk8ChildEntry(_BaseSmk8Entry):
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
self._fill_alts(soup)
|
||||
reading = self._find_reading(soup)
|
||||
expressions = []
|
||||
if soup.find("子見出部").find("標準表記") is None:
|
||||
expressions.append(reading)
|
||||
for expression in self._find_expressions(soup):
|
||||
if expression not in expressions:
|
||||
expressions.append(expression)
|
||||
headwords = {reading: expressions}
|
||||
return headwords
|
||||
|
||||
|
||||
class Smk8PhraseEntry(_BaseSmk8Entry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.__phrase_readings = load_smk8_phrase_readings()
|
||||
|
||||
def get_part_of_speech_tags(self):
|
||||
# phrases do not contain these tags
|
||||
return []
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
headwords = {}
|
||||
expressions = self._find_expressions(soup)
|
||||
readings = self._find_readings()
|
||||
for idx, expression in enumerate(expressions):
|
||||
reading = readings[idx]
|
||||
if reading in headwords:
|
||||
headwords[reading].append(expression)
|
||||
else:
|
||||
headwords[reading] = [expression]
|
||||
return headwords
|
||||
|
||||
def _find_expressions(self, soup):
|
||||
self._delete_unused_nodes(soup)
|
||||
self._fill_alts(soup)
|
||||
text = soup.find("標準表記").text
|
||||
text = self._clean_expression(text)
|
||||
alternatives = Expressions.expand_smk_alternatives(text)
|
||||
expressions = []
|
||||
for alt in alternatives:
|
||||
for exp in Expressions.expand_abbreviation(alt):
|
||||
expressions.append(exp)
|
||||
return expressions
|
||||
|
||||
def _find_readings(self):
|
||||
text = self.__phrase_readings[self.entry_id]
|
||||
alternatives = Expressions.expand_smk_alternatives(text)
|
||||
readings = []
|
||||
for alt in alternatives:
|
||||
for reading in Expressions.expand_abbreviation(alt):
|
||||
readings.append(reading)
|
||||
return readings
|
||||
|
||||
|
||||
class Smk8KanjiEntry(_BaseSmk8Entry):
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
self._fill_alts(soup)
|
||||
reading = self.__get_parent_reading()
|
||||
expressions = self._find_expressions(soup)
|
||||
headwords = {reading: expressions}
|
||||
return headwords
|
||||
|
||||
def __get_parent_reading(self):
|
||||
parent_id = self.SUBENTRY_ID_TO_ENTRY_ID[self.entry_id]
|
||||
parent = self.ID_TO_ENTRY[parent_id]
|
||||
reading = parent.get_first_reading()
|
||||
return reading
|
73
bot/entries/smk8/base_entry.py
Normal file
73
bot/entries/smk8/base_entry.py
Normal file
|
@ -0,0 +1,73 @@
|
|||
import bot.soup as Soup
|
||||
import bot.entries.base.expressions as Expressions
|
||||
from bot.entries.base.sanseido_entry import SanseidoEntry
|
||||
|
||||
|
||||
class BaseEntry(SanseidoEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.children = []
|
||||
self.phrases = []
|
||||
self.kanjis = []
|
||||
|
||||
def get_part_of_speech_tags(self):
|
||||
if self._part_of_speech_tags is not None:
|
||||
return self._part_of_speech_tags
|
||||
self._part_of_speech_tags = []
|
||||
soup = self.get_page_soup()
|
||||
headword_info = soup.find("見出要素")
|
||||
if headword_info is None:
|
||||
return self._part_of_speech_tags
|
||||
for tag in headword_info.find_all("品詞M"):
|
||||
if tag.text not in self._part_of_speech_tags:
|
||||
self._part_of_speech_tags.append(tag.text)
|
||||
return self._part_of_speech_tags
|
||||
|
||||
def _find_reading(self, soup):
|
||||
midasi_kana = soup.find("見出仮名")
|
||||
reading = midasi_kana.text
|
||||
for x in [" ", "・"]:
|
||||
reading = reading.replace(x, "")
|
||||
return reading
|
||||
|
||||
def _find_expressions(self, soup):
|
||||
clean_expressions = []
|
||||
for expression in soup.find_all("標準表記"):
|
||||
clean_expression = self._clean_expression(expression.text)
|
||||
clean_expressions.append(clean_expression)
|
||||
expressions = Expressions.expand_abbreviation_list(clean_expressions)
|
||||
return expressions
|
||||
|
||||
def _get_subentry_parameters(self):
|
||||
from bot.entries.smk8.child_entry import ChildEntry
|
||||
from bot.entries.smk8.phrase_entry import PhraseEntry
|
||||
from bot.entries.smk8.kanji_entry import KanjiEntry
|
||||
subentry_parameters = [
|
||||
[ChildEntry, ["子項目F", "子項目"], self.children],
|
||||
[PhraseEntry, ["句項目F", "句項目"], self.phrases],
|
||||
[KanjiEntry, ["造語成分項目"], self.kanjis],
|
||||
]
|
||||
return subentry_parameters
|
||||
|
||||
@staticmethod
|
||||
def _delete_unused_nodes(soup):
|
||||
"""Remove extra markup elements that appear in the entry
|
||||
headword line which are not part of the entry headword"""
|
||||
unused_nodes = [
|
||||
"表音表記", "表外音訓マーク", "表外字マーク", "ルビG"
|
||||
]
|
||||
for name in unused_nodes:
|
||||
Soup.delete_soup_nodes(soup, name)
|
||||
|
||||
@staticmethod
|
||||
def _clean_expression(expression):
|
||||
for x in ["〈", "〉", "{", "}", "…", " "]:
|
||||
expression = expression.replace(x, "")
|
||||
return expression
|
||||
|
||||
@staticmethod
|
||||
def _fill_alts(soup):
|
||||
for elm in soup.find_all(["親見出仮名", "親見出表記"]):
|
||||
elm.string = elm.attrs["alt"]
|
||||
for gaiji in soup.find_all("外字"):
|
||||
gaiji.string = gaiji.img.attrs["alt"]
|
17
bot/entries/smk8/child_entry.py
Normal file
17
bot/entries/smk8/child_entry.py
Normal file
|
@ -0,0 +1,17 @@
|
|||
from bot.entries.smk8.base_entry import BaseEntry
|
||||
|
||||
|
||||
class ChildEntry(BaseEntry):
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
self._fill_alts(soup)
|
||||
reading = self._find_reading(soup)
|
||||
expressions = []
|
||||
if soup.find("子見出部").find("標準表記") is None:
|
||||
expressions.append(reading)
|
||||
for expression in self._find_expressions(soup):
|
||||
if expression not in expressions:
|
||||
expressions.append(expression)
|
||||
headwords = {reading: expressions}
|
||||
return headwords
|
26
bot/entries/smk8/entry.py
Normal file
26
bot/entries/smk8/entry.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
from bot.entries.smk8.base_entry import BaseEntry
|
||||
from bot.entries.smk8.preprocess import preprocess_page
|
||||
|
||||
|
||||
class Entry(BaseEntry):
|
||||
def __init__(self, target, page_id):
|
||||
entry_id = (page_id, 0)
|
||||
super().__init__(target, entry_id)
|
||||
|
||||
def set_page(self, page):
|
||||
page = preprocess_page(page)
|
||||
super().set_page(page)
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
self._fill_alts(soup)
|
||||
reading = self._find_reading(soup)
|
||||
expressions = []
|
||||
if soup.find("見出部").find("標準表記") is None:
|
||||
expressions.append(reading)
|
||||
for expression in self._find_expressions(soup):
|
||||
if expression not in expressions:
|
||||
expressions.append(expression)
|
||||
headwords = {reading: expressions}
|
||||
return headwords
|
22
bot/entries/smk8/kanji_entry.py
Normal file
22
bot/entries/smk8/kanji_entry.py
Normal file
|
@ -0,0 +1,22 @@
|
|||
from bot.entries.smk8.base_entry import BaseEntry
|
||||
|
||||
|
||||
class KanjiEntry(BaseEntry):
|
||||
def get_part_of_speech_tags(self):
|
||||
# kanji entries do not contain these tags
|
||||
return []
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
self._delete_unused_nodes(soup)
|
||||
self._fill_alts(soup)
|
||||
reading = self.__get_parent_reading()
|
||||
expressions = self._find_expressions(soup)
|
||||
headwords = {reading: expressions}
|
||||
return headwords
|
||||
|
||||
def __get_parent_reading(self):
|
||||
parent_id = self.SUBENTRY_ID_TO_ENTRY_ID[self.entry_id]
|
||||
parent = self.ID_TO_ENTRY[parent_id]
|
||||
reading = parent.get_first_reading()
|
||||
return reading
|
64
bot/entries/smk8/phrase_entry.py
Normal file
64
bot/entries/smk8/phrase_entry.py
Normal file
|
@ -0,0 +1,64 @@
|
|||
import re
|
||||
|
||||
import bot.entries.base.expressions as Expressions
|
||||
from bot.data import load_phrase_readings
|
||||
from bot.entries.smk8.base_entry import BaseEntry
|
||||
|
||||
|
||||
class PhraseEntry(BaseEntry):
|
||||
def __init__(self, target, entry_id):
|
||||
super().__init__(target, entry_id)
|
||||
self.__phrase_readings = load_phrase_readings(self.target)
|
||||
|
||||
def get_part_of_speech_tags(self):
|
||||
# phrase entries do not contain these tags
|
||||
return []
|
||||
|
||||
def _get_headwords(self):
|
||||
soup = self.get_page_soup()
|
||||
headwords = {}
|
||||
expressions = self._find_expressions(soup)
|
||||
readings = self._find_readings()
|
||||
for idx, expression in enumerate(expressions):
|
||||
reading = readings[idx]
|
||||
if reading in headwords:
|
||||
headwords[reading].append(expression)
|
||||
else:
|
||||
headwords[reading] = [expression]
|
||||
return headwords
|
||||
|
||||
def _find_expressions(self, soup):
|
||||
self._delete_unused_nodes(soup)
|
||||
self._fill_alts(soup)
|
||||
text = soup.find("標準表記").text
|
||||
text = self._clean_expression(text)
|
||||
alternatives = parse_phrase(text)
|
||||
expressions = []
|
||||
for alt in alternatives:
|
||||
for exp in Expressions.expand_abbreviation(alt):
|
||||
expressions.append(exp)
|
||||
return expressions
|
||||
|
||||
def _find_readings(self):
|
||||
text = self.__phrase_readings[self.entry_id]
|
||||
alternatives = parse_phrase(text)
|
||||
readings = []
|
||||
for alt in alternatives:
|
||||
for reading in Expressions.expand_abbreviation(alt):
|
||||
readings.append(reading)
|
||||
return readings
|
||||
|
||||
|
||||
def parse_phrase(text):
|
||||
"""Return a list of strings described by △ notation."""
|
||||
match = re.search(r"△([^(]+)(([^(]+))", text)
|
||||
if match is None:
|
||||
return [text]
|
||||
alt_parts = [match.group(1)]
|
||||
for alt_part in match.group(2).split("・"):
|
||||
alt_parts.append(alt_part)
|
||||
alts = []
|
||||
for alt_part in alt_parts:
|
||||
alt_exp = re.sub(r"△[^(]+([^(]+)", alt_part, text)
|
||||
alts.append(alt_exp)
|
||||
return alts
|
|
@ -6,8 +6,8 @@ from bot.data import get_adobe_glyph
|
|||
|
||||
__GAIJI = {
|
||||
"gaiji/5350.svg": "卐",
|
||||
"gaiji/62cb.svg": "抛",
|
||||
"gaiji/7be1.svg": "簒",
|
||||
"gaiji/62cb.svg": "拋",
|
||||
"gaiji/7be1.svg": "篡",
|
||||
}
|
||||
|
||||
|
37
bot/factory.py
Normal file
37
bot/factory.py
Normal file
|
@ -0,0 +1,37 @@
|
|||
import importlib
|
||||
|
||||
|
||||
def new_crawler(target):
|
||||
module_path = f"bot.crawlers.{target.name.lower()}"
|
||||
module = importlib.import_module(module_path)
|
||||
return module.Crawler(target)
|
||||
|
||||
|
||||
def new_entry(target, page_id):
|
||||
module_path = f"bot.entries.{target.name.lower()}.entry"
|
||||
module = importlib.import_module(module_path)
|
||||
return module.Entry(target, page_id)
|
||||
|
||||
|
||||
def new_yomichan_exporter(target):
|
||||
module_path = f"bot.yomichan.exporters.{target.name.lower()}"
|
||||
module = importlib.import_module(module_path)
|
||||
return module.Exporter(target)
|
||||
|
||||
|
||||
def new_yomichan_terminator(target):
|
||||
module_path = f"bot.yomichan.terms.{target.name.lower()}"
|
||||
module = importlib.import_module(module_path)
|
||||
return module.Terminator(target)
|
||||
|
||||
|
||||
def new_mdict_exporter(target):
|
||||
module_path = f"bot.mdict.exporters.{target.name.lower()}"
|
||||
module = importlib.import_module(module_path)
|
||||
return module.Exporter(target)
|
||||
|
||||
|
||||
def new_mdict_terminator(target):
|
||||
module_path = f"bot.mdict.terms.{target.name.lower()}"
|
||||
module = importlib.import_module(module_path)
|
||||
return module.Terminator(target)
|
|
@ -1,21 +1,19 @@
|
|||
# pylint: disable=too-few-public-methods
|
||||
|
||||
import subprocess
|
||||
import os
|
||||
import shutil
|
||||
import subprocess
|
||||
from abc import ABC, abstractmethod
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
|
||||
from platformdirs import user_documents_dir, user_cache_dir
|
||||
|
||||
from bot.targets import Targets
|
||||
from bot.mdict.terms.factory import new_terminator
|
||||
from bot.time import timestamp
|
||||
from bot.factory import new_mdict_terminator
|
||||
|
||||
|
||||
class Exporter(ABC):
|
||||
class BaseExporter(ABC):
|
||||
def __init__(self, target):
|
||||
self._target = target
|
||||
self._terminator = new_terminator(target)
|
||||
self._terminator = new_mdict_terminator(target)
|
||||
self._build_dir = None
|
||||
self._build_media_dir = None
|
||||
self._description_file = None
|
||||
|
@ -24,11 +22,10 @@ class Exporter(ABC):
|
|||
def export(self, entries, media_dir, icon_file):
|
||||
self._init_build_media_dir(media_dir)
|
||||
self._init_description_file(entries)
|
||||
terms = self._get_terms(entries)
|
||||
print(f"Exporting {len(terms)} Mdict keys...")
|
||||
self._write_mdx_file(terms)
|
||||
self._write_mdx_file(entries)
|
||||
self._write_mdd_file()
|
||||
self._write_icon_file(icon_file)
|
||||
self._write_css_file()
|
||||
self._rm_build_dir()
|
||||
|
||||
def _get_build_dir(self):
|
||||
|
@ -36,7 +33,7 @@ class Exporter(ABC):
|
|||
return self._build_dir
|
||||
cache_dir = user_cache_dir("jitenbot")
|
||||
build_directory = os.path.join(cache_dir, "mdict_build")
|
||||
print(f"Initializing build directory `{build_directory}`")
|
||||
print(f"{timestamp()} Initializing build directory `{build_directory}`")
|
||||
if Path(build_directory).is_dir():
|
||||
shutil.rmtree(build_directory)
|
||||
os.makedirs(build_directory)
|
||||
|
@ -47,7 +44,7 @@ class Exporter(ABC):
|
|||
build_dir = self._get_build_dir()
|
||||
build_media_dir = os.path.join(build_dir, self._target.value)
|
||||
if media_dir is not None:
|
||||
print("Copying media files to build directory...")
|
||||
print(f"{timestamp()} Copying media files to build directory...")
|
||||
shutil.copytree(media_dir, build_media_dir)
|
||||
else:
|
||||
os.makedirs(build_media_dir)
|
||||
|
@ -57,34 +54,23 @@ class Exporter(ABC):
|
|||
self._build_media_dir = build_media_dir
|
||||
|
||||
def _init_description_file(self, entries):
|
||||
filename = f"{self._target.value}.mdx.description.html"
|
||||
original_file = os.path.join(
|
||||
"data", "mdict", "description", filename)
|
||||
with open(original_file, "r", encoding="utf8") as f:
|
||||
description_template_file = self._get_description_template_file()
|
||||
with open(description_template_file, "r", encoding="utf8") as f:
|
||||
description = f.read()
|
||||
description = description.replace(
|
||||
"{{revision}}", self._get_revision(entries))
|
||||
description = description.replace(
|
||||
"{{attribution}}", self._get_attribution(entries))
|
||||
build_dir = self._get_build_dir()
|
||||
description_file = os.path.join(build_dir, filename)
|
||||
description_file = os.path.join(
|
||||
build_dir, f"{self._target.value}.mdx.description.html")
|
||||
with open(description_file, "w", encoding="utf8") as f:
|
||||
f.write(description)
|
||||
self._description_file = description_file
|
||||
|
||||
def _get_terms(self, entries):
|
||||
terms = []
|
||||
entries_len = len(entries)
|
||||
for idx, entry in enumerate(entries):
|
||||
update = f"Creating Mdict terms for entry {idx+1}/{entries_len}"
|
||||
print(update, end='\r', flush=True)
|
||||
new_terms = self._terminator.make_terms(entry)
|
||||
for term in new_terms:
|
||||
terms.append(term)
|
||||
print()
|
||||
return terms
|
||||
|
||||
def _write_mdx_file(self, terms):
|
||||
def _write_mdx_file(self, entries):
|
||||
terms = self._get_terms(entries)
|
||||
print(f"{timestamp()} Exporting {len(terms)} Mdict keys...")
|
||||
out_dir = self._get_out_dir()
|
||||
out_file = os.path.join(out_dir, f"{self._target.value}.mdx")
|
||||
params = [
|
||||
|
@ -96,6 +82,18 @@ class Exporter(ABC):
|
|||
]
|
||||
subprocess.run(params, check=True)
|
||||
|
||||
def _get_terms(self, entries):
|
||||
terms = []
|
||||
entries_len = len(entries)
|
||||
for idx, entry in enumerate(entries):
|
||||
update = f"\tCreating MDict terms for entry {idx+1}/{entries_len}"
|
||||
print(update, end='\r', flush=True)
|
||||
new_terms = self._terminator.make_terms(entry)
|
||||
for term in new_terms:
|
||||
terms.append(term)
|
||||
print()
|
||||
return terms
|
||||
|
||||
def _write_mdd_file(self):
|
||||
out_dir = self._get_out_dir()
|
||||
out_file = os.path.join(out_dir, f"{self._target.value}.mdd")
|
||||
|
@ -109,7 +107,7 @@ class Exporter(ABC):
|
|||
subprocess.run(params, check=True)
|
||||
|
||||
def _write_icon_file(self, icon_file):
|
||||
premade_icon_file = f"data/mdict/icon/{self._target.value}.png"
|
||||
premade_icon_file = self._get_premade_icon_file()
|
||||
out_dir = self._get_out_dir()
|
||||
out_file = os.path.join(out_dir, f"{self._target.value}.png")
|
||||
if icon_file is not None and Path(icon_file).is_file():
|
||||
|
@ -117,12 +115,17 @@ class Exporter(ABC):
|
|||
elif Path(premade_icon_file).is_file():
|
||||
shutil.copy(premade_icon_file, out_file)
|
||||
|
||||
def _write_css_file(self):
|
||||
css_file = self._get_css_file()
|
||||
out_dir = self._get_out_dir()
|
||||
shutil.copy(css_file, out_dir)
|
||||
|
||||
def _get_out_dir(self):
|
||||
if self._out_dir is not None:
|
||||
return self._out_dir
|
||||
out_dir = os.path.join(
|
||||
user_documents_dir(), "jitenbot", "mdict", self._target.value)
|
||||
print(f"Initializing output directory `{out_dir}`")
|
||||
print(f"{timestamp()} Initializing output directory `{out_dir}`")
|
||||
if Path(out_dir).is_dir():
|
||||
shutil.rmtree(out_dir)
|
||||
os.makedirs(out_dir)
|
||||
|
@ -148,59 +151,24 @@ class Exporter(ABC):
|
|||
"data", "mdict", "css",
|
||||
f"{self._target.value}.css")
|
||||
|
||||
def _get_premade_icon_file(self):
|
||||
return os.path.join(
|
||||
"data", "mdict", "icon",
|
||||
f"{self._target.value}.png")
|
||||
|
||||
def _get_description_template_file(self):
|
||||
return os.path.join(
|
||||
"data", "mdict", "description",
|
||||
f"{self._target.value}.mdx.description.html")
|
||||
|
||||
def _rm_build_dir(self):
|
||||
build_dir = self._get_build_dir()
|
||||
shutil.rmtree(build_dir)
|
||||
|
||||
@abstractmethod
|
||||
def _get_revision(self, entries):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _get_attribution(self, entries):
|
||||
pass
|
||||
|
||||
|
||||
class _JitenonExporter(Exporter):
|
||||
def _get_revision(self, entries):
|
||||
modified_date = None
|
||||
for entry in entries:
|
||||
if modified_date is None or entry.modified_date > modified_date:
|
||||
modified_date = entry.modified_date
|
||||
revision = modified_date.strftime("%Y年%m月%d日閲覧")
|
||||
return revision
|
||||
|
||||
def _get_attribution(self, entries):
|
||||
modified_date = None
|
||||
for entry in entries:
|
||||
if modified_date is None or entry.modified_date > modified_date:
|
||||
attribution = entry.attribution
|
||||
return attribution
|
||||
|
||||
|
||||
class JitenonKokugoExporter(_JitenonExporter):
|
||||
pass
|
||||
|
||||
|
||||
class JitenonYojiExporter(_JitenonExporter):
|
||||
pass
|
||||
|
||||
|
||||
class JitenonKotowazaExporter(_JitenonExporter):
|
||||
pass
|
||||
|
||||
|
||||
class _MonokakidoExporter(Exporter):
|
||||
def _get_revision(self, entries):
|
||||
timestamp = datetime.now().strftime("%Y年%m月%d日作成")
|
||||
return timestamp
|
||||
|
||||
|
||||
class Smk8Exporter(_MonokakidoExporter):
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2020"
|
||||
|
||||
|
||||
class Daijirin2Exporter(_MonokakidoExporter):
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2019"
|
||||
raise NotImplementedError
|
18
bot/mdict/exporters/base/jitenon.py
Normal file
18
bot/mdict/exporters/base/jitenon.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
from bot.mdict.exporters.base.exporter import BaseExporter
|
||||
|
||||
|
||||
class JitenonExporter(BaseExporter):
|
||||
def _get_revision(self, entries):
|
||||
modified_date = None
|
||||
for entry in entries:
|
||||
if modified_date is None or entry.modified_date > modified_date:
|
||||
modified_date = entry.modified_date
|
||||
revision = modified_date.strftime("%Y年%m月%d日閲覧")
|
||||
return revision
|
||||
|
||||
def _get_attribution(self, entries):
|
||||
modified_date = None
|
||||
for entry in entries:
|
||||
if modified_date is None or entry.modified_date > modified_date:
|
||||
attribution = entry.attribution
|
||||
return attribution
|
8
bot/mdict/exporters/base/monokakido.py
Normal file
8
bot/mdict/exporters/base/monokakido.py
Normal file
|
@ -0,0 +1,8 @@
|
|||
from datetime import datetime
|
||||
from bot.mdict.exporters.base.exporter import BaseExporter
|
||||
|
||||
|
||||
class MonokakidoExporter(BaseExporter):
|
||||
def _get_revision(self, entries):
|
||||
timestamp = datetime.now().strftime("%Y年%m月%d日作成")
|
||||
return timestamp
|
6
bot/mdict/exporters/daijirin2.py
Normal file
6
bot/mdict/exporters/daijirin2.py
Normal file
|
@ -0,0 +1,6 @@
|
|||
from bot.mdict.exporters.base.monokakido import MonokakidoExporter
|
||||
|
||||
|
||||
class Exporter(MonokakidoExporter):
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2019"
|
|
@ -1,18 +0,0 @@
|
|||
from bot.targets import Targets
|
||||
|
||||
from bot.mdict.exporters.export import JitenonKokugoExporter
|
||||
from bot.mdict.exporters.export import JitenonYojiExporter
|
||||
from bot.mdict.exporters.export import JitenonKotowazaExporter
|
||||
from bot.mdict.exporters.export import Smk8Exporter
|
||||
from bot.mdict.exporters.export import Daijirin2Exporter
|
||||
|
||||
|
||||
def new_mdict_exporter(target):
|
||||
exporter_map = {
|
||||
Targets.JITENON_KOKUGO: JitenonKokugoExporter,
|
||||
Targets.JITENON_YOJI: JitenonYojiExporter,
|
||||
Targets.JITENON_KOTOWAZA: JitenonKotowazaExporter,
|
||||
Targets.SMK8: Smk8Exporter,
|
||||
Targets.DAIJIRIN2: Daijirin2Exporter,
|
||||
}
|
||||
return exporter_map[target](target)
|
5
bot/mdict/exporters/jitenon_kokugo.py
Normal file
5
bot/mdict/exporters/jitenon_kokugo.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
from bot.mdict.exporters.base.jitenon import JitenonExporter
|
||||
|
||||
|
||||
class Exporter(JitenonExporter):
|
||||
pass
|
5
bot/mdict/exporters/jitenon_kotowaza.py
Normal file
5
bot/mdict/exporters/jitenon_kotowaza.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
from bot.mdict.exporters.base.jitenon import JitenonExporter
|
||||
|
||||
|
||||
class Exporter(JitenonExporter):
|
||||
pass
|
5
bot/mdict/exporters/jitenon_yoji.py
Normal file
5
bot/mdict/exporters/jitenon_yoji.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
from bot.mdict.exporters.base.jitenon import JitenonExporter
|
||||
|
||||
|
||||
class Exporter(JitenonExporter):
|
||||
pass
|
6
bot/mdict/exporters/sankoku8.py
Normal file
6
bot/mdict/exporters/sankoku8.py
Normal file
|
@ -0,0 +1,6 @@
|
|||
from bot.mdict.exporters.base.monokakido import MonokakidoExporter
|
||||
|
||||
|
||||
class Exporter(MonokakidoExporter):
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2021"
|
6
bot/mdict/exporters/smk8.py
Normal file
6
bot/mdict/exporters/smk8.py
Normal file
|
@ -0,0 +1,6 @@
|
|||
from bot.mdict.exporters.base.monokakido import MonokakidoExporter
|
||||
|
||||
|
||||
class Exporter(MonokakidoExporter):
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2020"
|
137
bot/mdict/glossary/sankoku8.py
Normal file
137
bot/mdict/glossary/sankoku8.py
Normal file
|
@ -0,0 +1,137 @@
|
|||
import re
|
||||
from bs4 import BeautifulSoup
|
||||
from bot.data import load_mdict_name_conversion
|
||||
from bot.name_conversion import convert_names
|
||||
|
||||
|
||||
def make_glossary(entry, media_dir):
|
||||
soup = entry.get_page_soup()
|
||||
__reposition_marks(soup)
|
||||
__remove_appendix_links(soup)
|
||||
__convert_images(soup)
|
||||
__remove_links_without_href(soup)
|
||||
__convert_links(soup, entry)
|
||||
__add_parent_link(soup, entry)
|
||||
__add_homophone_links(soup, entry)
|
||||
|
||||
name_conversion = load_mdict_name_conversion(entry.target)
|
||||
convert_names(soup, name_conversion)
|
||||
|
||||
glossary = soup.span.decode()
|
||||
return glossary
|
||||
|
||||
|
||||
def __reposition_marks(soup):
|
||||
"""These 表外字マーク symbols will be converted to rubies later, so they need to
|
||||
be positioned after the corresponding text in order to appear correctly"""
|
||||
for elm in soup.find_all("表外字"):
|
||||
mark = elm.find("表外字マーク")
|
||||
elm.append(mark)
|
||||
for elm in soup.find_all("表外音訓"):
|
||||
mark = elm.find("表外音訓マーク")
|
||||
elm.append(mark)
|
||||
|
||||
|
||||
def __remove_appendix_links(soup):
|
||||
"""This info would be useful and nice to have, but jitenbot currently
|
||||
isn't designed to fetch and process these appendix files. It probably
|
||||
wouldn't be possible to include them in Yomichan, but it would definitely
|
||||
be possible for Mdict."""
|
||||
for elm in soup.find_all("a"):
|
||||
if not elm.has_attr("href"):
|
||||
continue
|
||||
if elm.attrs["href"].startswith("appendix"):
|
||||
elm.attrs["data-name"] = "a"
|
||||
elm.attrs["data-href"] = elm.attrs["href"]
|
||||
elm.name = "span"
|
||||
del elm.attrs["href"]
|
||||
|
||||
|
||||
def __convert_images(soup):
|
||||
conversions = [
|
||||
["svg-logo/重要語.svg", "*"],
|
||||
["svg-logo/最重要語.svg", "**"],
|
||||
["svg-logo/一般常識語.svg", "☆☆"],
|
||||
["svg-logo/追い込み.svg", ""],
|
||||
["svg-special/区切り線.svg", "|"],
|
||||
["svg-accent/平板.svg", "⎺"],
|
||||
["svg-accent/アクセント.svg", "⌝"],
|
||||
["svg-logo/アク.svg", "アク"],
|
||||
["svg-logo/丁寧.svg", "丁寧"],
|
||||
["svg-logo/可能.svg", "可能"],
|
||||
["svg-logo/尊敬.svg", "尊敬"],
|
||||
["svg-logo/接尾.svg", "接尾"],
|
||||
["svg-logo/接頭.svg", "接頭"],
|
||||
["svg-logo/表記.svg", "表記"],
|
||||
["svg-logo/謙譲.svg", "謙譲"],
|
||||
["svg-logo/区別.svg", "区別"],
|
||||
["svg-logo/由来.svg", "由来"],
|
||||
]
|
||||
for conversion in conversions:
|
||||
filename, text = conversion
|
||||
for elm in soup.find_all("img", attrs={"src": filename}):
|
||||
elm.attrs["data-name"] = elm.name
|
||||
elm.attrs["data-src"] = elm.attrs["src"]
|
||||
elm.name = "span"
|
||||
elm.string = text
|
||||
del elm.attrs["src"]
|
||||
|
||||
|
||||
def __remove_links_without_href(soup):
|
||||
for elm in soup.find_all("a"):
|
||||
if elm.has_attr("href"):
|
||||
continue
|
||||
elm.attrs["data-name"] = elm.name
|
||||
elm.name = "span"
|
||||
|
||||
|
||||
def __convert_links(soup, entry):
|
||||
for elm in soup.find_all("a"):
|
||||
href = elm.attrs["href"].split(" ")[0]
|
||||
if re.match(r"^#?[0-9]+(?:-[0-9A-F]{4})?$", href):
|
||||
href = href.removeprefix("#")
|
||||
ref_entry_id = entry.id_string_to_entry_id(href)
|
||||
if ref_entry_id in entry.ID_TO_ENTRY:
|
||||
ref_entry = entry.ID_TO_ENTRY[ref_entry_id]
|
||||
else:
|
||||
ref_entry = entry.ID_TO_ENTRY[(ref_entry_id[0], 0)]
|
||||
gid = ref_entry.get_global_identifier()
|
||||
elm.attrs["href"] = f"entry://{gid}"
|
||||
elif re.match(r"^entry:", href):
|
||||
pass
|
||||
elif re.match(r"^https?:[\w\W]*", href):
|
||||
pass
|
||||
else:
|
||||
raise Exception(f"Invalid href format: {href}")
|
||||
|
||||
|
||||
def __add_parent_link(soup, entry):
|
||||
elm = soup.find("親見出相当部")
|
||||
if elm is not None:
|
||||
parent_entry = entry.get_parent()
|
||||
gid = parent_entry.get_global_identifier()
|
||||
elm.attrs["href"] = f"entry://{gid}"
|
||||
elm.attrs["data-name"] = elm.name
|
||||
elm.name = "a"
|
||||
|
||||
|
||||
def __add_homophone_links(soup, entry):
|
||||
forward_link = ["←", entry.entry_id[0] + 1]
|
||||
backward_link = ["→", entry.entry_id[0] - 1]
|
||||
homophone_info_list = [
|
||||
["svg-logo/homophone1.svg", [forward_link]],
|
||||
["svg-logo/homophone2.svg", [forward_link, backward_link]],
|
||||
["svg-logo/homophone3.svg", [backward_link]],
|
||||
]
|
||||
for homophone_info in homophone_info_list:
|
||||
filename, link_info = homophone_info
|
||||
for elm in soup.find_all("img", attrs={"src": filename}):
|
||||
for info in link_info:
|
||||
text, link_id = info
|
||||
link_entry = entry.ID_TO_ENTRY[(link_id, 0)]
|
||||
gid = link_entry.get_global_identifier()
|
||||
link = BeautifulSoup("<a/>", "xml").a
|
||||
link.string = text
|
||||
link.attrs["href"] = f"entry://{gid}"
|
||||
elm.append(link)
|
||||
elm.unwrap()
|
20
bot/mdict/terms/base/jitenon.py
Normal file
20
bot/mdict/terms/base/jitenon.py
Normal file
|
@ -0,0 +1,20 @@
|
|||
from bot.mdict.terms.base.terminator import BaseTerminator
|
||||
|
||||
|
||||
class JitenonTerminator(BaseTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = None
|
||||
|
||||
def _glossary(self, entry):
|
||||
if entry.entry_id in self._glossary_cache:
|
||||
return self._glossary_cache[entry.entry_id]
|
||||
glossary = self._glossary_maker.make_glossary(entry, self._media_dir)
|
||||
self._glossary_cache[entry.entry_id] = glossary
|
||||
return glossary
|
||||
|
||||
def _link_glossary_parameters(self, entry):
|
||||
return []
|
||||
|
||||
def _subentry_lists(self, entry):
|
||||
return []
|
|
@ -1,7 +1,8 @@
|
|||
import re
|
||||
from abc import abstractmethod, ABC
|
||||
|
||||
|
||||
class Terminator(ABC):
|
||||
class BaseTerminator(ABC):
|
||||
def __init__(self, target):
|
||||
self._target = target
|
||||
self._glossary_cache = {}
|
||||
|
@ -12,35 +13,20 @@ class Terminator(ABC):
|
|||
|
||||
def make_terms(self, entry):
|
||||
gid = entry.get_global_identifier()
|
||||
glossary = self.__full_glossary(entry)
|
||||
glossary = self.__get_full_glossary(entry)
|
||||
terms = [[gid, glossary]]
|
||||
keys = set()
|
||||
headwords = entry.get_headwords()
|
||||
for reading, expressions in headwords.items():
|
||||
if len(expressions) == 0:
|
||||
keys.add(reading)
|
||||
for expression in expressions:
|
||||
if expression.strip() == "":
|
||||
keys.add(reading)
|
||||
continue
|
||||
keys.add(expression)
|
||||
if reading.strip() == "":
|
||||
continue
|
||||
if reading != expression:
|
||||
keys.add(f"{reading}【{expression}】")
|
||||
else:
|
||||
keys.add(reading)
|
||||
keys = self.__get_keys(entry)
|
||||
link = f"@@@LINK={gid}"
|
||||
for key in keys:
|
||||
if key.strip() != "":
|
||||
terms.append([key, link])
|
||||
for subentries in self._subentry_lists(entry):
|
||||
for subentry in subentries:
|
||||
for subentry_list in self._subentry_lists(entry):
|
||||
for subentry in subentry_list:
|
||||
for term in self.make_terms(subentry):
|
||||
terms.append(term)
|
||||
return terms
|
||||
|
||||
def __full_glossary(self, entry):
|
||||
def __get_full_glossary(self, entry):
|
||||
glossary = []
|
||||
style_link = f"<link rel='stylesheet' href='{self._target.value}.css' type='text/css'>"
|
||||
glossary.append(style_link)
|
||||
|
@ -60,14 +46,38 @@ class Terminator(ABC):
|
|||
glossary.append(link_glossary)
|
||||
return "\n".join(glossary)
|
||||
|
||||
def __get_keys(self, entry):
|
||||
keys = set()
|
||||
headwords = entry.get_headwords()
|
||||
for reading, expressions in headwords.items():
|
||||
stripped_reading = reading.strip()
|
||||
keys.add(stripped_reading)
|
||||
if re.match(r"^[ぁ-ヿ、]+$", stripped_reading):
|
||||
kana_only_key = f"{stripped_reading}【∅】"
|
||||
else:
|
||||
kana_only_key = ""
|
||||
if len(expressions) == 0:
|
||||
keys.add(kana_only_key)
|
||||
for expression in expressions:
|
||||
stripped_expression = expression.strip()
|
||||
keys.add(stripped_expression)
|
||||
if stripped_expression == "":
|
||||
keys.add(kana_only_key)
|
||||
elif stripped_expression == stripped_reading:
|
||||
keys.add(kana_only_key)
|
||||
else:
|
||||
combo_key = f"{stripped_reading}【{stripped_expression}】"
|
||||
keys.add(combo_key)
|
||||
return keys
|
||||
|
||||
@abstractmethod
|
||||
def _glossary(self, entry):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _link_glossary_parameters(self, entry):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _subentry_lists(self, entry):
|
||||
pass
|
||||
raise NotImplementedError
|
|
@ -1,8 +1,8 @@
|
|||
from bot.mdict.terms.terminator import Terminator
|
||||
from bot.mdict.terms.base.terminator import BaseTerminator
|
||||
from bot.mdict.glossary.daijirin2 import make_glossary
|
||||
|
||||
|
||||
class Daijirin2Terminator(Terminator):
|
||||
class Terminator(BaseTerminator):
|
||||
def _glossary(self, entry):
|
||||
if entry.entry_id in self._glossary_cache:
|
||||
return self._glossary_cache[entry.entry_id]
|
||||
|
|
|
@ -1,18 +0,0 @@
|
|||
from bot.targets import Targets
|
||||
|
||||
from bot.mdict.terms.jitenon import JitenonKokugoTerminator
|
||||
from bot.mdict.terms.jitenon import JitenonYojiTerminator
|
||||
from bot.mdict.terms.jitenon import JitenonKotowazaTerminator
|
||||
from bot.mdict.terms.smk8 import Smk8Terminator
|
||||
from bot.mdict.terms.daijirin2 import Daijirin2Terminator
|
||||
|
||||
|
||||
def new_terminator(target):
|
||||
terminator_map = {
|
||||
Targets.JITENON_KOKUGO: JitenonKokugoTerminator,
|
||||
Targets.JITENON_YOJI: JitenonYojiTerminator,
|
||||
Targets.JITENON_KOTOWAZA: JitenonKotowazaTerminator,
|
||||
Targets.SMK8: Smk8Terminator,
|
||||
Targets.DAIJIRIN2: Daijirin2Terminator,
|
||||
}
|
||||
return terminator_map[target](target)
|
|
@ -1,42 +0,0 @@
|
|||
from bot.mdict.terms.terminator import Terminator
|
||||
|
||||
from bot.mdict.glossary.jitenon import JitenonKokugoGlossary
|
||||
from bot.mdict.glossary.jitenon import JitenonYojiGlossary
|
||||
from bot.mdict.glossary.jitenon import JitenonKotowazaGlossary
|
||||
|
||||
|
||||
class JitenonTerminator(Terminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = None
|
||||
|
||||
def _glossary(self, entry):
|
||||
if entry.entry_id in self._glossary_cache:
|
||||
return self._glossary_cache[entry.entry_id]
|
||||
glossary = self._glossary_maker.make_glossary(entry, self._media_dir)
|
||||
self._glossary_cache[entry.entry_id] = glossary
|
||||
return glossary
|
||||
|
||||
def _link_glossary_parameters(self, entry):
|
||||
return []
|
||||
|
||||
def _subentry_lists(self, entry):
|
||||
return []
|
||||
|
||||
|
||||
class JitenonKokugoTerminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonKokugoGlossary()
|
||||
|
||||
|
||||
class JitenonYojiTerminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonYojiGlossary()
|
||||
|
||||
|
||||
class JitenonKotowazaTerminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonKotowazaGlossary()
|
8
bot/mdict/terms/jitenon_kokugo.py
Normal file
8
bot/mdict/terms/jitenon_kokugo.py
Normal file
|
@ -0,0 +1,8 @@
|
|||
from bot.mdict.terms.base.jitenon import JitenonTerminator
|
||||
from bot.mdict.glossary.jitenon import JitenonKokugoGlossary
|
||||
|
||||
|
||||
class Terminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonKokugoGlossary()
|
8
bot/mdict/terms/jitenon_kotowaza.py
Normal file
8
bot/mdict/terms/jitenon_kotowaza.py
Normal file
|
@ -0,0 +1,8 @@
|
|||
from bot.mdict.terms.base.jitenon import JitenonTerminator
|
||||
from bot.mdict.glossary.jitenon import JitenonKotowazaGlossary
|
||||
|
||||
|
||||
class Terminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonKotowazaGlossary()
|
8
bot/mdict/terms/jitenon_yoji.py
Normal file
8
bot/mdict/terms/jitenon_yoji.py
Normal file
|
@ -0,0 +1,8 @@
|
|||
from bot.mdict.terms.base.jitenon import JitenonTerminator
|
||||
from bot.mdict.glossary.jitenon import JitenonYojiGlossary
|
||||
|
||||
|
||||
class Terminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonYojiGlossary()
|
23
bot/mdict/terms/sankoku8.py
Normal file
23
bot/mdict/terms/sankoku8.py
Normal file
|
@ -0,0 +1,23 @@
|
|||
from bot.mdict.terms.base.terminator import BaseTerminator
|
||||
from bot.mdict.glossary.sankoku8 import make_glossary
|
||||
|
||||
|
||||
class Terminator(BaseTerminator):
|
||||
def _glossary(self, entry):
|
||||
if entry.entry_id in self._glossary_cache:
|
||||
return self._glossary_cache[entry.entry_id]
|
||||
glossary = make_glossary(entry, self._media_dir)
|
||||
self._glossary_cache[entry.entry_id] = glossary
|
||||
return glossary
|
||||
|
||||
def _link_glossary_parameters(self, entry):
|
||||
return [
|
||||
[entry.children, "子項目"],
|
||||
[entry.phrases, "句項目"],
|
||||
]
|
||||
|
||||
def _subentry_lists(self, entry):
|
||||
return [
|
||||
entry.children,
|
||||
entry.phrases,
|
||||
]
|
|
@ -1,8 +1,8 @@
|
|||
from bot.mdict.terms.terminator import Terminator
|
||||
from bot.mdict.terms.base.terminator import BaseTerminator
|
||||
from bot.mdict.glossary.smk8 import make_glossary
|
||||
|
||||
|
||||
class Smk8Terminator(Terminator):
|
||||
class Terminator(BaseTerminator):
|
||||
def _glossary(self, entry):
|
||||
if entry.entry_id in self._glossary_cache:
|
||||
return self._glossary_cache[entry.entry_id]
|
||||
|
|
|
@ -7,3 +7,4 @@ class Targets(Enum):
|
|||
JITENON_KOTOWAZA = "jitenon-kotowaza"
|
||||
SMK8 = "smk8"
|
||||
DAIJIRIN2 = "daijirin2"
|
||||
SANKOKU8 = "sankoku8"
|
||||
|
|
5
bot/time.py
Normal file
5
bot/time.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
import time
|
||||
|
||||
|
||||
def timestamp():
|
||||
return time.strftime('%X')
|
|
@ -1,25 +1,27 @@
|
|||
# pylint: disable=too-few-public-methods
|
||||
|
||||
import json
|
||||
import os
|
||||
import shutil
|
||||
import copy
|
||||
from pathlib import Path
|
||||
from datetime import datetime
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
import fastjsonschema
|
||||
from platformdirs import user_documents_dir, user_cache_dir
|
||||
|
||||
from bot.time import timestamp
|
||||
from bot.data import load_yomichan_metadata
|
||||
from bot.yomichan.terms.factory import new_terminator
|
||||
from bot.data import load_yomichan_term_schema
|
||||
from bot.factory import new_yomichan_terminator
|
||||
|
||||
|
||||
class Exporter(ABC):
|
||||
class BaseExporter(ABC):
|
||||
def __init__(self, target):
|
||||
self._target = target
|
||||
self._terminator = new_terminator(target)
|
||||
self._terminator = new_yomichan_terminator(target)
|
||||
self._build_dir = None
|
||||
self._terms_per_file = 2000
|
||||
|
||||
def export(self, entries, image_dir):
|
||||
def export(self, entries, image_dir, validate):
|
||||
self.__init_build_image_dir(image_dir)
|
||||
meta = load_yomichan_metadata()
|
||||
index = meta[self._target.value]["index"]
|
||||
|
@ -27,34 +29,45 @@ class Exporter(ABC):
|
|||
index["attribution"] = self._get_attribution(entries)
|
||||
tags = meta[self._target.value]["tags"]
|
||||
terms = self.__get_terms(entries)
|
||||
if validate:
|
||||
self.__validate_terms(terms)
|
||||
self.__make_dictionary(terms, index, tags)
|
||||
|
||||
@abstractmethod
|
||||
def _get_revision(self, entries):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _get_attribution(self, entries):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
def _get_build_dir(self):
|
||||
if self._build_dir is not None:
|
||||
return self._build_dir
|
||||
cache_dir = user_cache_dir("jitenbot")
|
||||
build_directory = os.path.join(cache_dir, "yomichan_build")
|
||||
print(f"Initializing build directory `{build_directory}`")
|
||||
print(f"{timestamp()} Initializing build directory `{build_directory}`")
|
||||
if Path(build_directory).is_dir():
|
||||
shutil.rmtree(build_directory)
|
||||
os.makedirs(build_directory)
|
||||
self._build_dir = build_directory
|
||||
return self._build_dir
|
||||
|
||||
def __get_invalid_term_dir(self):
|
||||
cache_dir = user_cache_dir("jitenbot")
|
||||
log_dir = os.path.join(cache_dir, "invalid_yomichan_terms")
|
||||
if Path(log_dir).is_dir():
|
||||
shutil.rmtree(log_dir)
|
||||
os.makedirs(log_dir)
|
||||
return log_dir
|
||||
|
||||
def __init_build_image_dir(self, image_dir):
|
||||
build_dir = self._get_build_dir()
|
||||
build_img_dir = os.path.join(build_dir, self._target.value)
|
||||
if image_dir is not None:
|
||||
print("Copying media files to build directory...")
|
||||
print(f"{timestamp()} Copying media files to build directory...")
|
||||
shutil.copytree(image_dir, build_img_dir)
|
||||
print(f"{timestamp()} Finished copying files")
|
||||
else:
|
||||
os.makedirs(build_img_dir)
|
||||
self._terminator.set_image_dir(build_img_dir)
|
||||
|
@ -63,7 +76,7 @@ class Exporter(ABC):
|
|||
terms = []
|
||||
entries_len = len(entries)
|
||||
for idx, entry in enumerate(entries):
|
||||
update = f"Creating Yomichan terms for entry {idx+1}/{entries_len}"
|
||||
update = f"\tCreating Yomichan terms for entry {idx+1}/{entries_len}"
|
||||
print(update, end='\r', flush=True)
|
||||
new_terms = self._terminator.make_terms(entry)
|
||||
for term in new_terms:
|
||||
|
@ -71,8 +84,29 @@ class Exporter(ABC):
|
|||
print()
|
||||
return terms
|
||||
|
||||
def __validate_terms(self, terms):
|
||||
print(f"{timestamp()} Making a copy of term data for validation...")
|
||||
terms_copy = copy.deepcopy(terms) # because validator will alter data!
|
||||
term_count = len(terms_copy)
|
||||
log_dir = self.__get_invalid_term_dir()
|
||||
schema = load_yomichan_term_schema()
|
||||
validator = fastjsonschema.compile(schema)
|
||||
failure_count = 0
|
||||
for idx, term in enumerate(terms_copy):
|
||||
update = f"\tValidating term {idx+1}/{term_count}"
|
||||
print(update, end='\r', flush=True)
|
||||
try:
|
||||
validator([term])
|
||||
except fastjsonschema.JsonSchemaException:
|
||||
failure_count += 1
|
||||
term_file = os.path.join(log_dir, f"{idx}.json")
|
||||
with open(term_file, "w", encoding='utf8') as f:
|
||||
json.dump([term], f, indent=4, ensure_ascii=False)
|
||||
print(f"\n{timestamp()} Finished validating with {failure_count} error{'' if failure_count == 1 else 's'}")
|
||||
if failure_count > 0:
|
||||
print(f"{timestamp()} Invalid terms saved to `{log_dir}` for debugging")
|
||||
|
||||
def __make_dictionary(self, terms, index, tags):
|
||||
print(f"Exporting {len(terms)} Yomichan terms...")
|
||||
self.__write_term_banks(terms)
|
||||
self.__write_index(index)
|
||||
self.__write_tag_bank(tags)
|
||||
|
@ -80,14 +114,18 @@ class Exporter(ABC):
|
|||
self.__rm_build_dir()
|
||||
|
||||
def __write_term_banks(self, terms):
|
||||
print(f"{timestamp()} Exporting {len(terms)} JSON terms")
|
||||
build_dir = self._get_build_dir()
|
||||
max_i = int(len(terms) / self._terms_per_file) + 1
|
||||
for i in range(max_i):
|
||||
update = f"\tWriting terms to term bank {i+1}/{max_i}"
|
||||
print(update, end='\r', flush=True)
|
||||
start = self._terms_per_file * i
|
||||
end = self._terms_per_file * (i + 1)
|
||||
term_file = os.path.join(build_dir, f"term_bank_{i+1}.json")
|
||||
with open(term_file, "w", encoding='utf8') as f:
|
||||
start = self._terms_per_file * i
|
||||
end = self._terms_per_file * (i + 1)
|
||||
json.dump(terms[start:end], f, indent=4, ensure_ascii=False)
|
||||
print()
|
||||
|
||||
def __write_index(self, index):
|
||||
build_dir = self._get_build_dir()
|
||||
|
@ -105,6 +143,7 @@ class Exporter(ABC):
|
|||
|
||||
def __write_archive(self, filename):
|
||||
archive_format = "zip"
|
||||
print(f"{timestamp()} Archiving data to {archive_format.upper()} file...")
|
||||
out_dir = os.path.join(user_documents_dir(), "jitenbot", "yomichan")
|
||||
if not Path(out_dir).is_dir():
|
||||
os.makedirs(out_dir)
|
||||
|
@ -115,55 +154,8 @@ class Exporter(ABC):
|
|||
base_filename = os.path.join(out_dir, filename)
|
||||
build_dir = self._get_build_dir()
|
||||
shutil.make_archive(base_filename, archive_format, build_dir)
|
||||
print(f"Dictionary file saved to {out_filepath}")
|
||||
print(f"{timestamp()} Dictionary file saved to `{out_filepath}`")
|
||||
|
||||
def __rm_build_dir(self):
|
||||
build_dir = self._get_build_dir()
|
||||
shutil.rmtree(build_dir)
|
||||
|
||||
|
||||
class _JitenonExporter(Exporter):
|
||||
def _get_revision(self, entries):
|
||||
modified_date = None
|
||||
for entry in entries:
|
||||
if modified_date is None or entry.modified_date > modified_date:
|
||||
modified_date = entry.modified_date
|
||||
revision = f"{self._target.value};{modified_date}"
|
||||
return revision
|
||||
|
||||
def _get_attribution(self, entries):
|
||||
modified_date = None
|
||||
for entry in entries:
|
||||
if modified_date is None or entry.modified_date > modified_date:
|
||||
attribution = entry.attribution
|
||||
return attribution
|
||||
|
||||
|
||||
class JitenonKokugoExporter(_JitenonExporter):
|
||||
pass
|
||||
|
||||
|
||||
class JitenonYojiExporter(_JitenonExporter):
|
||||
pass
|
||||
|
||||
|
||||
class JitenonKotowazaExporter(_JitenonExporter):
|
||||
pass
|
||||
|
||||
|
||||
class Smk8Exporter(Exporter):
|
||||
def _get_revision(self, entries):
|
||||
timestamp = datetime.now().strftime("%Y-%m-%d")
|
||||
return f"{self._target.value};{timestamp}"
|
||||
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2020"
|
||||
|
||||
|
||||
class Daijirin2Exporter(Exporter):
|
||||
def _get_revision(self, entries):
|
||||
timestamp = datetime.now().strftime("%Y-%m-%d")
|
||||
return f"{self._target.value};{timestamp}"
|
||||
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2019"
|
18
bot/yomichan/exporters/base/jitenon.py
Normal file
18
bot/yomichan/exporters/base/jitenon.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
from bot.yomichan.exporters.base.exporter import BaseExporter
|
||||
|
||||
|
||||
class JitenonExporter(BaseExporter):
|
||||
def _get_revision(self, entries):
|
||||
modified_date = None
|
||||
for entry in entries:
|
||||
if modified_date is None or entry.modified_date > modified_date:
|
||||
modified_date = entry.modified_date
|
||||
revision = f"{self._target.value};{modified_date}"
|
||||
return revision
|
||||
|
||||
def _get_attribution(self, entries):
|
||||
modified_date = None
|
||||
for entry in entries:
|
||||
if modified_date is None or entry.modified_date > modified_date:
|
||||
attribution = entry.attribution
|
||||
return attribution
|
8
bot/yomichan/exporters/base/monokakido.py
Normal file
8
bot/yomichan/exporters/base/monokakido.py
Normal file
|
@ -0,0 +1,8 @@
|
|||
from datetime import datetime
|
||||
from bot.yomichan.exporters.base.exporter import BaseExporter
|
||||
|
||||
|
||||
class MonokakidoExporter(BaseExporter):
|
||||
def _get_revision(self, entries):
|
||||
timestamp = datetime.now().strftime("%Y-%m-%d")
|
||||
return f"{self._target.value};{timestamp}"
|
6
bot/yomichan/exporters/daijirin2.py
Normal file
6
bot/yomichan/exporters/daijirin2.py
Normal file
|
@ -0,0 +1,6 @@
|
|||
from bot.yomichan.exporters.base.monokakido import MonokakidoExporter
|
||||
|
||||
|
||||
class Exporter(MonokakidoExporter):
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2019"
|
|
@ -1,18 +0,0 @@
|
|||
from bot.targets import Targets
|
||||
|
||||
from bot.yomichan.exporters.export import JitenonKokugoExporter
|
||||
from bot.yomichan.exporters.export import JitenonYojiExporter
|
||||
from bot.yomichan.exporters.export import JitenonKotowazaExporter
|
||||
from bot.yomichan.exporters.export import Smk8Exporter
|
||||
from bot.yomichan.exporters.export import Daijirin2Exporter
|
||||
|
||||
|
||||
def new_yomi_exporter(target):
|
||||
exporter_map = {
|
||||
Targets.JITENON_KOKUGO: JitenonKokugoExporter,
|
||||
Targets.JITENON_YOJI: JitenonYojiExporter,
|
||||
Targets.JITENON_KOTOWAZA: JitenonKotowazaExporter,
|
||||
Targets.SMK8: Smk8Exporter,
|
||||
Targets.DAIJIRIN2: Daijirin2Exporter,
|
||||
}
|
||||
return exporter_map[target](target)
|
5
bot/yomichan/exporters/jitenon_kokugo.py
Normal file
5
bot/yomichan/exporters/jitenon_kokugo.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
from bot.yomichan.exporters.base.jitenon import JitenonExporter
|
||||
|
||||
|
||||
class Exporter(JitenonExporter):
|
||||
pass
|
5
bot/yomichan/exporters/jitenon_kotowaza.py
Normal file
5
bot/yomichan/exporters/jitenon_kotowaza.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
from bot.yomichan.exporters.base.jitenon import JitenonExporter
|
||||
|
||||
|
||||
class Exporter(JitenonExporter):
|
||||
pass
|
5
bot/yomichan/exporters/jitenon_yoji.py
Normal file
5
bot/yomichan/exporters/jitenon_yoji.py
Normal file
|
@ -0,0 +1,5 @@
|
|||
from bot.yomichan.exporters.base.jitenon import JitenonExporter
|
||||
|
||||
|
||||
class Exporter(JitenonExporter):
|
||||
pass
|
6
bot/yomichan/exporters/sankoku8.py
Normal file
6
bot/yomichan/exporters/sankoku8.py
Normal file
|
@ -0,0 +1,6 @@
|
|||
from bot.yomichan.exporters.base.monokakido import MonokakidoExporter
|
||||
|
||||
|
||||
class Exporter(MonokakidoExporter):
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2021"
|
6
bot/yomichan/exporters/smk8.py
Normal file
6
bot/yomichan/exporters/smk8.py
Normal file
|
@ -0,0 +1,6 @@
|
|||
from bot.yomichan.exporters.base.monokakido import MonokakidoExporter
|
||||
|
||||
|
||||
class Exporter(MonokakidoExporter):
|
||||
def _get_attribution(self, entries):
|
||||
return "© Sanseido Co., LTD. 2020"
|
|
@ -1,9 +1,10 @@
|
|||
import re
|
||||
import os
|
||||
from bs4 import BeautifulSoup
|
||||
from functools import cache
|
||||
from pathlib import Path
|
||||
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
import bot.yomichan.glossary.icons as Icons
|
||||
from bot.soup import delete_soup_nodes
|
||||
from bot.data import load_yomichan_name_conversion
|
||||
|
@ -111,8 +112,8 @@ def __convert_gaiji(soup, image_dir):
|
|||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0 if ratio > 1.0 else ratio,
|
||||
"width": ratio if ratio > 1.0 else 1.0,
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
|
@ -150,8 +151,8 @@ def __convert_logos(soup, image_dir):
|
|||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0 if ratio > 1.0 else ratio,
|
||||
"width": ratio if ratio > 1.0 else 1.0,
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
|
@ -174,8 +175,8 @@ def __convert_kanjion_logos(soup, image_dir):
|
|||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0 if ratio > 1.0 else ratio,
|
||||
"width": ratio if ratio > 1.0 else 1.0,
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
|
@ -198,8 +199,8 @@ def __convert_daigoginum(soup, image_dir):
|
|||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0 if ratio > 1.0 else ratio,
|
||||
"width": ratio if ratio > 1.0 else 1.0,
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
|
@ -222,8 +223,8 @@ def __convert_jundaigoginum(soup, image_dir):
|
|||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0 if ratio > 1.0 else ratio,
|
||||
"width": ratio if ratio > 1.0 else 1.0,
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
|
|
|
@ -76,6 +76,7 @@ def __get_attributes(attrs):
|
|||
|
||||
|
||||
def __get_style(inline_style_string):
|
||||
# pylint: disable=no-member
|
||||
style = {}
|
||||
parsed_style = parseStyle(inline_style_string)
|
||||
if parsed_style.fontStyle != "":
|
||||
|
@ -100,7 +101,7 @@ def __get_style(inline_style_string):
|
|||
"marginLeft": parsed_style.marginLeft,
|
||||
}
|
||||
for key, val in margins.items():
|
||||
m = re.search(r"(\d+(\.\d*)?|\.\d+)em", val)
|
||||
m = re.search(r"(-?\d+(\.\d*)?|-?\.\d+)em", val)
|
||||
if m:
|
||||
style[key] = float(m.group(1))
|
||||
|
||||
|
|
|
@ -26,6 +26,27 @@ def make_monochrome_fill_rectangle(path, text):
|
|||
f.write(svg)
|
||||
|
||||
|
||||
@cache
|
||||
def make_accent(path):
|
||||
svg = __svg_accent()
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
f.write(svg)
|
||||
|
||||
|
||||
@cache
|
||||
def make_heiban(path):
|
||||
svg = __svg_heiban()
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
f.write(svg)
|
||||
|
||||
|
||||
@cache
|
||||
def make_red_char(path, char):
|
||||
svg = __svg_red_character(char)
|
||||
with open(path, "w", encoding="utf-8") as f:
|
||||
f.write(svg)
|
||||
|
||||
|
||||
def __calculate_svg_ratio(path):
|
||||
with open(path, "r", encoding="utf-8") as f:
|
||||
xml = f.read()
|
||||
|
@ -82,3 +103,30 @@ def __svg_masked_rectangle(text):
|
|||
fill='black' mask='url(#a)'/>
|
||||
</svg>"""
|
||||
return svg.strip()
|
||||
|
||||
|
||||
def __svg_heiban():
|
||||
svg = f"""
|
||||
<svg viewBox='0 0 210 300' xmlns='http://www.w3.org/2000/svg' version='1.1'>
|
||||
<rect width='210' height='30' fill='red'/>
|
||||
</svg>"""
|
||||
return svg.strip()
|
||||
|
||||
|
||||
def __svg_accent():
|
||||
svg = f"""
|
||||
<svg viewBox='0 0 150 300' xmlns='http://www.w3.org/2000/svg' version='1.1'>
|
||||
<rect width='150' height='30' fill='red'/>
|
||||
<rect width='30' height='150' x='120' fill='red'/>
|
||||
</svg>"""
|
||||
return svg.strip()
|
||||
|
||||
|
||||
def __svg_red_character(char):
|
||||
svg = f"""
|
||||
<svg viewBox='0 0 300 300' xmlns='http://www.w3.org/2000/svg' version='1.1'>
|
||||
<text text-anchor='middle' x='50%' y='50%' dy='.37em'
|
||||
font-family='sans-serif' font-size='300px'
|
||||
fill='red'>{char}</text>
|
||||
</svg>"""
|
||||
return svg.strip()
|
||||
|
|
|
@ -118,8 +118,8 @@ class JitenonKokugoGlossary(JitenonGlossary):
|
|||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0 if ratio > 1.0 else ratio,
|
||||
"width": ratio if ratio > 1.0 else 1.0,
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
|
|
344
bot/yomichan/glossary/sankoku8.py
Normal file
344
bot/yomichan/glossary/sankoku8.py
Normal file
|
@ -0,0 +1,344 @@
|
|||
import re
|
||||
import os
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
import bot.yomichan.glossary.icons as Icons
|
||||
from bot.data import load_yomichan_name_conversion
|
||||
from bot.yomichan.glossary.gloss import make_gloss
|
||||
from bot.name_conversion import convert_names
|
||||
|
||||
|
||||
def make_glossary(entry, media_dir):
|
||||
soup = entry.get_page_soup()
|
||||
__remove_glyph_styles(soup)
|
||||
__reposition_marks(soup)
|
||||
__remove_links_without_href(soup)
|
||||
__remove_appendix_links(soup)
|
||||
__convert_links(soup, entry)
|
||||
__add_parent_link(soup, entry)
|
||||
__add_homophone_links(soup, entry)
|
||||
__convert_images_to_text(soup)
|
||||
__text_parens_to_images(soup, media_dir)
|
||||
__replace_icons(soup, media_dir)
|
||||
__replace_accent_symbols(soup, media_dir)
|
||||
__convert_gaiji(soup, media_dir)
|
||||
__convert_graphics(soup, media_dir)
|
||||
__convert_number_icons(soup, media_dir)
|
||||
|
||||
name_conversion = load_yomichan_name_conversion(entry.target)
|
||||
convert_names(soup, name_conversion)
|
||||
|
||||
gloss = make_gloss(soup.span)
|
||||
glossary = [gloss]
|
||||
return glossary
|
||||
|
||||
|
||||
def __remove_glyph_styles(soup):
|
||||
"""The css_parser library will emit annoying warning messages
|
||||
later if it sees these glyph character styles"""
|
||||
for elm in soup.find_all("glyph"):
|
||||
if elm.has_attr("style"):
|
||||
elm["data-style"] = elm.attrs["style"]
|
||||
del elm.attrs["style"]
|
||||
|
||||
|
||||
def __reposition_marks(soup):
|
||||
"""These マーク symbols will be converted to rubies later, so they need to
|
||||
be positioned after the corresponding text in order to appear correctly"""
|
||||
for elm in soup.find_all("表外字"):
|
||||
mark = elm.find("表外字マーク")
|
||||
elm.append(mark)
|
||||
for elm in soup.find_all("表外音訓"):
|
||||
mark = elm.find("表外音訓マーク")
|
||||
elm.append(mark)
|
||||
|
||||
|
||||
def __remove_links_without_href(soup):
|
||||
for elm in soup.find_all("a"):
|
||||
if elm.has_attr("href"):
|
||||
continue
|
||||
elm.attrs["data-name"] = elm.name
|
||||
elm.name = "span"
|
||||
|
||||
|
||||
def __remove_appendix_links(soup):
|
||||
for elm in soup.find_all("a"):
|
||||
if elm.attrs["href"].startswith("appendix"):
|
||||
elm.unwrap()
|
||||
|
||||
|
||||
def __convert_links(soup, entry):
|
||||
for elm in soup.find_all("a"):
|
||||
href = elm.attrs["href"].split(" ")[0]
|
||||
href = href.removeprefix("#")
|
||||
if not re.match(r"^[0-9]+(?:-[0-9A-F]{4})?$", href):
|
||||
raise Exception(f"Invalid href format: {href}")
|
||||
ref_entry_id = entry.id_string_to_entry_id(href)
|
||||
if ref_entry_id in entry.ID_TO_ENTRY:
|
||||
ref_entry = entry.ID_TO_ENTRY[ref_entry_id]
|
||||
else:
|
||||
ref_entry = entry.ID_TO_ENTRY[(ref_entry_id[0], 0)]
|
||||
expression = ref_entry.get_first_expression()
|
||||
elm.attrs["href"] = f"?query={expression}&wildcards=off"
|
||||
|
||||
|
||||
def __add_parent_link(soup, entry):
|
||||
elm = soup.find("親見出相当部")
|
||||
if elm is not None:
|
||||
parent_entry = entry.get_parent()
|
||||
expression = parent_entry.get_first_expression()
|
||||
elm.attrs["href"] = f"?query={expression}&wildcards=off"
|
||||
elm.name = "a"
|
||||
|
||||
|
||||
def __add_homophone_links(soup, entry):
|
||||
forward_link = ["←", entry.entry_id[0] + 1]
|
||||
backward_link = ["→", entry.entry_id[0] - 1]
|
||||
homophone_info_list = [
|
||||
["svg-logo/homophone1.svg", [forward_link]],
|
||||
["svg-logo/homophone2.svg", [forward_link, backward_link]],
|
||||
["svg-logo/homophone3.svg", [backward_link]],
|
||||
]
|
||||
for homophone_info in homophone_info_list:
|
||||
filename, link_info = homophone_info
|
||||
for elm in soup.find_all("img", attrs={"src": filename}):
|
||||
for info in link_info:
|
||||
text, link_id = info
|
||||
link_entry = entry.ID_TO_ENTRY[(link_id, 0)]
|
||||
expression = link_entry.get_first_expression()
|
||||
link = BeautifulSoup("<a/>", "xml").a
|
||||
link.string = text
|
||||
link.attrs["href"] = f"?query={expression}&wildcards=off"
|
||||
elm.append(link)
|
||||
elm.unwrap()
|
||||
|
||||
|
||||
def __convert_images_to_text(soup):
|
||||
conversions = [
|
||||
["svg-logo/重要語.svg", "*", "vertical-align: super; font-size: 0.6em"],
|
||||
["svg-logo/最重要語.svg", "**", "vertical-align: super; font-size: 0.6em"],
|
||||
["svg-logo/一般常識語.svg", "☆☆", "vertical-align: super; font-size: 0.6em"],
|
||||
["svg-logo/追い込み.svg", "", ""],
|
||||
["svg-special/区切り線.svg", "|", ""],
|
||||
]
|
||||
for conversion in conversions:
|
||||
filename, text, style = conversion
|
||||
for elm in soup.find_all("img", attrs={"src": filename}):
|
||||
if text == "":
|
||||
elm.unwrap()
|
||||
continue
|
||||
if style != "":
|
||||
elm.attrs["style"] = style
|
||||
elm.attrs["data-name"] = elm.name
|
||||
elm.attrs["data-src"] = elm.attrs["src"]
|
||||
elm.name = "span"
|
||||
elm.string = text
|
||||
del elm.attrs["src"]
|
||||
|
||||
|
||||
def __text_parens_to_images(soup, media_dir):
|
||||
for elm in soup.find_all("red"):
|
||||
char = elm.text
|
||||
if char not in ["(", ")"]:
|
||||
continue
|
||||
filename = f"red_{char}.svg"
|
||||
path = os.path.join(media_dir, filename)
|
||||
Icons.make_red_char(path, char)
|
||||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
"background": False,
|
||||
"appearance": "auto",
|
||||
"path": f"{os.path.basename(media_dir)}/{filename}",
|
||||
}
|
||||
elm.attrs["data-name"] = elm.name
|
||||
elm.name = "span"
|
||||
elm.string = ""
|
||||
elm.append(img)
|
||||
elm.attrs["style"] = "vertical-align: text-bottom;"
|
||||
|
||||
|
||||
def __replace_icons(soup, media_dir):
|
||||
cls_to_appearance = {
|
||||
"default": "monochrome",
|
||||
"fill": "monochrome",
|
||||
"red": "auto",
|
||||
"redfill": "auto",
|
||||
"none": "monochrome",
|
||||
}
|
||||
icon_info_list = [
|
||||
["svg-logo/アク.svg", "アク", "default"],
|
||||
["svg-logo/丁寧.svg", "丁寧", "default"],
|
||||
["svg-logo/可能.svg", "可能", "default"],
|
||||
["svg-logo/尊敬.svg", "尊敬", "default"],
|
||||
["svg-logo/接尾.svg", "接尾", "default"],
|
||||
["svg-logo/接頭.svg", "接頭", "default"],
|
||||
["svg-logo/表記.svg", "表記", "default"],
|
||||
["svg-logo/謙譲.svg", "謙譲", "default"],
|
||||
["svg-logo/区別.svg", "区別", "redfill"],
|
||||
["svg-logo/由来.svg", "由来", "redfill"],
|
||||
["svg-logo/人.svg", "", "none"],
|
||||
["svg-logo/他.svg", "", "none"],
|
||||
["svg-logo/動.svg", "", "none"],
|
||||
["svg-logo/名.svg", "", "none"],
|
||||
["svg-logo/句.svg", "", "none"],
|
||||
["svg-logo/派.svg", "", "none"],
|
||||
["svg-logo/自.svg", "", "none"],
|
||||
["svg-logo/連.svg", "", "none"],
|
||||
["svg-logo/造.svg", "", "none"],
|
||||
["svg-logo/造2.svg", "", "none"],
|
||||
["svg-logo/造3.svg", "", "none"],
|
||||
["svg-logo/百科.svg", "", "none"],
|
||||
]
|
||||
for icon_info in icon_info_list:
|
||||
src, text, cls = icon_info
|
||||
for elm in soup.find_all("img", attrs={"src": src}):
|
||||
path = media_dir
|
||||
for part in src.split("/"):
|
||||
path = os.path.join(path, part)
|
||||
__make_rectangle(path, text, cls)
|
||||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
"background": False,
|
||||
"appearance": cls_to_appearance[cls],
|
||||
"title": elm.attrs["alt"] if elm.has_attr("alt") else "",
|
||||
"path": f"{os.path.basename(media_dir)}/{src}",
|
||||
}
|
||||
elm.name = "span"
|
||||
elm.clear()
|
||||
elm.append(img)
|
||||
elm.attrs["style"] = "vertical-align: text-bottom; margin-right: 0.25em;"
|
||||
|
||||
|
||||
def __replace_accent_symbols(soup, media_dir):
|
||||
accent_info_list = [
|
||||
["svg-accent/平板.svg", Icons.make_heiban],
|
||||
["svg-accent/アクセント.svg", Icons.make_accent],
|
||||
]
|
||||
for info in accent_info_list:
|
||||
src, write_svg_function = info
|
||||
for elm in soup.find_all("img", attrs={"src": src}):
|
||||
path = media_dir
|
||||
for part in src.split("/"):
|
||||
path = os.path.join(path, part)
|
||||
write_svg_function(path)
|
||||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
"background": False,
|
||||
"appearance": "auto",
|
||||
"path": f"{os.path.basename(media_dir)}/{src}",
|
||||
}
|
||||
elm.name = "span"
|
||||
elm.clear()
|
||||
elm.append(img)
|
||||
elm.attrs["style"] = "vertical-align: super; margin-left: -0.25em;"
|
||||
|
||||
|
||||
def __convert_gaiji(soup, media_dir):
|
||||
for elm in soup.find_all("img"):
|
||||
if not elm.has_attr("src"):
|
||||
continue
|
||||
src = elm.attrs["src"]
|
||||
if src.startswith("graphics"):
|
||||
continue
|
||||
path = media_dir
|
||||
for part in src.split("/"):
|
||||
if part.strip() == "":
|
||||
continue
|
||||
path = os.path.join(path, part)
|
||||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
"background": False,
|
||||
"appearance": "monochrome",
|
||||
"title": elm.attrs["alt"] if elm.has_attr("alt") else "",
|
||||
"path": f"{os.path.basename(media_dir)}/{src}",
|
||||
}
|
||||
elm.name = "span"
|
||||
elm.clear()
|
||||
elm.append(img)
|
||||
elm.attrs["style"] = "vertical-align: text-bottom;"
|
||||
|
||||
|
||||
def __convert_graphics(soup, media_dir):
|
||||
for elm in soup.find_all("img"):
|
||||
if not elm.has_attr("src"):
|
||||
continue
|
||||
src = elm.attrs["src"]
|
||||
if not src.startswith("graphics"):
|
||||
continue
|
||||
elm.attrs = {
|
||||
"collapsible": True,
|
||||
"collapsed": True,
|
||||
"title": elm.attrs["alt"] if elm.has_attr("alt") else "",
|
||||
"path": f"{os.path.basename(media_dir)}/{src}",
|
||||
"src": src,
|
||||
}
|
||||
|
||||
|
||||
def __convert_number_icons(soup, media_dir):
|
||||
for elm in soup.find_all("大語義番号"):
|
||||
if elm.find_parent("a") is None:
|
||||
filename = f"{elm.text}-fill.svg"
|
||||
appearance = "monochrome"
|
||||
path = os.path.join(media_dir, filename)
|
||||
__make_rectangle(path, elm.text, "fill")
|
||||
else:
|
||||
filename = f"{elm.text}-bluefill.svg"
|
||||
appearance = "auto"
|
||||
path = os.path.join(media_dir, filename)
|
||||
__make_rectangle(path, elm.text, "bluefill")
|
||||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
"background": False,
|
||||
"appearance": appearance,
|
||||
"title": elm.text,
|
||||
"path": f"{os.path.basename(media_dir)}/{filename}",
|
||||
}
|
||||
elm.name = "span"
|
||||
elm.clear()
|
||||
elm.append(img)
|
||||
elm.attrs["style"] = "vertical-align: text-bottom; margin-right: 0.25em;"
|
||||
|
||||
|
||||
def __make_rectangle(path, text, cls):
|
||||
if cls == "none":
|
||||
pass
|
||||
elif cls == "fill":
|
||||
Icons.make_monochrome_fill_rectangle(path, text)
|
||||
elif cls == "red":
|
||||
Icons.make_rectangle(path, text, "red", "white", "red")
|
||||
elif cls == "redfill":
|
||||
Icons.make_rectangle(path, text, "red", "red", "white")
|
||||
elif cls == "bluefill":
|
||||
Icons.make_rectangle(path, text, "blue", "blue", "white")
|
||||
else:
|
||||
Icons.make_rectangle(path, text, "black", "transparent", "black")
|
|
@ -92,8 +92,8 @@ def __convert_gaiji(soup, image_dir):
|
|||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0 if ratio > 1.0 else ratio,
|
||||
"width": ratio if ratio > 1.0 else 1.0,
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
|
@ -124,8 +124,8 @@ def __convert_rectangles(soup, image_dir):
|
|||
ratio = Icons.calculate_ratio(path)
|
||||
img = BeautifulSoup("<img/>", "xml").img
|
||||
img.attrs = {
|
||||
"height": 1.0 if ratio > 1.0 else ratio,
|
||||
"width": ratio if ratio > 1.0 else 1.0,
|
||||
"height": 1.0,
|
||||
"width": ratio,
|
||||
"sizeUnits": "em",
|
||||
"collapsible": False,
|
||||
"collapsed": False,
|
||||
|
|
26
bot/yomichan/terms/base/jitenon.py
Normal file
26
bot/yomichan/terms/base/jitenon.py
Normal file
|
@ -0,0 +1,26 @@
|
|||
from bot.yomichan.terms.base.terminator import BaseTerminator
|
||||
|
||||
|
||||
class JitenonTerminator(BaseTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = None
|
||||
|
||||
def _definition_tags(self, entry):
|
||||
return None
|
||||
|
||||
def _glossary(self, entry):
|
||||
if entry.entry_id in self._glossary_cache:
|
||||
return self._glossary_cache[entry.entry_id]
|
||||
glossary = self._glossary_maker.make_glossary(entry, self._image_dir)
|
||||
self._glossary_cache[entry.entry_id] = glossary
|
||||
return glossary
|
||||
|
||||
def _sequence(self, entry):
|
||||
return entry.entry_id
|
||||
|
||||
def _link_glossary_parameters(self, entry):
|
||||
return []
|
||||
|
||||
def _subentry_lists(self, entry):
|
||||
return []
|
|
@ -2,7 +2,7 @@ from abc import abstractmethod, ABC
|
|||
from bot.data import load_yomichan_inflection_categories
|
||||
|
||||
|
||||
class Terminator(ABC):
|
||||
class BaseTerminator(ABC):
|
||||
def __init__(self, target):
|
||||
self._target = target
|
||||
self._glossary_cache = {}
|
||||
|
@ -66,28 +66,28 @@ class Terminator(ABC):
|
|||
|
||||
@abstractmethod
|
||||
def _definition_tags(self, entry):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _inflection_rules(self, entry, expression):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _glossary(self, entry):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _sequence(self, entry):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _term_tags(self, entry):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _link_glossary_parameters(self, entry):
|
||||
pass
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def _subentry_lists(self, entry):
|
||||
pass
|
||||
raise NotImplementedError
|
|
@ -1,14 +1,10 @@
|
|||
from bot.entries.daijirin2 import Daijirin2PhraseEntry as PhraseEntry
|
||||
|
||||
from bot.yomichan.terms.terminator import Terminator
|
||||
from bot.entries.daijirin2.phrase_entry import PhraseEntry
|
||||
from bot.yomichan.terms.base.terminator import BaseTerminator
|
||||
from bot.yomichan.glossary.daijirin2 import make_glossary
|
||||
from bot.yomichan.grammar import sudachi_rules, tags_to_rules
|
||||
|
||||
|
||||
class Daijirin2Terminator(Terminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
|
||||
class Terminator(BaseTerminator):
|
||||
def _definition_tags(self, entry):
|
||||
return ""
|
||||
|
||||
|
|
|
@ -1,18 +0,0 @@
|
|||
from bot.targets import Targets
|
||||
|
||||
from bot.yomichan.terms.jitenon import JitenonKokugoTerminator
|
||||
from bot.yomichan.terms.jitenon import JitenonYojiTerminator
|
||||
from bot.yomichan.terms.jitenon import JitenonKotowazaTerminator
|
||||
from bot.yomichan.terms.smk8 import Smk8Terminator
|
||||
from bot.yomichan.terms.daijirin2 import Daijirin2Terminator
|
||||
|
||||
|
||||
def new_terminator(target):
|
||||
terminator_map = {
|
||||
Targets.JITENON_KOKUGO: JitenonKokugoTerminator,
|
||||
Targets.JITENON_YOJI: JitenonYojiTerminator,
|
||||
Targets.JITENON_KOTOWAZA: JitenonKotowazaTerminator,
|
||||
Targets.SMK8: Smk8Terminator,
|
||||
Targets.DAIJIRIN2: Daijirin2Terminator,
|
||||
}
|
||||
return terminator_map[target](target)
|
|
@ -1,68 +0,0 @@
|
|||
from bot.yomichan.grammar import sudachi_rules
|
||||
from bot.yomichan.terms.terminator import Terminator
|
||||
|
||||
from bot.yomichan.glossary.jitenon import JitenonKokugoGlossary
|
||||
from bot.yomichan.glossary.jitenon import JitenonYojiGlossary
|
||||
from bot.yomichan.glossary.jitenon import JitenonKotowazaGlossary
|
||||
|
||||
|
||||
class JitenonTerminator(Terminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = None
|
||||
|
||||
def _definition_tags(self, entry):
|
||||
return None
|
||||
|
||||
def _glossary(self, entry):
|
||||
if entry.entry_id in self._glossary_cache:
|
||||
return self._glossary_cache[entry.entry_id]
|
||||
glossary = self._glossary_maker.make_glossary(entry, self._image_dir)
|
||||
self._glossary_cache[entry.entry_id] = glossary
|
||||
return glossary
|
||||
|
||||
def _sequence(self, entry):
|
||||
return entry.entry_id
|
||||
|
||||
def _link_glossary_parameters(self, entry):
|
||||
return []
|
||||
|
||||
def _subentry_lists(self, entry):
|
||||
return []
|
||||
|
||||
|
||||
class JitenonKokugoTerminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonKokugoGlossary()
|
||||
|
||||
def _inflection_rules(self, entry, expression):
|
||||
return sudachi_rules(expression)
|
||||
|
||||
def _term_tags(self, entry):
|
||||
return ""
|
||||
|
||||
|
||||
class JitenonYojiTerminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonYojiGlossary()
|
||||
|
||||
def _inflection_rules(self, entry, expression):
|
||||
return ""
|
||||
|
||||
def _term_tags(self, entry):
|
||||
tags = entry.kanken_level.split("/")
|
||||
return " ".join(tags)
|
||||
|
||||
|
||||
class JitenonKotowazaTerminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonKotowazaGlossary()
|
||||
|
||||
def _inflection_rules(self, entry, expression):
|
||||
return sudachi_rules(expression)
|
||||
|
||||
def _term_tags(self, entry):
|
||||
return ""
|
15
bot/yomichan/terms/jitenon_kokugo.py
Normal file
15
bot/yomichan/terms/jitenon_kokugo.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
from bot.yomichan.grammar import sudachi_rules
|
||||
from bot.yomichan.glossary.jitenon import JitenonKokugoGlossary
|
||||
from bot.yomichan.terms.base.jitenon import JitenonTerminator
|
||||
|
||||
|
||||
class Terminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonKokugoGlossary()
|
||||
|
||||
def _inflection_rules(self, entry, expression):
|
||||
return sudachi_rules(expression)
|
||||
|
||||
def _term_tags(self, entry):
|
||||
return ""
|
15
bot/yomichan/terms/jitenon_kotowaza.py
Normal file
15
bot/yomichan/terms/jitenon_kotowaza.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
from bot.yomichan.grammar import sudachi_rules
|
||||
from bot.yomichan.glossary.jitenon import JitenonKotowazaGlossary
|
||||
from bot.yomichan.terms.base.jitenon import JitenonTerminator
|
||||
|
||||
|
||||
class Terminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonKotowazaGlossary()
|
||||
|
||||
def _inflection_rules(self, entry, expression):
|
||||
return sudachi_rules(expression)
|
||||
|
||||
def _term_tags(self, entry):
|
||||
return ""
|
15
bot/yomichan/terms/jitenon_yoji.py
Normal file
15
bot/yomichan/terms/jitenon_yoji.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
from bot.yomichan.glossary.jitenon import JitenonYojiGlossary
|
||||
from bot.yomichan.terms.base.jitenon import JitenonTerminator
|
||||
|
||||
|
||||
class Terminator(JitenonTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
self._glossary_maker = JitenonYojiGlossary()
|
||||
|
||||
def _inflection_rules(self, entry, expression):
|
||||
return ""
|
||||
|
||||
def _term_tags(self, entry):
|
||||
tags = entry.kanken_level.split("/")
|
||||
return " ".join(tags)
|
43
bot/yomichan/terms/sankoku8.py
Normal file
43
bot/yomichan/terms/sankoku8.py
Normal file
|
@ -0,0 +1,43 @@
|
|||
from bot.entries.sankoku8.phrase_entry import PhraseEntry
|
||||
from bot.yomichan.terms.base.terminator import BaseTerminator
|
||||
from bot.yomichan.glossary.sankoku8 import make_glossary
|
||||
from bot.yomichan.grammar import sudachi_rules, tags_to_rules
|
||||
|
||||
|
||||
class Terminator(BaseTerminator):
|
||||
def _definition_tags(self, entry):
|
||||
return ""
|
||||
|
||||
def _inflection_rules(self, entry, expression):
|
||||
if isinstance(entry, PhraseEntry):
|
||||
return sudachi_rules(expression)
|
||||
pos_tags = entry.get_part_of_speech_tags()
|
||||
if len(pos_tags) == 0:
|
||||
return sudachi_rules(expression)
|
||||
else:
|
||||
return tags_to_rules(expression, pos_tags, self._inflection_categories)
|
||||
|
||||
def _glossary(self, entry):
|
||||
if entry.entry_id in self._glossary_cache:
|
||||
return self._glossary_cache[entry.entry_id]
|
||||
glossary = make_glossary(entry, self._image_dir)
|
||||
self._glossary_cache[entry.entry_id] = glossary
|
||||
return glossary
|
||||
|
||||
def _sequence(self, entry):
|
||||
return entry.entry_id[0] * 100000 + entry.entry_id[1]
|
||||
|
||||
def _term_tags(self, entry):
|
||||
return ""
|
||||
|
||||
def _link_glossary_parameters(self, entry):
|
||||
return [
|
||||
[entry.children, "子"],
|
||||
[entry.phrases, "句"]
|
||||
]
|
||||
|
||||
def _subentry_lists(self, entry):
|
||||
return [
|
||||
entry.children,
|
||||
entry.phrases,
|
||||
]
|
|
@ -1,12 +1,11 @@
|
|||
from bot.entries.smk8 import Smk8KanjiEntry as KanjiEntry
|
||||
from bot.entries.smk8 import Smk8PhraseEntry as PhraseEntry
|
||||
|
||||
from bot.yomichan.terms.terminator import Terminator
|
||||
from bot.entries.smk8.kanji_entry import KanjiEntry
|
||||
from bot.entries.smk8.phrase_entry import PhraseEntry
|
||||
from bot.yomichan.terms.base.terminator import BaseTerminator
|
||||
from bot.yomichan.glossary.smk8 import make_glossary
|
||||
from bot.yomichan.grammar import sudachi_rules, tags_to_rules
|
||||
|
||||
|
||||
class Smk8Terminator(Terminator):
|
||||
class Terminator(BaseTerminator):
|
||||
def __init__(self, target):
|
||||
super().__init__(target)
|
||||
|
||||
|
|
|
@ -1391,7 +1391,7 @@
|
|||
22544,16385,おもいこしをあげる
|
||||
22634,16385,おもいたったがきちにち
|
||||
22634,16386,おもいたつひがきちじつ
|
||||
22728,16385,おもうゆえに
|
||||
22728,16385,おもうえに
|
||||
22728,16386,おもうこころ
|
||||
22728,16387,おもうこといわねばはらふくる
|
||||
22728,16388,おもうそら
|
||||
|
@ -5224,7 +5224,7 @@
|
|||
111520,16385,てんちょうにたっする
|
||||
111583,16385,てんどうぜかひか
|
||||
111583,16386,てんどうひとをころさず
|
||||
111645,16385,てんばくうをいく
|
||||
111645,16385,てんばくうをゆく
|
||||
111695,16385,てんびんにかける
|
||||
111790,16385,てんめいをしる
|
||||
111801,16385,てんもうかいかいそにしてもらさず
|
||||
|
@ -5713,7 +5713,7 @@
|
|||
119456,16385,なまきにくぎ
|
||||
119456,16386,なまきをさく
|
||||
119472,16385,なまけもののあしからとりがたつ
|
||||
119472,16386,なまけもののせっくはたらき
|
||||
119472,16386,なまけもののせっくばたらき
|
||||
119503,16385,なますにたたく
|
||||
119503,16386,なますをふく
|
||||
119507,16385,なまずをひょうたんでおさえる
|
||||
|
@ -7215,7 +7215,7 @@
|
|||
154782,16388,みずがはいる
|
||||
154782,16389,みずがひく
|
||||
154782,16390,みずかる
|
||||
154782,16391,みずきょければうおすまず
|
||||
154782,16391,みずきよければうおすまず
|
||||
154782,16392,みずすむ
|
||||
154782,16393,みずでわる
|
||||
154782,16394,みずとあぶら
|
||||
|
|
|
3573
data/entries/sankoku8/phrase_readings.csv
Normal file
3573
data/entries/sankoku8/phrase_readings.csv
Normal file
File diff suppressed because it is too large
Load diff
|
@ -1,47 +1,61 @@
|
|||
俠,侠
|
||||
俱,倶
|
||||
儘,侭
|
||||
凜,凛
|
||||
剝,剥
|
||||
𠮟,叱
|
||||
吞,呑
|
||||
靭,靱
|
||||
臈,﨟
|
||||
啞,唖
|
||||
噓,嘘
|
||||
嚙,噛
|
||||
屛,屏
|
||||
幷,并
|
||||
彎,弯
|
||||
搔,掻
|
||||
攪,撹
|
||||
枡,桝
|
||||
濾,沪
|
||||
繡,繍
|
||||
蔣,蒋
|
||||
蠟,蝋
|
||||
醬,醤
|
||||
穎,頴
|
||||
鷗,鴎
|
||||
鹼,鹸
|
||||
麴,麹
|
||||
俠,侠
|
||||
俱,倶
|
||||
剝,剥
|
||||
噓,嘘
|
||||
囊,嚢
|
||||
塡,填
|
||||
姸,妍
|
||||
屛,屏
|
||||
屢,屡
|
||||
拋,抛
|
||||
搔,掻
|
||||
摑,掴
|
||||
攪,撹
|
||||
潑,溌
|
||||
瀆,涜
|
||||
潑,溌
|
||||
焰,焔
|
||||
禱,祷
|
||||
竜,龍
|
||||
筓,笄
|
||||
簞,箪
|
||||
籠,篭
|
||||
繡,繍
|
||||
繫,繋
|
||||
腁,胼
|
||||
萊,莱
|
||||
藪,薮
|
||||
蟬,蝉
|
||||
蠟,蝋
|
||||
軀,躯
|
||||
醬,醤
|
||||
醱,醗
|
||||
頰,頬
|
||||
顚,顛
|
||||
驒,騨
|
||||
鶯,鴬
|
||||
鷗,鴎
|
||||
鷽,鴬
|
||||
鹼,鹸
|
||||
麴,麹
|
||||
靭,靱
|
||||
靱,靭
|
||||
姸,妍
|
||||
攢,攅
|
||||
𣜜,杤
|
||||
檔,档
|
||||
槶,椢
|
||||
櫳,槞
|
||||
纊,絋
|
||||
纘,纉
|
||||
隯,陦
|
||||
筓,笄
|
||||
逬,迸
|
||||
腁,胼
|
||||
騈,駢
|
||||
拋,抛
|
||||
篡,簒
|
||||
檜,桧
|
||||
禰,祢
|
||||
禱,祷
|
||||
蘆,芦
|
||||
凜,凛
|
|
|
@ -1,19 +1,19 @@
|
|||
|
||||
@font-face {
|
||||
font-family: jpgothic;
|
||||
src: local("メイリオ"), local("ヒラギノ角ゴ Pro W3"), local("Hiragino Kaku Gothic Pro"), local("Meiryo"), local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP"), local("MS Pゴシック"), local("MS Pgothic"), local("sans-serif");
|
||||
src: local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP");
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: jpmincho;
|
||||
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("Source Han Serif JP"), local("MS PMincho"), local("serif");
|
||||
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("IPAmjMincho"), local("Source Han Serif JP"), local("HanaMinA"), local("HanaMinB");
|
||||
}
|
||||
|
||||
body {
|
||||
margin: 0em 1em;
|
||||
/*margin: 0em 1em;*/
|
||||
line-height: 1.5em;
|
||||
font-family: jpmincho;
|
||||
font-size: 1.2em;
|
||||
font-family: jpmincho, serif;
|
||||
/*font-size: 1.2em;*/
|
||||
color: black;
|
||||
}
|
||||
|
||||
|
@ -43,7 +43,7 @@ span[data-name="i"] {
|
|||
}
|
||||
|
||||
span[data-name="h1"] {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-size: 1em;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
@ -134,7 +134,7 @@ span[data-name="キャプション"] {
|
|||
}
|
||||
|
||||
span[data-name="ルビG"] {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-size: 0.7em;
|
||||
font-weight: normal;
|
||||
vertical-align: 0.35em;
|
||||
|
@ -142,7 +142,7 @@ span[data-name="ルビG"] {
|
|||
}
|
||||
|
||||
.warichu span[data-name="ルビG"] {
|
||||
font-family: jpmincho;
|
||||
font-family: jpmincho, serif;
|
||||
font-size: 0.5em;
|
||||
font-weight: normal;
|
||||
vertical-align: 0em;
|
||||
|
@ -178,7 +178,7 @@ span[data-name="句仮名"] {
|
|||
}
|
||||
|
||||
span[data-name="句表記"] {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
|
@ -189,7 +189,7 @@ span[data-name="句項目"] {
|
|||
}
|
||||
|
||||
span[data-name="和字"] {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
}
|
||||
|
||||
span[data-name="品詞行"] {
|
||||
|
@ -209,7 +209,7 @@ span[data-name="大語義"] {
|
|||
span[data-name="大語義num"] {
|
||||
margin: 0.025em;
|
||||
padding: 0.1em;
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-size: 0.8em;
|
||||
color: white;
|
||||
background-color: black;
|
||||
|
@ -227,7 +227,7 @@ span[data-name="慣用G"] {
|
|||
}
|
||||
|
||||
span[data-name="欧字"] {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
}
|
||||
|
||||
span[data-name="歴史仮名"] {
|
||||
|
@ -248,7 +248,7 @@ span[data-name="準大語義"] {
|
|||
span[data-name="準大語義num"] {
|
||||
margin: 0.025em;
|
||||
padding: 0.1em;
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-size: 0.8em;
|
||||
border: solid 1px black;
|
||||
}
|
||||
|
@ -256,7 +256,7 @@ span[data-name="準大語義num"] {
|
|||
span[data-name="漢字音logo"] {
|
||||
margin: 0.025em;
|
||||
padding: 0.1em;
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-size: 0.8em;
|
||||
border: solid 0.5px black;
|
||||
border-radius: 1em;
|
||||
|
@ -290,17 +290,17 @@ span[data-name="異字同訓"] {
|
|||
}
|
||||
|
||||
span[data-name="異字同訓仮名"] {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
span[data-name="異字同訓漢字"] {
|
||||
font-family: jpmincho;
|
||||
font-family: jpmincho, serif;
|
||||
font-weight: normal;
|
||||
}
|
||||
|
||||
span[data-name="異字同訓表記"] {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
|
@ -321,12 +321,12 @@ rt {
|
|||
}
|
||||
|
||||
span[data-name="見出仮名"] {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
span[data-name="見出相当部"] {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
|
@ -371,7 +371,7 @@ span[data-name="logo"] {
|
|||
}
|
||||
|
||||
.gothic {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
|
@ -407,7 +407,7 @@ span[data-name="付記"]:after {
|
|||
}
|
||||
|
||||
div[data-child-links] {
|
||||
padding-top: 1em;
|
||||
padding-left: 1em;
|
||||
}
|
||||
|
||||
div[data-child-links] ul {
|
||||
|
@ -417,7 +417,7 @@ div[data-child-links] ul {
|
|||
|
||||
div[data-child-links] span {
|
||||
padding: 0.1em;
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-size: 0.8em;
|
||||
color: white;
|
||||
border-width: 0.05em;
|
||||
|
|
|
@ -1,20 +1,17 @@
|
|||
|
||||
@font-face {
|
||||
font-family: jpgothic;
|
||||
src: local("メイリオ"), local("ヒラギノ角ゴ Pro W3"), local("Hiragino Kaku Gothic Pro"), local("Meiryo"), local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP"), local("MS Pゴシック"), local("MS Pgothic"), local("sans-serif");
|
||||
src: local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP");
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: jpmincho;
|
||||
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("Source Han Serif JP"), local("MS PMincho"), local("serif");
|
||||
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("IPAmjMincho"), local("Source Han Serif JP"), local("HanaMinA"), local("HanaMinB");
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: jpmincho;
|
||||
margin: 0em 1em;
|
||||
font-family: jpmincho, serif;
|
||||
line-height: 1.5em;
|
||||
font-size: 1.2em;
|
||||
color: black;
|
||||
}
|
||||
|
||||
table, th, td {
|
||||
|
@ -24,7 +21,7 @@ table, th, td {
|
|||
}
|
||||
|
||||
th {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
color: black;
|
||||
background-color: lightgray;
|
||||
font-weight: normal;
|
||||
|
@ -43,17 +40,18 @@ td ul {
|
|||
}
|
||||
|
||||
.読み方 {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
.意味 {
|
||||
.意味,
|
||||
.kanjirighttb {
|
||||
margin-left: 1.0em;
|
||||
margin-bottom: 0.5em;
|
||||
}
|
||||
|
||||
.num_icon {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
padding-left: 0.25em;
|
||||
margin-right: 0.5em;
|
||||
font-size: 0.8em;
|
||||
|
@ -63,4 +61,3 @@ td ul {
|
|||
border-style: none;
|
||||
-webkit-border-radius: 0.1em;
|
||||
}
|
||||
|
||||
|
|
|
@ -1,20 +1,17 @@
|
|||
|
||||
@font-face {
|
||||
font-family: jpgothic;
|
||||
src: local("メイリオ"), local("ヒラギノ角ゴ Pro W3"), local("Hiragino Kaku Gothic Pro"), local("Meiryo"), local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP"), local("MS Pゴシック"), local("MS Pgothic"), local("sans-serif");
|
||||
src: local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP");
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: jpmincho;
|
||||
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("Source Han Serif JP"), local("MS PMincho"), local("serif");
|
||||
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("IPAmjMincho"), local("Source Han Serif JP"), local("HanaMinA"), local("HanaMinB");
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: jpmincho;
|
||||
margin: 0em 1em;
|
||||
font-family: jpmincho, serif;
|
||||
line-height: 1.5em;
|
||||
font-size: 1.2em;
|
||||
color: black;
|
||||
}
|
||||
|
||||
table, th, td {
|
||||
|
@ -24,7 +21,7 @@ table, th, td {
|
|||
}
|
||||
|
||||
th {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
color: black;
|
||||
background-color: lightgray;
|
||||
font-weight: normal;
|
||||
|
@ -39,12 +36,12 @@ a {
|
|||
}
|
||||
|
||||
.読み方 {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
.意味 {
|
||||
.意味,
|
||||
.kanjirighttb {
|
||||
margin-left: 1.0em;
|
||||
margin-bottom: 0.5em;
|
||||
}
|
||||
|
||||
|
|
|
@ -1,20 +1,17 @@
|
|||
|
||||
@font-face {
|
||||
font-family: jpgothic;
|
||||
src: local("メイリオ"), local("ヒラギノ角ゴ Pro W3"), local("Hiragino Kaku Gothic Pro"), local("Meiryo"), local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP"), local("MS Pゴシック"), local("MS Pgothic"), local("sans-serif");
|
||||
src: local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP");
|
||||
}
|
||||
|
||||
@font-face {
|
||||
font-family: jpmincho;
|
||||
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("Source Han Serif JP"), local("MS PMincho"), local("serif");
|
||||
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("IPAmjMincho"), local("Source Han Serif JP"), local("HanaMinA"), local("HanaMinB");
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: jpmincho;
|
||||
margin: 0em 1em;
|
||||
font-family: jpmincho, serif;
|
||||
line-height: 1.5em;
|
||||
font-size: 1.2em;
|
||||
color: black;
|
||||
}
|
||||
|
||||
table, th, td {
|
||||
|
@ -24,7 +21,7 @@ table, th, td {
|
|||
}
|
||||
|
||||
th {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
color: black;
|
||||
background-color: lightgray;
|
||||
font-weight: normal;
|
||||
|
@ -39,12 +36,12 @@ a {
|
|||
}
|
||||
|
||||
.読み方 {
|
||||
font-family: jpgothic;
|
||||
font-family: jpgothic, sans-serif;
|
||||
font-weight: bold;
|
||||
}
|
||||
|
||||
.意味 {
|
||||
.意味,
|
||||
.kanjirighttb {
|
||||
margin-left: 1.0em;
|
||||
margin-bottom: 0.5em;
|
||||
}
|
||||
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show more
Loading…
Reference in a new issue