Compare commits

..

1 commit

Author SHA1 Message Date
epistularum 9e0ff422d8
User instructions for gaiji/audio icon 2023-07-12 21:36:39 +09:00
116 changed files with 1407 additions and 7718 deletions

158
README.md
View file

@ -4,13 +4,12 @@ compiling the scraped data into compact dictionary file formats.
### Supported Dictionaries
* Web Dictionaries
* [国語辞典オンライン](https://kokugo.jitenon.jp/) (`jitenon-kokugo`)
* [四字熟語辞典オンライン](https://yoji.jitenon.jp/) (`jitenon-yoji`)
* [故事・ことわざ・慣用句オンライン](https://kotowaza.jitenon.jp/) (`jitenon-kotowaza`)
* Monokakido
* [新明解国語辞典 第八版](https://www.monokakido.jp/ja/dictionaries/smk8/index.html) (`smk8`)
* [大辞林 第四版](https://www.monokakido.jp/ja/dictionaries/daijirin2/index.html) (`daijirin2`)
* [三省堂国語辞典 第八版](https://www.monokakido.jp/ja/dictionaries/sankoku8/index.html) (`sankoku8`)
* [国語辞典オンライン](https://kokugo.jitenon.jp/) (Jitenon Kokugo)
* [四字熟語辞典オンライン](https://yoji.jitenon.jp/) (Jitenon Yoji)
* [故事・ことわざ・慣用句オンライン](https://kotowaza.jitenon.jp/) (Jitenon Kotowaza)
* Monokakido (["辞書 by 物書堂"](https://www.monokakido.jp/ja/dictionaries/app/))
* [新明解国語辞典 第八版](https://www.monokakido.jp/ja/dictionaries/smk8/index.html) (Shinmeikai 8e)
* [大辞林 第四版](https://www.monokakido.jp/ja/dictionaries/daijirin2/index.html) (Daijirin 4e)
### Supported Output Formats
@ -49,12 +48,6 @@ compiling the scraped data into compact dictionary file formats.
![daijirin2](https://user-images.githubusercontent.com/8003332/235578700-9dbf4fb0-0154-48b5-817c-8fe75e442afc.png)
</details>
<details>
<summary>Sanseidō 8e (print | yomichan)</summary>
![sankoku8](https://github.com/stephenmk/jitenbot/assets/8003332/0358b3fc-71fb-4557-977c-1976a12229ec)
</details>
<details>
<summary>Various (GoldenDict)</summary>
@ -64,14 +57,13 @@ compiling the scraped data into compact dictionary file formats.
# Usage
```
usage: jitenbot [-h] [-p PAGE_DIR] [-m MEDIA_DIR] [-i MDICT_ICON]
[--no-mdict-export] [--no-yomichan-export]
[--validate-yomichan-terms]
{jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2,sankoku8}
[--no-yomichan-export] [--no-mdict-export]
{jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}
Convert Japanese dictionary files to new formats.
positional arguments:
{jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2,sankoku8}
{jitenon-kokugo,jitenon-yoji,jitenon-kotowaza,smk8,daijirin2}
name of dictionary to convert
options:
@ -83,14 +75,10 @@ options:
graphics, audio, etc.)
-i MDICT_ICON, --mdict-icon MDICT_ICON
path to icon file to be used with MDict
--no-mdict-export skip export of dictionary data to MDict format
--no-yomichan-export skip export of dictionary data to Yomichan format
--validate-yomichan-terms
validate JSON structure of exported Yomichan
dictionary terms
--no-mdict-export skip export of dictionary data to MDict format
See README.md for details regarding media directory structures
```
### Web Targets
Jitenbot will scrape the target website and save the pages to the [user cache directory](https://pypi.org/project/platformdirs/).
@ -101,112 +89,58 @@ HTTP request headers (user agent string, etc.) may be customized by editing the
[user config directory](https://pypi.org/project/platformdirs/).
### Monokakido Targets
These digital dictionaries are available for purchase through the [Monokakido Dictionaries app](https://www.monokakido.jp/ja/dictionaries/app/) on MacOS/iOS. Under ideal circumstances, Jitenbot would be able to automatically fetch all the data it needs from this app's data directory[^1] on your system. In its current state of development, Jitenbot unfortunately requires you to find and assemble the necessary data yourself. The files must be organized into a particular folder structure (defined below) and then passed to Jitenbot via the corresponding command line arguments.
Page data and media data must be [procured by the user](https://github.com/golddranks/monokakido/)
and passed to jitenbot via the appropriate command line flags. Additionaly, the gaiji folder and the audio icon have to be manually copied from the original dictionary folder into the media folder.
Some of the folders in the app's data directory[^1] contain encoded files that must be unencoded using [golddranks' monokakido tool](https://github.com/golddranks/monokakido/). These folders are indicated by a reference mark (※) in the notes below.
[^1]: `/Library/Application Support/AppStoreContent/jp.monokakido.Dictionaries/Products/`
Path:
```/YOUR_SAVE_PATH/jp.monokakido.Dictionaries.DICTIONARY_NAME/Contents/DICTIONARY_NAME/```.
<details>
<summary>smk8 files</summary>
<summary>smk8 media directory</summary>
Since Yomichan does not support audio files from imported dictionaries, the `audio/` directory may be omitted to save filesize space in the output ZIP file if desired.
Since Yomichan does not support audio files from imported
dictionaries, the `audio/` directory may be omitted to save filesize
space in the output ZIP file if desired.
```
.
├── media
│   ├── audio (※)
│   │   ├── 00001.aac
│   │   ├── 00002.aac
│   │   ├── 00003.aac
│   │   ├── ...
│   │   └── 82682.aac
│   ├── Audio.png
│   └── gaiji
│   ├── 1d110.svg
│   ├── 1d15d.svg
│   ├── 1d15e.svg
│   ├── ...
│   └── xbunnoa.svg
└── pages (※)
├── 0000000000.xml
├── 0000000001.xml
├── 0000000002.xml
├── ...
└── 0000064581.xml
media
├── Audio.png
├── audio
│   ├── 00001.aac
│   ├── 00002.aac
│   ├── 00003.aac
│   │  ...
│   └── 82682.aac
└── gaiji
├── 1d110.svg
├── 1d15d.svg
├── 1d15e.svg
   │  ...
└── xbunnoa.svg
```
</details>
<details>
<summary>daijirin2 files</summary>
<summary>daijirin2 media directory</summary>
The `graphics/` directory may be omitted to save space if desired.
```
.
├── media
│   ├── gaiji
│   │   ├── 1D10B.svg
│   │   ├── 1D110.svg
│   │   ├── 1D12A.svg
│   │   ├── ...
│   │   └── vectorOB.svg
│   └── graphics (※)
│   ├── 3djr_0002.png
│   ├── 3djr_0004.png
│   ├── 3djr_0005.png
│   ├── ...
│   └── 4djr_yahazu.png
└── pages (※)
├── 0000000001.xml
├── 0000000002.xml
├── 0000000003.xml
├── ...
└── 0000182633.xml
```
</details>
<details>
<summary>sankoku8 files</summary>
```
.
├── media
│   ├── graphics
│   │   ├── 000chouchou.png
│   │   ├── ...
│   │   └── 888udatsu.png
│   ├── svg-accent
│   │   ├── アクセント.svg
│   │   └── 平板.svg
│   ├── svg-frac
│   │   ├── frac-1-2.svg
│   │   ├── ...
│   │   └── frac-a-b.svg
│   ├── svg-gaiji
│   │   ├── aiaigasa.svg
│   │   ├── ...
│   │   └── 異体字_西.svg
│   ├── svg-intonation
│   │   ├── 上昇下降.svg
│   │   ├── ...
│   │   └── 長.svg
│   ├── svg-logo
│   │   ├── denshi.svg
│   │   ├── ...
│   │   └── 重要語.svg
│   └── svg-special
│   └── 区切り線.svg
└── pages (※)
├── 0000000001.xml
├── ...
└── 0000065457.xml
media
├── gaiji
│   ├── 1D10B.svg
│   ├── 1D110.svg
│   ├── 1D12A.svg
│   │  ...
│   └── vectorOB.svg
└── graphics
├── 3djr_0002.png
├── 3djr_0004.png
├── 3djr_0005.png
   │  ...
└── 4djr_yahazu.png
```
</details>
# Attribution
`Adobe-Japan1_sequences.txt` is provided by [The Adobe-Japan1-7 Character Collection](https://github.com/adobe-type-tools/Adobe-Japan1).
The Yomichan term-bank schema definition `dictionary-term-bank-v3-schema.json` is provided by the [Yomichan](https://github.com/foosoft/yomichan) project.
Many thanks to [epistularum](https://github.com/epistularum) for providing thoughtful feedback regarding the implementation of the MDict export functionality.

View file

@ -1,13 +1,10 @@
### Todo
- [x] Add factory classes to reduce the amount of class import statements
- [x] Add dynamic import functionality to factory classes to reduce boilerplate
- [x] Support exporting to MDict (.MDX) dictionary format
- [x] Validate JSON schema of Yomichan terms during export
- [ ] Add support for monokakido search keys from index files
- [ ] Delete unneeded media from temp build directory before final export
- [ ] Add test suite
- [ ] Add documentation (docstrings, etc.)
- [ ] Validate JSON schema of Yomichan terms during export
- [ ] Add build scripts for producing program binaries
- [ ] Validate scraped webpages after downloading
- [ ] Log non-fatal failures to a log file instead of raising exceptions
@ -16,7 +13,7 @@
- [ ] [Yoji-Jukugo.com](https://yoji-jukugo.com/)
- [ ] [実用日本語表現辞典](https://www.weblio.jp/cat/dictionary/jtnhj)
- [ ] Support more Monokakido dictionaries
- [x] 三省堂国語辞典 第8版 (SANKOKU8)
- [ ] 三省堂国語辞典 第8版 (SANKOKU8)
- [ ] 精選版 日本国語大辞典 (NDS)
- [ ] 大辞泉 第2版 (DAIJISEN2)
- [ ] 明鏡国語辞典 第3版 (MK3)

View file

@ -1,54 +0,0 @@
import re
from abc import ABC, abstractmethod
from bot.factory import new_entry
from bot.factory import new_yomichan_exporter
from bot.factory import new_mdict_exporter
class BaseCrawler(ABC):
def __init__(self, target):
self._target = target
self._page_map = {}
self._entries = []
self._page_id_pattern = None
@abstractmethod
def collect_pages(self, page_dir):
raise NotImplementedError
def read_pages(self):
pages_len = len(self._page_map)
items = self._page_map.items()
for idx, (page_id, page_path) in enumerate(items):
update = f"\tReading page {idx+1}/{pages_len}"
print(update, end='\r', flush=True)
entry = new_entry(self._target, page_id)
with open(page_path, "r", encoding="utf-8") as f:
page = f.read()
try:
entry.set_page(page)
except ValueError as err:
print(err)
print("Try deleting and redownloading file:")
print(f"\t{page_path}\n")
continue
self._entries.append(entry)
print()
def make_yomichan_dictionary(self, media_dir, validate):
exporter = new_yomichan_exporter(self._target)
exporter.export(self._entries, media_dir, validate)
def make_mdict_dictionary(self, media_dir, icon_file):
exporter = new_mdict_exporter(self._target)
exporter.export(self._entries, media_dir, icon_file)
def _parse_page_id(self, page_link):
m = re.search(self._page_id_pattern, page_link)
if m is None:
return None
page_id = int(m.group(1))
if page_id in self._page_map:
return None
return page_id

View file

@ -1,30 +0,0 @@
from bs4 import BeautifulSoup
from bot.time import timestamp
from bot.crawlers.scrapers.jitenon import Jitenon as JitenonScraper
from bot.crawlers.base.crawler import BaseCrawler
class JitenonCrawler(BaseCrawler):
def __init__(self, target):
super().__init__(target)
self._gojuon_url = None
def collect_pages(self, page_dir):
print(f"{timestamp()} Scraping {self._gojuon_url}")
jitenon = JitenonScraper()
gojuon_doc, _ = jitenon.scrape(self._gojuon_url)
gojuon_soup = BeautifulSoup(gojuon_doc, features="html.parser")
for gojuon_a in gojuon_soup.select(".kana_area a", href=True):
gojuon_href = gojuon_a['href']
kana_doc, _ = jitenon.scrape(gojuon_href)
kana_soup = BeautifulSoup(kana_doc, features="html.parser")
for kana_a in kana_soup.select(".word_box a", href=True):
page_link = kana_a['href']
page_id = self._parse_page_id(page_link)
if page_id is None:
continue
_, page_path = jitenon.scrape(page_link)
self._page_map[page_id] = page_path
pages_len = len(self._page_map)
print(f"\n{timestamp()} Found {pages_len} entry pages")

View file

@ -1,20 +0,0 @@
import os
from bot.time import timestamp
from bot.crawlers.base.crawler import BaseCrawler
class MonokakidoCrawler(BaseCrawler):
def __init__(self, target):
super().__init__(target)
self._page_id_pattern = r"^([0-9]+)\.xml$"
def collect_pages(self, page_dir):
print(f"{timestamp()} Searching for page files in `{page_dir}`")
for pagefile in os.listdir(page_dir):
page_id = self._parse_page_id(pagefile)
if page_id is None or page_id == 0:
continue
path = os.path.join(page_dir, pagefile)
self._page_map[page_id] = path
pages_len = len(self._page_map)
print(f"{timestamp()} Found {pages_len} page files for processing")

154
bot/crawlers/crawlers.py Normal file
View file

@ -0,0 +1,154 @@
import os
import re
from abc import ABC, abstractmethod
from bs4 import BeautifulSoup
import bot.crawlers.scraper as Scraper
from bot.entries.factory import new_entry
from bot.yomichan.exporters.factory import new_yomi_exporter
from bot.mdict.exporters.factory import new_mdict_exporter
class Crawler(ABC):
def __init__(self, target):
self._target = target
self._page_map = {}
self._entries = []
self._page_id_pattern = None
@abstractmethod
def collect_pages(self, page_dir):
pass
def read_pages(self):
pages_len = len(self._page_map)
items = self._page_map.items()
for idx, (page_id, page_path) in enumerate(items):
update = f"Reading page {idx+1}/{pages_len}"
print(update, end='\r', flush=True)
entry = new_entry(self._target, page_id)
with open(page_path, "r", encoding="utf-8") as f:
page = f.read()
try:
entry.set_page(page)
except ValueError as err:
print(err)
print("Try deleting and redownloading file:")
print(f"\t{page_path}\n")
continue
self._entries.append(entry)
print()
def make_yomichan_dictionary(self, media_dir):
exporter = new_yomi_exporter(self._target)
exporter.export(self._entries, media_dir)
def make_mdict_dictionary(self, media_dir, icon_file):
exporter = new_mdict_exporter(self._target)
exporter.export(self._entries, media_dir, icon_file)
def _parse_page_id(self, page_link):
m = re.search(self._page_id_pattern, page_link)
if m is None:
return None
page_id = int(m.group(1))
if page_id in self._page_map:
return None
return page_id
class JitenonKokugoCrawler(Crawler):
def __init__(self, target):
super().__init__(target)
self._gojuon_url = "https://kokugo.jitenon.jp/cat/gojuonindex.php"
self._page_id_pattern = r"word/p([0-9]+)$"
def collect_pages(self, page_dir):
jitenon = Scraper.Jitenon()
gojuon_doc, _ = jitenon.scrape(self._gojuon_url)
gojuon_soup = BeautifulSoup(gojuon_doc, features="html.parser")
for gojuon_a in gojuon_soup.select(".kana_area a", href=True):
gojuon_href = gojuon_a['href']
max_kana_page = 1
current_kana_page = 1
while current_kana_page <= max_kana_page:
kana_doc, _ = jitenon.scrape(f"{gojuon_href}&page={current_kana_page}")
current_kana_page += 1
kana_soup = BeautifulSoup(kana_doc, features="html.parser")
page_total = kana_soup.find(class_="page_total").text
m = re.search(r"全([0-9]+)件", page_total)
if m:
max_kana_page = int(m.group(1))
for kana_a in kana_soup.select(".word_box a", href=True):
page_link = kana_a['href']
page_id = self._parse_page_id(page_link)
if page_id is None:
continue
_, page_path = jitenon.scrape(page_link)
self._page_map[page_id] = page_path
pages_len = len(self._page_map)
print(f"Finished scraping {pages_len} pages")
class _JitenonCrawler(Crawler):
def __init__(self, target):
super().__init__(target)
self._gojuon_url = None
def collect_pages(self, page_dir):
print("Scraping jitenon.jp")
jitenon = Scraper.Jitenon()
gojuon_doc, _ = jitenon.scrape(self._gojuon_url)
gojuon_soup = BeautifulSoup(gojuon_doc, features="html.parser")
for gojuon_a in gojuon_soup.select(".kana_area a", href=True):
gojuon_href = gojuon_a['href']
kana_doc, _ = jitenon.scrape(gojuon_href)
kana_soup = BeautifulSoup(kana_doc, features="html.parser")
for kana_a in kana_soup.select(".word_box a", href=True):
page_link = kana_a['href']
page_id = self._parse_page_id(page_link)
if page_id is None:
continue
_, page_path = jitenon.scrape(page_link)
self._page_map[page_id] = page_path
pages_len = len(self._page_map)
print(f"Finished scraping {pages_len} pages")
class JitenonYojiCrawler(_JitenonCrawler):
def __init__(self, target):
super().__init__(target)
self._gojuon_url = "https://yoji.jitenon.jp/cat/gojuon.html"
self._page_id_pattern = r"([0-9]+)\.html$"
class JitenonKotowazaCrawler(_JitenonCrawler):
def __init__(self, target):
super().__init__(target)
self._gojuon_url = "https://kotowaza.jitenon.jp/cat/gojuon.php"
self._page_id_pattern = r"([0-9]+)\.php$"
class _MonokakidoCrawler(Crawler):
def __init__(self, target):
super().__init__(target)
self._page_id_pattern = r"^([0-9]+)\.xml$"
def collect_pages(self, page_dir):
print(f"Searching for page files in `{page_dir}`")
for pagefile in os.listdir(page_dir):
page_id = self._parse_page_id(pagefile)
if page_id is None or page_id == 0:
continue
path = os.path.join(page_dir, pagefile)
self._page_map[page_id] = path
pages_len = len(self._page_map)
print(f"Found {pages_len} page files for processing")
class Smk8Crawler(_MonokakidoCrawler):
pass
class Daijirin2Crawler(_MonokakidoCrawler):
pass

View file

@ -1,5 +0,0 @@
from bot.crawlers.base.monokakido import MonokakidoCrawler
class Crawler(MonokakidoCrawler):
pass

18
bot/crawlers/factory.py Normal file
View file

@ -0,0 +1,18 @@
from bot.targets import Targets
from bot.crawlers.crawlers import JitenonKokugoCrawler
from bot.crawlers.crawlers import JitenonYojiCrawler
from bot.crawlers.crawlers import JitenonKotowazaCrawler
from bot.crawlers.crawlers import Smk8Crawler
from bot.crawlers.crawlers import Daijirin2Crawler
def new_crawler(target):
crawler_map = {
Targets.JITENON_KOKUGO: JitenonKokugoCrawler,
Targets.JITENON_YOJI: JitenonYojiCrawler,
Targets.JITENON_KOTOWAZA: JitenonKotowazaCrawler,
Targets.SMK8: Smk8Crawler,
Targets.DAIJIRIN2: Daijirin2Crawler,
}
return crawler_map[target](target)

View file

@ -1,40 +0,0 @@
import re
from bs4 import BeautifulSoup
from bot.time import timestamp
from bot.crawlers.base.crawler import BaseCrawler
from bot.crawlers.scrapers.jitenon import Jitenon as JitenonScraper
class Crawler(BaseCrawler):
def __init__(self, target):
super().__init__(target)
self._gojuon_url = "https://kokugo.jitenon.jp/cat/gojuonindex.php"
self._page_id_pattern = r"word/p([0-9]+)$"
def collect_pages(self, page_dir):
print(f"{timestamp()} Scraping {self._gojuon_url}")
jitenon = JitenonScraper()
gojuon_doc, _ = jitenon.scrape(self._gojuon_url)
gojuon_soup = BeautifulSoup(gojuon_doc, features="html.parser")
for gojuon_a in gojuon_soup.select(".kana_area a", href=True):
gojuon_href = gojuon_a['href']
max_kana_page = 1
current_kana_page = 1
while current_kana_page <= max_kana_page:
kana_doc, _ = jitenon.scrape(f"{gojuon_href}&page={current_kana_page}")
current_kana_page += 1
kana_soup = BeautifulSoup(kana_doc, features="html.parser")
page_total = kana_soup.find(class_="page_total").text
m = re.search(r"全([0-9]+)件", page_total)
if m:
max_kana_page = int(m.group(1))
for kana_a in kana_soup.select(".word_box a", href=True):
page_link = kana_a['href']
page_id = self._parse_page_id(page_link)
if page_id is None:
continue
_, page_path = jitenon.scrape(page_link)
self._page_map[page_id] = page_path
pages_len = len(self._page_map)
print(f"\n{timestamp()} Found {pages_len} entry pages")

View file

@ -1,8 +0,0 @@
from bot.crawlers.base.jitenon import JitenonCrawler
class Crawler(JitenonCrawler):
def __init__(self, target):
super().__init__(target)
self._gojuon_url = "https://kotowaza.jitenon.jp/cat/gojuon.php"
self._page_id_pattern = r"([0-9]+)\.php$"

View file

@ -1,8 +0,0 @@
from bot.crawlers.base.jitenon import JitenonCrawler
class Crawler(JitenonCrawler):
def __init__(self, target):
super().__init__(target)
self._gojuon_url = "https://yoji.jitenon.jp/cat/gojuon.html"
self._page_id_pattern = r"([0-9]+)\.html$"

View file

@ -1,5 +0,0 @@
from bot.crawlers.base.monokakido import MonokakidoCrawler
class Crawler(MonokakidoCrawler):
pass

View file

@ -1,28 +1,24 @@
import time
import requests
import re
import os
import hashlib
import random
import math
from datetime import datetime
from urllib.parse import urlparse
from pathlib import Path
from abc import ABC, abstractmethod
import requests
from platformdirs import user_cache_dir
from urllib.parse import urlparse
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from platformdirs import user_cache_dir
from bot.time import timestamp
from bot.data import load_config
class BaseScraper(ABC):
class Scraper():
def __init__(self):
self.cache_count = 0
self._config = load_config()
self.netloc_re = self._get_netloc_re()
pattern = r"^(?:([A-Za-z0-9.\-]+)\.)?" + self.domain + r"$"
self.netloc_re = re.compile(pattern)
self.__set_session()
def scrape(self, urlstring):
@ -35,14 +31,9 @@ class BaseScraper(ABC):
with open(cache_path, "w", encoding="utf-8") as f:
f.write(html)
else:
self.cache_count += 1
print(f"\tDiscovering cached file {self.cache_count}", end='\r', flush=True)
print("Discovering cached files...", end='\r', flush=True)
return html, cache_path
@abstractmethod
def _get_netloc_re(self):
raise NotImplementedError
def __set_session(self):
retry_strategy = Retry(
total=3,
@ -96,14 +87,21 @@ class BaseScraper(ABC):
def __get(self, urlstring):
delay = 10
time.sleep(delay)
print(f"{timestamp()} Scraping {urlstring} ...", end='')
now = datetime.now().strftime("%H:%M:%S")
print(f"{now} scraping {urlstring} ...", end='')
try:
response = self.session.get(urlstring, timeout=10)
print(f"{timestamp()} OK")
print("OK")
return response.text
except Exception as ex:
print(f"\tFailed: {str(ex)}")
print(f"{timestamp()} Resetting session and trying again")
except Exception:
print("failed")
print("resetting session and trying again")
self.__set_session()
response = self.session.get(urlstring, timeout=10)
return response.text
class Jitenon(Scraper):
def __init__(self):
self.domain = r"jitenon\.jp"
super().__init__()

View file

@ -1,10 +0,0 @@
import re
from bot.crawlers.scrapers.scraper import BaseScraper
class Jitenon(BaseScraper):
def _get_netloc_re(self):
domain = r"jitenon\.jp"
pattern = r"^(?:([A-Za-z0-9.\-]+)\.)?" + domain + r"$"
netloc_re = re.compile(pattern)
return netloc_re

View file

@ -1,5 +0,0 @@
from bot.crawlers.base.monokakido import MonokakidoCrawler
class Crawler(MonokakidoCrawler):
pass

View file

@ -37,16 +37,14 @@ def load_config():
@cache
def load_yomichan_inflection_categories():
file_name = os.path.join(
"yomichan", "inflection_categories.json")
file_name = os.path.join("yomichan", "inflection_categories.json")
data = __load_json(file_name)
return data
@cache
def load_yomichan_metadata():
file_name = os.path.join(
"yomichan", "index.json")
file_name = os.path.join("yomichan", "index.json")
data = __load_json(file_name)
return data
@ -55,21 +53,31 @@ def load_yomichan_metadata():
def load_variant_kanji():
def loader(data, row):
data[row[0]] = row[1]
file_name = os.path.join(
"entries", "variant_kanji.csv")
file_name = os.path.join("entries", "variant_kanji.csv")
data = {}
__load_csv(file_name, loader, data)
return data
@cache
def load_phrase_readings(target):
def load_smk8_phrase_readings():
def loader(data, row):
entry_id = (int(row[0]), int(row[1]))
reading = row[2]
data[entry_id] = reading
file_name = os.path.join(
"entries", target.value, "phrase_readings.csv")
file_name = os.path.join("entries", "smk8", "phrase_readings.csv")
data = {}
__load_csv(file_name, loader, data)
return data
@cache
def load_daijirin2_phrase_readings():
def loader(data, row):
entry_id = (int(row[0]), int(row[1]))
reading = row[2]
data[entry_id] = reading
file_name = os.path.join("entries", "daijirin2", "phrase_readings.csv")
data = {}
__load_csv(file_name, loader, data)
return data
@ -84,8 +92,7 @@ def load_daijirin2_kana_abbreviations():
if abbr.strip() != "":
abbreviations.append(abbr)
data[entry_id] = abbreviations
file_name = os.path.join(
"entries", "daijirin2", "kana_abbreviations.csv")
file_name = os.path.join("entries", "daijirin2", "kana_abbreviations.csv")
data = {}
__load_csv(file_name, loader, data)
return data
@ -93,24 +100,14 @@ def load_daijirin2_kana_abbreviations():
@cache
def load_yomichan_name_conversion(target):
file_name = os.path.join(
"yomichan", "name_conversion", f"{target.value}.json")
file_name = os.path.join("yomichan", "name_conversion", f"{target.value}.json")
data = __load_json(file_name)
return data
@cache
def load_yomichan_term_schema():
file_name = os.path.join(
"yomichan", "dictionary-term-bank-v3-schema.json")
schema = __load_json(file_name)
return schema
@cache
def load_mdict_name_conversion(target):
file_name = os.path.join(
"mdict", "name_conversion", f"{target.value}.json")
file_name = os.path.join("mdict", "name_conversion", f"{target.value}.json")
data = __load_json(file_name)
return data
@ -134,8 +131,7 @@ def __load_adobe_glyphs():
data[code].append(character)
else:
data[code] = [character]
file_name = os.path.join(
"entries", "adobe", "Adobe-Japan1_sequences.txt")
file_name = os.path.join("entries", "adobe", "Adobe-Japan1_sequences.txt")
data = {}
__load_csv(file_name, loader, data, delim=';')
return data
@ -143,8 +139,7 @@ def __load_adobe_glyphs():
@cache
def __load_override_adobe_glyphs():
file_name = os.path.join(
"entries", "adobe", "override_glyphs.json")
file_name = os.path.join("entries", "adobe", "override_glyphs.json")
json_data = __load_json(file_name)
data = {}
for key, val in json_data.items():

View file

@ -1,60 +0,0 @@
from abc import abstractmethod
from bs4 import BeautifulSoup
from bot.entries.base.entry import Entry
import bot.entries.base.expressions as Expressions
class SanseidoEntry(Entry):
def set_page(self, page):
page = self._decompose_subentries(page)
self._page = page
def get_page_soup(self):
soup = BeautifulSoup(self._page, "xml")
return soup
def get_global_identifier(self):
parent_part = format(self.entry_id[0], '06')
child_part = hex(self.entry_id[1]).lstrip('0x').zfill(4).upper()
return f"@{self.target.value}-{parent_part}-{child_part}"
def _decompose_subentries(self, page):
soup = BeautifulSoup(page, features="xml")
for x in self._get_subentry_parameters():
subentry_class, tags, subentry_list = x
for tag in tags:
tag_soup = soup.find(tag)
while tag_soup is not None:
tag_soup.name = "項目"
subentry_id = self.id_string_to_entry_id(tag_soup.attrs["id"])
self.SUBENTRY_ID_TO_ENTRY_ID[subentry_id] = self.entry_id
subentry = subentry_class(self.target, subentry_id)
page = tag_soup.decode()
subentry.set_page(page)
subentry_list.append(subentry)
tag_soup.decompose()
tag_soup = soup.find(tag)
return soup.decode()
@abstractmethod
def _get_subentry_parameters(self):
raise NotImplementedError
def _add_variant_expressions(self, headwords):
for expressions in headwords.values():
Expressions.add_variant_kanji(expressions)
Expressions.add_fullwidth(expressions)
Expressions.remove_iteration_mark(expressions)
Expressions.add_iteration_mark(expressions)
@staticmethod
def id_string_to_entry_id(id_string):
parts = id_string.split("-")
if len(parts) == 1:
return (int(parts[0]), 0)
elif len(parts) == 2:
# subentries have a hexadecimal part
return (int(parts[0]), int(parts[1], 16))
else:
raise Exception(f"Invalid entry ID: {id_string}")

231
bot/entries/daijirin2.py Normal file
View file

@ -0,0 +1,231 @@
from bs4 import BeautifulSoup
import bot.entries.expressions as Expressions
import bot.soup as Soup
from bot.data import load_daijirin2_phrase_readings
from bot.data import load_daijirin2_kana_abbreviations
from bot.entries.entry import Entry
from bot.entries.daijirin2_preprocess import preprocess_page
class _BaseDaijirin2Entry(Entry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.children = []
self.phrases = []
self._kana_abbreviations = load_daijirin2_kana_abbreviations()
def get_global_identifier(self):
parent_part = format(self.entry_id[0], '06')
child_part = hex(self.entry_id[1]).lstrip('0x').zfill(4).upper()
return f"@{self.target.value}-{parent_part}-{child_part}"
def set_page(self, page):
page = self.__decompose_subentries(page)
self._page = page
def get_page_soup(self):
soup = BeautifulSoup(self._page, "xml")
return soup
def get_part_of_speech_tags(self):
if self._part_of_speech_tags is not None:
return self._part_of_speech_tags
self._part_of_speech_tags = []
soup = self.get_page_soup()
for pos_group in soup.find_all("品詞G"):
if pos_group.parent.name == "大語義":
self._set_part_of_speech_tags(pos_group)
return self._part_of_speech_tags
def _set_part_of_speech_tags(self, el):
pos_names = ["品詞", "品詞活用", "品詞行", "用法"]
for child in el.children:
if child.name is not None:
self._set_part_of_speech_tags(child)
continue
pos = str(child)
if el.name not in pos_names:
continue
elif pos in ["", ""]:
continue
elif pos in self._part_of_speech_tags:
continue
else:
self._part_of_speech_tags.append(pos)
def _get_regular_headwords(self, soup):
self._fill_alts(soup)
reading = soup.find("見出仮名").text
expressions = []
for el in soup.find_all("標準表記"):
expression = self._clean_expression(el.text)
if "" in expression:
kana_abbrs = self._kana_abbreviations[self.entry_id]
for abbr in kana_abbrs:
expression = expression.replace("", abbr, 1)
expressions.append(expression)
expressions = Expressions.expand_abbreviation_list(expressions)
if len(expressions) == 0:
expressions.append(reading)
headwords = {reading: expressions}
return headwords
def _add_variant_expressions(self, headwords):
for expressions in headwords.values():
Expressions.add_variant_kanji(expressions)
Expressions.add_fullwidth(expressions)
Expressions.remove_iteration_mark(expressions)
Expressions.add_iteration_mark(expressions)
def __decompose_subentries(self, page):
soup = BeautifulSoup(page, features="xml")
subentry_parameters = [
[Daijirin2ChildEntry, ["子項目"], self.children],
[Daijirin2PhraseEntry, ["句項目"], self.phrases],
]
for x in subentry_parameters:
subentry_class, tags, subentry_list = x
for tag in tags:
tag_soup = soup.find(tag)
while tag_soup is not None:
tag_soup.name = "項目"
subentry_id = self.id_string_to_entry_id(tag_soup.attrs["id"])
self.SUBENTRY_ID_TO_ENTRY_ID[subentry_id] = self.entry_id
subentry = subentry_class(self.target, subentry_id)
page = tag_soup.decode()
subentry.set_page(page)
subentry_list.append(subentry)
tag_soup.decompose()
tag_soup = soup.find(tag)
return soup.decode()
@staticmethod
def id_string_to_entry_id(id_string):
parts = id_string.split("-")
if len(parts) == 1:
return (int(parts[0]), 0)
elif len(parts) == 2:
# subentries have a hexadecimal part
return (int(parts[0]), int(parts[1], 16))
else:
raise Exception(f"Invalid entry ID: {id_string}")
@staticmethod
def _delete_unused_nodes(soup):
"""Remove extra markup elements that appear in the entry
headword line which are not part of the entry headword"""
unused_nodes = [
"漢字音logo", "活用分節", "連語句活用分節", "語構成",
"表外字マーク", "表外字マーク", "ルビG"
]
for name in unused_nodes:
Soup.delete_soup_nodes(soup, name)
@staticmethod
def _clean_expression(expression):
for x in ["", "", "", "", " "]:
expression = expression.replace(x, "")
return expression
@staticmethod
def _fill_alts(soup):
for gaiji in soup.find_all(class_="gaiji"):
if gaiji.name == "img" and gaiji.has_attr("alt"):
gaiji.name = "span"
gaiji.string = gaiji.attrs["alt"]
class Daijirin2Entry(_BaseDaijirin2Entry):
def __init__(self, target, page_id):
entry_id = (page_id, 0)
super().__init__(target, entry_id)
def set_page(self, page):
page = preprocess_page(page)
super().set_page(page)
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
if soup.find("漢字見出") is not None:
headwords = self._get_kanji_headwords(soup)
elif soup.find("略語G") is not None:
headwords = self._get_acronym_headwords(soup)
else:
headwords = self._get_regular_headwords(soup)
return headwords
def _get_kanji_headwords(self, soup):
readings = []
for el in soup.find_all("漢字音"):
hira = Expressions.kata_to_hira(el.text)
readings.append(hira)
if soup.find("漢字音") is None:
readings.append("")
expressions = []
for el in soup.find_all("漢字見出"):
expressions.append(el.text)
headwords = {}
for reading in readings:
headwords[reading] = expressions
return headwords
def _get_acronym_headwords(self, soup):
expressions = []
for el in soup.find_all("略語"):
expression_parts = []
for part in el.find_all(["欧字", "和字"]):
expression_parts.append(part.text)
expression = "".join(expression_parts)
expressions.append(expression)
headwords = {"": expressions}
return headwords
class Daijirin2ChildEntry(_BaseDaijirin2Entry):
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
headwords = self._get_regular_headwords(soup)
return headwords
class Daijirin2PhraseEntry(_BaseDaijirin2Entry):
def get_part_of_speech_tags(self):
# phrases do not contain these tags
return []
def _get_headwords(self):
soup = self.get_page_soup()
headwords = {}
expressions = self._find_expressions(soup)
readings = self._find_readings()
for idx, expression in enumerate(expressions):
reading = readings[idx]
if reading in headwords:
headwords[reading].append(expression)
else:
headwords[reading] = [expression]
return headwords
def _find_expressions(self, soup):
self._delete_unused_nodes(soup)
text = soup.find("句表記").text
text = self._clean_expression(text)
alternatives = Expressions.expand_daijirin_alternatives(text)
expressions = []
for alt in alternatives:
for exp in Expressions.expand_abbreviation(alt):
expressions.append(exp)
return expressions
def _find_readings(self):
phrase_readings = load_daijirin2_phrase_readings()
text = phrase_readings[self.entry_id]
alternatives = Expressions.expand_daijirin_alternatives(text)
readings = []
for alt in alternatives:
for reading in Expressions.expand_abbreviation(alt):
readings.append(reading)
return readings

View file

@ -1,88 +0,0 @@
import bot.soup as Soup
from bot.data import load_daijirin2_kana_abbreviations
from bot.entries.base.sanseido_entry import SanseidoEntry
import bot.entries.base.expressions as Expressions
class BaseEntry(SanseidoEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.children = []
self.phrases = []
self._kana_abbreviations = load_daijirin2_kana_abbreviations()
def get_part_of_speech_tags(self):
if self._part_of_speech_tags is not None:
return self._part_of_speech_tags
self._part_of_speech_tags = []
soup = self.get_page_soup()
for pos_group in soup.find_all("品詞G"):
if pos_group.parent.name == "大語義":
self._set_part_of_speech_tags(pos_group)
return self._part_of_speech_tags
def _set_part_of_speech_tags(self, el):
pos_names = ["品詞", "品詞活用", "品詞行", "用法"]
for child in el.children:
if child.name is not None:
self._set_part_of_speech_tags(child)
continue
pos = str(child)
if el.name not in pos_names:
continue
elif pos in ["", ""]:
continue
elif pos in self._part_of_speech_tags:
continue
else:
self._part_of_speech_tags.append(pos)
def _get_regular_headwords(self, soup):
self._fill_alts(soup)
reading = soup.find("見出仮名").text
expressions = []
for el in soup.find_all("標準表記"):
expression = self._clean_expression(el.text)
if "" in expression:
kana_abbrs = self._kana_abbreviations[self.entry_id]
for abbr in kana_abbrs:
expression = expression.replace("", abbr, 1)
expressions.append(expression)
expressions = Expressions.expand_abbreviation_list(expressions)
if len(expressions) == 0:
expressions.append(reading)
headwords = {reading: expressions}
return headwords
def _get_subentry_parameters(self):
from bot.entries.daijirin2.child_entry import ChildEntry
from bot.entries.daijirin2.phrase_entry import PhraseEntry
subentry_parameters = [
[ChildEntry, ["子項目"], self.children],
[PhraseEntry, ["句項目"], self.phrases],
]
return subentry_parameters
@staticmethod
def _delete_unused_nodes(soup):
"""Remove extra markup elements that appear in the entry
headword line which are not part of the entry headword"""
unused_nodes = [
"漢字音logo", "活用分節", "連語句活用分節", "語構成",
"表外字マーク", "表外字マーク", "ルビG"
]
for name in unused_nodes:
Soup.delete_soup_nodes(soup, name)
@staticmethod
def _clean_expression(expression):
for x in ["", "", "", "", " "]:
expression = expression.replace(x, "")
return expression
@staticmethod
def _fill_alts(soup):
for gaiji in soup.find_all(class_="gaiji"):
if gaiji.name == "img" and gaiji.has_attr("alt"):
gaiji.name = "span"
gaiji.string = gaiji.attrs["alt"]

View file

@ -1,9 +0,0 @@
from bot.entries.daijirin2.base_entry import BaseEntry
class ChildEntry(BaseEntry):
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
headwords = self._get_regular_headwords(soup)
return headwords

View file

@ -1,50 +0,0 @@
import bot.entries.base.expressions as Expressions
from bot.entries.daijirin2.base_entry import BaseEntry
from bot.entries.daijirin2.preprocess import preprocess_page
class Entry(BaseEntry):
def __init__(self, target, page_id):
entry_id = (page_id, 0)
super().__init__(target, entry_id)
def set_page(self, page):
page = preprocess_page(page)
super().set_page(page)
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
if soup.find("漢字見出") is not None:
headwords = self._get_kanji_headwords(soup)
elif soup.find("略語G") is not None:
headwords = self._get_acronym_headwords(soup)
else:
headwords = self._get_regular_headwords(soup)
return headwords
def _get_kanji_headwords(self, soup):
readings = []
for el in soup.find_all("漢字音"):
hira = Expressions.kata_to_hira(el.text)
readings.append(hira)
if soup.find("漢字音") is None:
readings.append("")
expressions = []
for el in soup.find_all("漢字見出"):
expressions.append(el.text)
headwords = {}
for reading in readings:
headwords[reading] = expressions
return headwords
def _get_acronym_headwords(self, soup):
expressions = []
for el in soup.find_all("略語"):
expression_parts = []
for part in el.find_all(["欧字", "和字"]):
expression_parts.append(part.text)
expression = "".join(expression_parts)
expressions.append(expression)
headwords = {"": expressions}
return headwords

View file

@ -1,67 +0,0 @@
import re
import bot.entries.base.expressions as Expressions
from bot.data import load_phrase_readings
from bot.entries.daijirin2.base_entry import BaseEntry
class PhraseEntry(BaseEntry):
def get_part_of_speech_tags(self):
# phrases do not contain these tags
return []
def _get_headwords(self):
soup = self.get_page_soup()
headwords = {}
expressions = self._find_expressions(soup)
readings = self._find_readings()
for idx, expression in enumerate(expressions):
reading = readings[idx]
if reading in headwords:
headwords[reading].append(expression)
else:
headwords[reading] = [expression]
return headwords
def _find_expressions(self, soup):
self._delete_unused_nodes(soup)
text = soup.find("句表記").text
text = self._clean_expression(text)
alternatives = parse_phrase(text)
expressions = []
for alt in alternatives:
for exp in Expressions.expand_abbreviation(alt):
expressions.append(exp)
return expressions
def _find_readings(self):
phrase_readings = load_phrase_readings(self.target)
text = phrase_readings[self.entry_id]
alternatives = parse_phrase(text)
readings = []
for alt in alternatives:
for reading in Expressions.expand_abbreviation(alt):
readings.append(reading)
return readings
def parse_phrase(text):
"""Return a list of strings described by notation."""
group_pattern = r"([^]+)(([^]+)([^]+))?"
groups = re.findall(group_pattern, text)
expressions = [""]
for group in groups:
new_exps = []
for expression in expressions:
new_exps.append(expression + group[0])
expressions = new_exps.copy()
if group[1] == "":
continue
new_exps = []
for expression in expressions:
new_exps.append(expression + group[2])
for expression in expressions:
for alt in group[3].split(""):
new_exps.append(expression + alt)
expressions = new_exps.copy()
return expressions

View file

@ -18,15 +18,15 @@ class Entry(ABC):
@abstractmethod
def get_global_identifier(self):
raise NotImplementedError
pass
@abstractmethod
def set_page(self, page):
raise NotImplementedError
pass
@abstractmethod
def get_page_soup(self):
raise NotImplementedError
pass
def get_headwords(self):
if self._headwords is not None:
@ -38,15 +38,15 @@ class Entry(ABC):
@abstractmethod
def _get_headwords(self):
raise NotImplementedError
pass
@abstractmethod
def _add_variant_expressions(self, headwords):
raise NotImplementedError
pass
@abstractmethod
def get_part_of_speech_tags(self):
raise NotImplementedError
pass
def get_parent(self):
if self.entry_id in self.SUBENTRY_ID_TO_ENTRY_ID:

View file

@ -31,14 +31,11 @@ def add_fullwidth(expressions):
def add_variant_kanji(expressions):
variant_kanji = load_variant_kanji()
for kyuuji, shinji in variant_kanji.items():
for old_kanji, new_kanji in variant_kanji.items():
new_exps = []
for expression in expressions:
if kyuuji in expression:
new_exp = expression.replace(kyuuji, shinji)
new_exps.append(new_exp)
if shinji in expression:
new_exp = expression.replace(shinji, kyuuji)
if old_kanji in expression:
new_exp = expression.replace(old_kanji, new_kanji)
new_exps.append(new_exp)
for new_exp in new_exps:
if new_exp not in expressions:
@ -88,3 +85,40 @@ def expand_abbreviation_list(expressions):
if new_exp not in new_exps:
new_exps.append(new_exp)
return new_exps
def expand_smk_alternatives(text):
"""Return a list of strings described by △ notation."""
m = re.search(r"△([^]+)([^]+)", text)
if m is None:
return [text]
alt_parts = [m.group(1)]
for alt_part in m.group(2).split(""):
alt_parts.append(alt_part)
alts = []
for alt_part in alt_parts:
alt_exp = re.sub(r"△[^]+[^]+", alt_part, text)
alts.append(alt_exp)
return alts
def expand_daijirin_alternatives(text):
"""Return a list of strings described by notation."""
group_pattern = r"([^]+)(([^]+)([^]+))?"
groups = re.findall(group_pattern, text)
expressions = [""]
for group in groups:
new_exps = []
for expression in expressions:
new_exps.append(expression + group[0])
expressions = new_exps.copy()
if group[1] == "":
continue
new_exps = []
for expression in expressions:
new_exps.append(expression + group[2])
for expression in expressions:
for alt in group[3].split(""):
new_exps.append(expression + alt)
expressions = new_exps.copy()
return expressions

18
bot/entries/factory.py Normal file
View file

@ -0,0 +1,18 @@
from bot.targets import Targets
from bot.entries.jitenon import JitenonKokugoEntry
from bot.entries.jitenon import JitenonYojiEntry
from bot.entries.jitenon import JitenonKotowazaEntry
from bot.entries.smk8 import Smk8Entry
from bot.entries.daijirin2 import Daijirin2Entry
def new_entry(target, page_id):
entry_map = {
Targets.JITENON_KOKUGO: JitenonKokugoEntry,
Targets.JITENON_YOJI: JitenonYojiEntry,
Targets.JITENON_KOTOWAZA: JitenonKotowazaEntry,
Targets.SMK8: Smk8Entry,
Targets.DAIJIRIN2: Daijirin2Entry,
}
return entry_map[target](target, page_id)

View file

@ -3,11 +3,11 @@ from abc import abstractmethod
from datetime import datetime, date
from bs4 import BeautifulSoup
from bot.entries.base.entry import Entry
import bot.entries.base.expressions as Expressions
from bot.entries.entry import Entry
import bot.entries.expressions as Expressions
class JitenonEntry(Entry):
class _JitenonEntry(Entry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.expression = ""
@ -58,7 +58,7 @@ class JitenonEntry(Entry):
@abstractmethod
def _get_column_map(self):
raise NotImplementedError
pass
def __set_modified_date(self, page):
m = re.search(r"\"dateModified\": \"(\d{4}-\d{2}-\d{2})", page)
@ -140,3 +140,104 @@ class JitenonEntry(Entry):
elif isinstance(attr_val, list):
colvals.append("".join(attr_val))
return ",".join(colvals)
class JitenonYojiEntry(_JitenonEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.origin = ""
self.kanken_level = ""
self.category = ""
self.related_expressions = []
def _get_column_map(self):
return {
"四字熟語": "expression",
"読み方": "yomikata",
"意味": "definition",
"異形": "other_forms",
"出典": "origin",
"漢検級": "kanken_level",
"場面用途": "category",
"類義語": "related_expressions",
}
def _add_variant_expressions(self, headwords):
for expressions in headwords.values():
Expressions.add_variant_kanji(expressions)
class JitenonKotowazaEntry(_JitenonEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.origin = ""
self.example = ""
self.related_expressions = []
def _get_column_map(self):
return {
"言葉": "expression",
"読み方": "yomikata",
"意味": "definition",
"異形": "other_forms",
"出典": "origin",
"例文": "example",
"類句": "related_expressions",
}
def _get_headwords(self):
if self.expression == "金棒引き・鉄棒引き":
headwords = {
"かなぼうひき": ["金棒引き", "鉄棒引き"]
}
else:
headwords = super()._get_headwords()
return headwords
def _add_variant_expressions(self, headwords):
for expressions in headwords.values():
Expressions.add_variant_kanji(expressions)
Expressions.add_fullwidth(expressions)
class JitenonKokugoEntry(_JitenonEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.example = ""
self.alt_expression = ""
self.antonym = ""
self.attachments = ""
self.compounds = ""
self.related_words = ""
def _get_column_map(self):
return {
"言葉": "expression",
"読み方": "yomikata",
"意味": "definition",
"例文": "example",
"別表記": "alt_expression",
"対義語": "antonym",
"活用": "attachments",
"用例": "compounds",
"類語": "related_words",
}
def _get_headwords(self):
headwords = {}
for reading in self.yomikata.split(""):
if reading not in headwords:
headwords[reading] = []
for expression in self.expression.split(""):
headwords[reading].append(expression)
if self.alt_expression.strip() != "":
for expression in self.alt_expression.split(""):
headwords[reading].append(expression)
return headwords
def _add_variant_expressions(self, headwords):
for expressions in headwords.values():
Expressions.add_variant_kanji(expressions)
Expressions.add_fullwidth(expressions)
Expressions.remove_iteration_mark(expressions)
Expressions.add_iteration_mark(expressions)

View file

@ -1,45 +0,0 @@
from bot.entries.base.jitenon_entry import JitenonEntry
import bot.entries.base.expressions as Expressions
class Entry(JitenonEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.example = ""
self.alt_expression = ""
self.antonym = ""
self.attachments = ""
self.compounds = ""
self.related_words = ""
def _get_column_map(self):
return {
"言葉": "expression",
"読み方": "yomikata",
"意味": "definition",
"例文": "example",
"別表記": "alt_expression",
"対義語": "antonym",
"活用": "attachments",
"用例": "compounds",
"類語": "related_words",
}
def _get_headwords(self):
headwords = {}
for reading in self.yomikata.split(""):
if reading not in headwords:
headwords[reading] = []
for expression in self.expression.split(""):
headwords[reading].append(expression)
if self.alt_expression.strip() != "":
for expression in self.alt_expression.split(""):
headwords[reading].append(expression)
return headwords
def _add_variant_expressions(self, headwords):
for expressions in headwords.values():
Expressions.add_variant_kanji(expressions)
Expressions.add_fullwidth(expressions)
Expressions.remove_iteration_mark(expressions)
Expressions.add_iteration_mark(expressions)

View file

@ -1,35 +0,0 @@
from bot.entries.base.jitenon_entry import JitenonEntry
import bot.entries.base.expressions as Expressions
class Entry(JitenonEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.origin = ""
self.example = ""
self.related_expressions = []
def _get_column_map(self):
return {
"言葉": "expression",
"読み方": "yomikata",
"意味": "definition",
"異形": "other_forms",
"出典": "origin",
"例文": "example",
"類句": "related_expressions",
}
def _get_headwords(self):
if self.expression == "金棒引き・鉄棒引き":
headwords = {
"かなぼうひき": ["金棒引き", "鉄棒引き"]
}
else:
headwords = super()._get_headwords()
return headwords
def _add_variant_expressions(self, headwords):
for expressions in headwords.values():
Expressions.add_variant_kanji(expressions)
Expressions.add_fullwidth(expressions)

View file

@ -1,27 +0,0 @@
import bot.entries.base.expressions as Expressions
from bot.entries.base.jitenon_entry import JitenonEntry
class Entry(JitenonEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.origin = ""
self.kanken_level = ""
self.category = ""
self.related_expressions = []
def _get_column_map(self):
return {
"四字熟語": "expression",
"読み方": "yomikata",
"意味": "definition",
"異形": "other_forms",
"出典": "origin",
"漢検級": "kanken_level",
"場面用途": "category",
"類義語": "related_expressions",
}
def _add_variant_expressions(self, headwords):
for expressions in headwords.values():
Expressions.add_variant_kanji(expressions)

View file

@ -1,104 +0,0 @@
import bot.soup as Soup
from bot.entries.base.sanseido_entry import SanseidoEntry
from bot.entries.sankoku8.parse import parse_hyouki_soup
class BaseEntry(SanseidoEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.children = []
self.phrases = []
self._hyouki_name = "表記"
self._midashi_name = None
self._midashi_kana_name = None
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
readings = self._find_readings(soup)
expressions = self._find_expressions(soup)
headwords = {}
for reading in readings:
headwords[reading] = []
if len(readings) == 1:
reading = readings[0]
if soup.find(self._midashi_name).find(self._hyouki_name) is None:
headwords[reading].append(reading)
for exp in expressions:
if exp not in headwords[reading]:
headwords[reading].append(exp)
elif len(readings) > 1 and len(expressions) == 0:
for reading in readings:
headwords[reading].append(reading)
elif len(readings) > 1 and len(expressions) == 1:
if soup.find(self._midashi_name).find(self._hyouki_name) is None:
for reading in readings:
headwords[reading].append(reading)
expression = expressions[0]
for reading in readings:
if expression not in headwords[reading]:
headwords[reading].append(expression)
elif len(readings) > 1 and len(expressions) == len(readings):
if soup.find(self._midashi_name).find(self._hyouki_name) is None:
for reading in readings:
headwords[reading].append(reading)
for idx, reading in enumerate(readings):
exp = expressions[idx]
if exp not in headwords[reading]:
headwords[reading].append(exp)
else:
raise Exception() # shouldn't happen
return headwords
def get_part_of_speech_tags(self):
if self._part_of_speech_tags is not None:
return self._part_of_speech_tags
self._part_of_speech_tags = []
soup = self.get_page_soup()
for midashi in soup.find_all([self._midashi_name, "見出部要素"]):
pos_group = midashi.find("品詞G")
if pos_group is None:
continue
for tag in pos_group.find_all("a"):
if tag.text not in self._part_of_speech_tags:
self._part_of_speech_tags.append(tag.text)
return self._part_of_speech_tags
def _find_expressions(self, soup):
expressions = []
for hyouki in soup.find_all(self._hyouki_name):
self._fill_alts(hyouki)
for expression in parse_hyouki_soup(hyouki, [""]):
expressions.append(expression)
return expressions
def _find_readings(self, soup):
midasi_kana = soup.find(self._midashi_kana_name)
readings = parse_hyouki_soup(midasi_kana, [""])
return readings
def _get_subentry_parameters(self):
from bot.entries.sankoku8.child_entry import ChildEntry
from bot.entries.sankoku8.phrase_entry import PhraseEntry
subentry_parameters = [
[ChildEntry, ["子項目"], self.children],
[PhraseEntry, ["句項目"], self.phrases],
]
return subentry_parameters
@staticmethod
def _delete_unused_nodes(soup):
"""Remove extra markup elements that appear in the entry
headword line which are not part of the entry headword"""
unused_nodes = [
"語構成", "平板", "アクセント", "表外字マーク", "表外音訓マーク",
"アクセント分節", "活用分節", "ルビG", "分書"
]
for name in unused_nodes:
Soup.delete_soup_nodes(soup, name)
@staticmethod
def _fill_alts(soup):
for img in soup.find_all("img"):
if img.has_attr("alt"):
img.string = img.attrs["alt"]

View file

@ -1,8 +0,0 @@
from bot.entries.sankoku8.base_entry import BaseEntry
class ChildEntry(BaseEntry):
def __init__(self, target, page_id):
super().__init__(target, page_id)
self._midashi_name = "子見出部"
self._midashi_kana_name = "子見出仮名"

View file

@ -1,14 +0,0 @@
from bot.entries.sankoku8.base_entry import BaseEntry
from bot.entries.sankoku8.preprocess import preprocess_page
class Entry(BaseEntry):
def __init__(self, target, page_id):
entry_id = (page_id, 0)
super().__init__(target, entry_id)
self._midashi_name = "見出部"
self._midashi_kana_name = "見出仮名"
def set_page(self, page):
page = preprocess_page(page)
super().set_page(page)

View file

@ -1,65 +0,0 @@
from bs4 import BeautifulSoup
def parse_hyouki_soup(soup, base_exps):
omitted_characters = [
"", "", "", "", "", "", "", "", ""
]
exps = base_exps.copy()
for child in soup.children:
new_exps = []
if child.name == "言換G":
for alt in child.find_all("言換"):
parts = parse_hyouki_soup(alt, [""])
for exp in exps:
for part in parts:
new_exps.append(exp + part)
elif child.name == "補足表記":
alt1 = child.find("表記対象")
alt2 = child.find("表記内容G")
parts1 = parse_hyouki_soup(alt1, [""])
parts2 = parse_hyouki_soup(alt2, [""])
for exp in exps:
for part in parts1:
new_exps.append(exp + part)
for part in parts2:
new_exps.append(exp + part)
elif child.name == "省略":
parts = parse_hyouki_soup(child, [""])
for exp in exps:
new_exps.append(exp)
for part in parts:
new_exps.append(exp + part)
elif child.name is not None:
new_exps = parse_hyouki_soup(child, exps)
else:
text = child.text
for char in omitted_characters:
text = text.replace(char, "")
for exp in exps:
new_exps.append(exp + text)
exps = new_exps.copy()
return exps
def parse_hyouki_pattern(pattern):
replacements = {
"": "<省略>",
"": "</省略>",
"": "<補足表記><表記対象>",
"": "</表記対象><表記内容G><表記内容>",
"": "</表記内容></表記内容G></補足表記>",
"": "<言換G>〈<言換>",
"": "</言換><言換>",
"": "</言換>〉</言換G>",
"": "<補足表記><表記対象>",
"": "</表記対象><表記内容G>⦅<表記内容>",
"": "</表記内容>⦆</表記内容G></補足表記>",
}
markup = f"<span>{pattern}</span>"
for key, val in replacements.items():
markup = markup.replace(key, val)
soup = BeautifulSoup(markup, "xml")
hyouki_soup = soup.find("span")
exps = parse_hyouki_soup(hyouki_soup, [""])
return exps

View file

@ -1,37 +0,0 @@
from bot.data import load_phrase_readings
from bot.entries.sankoku8.base_entry import BaseEntry
from bot.entries.sankoku8.parse import parse_hyouki_soup
from bot.entries.sankoku8.parse import parse_hyouki_pattern
class PhraseEntry(BaseEntry):
def get_part_of_speech_tags(self):
# phrases do not contain these tags
return []
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
expressions = self._find_expressions(soup)
readings = self._find_readings(soup)
headwords = {}
if len(expressions) != len(readings):
raise Exception(f"{self.entry_id[0]}-{self.entry_id[1]}")
for idx, expression in enumerate(expressions):
reading = readings[idx]
if reading in headwords:
headwords[reading].append(expression)
else:
headwords[reading] = [expression]
return headwords
def _find_expressions(self, soup):
phrase_soup = soup.find("句表記")
expressions = parse_hyouki_soup(phrase_soup, [""])
return expressions
def _find_readings(self, soup):
reading_patterns = load_phrase_readings(self.target)
reading_pattern = reading_patterns[self.entry_id]
readings = parse_hyouki_pattern(reading_pattern)
return readings

View file

@ -1,51 +0,0 @@
import re
from bs4 import BeautifulSoup
from bot.data import get_adobe_glyph
__GAIJI = {
"svg-gaiji/byan.svg": "𰻞",
"svg-gaiji/G16EF.svg": "",
}
def preprocess_page(page):
soup = BeautifulSoup(page, features="xml")
__replace_glyph_codes(soup)
__add_image_alt_text(soup)
__replace_tatehyphen(soup)
page = __strip_page(soup)
return page
def __replace_glyph_codes(soup):
for el in soup.find_all("glyph"):
m = re.search(r"^glyph:([0-9]+);?$", el.attrs["style"])
code = int(m.group(1))
for geta in el.find_all(string=""):
glyph = get_adobe_glyph(code)
geta.replace_with(glyph)
def __add_image_alt_text(soup):
for img in soup.find_all("img"):
if not img.has_attr("src"):
continue
src = img.attrs["src"]
if src in __GAIJI:
img.attrs["alt"] = __GAIJI[src]
def __replace_tatehyphen(soup):
for img in soup.find_all("img", {"src": "svg-gaiji/tatehyphen.svg"}):
img.string = ""
img.unwrap()
def __strip_page(soup):
koumoku = soup.find(["項目"])
if koumoku is not None:
return koumoku.decode()
else:
raise Exception(f"Primary 項目 not found in page:\n{soup.prettify()}")

221
bot/entries/smk8.py Normal file
View file

@ -0,0 +1,221 @@
from bs4 import BeautifulSoup
import bot.entries.expressions as Expressions
import bot.soup as Soup
from bot.data import load_smk8_phrase_readings
from bot.entries.entry import Entry
from bot.entries.smk8_preprocess import preprocess_page
class _BaseSmk8Entry(Entry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.children = []
self.phrases = []
self.kanjis = []
def get_global_identifier(self):
parent_part = format(self.entry_id[0], '06')
child_part = hex(self.entry_id[1]).lstrip('0x').zfill(4).upper()
return f"@{self.target.value}-{parent_part}-{child_part}"
def set_page(self, page):
page = self.__decompose_subentries(page)
self._page = page
def get_page_soup(self):
soup = BeautifulSoup(self._page, "xml")
return soup
def get_part_of_speech_tags(self):
if self._part_of_speech_tags is not None:
return self._part_of_speech_tags
self._part_of_speech_tags = []
soup = self.get_page_soup()
headword_info = soup.find("見出要素")
if headword_info is None:
return self._part_of_speech_tags
for tag in headword_info.find_all("品詞M"):
if tag.text not in self._part_of_speech_tags:
self._part_of_speech_tags.append(tag.text)
return self._part_of_speech_tags
def _add_variant_expressions(self, headwords):
for expressions in headwords.values():
Expressions.add_variant_kanji(expressions)
Expressions.add_fullwidth(expressions)
Expressions.remove_iteration_mark(expressions)
Expressions.add_iteration_mark(expressions)
def _find_reading(self, soup):
midasi_kana = soup.find("見出仮名")
reading = midasi_kana.text
for x in [" ", ""]:
reading = reading.replace(x, "")
return reading
def _find_expressions(self, soup):
clean_expressions = []
for expression in soup.find_all("標準表記"):
clean_expression = self._clean_expression(expression.text)
clean_expressions.append(clean_expression)
expressions = Expressions.expand_abbreviation_list(clean_expressions)
return expressions
def __decompose_subentries(self, page):
soup = BeautifulSoup(page, features="xml")
subentry_parameters = [
[Smk8ChildEntry, ["子項目F", "子項目"], self.children],
[Smk8PhraseEntry, ["句項目F", "句項目"], self.phrases],
[Smk8KanjiEntry, ["造語成分項目"], self.kanjis],
]
for x in subentry_parameters:
subentry_class, tags, subentry_list = x
for tag in tags:
tag_soup = soup.find(tag)
while tag_soup is not None:
tag_soup.name = "項目"
subentry_id = self.id_string_to_entry_id(tag_soup.attrs["id"])
self.SUBENTRY_ID_TO_ENTRY_ID[subentry_id] = self.entry_id
subentry = subentry_class(self.target, subentry_id)
page = tag_soup.decode()
subentry.set_page(page)
subentry_list.append(subentry)
tag_soup.decompose()
tag_soup = soup.find(tag)
return soup.decode()
@staticmethod
def id_string_to_entry_id(id_string):
parts = id_string.split("-")
if len(parts) == 1:
return (int(parts[0]), 0)
elif len(parts) == 2:
# subentries have a hexadecimal part
return (int(parts[0]), int(parts[1], 16))
else:
raise Exception(f"Invalid entry ID: {id_string}")
@staticmethod
def _delete_unused_nodes(soup):
"""Remove extra markup elements that appear in the entry
headword line which are not part of the entry headword"""
unused_nodes = [
"表音表記", "表外音訓マーク", "表外字マーク", "ルビG"
]
for name in unused_nodes:
Soup.delete_soup_nodes(soup, name)
@staticmethod
def _clean_expression(expression):
for x in ["", "", "", "", "", " "]:
expression = expression.replace(x, "")
return expression
@staticmethod
def _fill_alts(soup):
for el in soup.find_all(["親見出仮名", "親見出表記"]):
el.string = el.attrs["alt"]
for gaiji in soup.find_all("外字"):
gaiji.string = gaiji.img.attrs["alt"]
class Smk8Entry(_BaseSmk8Entry):
def __init__(self, target, page_id):
entry_id = (page_id, 0)
super().__init__(target, entry_id)
def set_page(self, page):
page = preprocess_page(page)
super().set_page(page)
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
self._fill_alts(soup)
reading = self._find_reading(soup)
expressions = []
if soup.find("見出部").find("標準表記") is None:
expressions.append(reading)
for expression in self._find_expressions(soup):
if expression not in expressions:
expressions.append(expression)
headwords = {reading: expressions}
return headwords
class Smk8ChildEntry(_BaseSmk8Entry):
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
self._fill_alts(soup)
reading = self._find_reading(soup)
expressions = []
if soup.find("子見出部").find("標準表記") is None:
expressions.append(reading)
for expression in self._find_expressions(soup):
if expression not in expressions:
expressions.append(expression)
headwords = {reading: expressions}
return headwords
class Smk8PhraseEntry(_BaseSmk8Entry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.__phrase_readings = load_smk8_phrase_readings()
def get_part_of_speech_tags(self):
# phrases do not contain these tags
return []
def _get_headwords(self):
soup = self.get_page_soup()
headwords = {}
expressions = self._find_expressions(soup)
readings = self._find_readings()
for idx, expression in enumerate(expressions):
reading = readings[idx]
if reading in headwords:
headwords[reading].append(expression)
else:
headwords[reading] = [expression]
return headwords
def _find_expressions(self, soup):
self._delete_unused_nodes(soup)
self._fill_alts(soup)
text = soup.find("標準表記").text
text = self._clean_expression(text)
alternatives = Expressions.expand_smk_alternatives(text)
expressions = []
for alt in alternatives:
for exp in Expressions.expand_abbreviation(alt):
expressions.append(exp)
return expressions
def _find_readings(self):
text = self.__phrase_readings[self.entry_id]
alternatives = Expressions.expand_smk_alternatives(text)
readings = []
for alt in alternatives:
for reading in Expressions.expand_abbreviation(alt):
readings.append(reading)
return readings
class Smk8KanjiEntry(_BaseSmk8Entry):
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
self._fill_alts(soup)
reading = self.__get_parent_reading()
expressions = self._find_expressions(soup)
headwords = {reading: expressions}
return headwords
def __get_parent_reading(self):
parent_id = self.SUBENTRY_ID_TO_ENTRY_ID[self.entry_id]
parent = self.ID_TO_ENTRY[parent_id]
reading = parent.get_first_reading()
return reading

View file

@ -1,73 +0,0 @@
import bot.soup as Soup
import bot.entries.base.expressions as Expressions
from bot.entries.base.sanseido_entry import SanseidoEntry
class BaseEntry(SanseidoEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.children = []
self.phrases = []
self.kanjis = []
def get_part_of_speech_tags(self):
if self._part_of_speech_tags is not None:
return self._part_of_speech_tags
self._part_of_speech_tags = []
soup = self.get_page_soup()
headword_info = soup.find("見出要素")
if headword_info is None:
return self._part_of_speech_tags
for tag in headword_info.find_all("品詞M"):
if tag.text not in self._part_of_speech_tags:
self._part_of_speech_tags.append(tag.text)
return self._part_of_speech_tags
def _find_reading(self, soup):
midasi_kana = soup.find("見出仮名")
reading = midasi_kana.text
for x in [" ", ""]:
reading = reading.replace(x, "")
return reading
def _find_expressions(self, soup):
clean_expressions = []
for expression in soup.find_all("標準表記"):
clean_expression = self._clean_expression(expression.text)
clean_expressions.append(clean_expression)
expressions = Expressions.expand_abbreviation_list(clean_expressions)
return expressions
def _get_subentry_parameters(self):
from bot.entries.smk8.child_entry import ChildEntry
from bot.entries.smk8.phrase_entry import PhraseEntry
from bot.entries.smk8.kanji_entry import KanjiEntry
subentry_parameters = [
[ChildEntry, ["子項目F", "子項目"], self.children],
[PhraseEntry, ["句項目F", "句項目"], self.phrases],
[KanjiEntry, ["造語成分項目"], self.kanjis],
]
return subentry_parameters
@staticmethod
def _delete_unused_nodes(soup):
"""Remove extra markup elements that appear in the entry
headword line which are not part of the entry headword"""
unused_nodes = [
"表音表記", "表外音訓マーク", "表外字マーク", "ルビG"
]
for name in unused_nodes:
Soup.delete_soup_nodes(soup, name)
@staticmethod
def _clean_expression(expression):
for x in ["", "", "", "", "", " "]:
expression = expression.replace(x, "")
return expression
@staticmethod
def _fill_alts(soup):
for elm in soup.find_all(["親見出仮名", "親見出表記"]):
elm.string = elm.attrs["alt"]
for gaiji in soup.find_all("外字"):
gaiji.string = gaiji.img.attrs["alt"]

View file

@ -1,17 +0,0 @@
from bot.entries.smk8.base_entry import BaseEntry
class ChildEntry(BaseEntry):
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
self._fill_alts(soup)
reading = self._find_reading(soup)
expressions = []
if soup.find("子見出部").find("標準表記") is None:
expressions.append(reading)
for expression in self._find_expressions(soup):
if expression not in expressions:
expressions.append(expression)
headwords = {reading: expressions}
return headwords

View file

@ -1,26 +0,0 @@
from bot.entries.smk8.base_entry import BaseEntry
from bot.entries.smk8.preprocess import preprocess_page
class Entry(BaseEntry):
def __init__(self, target, page_id):
entry_id = (page_id, 0)
super().__init__(target, entry_id)
def set_page(self, page):
page = preprocess_page(page)
super().set_page(page)
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
self._fill_alts(soup)
reading = self._find_reading(soup)
expressions = []
if soup.find("見出部").find("標準表記") is None:
expressions.append(reading)
for expression in self._find_expressions(soup):
if expression not in expressions:
expressions.append(expression)
headwords = {reading: expressions}
return headwords

View file

@ -1,22 +0,0 @@
from bot.entries.smk8.base_entry import BaseEntry
class KanjiEntry(BaseEntry):
def get_part_of_speech_tags(self):
# kanji entries do not contain these tags
return []
def _get_headwords(self):
soup = self.get_page_soup()
self._delete_unused_nodes(soup)
self._fill_alts(soup)
reading = self.__get_parent_reading()
expressions = self._find_expressions(soup)
headwords = {reading: expressions}
return headwords
def __get_parent_reading(self):
parent_id = self.SUBENTRY_ID_TO_ENTRY_ID[self.entry_id]
parent = self.ID_TO_ENTRY[parent_id]
reading = parent.get_first_reading()
return reading

View file

@ -1,64 +0,0 @@
import re
import bot.entries.base.expressions as Expressions
from bot.data import load_phrase_readings
from bot.entries.smk8.base_entry import BaseEntry
class PhraseEntry(BaseEntry):
def __init__(self, target, entry_id):
super().__init__(target, entry_id)
self.__phrase_readings = load_phrase_readings(self.target)
def get_part_of_speech_tags(self):
# phrase entries do not contain these tags
return []
def _get_headwords(self):
soup = self.get_page_soup()
headwords = {}
expressions = self._find_expressions(soup)
readings = self._find_readings()
for idx, expression in enumerate(expressions):
reading = readings[idx]
if reading in headwords:
headwords[reading].append(expression)
else:
headwords[reading] = [expression]
return headwords
def _find_expressions(self, soup):
self._delete_unused_nodes(soup)
self._fill_alts(soup)
text = soup.find("標準表記").text
text = self._clean_expression(text)
alternatives = parse_phrase(text)
expressions = []
for alt in alternatives:
for exp in Expressions.expand_abbreviation(alt):
expressions.append(exp)
return expressions
def _find_readings(self):
text = self.__phrase_readings[self.entry_id]
alternatives = parse_phrase(text)
readings = []
for alt in alternatives:
for reading in Expressions.expand_abbreviation(alt):
readings.append(reading)
return readings
def parse_phrase(text):
"""Return a list of strings described by △ notation."""
match = re.search(r"△([^]+)([^]+)", text)
if match is None:
return [text]
alt_parts = [match.group(1)]
for alt_part in match.group(2).split(""):
alt_parts.append(alt_part)
alts = []
for alt_part in alt_parts:
alt_exp = re.sub(r"△[^]+[^]+", alt_part, text)
alts.append(alt_exp)
return alts

View file

@ -6,8 +6,8 @@ from bot.data import get_adobe_glyph
__GAIJI = {
"gaiji/5350.svg": "",
"gaiji/62cb.svg": "",
"gaiji/7be1.svg": "",
"gaiji/62cb.svg": "",
"gaiji/7be1.svg": "",
}

View file

@ -1,37 +0,0 @@
import importlib
def new_crawler(target):
module_path = f"bot.crawlers.{target.name.lower()}"
module = importlib.import_module(module_path)
return module.Crawler(target)
def new_entry(target, page_id):
module_path = f"bot.entries.{target.name.lower()}.entry"
module = importlib.import_module(module_path)
return module.Entry(target, page_id)
def new_yomichan_exporter(target):
module_path = f"bot.yomichan.exporters.{target.name.lower()}"
module = importlib.import_module(module_path)
return module.Exporter(target)
def new_yomichan_terminator(target):
module_path = f"bot.yomichan.terms.{target.name.lower()}"
module = importlib.import_module(module_path)
return module.Terminator(target)
def new_mdict_exporter(target):
module_path = f"bot.mdict.exporters.{target.name.lower()}"
module = importlib.import_module(module_path)
return module.Exporter(target)
def new_mdict_terminator(target):
module_path = f"bot.mdict.terms.{target.name.lower()}"
module = importlib.import_module(module_path)
return module.Terminator(target)

View file

@ -1,18 +0,0 @@
from bot.mdict.exporters.base.exporter import BaseExporter
class JitenonExporter(BaseExporter):
def _get_revision(self, entries):
modified_date = None
for entry in entries:
if modified_date is None or entry.modified_date > modified_date:
modified_date = entry.modified_date
revision = modified_date.strftime("%Y年%m月%d日閲覧")
return revision
def _get_attribution(self, entries):
modified_date = None
for entry in entries:
if modified_date is None or entry.modified_date > modified_date:
attribution = entry.attribution
return attribution

View file

@ -1,8 +0,0 @@
from datetime import datetime
from bot.mdict.exporters.base.exporter import BaseExporter
class MonokakidoExporter(BaseExporter):
def _get_revision(self, entries):
timestamp = datetime.now().strftime("%Y年%m月%d日作成")
return timestamp

View file

@ -1,6 +0,0 @@
from bot.mdict.exporters.base.monokakido import MonokakidoExporter
class Exporter(MonokakidoExporter):
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2019"

View file

@ -1,19 +1,21 @@
# pylint: disable=too-few-public-methods
import subprocess
import os
import shutil
import subprocess
from abc import ABC, abstractmethod
from pathlib import Path
from datetime import datetime
from platformdirs import user_documents_dir, user_cache_dir
from bot.time import timestamp
from bot.factory import new_mdict_terminator
from bot.targets import Targets
from bot.mdict.terms.factory import new_terminator
class BaseExporter(ABC):
class Exporter(ABC):
def __init__(self, target):
self._target = target
self._terminator = new_mdict_terminator(target)
self._terminator = new_terminator(target)
self._build_dir = None
self._build_media_dir = None
self._description_file = None
@ -22,10 +24,11 @@ class BaseExporter(ABC):
def export(self, entries, media_dir, icon_file):
self._init_build_media_dir(media_dir)
self._init_description_file(entries)
self._write_mdx_file(entries)
terms = self._get_terms(entries)
print(f"Exporting {len(terms)} Mdict keys...")
self._write_mdx_file(terms)
self._write_mdd_file()
self._write_icon_file(icon_file)
self._write_css_file()
self._rm_build_dir()
def _get_build_dir(self):
@ -33,7 +36,7 @@ class BaseExporter(ABC):
return self._build_dir
cache_dir = user_cache_dir("jitenbot")
build_directory = os.path.join(cache_dir, "mdict_build")
print(f"{timestamp()} Initializing build directory `{build_directory}`")
print(f"Initializing build directory `{build_directory}`")
if Path(build_directory).is_dir():
shutil.rmtree(build_directory)
os.makedirs(build_directory)
@ -44,7 +47,7 @@ class BaseExporter(ABC):
build_dir = self._get_build_dir()
build_media_dir = os.path.join(build_dir, self._target.value)
if media_dir is not None:
print(f"{timestamp()} Copying media files to build directory...")
print("Copying media files to build directory...")
shutil.copytree(media_dir, build_media_dir)
else:
os.makedirs(build_media_dir)
@ -54,23 +57,34 @@ class BaseExporter(ABC):
self._build_media_dir = build_media_dir
def _init_description_file(self, entries):
description_template_file = self._get_description_template_file()
with open(description_template_file, "r", encoding="utf8") as f:
filename = f"{self._target.value}.mdx.description.html"
original_file = os.path.join(
"data", "mdict", "description", filename)
with open(original_file, "r", encoding="utf8") as f:
description = f.read()
description = description.replace(
"{{revision}}", self._get_revision(entries))
description = description.replace(
"{{attribution}}", self._get_attribution(entries))
build_dir = self._get_build_dir()
description_file = os.path.join(
build_dir, f"{self._target.value}.mdx.description.html")
description_file = os.path.join(build_dir, filename)
with open(description_file, "w", encoding="utf8") as f:
f.write(description)
self._description_file = description_file
def _write_mdx_file(self, entries):
terms = self._get_terms(entries)
print(f"{timestamp()} Exporting {len(terms)} Mdict keys...")
def _get_terms(self, entries):
terms = []
entries_len = len(entries)
for idx, entry in enumerate(entries):
update = f"Creating Mdict terms for entry {idx+1}/{entries_len}"
print(update, end='\r', flush=True)
new_terms = self._terminator.make_terms(entry)
for term in new_terms:
terms.append(term)
print()
return terms
def _write_mdx_file(self, terms):
out_dir = self._get_out_dir()
out_file = os.path.join(out_dir, f"{self._target.value}.mdx")
params = [
@ -82,18 +96,6 @@ class BaseExporter(ABC):
]
subprocess.run(params, check=True)
def _get_terms(self, entries):
terms = []
entries_len = len(entries)
for idx, entry in enumerate(entries):
update = f"\tCreating MDict terms for entry {idx+1}/{entries_len}"
print(update, end='\r', flush=True)
new_terms = self._terminator.make_terms(entry)
for term in new_terms:
terms.append(term)
print()
return terms
def _write_mdd_file(self):
out_dir = self._get_out_dir()
out_file = os.path.join(out_dir, f"{self._target.value}.mdd")
@ -107,7 +109,7 @@ class BaseExporter(ABC):
subprocess.run(params, check=True)
def _write_icon_file(self, icon_file):
premade_icon_file = self._get_premade_icon_file()
premade_icon_file = f"data/mdict/icon/{self._target.value}.png"
out_dir = self._get_out_dir()
out_file = os.path.join(out_dir, f"{self._target.value}.png")
if icon_file is not None and Path(icon_file).is_file():
@ -115,17 +117,12 @@ class BaseExporter(ABC):
elif Path(premade_icon_file).is_file():
shutil.copy(premade_icon_file, out_file)
def _write_css_file(self):
css_file = self._get_css_file()
out_dir = self._get_out_dir()
shutil.copy(css_file, out_dir)
def _get_out_dir(self):
if self._out_dir is not None:
return self._out_dir
out_dir = os.path.join(
user_documents_dir(), "jitenbot", "mdict", self._target.value)
print(f"{timestamp()} Initializing output directory `{out_dir}`")
print(f"Initializing output directory `{out_dir}`")
if Path(out_dir).is_dir():
shutil.rmtree(out_dir)
os.makedirs(out_dir)
@ -151,24 +148,59 @@ class BaseExporter(ABC):
"data", "mdict", "css",
f"{self._target.value}.css")
def _get_premade_icon_file(self):
return os.path.join(
"data", "mdict", "icon",
f"{self._target.value}.png")
def _get_description_template_file(self):
return os.path.join(
"data", "mdict", "description",
f"{self._target.value}.mdx.description.html")
def _rm_build_dir(self):
build_dir = self._get_build_dir()
shutil.rmtree(build_dir)
@abstractmethod
def _get_revision(self, entries):
raise NotImplementedError
pass
@abstractmethod
def _get_attribution(self, entries):
raise NotImplementedError
pass
class _JitenonExporter(Exporter):
def _get_revision(self, entries):
modified_date = None
for entry in entries:
if modified_date is None or entry.modified_date > modified_date:
modified_date = entry.modified_date
revision = modified_date.strftime("%Y年%m月%d日閲覧")
return revision
def _get_attribution(self, entries):
modified_date = None
for entry in entries:
if modified_date is None or entry.modified_date > modified_date:
attribution = entry.attribution
return attribution
class JitenonKokugoExporter(_JitenonExporter):
pass
class JitenonYojiExporter(_JitenonExporter):
pass
class JitenonKotowazaExporter(_JitenonExporter):
pass
class _MonokakidoExporter(Exporter):
def _get_revision(self, entries):
timestamp = datetime.now().strftime("%Y年%m月%d日作成")
return timestamp
class Smk8Exporter(_MonokakidoExporter):
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2020"
class Daijirin2Exporter(_MonokakidoExporter):
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2019"

View file

@ -0,0 +1,18 @@
from bot.targets import Targets
from bot.mdict.exporters.export import JitenonKokugoExporter
from bot.mdict.exporters.export import JitenonYojiExporter
from bot.mdict.exporters.export import JitenonKotowazaExporter
from bot.mdict.exporters.export import Smk8Exporter
from bot.mdict.exporters.export import Daijirin2Exporter
def new_mdict_exporter(target):
exporter_map = {
Targets.JITENON_KOKUGO: JitenonKokugoExporter,
Targets.JITENON_YOJI: JitenonYojiExporter,
Targets.JITENON_KOTOWAZA: JitenonKotowazaExporter,
Targets.SMK8: Smk8Exporter,
Targets.DAIJIRIN2: Daijirin2Exporter,
}
return exporter_map[target](target)

View file

@ -1,5 +0,0 @@
from bot.mdict.exporters.base.jitenon import JitenonExporter
class Exporter(JitenonExporter):
pass

View file

@ -1,5 +0,0 @@
from bot.mdict.exporters.base.jitenon import JitenonExporter
class Exporter(JitenonExporter):
pass

View file

@ -1,5 +0,0 @@
from bot.mdict.exporters.base.jitenon import JitenonExporter
class Exporter(JitenonExporter):
pass

View file

@ -1,6 +0,0 @@
from bot.mdict.exporters.base.monokakido import MonokakidoExporter
class Exporter(MonokakidoExporter):
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2021"

View file

@ -1,6 +0,0 @@
from bot.mdict.exporters.base.monokakido import MonokakidoExporter
class Exporter(MonokakidoExporter):
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2020"

View file

@ -1,137 +0,0 @@
import re
from bs4 import BeautifulSoup
from bot.data import load_mdict_name_conversion
from bot.name_conversion import convert_names
def make_glossary(entry, media_dir):
soup = entry.get_page_soup()
__reposition_marks(soup)
__remove_appendix_links(soup)
__convert_images(soup)
__remove_links_without_href(soup)
__convert_links(soup, entry)
__add_parent_link(soup, entry)
__add_homophone_links(soup, entry)
name_conversion = load_mdict_name_conversion(entry.target)
convert_names(soup, name_conversion)
glossary = soup.span.decode()
return glossary
def __reposition_marks(soup):
"""These 表外字マーク symbols will be converted to rubies later, so they need to
be positioned after the corresponding text in order to appear correctly"""
for elm in soup.find_all("表外字"):
mark = elm.find("表外字マーク")
elm.append(mark)
for elm in soup.find_all("表外音訓"):
mark = elm.find("表外音訓マーク")
elm.append(mark)
def __remove_appendix_links(soup):
"""This info would be useful and nice to have, but jitenbot currently
isn't designed to fetch and process these appendix files. It probably
wouldn't be possible to include them in Yomichan, but it would definitely
be possible for Mdict."""
for elm in soup.find_all("a"):
if not elm.has_attr("href"):
continue
if elm.attrs["href"].startswith("appendix"):
elm.attrs["data-name"] = "a"
elm.attrs["data-href"] = elm.attrs["href"]
elm.name = "span"
del elm.attrs["href"]
def __convert_images(soup):
conversions = [
["svg-logo/重要語.svg", ""],
["svg-logo/最重要語.svg", ""],
["svg-logo/一般常識語.svg", "☆☆"],
["svg-logo/追い込み.svg", ""],
["svg-special/区切り線.svg", "|"],
["svg-accent/平板.svg", ""],
["svg-accent/アクセント.svg", ""],
["svg-logo/アク.svg", "アク"],
["svg-logo/丁寧.svg", "丁寧"],
["svg-logo/可能.svg", "可能"],
["svg-logo/尊敬.svg", "尊敬"],
["svg-logo/接尾.svg", "接尾"],
["svg-logo/接頭.svg", "接頭"],
["svg-logo/表記.svg", "表記"],
["svg-logo/謙譲.svg", "謙譲"],
["svg-logo/区別.svg", "区別"],
["svg-logo/由来.svg", "由来"],
]
for conversion in conversions:
filename, text = conversion
for elm in soup.find_all("img", attrs={"src": filename}):
elm.attrs["data-name"] = elm.name
elm.attrs["data-src"] = elm.attrs["src"]
elm.name = "span"
elm.string = text
del elm.attrs["src"]
def __remove_links_without_href(soup):
for elm in soup.find_all("a"):
if elm.has_attr("href"):
continue
elm.attrs["data-name"] = elm.name
elm.name = "span"
def __convert_links(soup, entry):
for elm in soup.find_all("a"):
href = elm.attrs["href"].split(" ")[0]
if re.match(r"^#?[0-9]+(?:-[0-9A-F]{4})?$", href):
href = href.removeprefix("#")
ref_entry_id = entry.id_string_to_entry_id(href)
if ref_entry_id in entry.ID_TO_ENTRY:
ref_entry = entry.ID_TO_ENTRY[ref_entry_id]
else:
ref_entry = entry.ID_TO_ENTRY[(ref_entry_id[0], 0)]
gid = ref_entry.get_global_identifier()
elm.attrs["href"] = f"entry://{gid}"
elif re.match(r"^entry:", href):
pass
elif re.match(r"^https?:[\w\W]*", href):
pass
else:
raise Exception(f"Invalid href format: {href}")
def __add_parent_link(soup, entry):
elm = soup.find("親見出相当部")
if elm is not None:
parent_entry = entry.get_parent()
gid = parent_entry.get_global_identifier()
elm.attrs["href"] = f"entry://{gid}"
elm.attrs["data-name"] = elm.name
elm.name = "a"
def __add_homophone_links(soup, entry):
forward_link = ["", entry.entry_id[0] + 1]
backward_link = ["", entry.entry_id[0] - 1]
homophone_info_list = [
["svg-logo/homophone1.svg", [forward_link]],
["svg-logo/homophone2.svg", [forward_link, backward_link]],
["svg-logo/homophone3.svg", [backward_link]],
]
for homophone_info in homophone_info_list:
filename, link_info = homophone_info
for elm in soup.find_all("img", attrs={"src": filename}):
for info in link_info:
text, link_id = info
link_entry = entry.ID_TO_ENTRY[(link_id, 0)]
gid = link_entry.get_global_identifier()
link = BeautifulSoup("<a/>", "xml").a
link.string = text
link.attrs["href"] = f"entry://{gid}"
elm.append(link)
elm.unwrap()

View file

@ -1,20 +0,0 @@
from bot.mdict.terms.base.terminator import BaseTerminator
class JitenonTerminator(BaseTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = None
def _glossary(self, entry):
if entry.entry_id in self._glossary_cache:
return self._glossary_cache[entry.entry_id]
glossary = self._glossary_maker.make_glossary(entry, self._media_dir)
self._glossary_cache[entry.entry_id] = glossary
return glossary
def _link_glossary_parameters(self, entry):
return []
def _subentry_lists(self, entry):
return []

View file

@ -1,8 +1,8 @@
from bot.mdict.terms.base.terminator import BaseTerminator
from bot.mdict.terms.terminator import Terminator
from bot.mdict.glossary.daijirin2 import make_glossary
class Terminator(BaseTerminator):
class Daijirin2Terminator(Terminator):
def _glossary(self, entry):
if entry.entry_id in self._glossary_cache:
return self._glossary_cache[entry.entry_id]

View file

@ -0,0 +1,18 @@
from bot.targets import Targets
from bot.mdict.terms.jitenon import JitenonKokugoTerminator
from bot.mdict.terms.jitenon import JitenonYojiTerminator
from bot.mdict.terms.jitenon import JitenonKotowazaTerminator
from bot.mdict.terms.smk8 import Smk8Terminator
from bot.mdict.terms.daijirin2 import Daijirin2Terminator
def new_terminator(target):
terminator_map = {
Targets.JITENON_KOKUGO: JitenonKokugoTerminator,
Targets.JITENON_YOJI: JitenonYojiTerminator,
Targets.JITENON_KOTOWAZA: JitenonKotowazaTerminator,
Targets.SMK8: Smk8Terminator,
Targets.DAIJIRIN2: Daijirin2Terminator,
}
return terminator_map[target](target)

View file

@ -0,0 +1,42 @@
from bot.mdict.terms.terminator import Terminator
from bot.mdict.glossary.jitenon import JitenonKokugoGlossary
from bot.mdict.glossary.jitenon import JitenonYojiGlossary
from bot.mdict.glossary.jitenon import JitenonKotowazaGlossary
class JitenonTerminator(Terminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = None
def _glossary(self, entry):
if entry.entry_id in self._glossary_cache:
return self._glossary_cache[entry.entry_id]
glossary = self._glossary_maker.make_glossary(entry, self._media_dir)
self._glossary_cache[entry.entry_id] = glossary
return glossary
def _link_glossary_parameters(self, entry):
return []
def _subentry_lists(self, entry):
return []
class JitenonKokugoTerminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonKokugoGlossary()
class JitenonYojiTerminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonYojiGlossary()
class JitenonKotowazaTerminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonKotowazaGlossary()

View file

@ -1,8 +0,0 @@
from bot.mdict.terms.base.jitenon import JitenonTerminator
from bot.mdict.glossary.jitenon import JitenonKokugoGlossary
class Terminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonKokugoGlossary()

View file

@ -1,8 +0,0 @@
from bot.mdict.terms.base.jitenon import JitenonTerminator
from bot.mdict.glossary.jitenon import JitenonKotowazaGlossary
class Terminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonKotowazaGlossary()

View file

@ -1,8 +0,0 @@
from bot.mdict.terms.base.jitenon import JitenonTerminator
from bot.mdict.glossary.jitenon import JitenonYojiGlossary
class Terminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonYojiGlossary()

View file

@ -1,23 +0,0 @@
from bot.mdict.terms.base.terminator import BaseTerminator
from bot.mdict.glossary.sankoku8 import make_glossary
class Terminator(BaseTerminator):
def _glossary(self, entry):
if entry.entry_id in self._glossary_cache:
return self._glossary_cache[entry.entry_id]
glossary = make_glossary(entry, self._media_dir)
self._glossary_cache[entry.entry_id] = glossary
return glossary
def _link_glossary_parameters(self, entry):
return [
[entry.children, "子項目"],
[entry.phrases, "句項目"],
]
def _subentry_lists(self, entry):
return [
entry.children,
entry.phrases,
]

View file

@ -1,8 +1,8 @@
from bot.mdict.terms.base.terminator import BaseTerminator
from bot.mdict.terms.terminator import Terminator
from bot.mdict.glossary.smk8 import make_glossary
class Terminator(BaseTerminator):
class Smk8Terminator(Terminator):
def _glossary(self, entry):
if entry.entry_id in self._glossary_cache:
return self._glossary_cache[entry.entry_id]

View file

@ -1,8 +1,7 @@
import re
from abc import abstractmethod, ABC
class BaseTerminator(ABC):
class Terminator(ABC):
def __init__(self, target):
self._target = target
self._glossary_cache = {}
@ -13,20 +12,35 @@ class BaseTerminator(ABC):
def make_terms(self, entry):
gid = entry.get_global_identifier()
glossary = self.__get_full_glossary(entry)
glossary = self.__full_glossary(entry)
terms = [[gid, glossary]]
keys = self.__get_keys(entry)
keys = set()
headwords = entry.get_headwords()
for reading, expressions in headwords.items():
if len(expressions) == 0:
keys.add(reading)
for expression in expressions:
if expression.strip() == "":
keys.add(reading)
continue
keys.add(expression)
if reading.strip() == "":
continue
if reading != expression:
keys.add(f"{reading}{expression}")
else:
keys.add(reading)
link = f"@@@LINK={gid}"
for key in keys:
if key.strip() != "":
terms.append([key, link])
for subentry_list in self._subentry_lists(entry):
for subentry in subentry_list:
for subentries in self._subentry_lists(entry):
for subentry in subentries:
for term in self.make_terms(subentry):
terms.append(term)
return terms
def __get_full_glossary(self, entry):
def __full_glossary(self, entry):
glossary = []
style_link = f"<link rel='stylesheet' href='{self._target.value}.css' type='text/css'>"
glossary.append(style_link)
@ -46,38 +60,14 @@ class BaseTerminator(ABC):
glossary.append(link_glossary)
return "\n".join(glossary)
def __get_keys(self, entry):
keys = set()
headwords = entry.get_headwords()
for reading, expressions in headwords.items():
stripped_reading = reading.strip()
keys.add(stripped_reading)
if re.match(r"^[ぁ-ヿ、]+$", stripped_reading):
kana_only_key = f"{stripped_reading}【∅】"
else:
kana_only_key = ""
if len(expressions) == 0:
keys.add(kana_only_key)
for expression in expressions:
stripped_expression = expression.strip()
keys.add(stripped_expression)
if stripped_expression == "":
keys.add(kana_only_key)
elif stripped_expression == stripped_reading:
keys.add(kana_only_key)
else:
combo_key = f"{stripped_reading}{stripped_expression}"
keys.add(combo_key)
return keys
@abstractmethod
def _glossary(self, entry):
raise NotImplementedError
pass
@abstractmethod
def _link_glossary_parameters(self, entry):
raise NotImplementedError
pass
@abstractmethod
def _subentry_lists(self, entry):
raise NotImplementedError
pass

View file

@ -7,4 +7,3 @@ class Targets(Enum):
JITENON_KOTOWAZA = "jitenon-kotowaza"
SMK8 = "smk8"
DAIJIRIN2 = "daijirin2"
SANKOKU8 = "sankoku8"

View file

@ -1,5 +0,0 @@
import time
def timestamp():
return time.strftime('%X')

View file

@ -1,18 +0,0 @@
from bot.yomichan.exporters.base.exporter import BaseExporter
class JitenonExporter(BaseExporter):
def _get_revision(self, entries):
modified_date = None
for entry in entries:
if modified_date is None or entry.modified_date > modified_date:
modified_date = entry.modified_date
revision = f"{self._target.value};{modified_date}"
return revision
def _get_attribution(self, entries):
modified_date = None
for entry in entries:
if modified_date is None or entry.modified_date > modified_date:
attribution = entry.attribution
return attribution

View file

@ -1,8 +0,0 @@
from datetime import datetime
from bot.yomichan.exporters.base.exporter import BaseExporter
class MonokakidoExporter(BaseExporter):
def _get_revision(self, entries):
timestamp = datetime.now().strftime("%Y-%m-%d")
return f"{self._target.value};{timestamp}"

View file

@ -1,6 +0,0 @@
from bot.yomichan.exporters.base.monokakido import MonokakidoExporter
class Exporter(MonokakidoExporter):
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2019"

View file

@ -1,27 +1,25 @@
# pylint: disable=too-few-public-methods
import json
import os
import shutil
import copy
from pathlib import Path
from datetime import datetime
from abc import ABC, abstractmethod
import fastjsonschema
from platformdirs import user_documents_dir, user_cache_dir
from bot.time import timestamp
from bot.data import load_yomichan_metadata
from bot.data import load_yomichan_term_schema
from bot.factory import new_yomichan_terminator
from bot.yomichan.terms.factory import new_terminator
class BaseExporter(ABC):
class Exporter(ABC):
def __init__(self, target):
self._target = target
self._terminator = new_yomichan_terminator(target)
self._terminator = new_terminator(target)
self._build_dir = None
self._terms_per_file = 2000
def export(self, entries, image_dir, validate):
def export(self, entries, image_dir):
self.__init_build_image_dir(image_dir)
meta = load_yomichan_metadata()
index = meta[self._target.value]["index"]
@ -29,45 +27,34 @@ class BaseExporter(ABC):
index["attribution"] = self._get_attribution(entries)
tags = meta[self._target.value]["tags"]
terms = self.__get_terms(entries)
if validate:
self.__validate_terms(terms)
self.__make_dictionary(terms, index, tags)
@abstractmethod
def _get_revision(self, entries):
raise NotImplementedError
pass
@abstractmethod
def _get_attribution(self, entries):
raise NotImplementedError
pass
def _get_build_dir(self):
if self._build_dir is not None:
return self._build_dir
cache_dir = user_cache_dir("jitenbot")
build_directory = os.path.join(cache_dir, "yomichan_build")
print(f"{timestamp()} Initializing build directory `{build_directory}`")
print(f"Initializing build directory `{build_directory}`")
if Path(build_directory).is_dir():
shutil.rmtree(build_directory)
os.makedirs(build_directory)
self._build_dir = build_directory
return self._build_dir
def __get_invalid_term_dir(self):
cache_dir = user_cache_dir("jitenbot")
log_dir = os.path.join(cache_dir, "invalid_yomichan_terms")
if Path(log_dir).is_dir():
shutil.rmtree(log_dir)
os.makedirs(log_dir)
return log_dir
def __init_build_image_dir(self, image_dir):
build_dir = self._get_build_dir()
build_img_dir = os.path.join(build_dir, self._target.value)
if image_dir is not None:
print(f"{timestamp()} Copying media files to build directory...")
print("Copying media files to build directory...")
shutil.copytree(image_dir, build_img_dir)
print(f"{timestamp()} Finished copying files")
else:
os.makedirs(build_img_dir)
self._terminator.set_image_dir(build_img_dir)
@ -76,7 +63,7 @@ class BaseExporter(ABC):
terms = []
entries_len = len(entries)
for idx, entry in enumerate(entries):
update = f"\tCreating Yomichan terms for entry {idx+1}/{entries_len}"
update = f"Creating Yomichan terms for entry {idx+1}/{entries_len}"
print(update, end='\r', flush=True)
new_terms = self._terminator.make_terms(entry)
for term in new_terms:
@ -84,29 +71,8 @@ class BaseExporter(ABC):
print()
return terms
def __validate_terms(self, terms):
print(f"{timestamp()} Making a copy of term data for validation...")
terms_copy = copy.deepcopy(terms) # because validator will alter data!
term_count = len(terms_copy)
log_dir = self.__get_invalid_term_dir()
schema = load_yomichan_term_schema()
validator = fastjsonschema.compile(schema)
failure_count = 0
for idx, term in enumerate(terms_copy):
update = f"\tValidating term {idx+1}/{term_count}"
print(update, end='\r', flush=True)
try:
validator([term])
except fastjsonschema.JsonSchemaException:
failure_count += 1
term_file = os.path.join(log_dir, f"{idx}.json")
with open(term_file, "w", encoding='utf8') as f:
json.dump([term], f, indent=4, ensure_ascii=False)
print(f"\n{timestamp()} Finished validating with {failure_count} error{'' if failure_count == 1 else 's'}")
if failure_count > 0:
print(f"{timestamp()} Invalid terms saved to `{log_dir}` for debugging")
def __make_dictionary(self, terms, index, tags):
print(f"Exporting {len(terms)} Yomichan terms...")
self.__write_term_banks(terms)
self.__write_index(index)
self.__write_tag_bank(tags)
@ -114,18 +80,14 @@ class BaseExporter(ABC):
self.__rm_build_dir()
def __write_term_banks(self, terms):
print(f"{timestamp()} Exporting {len(terms)} JSON terms")
build_dir = self._get_build_dir()
max_i = int(len(terms) / self._terms_per_file) + 1
for i in range(max_i):
update = f"\tWriting terms to term bank {i+1}/{max_i}"
print(update, end='\r', flush=True)
start = self._terms_per_file * i
end = self._terms_per_file * (i + 1)
term_file = os.path.join(build_dir, f"term_bank_{i+1}.json")
with open(term_file, "w", encoding='utf8') as f:
start = self._terms_per_file * i
end = self._terms_per_file * (i + 1)
json.dump(terms[start:end], f, indent=4, ensure_ascii=False)
print()
def __write_index(self, index):
build_dir = self._get_build_dir()
@ -143,7 +105,6 @@ class BaseExporter(ABC):
def __write_archive(self, filename):
archive_format = "zip"
print(f"{timestamp()} Archiving data to {archive_format.upper()} file...")
out_dir = os.path.join(user_documents_dir(), "jitenbot", "yomichan")
if not Path(out_dir).is_dir():
os.makedirs(out_dir)
@ -154,8 +115,55 @@ class BaseExporter(ABC):
base_filename = os.path.join(out_dir, filename)
build_dir = self._get_build_dir()
shutil.make_archive(base_filename, archive_format, build_dir)
print(f"{timestamp()} Dictionary file saved to `{out_filepath}`")
print(f"Dictionary file saved to {out_filepath}")
def __rm_build_dir(self):
build_dir = self._get_build_dir()
shutil.rmtree(build_dir)
class _JitenonExporter(Exporter):
def _get_revision(self, entries):
modified_date = None
for entry in entries:
if modified_date is None or entry.modified_date > modified_date:
modified_date = entry.modified_date
revision = f"{self._target.value};{modified_date}"
return revision
def _get_attribution(self, entries):
modified_date = None
for entry in entries:
if modified_date is None or entry.modified_date > modified_date:
attribution = entry.attribution
return attribution
class JitenonKokugoExporter(_JitenonExporter):
pass
class JitenonYojiExporter(_JitenonExporter):
pass
class JitenonKotowazaExporter(_JitenonExporter):
pass
class Smk8Exporter(Exporter):
def _get_revision(self, entries):
timestamp = datetime.now().strftime("%Y-%m-%d")
return f"{self._target.value};{timestamp}"
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2020"
class Daijirin2Exporter(Exporter):
def _get_revision(self, entries):
timestamp = datetime.now().strftime("%Y-%m-%d")
return f"{self._target.value};{timestamp}"
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2019"

View file

@ -0,0 +1,18 @@
from bot.targets import Targets
from bot.yomichan.exporters.export import JitenonKokugoExporter
from bot.yomichan.exporters.export import JitenonYojiExporter
from bot.yomichan.exporters.export import JitenonKotowazaExporter
from bot.yomichan.exporters.export import Smk8Exporter
from bot.yomichan.exporters.export import Daijirin2Exporter
def new_yomi_exporter(target):
exporter_map = {
Targets.JITENON_KOKUGO: JitenonKokugoExporter,
Targets.JITENON_YOJI: JitenonYojiExporter,
Targets.JITENON_KOTOWAZA: JitenonKotowazaExporter,
Targets.SMK8: Smk8Exporter,
Targets.DAIJIRIN2: Daijirin2Exporter,
}
return exporter_map[target](target)

View file

@ -1,5 +0,0 @@
from bot.yomichan.exporters.base.jitenon import JitenonExporter
class Exporter(JitenonExporter):
pass

View file

@ -1,5 +0,0 @@
from bot.yomichan.exporters.base.jitenon import JitenonExporter
class Exporter(JitenonExporter):
pass

View file

@ -1,5 +0,0 @@
from bot.yomichan.exporters.base.jitenon import JitenonExporter
class Exporter(JitenonExporter):
pass

View file

@ -1,6 +0,0 @@
from bot.yomichan.exporters.base.monokakido import MonokakidoExporter
class Exporter(MonokakidoExporter):
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2021"

View file

@ -1,6 +0,0 @@
from bot.yomichan.exporters.base.monokakido import MonokakidoExporter
class Exporter(MonokakidoExporter):
def _get_attribution(self, entries):
return "© Sanseido Co., LTD. 2020"

View file

@ -1,10 +1,9 @@
import re
import os
from bs4 import BeautifulSoup
from functools import cache
from pathlib import Path
from bs4 import BeautifulSoup
import bot.yomichan.glossary.icons as Icons
from bot.soup import delete_soup_nodes
from bot.data import load_yomichan_name_conversion
@ -112,8 +111,8 @@ def __convert_gaiji(soup, image_dir):
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"height": 1.0 if ratio > 1.0 else ratio,
"width": ratio if ratio > 1.0 else 1.0,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
@ -151,8 +150,8 @@ def __convert_logos(soup, image_dir):
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"height": 1.0 if ratio > 1.0 else ratio,
"width": ratio if ratio > 1.0 else 1.0,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
@ -175,8 +174,8 @@ def __convert_kanjion_logos(soup, image_dir):
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"height": 1.0 if ratio > 1.0 else ratio,
"width": ratio if ratio > 1.0 else 1.0,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
@ -199,8 +198,8 @@ def __convert_daigoginum(soup, image_dir):
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"height": 1.0 if ratio > 1.0 else ratio,
"width": ratio if ratio > 1.0 else 1.0,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
@ -223,8 +222,8 @@ def __convert_jundaigoginum(soup, image_dir):
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"height": 1.0 if ratio > 1.0 else ratio,
"width": ratio if ratio > 1.0 else 1.0,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,

View file

@ -76,7 +76,6 @@ def __get_attributes(attrs):
def __get_style(inline_style_string):
# pylint: disable=no-member
style = {}
parsed_style = parseStyle(inline_style_string)
if parsed_style.fontStyle != "":
@ -101,7 +100,7 @@ def __get_style(inline_style_string):
"marginLeft": parsed_style.marginLeft,
}
for key, val in margins.items():
m = re.search(r"(-?\d+(\.\d*)?|-?\.\d+)em", val)
m = re.search(r"(\d+(\.\d*)?|\.\d+)em", val)
if m:
style[key] = float(m.group(1))

View file

@ -26,27 +26,6 @@ def make_monochrome_fill_rectangle(path, text):
f.write(svg)
@cache
def make_accent(path):
svg = __svg_accent()
with open(path, "w", encoding="utf-8") as f:
f.write(svg)
@cache
def make_heiban(path):
svg = __svg_heiban()
with open(path, "w", encoding="utf-8") as f:
f.write(svg)
@cache
def make_red_char(path, char):
svg = __svg_red_character(char)
with open(path, "w", encoding="utf-8") as f:
f.write(svg)
def __calculate_svg_ratio(path):
with open(path, "r", encoding="utf-8") as f:
xml = f.read()
@ -103,30 +82,3 @@ def __svg_masked_rectangle(text):
fill='black' mask='url(#a)'/>
</svg>"""
return svg.strip()
def __svg_heiban():
svg = f"""
<svg viewBox='0 0 210 300' xmlns='http://www.w3.org/2000/svg' version='1.1'>
<rect width='210' height='30' fill='red'/>
</svg>"""
return svg.strip()
def __svg_accent():
svg = f"""
<svg viewBox='0 0 150 300' xmlns='http://www.w3.org/2000/svg' version='1.1'>
<rect width='150' height='30' fill='red'/>
<rect width='30' height='150' x='120' fill='red'/>
</svg>"""
return svg.strip()
def __svg_red_character(char):
svg = f"""
<svg viewBox='0 0 300 300' xmlns='http://www.w3.org/2000/svg' version='1.1'>
<text text-anchor='middle' x='50%' y='50%' dy='.37em'
font-family='sans-serif' font-size='300px'
fill='red'>{char}</text>
</svg>"""
return svg.strip()

View file

@ -118,8 +118,8 @@ class JitenonKokugoGlossary(JitenonGlossary):
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"height": 1.0 if ratio > 1.0 else ratio,
"width": ratio if ratio > 1.0 else 1.0,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,

View file

@ -1,344 +0,0 @@
import re
import os
from bs4 import BeautifulSoup
import bot.yomichan.glossary.icons as Icons
from bot.data import load_yomichan_name_conversion
from bot.yomichan.glossary.gloss import make_gloss
from bot.name_conversion import convert_names
def make_glossary(entry, media_dir):
soup = entry.get_page_soup()
__remove_glyph_styles(soup)
__reposition_marks(soup)
__remove_links_without_href(soup)
__remove_appendix_links(soup)
__convert_links(soup, entry)
__add_parent_link(soup, entry)
__add_homophone_links(soup, entry)
__convert_images_to_text(soup)
__text_parens_to_images(soup, media_dir)
__replace_icons(soup, media_dir)
__replace_accent_symbols(soup, media_dir)
__convert_gaiji(soup, media_dir)
__convert_graphics(soup, media_dir)
__convert_number_icons(soup, media_dir)
name_conversion = load_yomichan_name_conversion(entry.target)
convert_names(soup, name_conversion)
gloss = make_gloss(soup.span)
glossary = [gloss]
return glossary
def __remove_glyph_styles(soup):
"""The css_parser library will emit annoying warning messages
later if it sees these glyph character styles"""
for elm in soup.find_all("glyph"):
if elm.has_attr("style"):
elm["data-style"] = elm.attrs["style"]
del elm.attrs["style"]
def __reposition_marks(soup):
"""These マーク symbols will be converted to rubies later, so they need to
be positioned after the corresponding text in order to appear correctly"""
for elm in soup.find_all("表外字"):
mark = elm.find("表外字マーク")
elm.append(mark)
for elm in soup.find_all("表外音訓"):
mark = elm.find("表外音訓マーク")
elm.append(mark)
def __remove_links_without_href(soup):
for elm in soup.find_all("a"):
if elm.has_attr("href"):
continue
elm.attrs["data-name"] = elm.name
elm.name = "span"
def __remove_appendix_links(soup):
for elm in soup.find_all("a"):
if elm.attrs["href"].startswith("appendix"):
elm.unwrap()
def __convert_links(soup, entry):
for elm in soup.find_all("a"):
href = elm.attrs["href"].split(" ")[0]
href = href.removeprefix("#")
if not re.match(r"^[0-9]+(?:-[0-9A-F]{4})?$", href):
raise Exception(f"Invalid href format: {href}")
ref_entry_id = entry.id_string_to_entry_id(href)
if ref_entry_id in entry.ID_TO_ENTRY:
ref_entry = entry.ID_TO_ENTRY[ref_entry_id]
else:
ref_entry = entry.ID_TO_ENTRY[(ref_entry_id[0], 0)]
expression = ref_entry.get_first_expression()
elm.attrs["href"] = f"?query={expression}&wildcards=off"
def __add_parent_link(soup, entry):
elm = soup.find("親見出相当部")
if elm is not None:
parent_entry = entry.get_parent()
expression = parent_entry.get_first_expression()
elm.attrs["href"] = f"?query={expression}&wildcards=off"
elm.name = "a"
def __add_homophone_links(soup, entry):
forward_link = ["", entry.entry_id[0] + 1]
backward_link = ["", entry.entry_id[0] - 1]
homophone_info_list = [
["svg-logo/homophone1.svg", [forward_link]],
["svg-logo/homophone2.svg", [forward_link, backward_link]],
["svg-logo/homophone3.svg", [backward_link]],
]
for homophone_info in homophone_info_list:
filename, link_info = homophone_info
for elm in soup.find_all("img", attrs={"src": filename}):
for info in link_info:
text, link_id = info
link_entry = entry.ID_TO_ENTRY[(link_id, 0)]
expression = link_entry.get_first_expression()
link = BeautifulSoup("<a/>", "xml").a
link.string = text
link.attrs["href"] = f"?query={expression}&wildcards=off"
elm.append(link)
elm.unwrap()
def __convert_images_to_text(soup):
conversions = [
["svg-logo/重要語.svg", "", "vertical-align: super; font-size: 0.6em"],
["svg-logo/最重要語.svg", "", "vertical-align: super; font-size: 0.6em"],
["svg-logo/一般常識語.svg", "☆☆", "vertical-align: super; font-size: 0.6em"],
["svg-logo/追い込み.svg", "", ""],
["svg-special/区切り線.svg", "|", ""],
]
for conversion in conversions:
filename, text, style = conversion
for elm in soup.find_all("img", attrs={"src": filename}):
if text == "":
elm.unwrap()
continue
if style != "":
elm.attrs["style"] = style
elm.attrs["data-name"] = elm.name
elm.attrs["data-src"] = elm.attrs["src"]
elm.name = "span"
elm.string = text
del elm.attrs["src"]
def __text_parens_to_images(soup, media_dir):
for elm in soup.find_all("red"):
char = elm.text
if char not in ["", ""]:
continue
filename = f"red_{char}.svg"
path = os.path.join(media_dir, filename)
Icons.make_red_char(path, char)
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
"background": False,
"appearance": "auto",
"path": f"{os.path.basename(media_dir)}/{filename}",
}
elm.attrs["data-name"] = elm.name
elm.name = "span"
elm.string = ""
elm.append(img)
elm.attrs["style"] = "vertical-align: text-bottom;"
def __replace_icons(soup, media_dir):
cls_to_appearance = {
"default": "monochrome",
"fill": "monochrome",
"red": "auto",
"redfill": "auto",
"none": "monochrome",
}
icon_info_list = [
["svg-logo/アク.svg", "アク", "default"],
["svg-logo/丁寧.svg", "丁寧", "default"],
["svg-logo/可能.svg", "可能", "default"],
["svg-logo/尊敬.svg", "尊敬", "default"],
["svg-logo/接尾.svg", "接尾", "default"],
["svg-logo/接頭.svg", "接頭", "default"],
["svg-logo/表記.svg", "表記", "default"],
["svg-logo/謙譲.svg", "謙譲", "default"],
["svg-logo/区別.svg", "区別", "redfill"],
["svg-logo/由来.svg", "由来", "redfill"],
["svg-logo/人.svg", "", "none"],
["svg-logo/他.svg", "", "none"],
["svg-logo/動.svg", "", "none"],
["svg-logo/名.svg", "", "none"],
["svg-logo/句.svg", "", "none"],
["svg-logo/派.svg", "", "none"],
["svg-logo/自.svg", "", "none"],
["svg-logo/連.svg", "", "none"],
["svg-logo/造.svg", "", "none"],
["svg-logo/造2.svg", "", "none"],
["svg-logo/造3.svg", "", "none"],
["svg-logo/百科.svg", "", "none"],
]
for icon_info in icon_info_list:
src, text, cls = icon_info
for elm in soup.find_all("img", attrs={"src": src}):
path = media_dir
for part in src.split("/"):
path = os.path.join(path, part)
__make_rectangle(path, text, cls)
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
"background": False,
"appearance": cls_to_appearance[cls],
"title": elm.attrs["alt"] if elm.has_attr("alt") else "",
"path": f"{os.path.basename(media_dir)}/{src}",
}
elm.name = "span"
elm.clear()
elm.append(img)
elm.attrs["style"] = "vertical-align: text-bottom; margin-right: 0.25em;"
def __replace_accent_symbols(soup, media_dir):
accent_info_list = [
["svg-accent/平板.svg", Icons.make_heiban],
["svg-accent/アクセント.svg", Icons.make_accent],
]
for info in accent_info_list:
src, write_svg_function = info
for elm in soup.find_all("img", attrs={"src": src}):
path = media_dir
for part in src.split("/"):
path = os.path.join(path, part)
write_svg_function(path)
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
"background": False,
"appearance": "auto",
"path": f"{os.path.basename(media_dir)}/{src}",
}
elm.name = "span"
elm.clear()
elm.append(img)
elm.attrs["style"] = "vertical-align: super; margin-left: -0.25em;"
def __convert_gaiji(soup, media_dir):
for elm in soup.find_all("img"):
if not elm.has_attr("src"):
continue
src = elm.attrs["src"]
if src.startswith("graphics"):
continue
path = media_dir
for part in src.split("/"):
if part.strip() == "":
continue
path = os.path.join(path, part)
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
"background": False,
"appearance": "monochrome",
"title": elm.attrs["alt"] if elm.has_attr("alt") else "",
"path": f"{os.path.basename(media_dir)}/{src}",
}
elm.name = "span"
elm.clear()
elm.append(img)
elm.attrs["style"] = "vertical-align: text-bottom;"
def __convert_graphics(soup, media_dir):
for elm in soup.find_all("img"):
if not elm.has_attr("src"):
continue
src = elm.attrs["src"]
if not src.startswith("graphics"):
continue
elm.attrs = {
"collapsible": True,
"collapsed": True,
"title": elm.attrs["alt"] if elm.has_attr("alt") else "",
"path": f"{os.path.basename(media_dir)}/{src}",
"src": src,
}
def __convert_number_icons(soup, media_dir):
for elm in soup.find_all("大語義番号"):
if elm.find_parent("a") is None:
filename = f"{elm.text}-fill.svg"
appearance = "monochrome"
path = os.path.join(media_dir, filename)
__make_rectangle(path, elm.text, "fill")
else:
filename = f"{elm.text}-bluefill.svg"
appearance = "auto"
path = os.path.join(media_dir, filename)
__make_rectangle(path, elm.text, "bluefill")
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
"background": False,
"appearance": appearance,
"title": elm.text,
"path": f"{os.path.basename(media_dir)}/{filename}",
}
elm.name = "span"
elm.clear()
elm.append(img)
elm.attrs["style"] = "vertical-align: text-bottom; margin-right: 0.25em;"
def __make_rectangle(path, text, cls):
if cls == "none":
pass
elif cls == "fill":
Icons.make_monochrome_fill_rectangle(path, text)
elif cls == "red":
Icons.make_rectangle(path, text, "red", "white", "red")
elif cls == "redfill":
Icons.make_rectangle(path, text, "red", "red", "white")
elif cls == "bluefill":
Icons.make_rectangle(path, text, "blue", "blue", "white")
else:
Icons.make_rectangle(path, text, "black", "transparent", "black")

View file

@ -92,8 +92,8 @@ def __convert_gaiji(soup, image_dir):
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"height": 1.0 if ratio > 1.0 else ratio,
"width": ratio if ratio > 1.0 else 1.0,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,
@ -124,8 +124,8 @@ def __convert_rectangles(soup, image_dir):
ratio = Icons.calculate_ratio(path)
img = BeautifulSoup("<img/>", "xml").img
img.attrs = {
"height": 1.0,
"width": ratio,
"height": 1.0 if ratio > 1.0 else ratio,
"width": ratio if ratio > 1.0 else 1.0,
"sizeUnits": "em",
"collapsible": False,
"collapsed": False,

View file

@ -1,26 +0,0 @@
from bot.yomichan.terms.base.terminator import BaseTerminator
class JitenonTerminator(BaseTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = None
def _definition_tags(self, entry):
return None
def _glossary(self, entry):
if entry.entry_id in self._glossary_cache:
return self._glossary_cache[entry.entry_id]
glossary = self._glossary_maker.make_glossary(entry, self._image_dir)
self._glossary_cache[entry.entry_id] = glossary
return glossary
def _sequence(self, entry):
return entry.entry_id
def _link_glossary_parameters(self, entry):
return []
def _subentry_lists(self, entry):
return []

View file

@ -1,10 +1,14 @@
from bot.entries.daijirin2.phrase_entry import PhraseEntry
from bot.yomichan.terms.base.terminator import BaseTerminator
from bot.entries.daijirin2 import Daijirin2PhraseEntry as PhraseEntry
from bot.yomichan.terms.terminator import Terminator
from bot.yomichan.glossary.daijirin2 import make_glossary
from bot.yomichan.grammar import sudachi_rules, tags_to_rules
class Terminator(BaseTerminator):
class Daijirin2Terminator(Terminator):
def __init__(self, target):
super().__init__(target)
def _definition_tags(self, entry):
return ""

View file

@ -0,0 +1,18 @@
from bot.targets import Targets
from bot.yomichan.terms.jitenon import JitenonKokugoTerminator
from bot.yomichan.terms.jitenon import JitenonYojiTerminator
from bot.yomichan.terms.jitenon import JitenonKotowazaTerminator
from bot.yomichan.terms.smk8 import Smk8Terminator
from bot.yomichan.terms.daijirin2 import Daijirin2Terminator
def new_terminator(target):
terminator_map = {
Targets.JITENON_KOKUGO: JitenonKokugoTerminator,
Targets.JITENON_YOJI: JitenonYojiTerminator,
Targets.JITENON_KOTOWAZA: JitenonKotowazaTerminator,
Targets.SMK8: Smk8Terminator,
Targets.DAIJIRIN2: Daijirin2Terminator,
}
return terminator_map[target](target)

View file

@ -0,0 +1,68 @@
from bot.yomichan.grammar import sudachi_rules
from bot.yomichan.terms.terminator import Terminator
from bot.yomichan.glossary.jitenon import JitenonKokugoGlossary
from bot.yomichan.glossary.jitenon import JitenonYojiGlossary
from bot.yomichan.glossary.jitenon import JitenonKotowazaGlossary
class JitenonTerminator(Terminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = None
def _definition_tags(self, entry):
return None
def _glossary(self, entry):
if entry.entry_id in self._glossary_cache:
return self._glossary_cache[entry.entry_id]
glossary = self._glossary_maker.make_glossary(entry, self._image_dir)
self._glossary_cache[entry.entry_id] = glossary
return glossary
def _sequence(self, entry):
return entry.entry_id
def _link_glossary_parameters(self, entry):
return []
def _subentry_lists(self, entry):
return []
class JitenonKokugoTerminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonKokugoGlossary()
def _inflection_rules(self, entry, expression):
return sudachi_rules(expression)
def _term_tags(self, entry):
return ""
class JitenonYojiTerminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonYojiGlossary()
def _inflection_rules(self, entry, expression):
return ""
def _term_tags(self, entry):
tags = entry.kanken_level.split("/")
return " ".join(tags)
class JitenonKotowazaTerminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonKotowazaGlossary()
def _inflection_rules(self, entry, expression):
return sudachi_rules(expression)
def _term_tags(self, entry):
return ""

View file

@ -1,15 +0,0 @@
from bot.yomichan.grammar import sudachi_rules
from bot.yomichan.glossary.jitenon import JitenonKokugoGlossary
from bot.yomichan.terms.base.jitenon import JitenonTerminator
class Terminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonKokugoGlossary()
def _inflection_rules(self, entry, expression):
return sudachi_rules(expression)
def _term_tags(self, entry):
return ""

View file

@ -1,15 +0,0 @@
from bot.yomichan.grammar import sudachi_rules
from bot.yomichan.glossary.jitenon import JitenonKotowazaGlossary
from bot.yomichan.terms.base.jitenon import JitenonTerminator
class Terminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonKotowazaGlossary()
def _inflection_rules(self, entry, expression):
return sudachi_rules(expression)
def _term_tags(self, entry):
return ""

View file

@ -1,15 +0,0 @@
from bot.yomichan.glossary.jitenon import JitenonYojiGlossary
from bot.yomichan.terms.base.jitenon import JitenonTerminator
class Terminator(JitenonTerminator):
def __init__(self, target):
super().__init__(target)
self._glossary_maker = JitenonYojiGlossary()
def _inflection_rules(self, entry, expression):
return ""
def _term_tags(self, entry):
tags = entry.kanken_level.split("/")
return " ".join(tags)

View file

@ -1,43 +0,0 @@
from bot.entries.sankoku8.phrase_entry import PhraseEntry
from bot.yomichan.terms.base.terminator import BaseTerminator
from bot.yomichan.glossary.sankoku8 import make_glossary
from bot.yomichan.grammar import sudachi_rules, tags_to_rules
class Terminator(BaseTerminator):
def _definition_tags(self, entry):
return ""
def _inflection_rules(self, entry, expression):
if isinstance(entry, PhraseEntry):
return sudachi_rules(expression)
pos_tags = entry.get_part_of_speech_tags()
if len(pos_tags) == 0:
return sudachi_rules(expression)
else:
return tags_to_rules(expression, pos_tags, self._inflection_categories)
def _glossary(self, entry):
if entry.entry_id in self._glossary_cache:
return self._glossary_cache[entry.entry_id]
glossary = make_glossary(entry, self._image_dir)
self._glossary_cache[entry.entry_id] = glossary
return glossary
def _sequence(self, entry):
return entry.entry_id[0] * 100000 + entry.entry_id[1]
def _term_tags(self, entry):
return ""
def _link_glossary_parameters(self, entry):
return [
[entry.children, ""],
[entry.phrases, ""]
]
def _subentry_lists(self, entry):
return [
entry.children,
entry.phrases,
]

View file

@ -1,11 +1,12 @@
from bot.entries.smk8.kanji_entry import KanjiEntry
from bot.entries.smk8.phrase_entry import PhraseEntry
from bot.yomichan.terms.base.terminator import BaseTerminator
from bot.entries.smk8 import Smk8KanjiEntry as KanjiEntry
from bot.entries.smk8 import Smk8PhraseEntry as PhraseEntry
from bot.yomichan.terms.terminator import Terminator
from bot.yomichan.glossary.smk8 import make_glossary
from bot.yomichan.grammar import sudachi_rules, tags_to_rules
class Terminator(BaseTerminator):
class Smk8Terminator(Terminator):
def __init__(self, target):
super().__init__(target)

View file

@ -2,7 +2,7 @@ from abc import abstractmethod, ABC
from bot.data import load_yomichan_inflection_categories
class BaseTerminator(ABC):
class Terminator(ABC):
def __init__(self, target):
self._target = target
self._glossary_cache = {}
@ -66,28 +66,28 @@ class BaseTerminator(ABC):
@abstractmethod
def _definition_tags(self, entry):
raise NotImplementedError
pass
@abstractmethod
def _inflection_rules(self, entry, expression):
raise NotImplementedError
pass
@abstractmethod
def _glossary(self, entry):
raise NotImplementedError
pass
@abstractmethod
def _sequence(self, entry):
raise NotImplementedError
pass
@abstractmethod
def _term_tags(self, entry):
raise NotImplementedError
pass
@abstractmethod
def _link_glossary_parameters(self, entry):
raise NotImplementedError
pass
@abstractmethod
def _subentry_lists(self, entry):
raise NotImplementedError
pass

View file

@ -1391,7 +1391,7 @@
22544,16385,おもいこしをあげる
22634,16385,おもいたったがきちにち
22634,16386,おもいたつひがきちじつ
22728,16385,おもうえに
22728,16385,おもうえに
22728,16386,おもうこころ
22728,16387,おもうこといわねばはらふくる
22728,16388,おもうそら
@ -5224,7 +5224,7 @@
111520,16385,てんちょうにたっする
111583,16385,てんどうぜかひか
111583,16386,てんどうひとをころさず
111645,16385,てんばくうを
111645,16385,てんばくうを
111695,16385,てんびんにかける
111790,16385,てんめいをしる
111801,16385,てんもうかいかいそにしてもらさず
@ -5713,7 +5713,7 @@
119456,16385,なまきにくぎ
119456,16386,なまきをさく
119472,16385,なまけもののあしからとりがたつ
119472,16386,なまけもののせっくたらき
119472,16386,なまけもののせっくたらき
119503,16385,なますにたたく
119503,16386,なますをふく
119507,16385,なまずをひょうたんでおさえる
@ -7215,7 +7215,7 @@
154782,16388,みずがはいる
154782,16389,みずがひく
154782,16390,みずかる
154782,16391,みずきければうおすまず
154782,16391,みずきければうおすまず
154782,16392,みずすむ
154782,16393,みずでわる
154782,16394,みずとあぶら

1 15 16385 ああいえばこういう
1391 22544 16385 おもいこしをあげる
1392 22634 16385 おもいたったがきちにち
1393 22634 16386 おもいたつひがきちじつ
1394 22728 16385 おもうえに おもうゆえに
1395 22728 16386 おもうこころ
1396 22728 16387 おもうこといわねばはらふくる
1397 22728 16388 おもうそら
5224 111520 16385 てんちょうにたっする
5225 111583 16385 てんどうぜかひか
5226 111583 16386 てんどうひとをころさず
5227 111645 16385 てんばくうをゆく てんばくうをいく
5228 111695 16385 てんびんにかける
5229 111790 16385 てんめいをしる
5230 111801 16385 てんもうかいかいそにしてもらさず
5713 119456 16385 なまきにくぎ
5714 119456 16386 なまきをさく
5715 119472 16385 なまけもののあしからとりがたつ
5716 119472 16386 なまけもののせっくばたらき なまけもののせっくはたらき
5717 119503 16385 なますにたたく
5718 119503 16386 なますをふく
5719 119507 16385 なまずをひょうたんでおさえる
7215 154782 16388 みずがはいる
7216 154782 16389 みずがひく
7217 154782 16390 みずかる
7218 154782 16391 みずきよければうおすまず みずきょければうおすまず
7219 154782 16392 みずすむ
7220 154782 16393 みずでわる
7221 154782 16394 みずとあぶら

File diff suppressed because it is too large Load diff

View file

@ -1,61 +1,47 @@
𠮟,叱
吞,呑
靭,靱
臈,﨟
啞,唖
嚙,噛
屛,屏
幷,并
彎,弯
搔,掻
攪,撹
枡,桝
濾,沪
繡,繍
蔣,蒋
蠟,蝋
醬,醤
穎,頴
鷗,鴎
鹼,鹸
麴,麹
俠,侠
俱,倶
儘,侭
凜,凛
剝,剥
𠮟,叱
吞,呑
啞,唖
噓,嘘
嚙,噛
囊,嚢
塡,填
姸,妍
屛,屏
屢,屡
拋,抛
搔,掻
摑,掴
瀆,涜
攪,撹
潑,溌
瀆,涜
焰,焔
禱,祷
竜,龍
筓,笄
簞,箪
籠,篭
繡,繍
繫,繋
腁,胼
萊,莱
藪,薮
蟬,蝉
蠟,蝋
軀,躯
醬,醤
醱,醗
頰,頬
顚,顛
驒,騨
姸,妍
攢,攅
𣜜,杤
檔,档
槶,椢
櫳,槞
纊,絋
纘,纉
隯,陦
筓,笄
逬,迸
腁,胼
騈,駢
拋,抛
篡,簒
檜,桧
禰,祢
禱,祷
蘆,芦
凜,凛
鶯,鴬
鷗,鴎
鷽,鴬
鹼,鹸
麴,麹
靭,靱
靱,靭

1 𠮟
𠮟
1
2
3
4
5
6 𠮟
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43 𣜜
44
45
46
47

View file

@ -1,19 +1,19 @@
@font-face {
font-family: jpgothic;
src: local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP");
src: local("メイリオ"), local("ヒラギノ角ゴ Pro W3"), local("Hiragino Kaku Gothic Pro"), local("Meiryo"), local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP"), local(" Pゴシック"), local("MS Pgothic"), local("sans-serif");
}
@font-face {
font-family: jpmincho;
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("IPAmjMincho"), local("Source Han Serif JP"), local("HanaMinA"), local("HanaMinB");
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("Source Han Serif JP"), local("MS PMincho"), local("serif");
}
body {
/*margin: 0em 1em;*/
margin: 0em 1em;
line-height: 1.5em;
font-family: jpmincho, serif;
/*font-size: 1.2em;*/
font-family: jpmincho;
font-size: 1.2em;
color: black;
}
@ -43,7 +43,7 @@ span[data-name="i"] {
}
span[data-name="h1"] {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-size: 1em;
font-weight: bold;
}
@ -134,7 +134,7 @@ span[data-name="キャプション"] {
}
span[data-name="ルビG"] {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-size: 0.7em;
font-weight: normal;
vertical-align: 0.35em;
@ -142,7 +142,7 @@ span[data-name="ルビG"] {
}
.warichu span[data-name="ルビG"] {
font-family: jpmincho, serif;
font-family: jpmincho;
font-size: 0.5em;
font-weight: normal;
vertical-align: 0em;
@ -178,7 +178,7 @@ span[data-name="句仮名"] {
}
span[data-name="句表記"] {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-weight: bold;
}
@ -189,7 +189,7 @@ span[data-name="句項目"] {
}
span[data-name="和字"] {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
}
span[data-name="品詞行"] {
@ -209,7 +209,7 @@ span[data-name="大語義"] {
span[data-name="大語義num"] {
margin: 0.025em;
padding: 0.1em;
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-size: 0.8em;
color: white;
background-color: black;
@ -227,7 +227,7 @@ span[data-name="慣用G"] {
}
span[data-name="欧字"] {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
}
span[data-name="歴史仮名"] {
@ -248,7 +248,7 @@ span[data-name="準大語義"] {
span[data-name="準大語義num"] {
margin: 0.025em;
padding: 0.1em;
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-size: 0.8em;
border: solid 1px black;
}
@ -256,7 +256,7 @@ span[data-name="準大語義num"] {
span[data-name="漢字音logo"] {
margin: 0.025em;
padding: 0.1em;
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-size: 0.8em;
border: solid 0.5px black;
border-radius: 1em;
@ -290,17 +290,17 @@ span[data-name="異字同訓"] {
}
span[data-name="異字同訓仮名"] {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-weight: bold;
}
span[data-name="異字同訓漢字"] {
font-family: jpmincho, serif;
font-family: jpmincho;
font-weight: normal;
}
span[data-name="異字同訓表記"] {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-weight: bold;
}
@ -321,12 +321,12 @@ rt {
}
span[data-name="見出仮名"] {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-weight: bold;
}
span[data-name="見出相当部"] {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-weight: bold;
}
@ -371,7 +371,7 @@ span[data-name="logo"] {
}
.gothic {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-weight: bold;
}
@ -407,7 +407,7 @@ span[data-name="付記"]:after {
}
div[data-child-links] {
padding-left: 1em;
padding-top: 1em;
}
div[data-child-links] ul {
@ -417,7 +417,7 @@ div[data-child-links] ul {
div[data-child-links] span {
padding: 0.1em;
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-size: 0.8em;
color: white;
border-width: 0.05em;

View file

@ -1,17 +1,20 @@
@font-face {
font-family: jpgothic;
src: local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP");
src: local("メイリオ"), local("ヒラギノ角ゴ Pro W3"), local("Hiragino Kaku Gothic Pro"), local("Meiryo"), local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP"), local(" Pゴシック"), local("MS Pgothic"), local("sans-serif");
}
@font-face {
font-family: jpmincho;
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("IPAmjMincho"), local("Source Han Serif JP"), local("HanaMinA"), local("HanaMinB");
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("Source Han Serif JP"), local("MS PMincho"), local("serif");
}
body {
font-family: jpmincho, serif;
font-family: jpmincho;
margin: 0em 1em;
line-height: 1.5em;
font-size: 1.2em;
color: black;
}
table, th, td {
@ -21,7 +24,7 @@ table, th, td {
}
th {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
color: black;
background-color: lightgray;
font-weight: normal;
@ -40,18 +43,17 @@ td ul {
}
.読み方 {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-weight: bold;
}
.意味,
.kanjirighttb {
.意味 {
margin-left: 1.0em;
margin-bottom: 0.5em;
}
.num_icon {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
padding-left: 0.25em;
margin-right: 0.5em;
font-size: 0.8em;
@ -61,3 +63,4 @@ td ul {
border-style: none;
-webkit-border-radius: 0.1em;
}

View file

@ -1,17 +1,20 @@
@font-face {
font-family: jpgothic;
src: local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP");
src: local("メイリオ"), local("ヒラギノ角ゴ Pro W3"), local("Hiragino Kaku Gothic Pro"), local("Meiryo"), local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP"), local(" Pゴシック"), local("MS Pgothic"), local("sans-serif");
}
@font-face {
font-family: jpmincho;
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("IPAmjMincho"), local("Source Han Serif JP"), local("HanaMinA"), local("HanaMinB");
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("Source Han Serif JP"), local("MS PMincho"), local("serif");
}
body {
font-family: jpmincho, serif;
font-family: jpmincho;
margin: 0em 1em;
line-height: 1.5em;
font-size: 1.2em;
color: black;
}
table, th, td {
@ -21,7 +24,7 @@ table, th, td {
}
th {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
color: black;
background-color: lightgray;
font-weight: normal;
@ -36,12 +39,12 @@ a {
}
.読み方 {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-weight: bold;
}
.意味,
.kanjirighttb {
.意味 {
margin-left: 1.0em;
margin-bottom: 0.5em;
}

View file

@ -1,17 +1,20 @@
@font-face {
font-family: jpgothic;
src: local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP");
src: local("メイリオ"), local("ヒラギノ角ゴ Pro W3"), local("Hiragino Kaku Gothic Pro"), local("Meiryo"), local("Noto Sans CJK JP"), local("IPAexGothic"), local("Source Han Sans JP"), local(" Pゴシック"), local("MS Pgothic"), local("sans-serif");
}
@font-face {
font-family: jpmincho;
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("IPAmjMincho"), local("Source Han Serif JP"), local("HanaMinA"), local("HanaMinB");
src: local("Noto Serif CJK JP"), local("IPAexMincho"), local("Source Han Serif JP"), local("MS PMincho"), local("serif");
}
body {
font-family: jpmincho, serif;
font-family: jpmincho;
margin: 0em 1em;
line-height: 1.5em;
font-size: 1.2em;
color: black;
}
table, th, td {
@ -21,7 +24,7 @@ table, th, td {
}
th {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
color: black;
background-color: lightgray;
font-weight: normal;
@ -36,12 +39,12 @@ a {
}
.読み方 {
font-family: jpgothic, sans-serif;
font-family: jpgothic;
font-weight: bold;
}
.意味,
.kanjirighttb {
.意味 {
margin-left: 1.0em;
margin-bottom: 0.5em;
}

Some files were not shown because too many files have changed in this diff Show more