jitenbot/bot/crawlers/base/monokakido.py

import os
from bot.time import timestamp
from bot.crawlers.base.crawler import BaseCrawler


class MonokakidoCrawler(BaseCrawler):
    def __init__(self, target):
        super().__init__(target)
        self._page_id_pattern = r"^([0-9]+)\.xml$"

    def collect_pages(self, page_dir):
        print(f"{timestamp()} Searching for page files in `{page_dir}`")
        for pagefile in os.listdir(page_dir):
            page_id = self._parse_page_id(pagefile)
            if page_id is None or page_id == 0:
                continue
            path = os.path.join(page_dir, pagefile)
            self._page_map[page_id] = path
        pages_len = len(self._page_map)
        print(f"{timestamp()} Found {pages_len} page files for processing")
Reorganize file structure of all other modules 2023-07-27 04:48:24 +00:00			`import os`
Add timestamps to command line messages This is a clumsy way of doing it (since it would be better to have a wrapper function append the timestamp), but that will be taken care of when the logging logic is all overhauled anyway. 2023-07-29 04:17:42 +00:00			`from bot.time import timestamp`
Reorganize file structure of all other modules 2023-07-27 04:48:24 +00:00			`from bot.crawlers.base.crawler import BaseCrawler`


			`class MonokakidoCrawler(BaseCrawler):`
			`def __init__(self, target):`
			`super().__init__(target)`
			`self._page_id_pattern = r"^([0-9]+)\.xml$"`

			`def collect_pages(self, page_dir):`
Add timestamps to command line messages This is a clumsy way of doing it (since it would be better to have a wrapper function append the timestamp), but that will be taken care of when the logging logic is all overhauled anyway. 2023-07-29 04:17:42 +00:00			print(f"{timestamp()} Searching for page files in `{page_dir}`")
Reorganize file structure of all other modules 2023-07-27 04:48:24 +00:00			`for pagefile in os.listdir(page_dir):`
			`page_id = self._parse_page_id(pagefile)`
			`if page_id is None or page_id == 0:`
			`continue`
			`path = os.path.join(page_dir, pagefile)`
			`self._page_map[page_id] = path`
			`pages_len = len(self._page_map)`
Add timestamps to command line messages This is a clumsy way of doing it (since it would be better to have a wrapper function append the timestamp), but that will be taken care of when the logging logic is all overhauled anyway. 2023-07-29 04:17:42 +00:00			`print(f"{timestamp()} Found {pages_len} page files for processing")`