Preferences::sanitizeInputPhrase() transforms an input phrase by
removing its whitespace/punctuation prefix and suffix. Translating a
phrase from X11 primary selection or from clipboard, via mouse-over or
from the command line results in such sanitization. This is useful when
a punctuation mark or a space is selected accidentally alongside a word.
This sanitization can be undesirable, however, when an abbreviated word
is selected. For example: "etc.", "e.g.", "i.e.".
This commit implements searching for the input word with the punctuation
suffix preserved as an alternative form of the sanitized word to show
articles for both. For example, when the word "etc." is translated from
the clipboard, both "ETC" and "etc." articles are displayed.
The punctuation suffix is preserved when the word is passed from the
scan popup to the main window and when the translate line text is
refreshed (e.g. when the current group is changed). The suffix is not
stored in history and favorites (doing so would require file format
changes and possibly substantial code changes, this can be implemented
later if need be).
Trim the input phrase once in ArticleNetworkAccessManager::getResource()
instead of verbose trimming in multiple places in
ArticleMaker::makeDefinitionFor().
Closes #1350.
For example, the first audio link in "The United States" English
Wikipedia article - "The Star-Spangled Banner" - ends with ".oga".
Without this commit the audio link is not recognized by GoldenDict:
* it is not pronounced when a Preferences=>Audio=>"Auto-pronounce..."
option is enabled;
* clicking on the link opens it in the default browser instead of
playing inside GoldenDict.
I have searched for the "<button" string and even for the "<\s*button"
pattern in tens of articles from all 5 default Wikipedia and all 5
default Wiktionary sites. Found none. I assume this pattern is obsolete.
Removing this useless code improves performance by doing less searching.
I have run the following command on directories that contained many
Wikipedia and Wiktionary articles received by GoldenDict:
pcregrep -MrI --buffer-size 20M '<\s*button' DIR-WITH-ARTICLES
This string replacement is 3-5 times faster than the QRegularExpression
replacement in "The United States" and "Paris" English Wikipedia
articles on my GNU/Linux system.
Before fe39fc8a05 the pattern started with
"<a\\shref=" instead of the current "<a\\s+href=", and no related bug
has been reported. I haven't encountered any whitespace character other
than space in this position. I believe that a single tab or a single EOL
character do not make sense after "<a". So a regression is unlikely.
I have searched for a tab or a newline character after "<a" and for a
whitespace character after "<a " in tens of articles from all 5 default
Wikipedia and all 5 default Wiktionary sites. Found none.
I have run the following command on directories that contained many
Wikipedia and Wiktionary articles received by GoldenDict:
pcregrep -MrI --buffer-size 20M "$PATTERN" DIR-WITH-ARTICLES
with PATTERN='<a(\t|\n)' and PATTERN='<a \s+href'.
Those who use GoldenDict to translate long text with Google Translate
or pronounce it with a text-to-speech engine (see a discussion in
comments under #1313) may still want to limit the input phrase length.
But they might prefer a limit greater than the current maximum - ten
thousand symbols. Ten million minus one symbols should be generous
enough. I don't want to increase the maximum further to avoid excessive
widening of spinboxes in the Preferences UI. Besides, a ten megabyte
input phrase freezes GoldenDict's UI with high CPU usage for a minute on
my system.
This speeds up decreasing the default value that is probably too large
for most users. I think that very few users would want to tune this
option's value finer than 10. Those who need such precision can enter
the desired number manually.
When a Wikipedia article is already cached, this change reduces the
amount of sent and received network data almost tenfold.
Setting up a network disk cache in the same way for dictNetMgr does not
noticeably impact the amount of network traffic. Either this network
access manager sends and receives very little data or the data is never
the same. So dictNetMgr does not need a disk cache.
Use QNetworkDiskCache's default maximum size of 50 MiB as the default
network cache size. This size is large enough to accommodate tens of
huge MediaWiki articles. It is also small enough that the user is
unlikely to run out of disk space because of the cache.
Clear network cache on exit by default because most users probably
don't load the same online articles after restarting GoldenDict. Plus
storing the network cache on disk indefinitely by default would be a new
and unexpected to the users privacy risk.
Nikita Moor came up with the idea and wrote an initial network disk
cache implementation in #1310.
When a long text is accidentally selected or copied while the scan popup
is on, Goldendict spends a lot of CPU time to gradually create a
"No translation..." article. When translation of huge (e.g. 15 MB) text
from the clipboard is (accidentally) requested, Goldendict freezes for a
while. Turning the added input phrase limit option on eliminates this
waste of the CPU time.
I have implemented these options primarily for selection and clipboard,
but they also affect mouse-over translation on Windows and command line
translation requests. This is mostly because I did not bother to limit
the options' scope. I guess hovering over an extremely long text without
spaces (e.g. Base64-encoded) could cause the same performance issue on
Windows. The command-line translation could be requested from a script
that integrates Goldendict with some other application, from which long
text could be sent for translation by accident.
I hope that the default value of 200 characters will be sufficient for
just about any real-world user input in any language. The option is on
by default, because the default length limit is generous and any longer
text is unlikely to be sent for translation intentionally. My personal
preference for the input phrase length limit is 100 symbols.
ArticleView::pasteTriggered() didn't call QString::simplified() on the
text retrieved from the clipboard. I assumed this was an oversight, so
now it *is* called - indirectly, via Preferences::sanitizeInputPhrase().
When an integral value is converted to a signed integer type, if the
result is not representable, the resulting value is implementation
defined (until C++20). Convert the string value from configuration file
to the target type (int) to avoid the redundant type conversion.
Use the more direct and efficient QSpinBox::value() instead of
QAbstractSpinBox::text().toInt().