Duplicate articles can be shown when the alts collection is not empty
and a MediaWiki site redirects multiple words to a single page. The
alts collection can be populated when:
* option Preferences=>Advanced=>"Extra search via synonyms" is enabled;
* a Morphology dictionary is active;
* a translation of a phrase is requested in a way that makes GoldenDict
pass the input phrase to Preferences::sanitizeInputPhrase().
Steps to reproduce 1:
1. Create and switch to a dictionary group with (1) "English Wikipedia"
and (2) "English (US) Morphology" dictionaries in it.
2. Request a translation of the word "plays" (without quotes).
Steps to reproduce 2:
1. Create a dictionary group with "English Wiktionary" dictionary in it;
switch to this group in the scan popup window (or in the main window
if the Preferences=>Scan Popup=>"Send translated word to main window"
option is enabled).
2. Select the word "i.e." (without quotes) and press Ctrl+C+C (or
whatever hotkey is configured to translate a word from clipboard).
For example, the first audio link in "The United States" English
Wikipedia article - "The Star-Spangled Banner" - ends with ".oga".
Without this commit the audio link is not recognized by GoldenDict:
* it is not pronounced when a Preferences=>Audio=>"Auto-pronounce..."
option is enabled;
* clicking on the link opens it in the default browser instead of
playing inside GoldenDict.
I have searched for the "<button" string and even for the "<\s*button"
pattern in tens of articles from all 5 default Wikipedia and all 5
default Wiktionary sites. Found none. I assume this pattern is obsolete.
Removing this useless code improves performance by doing less searching.
I have run the following command on directories that contained many
Wikipedia and Wiktionary articles received by GoldenDict:
pcregrep -MrI --buffer-size 20M '<\s*button' DIR-WITH-ARTICLES
This string replacement is 3-5 times faster than the QRegularExpression
replacement in "The United States" and "Paris" English Wikipedia
articles on my GNU/Linux system.
Before fe39fc8a05 the pattern started with
"<a\\shref=" instead of the current "<a\\s+href=", and no related bug
has been reported. I haven't encountered any whitespace character other
than space in this position. I believe that a single tab or a single EOL
character do not make sense after "<a". So a regression is unlikely.
I have searched for a tab or a newline character after "<a" and for a
whitespace character after "<a " in tens of articles from all 5 default
Wikipedia and all 5 default Wiktionary sites. Found none.
I have run the following command on directories that contained many
Wikipedia and Wiktionary articles received by GoldenDict:
pcregrep -MrI --buffer-size 20M "$PATTERN" DIR-WITH-ARTICLES
with PATTERN='<a(\t|\n)' and PATTERN='<a \s+href'.
I haven't encountered any prefix other than "/wiki/" that should be
discarded. If there are such other prefixes, I think they would conform
to some pattern, and so the replacement code could be adjusted to
accommodate them.
This commit fixes #813.
Examples of pages with subpage links in English Wikipedia that are fixed
by this commit: "Asio (disambiguation)", "Asio C plus plus library".
This issue is much more prevalent in Wookieepedia because it has
a two-tab link system with the patterns */Legends and */Canon.