MediaWiki: remove the /wiki/ prefix from links w/o regexp

This string replacement is 3-5 times faster than the QRegularExpression
replacement in "The United States" and "Paris" English Wikipedia
articles on my GNU/Linux system.

Before fe39fc8a05 the pattern started with
"<a\\shref=" instead of the current "<a\\s+href=", and no related bug
has been reported. I haven't encountered any whitespace character other
than space in this position. I believe that a single tab or a single EOL
character do not make sense after "<a". So a regression is unlikely.

I have searched for a tab or a newline character after "<a" and for a
whitespace character after "<a " in tens of articles from all 5 default
Wikipedia and all 5 default Wiktionary sites. Found none.

I have run the following command on directories that contained many
Wikipedia and Wiktionary articles received by GoldenDict:
  pcregrep -MrI --buffer-size 20M "$PATTERN" DIR-WITH-ARTICLES
with PATTERN='<a(\t|\n)' and PATTERN='<a \s+href'.
This commit is contained in:
Igor Kushnir 2020-11-23 18:43:50 +02:00
parent b7da546dd5
commit dec59439b9

View file

@ -493,11 +493,7 @@ void MediaWikiArticleRequest::requestFinished( QNetworkReply * r )
articleString.replace( "src=\"/", "src=\"" + wikiUrl.toString() );
// Remove the /wiki/ prefix from links
#if QT_VERSION >= QT_VERSION_CHECK( 5, 0, 0 )
articleString.replace( QRegularExpression( "<a\\s+href=\"/wiki/" ), "<a href=\"" );
#else
articleString.replace( QRegExp( "<a\\s+href=\"/wiki/" ), "<a href=\"" );
#endif
articleString.replace( "<a href=\"/wiki/", "<a href=\"" );
//fix audio
#if QT_VERSION >= QT_VERSION_CHECK( 5, 0, 0 )