WebHelp Responsive search: How do "Stop Words" work?
Post here questions and problems related to editing and publishing DITA content.
WebHelp Responsive search: How do "Stop Words" work?
Post by Anonymous1 »
Hello,
first of all thank you for the new search capabilities in Oxygen 19.
We are currently translating the new search strings in our various languages. Two strings mention the "stop words", such as "of", "the", and "by".
How does this work in other languages? I can see that you have translated them into Spanish, for example. How should we proceed if we would like to add Russian, for example? Is there a way to add or remove stop words?
Thanks,
Benjamin
first of all thank you for the new search capabilities in Oxygen 19.
We are currently translating the new search strings in our various languages. Two strings mention the "stop words", such as "of", "the", and "by".
How does this work in other languages? I can see that you have translated them into Spanish, for example. How should we proceed if we would like to add Russian, for example? Is there a way to add or remove stop words?
Thanks,
Benjamin
Re: WebHelp Responsive search: How do "Stop Words" work?
Post by Anonymous1 »
Correction: I've just realized that the Spanish translation was done by a colleague of mine and not by you. So the more general question: How should we deal with translating the strings in the WebHelp search?
-
- Posts: 404
- Joined: Thu Aug 21, 2003 11:36 am
- Location: Craiova
- Contact:
Re: WebHelp Responsive search: How do "Stop Words" work?
Post by radu_pisoi »
Hi,
The procedure for localizing the WebHelp output is described in our user manual in the Localizing the Interface of WebHelp Output (for DITA Map Transformations) topic.
The procedure for localizing the WebHelp output is described in our user manual in the Localizing the Interface of WebHelp Output (for DITA Map Transformations) topic.
Do you need the context where these strings are used? If yes, could you tell us which are the strings you need additional information?We are currently translating the new search strings in our various languages. Two strings mention the "stop words", such as "of", "the", and "by".
Radu Pisoi
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Re: WebHelp Responsive search: How do "Stop Words" work?
Post by Anonymous1 »
Thanks for your answer.
We know how to localize the WebHelp output, I mean something different here.
The search considers some words as so called "stop words". This means, they are not considered when searching for terms. There are two strings that mention stop words:
We must translate those strings into our target languages (Spanish, French, Japanese, Russian, etc.).
The question now is: What do we do with the stop words (of, the, by,...)? Just because we translate them, doesn't mean that the search actually ignores them in other languages.
How does the search know, which words are stop words? And can we add stop words for other languages as well?
We know how to localize the WebHelp output, I mean something different here.
The search considers some words as so called "stop words". This means, they are not considered when searching for terms. There are two strings that mention stop words:
Code: Select all
No results were found because the search query only contains <span>stop words</span> that are excluded by the search engine.
Code: Select all
Stop words are very common words or adjectives that hinder search efforts. Words such as: 'of', 'the', 'by', etc.
The question now is: What do we do with the stop words (of, the, by,...)? Just because we translate them, doesn't mean that the search actually ignores them in other languages.
How does the search know, which words are stop words? And can we add stop words for other languages as well?
-
- Posts: 404
- Joined: Thu Aug 21, 2003 11:36 am
- Location: Craiova
- Contact:
Re: WebHelp Responsive search: How do "Stop Words" work?
Post by radu_pisoi »
Hi,
The stop words are computed dynamically depending on the language you have chosen when you publish your documentation. They are computed by the search indexer and written in the out/webhelp-responsive/oxygen-webhelp/search/index-1.js file:
So, if you want to be sure which are the stop words for a certain language, you need to inspect the index-1.js file.
There is no parameter to control the stop words.
The stop words are computed dynamically depending on the language you have chosen when you publish your documentation. They are computed by the search indexer and written in the out/webhelp-responsive/oxygen-webhelp/search/index-1.js file:
Code: Select all
stopWords = new Array();
stopWords[0]= "but";
stopWords[1]= "be";
stopWords[2]= "with";
stopWords[3]= "such";
....
There is no parameter to control the stop words.
Radu Pisoi
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Re: WebHelp Responsive search: How do "Stop Words" work?
Post by Anonymous1 »
Thank you, that helps a lot.
-
- Posts: 25
- Joined: Mon Sep 17, 2007 10:02 am
- Location: Flanders
Re: WebHelp Responsive search: How do "Stop Words" work?
Hi,
I do realise I am reviving a pretty old thread...
In v23 I looked for the index-1.js
but realized that the array construction has moved to
...\oxygen-webhelp\app\search\index\stopwords.js
Does that imply that we can now influence the stop words?
I guess I could swap that file with a project/language dependent function, either manually or through a plugin change,
but doing it from the configuration of the customization would be my preferred path.
Thanks for other suggestions
Geert Bormans
I do realise I am reviving a pretty old thread...
In v23 I looked for the index-1.js
but realized that the array construction has moved to
...\oxygen-webhelp\app\search\index\stopwords.js
Code: Select all
define(function() {
// Auto generated list of analyzer stop words that must be ignored by search.
return ["but","be","with","such","then","for","no","will","not","are","and","their","if","this","on","into","a","or","there","in","that","they","was","is","it","an","the","as","at","these","by","to","of"];
});
I guess I could swap that file with a project/language dependent function, either manually or through a plugin change,
but doing it from the configuration of the customization would be my preferred path.
Thanks for other suggestions
Geert Bormans
-
- Posts: 404
- Joined: Thu Aug 21, 2003 11:36 am
- Location: Craiova
- Contact:
Re: WebHelp Responsive search: How do "Stop Words" work?
Post by radu_pisoi »
Hi,
Starting with version 23, you can customize the stop words list by using the following two parameters: webhelp.search.stop.words.exclude and webhelp.search.stop.words.include. They allow you to exclude/include custom stop words.
Please see the WebHelp Responsive Transformation Parameters topic in WebHelp documentation for more details.
Starting with version 23, you can customize the stop words list by using the following two parameters: webhelp.search.stop.words.exclude and webhelp.search.stop.words.include. They allow you to exclude/include custom stop words.
Please see the WebHelp Responsive Transformation Parameters topic in WebHelp documentation for more details.
Radu Pisoi
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
-
- Posts: 25
- Joined: Mon Sep 17, 2007 10:02 am
- Location: Flanders
Re: WebHelp Responsive search: How do "Stop Words" work?
Hi Radu,
Thanks for pointing me to right place in the manual
(and thank you Oxygen for adding that functionality)
I assume this can not be made language dependent other than add all languages in one parameter?
Anyhow, the functionality is extremely useful as it is already
Thanks,
Geert
Thanks for pointing me to right place in the manual
(and thank you Oxygen for adding that functionality)
I assume this can not be made language dependent other than add all languages in one parameter?
Anyhow, the functionality is extremely useful as it is already
Thanks,
Geert
-
- Posts: 404
- Joined: Thu Aug 21, 2003 11:36 am
- Location: Craiova
- Contact:
Re: WebHelp Responsive search: How do "Stop Words" work?
Post by radu_pisoi »
Hi,
No, you should update exclude/include stop words parameters depending on the current language.
Radu Pisoi
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
-
- Posts: 4
- Joined: Fri Feb 04, 2022 6:09 pm
Re: WebHelp Responsive search: How do "Stop Words" work?
Post by mmgHinchey »
I've noticed some languages do not produce language-specific stop words. (for example, simplified Chinese, Ukrainian, and Korean). Is there a list, somewhere, of what languages generate a language specific stopwords.js, and which generate a stopwords.js based on English.
Thank you
Thank you
-
- Posts: 115
- Joined: Mon Jul 10, 2023 11:49 am
Re: WebHelp Responsive search: How do "Stop Words" work?
I'm a native Chinese speaker and I know a little Korean due to the impact of ancient Chinese, in my opinion, what should be included or excluded in parameter "webhelp.search.stop.words.exclude" like what I specified for the English webhelp output in .opt file:
<parameter name="webhelp.search.stop.words.exclude"
value="in,at,with,and,or,from,into,not,by,if,for,as,a,an,is,no,not,of,on"/>
actually depends on how you define the efficiency of a search and the context of having a white/blacklist of stop words.
1. Context. In English, there are lots of words that are frequently used in random sentences like "is", "for", "and", "at", etc. In normal context, we don't want users to experience a long and meaningless keyword search which brings them thousands of search results that have nothing to do with the actual intention. However, in certain context, for example, if the product is about SQL language or other SQL-like database product, including keywords like "into", "by", "and", "at", "in", "as", etc. could sometimes block the search for certain SQL keywords or statements that contain such words. So it's better to exclude these words, at least exclude specific keywords like "like", "group by", "context by", etc.
2. Languge. In Chinese, Korean, Japanese, etc. there are always some words that do not mean anything specific, if they do, they function as a formal/polite word ending, such as 습니다("smida") at the end of a descriptive sentence especially in TV news or on newspaper. Chinese, especially in ancient Chinese, we have lots of similar ending words like "也", "矣","哉",these modal particles don't mean anything. In modern Chinese, we have some words, mostly adv. , such as "有时"(sometimes/иногда),“非常”(very/oчень),, and sometimes random words like "什么"(what/что),“这个”(this/это),etc. These words usually don't bring our readers meaning search results, so they should be included in the keyword avoid list.
So, instead of setting an absolute rule for various languages, consulting native speakers and asking their opinions to form up a stop words list might be a better practice.
<parameter name="webhelp.search.stop.words.exclude"
value="in,at,with,and,or,from,into,not,by,if,for,as,a,an,is,no,not,of,on"/>
actually depends on how you define the efficiency of a search and the context of having a white/blacklist of stop words.
1. Context. In English, there are lots of words that are frequently used in random sentences like "is", "for", "and", "at", etc. In normal context, we don't want users to experience a long and meaningless keyword search which brings them thousands of search results that have nothing to do with the actual intention. However, in certain context, for example, if the product is about SQL language or other SQL-like database product, including keywords like "into", "by", "and", "at", "in", "as", etc. could sometimes block the search for certain SQL keywords or statements that contain such words. So it's better to exclude these words, at least exclude specific keywords like "like", "group by", "context by", etc.
2. Languge. In Chinese, Korean, Japanese, etc. there are always some words that do not mean anything specific, if they do, they function as a formal/polite word ending, such as 습니다("smida") at the end of a descriptive sentence especially in TV news or on newspaper. Chinese, especially in ancient Chinese, we have lots of similar ending words like "也", "矣","哉",these modal particles don't mean anything. In modern Chinese, we have some words, mostly adv. , such as "有时"(sometimes/иногда),“非常”(very/oчень),, and sometimes random words like "什么"(what/что),“这个”(this/это),etc. These words usually don't bring our readers meaning search results, so they should be included in the keyword avoid list.
So, instead of setting an absolute rule for various languages, consulting native speakers and asking their opinions to form up a stop words list might be a better practice.
-
- Posts: 145
- Joined: Mon Jun 12, 2017 10:50 am
Re: WebHelp Responsive search: How do "Stop Words" work?
Post by cosmin_andrei »
Hi galanohan,
Note that there is no hardcoded stop words list in the Oxygen WebHelp code.
For the content indexing we use the Apache Lucene library and the stop words list is obtained from the Lucene library for each individual language.
Note that there is no hardcoded stop words list in the Oxygen WebHelp code.
For the content indexing we use the Apache Lucene library and the stop words list is obtained from the Lucene library for each individual language.
Regards,
Cosmin
--
Cosmin Andrei
oXygen XML Editor and Author Support
Cosmin
--
Cosmin Andrei
oXygen XML Editor and Author Support
Return to “DITA (Editing and Publishing DITA Content)”
Jump to
- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service