TheindexPlan-创新互联

In order to index the CSV, we want to take two fields from each row, title and description, and turn them into suitable terms. For straightforward textual search we don’t need document values.

在本溪等地区,都构建了全面的区域性战略布局,加强发展的系统性、市场前瞻性、产品创新能力,以专注、极致的服务理念,为客户提供成都做网站、网站制作 网站设计制作按需网站策划,公司网站建设,企业网站建设,成都品牌网站建设,网络营销推广,外贸网站制作,本溪网站建设费用合理。

Because we’re dealing with free text, and because we know the whole dataset is in English, we can use stemming so that for instance searching for “sundial” and “sundials” will both match the same documents. This way people don’t need to worry too much about exactly which words to use in their query.

Finally, we want a way of separating the two fields. In Xapian this is done using term prefixes, basically by putting short strings at the beginning of terms to indicate which field the term indexes. As well as prefixed terms, we also want to generate unprefixed terms, so that as well as searching within fields you can also search for text in any field.

There are some conventional prefixes used, which is helpful if you ever need to interoperate with omega (a web-based search engine) or other compatible systems. From this, we’ll use ‘S’ to prefix title (it stands for ‘subject’), and for description we’ll use ‘XD’. A full list of conventional prefixes is given at the top of the omega documentation on termprefixes.

When you’re indexing multiple fields like this, the term positions used for each field when indexed unprefixed need to be kept apart. Say you have a title of “The Saints”, and description “Don’t like rabbits? Keep reading.” If you index those fields without a gap, the phrase search “Saints don’t like rabbits” will match, where it really shouldn’t. Usually a gap of 100 between each field is enough.

To write to a database, we use the WritableDatabase class, which allows us to create, update or overwrite a database.

To create terms, we use Xapian’s TermGenerator, a built-in class to make turning free text into terms easier. It will split into words, apply stemming, and then add term prefixes as needed. It can also take care of term positions, including the gap between different fields.

为了对CSV进行索引,我们要从每行中取两个字段,标题和描述,并将其转换成合适的term。对于简单的文本搜索,我们不需要文档值。

因为我们正在处理自由文本,并且因为我们知道整个数据集是英文的,所以我们可以使用词干,例如搜索“sundial”和“sundials”都将匹配相同的文档。这样一来,人们不需要太多关心在查询中使用哪些单词。

最后,我们想要一种分离这两个字段的方法。在Xapian中,这是使用trem  prefixes完成的,基本上是通过在术语开头放短字符串来指示术语索引的字段。除了前缀术语之外,我们还要生成无偏见的术语,以便在字段内搜索,也可以在任何字段中搜索文本。

有一些常规的前缀使用,如果您需要与omega(基于Web的搜索引擎)或其他兼容系统进行互操作,这是有帮助的。从此,我们将使用'S'来标题(它代表'subject'),对于描述,我们将使用'XD'。 omega文档的顶部提供了常规前缀的完整列表。

当您对这样的多个字段进行索引时,需要将索引未修改的每个字段使用的术语位置分开。说你有一个标题“圣徒”,并描述“不喜欢兔子?继续读书。“如果你没有间隙地索引这些字段,搜索”圣徒不喜欢兔子“这个词将会匹配,真的不应该。通常每个领域之间的差距就足够了。

要写入数据库,我们使用WritableDatabase类,它允许我们创建,更新或覆盖数据库。

要创建条款,我们使用Xapian的TermGenerator,一个内置的类来使自由文本变得更容易。它将分割成单词,应用词干,然后根据需要添加术语前缀。它也可以照顾到职位,包括不同领域之间的差距。

另外有需要云服务器可以了解下创新互联scvps.cn,海内外云服务器15元起步,三天无理由+7*72小时售后在线,公司持有idc许可证,提供“云服务器、裸金属服务器、高防服务器、香港服务器、美国服务器、虚拟主机、免备案服务器”等云主机租用服务以及企业上云的综合解决方案,具有“安全稳定、简单易用、服务可用性高、性价比高”等特点与优势,专为企业上云打造定制,能够满足用户丰富、多元化的应用场景需求。


当前标题:TheindexPlan-创新互联
分享链接:http://pcwzsj.com/article/dhdhjj.html