logo
logo

Sorting search results by relevance

​ To help you organize search results, you can sort them based on how relevant they are to your search criteria. This means that those most relevant to your search are displayed at the top of the search results list.

When generating search results, Sitecore Content Hub: ​

  1. Reduces the number of search result candidates by applying the search criteria you define.
  2. Scores and ranks search results. The first step is to apply the search criteria and reduce the number of search candidates. This step outputs all assets that match the defined search or filter criteria.

In the second step, Content Hub calculates and assigns a score to each asset in the candidate set. This score reflects how relevant the asset is for the defined query. After the relevancy score is assigned to assets, the search results are sorted and ranked.

Relevance score

​ Content Hub uses the BM25 best matching algorithm to calculate relevancy. This algorithm uses three factors to determine each asset score as described in the following table.

FactorDescription
Term frequency (TF)This refers to the number of times the search term is repeated in the asset fields. The more often it is repeated, the more relevant the asset is. For example, the Winter cookbook and the Classic Cocktails recipe book are both assets in Content Hub:
  • In the asset descriptions, the term cook is used more often in the Winter cookbook description than in the Classic Cocktails recipe description.
  • This means that when a user searches for the term cook, the Winter cookbook is more relevant than the Classic Cocktails recipe book as far as term frequency goes.
  • Inverse document frequency (IDF)This refers to the number of assets that contain the search term. The higher the number of assets, the less important that term is. For example, consider the Winter cookbook and the Classic Cocktails recipe book from the previous example along with eight more assets in the same context:
  • The user searches for the term famous chef.
  • Nine out of the ten assets include the term famous in their description. However, only three assets have the term chef in the description.
  • This means that the term famous is less important than the term chef in this search attempt.
  • The result is that the three assets that have the term chef are more relevant than the other assets to the famous chef search term.
  • Field lengthThis means that if an asset contains the search term in a field with a shorter length, it is likely more relevant than an asset that contains the same term in an extended field. For example, the Winter cookbook has a description of 350 characters while the Classic Cocktails recipe book has a description of 1200 characters:
  • The user searches for the term ingredient and both assets have this term in the description.
  • However, because of the differing field length, the Winter Cookbook is more relevant than the Classic Cocktails recipe book.
  • Relevancy score algorithm

    The following formula is used to calculate the relevancy score:

    Where:

    • qi is the ith query term. For example, if users search for cook, there is one term and q0 is cook. But if they search for famous chef, there are two terms. In this example, q0 is famous and q1 is chef.
    • IDF(qi) is the inverse document frequency of the ith query term. The formula used to calculate the IDF(qi) is:
    • docCount is the total number of all assets and f(qi) is the number of assets that contains the qi search term. For example, the user searches for the term book across ten assets. If the term book occurs in six assets, the IDF(book) is calculated as:
    • The field's length is divided by the average field length in the formula (1). If the number of terms defined for the asset is longer than average, its score decreases (because of the larger denominator). If the number of terms is shorter than average, the score increases due to the smaller denominator. The implementation of field length in Content Hub is based on the number of terms, not character length, so if the cookbook has 500 pages and mentions chef once, it is less relevant than a short tweet that uses chef once.
    • Variable b has a default value of 0.75. Changing the b changes the effect of field length compared to average length.
    • k1 and f(qi, D) appear both in the numerator and the denominator. k1 is a variable to define the term frequency saturation characteristics. The default value is 1.2. For more information, see link. This component impacts how the frequency of terms are assigned scores.
    • f(qi, D) means how many times does the ith query term occur in asset D? Using the previous example of 10 defined assets in a cooking context, the user searches for "pepper" and it occurs in document 4, 11 times and in document 7 two times, but not even once in the other documents. This can be understood as f("pepper", 4) = 11 and f("pepper", 7) = 2 but 0 for all other documents. f(qi, D) can be understood as the more often a query term occurs for an asset, the higher its score will be.

    Boosting an asset

    You can influence how an asset is ranked in search results by boosting an asset.

    Consider the following example. The superuser adds two fields to the schema for M.Asset:

    • Author
    • Information About Author

    On the Author field, the superuser enables the boost property.

    If the user uploads two cookbooks both written by Sara Dubler and adds the following:

    • Summer Salads cookbook - adds the name of the author in the Author field and leaves the other field blank.
    • Mediterranean Salads cookbook - leaves the Author field blank but adds information about Sara Dubler in the Information About Author field.

    Then, a search for Sara Dubler brings up the Summer Salad cookbook and lists it first because it has Sara Dubler in the Author field and this field is boosted in comparison with the Information about Author field.

    Note

    The boost feature and search support wildcard searches if enabled in the Search component.

    Can we improve this article ? Provide feedback