Follow

Handling accents and special characters in search

Users of countless software products use search often to locate what they need. Search, in it’s most simplest form, is to take a search keyword from the user, run down a list and find results which partially or fully (it’s up to you) match the keyword. In this blog, we’ll talk about two features that we feel should be provided with any user-friendly search implementation. These are related to handling search with data having accented and special characters. We’ll see how to implement these using MS SQL Server.

  1. Take a table in the database which has a column word that contains an item “João”, basic search should fetch this item with keyword “Joao”. See the below screenshot from WordPress.org that uses Google Search to provide this functionality, Search for "Joao" in WordPress.org
  2. In a database table which has column Word that contains an item “e-commerce”, basic search should fetch this item with keywords like “ecommerce” or “e commerce”. See the below screenshot from WordPress.org that uses Google Search to provide this functionality, Search for "ecommerce" in WordPress.org

We’re going to focus on implementing these features using SQL Server.

Handling accented characters

The reason why we’re not able to search for “João” using the keyword “Joao” is because of the collation used in the database.

Link to know everything about collations for good

When a new database is created in SQL Server, the collation of the database is set to SQL_Latin1_General_CP1_CI_AS  by default.

While every word in the name of the collation has a meaning, we’ll focus only on CI and AS.

CI (Determines case sensitivity) means that the data that will be stored in the database will be case insensitive. AS means that the data that will be stored in the database will be accent sensitive.

AS (Determines accent sensitivity) is the reason why we were not able to search for “João” using the keyword “Joao”.

For our intended functionality, it’s obvious that the collation needs to be changed to SQL_Latin1_General_CP1_CI_AI. This collation will not differentiate between ‘ã’ and ‘a’.

Collations can be set on an entire server, to a database in a server or to specific columns in a table of a database. For our requirement we will only change the collation of the column that we wish to make case and accent insensitive. Do note that, there is a performance impact when modifying the collation. To find out more, check here.

For the list of collations in SQL Server, run this query,

We have a database search_table which looks like below.

Search table in database

The easiest way to find the collation of the table’s column is to check the properties of the column in SQL Server Management Studio’s Object Explorer.

Check collation of the table's column under properties in Server Management Studio's Object Explorer

You can alter the collation of the Name column using the following query.

Now, on searching the table for the word “Joao”, the following will be the result,

Search result for "Joao" after collation.

Neat!

Handling hyphenated characters

Solving this problem is also fairly simple. We do it by following these steps on the same table search_table.

First, we create an SQL function in the database. The function will perform the following steps sequentially,

  1. Take a string as an input.
  2. Remove all the special characters from the string and lower the case of the string.
  3. Return the result string.

This function will look like below,

Next, we create a computed column named NormalizedName in search_table table whose value in the column will be derived from the Name column. We will use the function created above and store result string in the computed column. Make sure the collation of this computed column is set to SQL_Latin1_General_CP1_CI_AI.

We then remove special characters including spaces from the search keyword before using it in the query. You can use the same function that was created in the first step. The searching will be done against NormalizedName column. The query to do this looks like this,

Following is the result,

The result “E-Commerce” will be fetched and returned for search keywords like “ECommerce”, “E commerce” and “e comm”.

Code implementation

If an API or external C# code is used to interact with the database, then C# itself can be used to remove special character in the search keyword before passing it to the query.

In the above code example, RemoveSpecialChars method is used to remove special characters using regular expressions. We are preserving accented characters. We can pass accented characters since the collation of the column in the database table has been changed to SQL_Latin1_General_CP1_CI_AI.

Results

Posted By: Sameer Shaik, Osmosee

Are you interested? follow us and get notified of new posts

Leave A Reply

1 × 4 =