Users of countless software products use search often to locate what they need. Search, in it’s most simplest form, is to take a search keyword from the user, run down a list and find results which partially or fully (it’s up to you) match the keyword. In this blog, we’ll talk about two features that we feel should be provided with any user-friendly search implementation. These are related to handling search with data having accented and special characters. We’ll see how to implement these using MS SQL Server.
- Take a table in the database which has a column word that contains an item “João”, basic search should fetch this item with keyword “Joao”. See the below screenshot from WordPress.org that uses Google Search to provide this functionality,
- In a database table which has column Word that contains an item “e-commerce”, basic search should fetch this item with keywords like “ecommerce” or “e commerce”. See the below screenshot from WordPress.org that uses Google Search to provide this functionality,
We’re going to focus on implementing these features using SQL Server.
Handling accented characters
The reason why we’re not able to search for “João” using the keyword “Joao” is because of the collation used in the database.
Link to know everything about collations for good
When a new database is created in SQL Server, the collation of the database is set to SQL_Latin1_General_CP1_CI_AS by default.
While every word in the name of the collation has a meaning, we’ll focus only on CI and AS.
CI (Determines case sensitivity) means that the data that will be stored in the database will be case insensitive. AS means that the data that will be stored in the database will be accent sensitive.
AS (Determines accent sensitivity) is the reason why we were not able to search for “João” using the keyword “Joao”.
For our intended functionality, it’s obvious that the collation needs to be changed to SQL_Latin1_General_CP1_CI_AI. This collation will not differentiate between ‘ã’ and ‘a’.
Collations can be set on an entire server, to a database in a server or to specific columns in a table of a database. For our requirement we will only change the collation of the column that we wish to make case and accent insensitive. Do note that, there is a performance impact when modifying the collation. To find out more, check here.
For the list of collations in SQL Server, run this query,
1 |
SELECT * FROM fn_helpcollations(); |
We have a database search_table which looks like below.
The easiest way to find the collation of the table’s column is to check the properties of the column in SQL Server Management Studio’s Object Explorer.
You can alter the collation of the Name column using the following query.
1 2 |
ALTER TABLE dbo.search_table ALTER COLUMN Name varchar(50) COLLATE SQL_Latin1_General_CP1_CI_AI NOT NULL; |
Now, on searching the table for the word “Joao”, the following will be the result,
Neat!
Handling hyphenated characters
Solving this problem is also fairly simple. We do it by following these steps on the same table search_table.
First, we create an SQL function in the database. The function will perform the following steps sequentially,
- Take a string as an input.
- Remove all the special characters from the string and lower the case of the string.
- Return the result string.
This function will look like below,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
CREATE FUNCTION [dbo].[RemoveSpecialChars] ( @OriginalName VARCHAR(256)) RETURNS VARCHAR(256) WITH SCHEMABINDING BEGIN IF @OriginalName is null return null DECLARE @ResultText varchar(256) = '' ,@Length int = len(@OriginalName) ,@Index int = 1; WHILE @Index <= @Length BEGIN DECLARE @char INT SET @char = ASCII(SUBSTRING(@OriginalName, @Index, 1)) IF (@char BETWEEN 48 AND 57 OR @char BETWEEN 65 AND 90 OR @char BETWEEN 97 AND 122 OR @char BETWEEN 192 AND 255) begin SET @ResultText = @ResultText + CHAR(@char) end SET @Index = @Index + 1 END IF LEN(@ResultText) = 0 RETURN null RETURN @ResultText END |
Next, we create a computed column named NormalizedName in search_table table whose value in the column will be derived from the Name column. We will use the function created above and store result string in the computed column. Make sure the collation of this computed column is set to SQL_Latin1_General_CP1_CI_AI.
1 2 3 |
ALTER TABLE search_table ADD [NormalizedName] AS (dbo.RemoveSpecialChars(Name)) COLLATE SQL_Latin1_General_CP1_CI_AI PERSISTED |
We then remove special characters including spaces from the search keyword before using it in the query. You can use the same function that was created in the first step. The searching will be done against NormalizedName column. The query to do this looks like this,
1 2 3 4 5 |
DECLARE @searchStr nvarchar(max) = 'E-Commerce'; SELECT Name FROM search_table WHERE NormalizedName LIKE '%' + dbo.RemoveSpecialChars(@searchStr) + '%'; |
Following is the result,
The result “E-Commerce” will be fetched and returned for search keywords like “ECommerce”, “E commerce” and “e comm”.
Code implementation
If an API or external C# code is used to interact with the database, then C# itself can be used to remove special character in the search keyword before passing it to the query.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
public static void Main() { string keyword = "ECommerce"; SearchName(RemoveSpecialChars(keyword)); } private static string RemoveSpecialChars(string text) { // Removing the special chars excluding the accented characters. text = Regex.Replace(text, @"[^0-9a-zA-ZÁ-Źá-ź]+", ""); return text; } private static void SearchName(string keyword) { string connString = ConfigurationManager.ConnectionStrings["DbStorage"].ToString(); using (SqlConnection sqlConnection = new SqlConnection(connString)) { string query = @"SELECT Name FROM search_table WHERE NormalizedName LIKE '%@keyword%'; "; SqlCommand sqlCommand = new SqlCommand(); sqlCommand.Connection = sqlConnection; sqlCommand.CommandType = CommandType.Text; sqlCommand.CommandText = query; sqlCommand.Parameters.AddWithValue("@keyword", keyword); SqlDataReader reader = sqlCommand.ExecuteReader(); while (reader.Read()) { ReadSingleRow((IDataRecord)reader); } reader.Close(); } } private static void ReadSingleRow(IDataRecord record) { Console.WriteLine(String.Format("{0}, {1}", record[0], record[1])); } |
In the above code example, RemoveSpecialChars method is used to remove special characters using regular expressions. We are preserving accented characters. We can pass accented characters since the collation of the column in the database table has been changed to SQL_Latin1_General_CP1_CI_AI.
Results
Posted By: Sameer Shaik, Osmosee