Vincent is Coding
March 23rd, 2024 Ruby on Rails

Unicode Normalise a String in Rails

One strange issue I had on Scribbles, especially with creating a page/post url based on the title, was that sometimes someone would use... wait for it... full-width characters. Huh? What?

Yes, I never encountered this... EVER. Here is a full-width page title: About

Now that actually is just "About" but as "full-width" characters, and that poses a problem.

When Scribbles saves a post or page, it'll go ahead and try and parameterize the page/post title. Here is a an example of my method:

if self.title.present?
  url = self.title.parameterize
else
  url = "#{self.published_date.strftime("%Y-%m-%d")}"
end

The problem with full-width characters is that they're not standard, so calling parameterize would just return empty — and thus just throw errors like there is no tomorrow. Not really, but having an empty url kinda defeats the purpose of create a URL.

Thankfully Rails has a built in string helper called unicode_normalize. You can see what it does here in the docs.

So now I do the following:

url = self.title.unicode_normalize(:nfkc).parameterize.downcase

You can see here I also added :nfkc which applies compatibility decomposition, followed by canonical composition. No idea what that means, but Rails suggested it as it threw an error in my initial attempt. I also added a downcase just to make sure everything is lower case.

Anyway, this was interesting to figure out.