PG Casts

Citext for Emails

Episode #13 hosted by Josh Branchaud

In this episode, we're going to be looking at how we can use the citext module when working with case insensitive data.

Postgres Version 9.5

Transcript

In previous episodes, we explored the issues presented by case-insensitive data:

We had to jump through a number of hoops to make sure that we can efficiently query our tables as well as prevent duplicate records.

We learned how to efficiently query our tables and prevent duplicate records, but there is an easier way -- the citext module.

Let's look at how to use it.

First, here is the table that we left off with:

\d old.users

In this table we used the varchar data type for our email column.

In our new table we will be replacing it with the citext data type provided by the citext module. This data type works just like the text data type, but during comparisons it internally calls the lower() function.

Like I said, citext is going to make things easier.

To get started,

create extension citext;

Then we create our new users table:

create table users (
  id serial primary key,
  email citext not null unique
);

and we can take a quick look at the description:

\d users

Let's insert an initial record so that we have something to work with, pay attention to the casing here:

insert into users (email) values ('PERSON1@example.com');

Now we can see the case-insensitive comparison in action by querying for the record with different casing:

select * from users where email = 'person1@example.com';

We find the record and the casing from the original insertion is preserved.

And what if we try to insert a duplicate record with different casing:

insert into users (email) values ('Person1@example.com');

As expected, we get an error, the insertion was prevented by our uniqueness constraint.

Lastly, let's see how the performance compares.

We can insert some additional records to our new users table:

insert into users (email)
select 'person' || num || '@example.com'
from generate_series(2,10000) as num;

Now we can look at the explain analyze output for queries on each of the tables:

explain analyze select * from old.users
where lower(email) = lower('person5000@example.com');
explain analyze select * from users
where email = 'person5000@example.com';

We can see that we get comparable performance when using the citext field with a unique index.

You may be wondering where the ilike operator fits into all of this. We will look at that in the following episode.

Until then, may your data be consistent and your queries performant.

Setup

create schema old;
create table old.users (
  id serial primary key,
  email varchar not null unique
);
create unique index users_unique_lower_email_idx on old.users (lower(email));

insert into old.users (email)
select 'person' || num || '@example.com'
from generate_series(1,10000) as num;

create schema old;
create table old.users (
  id serial primary key,
  email varchar not null unique
);
create unique index users_unique_lower_email_idx on old.users (lower(email));

insert into old.users (email)
select 'person' || num || '@example.com'
from generate_series(1,10000) as num;