PG Casts

More Case Insensitive Data

Episode #8 hosted by Josh Branchaud

In this episode, we'll be looking at a unique index we can use when working with case-insensitive data.

Postgres Version 9.5

Transcript

In a previous episode, we explored some considerations when querying data that is considered to be case-insensitive.

We started with a users table that looked like this:

\d old.users

We have a unique index on the email field, but because we expect most queries to use the lower() function on email, we decided to add another index. We can see that on our current table:

\d users

With this table, we can do indexed case-insensitive queries.

Both of these tables have a another major issue though. They will both allow us to insert records with duplicate emails by using different casing for those emails.

To see what this looks like, let's look at the first record in both tables:

select * from old.users, users limit 1;

Now let's insert some records that should violate the uniqueness that we'd like our table to be enforcing:

insert into old.users (email) values ('PERSON1@EXAMPLE.COM');
insert into users (email) values ('PERSON1@EXAMPLE.COM');

Yikes. We are not sufficiently enforcing uniqueness here. We need a better index. Let's focus on the users table ignoring the old.users table.

If we take another look at the description of the table, we may notice the problem.

\d users

The users_lower_email_idx is not a unique index. If we update it to enforce uniqueness on the lowercase version of the email column, then we can be sure we won't end up with any duplicate records.

So, what if we just try adding another index with this additional constraint?

create unique index users_unique_lower_email_idx on users (lower(email));
-- fails

Oh right, we need to clear out any duplicates first which means we need to remove that record we inserted earlier.

Also, we should replace the existing index to reduce the overhead on our table. To do this, we'll want to start a transaction to ensure data consistency as the indexes are being swapped.

begin;
delete from users where email = 'PERSON1@EXAMPLE.COM';
drop index users_lower_email_idx;
create unique index users_unique_lower_email_idx on users (lower(email));
commit;

We can see our new index in action by trying what is now an invalid insert statement:

insert into users (email) values ('PERSON1@EXAMPLE.COM');

Our case-insensitive email column is now even more robust.

Until next time, thanks for watching.

Setup Script

create schema old;
create table old.users (
  id serial primary key,
  email varchar not null unique
);

create table users (
  id serial primary key,
  email varchar not null unique
);
create index users_lower_email_idx on users (lower(email));

insert into old.users (email)
select 'person' || num || '@example.com'
from generate_series(1,10000) as num;

insert into users (email)
select 'person' || num || '@example.com'
from generate_series(1,10000) as num;