Find and sort unique email addresses in a text file

The task is trivial, the solution not. Fortunately there's a Perl module written for exactly what we aim to do. It's called Email::Find and it can be installed through the libemail-find-perl package on Debian based systems.

This module implements a function named find() that finds all RFC822 compliant email addresses in the string parameter and executes a callback function for each address. We only have to do a unique sort on the result.

Here's an example code to do that:
#!/usr/bin/perl

# requires the libemail-find-perl package on Debian/Ubuntu systems
use Email::Find;

my %addresses;

# find email addresses, convert to lowercase and store unique values
while (<>) {
  Email::Find->new(sub {
    @addresses{lc(shift->address)} = 1;
  })->find(\$_);
}

# print to the file the sorted values
print "$_\n" foreach sort keys %addresses;

Save the code into a file (eg. find_emails.pl), add execute permission on it (chmod u+x find_emails.pl) and call it either by supplying a file to its standard input or by specifying the input file's name as the first parameter:
./find_emails.pl inputfile.txt
or
cat inputfile.txt | ./find_emails.pl

The script will print out one address per line to the standard output. I've added a conversion to lowercase too, you can skip it by removing the call to the lc() function.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

E-mail address case

According to RFC 2822 the local part of an e-mail address must be handled in a case sensitive manner, whereas the domain part is case-insensitive. Eye-wink

Re: E-mail address case

That's true and I was aware of it, but practise shows ...
  1. 99% of the mail delivery agents handle it case-insensitively anyway
  2. a lot of users use their email addresses with capitals (eg. if his name is John Smith and email address is john.smith@example.com, then users tend to write it in capitals, like John.Smith@example.com ... however they should not do so)
So I figured that converting the email addresses to lowercase has more pro than contra. I just tested with Gmail and it's handling addresses case-insensitively (I get the mail regarless of the case of the letters in my address).

Syndicate content