Start Date
Immediately after hiring
Files
Employer Facts
Project Description
We want to commission someone to write java code that extracts phone numbers and email addresses from a given file.
To start with a trivial example, given
[email protected]
we would want to return
[email protected]
We also need to deal with more complex examples where emails are less obvious, such as the following types of examples that I'm sure you've seen:
jurafsky(at)cs.stanford.edu
jurafsky at csli dot stanford dot edu
We even want to handle examples like in the uploaded screenshot. (I wanted to type it in, but I would need to use metachars...)
For all of the above you should return the corresponding email address:
[email protected]
Similarly, for phone numbers, we want to handle examples like the following:
TEL +1 650 723 0293
Phone: (650) 723-0293
Tel (+1): 650-723-0293
<a href="contact.html">TEL</a> +1 650 723 0293
all of which should return the following canonical form:
650-723-0293
(you can assume all phone numbers will be inside North America).
If you wish to take this on, we have uploaded the code we have so far, plus a development test set with some emails and phone numbers, together with the correct answers for your testing purposes.
You should be creative in looking at the web and thinking of different types of ways of encoding emails and phone numbers, not just the examples here. Finally, we won't need to deal with really difficult examples like images of any kind, or examples that require parsing names into first/last like:
"first name"@cs.stanford.edu
*We would need a final submission by Thurs, Jan 19, 2012. *
Please let us know if this is something that you would be up for, and of course if you have any questions. We are happy to compensate you by the hour or at a flat rate which you set -- whichever is best for you. Thank you for considering this, and we hope all is well with you.
Deliverables
The program should correctly produce the devGold file from the given html files in the data/dev folder. By default, if you execute:
$cd java
$mkdir classes
$javac -d classes *.java
$java -cp classes SpamLord ../data/dev/ ../data/devGOLD
It will run your code on the files contained in data/dev/ and compare the results of a simple regular expression against the correct results. The results will look something like this:
True Positives (3) ###############################
balaji e [email protected]
nass e [email protected]
shoham e [email protected]
False Positives (2) ###############################
psyoung e [email protected]
thm e [email protected]
False Negatives (111) ###############################
ashishg e [email protected]
ashishg e [email protected]
ashishg p 650-723-1614
...
The true positive section displays e-mails and phone numbers which the starter code correctly matches, the false positive section displays e-mails which the starter code regular expressions match but which are not correct, and the false negative section displays e-mails and phone numbers which the starter code did not match, but which do exist in the html files. Your goal, then, is to reduce the number of false positives and negatives to zero, and submit the files to us on or before Thursday, January 19, 2012.