ajax-loader

Start Date

Immediately after hiring


Files

  1. EmailPhoneExtractor.zip
  2. screenshot.png

Employer Facts

Location:
United States
Rating:
0 stars
Spent:
 $0.00
Projects:
 0
No payment method

Project Description

We want to commission someone to write java code that extracts phone numbers and email addresses from a given file.
To start with a trivial example, given

jurafsky@stanford.edu
we would want to return
jurafsky@stanford.edu

We also need to deal with more complex examples where emails are less obvious, such as the following types of examples that I'm sure you've seen:

jurafsky(at)cs.stanford.edu
jurafsky at csli dot stanford dot edu
We even want to handle examples like in the uploaded screenshot. (I wanted to type it in, but I would need to use metachars...)

For all of the above you should return the corresponding email address:

jurafsky@stanford.edu

Similarly, for phone numbers, we want to handle examples like the following:

TEL +1 650 723 0293
Phone: (650) 723-0293
Tel (+1): 650-723-0293
<a href="contact.html">TEL</a> +1 650 723 0293

all of which should return the following canonical form:
650-723-0293
(you can assume all phone numbers will be inside North America).

If you wish to take this on, we have uploaded the code we have so far, plus a development test set with some emails and phone numbers, together with the correct answers for your testing purposes.

You should be creative in looking at the web and thinking of different types of ways of encoding emails and phone numbers, not just the examples here. Finally, we won't need to deal with really difficult examples like images of any kind, or examples that require parsing names into first/last like:
"first name"@cs.stanford.edu

*We would need a final submission by Thurs, Jan 19, 2012. *

Please let us know if this is something that you would be up for, and of course if you have any questions. We are happy to compensate you by the hour or at a flat rate which you set -- whichever is best for you. Thank you for considering this, and we hope all is well with you.

Deliverables

The program should correctly produce the devGold file from the given html files in the data/dev folder. By default, if you execute:

$cd java
$mkdir classes
$javac -d classes *.java
$java -cp classes SpamLord ../data/dev/ ../data/devGOLD

It will run your code on the files contained in data/dev/ and compare the results of a simple regular expression against the correct results. The results will look something like this:
True Positives (3) ###############################
balaji e balaji@stanford.edu
nass e nass@stanford.edu
shoham e shoham@stanford.edu
False Positives (2) ###############################
psyoung e young@stanford.edu
thm e pkrokel@Stanford.edu
False Negatives (111) ###############################
ashishg e ashishg@stanford.edu
ashishg e rozm@stanford.edu
ashishg p 650-723-1614
...
The true positive section displays e-mails and phone numbers which the starter code correctly matches, the false positive section displays e-mails which the starter code regular expressions match but which are not correct, and the false negative section displays e-mails and phone numbers which the starter code did not match, but which do exist in the html files. Your goal, then, is to reduce the number of false positives and negatives to zero, and submit the files to us on or before Thursday, January 19, 2012.

Skills Required

Java