Sinisterly
Tutorial Harvesting Email Addresses from a Website using Java - Printable Version

+- Sinisterly (https://sinister.ly)
+-- Forum: Coding (https://sinister.ly/Forum-Coding)
+--- Forum: Java, JVM, & JRE (https://sinister.ly/Forum-Java-JVM-JRE)
+--- Thread: Tutorial Harvesting Email Addresses from a Website using Java (/Thread-Tutorial-Harvesting-Email-Addresses-from-a-Website-using-Java)



Harvesting Email Addresses from a Website using Java - Ex094 - 08-26-2016

Github: https://github.com/Ex094/Email-Extractor

In this tutorial we'll be creating a small java command line application to extract email addresses from websites, a program like this comes in handy for people who are into advertising and stuff.

So before we jump right into programming, lets think about the possible steps of the program.

As we are extracting emails from a website so we are definitely going to be asking the user to input the URL of the website. Once we have the website, we won't magically have all the emails but we will have to get the contents of the URL first. Now that we have the contents, how are we going to extract the emails? Yes you have guessed it right, we will definitely be using REGEX.

So the list of steps are:

Code:
Get website URL from the user
Get the contents of the URL
Run REGEX on the contents
Print out email addresses extracted by the REGEX from the contents.

Now that we have a basic layout of our program, lets start coding part by part and we'll add possible improvements on the way but first we will create our EmailExtractor class.

Code:
/**
* @author ex094
*/
public class EmailExtractor {

}


Handling URL

We will be initializing the EmailExtractor with a URL which the user will input via command line arguments but we will cover that part in the end, for now we will create the constructor for the EmailExtractor which will take a URL as an argument and then initialize the URL Object.

So the code is

Code:
import java.net.URL;

/**
* @author ex094
*/
public class EmailExtractor {

   URL url; //URL Instance Variable

   EmailExtractor(String url) {
       this.url = new URL(url); //Initalizing our URL object
   }

}

If you are new to the Java URL Class, it simply allows us to open a connection to the specified URL and then read data from it. You must specify the protocol (http/https) in the URL otherwise URL will throw the MalformedURLException, hence we will enclose the statement in Try..Catch.

Code:
import java.net.MalformedURLException;
import java.net.URL;

/**
* @author ex094
*/
public class EmailExtractor {

   URL url; //URL Instance Variable

   EmailExtractor(String url) {

       try {
           this.url = new URL(url); //Initalizing our URL object
       } catch (MalformedURLException ex) {
          System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
          System.exit(1);
       }
   }
}

Getting URL Contents

In the previous section we initialized our URL object to hold the user URL, what now we need to do is read the contents of the URL and store it inside a variable so that we can later apply regex and extract email addresses from it.

Lets create a method readContents which will read the contents off from the URL. It uses a BufferReader to read the InputStream from the URL object and then save the contents in a StringBuilder variable.

Code:
mport java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

/**
* @author ex094
*/
public class EmailExtractor {

   URL url; //URL Instance Variable
   StringBuilder contents; //Stores our URL Contents

   EmailExtractor(String url) {

       try {
           this.url = new URL(url); //Initalizing our URL object
       } catch (MalformedURLException ex) {
          System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
          System.exit(1);
       }
   }

   public void readContents() {

      //Open Connection to URL and get stream to read
       BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
       contents = new StringBuilder();
       //Read and Save Contents to StringBuilder variable
       String input = "";
       while((input = read.readLine()) != null) {
           contents.append(input);
       }

   }
}

The url.openStream() basically opens the connection with the URL, then returns an InputStream so that we can read the data from the URL, The BufferedReader reads block of characters from the InputStreamReader.

The readContents method is complete but there's a problem, if the URL supplied by the user is in correct format but doesn't actually exist on the internet, the url.openStream() will throw an IOException hence we need to handle that exception too, so we just surround the whole block with Try..Catch.

Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;

/**
* @author ex094
*/
public class EmailExtractor {

   URL url; //URL Instance Variable
   StringBuilder contents; //Stores our URL Contents

   EmailExtractor(String url) {

       try {
           this.url = new URL(url); //Initalizing our URL object
       } catch (MalformedURLException ex) {
          System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
          System.exit(1);
       }
   }

   public void readContents() {
       try {
           //Open Connection to URL and get stream to read
           BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
           contents = new StringBuilder();
           //Read and Save Contents to StringBuilder variable
           String input = "";
           while((input = read.readLine()) != null) {
               contents.append(input);
           }
       } catch (IOException ex) {
           System.out.println("Unable to read URL due to Unknown Host..");
       }
   }
}

Now if the user enters a URL like http://123asd.com which doesn't exist, our program will throw an exception Unable to read URL due to Unknown Host..

Extracting Email Addresses Using REGEX

When we obtain the contents of the URL, it'll be in a messy HTML form, Using a regular expression pattern for email address, we can find out the matching strings in the content.

The regular expression used is: \b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b

We will create a extractEmail method which will use regex to search for email addresses in the contents and once it gets a hit, it'll store that email address inside an String ArrayList but due to the fact that sometimes emails might get repeated so to maintain uniqueness, we will use Set Data Structure.

Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* @author ex094
*/
public class EmailExtractor {

   String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
   URL url; //URL Instance Variable
   StringBuilder contents; //Stores our URL Contents
   Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses

   EmailExtractor(String url) {

       try {
           this.url = new URL(url); //Initalizing our URL object
       } catch (MalformedURLException ex) {
          System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
          System.exit(1);
       }
   }

   public void readContents() {
       try {
           //Open Connection to URL and get stream to read
           BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
           contents = new StringBuilder();
           //Read and Save Contents to StringBuilder variable
           String input = "";
           while((input = read.readLine()) != null) {
               contents.append(input);
           }
       } catch (IOException ex) {
           System.out.println("Unable to read URL due to Unknown Host..");
       }
   }

   public void extractEmail() {
       //Creates a Pattern
       Pattern pat = Pattern.compile(pattern);
       //Matches contents against the given Email Address Pattern
       Matcher match = pat.matcher(contents);
       //If match found, append to emailAddresses
       while(match.find()) {
           emailAddresses.add(match.group());
       }
   }
}

Printing out Email Addresses

To print out email addresses to the command line from the emailAddresses set, we will create a method printAddresses

Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* @author ex094
*/
public class EmailExtractor {

   String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
   URL url; //URL Instance Variable
   StringBuilder contents; //Stores our URL Contents
   Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses

   EmailExtractor(String url) {

       try {
           this.url = new URL(url); //Initalizing our URL object
       } catch (MalformedURLException ex) {
          System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
          System.exit(1);
       }
   }

   public void readContents() {
       try {
           //Open Connection to URL and get stream to read
           BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
           contents = new StringBuilder();
           //Read and Save Contents to StringBuilder variable
           String input = "";
           while((input = read.readLine()) != null) {
               contents.append(input);
           }
       } catch (IOException ex) {
           System.out.println("Unable to read URL due to Unknown Host..");
       }
   }

   public void extractEmail() {
       //Creates a Pattern
       Pattern pat = Pattern.compile(pattern);
       //Matches contents against the given Email Address Pattern
       Matcher match = pat.matcher(contents);
       //If match found, append to emailAddresses
       while(match.find()) {
           emailAddresses.add(match.group());
       }
   }

   public void printAddresses() {
       //Check if email addresses have been extracted
       if(emailAddresses.size() &amp;amp;gt; 0) {
           //Print out all the extracted emails
           System.out.println("Extracted Email Addresses: ");
           for(String emails : emailAddresses) {
               System.out.println(emails);
           }
       } else {
           //In case, no email addresses were extracted
           System.out.println("No emails were extracted!");
       }
   }
}

The printAddresses method will first check that if the Set is not empty i.e there are emails in the Set, if there are then it'll print all of the email address and if no email addresses were found in the website contents i.e. the Set, containing the email addresses, size is zero then it'll print No emails were extracted!

Saving Email Addresses to a Text File (Extra)

Suppose a site you just scraped contains 1000 email address and all of em gets printed on your terminal, it's time consuming and annoying to copy and paste them, scroll down. So instead we can create a method called saveAddresses which will save all the extracted email address to a file with the name that the user assigns it.

Code:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* @author ex094
*/
public class EmailExtractor {

   String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
   URL url; //URL Instance Variable
   StringBuilder contents; //Stores our URL Contents
   Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses

   EmailExtractor(String url) {

       try {
           this.url = new URL(url); //Initalizing our URL object
       } catch (MalformedURLException ex) {
          System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
          System.exit(1);
       }
   }

   public void readContents() {
       try {
           //Open Connection to URL and get stream to read
           BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
           contents = new StringBuilder();
           //Read and Save Contents to StringBuilder variable
           String input = "";
           while((input = read.readLine()) != null) {
               contents.append(input);
           }
       } catch (IOException ex) {
           System.out.println("Unable to read URL due to Unknown Host..");
       }
   }

   public void extractEmail() {
       //Creates a Pattern
       Pattern pat = Pattern.compile(pattern);
       //Matches contents against the given Email Address Pattern
       Matcher match = pat.matcher(contents);
       //If match found, append to emailAddresses
       while(match.find()) {
           emailAddresses.add(match.group());
       }
   }

   public void printAddresses() {
       //Check if email addresses have been extracted
       if(emailAddresses.size() &amp;amp;gt; 0) {
           //Print out all the extracted emails
           System.out.println("Extracted Email Addresses: ");
           for(String emails : emailAddresses) {
               System.out.println(emails);
           }
       } else {
           //In case, no email addresses were extracted
           System.out.println("No emails were extracted!");
       }
   }

   public void saveAddresses(String filename) {
       //Create a new .txt file
       File file = new File(filename + ".txt");
       //Setting charset
       Charset charset = Charset.forName("UTF-8");

       //Create a BufferedWriter to write emails to the file
       try(BufferedWriter write = new BufferedWriter(new FileWriter(file))) {
           //Write each email address on a newline in the file
           for(String emails : emailAddresses) {
               write.write(emails);
               write.newLine();
           }
       } catch (IOException ex) {
           System.out.println("Could not save email addresses to text file!");
       }
   }
}

The File object creates a new text file by the name that is passed as an argument to the method, the BufferedWriter will write the the email addresses  to the text file, each on a newline. In case there's a problem writing the file, the Try..Catch block will handle the IO exception.

Command Line Arguments

So we are basically done creating our EmailExtractor class and it's essential methods. Now what we need to do is handle our user inputs. Time for us to create our main method,

Code:
public static void main (String[] args) {

}
You've written this piece of code thousands of times yet have you ever wondered what String[] args mean?

String[] args is simply an array of Strings, that contains command line arguments passed by the user.
Initially the args array is empty,

Code:
args = []

So when you type in your terminal something like:

Code:
java EmailExtractor hello world

The args array becomes,

Code:
args = ["hello", "world"]

Since it's a typical Java Array, we can access the passed arguments using index. So if I wanted to see what the first argument the user has passed, I would simply do

Code:
System.out.println(args[0]);

And it will print hello.

Another thing to keep in mind is that args is just the name of the array, you can name it anything like String[] myArguments but it's recommended that you follow the convention and keep it as String[] args.

Handling Command Line Arguments

For our application here, we will have 2 arguments

Code:
URL of the website
Save Email Addresses

Out of which the first argument is necessary and the 2nd one is optional, whether you want to save the file or not. When you run the app with just the URL as the argument, it'll extract the email addresses and print them by default. But if you want to save those email addresses to a text file you need to add an extra argument followed by another argument that is the name of the file,

Code:
java EmailExtractor http://www.google.com -s emails

-s is the argument that will indicate that the user wants to save the file, and emails is the name of that file. So our main method becomes

Code:
public static void main (String args[]) {
       EmailExtractor extract;

       //Check if arguments are supplied and URL is supplied
       if(args.length > 0 && args[0] != null) {

           //If length of args is 3 and -s in args, then save the emails
           if(args.length == 3 && args[1] != null && args[1].equals("-s") && args[2] != null) {

           //Just print them normally
           } else {

           }
       } else {
           System.out.println("Invalid Arguments supplied...");
       }
   }
We are checking in the If condition that the arguments are supplied by the user specially the URL, other wise if  the url is not supplied , it'll simply tell the user Invalid Arguments Supplied... Now if the URL is included as arg and -s along with the file name is also been input by the user then we will save the email addresses in a file using the saveAddresses method else the list of email addresses will be simply displayed. Now our code becomes,

Code:
public static void main (String args[]) {

       EmailExtractor extract;

       //Check if arguments are supplied and URL is supplied
       if(args.length > 0 && args[0] != null) {

           extract = new EmailExtractor(args[0]);//Initalize Extractor with URL
           extract.readContents(); //Read the URL contents
           extract.extractEmail(); //Extract the email addresses

           //If -s in args, then save the emails
           if(args.length == 3 && args[1] != null && args[1].equals("-s") && args[2] != null) {
               extract.saveAddresses(args[2]); //Save the email address in a file with name from args[2]
           //Just print them normally
           } else {
               extract.printAddresses(); //Otherwise normally display the email addresses
           }
       } else {
           System.out.println("Invalid Arguments supplied...");
       }
   }

And our Email Extractor is complete, Build the jar file using NetBeans and run this command on the terminal:

Code:
java -jar emailExtractor.jar https://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/

It'll produce the following output:

[Image: screenshot-from-2016-08-25-23-26-53.png?w=756]

The complete code:

Spoiler:
Code:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* @author ex094
*/
public class EmailExtractor {

   String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
   URL url; //URL Instance Variable
   StringBuilder contents; //Stores our URL Contents
   Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses

   EmailExtractor(String url) {

       try {
           this.url = new URL(url); //Initalizing our URL object
       } catch (MalformedURLException ex) {
          System.out.println("\tPlease include Protocol in your URL e.g. http://www.google.com");
          System.exit(1);
       }
   }

   public void readContents() {
       try {
           //Open Connection to URL and get stream to read
           BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
           contents = new StringBuilder();
           //Read and Save Contents to StringBuilder variable
           String input = "";
           while((input = read.readLine()) != null) {
               contents.append(input);
           }
       } catch (IOException ex) {
           System.out.println("\tUnable to read URL due to Unknown Host..");
       }
   }

   public void extractEmail() {
       //Creates a Pattern
       Pattern pat = Pattern.compile(pattern);
       //Matches contents against the given Email Address Pattern
       Matcher match = pat.matcher(contents);
       //If match found, append to emailAddresses
       while(match.find()) {
           emailAddresses.add(match.group());
       }
   }

   public void printAddresses() {
       //Check if email addresses have been extracted
       if(emailAddresses.size() > 0) {
           //Print out all the extracted emails
           System.out.println("\tExtracted Email Addresses: ");
           for(String emails : emailAddresses) {
               System.out.println(emails);
           }
       } else {
           //In case, no email addresses were extracted
           System.out.println("\tNo emails were extracted!");
       }
   }

   public void saveAddresses(String filename) {
       //Create a new .txt file
       File file = new File(filename + ".txt");
       //Setting charset
       Charset charset = Charset.forName("UTF-8");

       //Create a BufferedWriter to write emails to the file
       try(BufferedWriter write = new BufferedWriter(new FileWriter(file))) {
           //Write each email address on a newline in the file
           for(String emails : emailAddresses) {
               write.write(emails);
               write.newLine();
           }
           System.out.println("\tEmails have been saved to " + filename + ".txt");
       } catch (IOException ex) {
           System.out.println("\tCould not save email addresses to text file!");
       }
   }

   public static void main (String args[]) {

       EmailExtractor extract;

       //Check if arguments are supplied and URL is supplied
       if(args.length > 0 && args[0] != null) {

           extract = new EmailExtractor(args[0]);//Initalize Extractor with URL
           extract.readContents(); //Read the URL contents
           extract.extractEmail(); //Extract the email addresses

           //If -s in args, then save the emails
           if(args.length == 3 && args[1] != null && args[1].equals("-s") && args[2] != null) {
               extract.saveAddresses(args[2]); //Save the email address in a file with name from args[2]
           //Just print them normally
           } else {
               extract.printAddresses(); //Otherwise normally display the email addresses
           }
       } else {
           System.out.println("\tInvalid Arguments supplied...");
       }
   }
}

This tutorial was fun, I'll write a separate tutorial about Command Line Arguments in Java so that if you have any kind of confusion regarding that topic, you can clear it up. Have fun coding Smile

Procurity: Original Blog Post

Regards,
Ex094


RE: Harvesting Email Addresses from a Website using Java - mothered - 08-26-2016

I need to shift my syntax away from PHP and SQL and apply myself here.

I've bookmarked this, and shall delve Into It when time permits. A very well documented, elaborated and formatted guide.
Thanks @"Ex094", appreciated.


RE: Harvesting Email Addresses from a Website using Java - BORW3 - 08-26-2016

Nice one, hope to see more.


RE: Harvesting Email Addresses from a Website using Java - Hu3c0 - 08-26-2016

[Image: thumbs-up-192.png]

Create or replace function hugs ( word Varchar2)
As
v_hugs_string varchar2();
Begin
v_hugs_string:=word;
For i IN REVERSE 1..LENGHT(v_hugs_string) LOOP
v_hugs_string:=v__hugs_string|| Substring(v_hugs_string,i,1);
End LOOP;
DBMS_OUTPUT.PUT_LINE('Your reverse String '||v_hugs_string);
DBMS_OUTPUT.PUT_LINE('You're very helpfull thanks to share your knowledge');
End hugs;

begin
hugs('Blessyou brother');
end;


RE: Harvesting Email Addresses from a Website using Java - Xiledcore - 09-13-2016

Really cool tutorial! Awesome job. Biggrin

Also: if you want to work with HTML content in Java, JSoup is a great library for that.


RE: Harvesting Email Addresses from a Website using Java - Hu3c0 - 09-13-2016

(09-13-2016, 02:36 PM)Xiledcore Wrote: Really cool tutorial! Awesome job. Biggrin

Also: if you want to work with HTML content in Java, JSoup is a great library for that.

I vouch...!


RE: Harvesting Email Addresses from a Website using Java - AnonymousDenial - 09-13-2016

Awesome tutorial. You can monetize easily off of this.


RE: Harvesting Email Addresses from a Website using Java - default - 02-23-2018

Let's see where this takes us.