Harvesting Email Addresses from a Website using Java 08-26-2016, 04:09 AM
#1
Github: https://github.com/Ex094/Email-Extractor
In this tutorial we'll be creating a small java command line application to extract email addresses from websites, a program like this comes in handy for people who are into advertising and stuff.
So before we jump right into programming, lets think about the possible steps of the program.
As we are extracting emails from a website so we are definitely going to be asking the user to input the URL of the website. Once we have the website, we won't magically have all the emails but we will have to get the contents of the URL first. Now that we have the contents, how are we going to extract the emails? Yes you have guessed it right, we will definitely be using REGEX.
So the list of steps are:
Now that we have a basic layout of our program, lets start coding part by part and we'll add possible improvements on the way but first we will create our EmailExtractor class.
Handling URL
We will be initializing the EmailExtractor with a URL which the user will input via command line arguments but we will cover that part in the end, for now we will create the constructor for the EmailExtractor which will take a URL as an argument and then initialize the URL Object.
So the code is
If you are new to the Java URL Class, it simply allows us to open a connection to the specified URL and then read data from it. You must specify the protocol (http/https) in the URL otherwise URL will throw the MalformedURLException, hence we will enclose the statement in Try..Catch.
Getting URL Contents
In the previous section we initialized our URL object to hold the user URL, what now we need to do is read the contents of the URL and store it inside a variable so that we can later apply regex and extract email addresses from it.
Lets create a method readContents which will read the contents off from the URL. It uses a BufferReader to read the InputStream from the URL object and then save the contents in a StringBuilder variable.
The url.openStream() basically opens the connection with the URL, then returns an InputStream so that we can read the data from the URL, The BufferedReader reads block of characters from the InputStreamReader.
The readContents method is complete but there's a problem, if the URL supplied by the user is in correct format but doesn't actually exist on the internet, the url.openStream() will throw an IOException hence we need to handle that exception too, so we just surround the whole block with Try..Catch.
Now if the user enters a URL like http://123asd.com which doesn't exist, our program will throw an exception Unable to read URL due to Unknown Host..
Extracting Email Addresses Using REGEX
When we obtain the contents of the URL, it'll be in a messy HTML form, Using a regular expression pattern for email address, we can find out the matching strings in the content.
The regular expression used is: \b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b
We will create a extractEmail method which will use regex to search for email addresses in the contents and once it gets a hit, it'll store that email address inside an String ArrayList but due to the fact that sometimes emails might get repeated so to maintain uniqueness, we will use Set Data Structure.
Printing out Email Addresses
To print out email addresses to the command line from the emailAddresses set, we will create a method printAddresses
The printAddresses method will first check that if the Set is not empty i.e there are emails in the Set, if there are then it'll print all of the email address and if no email addresses were found in the website contents i.e. the Set, containing the email addresses, size is zero then it'll print No emails were extracted!
Saving Email Addresses to a Text File (Extra)
Suppose a site you just scraped contains 1000 email address and all of em gets printed on your terminal, it's time consuming and annoying to copy and paste them, scroll down. So instead we can create a method called saveAddresses which will save all the extracted email address to a file with the name that the user assigns it.
The File object creates a new text file by the name that is passed as an argument to the method, the BufferedWriter will write the the email addresses to the text file, each on a newline. In case there's a problem writing the file, the Try..Catch block will handle the IO exception.
Command Line Arguments
So we are basically done creating our EmailExtractor class and it's essential methods. Now what we need to do is handle our user inputs. Time for us to create our main method,
You've written this piece of code thousands of times yet have you ever wondered what String[] args mean?
String[] args is simply an array of Strings, that contains command line arguments passed by the user.
Initially the args array is empty,
So when you type in your terminal something like:
The args array becomes,
Since it's a typical Java Array, we can access the passed arguments using index. So if I wanted to see what the first argument the user has passed, I would simply do
And it will print hello.
Another thing to keep in mind is that args is just the name of the array, you can name it anything like String[] myArguments but it's recommended that you follow the convention and keep it as String[] args.
Handling Command Line Arguments
For our application here, we will have 2 arguments
Out of which the first argument is necessary and the 2nd one is optional, whether you want to save the file or not. When you run the app with just the URL as the argument, it'll extract the email addresses and print them by default. But if you want to save those email addresses to a text file you need to add an extra argument followed by another argument that is the name of the file,
-s is the argument that will indicate that the user wants to save the file, and emails is the name of that file. So our main method becomes
We are checking in the If condition that the arguments are supplied by the user specially the URL, other wise if the url is not supplied , it'll simply tell the user Invalid Arguments Supplied... Now if the URL is included as arg and -s along with the file name is also been input by the user then we will save the email addresses in a file using the saveAddresses method else the list of email addresses will be simply displayed. Now our code becomes,
And our Email Extractor is complete, Build the jar file using NetBeans and run this command on the terminal:
It'll produce the following output:
The complete code:
This tutorial was fun, I'll write a separate tutorial about Command Line Arguments in Java so that if you have any kind of confusion regarding that topic, you can clear it up. Have fun coding![Smile Smile](https://sinister.ly/images/smilies/set/smile.png)
Procurity: Original Blog Post
Regards,
Ex094
In this tutorial we'll be creating a small java command line application to extract email addresses from websites, a program like this comes in handy for people who are into advertising and stuff.
So before we jump right into programming, lets think about the possible steps of the program.
As we are extracting emails from a website so we are definitely going to be asking the user to input the URL of the website. Once we have the website, we won't magically have all the emails but we will have to get the contents of the URL first. Now that we have the contents, how are we going to extract the emails? Yes you have guessed it right, we will definitely be using REGEX.
So the list of steps are:
Code:
Get website URL from the user
Get the contents of the URL
Run REGEX on the contents
Print out email addresses extracted by the REGEX from the contents.
Now that we have a basic layout of our program, lets start coding part by part and we'll add possible improvements on the way but first we will create our EmailExtractor class.
Code:
/**
* @author ex094
*/
public class EmailExtractor {
}
Handling URL
We will be initializing the EmailExtractor with a URL which the user will input via command line arguments but we will cover that part in the end, for now we will create the constructor for the EmailExtractor which will take a URL as an argument and then initialize the URL Object.
So the code is
Code:
import java.net.URL;
/**
* @author ex094
*/
public class EmailExtractor {
URL url; //URL Instance Variable
EmailExtractor(String url) {
this.url = new URL(url); //Initalizing our URL object
}
}
If you are new to the Java URL Class, it simply allows us to open a connection to the specified URL and then read data from it. You must specify the protocol (http/https) in the URL otherwise URL will throw the MalformedURLException, hence we will enclose the statement in Try..Catch.
Code:
import java.net.MalformedURLException;
import java.net.URL;
/**
* @author ex094
*/
public class EmailExtractor {
URL url; //URL Instance Variable
EmailExtractor(String url) {
try {
this.url = new URL(url); //Initalizing our URL object
} catch (MalformedURLException ex) {
System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
System.exit(1);
}
}
}
Getting URL Contents
In the previous section we initialized our URL object to hold the user URL, what now we need to do is read the contents of the URL and store it inside a variable so that we can later apply regex and extract email addresses from it.
Lets create a method readContents which will read the contents off from the URL. It uses a BufferReader to read the InputStream from the URL object and then save the contents in a StringBuilder variable.
Code:
mport java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
/**
* @author ex094
*/
public class EmailExtractor {
URL url; //URL Instance Variable
StringBuilder contents; //Stores our URL Contents
EmailExtractor(String url) {
try {
this.url = new URL(url); //Initalizing our URL object
} catch (MalformedURLException ex) {
System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
System.exit(1);
}
}
public void readContents() {
//Open Connection to URL and get stream to read
BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
contents = new StringBuilder();
//Read and Save Contents to StringBuilder variable
String input = "";
while((input = read.readLine()) != null) {
contents.append(input);
}
}
}
The url.openStream() basically opens the connection with the URL, then returns an InputStream so that we can read the data from the URL, The BufferedReader reads block of characters from the InputStreamReader.
The readContents method is complete but there's a problem, if the URL supplied by the user is in correct format but doesn't actually exist on the internet, the url.openStream() will throw an IOException hence we need to handle that exception too, so we just surround the whole block with Try..Catch.
Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
/**
* @author ex094
*/
public class EmailExtractor {
URL url; //URL Instance Variable
StringBuilder contents; //Stores our URL Contents
EmailExtractor(String url) {
try {
this.url = new URL(url); //Initalizing our URL object
} catch (MalformedURLException ex) {
System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
System.exit(1);
}
}
public void readContents() {
try {
//Open Connection to URL and get stream to read
BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
contents = new StringBuilder();
//Read and Save Contents to StringBuilder variable
String input = "";
while((input = read.readLine()) != null) {
contents.append(input);
}
} catch (IOException ex) {
System.out.println("Unable to read URL due to Unknown Host..");
}
}
}
Now if the user enters a URL like http://123asd.com which doesn't exist, our program will throw an exception Unable to read URL due to Unknown Host..
Extracting Email Addresses Using REGEX
When we obtain the contents of the URL, it'll be in a messy HTML form, Using a regular expression pattern for email address, we can find out the matching strings in the content.
The regular expression used is: \b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b
We will create a extractEmail method which will use regex to search for email addresses in the contents and once it gets a hit, it'll store that email address inside an String ArrayList but due to the fact that sometimes emails might get repeated so to maintain uniqueness, we will use Set Data Structure.
Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* @author ex094
*/
public class EmailExtractor {
String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
URL url; //URL Instance Variable
StringBuilder contents; //Stores our URL Contents
Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses
EmailExtractor(String url) {
try {
this.url = new URL(url); //Initalizing our URL object
} catch (MalformedURLException ex) {
System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
System.exit(1);
}
}
public void readContents() {
try {
//Open Connection to URL and get stream to read
BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
contents = new StringBuilder();
//Read and Save Contents to StringBuilder variable
String input = "";
while((input = read.readLine()) != null) {
contents.append(input);
}
} catch (IOException ex) {
System.out.println("Unable to read URL due to Unknown Host..");
}
}
public void extractEmail() {
//Creates a Pattern
Pattern pat = Pattern.compile(pattern);
//Matches contents against the given Email Address Pattern
Matcher match = pat.matcher(contents);
//If match found, append to emailAddresses
while(match.find()) {
emailAddresses.add(match.group());
}
}
}
Printing out Email Addresses
To print out email addresses to the command line from the emailAddresses set, we will create a method printAddresses
Code:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* @author ex094
*/
public class EmailExtractor {
String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
URL url; //URL Instance Variable
StringBuilder contents; //Stores our URL Contents
Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses
EmailExtractor(String url) {
try {
this.url = new URL(url); //Initalizing our URL object
} catch (MalformedURLException ex) {
System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
System.exit(1);
}
}
public void readContents() {
try {
//Open Connection to URL and get stream to read
BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
contents = new StringBuilder();
//Read and Save Contents to StringBuilder variable
String input = "";
while((input = read.readLine()) != null) {
contents.append(input);
}
} catch (IOException ex) {
System.out.println("Unable to read URL due to Unknown Host..");
}
}
public void extractEmail() {
//Creates a Pattern
Pattern pat = Pattern.compile(pattern);
//Matches contents against the given Email Address Pattern
Matcher match = pat.matcher(contents);
//If match found, append to emailAddresses
while(match.find()) {
emailAddresses.add(match.group());
}
}
public void printAddresses() {
//Check if email addresses have been extracted
if(emailAddresses.size() &amp;gt; 0) {
//Print out all the extracted emails
System.out.println("Extracted Email Addresses: ");
for(String emails : emailAddresses) {
System.out.println(emails);
}
} else {
//In case, no email addresses were extracted
System.out.println("No emails were extracted!");
}
}
}
The printAddresses method will first check that if the Set is not empty i.e there are emails in the Set, if there are then it'll print all of the email address and if no email addresses were found in the website contents i.e. the Set, containing the email addresses, size is zero then it'll print No emails were extracted!
Saving Email Addresses to a Text File (Extra)
Suppose a site you just scraped contains 1000 email address and all of em gets printed on your terminal, it's time consuming and annoying to copy and paste them, scroll down. So instead we can create a method called saveAddresses which will save all the extracted email address to a file with the name that the user assigns it.
Code:
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.HashSet;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* @author ex094
*/
public class EmailExtractor {
String pattern = "\\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z0-9.-]+\\b"; //Email Address Pattern
URL url; //URL Instance Variable
StringBuilder contents; //Stores our URL Contents
Set<String> emailAddresses = new HashSet<>(); //Contains unique email addresses
EmailExtractor(String url) {
try {
this.url = new URL(url); //Initalizing our URL object
} catch (MalformedURLException ex) {
System.out.println("Please include Protocol in your URL e.g. http://www.google.com");
System.exit(1);
}
}
public void readContents() {
try {
//Open Connection to URL and get stream to read
BufferedReader read = new BufferedReader(new InputStreamReader(url.openStream()));
contents = new StringBuilder();
//Read and Save Contents to StringBuilder variable
String input = "";
while((input = read.readLine()) != null) {
contents.append(input);
}
} catch (IOException ex) {
System.out.println("Unable to read URL due to Unknown Host..");
}
}
public void extractEmail() {
//Creates a Pattern
Pattern pat = Pattern.compile(pattern);
//Matches contents against the given Email Address Pattern
Matcher match = pat.matcher(contents);
//If match found, append to emailAddresses
while(match.find()) {
emailAddresses.add(match.group());
}
}
public void printAddresses() {
//Check if email addresses have been extracted
if(emailAddresses.size() &amp;gt; 0) {
//Print out all the extracted emails
System.out.println("Extracted Email Addresses: ");
for(String emails : emailAddresses) {
System.out.println(emails);
}
} else {
//In case, no email addresses were extracted
System.out.println("No emails were extracted!");
}
}
public void saveAddresses(String filename) {
//Create a new .txt file
File file = new File(filename + ".txt");
//Setting charset
Charset charset = Charset.forName("UTF-8");
//Create a BufferedWriter to write emails to the file
try(BufferedWriter write = new BufferedWriter(new FileWriter(file))) {
//Write each email address on a newline in the file
for(String emails : emailAddresses) {
write.write(emails);
write.newLine();
}
} catch (IOException ex) {
System.out.println("Could not save email addresses to text file!");
}
}
}
The File object creates a new text file by the name that is passed as an argument to the method, the BufferedWriter will write the the email addresses to the text file, each on a newline. In case there's a problem writing the file, the Try..Catch block will handle the IO exception.
Command Line Arguments
So we are basically done creating our EmailExtractor class and it's essential methods. Now what we need to do is handle our user inputs. Time for us to create our main method,
Code:
public static void main (String[] args) {
}
String[] args is simply an array of Strings, that contains command line arguments passed by the user.
Initially the args array is empty,
Code:
args = []
So when you type in your terminal something like:
Code:
java EmailExtractor hello world
The args array becomes,
Code:
args = ["hello", "world"]
Since it's a typical Java Array, we can access the passed arguments using index. So if I wanted to see what the first argument the user has passed, I would simply do
Code:
System.out.println(args[0]);
And it will print hello.
Another thing to keep in mind is that args is just the name of the array, you can name it anything like String[] myArguments but it's recommended that you follow the convention and keep it as String[] args.
Handling Command Line Arguments
For our application here, we will have 2 arguments
Code:
URL of the website
Save Email Addresses
Out of which the first argument is necessary and the 2nd one is optional, whether you want to save the file or not. When you run the app with just the URL as the argument, it'll extract the email addresses and print them by default. But if you want to save those email addresses to a text file you need to add an extra argument followed by another argument that is the name of the file,
Code:
java EmailExtractor http://www.google.com -s emails
-s is the argument that will indicate that the user wants to save the file, and emails is the name of that file. So our main method becomes
Code:
public static void main (String args[]) {
EmailExtractor extract;
//Check if arguments are supplied and URL is supplied
if(args.length > 0 && args[0] != null) {
//If length of args is 3 and -s in args, then save the emails
if(args.length == 3 && args[1] != null && args[1].equals("-s") && args[2] != null) {
//Just print them normally
} else {
}
} else {
System.out.println("Invalid Arguments supplied...");
}
}
Code:
public static void main (String args[]) {
EmailExtractor extract;
//Check if arguments are supplied and URL is supplied
if(args.length > 0 && args[0] != null) {
extract = new EmailExtractor(args[0]);//Initalize Extractor with URL
extract.readContents(); //Read the URL contents
extract.extractEmail(); //Extract the email addresses
//If -s in args, then save the emails
if(args.length == 3 && args[1] != null && args[1].equals("-s") && args[2] != null) {
extract.saveAddresses(args[2]); //Save the email address in a file with name from args[2]
//Just print them normally
} else {
extract.printAddresses(); //Otherwise normally display the email addresses
}
} else {
System.out.println("Invalid Arguments supplied...");
}
}
And our Email Extractor is complete, Build the jar file using NetBeans and run this command on the terminal:
Code:
java -jar emailExtractor.jar https://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
It'll produce the following output:
![[Image: screenshot-from-2016-08-25-23-26-53.png?w=756]](https://procurity.files.wordpress.com/2016/08/screenshot-from-2016-08-25-23-26-53.png?w=756)
The complete code:
Spoiler:
This tutorial was fun, I'll write a separate tutorial about Command Line Arguments in Java so that if you have any kind of confusion regarding that topic, you can clear it up. Have fun coding
![Smile Smile](https://sinister.ly/images/smilies/set/smile.png)
Procurity: Original Blog Post
Regards,
Ex094