2012年10月31日 星期三

Reading a Remote File Using PHP


To read a remote file from php you have at least four options :
1. Use fopen()
2. Use file_get_contents()
3. CURL
4. Make your own function using php's socket functions.
First i need to warn you about something. You can only use fopen() and file_get_contents() when fopen wrappers is enabled. This parameter is specified in the php.ini file and cannot be changed at run time usingini_set(), To know whether you can use these two or not you can use the following code to check the value of fopen wrapper seting.
if (ini_get('allow_url_fopen') == '1') {
   // use fopen() or file_get_contents()
} else {
   // use curl or your custom function
}

1 . Using fopen()

If you use fopen() to read a remote file the process is as simple as reading from a local file. The only difference is that you will specify the URL instead of the file name. Take a look at the example below :
// make sure the remote file is successfully opened before doing anything else
if ($fp = fopen('http://www.google.com/', 'r')) {
   $content = '';
   // keep reading until there's nothing left
   while ($line = fread($fp, 1024)) {
      $content .= $line;
   }
   // do something with the content here
   // ...
} else {
   // an error occured when trying to open the specified url
}
Now, the code above use fread() function in the while loop to read up to 1024 bytes of data in a single loop. That code can also be written like this :
// make sure the remote file is successfully opened before doing anything else
if ($fp = fopen('http://www.google.com/', 'r')) {
   $content = '';
   // keep reading until there's nothing left
   while ($line = fgets($fp, 1024)) {
      $content .= $line;
   }
   // do something with the content here
   // ...
} else {
   // an error occured when trying to open the specified url
}
instead of fread() we use fgets() which reads one line of data up to 1024 bytes. The first code is much more preferable than the second though. Just imagine if the remote file's size is 50 kilobytes and consists of 300 lines. Using the first code will cause the loop to be executed about fifty times but using the second the loop will be executed three hundred times.
If you consider the cost to call a function plus the time required to make 300 requests compared to just 5 then clearly the first one is the winner.

2. Using file_get_contents()

This is my favorite way of reading a remote file because it is very simple. Just call this function and specify a url as the parameter. But make sure you remember to check the return value first to determine if it return an error before processing the result
$content = file_get_contents('http://www.google.com/');
if ($content !== false) {
   // do something with the content
} else {
   // an error happened
}

3. CURL

Unlike the two methods above using CURL cannot be said as straigthforward. Although this library is very useful to connect and communicate with may different protocols ( not just http ) it requires more effort to learn. And another problem is that not all web host have this library in their php installation. So we better make sure to check if the library is installed before trying to use it.
Here is a basic example on fetching a remote file
// make sure curl is installed
if (function_exists('curl_init')) {
   // initialize a new curl resource
   $ch = curl_init();

   // set the url to fetch
   curl_setopt($ch, CURLOPT_URL, 'http://www.google.com');

   // don't give me the headers just the content
   curl_setopt($ch, CURLOPT_HEADER, 0);

   // return the value instead of printing the response to browser
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

   // use a user agent to mimic a browser
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0');

   $content = curl_exec($ch);

   // remember to always close the session and free all resources
   curl_close($ch);
} else {
   // curl library is not installed so we better use something else
}
In some cases using CURL is faster than using file_get_contents() or fopen(). This is because CURL handles compression protocols by default ( for example gzip ). Many sites, big and small, use gzip compression to compress their web pages in order to save bandwidth. This site, for example, also use gzip compression which cut the bandwidth used into half. So if you're the type who just can't wait CURL will fit you most.


4. Custom functions

In the worst case your server will have fopen wrappers disabled and don't have CURL library installed. In this sad situation you just have to make your own way.
Our function shall be named getRemoteFile() which takes only one parameter, the url for the remote file. The skeleton for this function is shown below
function getRemoteFile($url)
{
   // 1. get the host name and url path

   // 2. connect to the remote server

   // 3. send the necessary headers to get the file

   // 4. retrieve the response from the remote server

   // 5. strip the headers

   // 6. return the file content
}
To extract the host name and url path from the given url we'll use parse_url() function. When given a url this function will spit out the followings :
  • scheme
  • host
  • port
  • user
  • pass
  • path
  • query
  • fragment
For example, if the url is http://www.php-mysql-tutorial.com/somepage.php then parse_url() will return :
Array
(
    [scheme] => http
    [host] => www.php-mysql-tutorial.com
    [path] => /somepage.php
)
and if the url is http://myusername:mypassword@www.php-mysql-tutorial.com/somepage.php?q=whatsthis#ouch then parse_url() will return this :
Array
(
    [scheme] => http
    [host] => www.php-mysql-tutorial.com
    [user] => myusername
    [pass] => mypassword
    [path] => /somepage.php
    [query] => q=whatsthis
    [fragment] => ouch
)
For our new function we only care about the host, port, path and query.
To establish a connection to a remote server we use fsockopen(). This function requires five arguments, the hostname, port number, a reference for error number, a reference for the error message and timeout
function getRemoteFile($url)
{
   // get the host name and url path
   $parsedUrl = parse_url($url);
   $host = $parsedUrl['host'];
   if (isset($parsedUrl['path'])) {
      $path = $parsedUrl['path'];
   } else {
      // the url is pointing to the host like http://www.mysite.com
      $path = '/';
   }

   if (isset($parsedUrl['query'])) {
      $path .= '?' . $parsedUrl['query'];
   }

   if (isset($parsedUrl['port'])) {
      $port = $parsedUrl['port'];
   } else {
      // most sites use port 80
      $port = '80';
   }

   $timeout = 10;
   $response = '';   // connect to the remote server
   $fp = @fsockopen($host, '80', $errno, $errstr, $timeout );
   if( !$fp ) {
      echo "Cannot retrieve $url";
   } else {
      // send the necessary headers to get the file
      fputs($fp, "GET $path HTTP/1.0\r\n" .
                 "Host: $host\r\n" .
                 "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3\r\n" .
                 "Accept: */*\r\n" .
                 "Accept-Language: en-us,en;q=0.5\r\n" .
                 "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n" .
                 "Keep-Alive: 300\r\n" .
                 "Connection: keep-alive\r\n" .
                 "Referer: http://$host\r\n\r\n");
      // retrieve the response from the remote server
      while ( $line = fread( $fp, 4096 ) ) {
         $response .= $line;
      }
      fclose( $fp );

      // strip the headers
      $pos      = strpos($response, "\r\n\r\n");
      $response = substr($response, $pos + 4);
   }
   // return the file content
   return $response;
}
The code above sends nine lines of headers but only the first two is mandatory. So even if you send only these
fputs($fp, "GET $path HTTP/1.0\r\n" .
           "Host: $host\r\n\r\n");
the function will likely be working correctly. Not always though. Since the file is stored in a remote server It really up to that server to reply to your request or not. Some people code their page to block any request without the proper referer header. Some will only accept a specific user agent. Other will require cookies set in the header.
If you want to see what headers should be sent to successfully fetch a specific remote file try using firefoxplus live http headers plugin. It's really a useful little tool


reference : http://www.php-mysql-tutorial.com/wikis/php-tutorial/reading-a-remote-file-using-php.aspx

沒有留言:

wibiya widget