[nycphp-talk] Curl & Traversing Pages
Joseph Crawford
codebowl at gmail.com
Wed Nov 23 10:30:09 EST 2005
Guys i am still in need of help with this ;)
Here is an explanation of what i have tried so far.
i am trying to fetch data from yellowpages.superpages.com, the script
i have written does this, i feed it a category url, it grabs the
records, checks to see if there is a next page or not, if there is it
grabs the url, then re-executes with the new URL until it hits the
last page of the category.
The issue i am having is this. It reaches out to the first page and
grabs the results, but when it reaches out to grab the second page i
get the following error
http://codebowl.dontexist.net/images/ypresult.jpg
now what doesnt make any sense to me is that if i echo the URL that is
grabbed (second page) and paste it to my browser i get the results
fine, i dont see that error page. If i feed the second page url to
the script it grabs the records then errors when trying to go to the
3rd page
The following is the current code i have
$url = explode('?', $url);
$url = $url[0].'?'.$url[1];
echo $url . '<br>';
$c = new Curl($url[0]);
$c->SetOpt(CURLOPT_FOLLOWLOCATION, 1);
$c->SetOpt(CURLOPT_RETURNTRANSFER, 1);
//$c->SetOpt(CURLOPT_URL, $url);
$c->SetOpt(CURLOPT_POST, 1);
$c->SetOpt(CURLOPT_POSTFIELDS, $url[1]);
$c->SetOpt(CURLOPT_HEADER, 1);
$c->SetOpt(CURLOPT_COOKIE, 1);
$c->SetOpt(CURLOPT_ENCODING, "gzip,deflate");
$c->SetOpt(CURLOPT_USERAGENT, "User-Agent=Mozilla/5.0 (Windows; U;
Windows NT 5.1; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
$c->SetOpt(CURLOPT_REFERER, "http://yellowpages.superpages.com/");
$c->SetOpt(CURLOPT_COOKIEJAR,
'e:\htdocs\tmp\cookies\superpages.cookiejar.txt');
//$c->SetOpt(CURLOPT_COOKIEFILE,
'e:\htdocs\tmp\cookies\superpages.cookiefile.txt');
$this->source = $c->Execute();
that's the curl code i have, i have tried to use POST, POSTFIELDS, i
have tried encoding the query string values, i also have cURL setting
the cookie and have tried with and without that
here's the cookie set by cUR
# Netscape HTTP Cookie File
# http://www.netscape.com/newsref/std/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
.superpages.com TRUE / FALSE 1290438669 SPC 1132758669510-yellowpages.superpages.com-42072-519770
.superpages.com TRUE / FALSE 0 web
.superpages.com TRUE / FALSE 0 shopping
.superpages.com TRUE / FALSE 0 yp PS:45$
Here are the headers output with the cURL exec
HTTP/1.1 200 OK P3P: CP="NOI DSP COR DEVa TAIa OUR BUS UNI"
Set-Cookie: SPC=1132758669510-yellowpages.superpages.com-42072-519770;
Domain=.superpages.com; Expires=Mon, 22-Nov-2010 15:11:09 GMT; Path=/
Content-Encoding: gzip Set-Cookie: web=; Domain=.superpages.com;
Path=/ Set-Cookie: shopping=; Domain=.superpages.com; Path=/
Set-Cookie: yp=PS:45$; Domain=.superpages.com; Path=/ Content-Type:
text/html;charset=ISO-8859-1 Content-Language: en-US Content-Length:
9082 Date: Wed, 23 Nov 2005 15:11:09 GMT Server: Apache Coyote/1.0
You can also see the page running at the following URL
http://codebowl.homelinux.net:8001/codebowl/yp.php
What really srikes me as odd is that i can feed the script an array of
URL's and make it loop over each to grab each page. However making it
traverse the pages automatically is where i am having the issues.
I have also compared the URL's from the actual HTML page and the one
grabbed by cURL
Here is the one from the HTML page
http://yellowpages.superpages.com/listings.jsp?PS=15&CB=1&L=VT&CID=00000518939&paging=1&F=1&OO=1&PI=15
and here is the one grabbed by cURL
http://yellowpages.superpages.com/listings.jsp?PS=15&CB=1&L=VT&CID=00000518939&paging=1&F=1&OO=1&PI=15
as you can see they are exactly the same so i am not sure what could
be going wrong here.
as before the latest code is located at
http://codebowl.dontexist.net/codebowl/System/Misc/YellowPages.phps
http://codebowl.dontexist.net/codebowl/System/Misc/Curl.phps
I have taken a look at that mozilla plugin however i dont know what i
am looking for as it doesnt show what happens in cURL and that's
really what i need to know.
Any help would be appreciated.
--
Joseph Crawford Jr.
Zend Certified Engineer
Codebowl Solutions, Inc.
1-802-671-2021
codebowl at gmail.com
More information about the talk
mailing list