PDA

View Full Version : PHP: Offline browsers [Crawlers & Spiders][+ 6rep]



Luckyrare
26-02-2007, 09:33 PM
Is there a way of defining real users to crawlers? I dont want to change what it sees just log that stat down differently in mysql.

Could you tell me any functions to define this or a way to do it

Thanks guys

timROGERS
26-02-2007, 09:56 PM
Find out the user agent names of the bots, and then just log them differently if it finds that the user agent of the visitor equals one of the user agents in an array of crawler ones.

Alternatively, I'd suggest you get some bot IPs and do the same with them.

Luckyrare
27-02-2007, 08:09 AM
Thats what I was thinking of doing, as a last resort as it wont be pick less know crawlers. Anyway if thats the only way to do it Ill work on it ;')

Heinous
27-02-2007, 12:33 PM
You could just (if I'm on the right track) use javascript, and use document.write to place a dynamic image (ie: image.php) which just places a 1x1 transparent image wherever, but the script itself does whatever you wanted it to.

From my knowledge webcrawlers don't have javascript enabled.

Luckyrare
01-03-2007, 08:40 PM
How do I get the users or servers dns/host information? IP seems to be a task due to the number of IP they use

google uses loads

209.185.108, 209.185.253, 216.239.33.96, 216.239.33.97, 216.239.33.98, 216.239.33.99, 216.239.37.98, 216.239.37.99, 216.239.39.98, 216.239.39.99, 216.239.41.96, 216.239.41.97, 216.239.41.98, 216.239.41.99, 216.239.45.4, 216.239.46, 216.239.51.96, 216.239.51.97, 216.239.51.98, 216.239.51.99, 216.239.53.98, 216.239.53.99, 216.239.57.96, 216.239.57.97, 216.239.57.98, 216.239.57.99, 216.239.59.98, 216.239.59.99, 216.33.229.163, 64.233.173.193, 64.233.173.194, 64.233.173.195, 64.233.173.196, 64.233.173.197, 64.233.173.198, 64.233.173.199, 64.233.173.200, 64.233.173.201, 64.233.173.202, 64.233.173.203, 64.233.173.204, 64.233.173.205, 64.233.173.206, 64.233.173.207, 64.233.173.208, 64.233.173.209, 64.233.173.210, 64.233.173.211, 64.233.173.212, 64.233.173.213, 64.233.173.214, 64.233.173.215, 64.233.173.216, 64.233.173.217, 64.233.173.218, 64.233.173.219, 64.233.173.220, 64.233.173.221, 64.233.173.222, 64.233.173.223, 64.233.173.224, 64.233.173.225, 64.233.173.226, 64.233.173.227, 64.233.173.228, 64.233.173.229, 64.233.173.230, 64.233.173.231, 64.233.173.232, 64.233.173.233, 64.233.173.234, 64.233.173.235, 64.233.173.236, 64.233.173.237, 64.233.173.238, 64.233.173.239, 64.233.173.240, 64.233.173.241, 64.233.173.242, 64.233.173.243, 64.233.173.244, 64.233.173.245, 64.233.173.246, 64.233.173.247, 64.233.173.248, 64.233.173.249, 64.233.173.250, 64.233.173.251, 64.233.173.252, 64.233.173.253, 64.233.173.254, 64.233.173.255, 64.68.80, 64.68.81, 64.68.82, 64.68.83, 64.68.84, 64.68.85, 64.68.86, 64.68.87, 64.68.88, 64.68.89, 64.68.90.1, 64.68.90.10, 64.68.90.11, 64.68.90.12, 64.68.90.129, 64.68.90.13, 64.68.90.130, 64.68.90.131, 64.68.90.132, 64.68.90.133, 64.68.90.134, 64.68.90.135, 64.68.90.136, 64.68.90.137, 64.68.90.138, 64.68.90.139, 64.68.90.14, 64.68.90.140, 64.68.90.141, 64.68.90.142, 64.68.90.143, 64.68.90.144, 64.68.90.145, 64.68.90.146, 64.68.90.147, 64.68.90.148, 64.68.90.149, 64.68.90.15, 64.68.90.150, 64.68.90.151, 64.68.90.152, 64.68.90.153, 64.68.90.154, 64.68.90.155, 64.68.90.156, 64.68.90.157, 64.68.90.158, 64.68.90.159, 64.68.90.16, 64.68.90.160, 64.68.90.161, 64.68.90.162, 64.68.90.163, 64.68.90.164, 64.68.90.165, 64.68.90.166, 64.68.90.167, 64.68.90.168, 64.68.90.169, 64.68.90.17, 64.68.90.170, 64.68.90.171, 64.68.90.172, 64.68.90.173, 64.68.90.174, 64.68.90.175, 64.68.90.176, 64.68.90.177, 64.68.90.178, 64.68.90.179, 64.68.90.18, 64.68.90.180, 64.68.90.181, 64.68.90.182, 64.68.90.183, 64.68.90.184, 64.68.90.185, 64.68.90.186, 64.68.90.187, 64.68.90.188, 64.68.90.189, 64.68.90.19, 64.68.90.190, 64.68.90.191, 64.68.90.192, 64.68.90.193, 64.68.90.194, 64.68.90.195, 64.68.90.196, 64.68.90.197, 64.68.90.198, 64.68.90.199, 64.68.90.2, 64.68.90.20, 64.68.90.200, 64.68.90.201, 64.68.90.202, 64.68.90.203, 64.68.90.204, 64.68.90.205, 64.68.90.206, 64.68.90.207, 64.68.90.208, 64.68.90.21, 64.68.90.22, 64.68.90.23, 64.68.90.24, 64.68.90.25, 64.68.90.26, 64.68.90.27, 64.68.90.28, 64.68.90.29, 64.68.90.3, 64.68.90.30, 64.68.90.31, 64.68.90.32, 64.68.90.33, 64.68.90.34, 64.68.90.35, 64.68.90.36, 64.68.90.37, 64.68.90.38, 64.68.90.39, 64.68.90.4, 64.68.90.40, 64.68.90.41, 64.68.90.42, 64.68.90.43, 64.68.90.44, 64.68.90.45, 64.68.90.46, 64.68.90.47, 64.68.90.48, 64.68.90.49, 64.68.90.5, 64.68.90.50, 64.68.90.51, 64.68.90.52, 64.68.90.53, 64.68.90.54, 64.68.90.55, 64.68.90.56, 64.68.90.57, 64.68.90.58, 64.68.90.59, 64.68.90.6, 64.68.90.60, 64.68.90.61, 64.68.90.62, 64.68.90.63, 64.68.90.64, 64.68.90.65, 64.68.90.66, 64.68.90.67, 64.68.90.68, 64.68.90.69, 64.68.90.7, 64.68.90.70, 64.68.90.71, 64.68.90.72, 64.68.90.73, 64.68.90.74, 64.68.90.75, 64.68.90.76, 64.68.90.77, 64.68.90.78, 64.68.90.79, 64.68.90.8, 64.68.90.80, 64.68.90.9, 64.68.91, 64.68.92, 66.249.64, 66.249.65, 66.249.66, 66.249.67, 66.249.68, 66.249.69, 66.249.70, 66.249.71, 66.249.72, 66.249.78, 66.249.79, 72.14.199, 8.6.48


edit:

Ill do the agent

Mentor
01-03-2007, 08:45 PM
The useragents the trick, basicaly the useragent of a bot will be substantaly differnt from that of a browser, you dont nessary need to have a list of bots out there ether, simply checking from an ocurance of the word bot in the browser string will usealy turn em up, id guess theres probably a number of other differnces in it, if you compaired a few with that of normal browser, browser strings "/

Luckyrare
01-03-2007, 09:13 PM
The useragents the trick, basicaly the useragent of a bot will be substantaly differnt from that of a browser, you dont nessary need to have a list of bots out there ether, simply checking from an ocurance of the word bot in the browser string will usealy turn em up, id guess theres probably a number of other differnces in it, if you compaired a few with that of normal browser, browser strings "/

Ok thanks so something like this would do the trick, havent finished it yet. I am pretty sure I will have to list all of them as they all are in different formats

http://willmacc.wordpress.com/bot-ips/



$agent = $_SERVER['HTTP_USER_AGENT'];

if (eregi('google', $agent) | eregi('gsa-crawler', $agent)) {

// Googlebot

}

elseif (eregi('search.msn.com', $agent) | eregi('MS Search 4.0 Robot', $agent)) {

// MSNBot

}


elseif (eregi('ZyBorg/1.0', $agent)) {

// WISEnut

}


elseif (eregi('Scooter/3.3Y!CrawlX', $agent)) {

// Alta Vista

}

elseif (eregi('Ask Jeeves/Teoma', $agent) | eregi('AskJeeves/Teoma', $agent) | eregi('teoma_agent1', $agent) | eregi('ask.com', $agent) ) {

// Ask Jeeves/Teoma

}

elseif (eregi('Lycos_Spider_(modspider)', $agent)) {

// Ask Jeeves/Teoma

}

elseif (eregi('ia_archiver', $agent)) {

// Alexa

}

Mentor
01-03-2007, 09:28 PM
Humm true, use a DB to help u out?
http://www.robotstxt.org/wc/active/html/index.html
http://www.siteware.ch/webresources/useragents/spiders/
http://www.pgts.com.au/pgtsj/pgtsj0208d.html
etc?

Luckyrare
01-03-2007, 09:34 PM
Humm true, use a DB to help u out?
http://www.robotstxt.org/wc/active/html/index.html
http://www.siteware.ch/webresources/useragents/spiders/
http://www.pgts.com.au/pgtsj/pgtsj0208d.html
etc?

Nice, Ill build up a list tomorrow. Thanks for your help

Blinger1
02-03-2007, 05:23 AM
why not just is something like

if (eregi('Mozilla/5.0', $agent)) {

// WISEnut

}

or something?

Luckyrare
02-03-2007, 04:45 PM
why not just is something like

if (eregi('Mozilla/5.0', $agent)) {

// WISEnut

} or something?

That would pick up some users I think as well as mix it up with a load of other bots

Want to hide these adverts? Register an account for free!