From ocl at gih.com Thu Nov 26 18:32:49 2015 From: ocl at gih.com (Olivier MJ Crepin-Leblond) Date: Thu, 26 Nov 2015 19:32:49 +0100 Subject: [IPv6crawler-wg] An important update about the IPv6 Matrix Project Message-ID: <56575051.3070402@gih.com> Hello all, Two worthy pieces of news regarding the IPv6 Matrix Project ( http://www.ipv6matrix.org ): 1. I have updated the Web site with the latest results ending in late October - hence noting a Crawl display date of November 2015. We now have 294 Gb of comma separated value text data in store, tracing back the spread of the IPv6 Internet since July 2010. Altogether, we ran the text approximately 36 times on all 1 million Alexa busiest Domain names. This represented testing of about 6.5 million hosts, carefully collecting traceroute information for each and every of them. We now have a very unique database that is showing the spread of the IPv6 Internet information sources worldwide. 2. Today I took out my very dusty Linux & Python gloves and performed a much needed update to the IPv6 Matrix Crawler input database, including the Alexa 1 million list as well as GeoIP Databases. Indeed, the Alexa database of the world's 1 million busiest Web sites dated from the Crawler's first inception in the first half of 2010. We're more than 5 years later! In a way, keeping the same input database has kept the base of crawls the steady thus the ability to compare results was possible. However, the flip-side of the coin is that we are ending up with more and more domain names marked as being dysfunctional. Nearly 5% of the domain names in the database were unreachable. The updated input database should resolve this, but we might also see a jump in some results. It will be interesting to see what the next run yields. Why do we not update the input database more often? Because buried in that database are the domain names of the people who wanted to opt out over the years. Having never thought about this, I spent several hours tracing back 5 years of emails of people complaining about the crawl triggering their firewalls. I put together a blacklist of domain names I have manually deleted from the crawl input files. The blacklist, as it stands now: Deleted: it-mate.co.uk indianic.com your-server.de catacombscds.com dewlance.com tcs.com printweb.de nocser.net shoppingnsales.com bsaadmail.com epayservice.ru 4footyfans.com guitarspeed99.com saga.co.uk Already gone from the current Alexa list: infinityautosurf.com canada-traffic.com usahitz.com jawatankosong.com.my 4d.com.my fitnessuncovered.co.uk kualalumpurbookfair.com xgen-it.com bpanet.de edns.de back2web.de waaaouh.com every-web.com w3sexe.com gratuits-web.com france-mateur.com pliagedepapier.com immobilieretparticuliers.com chronobio.com stickers-origines.com tailor-made.co.uk With these out of the input files, we are able to start the next crawl. * I hope I have not missed any complaints, but if I have, this is advance notice that we might receive a few emails in the forthcoming weeks. We might also receive a few emails from sites that have appeared on the Alexa 1 million list since 2010.* Back to this list, the excellent filtering program which was used to process the original list and clean it up was used again for the modern list. The Alexa list had a number of domain names which were actually sub-directories in the past, as well as some invalid domains. Alexa has since tightened its act. The latest Alexa list is much cleaner. It holds 999998 valid domains vs. 984587 domains for the original 2010 list. Finally, new gTLDs have now appeared in the Alexa list, including some Internationalised Domain Names (IDNs). The world is indeed a very different place! It will be interesting to see how the Crawler as well as all other scripts to process the information into displayable data on the Web server, will cope with these: academy.csv consulting.csv guide.csv one.csv supply.csv accountant.csv contractors.csv guru.csv onl.csv support.csv actor.csv cool.csv hamburg.csv online.csv surf.csv ads.csv country.csv haus.csv ooo.csv swiss.csv adult.csv creditcard.csv healthcare.csv orange.csv sydney.csv agency.csv cricket.csv help.csv ovh.csv systems.csv alsace.csv cymru.csv hiphop.csv paris.csv taipei.csv amsterdam.csv dance.csv holiday.csv partners.csv tattoo.csv app.csv date.csv horse.csv parts.csv team.csv archi.csv dating.csv host.csv party.csv tech.csv associates.csv deals.csv hosting.csv photo.csv technology.csv attorney.csv delivery.csv house.csv photography.csv theater.csv auction.csv desi.csv how.csv photos.csv tienda.csv audio.csv design.csv immobilien.csv pics.csv tips.csv axa.csv dev.csv immo.csv pictures.csv tirol.csv barclaycard.csv diet.csv ink.csv pink.csv today.csv barclays.csv digital.csv international.csv pizza.csv tokyo.csv bar.csv direct.csv investments.csv place.csv tools.csv bargains.csv directory.csv irish.csv plus.csv top.csv bayern.csv discount.csv jetzt.csv poker.csv town.csv beer.csv dog.csv joburg.csv porn.csv toys.csv berlin.csv domains.csv juegos.csv post.csv trade.csv best.csv earth.csv kim.csv press.csv training.csv bid.csv education.csv kitchen.csv prod.csv trust.csv bike.csv email.csv kiwi.csv productions.csv university.csv bio.csv emerck.csv koeln.csv properties.csv uno.csv black.csv energy.csv krd.csv property.csv uol.csv blackfriday.csv equipment.csv kred.csv pub.csv vacations.csv blue.csv estate.csv land.csv quebec.csv vegas.csv bnpparibas.csv eus.csv law.csv realtor.csv ventures.csv boo.csv events.csv legal.csv recipes.csv video.csv boutique.csv exchange.csv life.csv red.csv vision.csv brussels.csv expert.csv limited.csv rehab.csv voyage.csv build.csv exposed.csv link.csv reise.csv wales.csv builders.csv express.csv live.csv reisen.csv wang.csv business.csv fail.csv lol.csv ren.csv watch.csv buzz.csv faith.csv london.csv rentals.csv webcam.csv bzh.csv farm.csv love.csv repair.csv website.csv cab.csv finance.csv luxury.csv report.csv wien.csv camera.csv fish.csv management.csv rest.csv wiki.csv camp.csv fishing.csv mango.csv review.csv win.csv capital.csv fit.csv market.csv reviews.csv windows.csv cards.csv fitness.csv marketing.csv rio.csv work.csv care.csv flights.csv markets.csv rip.csv works.csv career.csv foo.csv media.csv rocks.csv world.csv careers.csv football.csv melbourne.csv ruhr.csv wtf.csv casa.csv forsale.csv menu.csv ryukyu.csv xn--3e0b707e.csv cash.csv foundation.csv microsoft.csv sale.csv xn--4gbrim.csv casino.csv frl.csv moda.csv scb.csv xn--80adxhks.csv center.csv fund.csv moe.csv school.csv xn--80asehdb.csv ceo.csv futbol.csv monash.csv science.csv xn--90ais.csv chat.csv gal.csv money.csv scot.csv xn--d1acj3b.csv church.csv gallery.csv moscow.csv services.csv xn--j1amh.csv city.csv garden.csv movie.csv sexy.csv xn--p1ai.csv claims.csv gent.csv nagoya.csv shiksha.csv xn--pgbs0dh.csv click.csv gift.csv network.csv shoes.csv xn--q9jyb4c.csv clinic.csv gifts.csv new.csv singles.csv xn--wgbl6a.csv clothing.csv glass.csv news.csv site.csv xxx.csv club.csv global.csv nexus.csv social.csv xyz.csv coach.csv globo.csv ngo.csv software.csv yandex.csv codes.csv gmail.csv ninja.csv solar.csv yoga.csv coffee.csv goo.csv nrw.csv solutions.csv yokohama.csv college.csv goog.csv ntt.csv soy.csv youtube.csv community.csv google.csv nyc.csv space.csv zone.csv company.csv graphics.csv office.csv style.csv computer.csv gratis.csv okinawa.csv sucks.csv In the meantime I'd like to cite again the Nile University Crew expertly led by Sameh El Ansary for designing and coding a Crawler's that been able to cope with shifting through 5 years of DNS junk with minimal maintenance, save the love and attention I give the servers by keeping them up to date with patches so they don't end up toppling over. They haven't been rebooted in 464 days and I am crossing fingers for their well-being. And of course, thanks to the University of Southampton Crew who built the excellent 2nd version of the Web Site under Tim Chown's supervision. I am still writing an article for RIPE Labs - just struggling to find the time to finish it, but getting there. Warmest regards, Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: