[IPv6crawler-wg] An important update about the IPv6 Matrix Project

Olivier MJ Crepin-Leblond ocl at gih.com
Thu Nov 26 18:32:49 GMT 2015


Hello all,

Two worthy pieces of news regarding the IPv6 Matrix Project (
http://www.ipv6matrix.org ):

1. I have updated the Web site with the latest results ending in late
October - hence noting a Crawl display date of November 2015.
We now have 294 Gb of comma separated value text data in store, tracing
back the spread of the IPv6 Internet since July 2010.
Altogether, we ran the text approximately 36 times on all 1 million
Alexa busiest Domain names. This represented testing of about 6.5
million hosts, carefully collecting traceroute information for each and
every of them. We now have a very unique database that is showing the
spread of the IPv6 Internet information sources worldwide.

2. Today I took out my very dusty Linux & Python gloves and performed a
much needed update to the IPv6 Matrix Crawler input database, including
the Alexa 1 million list as well as GeoIP Databases.

Indeed, the Alexa database of the world's 1 million busiest Web sites
dated from the Crawler's first inception in the first half of 2010.
We're more than 5 years later!

In a way, keeping the same input database has kept the base of crawls
the steady thus the ability to compare results was possible. However,
the flip-side of the coin is that we are ending up with more and more
domain names marked as being dysfunctional. Nearly 5% of the domain
names in the database were unreachable. The updated input database
should resolve this, but we might also see a jump in some results. It
will be interesting to see what the next run yields.
Why do we not update the input database more often? Because buried in
that database are the domain names of the people who wanted to opt out
over the years. Having never thought about this, I spent several hours
tracing back 5 years of emails of people complaining about the crawl
triggering their firewalls. I put together a blacklist of domain names I
have manually deleted from the crawl input files.
The blacklist, as it stands now:

Deleted:

it-mate.co.uk
indianic.com
your-server.de
catacombscds.com
dewlance.com
tcs.com
printweb.de
nocser.net
shoppingnsales.com
bsaadmail.com
epayservice.ru
4footyfans.com
guitarspeed99.com
saga.co.uk

Already gone from the current Alexa list:

infinityautosurf.com
canada-traffic.com
usahitz.com
jawatankosong.com.my
4d.com.my
fitnessuncovered.co.uk
kualalumpurbookfair.com
xgen-it.com
bpanet.de
edns.de
back2web.de
waaaouh.com
every-web.com
w3sexe.com
gratuits-web.com
france-mateur.com
pliagedepapier.com
immobilieretparticuliers.com
chronobio.com
stickers-origines.com
tailor-made.co.uk

With these out of the input files, we are able to start the next crawl. *
I hope I have not missed any complaints, but if I have, this is advance
notice that we might receive a few emails in the forthcoming weeks. We
might also receive a few emails from sites that have appeared on the
Alexa 1 million list since 2010.*

Back to this list, the excellent filtering program which was used to
process the original list and clean it up was used again for the modern
list. The Alexa list had a number of domain names which were actually
sub-directories in the past, as well as some invalid domains. Alexa has
since tightened its act. The latest Alexa list is much cleaner. It holds
999998 valid domains vs. 984587 domains for the original 2010 list.

Finally, new gTLDs have now appeared in the Alexa list, including some
Internationalised Domain Names (IDNs). The world is indeed a very
different place!
It will be interesting to see how the Crawler as well as all other
scripts to process the information into displayable data on the Web
server, will cope with these:

academy.csv      consulting.csv   guide.csv          one.csv         
supply.csv
accountant.csv   contractors.csv  guru.csv           onl.csv         
support.csv
actor.csv        cool.csv         hamburg.csv        online.csv      
surf.csv
ads.csv          country.csv      haus.csv           ooo.csv         
swiss.csv
adult.csv        creditcard.csv   healthcare.csv     orange.csv      
sydney.csv
agency.csv       cricket.csv      help.csv           ovh.csv         
systems.csv
alsace.csv       cymru.csv        hiphop.csv         paris.csv       
taipei.csv
amsterdam.csv    dance.csv        holiday.csv        partners.csv    
tattoo.csv
app.csv          date.csv         horse.csv          parts.csv       
team.csv
archi.csv        dating.csv       host.csv           party.csv       
tech.csv
associates.csv   deals.csv        hosting.csv        photo.csv       
technology.csv
attorney.csv     delivery.csv     house.csv          photography.csv 
theater.csv
auction.csv      desi.csv         how.csv            photos.csv      
tienda.csv
audio.csv        design.csv       immobilien.csv     pics.csv        
tips.csv
axa.csv          dev.csv          immo.csv           pictures.csv    
tirol.csv
barclaycard.csv  diet.csv         ink.csv            pink.csv        
today.csv
barclays.csv     digital.csv      international.csv  pizza.csv       
tokyo.csv
bar.csv          direct.csv       investments.csv    place.csv       
tools.csv
bargains.csv     directory.csv    irish.csv          plus.csv        
top.csv
bayern.csv       discount.csv     jetzt.csv          poker.csv       
town.csv
beer.csv         dog.csv          joburg.csv         porn.csv        
toys.csv
berlin.csv       domains.csv      juegos.csv         post.csv        
trade.csv
best.csv         earth.csv        kim.csv            press.csv       
training.csv
bid.csv          education.csv    kitchen.csv        prod.csv        
trust.csv
bike.csv         email.csv        kiwi.csv           productions.csv 
university.csv
bio.csv          emerck.csv       koeln.csv          properties.csv  
uno.csv
black.csv        energy.csv       krd.csv            property.csv    
uol.csv
blackfriday.csv  equipment.csv    kred.csv           pub.csv         
vacations.csv
blue.csv         estate.csv       land.csv           quebec.csv      
vegas.csv
bnpparibas.csv   eus.csv          law.csv            realtor.csv     
ventures.csv
boo.csv          events.csv       legal.csv          recipes.csv     
video.csv
boutique.csv     exchange.csv     life.csv           red.csv         
vision.csv
brussels.csv     expert.csv       limited.csv        rehab.csv       
voyage.csv
build.csv        exposed.csv      link.csv           reise.csv       
wales.csv
builders.csv     express.csv      live.csv           reisen.csv      
wang.csv
business.csv     fail.csv         lol.csv            ren.csv         
watch.csv
buzz.csv         faith.csv        london.csv         rentals.csv     
webcam.csv
bzh.csv          farm.csv         love.csv           repair.csv      
website.csv
cab.csv          finance.csv      luxury.csv         report.csv      
wien.csv
camera.csv       fish.csv         management.csv     rest.csv        
wiki.csv
camp.csv         fishing.csv      mango.csv          review.csv      
win.csv
capital.csv      fit.csv          market.csv         reviews.csv     
windows.csv
cards.csv        fitness.csv      marketing.csv      rio.csv         
work.csv
care.csv         flights.csv      markets.csv        rip.csv         
works.csv
career.csv       foo.csv          media.csv          rocks.csv       
world.csv
careers.csv      football.csv     melbourne.csv      ruhr.csv        
wtf.csv
casa.csv         forsale.csv      menu.csv           ryukyu.csv      
xn--3e0b707e.csv
cash.csv         foundation.csv   microsoft.csv      sale.csv        
xn--4gbrim.csv
casino.csv       frl.csv          moda.csv           scb.csv         
xn--80adxhks.csv
center.csv       fund.csv         moe.csv            school.csv      
xn--80asehdb.csv
ceo.csv          futbol.csv       monash.csv         science.csv     
xn--90ais.csv
chat.csv         gal.csv          money.csv          scot.csv        
xn--d1acj3b.csv
church.csv       gallery.csv      moscow.csv         services.csv    
xn--j1amh.csv
city.csv         garden.csv       movie.csv          sexy.csv        
xn--p1ai.csv
claims.csv       gent.csv         nagoya.csv         shiksha.csv     
xn--pgbs0dh.csv
click.csv        gift.csv         network.csv        shoes.csv       
xn--q9jyb4c.csv
clinic.csv       gifts.csv        new.csv            singles.csv     
xn--wgbl6a.csv
clothing.csv     glass.csv        news.csv           site.csv        
xxx.csv
club.csv         global.csv       nexus.csv          social.csv      
xyz.csv
coach.csv        globo.csv        ngo.csv            software.csv    
yandex.csv
codes.csv        gmail.csv        ninja.csv          solar.csv       
yoga.csv
coffee.csv       goo.csv          nrw.csv            solutions.csv   
yokohama.csv
college.csv      goog.csv         ntt.csv            soy.csv         
youtube.csv
community.csv    google.csv       nyc.csv            space.csv       
zone.csv
company.csv      graphics.csv     office.csv         style.csv
computer.csv     gratis.csv       okinawa.csv        sucks.csv

In the meantime I'd like to cite again the Nile University Crew expertly
led by Sameh El Ansary for designing and coding a Crawler's that been
able to cope with shifting through 5 years of DNS junk with minimal
maintenance, save the love and attention I give the servers by keeping
them up to date with patches so they don't end up toppling over. They
haven't been rebooted in 464 days and I am crossing fingers for their
well-being.
And of course, thanks to the University of Southampton Crew who built
the excellent 2nd version of the Web Site under Tim Chown's supervision.

I am still writing an article for RIPE Labs - just struggling to find
the time to finish it, but getting there.

Warmest regards,

Olivier

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gypsy.gih.co.uk/pipermail/ipv6crawler-wg/attachments/20151126/d572f954/attachment-0001.html>


More information about the IPv6crawler-wg mailing list