[IPv6crawler-wg] An important update about the IPv6 Matrix Project

Olivier MJ Crepin-Leblond ocl at gih.com
Sat Feb 6 16:32:34 GMT 2016


Hello all,

another update: the first complete run using the new TLDs has completed!
You can view the results up to February 2016 from http://www.ipv6matrix.org

In adding new gTLDs we have hit a snag, although this snag does not
significantly affect overall results since it appears to only affect a
tiny number of domains.

I am speaking about Internationalized Top Level Domains (IDNs):

xn--3e0b707e  xn--80adxhks  xn--90ais    xn--j1amh  xn--pgbs0dh  xn--wgbl6a
xn--4gbrim    xn--80asehdb  xn--d1acj3b  xn--p1ai   xn--q9jyb4c

Each of these is the ASCII equivalent of a non ASCII domain name. Whist
the Crawler works well with them and we are able to collect all of the
data pertaining to crawls in IDNs, the program that builds the Database
uses SQLite. Until now, database entries made use of domain names that
were ASCII - but IDNs use a double dash "--" in the domain. SQLite
coughs on DASH - so we have not been able to produce the database needed
for the displaying of the results when including IDNs.

Until we have a workaround, I have manually isolated data collected for
IDNs, which means we still collect them, but we will not take them into
account in the final database results. As I have said, this is a tiny
subset of domains: 760 entries out of a total of 1 Million domains.

I am *still* drafting a very long article for RIPE labs. In fact, we
might publish this in two parts. In the meantime, the results appear to
be somehow consistent with results of other tracking projects, some of
which use other methods to track IPv6 adoption:

- http://6lab.cisco.com/stats/
- https://www.vyncke.org/ipv6status/
- http://www.mrp.net/ipv6_survey/

We now have 306 Gb of comma separated value text data in store, tracing
back the spread of the IPv6 Internet since July 2010.  (294Gb in
November 2015)

I look forward to your kind feedback.

Warmest regards,
Olivier


On 26/11/2015 19:32, Olivier MJ Crepin-Leblond wrote:
> Hello all,
>
> Two worthy pieces of news regarding the IPv6 Matrix Project (
> http://www.ipv6matrix.org ):
>
> 1. I have updated the Web site with the latest results ending in late
> October - hence noting a Crawl display date of November 2015.
> We now have 294 Gb of comma separated value text data in store,
> tracing back the spread of the IPv6 Internet since July 2010.
> Altogether, we ran the text approximately 36 times on all 1 million
> Alexa busiest Domain names. This represented testing of about 6.5
> million hosts, carefully collecting traceroute information for each
> and every of them. We now have a very unique database that is showing
> the spread of the IPv6 Internet information sources worldwide.
>
> 2. Today I took out my very dusty Linux & Python gloves and performed
> a much needed update to the IPv6 Matrix Crawler input database,
> including the Alexa 1 million list as well as GeoIP Databases.
>
> Indeed, the Alexa database of the world's 1 million busiest Web sites
> dated from the Crawler's first inception in the first half of 2010.
> We're more than 5 years later!
>
> In a way, keeping the same input database has kept the base of crawls
> the steady thus the ability to compare results was possible. However,
> the flip-side of the coin is that we are ending up with more and more
> domain names marked as being dysfunctional. Nearly 5% of the domain
> names in the database were unreachable. The updated input database
> should resolve this, but we might also see a jump in some results. It
> will be interesting to see what the next run yields.
> Why do we not update the input database more often? Because buried in
> that database are the domain names of the people who wanted to opt out
> over the years. Having never thought about this, I spent several hours
> tracing back 5 years of emails of people complaining about the crawl
> triggering their firewalls. I put together a blacklist of domain names
> I have manually deleted from the crawl input files.
> The blacklist, as it stands now:
>
> Deleted:
>
> it-mate.co.uk
> indianic.com
> your-server.de
> catacombscds.com
> dewlance.com
> tcs.com
> printweb.de
> nocser.net
> shoppingnsales.com
> bsaadmail.com
> epayservice.ru
> 4footyfans.com
> guitarspeed99.com
> saga.co.uk
>
> Already gone from the current Alexa list:
>
> infinityautosurf.com
> canada-traffic.com
> usahitz.com
> jawatankosong.com.my
> 4d.com.my
> fitnessuncovered.co.uk
> kualalumpurbookfair.com
> xgen-it.com
> bpanet.de
> edns.de
> back2web.de
> waaaouh.com
> every-web.com
> w3sexe.com
> gratuits-web.com
> france-mateur.com
> pliagedepapier.com
> immobilieretparticuliers.com
> chronobio.com
> stickers-origines.com
> tailor-made.co.uk
>
> With these out of the input files, we are able to start the next crawl. *
> I hope I have not missed any complaints, but if I have, this is
> advance notice that we might receive a few emails in the forthcoming
> weeks. We might also receive a few emails from sites that have
> appeared on the Alexa 1 million list since 2010.*
>
> Back to this list, the excellent filtering program which was used to
> process the original list and clean it up was used again for the
> modern list. The Alexa list had a number of domain names which were
> actually sub-directories in the past, as well as some invalid domains.
> Alexa has since tightened its act. The latest Alexa list is much
> cleaner. It holds 999998 valid domains vs. 984587 domains for the
> original 2010 list.
>
> Finally, new gTLDs have now appeared in the Alexa list, including some
> Internationalised Domain Names (IDNs). The world is indeed a very
> different place!
> It will be interesting to see how the Crawler as well as all other
> scripts to process the information into displayable data on the Web
> server, will cope with these:
>
> academy.csv      consulting.csv   guide.csv          one.csv         
> supply.csv
> accountant.csv   contractors.csv  guru.csv           onl.csv         
> support.csv
> actor.csv        cool.csv         hamburg.csv        online.csv      
> surf.csv
> ads.csv          country.csv      haus.csv           ooo.csv         
> swiss.csv
> adult.csv        creditcard.csv   healthcare.csv     orange.csv      
> sydney.csv
> agency.csv       cricket.csv      help.csv           ovh.csv         
> systems.csv
> alsace.csv       cymru.csv        hiphop.csv         paris.csv       
> taipei.csv
> amsterdam.csv    dance.csv        holiday.csv        partners.csv    
> tattoo.csv
> app.csv          date.csv         horse.csv          parts.csv       
> team.csv
> archi.csv        dating.csv       host.csv           party.csv       
> tech.csv
> associates.csv   deals.csv        hosting.csv        photo.csv       
> technology.csv
> attorney.csv     delivery.csv     house.csv          photography.csv 
> theater.csv
> auction.csv      desi.csv         how.csv            photos.csv      
> tienda.csv
> audio.csv        design.csv       immobilien.csv     pics.csv        
> tips.csv
> axa.csv          dev.csv          immo.csv           pictures.csv    
> tirol.csv
> barclaycard.csv  diet.csv         ink.csv            pink.csv        
> today.csv
> barclays.csv     digital.csv      international.csv  pizza.csv       
> tokyo.csv
> bar.csv          direct.csv       investments.csv    place.csv       
> tools.csv
> bargains.csv     directory.csv    irish.csv          plus.csv        
> top.csv
> bayern.csv       discount.csv     jetzt.csv          poker.csv       
> town.csv
> beer.csv         dog.csv          joburg.csv         porn.csv        
> toys.csv
> berlin.csv       domains.csv      juegos.csv         post.csv        
> trade.csv
> best.csv         earth.csv        kim.csv            press.csv       
> training.csv
> bid.csv          education.csv    kitchen.csv        prod.csv        
> trust.csv
> bike.csv         email.csv        kiwi.csv           productions.csv 
> university.csv
> bio.csv          emerck.csv       koeln.csv          properties.csv  
> uno.csv
> black.csv        energy.csv       krd.csv            property.csv    
> uol.csv
> blackfriday.csv  equipment.csv    kred.csv           pub.csv         
> vacations.csv
> blue.csv         estate.csv       land.csv           quebec.csv      
> vegas.csv
> bnpparibas.csv   eus.csv          law.csv            realtor.csv     
> ventures.csv
> boo.csv          events.csv       legal.csv          recipes.csv     
> video.csv
> boutique.csv     exchange.csv     life.csv           red.csv         
> vision.csv
> brussels.csv     expert.csv       limited.csv        rehab.csv       
> voyage.csv
> build.csv        exposed.csv      link.csv           reise.csv       
> wales.csv
> builders.csv     express.csv      live.csv           reisen.csv      
> wang.csv
> business.csv     fail.csv         lol.csv            ren.csv         
> watch.csv
> buzz.csv         faith.csv        london.csv         rentals.csv     
> webcam.csv
> bzh.csv          farm.csv         love.csv           repair.csv      
> website.csv
> cab.csv          finance.csv      luxury.csv         report.csv      
> wien.csv
> camera.csv       fish.csv         management.csv     rest.csv        
> wiki.csv
> camp.csv         fishing.csv      mango.csv          review.csv      
> win.csv
> capital.csv      fit.csv          market.csv         reviews.csv     
> windows.csv
> cards.csv        fitness.csv      marketing.csv      rio.csv         
> work.csv
> care.csv         flights.csv      markets.csv        rip.csv         
> works.csv
> career.csv       foo.csv          media.csv          rocks.csv       
> world.csv
> careers.csv      football.csv     melbourne.csv      ruhr.csv        
> wtf.csv
> casa.csv         forsale.csv      menu.csv           ryukyu.csv      
> xn--3e0b707e.csv
> cash.csv         foundation.csv   microsoft.csv      sale.csv        
> xn--4gbrim.csv
> casino.csv       frl.csv          moda.csv           scb.csv         
> xn--80adxhks.csv
> center.csv       fund.csv         moe.csv            school.csv      
> xn--80asehdb.csv
> ceo.csv          futbol.csv       monash.csv         science.csv     
> xn--90ais.csv
> chat.csv         gal.csv          money.csv          scot.csv        
> xn--d1acj3b.csv
> church.csv       gallery.csv      moscow.csv         services.csv    
> xn--j1amh.csv
> city.csv         garden.csv       movie.csv          sexy.csv        
> xn--p1ai.csv
> claims.csv       gent.csv         nagoya.csv         shiksha.csv     
> xn--pgbs0dh.csv
> click.csv        gift.csv         network.csv        shoes.csv       
> xn--q9jyb4c.csv
> clinic.csv       gifts.csv        new.csv            singles.csv     
> xn--wgbl6a.csv
> clothing.csv     glass.csv        news.csv           site.csv        
> xxx.csv
> club.csv         global.csv       nexus.csv          social.csv      
> xyz.csv
> coach.csv        globo.csv        ngo.csv            software.csv    
> yandex.csv
> codes.csv        gmail.csv        ninja.csv          solar.csv       
> yoga.csv
> coffee.csv       goo.csv          nrw.csv            solutions.csv   
> yokohama.csv
> college.csv      goog.csv         ntt.csv            soy.csv         
> youtube.csv
> community.csv    google.csv       nyc.csv            space.csv       
> zone.csv
> company.csv      graphics.csv     office.csv         style.csv
> computer.csv     gratis.csv       okinawa.csv        sucks.csv
>
> In the meantime I'd like to cite again the Nile University Crew
> expertly led by Sameh El Ansary for designing and coding a Crawler's
> that been able to cope with shifting through 5 years of DNS junk with
> minimal maintenance, save the love and attention I give the servers by
> keeping them up to date with patches so they don't end up toppling
> over. They haven't been rebooted in 464 days and I am crossing fingers
> for their well-being.
> And of course, thanks to the University of Southampton Crew who built
> the excellent 2nd version of the Web Site under Tim Chown's supervision.
>
> I am still writing an article for RIPE Labs - just struggling to find
> the time to finish it, but getting there.
>
> Warmest regards,
>
> Olivier
>
>
>
> _______________________________________________
> IPv6crawler-wg mailing list
> IPv6crawler-wg at gih.co.uk
> http://gypsy.gih.co.uk/mailman/listinfo/ipv6crawler-wg

-- 
Olivier MJ Crépin-Leblond, PhD
http://www.gih.com/ocl.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gypsy.gih.co.uk/pipermail/ipv6crawler-wg/attachments/20160206/cf1f70d4/attachment-0001.html>


More information about the IPv6crawler-wg mailing list