[IPv6crawler-wg] An important update about the IPv6 Matrix Project

Christian de Larrinaga cdel at firsthand.net
Sat Feb 6 17:09:54 GMT 2016


That is a humungous large sqlite database! or are you only collecting
the data as a form of cache using sqlite and then exporting it out once
organised into csv?

Sqlite v3 supports utf-8 which might help?
if it doesn't break something else of course.

C

Olivier MJ Crepin-Leblond wrote:
> Hello all,
>
> another update: the first complete run using the new TLDs has completed!
> You can view the results up to February 2016 from
> http://www.ipv6matrix.org
>
> In adding new gTLDs we have hit a snag, although this snag does not
> significantly affect overall results since it appears to only affect a
> tiny number of domains.
>
> I am speaking about Internationalized Top Level Domains (IDNs):
>
> xn--3e0b707e  xn--80adxhks  xn--90ais    xn--j1amh  xn--pgbs0dh 
> xn--wgbl6a
> xn--4gbrim    xn--80asehdb  xn--d1acj3b  xn--p1ai   xn--q9jyb4c
>
> Each of these is the ASCII equivalent of a non ASCII domain name.
> Whist the Crawler works well with them and we are able to collect all
> of the data pertaining to crawls in IDNs, the program that builds the
> Database uses SQLite. Until now, database entries made use of domain
> names that were ASCII - but IDNs use a double dash "--" in the domain.
> SQLite coughs on DASH - so we have not been able to produce the
> database needed for the displaying of the results when including IDNs.
>
> Until we have a workaround, I have manually isolated data collected
> for IDNs, which means we still collect them, but we will not take them
> into account in the final database results. As I have said, this is a
> tiny subset of domains: 760 entries out of a total of 1 Million domains.
>
> I am *still* drafting a very long article for RIPE labs. In fact, we
> might publish this in two parts. In the meantime, the results appear
> to be somehow consistent with results of other tracking projects, some
> of which use other methods to track IPv6 adoption:
>
> - http://6lab.cisco.com/stats/
> - https://www.vyncke.org/ipv6status/
> - http://www.mrp.net/ipv6_survey/
>
> We now have 306 Gb of comma separated value text data in store,
> tracing back the spread of the IPv6 Internet since July 2010.  (294Gb
> in November 2015)
>
> I look forward to your kind feedback.
>
> Warmest regards,
> Olivier
>
>
> On 26/11/2015 19:32, Olivier MJ Crepin-Leblond wrote:
>> Hello all,
>>
>> Two worthy pieces of news regarding the IPv6 Matrix Project (
>> http://www.ipv6matrix.org ):
>>
>> 1. I have updated the Web site with the latest results ending in late
>> October - hence noting a Crawl display date of November 2015.
>> We now have 294 Gb of comma separated value text data in store,
>> tracing back the spread of the IPv6 Internet since July 2010.
>> Altogether, we ran the text approximately 36 times on all 1 million
>> Alexa busiest Domain names. This represented testing of about 6.5
>> million hosts, carefully collecting traceroute information for each
>> and every of them. We now have a very unique database that is showing
>> the spread of the IPv6 Internet information sources worldwide.
>>
>> 2. Today I took out my very dusty Linux & Python gloves and performed
>> a much needed update to the IPv6 Matrix Crawler input database,
>> including the Alexa 1 million list as well as GeoIP Databases.
>>
>> Indeed, the Alexa database of the world's 1 million busiest Web sites
>> dated from the Crawler's first inception in the first half of 2010.
>> We're more than 5 years later!
>>
>> In a way, keeping the same input database has kept the base of crawls
>> the steady thus the ability to compare results was possible. However,
>> the flip-side of the coin is that we are ending up with more and more
>> domain names marked as being dysfunctional. Nearly 5% of the domain
>> names in the database were unreachable. The updated input database
>> should resolve this, but we might also see a jump in some results. It
>> will be interesting to see what the next run yields.
>> Why do we not update the input database more often? Because buried in
>> that database are the domain names of the people who wanted to opt
>> out over the years. Having never thought about this, I spent several
>> hours tracing back 5 years of emails of people complaining about the
>> crawl triggering their firewalls. I put together a blacklist of
>> domain names I have manually deleted from the crawl input files.
>> The blacklist, as it stands now:
>>
>> Deleted:
>>
>> it-mate.co.uk
>> indianic.com
>> your-server.de
>> catacombscds.com
>> dewlance.com
>> tcs.com
>> printweb.de
>> nocser.net
>> shoppingnsales.com
>> bsaadmail.com
>> epayservice.ru
>> 4footyfans.com
>> guitarspeed99.com
>> saga.co.uk
>>
>> Already gone from the current Alexa list:
>>
>> infinityautosurf.com
>> canada-traffic.com
>> usahitz.com
>> jawatankosong.com.my
>> 4d.com.my
>> fitnessuncovered.co.uk
>> kualalumpurbookfair.com
>> xgen-it.com
>> bpanet.de
>> edns.de
>> back2web.de
>> waaaouh.com
>> every-web.com
>> w3sexe.com
>> gratuits-web.com
>> france-mateur.com
>> pliagedepapier.com
>> immobilieretparticuliers.com
>> chronobio.com
>> stickers-origines.com
>> tailor-made.co.uk
>>
>> With these out of the input files, we are able to start the next crawl. *
>> I hope I have not missed any complaints, but if I have, this is
>> advance notice that we might receive a few emails in the forthcoming
>> weeks. We might also receive a few emails from sites that have
>> appeared on the Alexa 1 million list since 2010.*
>>
>> Back to this list, the excellent filtering program which was used to
>> process the original list and clean it up was used again for the
>> modern list. The Alexa list had a number of domain names which were
>> actually sub-directories in the past, as well as some invalid
>> domains. Alexa has since tightened its act. The latest Alexa list is
>> much cleaner. It holds 999998 valid domains vs. 984587 domains for
>> the original 2010 list.
>>
>> Finally, new gTLDs have now appeared in the Alexa list, including
>> some Internationalised Domain Names (IDNs). The world is indeed a
>> very different place!
>> It will be interesting to see how the Crawler as well as all other
>> scripts to process the information into displayable data on the Web
>> server, will cope with these:
>>
>> academy.csv      consulting.csv   guide.csv          one.csv         
>> supply.csv
>> accountant.csv   contractors.csv  guru.csv           onl.csv         
>> support.csv
>> actor.csv        cool.csv         hamburg.csv        online.csv      
>> surf.csv
>> ads.csv          country.csv      haus.csv           ooo.csv         
>> swiss.csv
>> adult.csv        creditcard.csv   healthcare.csv     orange.csv      
>> sydney.csv
>> agency.csv       cricket.csv      help.csv           ovh.csv         
>> systems.csv
>> alsace.csv       cymru.csv        hiphop.csv         paris.csv       
>> taipei.csv
>> amsterdam.csv    dance.csv        holiday.csv        partners.csv    
>> tattoo.csv
>> app.csv          date.csv         horse.csv          parts.csv       
>> team.csv
>> archi.csv        dating.csv       host.csv           party.csv       
>> tech.csv
>> associates.csv   deals.csv        hosting.csv        photo.csv       
>> technology.csv
>> attorney.csv     delivery.csv     house.csv          photography.csv 
>> theater.csv
>> auction.csv      desi.csv         how.csv            photos.csv      
>> tienda.csv
>> audio.csv        design.csv       immobilien.csv     pics.csv        
>> tips.csv
>> axa.csv          dev.csv          immo.csv           pictures.csv    
>> tirol.csv
>> barclaycard.csv  diet.csv         ink.csv            pink.csv        
>> today.csv
>> barclays.csv     digital.csv      international.csv  pizza.csv       
>> tokyo.csv
>> bar.csv          direct.csv       investments.csv    place.csv       
>> tools.csv
>> bargains.csv     directory.csv    irish.csv          plus.csv        
>> top.csv
>> bayern.csv       discount.csv     jetzt.csv          poker.csv       
>> town.csv
>> beer.csv         dog.csv          joburg.csv         porn.csv        
>> toys.csv
>> berlin.csv       domains.csv      juegos.csv         post.csv        
>> trade.csv
>> best.csv         earth.csv        kim.csv            press.csv       
>> training.csv
>> bid.csv          education.csv    kitchen.csv        prod.csv        
>> trust.csv
>> bike.csv         email.csv        kiwi.csv           productions.csv 
>> university.csv
>> bio.csv          emerck.csv       koeln.csv          properties.csv  
>> uno.csv
>> black.csv        energy.csv       krd.csv            property.csv    
>> uol.csv
>> blackfriday.csv  equipment.csv    kred.csv           pub.csv         
>> vacations.csv
>> blue.csv         estate.csv       land.csv           quebec.csv      
>> vegas.csv
>> bnpparibas.csv   eus.csv          law.csv            realtor.csv     
>> ventures.csv
>> boo.csv          events.csv       legal.csv          recipes.csv     
>> video.csv
>> boutique.csv     exchange.csv     life.csv           red.csv         
>> vision.csv
>> brussels.csv     expert.csv       limited.csv        rehab.csv       
>> voyage.csv
>> build.csv        exposed.csv      link.csv           reise.csv       
>> wales.csv
>> builders.csv     express.csv      live.csv           reisen.csv      
>> wang.csv
>> business.csv     fail.csv         lol.csv            ren.csv         
>> watch.csv
>> buzz.csv         faith.csv        london.csv         rentals.csv     
>> webcam.csv
>> bzh.csv          farm.csv         love.csv           repair.csv      
>> website.csv
>> cab.csv          finance.csv      luxury.csv         report.csv      
>> wien.csv
>> camera.csv       fish.csv         management.csv     rest.csv        
>> wiki.csv
>> camp.csv         fishing.csv      mango.csv          review.csv      
>> win.csv
>> capital.csv      fit.csv          market.csv         reviews.csv     
>> windows.csv
>> cards.csv        fitness.csv      marketing.csv      rio.csv         
>> work.csv
>> care.csv         flights.csv      markets.csv        rip.csv         
>> works.csv
>> career.csv       foo.csv          media.csv          rocks.csv       
>> world.csv
>> careers.csv      football.csv     melbourne.csv      ruhr.csv        
>> wtf.csv
>> casa.csv         forsale.csv      menu.csv           ryukyu.csv      
>> xn--3e0b707e.csv
>> cash.csv         foundation.csv   microsoft.csv      sale.csv        
>> xn--4gbrim.csv
>> casino.csv       frl.csv          moda.csv           scb.csv         
>> xn--80adxhks.csv
>> center.csv       fund.csv         moe.csv            school.csv      
>> xn--80asehdb.csv
>> ceo.csv          futbol.csv       monash.csv         science.csv     
>> xn--90ais.csv
>> chat.csv         gal.csv          money.csv          scot.csv        
>> xn--d1acj3b.csv
>> church.csv       gallery.csv      moscow.csv         services.csv    
>> xn--j1amh.csv
>> city.csv         garden.csv       movie.csv          sexy.csv        
>> xn--p1ai.csv
>> claims.csv       gent.csv         nagoya.csv         shiksha.csv     
>> xn--pgbs0dh.csv
>> click.csv        gift.csv         network.csv        shoes.csv       
>> xn--q9jyb4c.csv
>> clinic.csv       gifts.csv        new.csv            singles.csv     
>> xn--wgbl6a.csv
>> clothing.csv     glass.csv        news.csv           site.csv        
>> xxx.csv
>> club.csv         global.csv       nexus.csv          social.csv      
>> xyz.csv
>> coach.csv        globo.csv        ngo.csv            software.csv    
>> yandex.csv
>> codes.csv        gmail.csv        ninja.csv          solar.csv       
>> yoga.csv
>> coffee.csv       goo.csv          nrw.csv            solutions.csv   
>> yokohama.csv
>> college.csv      goog.csv         ntt.csv            soy.csv         
>> youtube.csv
>> community.csv    google.csv       nyc.csv            space.csv       
>> zone.csv
>> company.csv      graphics.csv     office.csv         style.csv
>> computer.csv     gratis.csv       okinawa.csv        sucks.csv
>>
>> In the meantime I'd like to cite again the Nile University Crew
>> expertly led by Sameh El Ansary for designing and coding a Crawler's
>> that been able to cope with shifting through 5 years of DNS junk with
>> minimal maintenance, save the love and attention I give the servers
>> by keeping them up to date with patches so they don't end up toppling
>> over. They haven't been rebooted in 464 days and I am crossing
>> fingers for their well-being.
>> And of course, thanks to the University of Southampton Crew who built
>> the excellent 2nd version of the Web Site under Tim Chown's supervision.
>>
>> I am still writing an article for RIPE Labs - just struggling to find
>> the time to finish it, but getting there.
>>
>> Warmest regards,
>>
>> Olivier
>>
>>
>>
>> _______________________________________________
>> IPv6crawler-wg mailing list
>> IPv6crawler-wg at gih.co.uk
>> http://gypsy.gih.co.uk/mailman/listinfo/ipv6crawler-wg
>
> -- 
> Olivier MJ Crépin-Leblond, PhD
> http://www.gih.com/ocl.html

-- 
Christian de Larrinaga  FBCS, CITP,
-------------------------
@ FirstHand
-------------------------
+44 7989 386778
cdel at firsthand.net
-------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gypsy.gih.co.uk/pipermail/ipv6crawler-wg/attachments/20160206/40730e2c/attachment-0001.html>


More information about the IPv6crawler-wg mailing list