[IPv6crawler-wg] An important update about the IPv6 Matrix Project

Sun Feb 7 22:41:10 GMT 2016

Actually looking into whether to formalise the Matrix as a web
observatory is not a bad idea. Should I ask Thanassis or Wendy?
Christian

Tim Chown wrote:
> Hi,
>
> This is pretty cool, and the db is slowly but surely making its way
> into Big Data territory ;)
>
> The internationalised domain name problem is also interesting.
> Christian’s solution sounds good.
>
> It seems timely to have another push on both virtualising the system
> (so we can run it from other vantage points) and distributing the data
> / results to minimise any potential of any loss to the increasingly
> valuable data set.
>
> This might fit well with the growing web observatory activity in
> Southampton. We also have a new highly resilient data centre in
> Fareham which could be a good place for the virtualised copy to be
> hosted.  If you’re OK with it, I can make some contacts to initiate,
> but let me know.
>
> Tim
>
>
>
>> On 6 Feb 2016, at 17:23, Olivier MJ Crepin-Leblond <ocl at gih.com
>> <mailto:ocl at gih.com>> wrote:
>>
>> Hello Christian,
>>
>> the the sqlite database comes in when it comes down to displaying the
>> results. The results of the crawls are in native CSV. All 306Gb of
>> these. The Sqlite database is much smaller as it only uses a subset
>> of all data collected (the data which is used in the GUI) and we are
>> not using a single Sqlite database but one for each crawl - a summary
>> of each crawl for each TLD.
>> The question of Sqlite v3 is a good one -- and I have unfortunately
>> got no idea whether it would work or whether it would break things.
>> To be added to the list of things to do.
>> Kindest regards,
>>
>> Olivier
>>
>> On 06/02/2016 18:09, Christian de Larrinaga wrote:
>>> That is a humungous large sqlite database! or are you only
>>> collecting the data as a form of cache using sqlite and then
>>> exporting it out once organised into csv?
>>>
>>> Sqlite v3 supports utf-8 which might help?
>>> if it doesn't break something else of course.
>>>
>>> C
>>>
>>> Olivier MJ Crepin-Leblond wrote:
>>>> Hello all,
>>>>
>>>> another update: the first complete run using the new TLDs has
>>>> completed!
>>>> You can view the results up to February 2016 from
>>>> http://www.ipv6matrix.org
>>>>
>>>> In adding new gTLDs we have hit a snag, although this snag does not
>>>> significantly affect overall results since it appears to only
>>>> affect a tiny number of domains.
>>>>
>>>> I am speaking about Internationalized Top Level Domains (IDNs):
>>>>
>>>> xn--3e0b707e  xn--80adxhks  xn--90ais    xn--j1amh  xn--pgbs0dh 
>>>> xn--wgbl6a
>>>> xn--4gbrim    xn--80asehdb  xn--d1acj3b  xn--p1ai   xn--q9jyb4c
>>>>
>>>> Each of these is the ASCII equivalent of a non ASCII domain name.
>>>> Whist the Crawler works well with them and we are able to collect
>>>> all of the data pertaining to crawls in IDNs, the program that
>>>> builds the Database uses SQLite. Until now, database entries made
>>>> use of domain names that were ASCII - but IDNs use a double dash
>>>> "--" in the domain. SQLite coughs on DASH - so we have not been
>>>> able to produce the database needed for the displaying of the
>>>> results when including IDNs.
>>>>
>>>> Until we have a workaround, I have manually isolated data collected
>>>> for IDNs, which means we still collect them, but we will not take
>>>> them into account in the final database results. As I have said,
>>>> this is a tiny subset of domains: 760 entries out of a total of 1
>>>> Million domains.
>>>>
>>>> I am *still* drafting a very long article for RIPE labs. In fact,
>>>> we might publish this in two parts. In the meantime, the results
>>>> appear to be somehow consistent with results of other tracking
>>>> projects, some of which use other methods to track IPv6 adoption:
>>>>
>>>> - http://6lab.cisco.com/stats/
>>>> - https://www.vyncke.org/ipv6status/
>>>> - http://www.mrp.net/ipv6_survey/
>>>>
>>>> We now have 306 Gb of comma separated value text data in store,
>>>> tracing back the spread of the IPv6 Internet since July 2010. 
>>>> (294Gb in November 2015)
>>>>
>>>> I look forward to your kind feedback.
>>>>
>>>> Warmest regards,
>>>> Olivier
>>>>
>>>>
>>>> On 26/11/2015 19:32, Olivier MJ Crepin-Leblond wrote:
>>>>> Hello all,
>>>>>
>>>>> Two worthy pieces of news regarding the IPv6 Matrix Project (
>>>>> http://www.ipv6matrix.org ):
>>>>>
>>>>> 1. I have updated the Web site with the latest results ending in
>>>>> late October - hence noting a Crawl display date of November 2015.
>>>>> We now have 294 Gb of comma separated value text data in store,
>>>>> tracing back the spread of the IPv6 Internet since July 2010.
>>>>> Altogether, we ran the text approximately 36 times on all 1
>>>>> million Alexa busiest Domain names. This represented testing of
>>>>> about 6.5 million hosts, carefully collecting traceroute
>>>>> information for each and every of them. We now have a very unique
>>>>> database that is showing the spread of the IPv6 Internet
>>>>> information sources worldwide.
>>>>>
>>>>> 2. Today I took out my very dusty Linux & Python gloves and
>>>>> performed a much needed update to the IPv6 Matrix Crawler input
>>>>> database, including the Alexa 1 million list as well as GeoIP
>>>>> Databases.
>>>>>
>>>>> Indeed, the Alexa database of the world's 1 million busiest Web
>>>>> sites dated from the Crawler's first inception in the first half
>>>>> of 2010.
>>>>> We're more than 5 years later!
>>>>>
>>>>> In a way, keeping the same input database has kept the base of
>>>>> crawls the steady thus the ability to compare results was
>>>>> possible. However, the flip-side of the coin is that we are ending
>>>>> up with more and more domain names marked as being dysfunctional.
>>>>> Nearly 5% of the domain names in the database were unreachable.
>>>>> The updated input database should resolve this, but we might also
>>>>> see a jump in some results. It will be interesting to see what the
>>>>> next run yields.
>>>>> Why do we not update the input database more often? Because buried
>>>>> in that database are the domain names of the people who wanted to
>>>>> opt out over the years. Having never thought about this, I spent
>>>>> several hours tracing back 5 years of emails of people complaining
>>>>> about the crawl triggering their firewalls. I put together a
>>>>> blacklist of domain names I have manually deleted from the crawl
>>>>> input files.
>>>>> The blacklist, as it stands now:
>>>>>
>>>>> Deleted:
>>>>>
>>>>> it-mate.co.uk <http://it-mate.co.uk>
>>>>> indianic.com <http://indianic.com>
>>>>> your-server.de <http://your-server.de>
>>>>> catacombscds.com <http://catacombscds.com>
>>>>> dewlance.com <http://dewlance.com>
>>>>> tcs.com <http://tcs.com>
>>>>> printweb.de <http://printweb.de>
>>>>> nocser.net <http://nocser.net>
>>>>> shoppingnsales.com <http://shoppingnsales.com>
>>>>> bsaadmail.com <http://bsaadmail.com>
>>>>> epayservice.ru <http://epayservice.ru>
>>>>> 4footyfans.com <http://4footyfans.com>
>>>>> guitarspeed99.com <http://guitarspeed99.com>
>>>>> saga.co.uk <http://saga.co.uk>
>>>>>
>>>>> Already gone from the current Alexa list:
>>>>>
>>>>> infinityautosurf.com <http://infinityautosurf.com>
>>>>> canada-traffic.com <http://canada-traffic.com>
>>>>> usahitz.com <http://usahitz.com>
>>>>> jawatankosong.com.my
>>>>> 4d.com.my
>>>>> fitnessuncovered.co.uk <http://fitnessuncovered.co.uk>
>>>>> kualalumpurbookfair.com <http://kualalumpurbookfair.com>
>>>>> xgen-it.com <http://xgen-it.com>
>>>>> bpanet.de <http://bpanet.de>
>>>>> edns.de <http://edns.de>
>>>>> back2web.de <http://back2web.de>
>>>>> waaaouh.com <http://waaaouh.com>
>>>>> every-web.com <http://every-web.com>
>>>>> w3sexe.com <http://w3sexe.com>
>>>>> gratuits-web.com <http://gratuits-web.com>
>>>>> france-mateur.com <http://france-mateur.com>
>>>>> pliagedepapier.com <http://pliagedepapier.com>
>>>>> immobilieretparticuliers.com <http://immobilieretparticuliers.com>
>>>>> chronobio.com <http://chronobio.com>
>>>>> stickers-origines.com <http://stickers-origines.com>
>>>>> tailor-made.co.uk <http://tailor-made.co.uk>
>>>>>
>>>>> With these out of the input files, we are able to start the next
>>>>> crawl. *
>>>>> I hope I have not missed any complaints, but if I have, this is
>>>>> advance notice that we might receive a few emails in the
>>>>> forthcoming weeks. We might also receive a few emails from sites
>>>>> that have appeared on the Alexa 1 million list since 2010.*
>>>>>
>>>>> Back to this list, the excellent filtering program which was used
>>>>> to process the original list and clean it up was used again for
>>>>> the modern list. The Alexa list had a number of domain names which
>>>>> were actually sub-directories in the past, as well as some invalid
>>>>> domains. Alexa has since tightened its act. The latest Alexa list
>>>>> is much cleaner. It holds 999998 valid domains vs. 984587 domains
>>>>> for the original 2010 list.
>>>>>
>>>>> Finally, new gTLDs have now appeared in the Alexa list, including
>>>>> some Internationalised Domain Names (IDNs). The world is indeed a
>>>>> very different place!
>>>>> It will be interesting to see how the Crawler as well as all other
>>>>> scripts to process the information into displayable data on the
>>>>> Web server, will cope with these:
>>>>>
>>>>> academy.csv      consulting.csv   guide.csv         
>>>>> one.csv          supply.csv
>>>>> accountant.csv   contractors.csv  guru.csv          
>>>>> onl.csv          support.csv
>>>>> actor.csv        cool.csv         hamburg.csv       
>>>>> online.csv       surf.csv
>>>>> ads.csv          country.csv      haus.csv          
>>>>> ooo.csv          swiss.csv
>>>>> adult.csv        creditcard.csv   healthcare.csv    
>>>>> orange.csv       sydney.csv
>>>>> agency.csv       cricket.csv      help.csv          
>>>>> ovh.csv          systems.csv
>>>>> alsace.csv       cymru.csv        hiphop.csv        
>>>>> paris.csv        taipei.csv
>>>>> amsterdam.csv    dance.csv        holiday.csv       
>>>>> partners.csv     tattoo.csv
>>>>> app.csv          date.csv         horse.csv         
>>>>> parts.csv        team.csv
>>>>> archi.csv        dating.csv       host.csv          
>>>>> party.csv        tech.csv
>>>>> associates.csv   deals.csv        hosting.csv       
>>>>> photo.csv        technology.csv
>>>>> attorney.csv     delivery.csv     house.csv         
>>>>> photography.csv  theater.csv
>>>>> auction.csv      desi.csv         how.csv           
>>>>> photos.csv       tienda.csv
>>>>> audio.csv        design.csv       immobilien.csv    
>>>>> pics.csv         tips.csv
>>>>> axa.csv          dev.csv          immo.csv          
>>>>> pictures.csv     tirol.csv
>>>>> barclaycard.csv  diet.csv         ink.csv           
>>>>> pink.csv         today.csv
>>>>> barclays.csv     digital.csv      international.csv 
>>>>> pizza.csv        tokyo.csv
>>>>> bar.csv          direct.csv       investments.csv   
>>>>> place.csv        tools.csv
>>>>> bargains.csv     directory.csv    irish.csv         
>>>>> plus.csv         top.csv
>>>>> bayern.csv       discount.csv     jetzt.csv         
>>>>> poker.csv        town.csv
>>>>> beer.csv         dog.csv          joburg.csv        
>>>>> porn.csv         toys.csv
>>>>> berlin.csv       domains.csv      juegos.csv        
>>>>> post.csv         trade.csv
>>>>> best.csv         earth.csv        kim.csv           
>>>>> press.csv        training.csv
>>>>> bid.csv          education.csv    kitchen.csv       
>>>>> prod.csv         trust.csv
>>>>> bike.csv         email.csv        kiwi.csv          
>>>>> productions.csv  university.csv
>>>>> bio.csv          emerck.csv       koeln.csv         
>>>>> properties.csv   uno.csv
>>>>> black.csv        energy.csv       krd.csv           
>>>>> property.csv     uol.csv
>>>>> blackfriday.csv  equipment.csv    kred.csv          
>>>>> pub.csv          vacations.csv
>>>>> blue.csv         estate.csv       land.csv          
>>>>> quebec.csv       vegas.csv
>>>>> bnpparibas.csv   eus.csv          law.csv           
>>>>> realtor.csv      ventures.csv
>>>>> boo.csv          events.csv       legal.csv         
>>>>> recipes.csv      video.csv
>>>>> boutique.csv     exchange.csv     life.csv          
>>>>> red.csv          vision.csv
>>>>> brussels.csv     expert.csv       limited.csv       
>>>>> rehab.csv        voyage.csv
>>>>> build.csv        exposed.csv      link.csv          
>>>>> reise.csv        wales.csv
>>>>> builders.csv     express.csv      live.csv          
>>>>> reisen.csv       wang.csv
>>>>> business.csv     fail.csv         lol.csv           
>>>>> ren.csv          watch.csv
>>>>> buzz.csv         faith.csv        london.csv        
>>>>> rentals.csv      webcam.csv
>>>>> bzh.csv          farm.csv         love.csv          
>>>>> repair.csv       website.csv
>>>>> cab.csv          finance.csv      luxury.csv        
>>>>> report.csv       wien.csv
>>>>> camera.csv       fish.csv         management.csv    
>>>>> rest.csv         wiki.csv
>>>>> camp.csv         fishing.csv      mango.csv         
>>>>> review.csv       win.csv
>>>>> capital.csv      fit.csv          market.csv        
>>>>> reviews.csv      windows.csv
>>>>> cards.csv        fitness.csv      marketing.csv     
>>>>> rio.csv          work.csv
>>>>> care.csv         flights.csv      markets.csv       
>>>>> rip.csv          works.csv
>>>>> career.csv       foo.csv          media.csv         
>>>>> rocks.csv        world.csv
>>>>> careers.csv      football.csv     melbourne.csv     
>>>>> ruhr.csv         wtf.csv
>>>>> casa.csv         forsale.csv      menu.csv          
>>>>> ryukyu.csv       xn--3e0b707e.csv
>>>>> cash.csv         foundation.csv   microsoft.csv     
>>>>> sale.csv         xn--4gbrim.csv
>>>>> casino.csv       frl.csv          moda.csv          
>>>>> scb.csv          xn--80adxhks.csv
>>>>> center.csv       fund.csv         moe.csv           
>>>>> school.csv       xn--80asehdb.csv
>>>>> ceo.csv          futbol.csv       monash.csv        
>>>>> science.csv      xn--90ais.csv
>>>>> chat.csv         gal.csv          money.csv         
>>>>> scot.csv         xn--d1acj3b.csv
>>>>> church.csv       gallery.csv      moscow.csv        
>>>>> services.csv     xn--j1amh.csv
>>>>> city.csv         garden.csv       movie.csv         
>>>>> sexy.csv         xn--p1ai.csv
>>>>> claims.csv       gent.csv         nagoya.csv        
>>>>> shiksha.csv      xn--pgbs0dh.csv
>>>>> click.csv        gift.csv         network.csv       
>>>>> shoes.csv        xn--q9jyb4c.csv
>>>>> clinic.csv       gifts.csv        new.csv           
>>>>> singles.csv      xn--wgbl6a.csv
>>>>> clothing.csv     glass.csv        news.csv          
>>>>> site.csv         xxx.csv
>>>>> club.csv         global.csv       nexus.csv         
>>>>> social.csv       xyz.csv
>>>>> coach.csv        globo.csv        ngo.csv           
>>>>> software.csv     yandex.csv
>>>>> codes.csv        gmail.csv        ninja.csv         
>>>>> solar.csv        yoga.csv
>>>>> coffee.csv       goo.csv          nrw.csv           
>>>>> solutions.csv    yokohama.csv
>>>>> college.csv      goog.csv         ntt.csv           
>>>>> soy.csv          youtube.csv
>>>>> community.csv    google.csv       nyc.csv           
>>>>> space.csv        zone.csv
>>>>> company.csv      graphics.csv     office.csv         style.csv
>>>>> computer.csv     gratis.csv       okinawa.csv        sucks.csv
>>>>>
>>>>> In the meantime I'd like to cite again the Nile University Crew
>>>>> expertly led by Sameh El Ansary for designing and coding a
>>>>> Crawler's that been able to cope with shifting through 5 years of
>>>>> DNS junk with minimal maintenance, save the love and attention I
>>>>> give the servers by keeping them up to date with patches so they
>>>>> don't end up toppling over. They haven't been rebooted in 464 days
>>>>> and I am crossing fingers for their well-being.
>>>>> And of course, thanks to the University of Southampton Crew who
>>>>> built the excellent 2nd version of the Web Site under Tim Chown's
>>>>> supervision.
>>>>>
>>>>> I am still writing an article for RIPE Labs - just struggling to
>>>>> find the time to finish it, but getting there.
>>>>>
>>>>> Warmest regards,
>>>>>
>>>>> Olivier
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> IPv6crawler-wg mailing list
>>>>> IPv6crawler-wg at gih.co.uk
>>>>> http://gypsy.gih.co.uk/mailman/listinfo/ipv6crawler-wg
>>>>
>>>> -- 
>>>> Olivier MJ Crépin-Leblond, PhD
>>>> http://www.gih.com/ocl.html
>>>
>>> -- 
>>> Christian de Larrinaga  FBCS, CITP,
>>> -------------------------
>>> @ FirstHand
>>> -------------------------
>>> +44 7989 386778
>>> cdel at firsthand.net
>>> -------------------------
>>>
>>
>> -- 
>> Olivier MJ Crépin-Leblond, PhD
>> http://www.gih.com/ocl.html
>

-- 
Christian de Larrinaga  FBCS, CITP,
-------------------------
@ FirstHand
-------------------------
+44 7989 386778
cdel at firsthand.net
-------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gypsy.gih.co.uk/pipermail/ipv6crawler-wg/attachments/20160207/5ad4b257/attachment-0001.html>