[IPv6crawler-wg] An important update about the IPv6 Matrix Project

Christian de Larrinaga cdel at firsthand.net
Sat Feb 6 17:53:47 GMT 2016


ahh I know that list ;-)


Olivier MJ Crepin-Leblond wrote:
> Hello Christian,
>
> the the sqlite database comes in when it comes down to displaying the
> results. The results of the crawls are in native CSV. All 306Gb of
> these. The Sqlite database is much smaller as it only uses a subset of
> all data collected (the data which is used in the GUI) and we are not
> using a single Sqlite database but one for each crawl - a summary of
> each crawl for each TLD.
> The question of Sqlite v3 is a good one -- and I have unfortunately
> got no idea whether it would work or whether it would break things. To
> be added to the list of things to do.
> Kindest regards,
>
> Olivier
>
> On 06/02/2016 18:09, Christian de Larrinaga wrote:
>> That is a humungous large sqlite database! or are you only collecting
>> the data as a form of cache using sqlite and then exporting it out
>> once organised into csv?
>>
>> Sqlite v3 supports utf-8 which might help?
>> if it doesn't break something else of course.
>>
>> C
>>
>> Olivier MJ Crepin-Leblond wrote:
>>> Hello all,
>>>
>>> another update: the first complete run using the new TLDs has
>>> completed!
>>> You can view the results up to February 2016 from
>>> http://www.ipv6matrix.org
>>>
>>> In adding new gTLDs we have hit a snag, although this snag does not
>>> significantly affect overall results since it appears to only affect
>>> a tiny number of domains.
>>>
>>> I am speaking about Internationalized Top Level Domains (IDNs):
>>>
>>> xn--3e0b707e  xn--80adxhks  xn--90ais    xn--j1amh  xn--pgbs0dh 
>>> xn--wgbl6a
>>> xn--4gbrim    xn--80asehdb  xn--d1acj3b  xn--p1ai   xn--q9jyb4c
>>>
>>> Each of these is the ASCII equivalent of a non ASCII domain name.
>>> Whist the Crawler works well with them and we are able to collect
>>> all of the data pertaining to crawls in IDNs, the program that
>>> builds the Database uses SQLite. Until now, database entries made
>>> use of domain names that were ASCII - but IDNs use a double dash
>>> "--" in the domain. SQLite coughs on DASH - so we have not been able
>>> to produce the database needed for the displaying of the results
>>> when including IDNs.
>>>
>>> Until we have a workaround, I have manually isolated data collected
>>> for IDNs, which means we still collect them, but we will not take
>>> them into account in the final database results. As I have said,
>>> this is a tiny subset of domains: 760 entries out of a total of 1
>>> Million domains.
>>>
>>> I am *still* drafting a very long article for RIPE labs. In fact, we
>>> might publish this in two parts. In the meantime, the results appear
>>> to be somehow consistent with results of other tracking projects,
>>> some of which use other methods to track IPv6 adoption:
>>>
>>> - http://6lab.cisco.com/stats/
>>> - https://www.vyncke.org/ipv6status/
>>> - http://www.mrp.net/ipv6_survey/
>>>
>>> We now have 306 Gb of comma separated value text data in store,
>>> tracing back the spread of the IPv6 Internet since July 2010. 
>>> (294Gb in November 2015)
>>>
>>> I look forward to your kind feedback.
>>>
>>> Warmest regards,
>>> Olivier
>>>
>>>
>>> On 26/11/2015 19:32, Olivier MJ Crepin-Leblond wrote:
>>>> Hello all,
>>>>
>>>> Two worthy pieces of news regarding the IPv6 Matrix Project (
>>>> http://www.ipv6matrix.org ):
>>>>
>>>> 1. I have updated the Web site with the latest results ending in
>>>> late October - hence noting a Crawl display date of November 2015.
>>>> We now have 294 Gb of comma separated value text data in store,
>>>> tracing back the spread of the IPv6 Internet since July 2010.
>>>> Altogether, we ran the text approximately 36 times on all 1 million
>>>> Alexa busiest Domain names. This represented testing of about 6.5
>>>> million hosts, carefully collecting traceroute information for each
>>>> and every of them. We now have a very unique database that is
>>>> showing the spread of the IPv6 Internet information sources worldwide.
>>>>
>>>> 2. Today I took out my very dusty Linux & Python gloves and
>>>> performed a much needed update to the IPv6 Matrix Crawler input
>>>> database, including the Alexa 1 million list as well as GeoIP
>>>> Databases.
>>>>
>>>> Indeed, the Alexa database of the world's 1 million busiest Web
>>>> sites dated from the Crawler's first inception in the first half of
>>>> 2010.
>>>> We're more than 5 years later!
>>>>
>>>> In a way, keeping the same input database has kept the base of
>>>> crawls the steady thus the ability to compare results was possible.
>>>> However, the flip-side of the coin is that we are ending up with
>>>> more and more domain names marked as being dysfunctional. Nearly 5%
>>>> of the domain names in the database were unreachable. The updated
>>>> input database should resolve this, but we might also see a jump in
>>>> some results. It will be interesting to see what the next run yields.
>>>> Why do we not update the input database more often? Because buried
>>>> in that database are the domain names of the people who wanted to
>>>> opt out over the years. Having never thought about this, I spent
>>>> several hours tracing back 5 years of emails of people complaining
>>>> about the crawl triggering their firewalls. I put together a
>>>> blacklist of domain names I have manually deleted from the crawl
>>>> input files.
>>>> The blacklist, as it stands now:
>>>>
>>>> Deleted:
>>>>
>>>> it-mate.co.uk
>>>> indianic.com
>>>> your-server.de
>>>> catacombscds.com
>>>> dewlance.com
>>>> tcs.com
>>>> printweb.de
>>>> nocser.net
>>>> shoppingnsales.com
>>>> bsaadmail.com
>>>> epayservice.ru
>>>> 4footyfans.com
>>>> guitarspeed99.com
>>>> saga.co.uk
>>>>
>>>> Already gone from the current Alexa list:
>>>>
>>>> infinityautosurf.com
>>>> canada-traffic.com
>>>> usahitz.com
>>>> jawatankosong.com.my
>>>> 4d.com.my
>>>> fitnessuncovered.co.uk
>>>> kualalumpurbookfair.com
>>>> xgen-it.com
>>>> bpanet.de
>>>> edns.de
>>>> back2web.de
>>>> waaaouh.com
>>>> every-web.com
>>>> w3sexe.com
>>>> gratuits-web.com
>>>> france-mateur.com
>>>> pliagedepapier.com
>>>> immobilieretparticuliers.com
>>>> chronobio.com
>>>> stickers-origines.com
>>>> tailor-made.co.uk
>>>>
>>>> With these out of the input files, we are able to start the next
>>>> crawl. *
>>>> I hope I have not missed any complaints, but if I have, this is
>>>> advance notice that we might receive a few emails in the
>>>> forthcoming weeks. We might also receive a few emails from sites
>>>> that have appeared on the Alexa 1 million list since 2010.*
>>>>
>>>> Back to this list, the excellent filtering program which was used
>>>> to process the original list and clean it up was used again for the
>>>> modern list. The Alexa list had a number of domain names which were
>>>> actually sub-directories in the past, as well as some invalid
>>>> domains. Alexa has since tightened its act. The latest Alexa list
>>>> is much cleaner. It holds 999998 valid domains vs. 984587 domains
>>>> for the original 2010 list.
>>>>
>>>> Finally, new gTLDs have now appeared in the Alexa list, including
>>>> some Internationalised Domain Names (IDNs). The world is indeed a
>>>> very different place!
>>>> It will be interesting to see how the Crawler as well as all other
>>>> scripts to process the information into displayable data on the Web
>>>> server, will cope with these:
>>>>
>>>> academy.csv      consulting.csv   guide.csv         
>>>> one.csv          supply.csv
>>>> accountant.csv   contractors.csv  guru.csv          
>>>> onl.csv          support.csv
>>>> actor.csv        cool.csv         hamburg.csv       
>>>> online.csv       surf.csv
>>>> ads.csv          country.csv      haus.csv          
>>>> ooo.csv          swiss.csv
>>>> adult.csv        creditcard.csv   healthcare.csv    
>>>> orange.csv       sydney.csv
>>>> agency.csv       cricket.csv      help.csv          
>>>> ovh.csv          systems.csv
>>>> alsace.csv       cymru.csv        hiphop.csv        
>>>> paris.csv        taipei.csv
>>>> amsterdam.csv    dance.csv        holiday.csv       
>>>> partners.csv     tattoo.csv
>>>> app.csv          date.csv         horse.csv         
>>>> parts.csv        team.csv
>>>> archi.csv        dating.csv       host.csv          
>>>> party.csv        tech.csv
>>>> associates.csv   deals.csv        hosting.csv       
>>>> photo.csv        technology.csv
>>>> attorney.csv     delivery.csv     house.csv         
>>>> photography.csv  theater.csv
>>>> auction.csv      desi.csv         how.csv           
>>>> photos.csv       tienda.csv
>>>> audio.csv        design.csv       immobilien.csv    
>>>> pics.csv         tips.csv
>>>> axa.csv          dev.csv          immo.csv          
>>>> pictures.csv     tirol.csv
>>>> barclaycard.csv  diet.csv         ink.csv           
>>>> pink.csv         today.csv
>>>> barclays.csv     digital.csv      international.csv 
>>>> pizza.csv        tokyo.csv
>>>> bar.csv          direct.csv       investments.csv   
>>>> place.csv        tools.csv
>>>> bargains.csv     directory.csv    irish.csv         
>>>> plus.csv         top.csv
>>>> bayern.csv       discount.csv     jetzt.csv         
>>>> poker.csv        town.csv
>>>> beer.csv         dog.csv          joburg.csv        
>>>> porn.csv         toys.csv
>>>> berlin.csv       domains.csv      juegos.csv        
>>>> post.csv         trade.csv
>>>> best.csv         earth.csv        kim.csv           
>>>> press.csv        training.csv
>>>> bid.csv          education.csv    kitchen.csv       
>>>> prod.csv         trust.csv
>>>> bike.csv         email.csv        kiwi.csv          
>>>> productions.csv  university.csv
>>>> bio.csv          emerck.csv       koeln.csv         
>>>> properties.csv   uno.csv
>>>> black.csv        energy.csv       krd.csv           
>>>> property.csv     uol.csv
>>>> blackfriday.csv  equipment.csv    kred.csv          
>>>> pub.csv          vacations.csv
>>>> blue.csv         estate.csv       land.csv          
>>>> quebec.csv       vegas.csv
>>>> bnpparibas.csv   eus.csv          law.csv           
>>>> realtor.csv      ventures.csv
>>>> boo.csv          events.csv       legal.csv         
>>>> recipes.csv      video.csv
>>>> boutique.csv     exchange.csv     life.csv          
>>>> red.csv          vision.csv
>>>> brussels.csv     expert.csv       limited.csv       
>>>> rehab.csv        voyage.csv
>>>> build.csv        exposed.csv      link.csv          
>>>> reise.csv        wales.csv
>>>> builders.csv     express.csv      live.csv          
>>>> reisen.csv       wang.csv
>>>> business.csv     fail.csv         lol.csv           
>>>> ren.csv          watch.csv
>>>> buzz.csv         faith.csv        london.csv        
>>>> rentals.csv      webcam.csv
>>>> bzh.csv          farm.csv         love.csv          
>>>> repair.csv       website.csv
>>>> cab.csv          finance.csv      luxury.csv        
>>>> report.csv       wien.csv
>>>> camera.csv       fish.csv         management.csv    
>>>> rest.csv         wiki.csv
>>>> camp.csv         fishing.csv      mango.csv         
>>>> review.csv       win.csv
>>>> capital.csv      fit.csv          market.csv        
>>>> reviews.csv      windows.csv
>>>> cards.csv        fitness.csv      marketing.csv     
>>>> rio.csv          work.csv
>>>> care.csv         flights.csv      markets.csv       
>>>> rip.csv          works.csv
>>>> career.csv       foo.csv          media.csv         
>>>> rocks.csv        world.csv
>>>> careers.csv      football.csv     melbourne.csv     
>>>> ruhr.csv         wtf.csv
>>>> casa.csv         forsale.csv      menu.csv          
>>>> ryukyu.csv       xn--3e0b707e.csv
>>>> cash.csv         foundation.csv   microsoft.csv     
>>>> sale.csv         xn--4gbrim.csv
>>>> casino.csv       frl.csv          moda.csv          
>>>> scb.csv          xn--80adxhks.csv
>>>> center.csv       fund.csv         moe.csv           
>>>> school.csv       xn--80asehdb.csv
>>>> ceo.csv          futbol.csv       monash.csv        
>>>> science.csv      xn--90ais.csv
>>>> chat.csv         gal.csv          money.csv         
>>>> scot.csv         xn--d1acj3b.csv
>>>> church.csv       gallery.csv      moscow.csv        
>>>> services.csv     xn--j1amh.csv
>>>> city.csv         garden.csv       movie.csv         
>>>> sexy.csv         xn--p1ai.csv
>>>> claims.csv       gent.csv         nagoya.csv        
>>>> shiksha.csv      xn--pgbs0dh.csv
>>>> click.csv        gift.csv         network.csv       
>>>> shoes.csv        xn--q9jyb4c.csv
>>>> clinic.csv       gifts.csv        new.csv           
>>>> singles.csv      xn--wgbl6a.csv
>>>> clothing.csv     glass.csv        news.csv          
>>>> site.csv         xxx.csv
>>>> club.csv         global.csv       nexus.csv         
>>>> social.csv       xyz.csv
>>>> coach.csv        globo.csv        ngo.csv           
>>>> software.csv     yandex.csv
>>>> codes.csv        gmail.csv        ninja.csv         
>>>> solar.csv        yoga.csv
>>>> coffee.csv       goo.csv          nrw.csv           
>>>> solutions.csv    yokohama.csv
>>>> college.csv      goog.csv         ntt.csv           
>>>> soy.csv          youtube.csv
>>>> community.csv    google.csv       nyc.csv           
>>>> space.csv        zone.csv
>>>> company.csv      graphics.csv     office.csv         style.csv
>>>> computer.csv     gratis.csv       okinawa.csv        sucks.csv
>>>>
>>>> In the meantime I'd like to cite again the Nile University Crew
>>>> expertly led by Sameh El Ansary for designing and coding a
>>>> Crawler's that been able to cope with shifting through 5 years of
>>>> DNS junk with minimal maintenance, save the love and attention I
>>>> give the servers by keeping them up to date with patches so they
>>>> don't end up toppling over. They haven't been rebooted in 464 days
>>>> and I am crossing fingers for their well-being.
>>>> And of course, thanks to the University of Southampton Crew who
>>>> built the excellent 2nd version of the Web Site under Tim Chown's
>>>> supervision.
>>>>
>>>> I am still writing an article for RIPE Labs - just struggling to
>>>> find the time to finish it, but getting there.
>>>>
>>>> Warmest regards,
>>>>
>>>> Olivier
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> IPv6crawler-wg mailing list
>>>> IPv6crawler-wg at gih.co.uk
>>>> http://gypsy.gih.co.uk/mailman/listinfo/ipv6crawler-wg
>>>
>>> -- 
>>> Olivier MJ Crépin-Leblond, PhD
>>> http://www.gih.com/ocl.html
>>
>> -- 
>> Christian de Larrinaga  FBCS, CITP,
>> -------------------------
>> @ FirstHand
>> -------------------------
>> +44 7989 386778
>> cdel at firsthand.net
>> -------------------------
>>
>
> -- 
> Olivier MJ Crépin-Leblond, PhD
> http://www.gih.com/ocl.html

-- 
Christian de Larrinaga  FBCS, CITP,
-------------------------
@ FirstHand
-------------------------
+44 7989 386778
cdel at firsthand.net
-------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gypsy.gih.co.uk/pipermail/ipv6crawler-wg/attachments/20160206/4e87e8fa/attachment-0001.html>


More information about the IPv6crawler-wg mailing list