[IPv6crawler-wg] An important update about the IPv6 Matrix Project
Christian de Larrinaga
cdel at firsthand.net
Sun Feb 7 22:41:10 GMT 2016
Actually looking into whether to formalise the Matrix as a web
observatory is not a bad idea. Should I ask Thanassis or Wendy?
Christian
Tim Chown wrote:
> Hi,
>
> This is pretty cool, and the db is slowly but surely making its way
> into Big Data territory ;)
>
> The internationalised domain name problem is also interesting.
> Christian’s solution sounds good.
>
> It seems timely to have another push on both virtualising the system
> (so we can run it from other vantage points) and distributing the data
> / results to minimise any potential of any loss to the increasingly
> valuable data set.
>
> This might fit well with the growing web observatory activity in
> Southampton. We also have a new highly resilient data centre in
> Fareham which could be a good place for the virtualised copy to be
> hosted. If you’re OK with it, I can make some contacts to initiate,
> but let me know.
>
> Tim
>
>
>
>> On 6 Feb 2016, at 17:23, Olivier MJ Crepin-Leblond <ocl at gih.com
>> <mailto:ocl at gih.com>> wrote:
>>
>> Hello Christian,
>>
>> the the sqlite database comes in when it comes down to displaying the
>> results. The results of the crawls are in native CSV. All 306Gb of
>> these. The Sqlite database is much smaller as it only uses a subset
>> of all data collected (the data which is used in the GUI) and we are
>> not using a single Sqlite database but one for each crawl - a summary
>> of each crawl for each TLD.
>> The question of Sqlite v3 is a good one -- and I have unfortunately
>> got no idea whether it would work or whether it would break things.
>> To be added to the list of things to do.
>> Kindest regards,
>>
>> Olivier
>>
>> On 06/02/2016 18:09, Christian de Larrinaga wrote:
>>> That is a humungous large sqlite database! or are you only
>>> collecting the data as a form of cache using sqlite and then
>>> exporting it out once organised into csv?
>>>
>>> Sqlite v3 supports utf-8 which might help?
>>> if it doesn't break something else of course.
>>>
>>> C
>>>
>>> Olivier MJ Crepin-Leblond wrote:
>>>> Hello all,
>>>>
>>>> another update: the first complete run using the new TLDs has
>>>> completed!
>>>> You can view the results up to February 2016 from
>>>> http://www.ipv6matrix.org
>>>>
>>>> In adding new gTLDs we have hit a snag, although this snag does not
>>>> significantly affect overall results since it appears to only
>>>> affect a tiny number of domains.
>>>>
>>>> I am speaking about Internationalized Top Level Domains (IDNs):
>>>>
>>>> xn--3e0b707e xn--80adxhks xn--90ais xn--j1amh xn--pgbs0dh
>>>> xn--wgbl6a
>>>> xn--4gbrim xn--80asehdb xn--d1acj3b xn--p1ai xn--q9jyb4c
>>>>
>>>> Each of these is the ASCII equivalent of a non ASCII domain name.
>>>> Whist the Crawler works well with them and we are able to collect
>>>> all of the data pertaining to crawls in IDNs, the program that
>>>> builds the Database uses SQLite. Until now, database entries made
>>>> use of domain names that were ASCII - but IDNs use a double dash
>>>> "--" in the domain. SQLite coughs on DASH - so we have not been
>>>> able to produce the database needed for the displaying of the
>>>> results when including IDNs.
>>>>
>>>> Until we have a workaround, I have manually isolated data collected
>>>> for IDNs, which means we still collect them, but we will not take
>>>> them into account in the final database results. As I have said,
>>>> this is a tiny subset of domains: 760 entries out of a total of 1
>>>> Million domains.
>>>>
>>>> I am *still* drafting a very long article for RIPE labs. In fact,
>>>> we might publish this in two parts. In the meantime, the results
>>>> appear to be somehow consistent with results of other tracking
>>>> projects, some of which use other methods to track IPv6 adoption:
>>>>
>>>> - http://6lab.cisco.com/stats/
>>>> - https://www.vyncke.org/ipv6status/
>>>> - http://www.mrp.net/ipv6_survey/
>>>>
>>>> We now have 306 Gb of comma separated value text data in store,
>>>> tracing back the spread of the IPv6 Internet since July 2010.
>>>> (294Gb in November 2015)
>>>>
>>>> I look forward to your kind feedback.
>>>>
>>>> Warmest regards,
>>>> Olivier
>>>>
>>>>
>>>> On 26/11/2015 19:32, Olivier MJ Crepin-Leblond wrote:
>>>>> Hello all,
>>>>>
>>>>> Two worthy pieces of news regarding the IPv6 Matrix Project (
>>>>> http://www.ipv6matrix.org ):
>>>>>
>>>>> 1. I have updated the Web site with the latest results ending in
>>>>> late October - hence noting a Crawl display date of November 2015.
>>>>> We now have 294 Gb of comma separated value text data in store,
>>>>> tracing back the spread of the IPv6 Internet since July 2010.
>>>>> Altogether, we ran the text approximately 36 times on all 1
>>>>> million Alexa busiest Domain names. This represented testing of
>>>>> about 6.5 million hosts, carefully collecting traceroute
>>>>> information for each and every of them. We now have a very unique
>>>>> database that is showing the spread of the IPv6 Internet
>>>>> information sources worldwide.
>>>>>
>>>>> 2. Today I took out my very dusty Linux & Python gloves and
>>>>> performed a much needed update to the IPv6 Matrix Crawler input
>>>>> database, including the Alexa 1 million list as well as GeoIP
>>>>> Databases.
>>>>>
>>>>> Indeed, the Alexa database of the world's 1 million busiest Web
>>>>> sites dated from the Crawler's first inception in the first half
>>>>> of 2010.
>>>>> We're more than 5 years later!
>>>>>
>>>>> In a way, keeping the same input database has kept the base of
>>>>> crawls the steady thus the ability to compare results was
>>>>> possible. However, the flip-side of the coin is that we are ending
>>>>> up with more and more domain names marked as being dysfunctional.
>>>>> Nearly 5% of the domain names in the database were unreachable.
>>>>> The updated input database should resolve this, but we might also
>>>>> see a jump in some results. It will be interesting to see what the
>>>>> next run yields.
>>>>> Why do we not update the input database more often? Because buried
>>>>> in that database are the domain names of the people who wanted to
>>>>> opt out over the years. Having never thought about this, I spent
>>>>> several hours tracing back 5 years of emails of people complaining
>>>>> about the crawl triggering their firewalls. I put together a
>>>>> blacklist of domain names I have manually deleted from the crawl
>>>>> input files.
>>>>> The blacklist, as it stands now:
>>>>>
>>>>> Deleted:
>>>>>
>>>>> it-mate.co.uk <http://it-mate.co.uk>
>>>>> indianic.com <http://indianic.com>
>>>>> your-server.de <http://your-server.de>
>>>>> catacombscds.com <http://catacombscds.com>
>>>>> dewlance.com <http://dewlance.com>
>>>>> tcs.com <http://tcs.com>
>>>>> printweb.de <http://printweb.de>
>>>>> nocser.net <http://nocser.net>
>>>>> shoppingnsales.com <http://shoppingnsales.com>
>>>>> bsaadmail.com <http://bsaadmail.com>
>>>>> epayservice.ru <http://epayservice.ru>
>>>>> 4footyfans.com <http://4footyfans.com>
>>>>> guitarspeed99.com <http://guitarspeed99.com>
>>>>> saga.co.uk <http://saga.co.uk>
>>>>>
>>>>> Already gone from the current Alexa list:
>>>>>
>>>>> infinityautosurf.com <http://infinityautosurf.com>
>>>>> canada-traffic.com <http://canada-traffic.com>
>>>>> usahitz.com <http://usahitz.com>
>>>>> jawatankosong.com.my
>>>>> 4d.com.my
>>>>> fitnessuncovered.co.uk <http://fitnessuncovered.co.uk>
>>>>> kualalumpurbookfair.com <http://kualalumpurbookfair.com>
>>>>> xgen-it.com <http://xgen-it.com>
>>>>> bpanet.de <http://bpanet.de>
>>>>> edns.de <http://edns.de>
>>>>> back2web.de <http://back2web.de>
>>>>> waaaouh.com <http://waaaouh.com>
>>>>> every-web.com <http://every-web.com>
>>>>> w3sexe.com <http://w3sexe.com>
>>>>> gratuits-web.com <http://gratuits-web.com>
>>>>> france-mateur.com <http://france-mateur.com>
>>>>> pliagedepapier.com <http://pliagedepapier.com>
>>>>> immobilieretparticuliers.com <http://immobilieretparticuliers.com>
>>>>> chronobio.com <http://chronobio.com>
>>>>> stickers-origines.com <http://stickers-origines.com>
>>>>> tailor-made.co.uk <http://tailor-made.co.uk>
>>>>>
>>>>> With these out of the input files, we are able to start the next
>>>>> crawl. *
>>>>> I hope I have not missed any complaints, but if I have, this is
>>>>> advance notice that we might receive a few emails in the
>>>>> forthcoming weeks. We might also receive a few emails from sites
>>>>> that have appeared on the Alexa 1 million list since 2010.*
>>>>>
>>>>> Back to this list, the excellent filtering program which was used
>>>>> to process the original list and clean it up was used again for
>>>>> the modern list. The Alexa list had a number of domain names which
>>>>> were actually sub-directories in the past, as well as some invalid
>>>>> domains. Alexa has since tightened its act. The latest Alexa list
>>>>> is much cleaner. It holds 999998 valid domains vs. 984587 domains
>>>>> for the original 2010 list.
>>>>>
>>>>> Finally, new gTLDs have now appeared in the Alexa list, including
>>>>> some Internationalised Domain Names (IDNs). The world is indeed a
>>>>> very different place!
>>>>> It will be interesting to see how the Crawler as well as all other
>>>>> scripts to process the information into displayable data on the
>>>>> Web server, will cope with these:
>>>>>
>>>>> academy.csv consulting.csv guide.csv
>>>>> one.csv supply.csv
>>>>> accountant.csv contractors.csv guru.csv
>>>>> onl.csv support.csv
>>>>> actor.csv cool.csv hamburg.csv
>>>>> online.csv surf.csv
>>>>> ads.csv country.csv haus.csv
>>>>> ooo.csv swiss.csv
>>>>> adult.csv creditcard.csv healthcare.csv
>>>>> orange.csv sydney.csv
>>>>> agency.csv cricket.csv help.csv
>>>>> ovh.csv systems.csv
>>>>> alsace.csv cymru.csv hiphop.csv
>>>>> paris.csv taipei.csv
>>>>> amsterdam.csv dance.csv holiday.csv
>>>>> partners.csv tattoo.csv
>>>>> app.csv date.csv horse.csv
>>>>> parts.csv team.csv
>>>>> archi.csv dating.csv host.csv
>>>>> party.csv tech.csv
>>>>> associates.csv deals.csv hosting.csv
>>>>> photo.csv technology.csv
>>>>> attorney.csv delivery.csv house.csv
>>>>> photography.csv theater.csv
>>>>> auction.csv desi.csv how.csv
>>>>> photos.csv tienda.csv
>>>>> audio.csv design.csv immobilien.csv
>>>>> pics.csv tips.csv
>>>>> axa.csv dev.csv immo.csv
>>>>> pictures.csv tirol.csv
>>>>> barclaycard.csv diet.csv ink.csv
>>>>> pink.csv today.csv
>>>>> barclays.csv digital.csv international.csv
>>>>> pizza.csv tokyo.csv
>>>>> bar.csv direct.csv investments.csv
>>>>> place.csv tools.csv
>>>>> bargains.csv directory.csv irish.csv
>>>>> plus.csv top.csv
>>>>> bayern.csv discount.csv jetzt.csv
>>>>> poker.csv town.csv
>>>>> beer.csv dog.csv joburg.csv
>>>>> porn.csv toys.csv
>>>>> berlin.csv domains.csv juegos.csv
>>>>> post.csv trade.csv
>>>>> best.csv earth.csv kim.csv
>>>>> press.csv training.csv
>>>>> bid.csv education.csv kitchen.csv
>>>>> prod.csv trust.csv
>>>>> bike.csv email.csv kiwi.csv
>>>>> productions.csv university.csv
>>>>> bio.csv emerck.csv koeln.csv
>>>>> properties.csv uno.csv
>>>>> black.csv energy.csv krd.csv
>>>>> property.csv uol.csv
>>>>> blackfriday.csv equipment.csv kred.csv
>>>>> pub.csv vacations.csv
>>>>> blue.csv estate.csv land.csv
>>>>> quebec.csv vegas.csv
>>>>> bnpparibas.csv eus.csv law.csv
>>>>> realtor.csv ventures.csv
>>>>> boo.csv events.csv legal.csv
>>>>> recipes.csv video.csv
>>>>> boutique.csv exchange.csv life.csv
>>>>> red.csv vision.csv
>>>>> brussels.csv expert.csv limited.csv
>>>>> rehab.csv voyage.csv
>>>>> build.csv exposed.csv link.csv
>>>>> reise.csv wales.csv
>>>>> builders.csv express.csv live.csv
>>>>> reisen.csv wang.csv
>>>>> business.csv fail.csv lol.csv
>>>>> ren.csv watch.csv
>>>>> buzz.csv faith.csv london.csv
>>>>> rentals.csv webcam.csv
>>>>> bzh.csv farm.csv love.csv
>>>>> repair.csv website.csv
>>>>> cab.csv finance.csv luxury.csv
>>>>> report.csv wien.csv
>>>>> camera.csv fish.csv management.csv
>>>>> rest.csv wiki.csv
>>>>> camp.csv fishing.csv mango.csv
>>>>> review.csv win.csv
>>>>> capital.csv fit.csv market.csv
>>>>> reviews.csv windows.csv
>>>>> cards.csv fitness.csv marketing.csv
>>>>> rio.csv work.csv
>>>>> care.csv flights.csv markets.csv
>>>>> rip.csv works.csv
>>>>> career.csv foo.csv media.csv
>>>>> rocks.csv world.csv
>>>>> careers.csv football.csv melbourne.csv
>>>>> ruhr.csv wtf.csv
>>>>> casa.csv forsale.csv menu.csv
>>>>> ryukyu.csv xn--3e0b707e.csv
>>>>> cash.csv foundation.csv microsoft.csv
>>>>> sale.csv xn--4gbrim.csv
>>>>> casino.csv frl.csv moda.csv
>>>>> scb.csv xn--80adxhks.csv
>>>>> center.csv fund.csv moe.csv
>>>>> school.csv xn--80asehdb.csv
>>>>> ceo.csv futbol.csv monash.csv
>>>>> science.csv xn--90ais.csv
>>>>> chat.csv gal.csv money.csv
>>>>> scot.csv xn--d1acj3b.csv
>>>>> church.csv gallery.csv moscow.csv
>>>>> services.csv xn--j1amh.csv
>>>>> city.csv garden.csv movie.csv
>>>>> sexy.csv xn--p1ai.csv
>>>>> claims.csv gent.csv nagoya.csv
>>>>> shiksha.csv xn--pgbs0dh.csv
>>>>> click.csv gift.csv network.csv
>>>>> shoes.csv xn--q9jyb4c.csv
>>>>> clinic.csv gifts.csv new.csv
>>>>> singles.csv xn--wgbl6a.csv
>>>>> clothing.csv glass.csv news.csv
>>>>> site.csv xxx.csv
>>>>> club.csv global.csv nexus.csv
>>>>> social.csv xyz.csv
>>>>> coach.csv globo.csv ngo.csv
>>>>> software.csv yandex.csv
>>>>> codes.csv gmail.csv ninja.csv
>>>>> solar.csv yoga.csv
>>>>> coffee.csv goo.csv nrw.csv
>>>>> solutions.csv yokohama.csv
>>>>> college.csv goog.csv ntt.csv
>>>>> soy.csv youtube.csv
>>>>> community.csv google.csv nyc.csv
>>>>> space.csv zone.csv
>>>>> company.csv graphics.csv office.csv style.csv
>>>>> computer.csv gratis.csv okinawa.csv sucks.csv
>>>>>
>>>>> In the meantime I'd like to cite again the Nile University Crew
>>>>> expertly led by Sameh El Ansary for designing and coding a
>>>>> Crawler's that been able to cope with shifting through 5 years of
>>>>> DNS junk with minimal maintenance, save the love and attention I
>>>>> give the servers by keeping them up to date with patches so they
>>>>> don't end up toppling over. They haven't been rebooted in 464 days
>>>>> and I am crossing fingers for their well-being.
>>>>> And of course, thanks to the University of Southampton Crew who
>>>>> built the excellent 2nd version of the Web Site under Tim Chown's
>>>>> supervision.
>>>>>
>>>>> I am still writing an article for RIPE Labs - just struggling to
>>>>> find the time to finish it, but getting there.
>>>>>
>>>>> Warmest regards,
>>>>>
>>>>> Olivier
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> IPv6crawler-wg mailing list
>>>>> IPv6crawler-wg at gih.co.uk
>>>>> http://gypsy.gih.co.uk/mailman/listinfo/ipv6crawler-wg
>>>>
>>>> --
>>>> Olivier MJ Crépin-Leblond, PhD
>>>> http://www.gih.com/ocl.html
>>>
>>> --
>>> Christian de Larrinaga FBCS, CITP,
>>> -------------------------
>>> @ FirstHand
>>> -------------------------
>>> +44 7989 386778
>>> cdel at firsthand.net
>>> -------------------------
>>>
>>
>> --
>> Olivier MJ Crépin-Leblond, PhD
>> http://www.gih.com/ocl.html
>
--
Christian de Larrinaga FBCS, CITP,
-------------------------
@ FirstHand
-------------------------
+44 7989 386778
cdel at firsthand.net
-------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gypsy.gih.co.uk/pipermail/ipv6crawler-wg/attachments/20160207/5ad4b257/attachment-0001.html>
More information about the IPv6crawler-wg
mailing list