[IPv6crawler-wg] An important update about the IPv6 Matrix Project
Christian de Larrinaga
cdel at firsthand.net
Sat Feb 6 17:53:47 GMT 2016
ahh I know that list ;-)
Olivier MJ Crepin-Leblond wrote:
> Hello Christian,
>
> the the sqlite database comes in when it comes down to displaying the
> results. The results of the crawls are in native CSV. All 306Gb of
> these. The Sqlite database is much smaller as it only uses a subset of
> all data collected (the data which is used in the GUI) and we are not
> using a single Sqlite database but one for each crawl - a summary of
> each crawl for each TLD.
> The question of Sqlite v3 is a good one -- and I have unfortunately
> got no idea whether it would work or whether it would break things. To
> be added to the list of things to do.
> Kindest regards,
>
> Olivier
>
> On 06/02/2016 18:09, Christian de Larrinaga wrote:
>> That is a humungous large sqlite database! or are you only collecting
>> the data as a form of cache using sqlite and then exporting it out
>> once organised into csv?
>>
>> Sqlite v3 supports utf-8 which might help?
>> if it doesn't break something else of course.
>>
>> C
>>
>> Olivier MJ Crepin-Leblond wrote:
>>> Hello all,
>>>
>>> another update: the first complete run using the new TLDs has
>>> completed!
>>> You can view the results up to February 2016 from
>>> http://www.ipv6matrix.org
>>>
>>> In adding new gTLDs we have hit a snag, although this snag does not
>>> significantly affect overall results since it appears to only affect
>>> a tiny number of domains.
>>>
>>> I am speaking about Internationalized Top Level Domains (IDNs):
>>>
>>> xn--3e0b707e xn--80adxhks xn--90ais xn--j1amh xn--pgbs0dh
>>> xn--wgbl6a
>>> xn--4gbrim xn--80asehdb xn--d1acj3b xn--p1ai xn--q9jyb4c
>>>
>>> Each of these is the ASCII equivalent of a non ASCII domain name.
>>> Whist the Crawler works well with them and we are able to collect
>>> all of the data pertaining to crawls in IDNs, the program that
>>> builds the Database uses SQLite. Until now, database entries made
>>> use of domain names that were ASCII - but IDNs use a double dash
>>> "--" in the domain. SQLite coughs on DASH - so we have not been able
>>> to produce the database needed for the displaying of the results
>>> when including IDNs.
>>>
>>> Until we have a workaround, I have manually isolated data collected
>>> for IDNs, which means we still collect them, but we will not take
>>> them into account in the final database results. As I have said,
>>> this is a tiny subset of domains: 760 entries out of a total of 1
>>> Million domains.
>>>
>>> I am *still* drafting a very long article for RIPE labs. In fact, we
>>> might publish this in two parts. In the meantime, the results appear
>>> to be somehow consistent with results of other tracking projects,
>>> some of which use other methods to track IPv6 adoption:
>>>
>>> - http://6lab.cisco.com/stats/
>>> - https://www.vyncke.org/ipv6status/
>>> - http://www.mrp.net/ipv6_survey/
>>>
>>> We now have 306 Gb of comma separated value text data in store,
>>> tracing back the spread of the IPv6 Internet since July 2010.
>>> (294Gb in November 2015)
>>>
>>> I look forward to your kind feedback.
>>>
>>> Warmest regards,
>>> Olivier
>>>
>>>
>>> On 26/11/2015 19:32, Olivier MJ Crepin-Leblond wrote:
>>>> Hello all,
>>>>
>>>> Two worthy pieces of news regarding the IPv6 Matrix Project (
>>>> http://www.ipv6matrix.org ):
>>>>
>>>> 1. I have updated the Web site with the latest results ending in
>>>> late October - hence noting a Crawl display date of November 2015.
>>>> We now have 294 Gb of comma separated value text data in store,
>>>> tracing back the spread of the IPv6 Internet since July 2010.
>>>> Altogether, we ran the text approximately 36 times on all 1 million
>>>> Alexa busiest Domain names. This represented testing of about 6.5
>>>> million hosts, carefully collecting traceroute information for each
>>>> and every of them. We now have a very unique database that is
>>>> showing the spread of the IPv6 Internet information sources worldwide.
>>>>
>>>> 2. Today I took out my very dusty Linux & Python gloves and
>>>> performed a much needed update to the IPv6 Matrix Crawler input
>>>> database, including the Alexa 1 million list as well as GeoIP
>>>> Databases.
>>>>
>>>> Indeed, the Alexa database of the world's 1 million busiest Web
>>>> sites dated from the Crawler's first inception in the first half of
>>>> 2010.
>>>> We're more than 5 years later!
>>>>
>>>> In a way, keeping the same input database has kept the base of
>>>> crawls the steady thus the ability to compare results was possible.
>>>> However, the flip-side of the coin is that we are ending up with
>>>> more and more domain names marked as being dysfunctional. Nearly 5%
>>>> of the domain names in the database were unreachable. The updated
>>>> input database should resolve this, but we might also see a jump in
>>>> some results. It will be interesting to see what the next run yields.
>>>> Why do we not update the input database more often? Because buried
>>>> in that database are the domain names of the people who wanted to
>>>> opt out over the years. Having never thought about this, I spent
>>>> several hours tracing back 5 years of emails of people complaining
>>>> about the crawl triggering their firewalls. I put together a
>>>> blacklist of domain names I have manually deleted from the crawl
>>>> input files.
>>>> The blacklist, as it stands now:
>>>>
>>>> Deleted:
>>>>
>>>> it-mate.co.uk
>>>> indianic.com
>>>> your-server.de
>>>> catacombscds.com
>>>> dewlance.com
>>>> tcs.com
>>>> printweb.de
>>>> nocser.net
>>>> shoppingnsales.com
>>>> bsaadmail.com
>>>> epayservice.ru
>>>> 4footyfans.com
>>>> guitarspeed99.com
>>>> saga.co.uk
>>>>
>>>> Already gone from the current Alexa list:
>>>>
>>>> infinityautosurf.com
>>>> canada-traffic.com
>>>> usahitz.com
>>>> jawatankosong.com.my
>>>> 4d.com.my
>>>> fitnessuncovered.co.uk
>>>> kualalumpurbookfair.com
>>>> xgen-it.com
>>>> bpanet.de
>>>> edns.de
>>>> back2web.de
>>>> waaaouh.com
>>>> every-web.com
>>>> w3sexe.com
>>>> gratuits-web.com
>>>> france-mateur.com
>>>> pliagedepapier.com
>>>> immobilieretparticuliers.com
>>>> chronobio.com
>>>> stickers-origines.com
>>>> tailor-made.co.uk
>>>>
>>>> With these out of the input files, we are able to start the next
>>>> crawl. *
>>>> I hope I have not missed any complaints, but if I have, this is
>>>> advance notice that we might receive a few emails in the
>>>> forthcoming weeks. We might also receive a few emails from sites
>>>> that have appeared on the Alexa 1 million list since 2010.*
>>>>
>>>> Back to this list, the excellent filtering program which was used
>>>> to process the original list and clean it up was used again for the
>>>> modern list. The Alexa list had a number of domain names which were
>>>> actually sub-directories in the past, as well as some invalid
>>>> domains. Alexa has since tightened its act. The latest Alexa list
>>>> is much cleaner. It holds 999998 valid domains vs. 984587 domains
>>>> for the original 2010 list.
>>>>
>>>> Finally, new gTLDs have now appeared in the Alexa list, including
>>>> some Internationalised Domain Names (IDNs). The world is indeed a
>>>> very different place!
>>>> It will be interesting to see how the Crawler as well as all other
>>>> scripts to process the information into displayable data on the Web
>>>> server, will cope with these:
>>>>
>>>> academy.csv consulting.csv guide.csv
>>>> one.csv supply.csv
>>>> accountant.csv contractors.csv guru.csv
>>>> onl.csv support.csv
>>>> actor.csv cool.csv hamburg.csv
>>>> online.csv surf.csv
>>>> ads.csv country.csv haus.csv
>>>> ooo.csv swiss.csv
>>>> adult.csv creditcard.csv healthcare.csv
>>>> orange.csv sydney.csv
>>>> agency.csv cricket.csv help.csv
>>>> ovh.csv systems.csv
>>>> alsace.csv cymru.csv hiphop.csv
>>>> paris.csv taipei.csv
>>>> amsterdam.csv dance.csv holiday.csv
>>>> partners.csv tattoo.csv
>>>> app.csv date.csv horse.csv
>>>> parts.csv team.csv
>>>> archi.csv dating.csv host.csv
>>>> party.csv tech.csv
>>>> associates.csv deals.csv hosting.csv
>>>> photo.csv technology.csv
>>>> attorney.csv delivery.csv house.csv
>>>> photography.csv theater.csv
>>>> auction.csv desi.csv how.csv
>>>> photos.csv tienda.csv
>>>> audio.csv design.csv immobilien.csv
>>>> pics.csv tips.csv
>>>> axa.csv dev.csv immo.csv
>>>> pictures.csv tirol.csv
>>>> barclaycard.csv diet.csv ink.csv
>>>> pink.csv today.csv
>>>> barclays.csv digital.csv international.csv
>>>> pizza.csv tokyo.csv
>>>> bar.csv direct.csv investments.csv
>>>> place.csv tools.csv
>>>> bargains.csv directory.csv irish.csv
>>>> plus.csv top.csv
>>>> bayern.csv discount.csv jetzt.csv
>>>> poker.csv town.csv
>>>> beer.csv dog.csv joburg.csv
>>>> porn.csv toys.csv
>>>> berlin.csv domains.csv juegos.csv
>>>> post.csv trade.csv
>>>> best.csv earth.csv kim.csv
>>>> press.csv training.csv
>>>> bid.csv education.csv kitchen.csv
>>>> prod.csv trust.csv
>>>> bike.csv email.csv kiwi.csv
>>>> productions.csv university.csv
>>>> bio.csv emerck.csv koeln.csv
>>>> properties.csv uno.csv
>>>> black.csv energy.csv krd.csv
>>>> property.csv uol.csv
>>>> blackfriday.csv equipment.csv kred.csv
>>>> pub.csv vacations.csv
>>>> blue.csv estate.csv land.csv
>>>> quebec.csv vegas.csv
>>>> bnpparibas.csv eus.csv law.csv
>>>> realtor.csv ventures.csv
>>>> boo.csv events.csv legal.csv
>>>> recipes.csv video.csv
>>>> boutique.csv exchange.csv life.csv
>>>> red.csv vision.csv
>>>> brussels.csv expert.csv limited.csv
>>>> rehab.csv voyage.csv
>>>> build.csv exposed.csv link.csv
>>>> reise.csv wales.csv
>>>> builders.csv express.csv live.csv
>>>> reisen.csv wang.csv
>>>> business.csv fail.csv lol.csv
>>>> ren.csv watch.csv
>>>> buzz.csv faith.csv london.csv
>>>> rentals.csv webcam.csv
>>>> bzh.csv farm.csv love.csv
>>>> repair.csv website.csv
>>>> cab.csv finance.csv luxury.csv
>>>> report.csv wien.csv
>>>> camera.csv fish.csv management.csv
>>>> rest.csv wiki.csv
>>>> camp.csv fishing.csv mango.csv
>>>> review.csv win.csv
>>>> capital.csv fit.csv market.csv
>>>> reviews.csv windows.csv
>>>> cards.csv fitness.csv marketing.csv
>>>> rio.csv work.csv
>>>> care.csv flights.csv markets.csv
>>>> rip.csv works.csv
>>>> career.csv foo.csv media.csv
>>>> rocks.csv world.csv
>>>> careers.csv football.csv melbourne.csv
>>>> ruhr.csv wtf.csv
>>>> casa.csv forsale.csv menu.csv
>>>> ryukyu.csv xn--3e0b707e.csv
>>>> cash.csv foundation.csv microsoft.csv
>>>> sale.csv xn--4gbrim.csv
>>>> casino.csv frl.csv moda.csv
>>>> scb.csv xn--80adxhks.csv
>>>> center.csv fund.csv moe.csv
>>>> school.csv xn--80asehdb.csv
>>>> ceo.csv futbol.csv monash.csv
>>>> science.csv xn--90ais.csv
>>>> chat.csv gal.csv money.csv
>>>> scot.csv xn--d1acj3b.csv
>>>> church.csv gallery.csv moscow.csv
>>>> services.csv xn--j1amh.csv
>>>> city.csv garden.csv movie.csv
>>>> sexy.csv xn--p1ai.csv
>>>> claims.csv gent.csv nagoya.csv
>>>> shiksha.csv xn--pgbs0dh.csv
>>>> click.csv gift.csv network.csv
>>>> shoes.csv xn--q9jyb4c.csv
>>>> clinic.csv gifts.csv new.csv
>>>> singles.csv xn--wgbl6a.csv
>>>> clothing.csv glass.csv news.csv
>>>> site.csv xxx.csv
>>>> club.csv global.csv nexus.csv
>>>> social.csv xyz.csv
>>>> coach.csv globo.csv ngo.csv
>>>> software.csv yandex.csv
>>>> codes.csv gmail.csv ninja.csv
>>>> solar.csv yoga.csv
>>>> coffee.csv goo.csv nrw.csv
>>>> solutions.csv yokohama.csv
>>>> college.csv goog.csv ntt.csv
>>>> soy.csv youtube.csv
>>>> community.csv google.csv nyc.csv
>>>> space.csv zone.csv
>>>> company.csv graphics.csv office.csv style.csv
>>>> computer.csv gratis.csv okinawa.csv sucks.csv
>>>>
>>>> In the meantime I'd like to cite again the Nile University Crew
>>>> expertly led by Sameh El Ansary for designing and coding a
>>>> Crawler's that been able to cope with shifting through 5 years of
>>>> DNS junk with minimal maintenance, save the love and attention I
>>>> give the servers by keeping them up to date with patches so they
>>>> don't end up toppling over. They haven't been rebooted in 464 days
>>>> and I am crossing fingers for their well-being.
>>>> And of course, thanks to the University of Southampton Crew who
>>>> built the excellent 2nd version of the Web Site under Tim Chown's
>>>> supervision.
>>>>
>>>> I am still writing an article for RIPE Labs - just struggling to
>>>> find the time to finish it, but getting there.
>>>>
>>>> Warmest regards,
>>>>
>>>> Olivier
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> IPv6crawler-wg mailing list
>>>> IPv6crawler-wg at gih.co.uk
>>>> http://gypsy.gih.co.uk/mailman/listinfo/ipv6crawler-wg
>>>
>>> --
>>> Olivier MJ Crépin-Leblond, PhD
>>> http://www.gih.com/ocl.html
>>
>> --
>> Christian de Larrinaga FBCS, CITP,
>> -------------------------
>> @ FirstHand
>> -------------------------
>> +44 7989 386778
>> cdel at firsthand.net
>> -------------------------
>>
>
> --
> Olivier MJ Crépin-Leblond, PhD
> http://www.gih.com/ocl.html
--
Christian de Larrinaga FBCS, CITP,
-------------------------
@ FirstHand
-------------------------
+44 7989 386778
cdel at firsthand.net
-------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gypsy.gih.co.uk/pipermail/ipv6crawler-wg/attachments/20160206/4e87e8fa/attachment-0001.html>
More information about the IPv6crawler-wg
mailing list