[IPv6crawler-wg] An important update about the IPv6 Matrix Project
Christian de Larrinaga
cdel at firsthand.net
Sat Feb 6 17:09:54 GMT 2016
That is a humungous large sqlite database! or are you only collecting
the data as a form of cache using sqlite and then exporting it out once
organised into csv?
Sqlite v3 supports utf-8 which might help?
if it doesn't break something else of course.
C
Olivier MJ Crepin-Leblond wrote:
> Hello all,
>
> another update: the first complete run using the new TLDs has completed!
> You can view the results up to February 2016 from
> http://www.ipv6matrix.org
>
> In adding new gTLDs we have hit a snag, although this snag does not
> significantly affect overall results since it appears to only affect a
> tiny number of domains.
>
> I am speaking about Internationalized Top Level Domains (IDNs):
>
> xn--3e0b707e xn--80adxhks xn--90ais xn--j1amh xn--pgbs0dh
> xn--wgbl6a
> xn--4gbrim xn--80asehdb xn--d1acj3b xn--p1ai xn--q9jyb4c
>
> Each of these is the ASCII equivalent of a non ASCII domain name.
> Whist the Crawler works well with them and we are able to collect all
> of the data pertaining to crawls in IDNs, the program that builds the
> Database uses SQLite. Until now, database entries made use of domain
> names that were ASCII - but IDNs use a double dash "--" in the domain.
> SQLite coughs on DASH - so we have not been able to produce the
> database needed for the displaying of the results when including IDNs.
>
> Until we have a workaround, I have manually isolated data collected
> for IDNs, which means we still collect them, but we will not take them
> into account in the final database results. As I have said, this is a
> tiny subset of domains: 760 entries out of a total of 1 Million domains.
>
> I am *still* drafting a very long article for RIPE labs. In fact, we
> might publish this in two parts. In the meantime, the results appear
> to be somehow consistent with results of other tracking projects, some
> of which use other methods to track IPv6 adoption:
>
> - http://6lab.cisco.com/stats/
> - https://www.vyncke.org/ipv6status/
> - http://www.mrp.net/ipv6_survey/
>
> We now have 306 Gb of comma separated value text data in store,
> tracing back the spread of the IPv6 Internet since July 2010. (294Gb
> in November 2015)
>
> I look forward to your kind feedback.
>
> Warmest regards,
> Olivier
>
>
> On 26/11/2015 19:32, Olivier MJ Crepin-Leblond wrote:
>> Hello all,
>>
>> Two worthy pieces of news regarding the IPv6 Matrix Project (
>> http://www.ipv6matrix.org ):
>>
>> 1. I have updated the Web site with the latest results ending in late
>> October - hence noting a Crawl display date of November 2015.
>> We now have 294 Gb of comma separated value text data in store,
>> tracing back the spread of the IPv6 Internet since July 2010.
>> Altogether, we ran the text approximately 36 times on all 1 million
>> Alexa busiest Domain names. This represented testing of about 6.5
>> million hosts, carefully collecting traceroute information for each
>> and every of them. We now have a very unique database that is showing
>> the spread of the IPv6 Internet information sources worldwide.
>>
>> 2. Today I took out my very dusty Linux & Python gloves and performed
>> a much needed update to the IPv6 Matrix Crawler input database,
>> including the Alexa 1 million list as well as GeoIP Databases.
>>
>> Indeed, the Alexa database of the world's 1 million busiest Web sites
>> dated from the Crawler's first inception in the first half of 2010.
>> We're more than 5 years later!
>>
>> In a way, keeping the same input database has kept the base of crawls
>> the steady thus the ability to compare results was possible. However,
>> the flip-side of the coin is that we are ending up with more and more
>> domain names marked as being dysfunctional. Nearly 5% of the domain
>> names in the database were unreachable. The updated input database
>> should resolve this, but we might also see a jump in some results. It
>> will be interesting to see what the next run yields.
>> Why do we not update the input database more often? Because buried in
>> that database are the domain names of the people who wanted to opt
>> out over the years. Having never thought about this, I spent several
>> hours tracing back 5 years of emails of people complaining about the
>> crawl triggering their firewalls. I put together a blacklist of
>> domain names I have manually deleted from the crawl input files.
>> The blacklist, as it stands now:
>>
>> Deleted:
>>
>> it-mate.co.uk
>> indianic.com
>> your-server.de
>> catacombscds.com
>> dewlance.com
>> tcs.com
>> printweb.de
>> nocser.net
>> shoppingnsales.com
>> bsaadmail.com
>> epayservice.ru
>> 4footyfans.com
>> guitarspeed99.com
>> saga.co.uk
>>
>> Already gone from the current Alexa list:
>>
>> infinityautosurf.com
>> canada-traffic.com
>> usahitz.com
>> jawatankosong.com.my
>> 4d.com.my
>> fitnessuncovered.co.uk
>> kualalumpurbookfair.com
>> xgen-it.com
>> bpanet.de
>> edns.de
>> back2web.de
>> waaaouh.com
>> every-web.com
>> w3sexe.com
>> gratuits-web.com
>> france-mateur.com
>> pliagedepapier.com
>> immobilieretparticuliers.com
>> chronobio.com
>> stickers-origines.com
>> tailor-made.co.uk
>>
>> With these out of the input files, we are able to start the next crawl. *
>> I hope I have not missed any complaints, but if I have, this is
>> advance notice that we might receive a few emails in the forthcoming
>> weeks. We might also receive a few emails from sites that have
>> appeared on the Alexa 1 million list since 2010.*
>>
>> Back to this list, the excellent filtering program which was used to
>> process the original list and clean it up was used again for the
>> modern list. The Alexa list had a number of domain names which were
>> actually sub-directories in the past, as well as some invalid
>> domains. Alexa has since tightened its act. The latest Alexa list is
>> much cleaner. It holds 999998 valid domains vs. 984587 domains for
>> the original 2010 list.
>>
>> Finally, new gTLDs have now appeared in the Alexa list, including
>> some Internationalised Domain Names (IDNs). The world is indeed a
>> very different place!
>> It will be interesting to see how the Crawler as well as all other
>> scripts to process the information into displayable data on the Web
>> server, will cope with these:
>>
>> academy.csv consulting.csv guide.csv one.csv
>> supply.csv
>> accountant.csv contractors.csv guru.csv onl.csv
>> support.csv
>> actor.csv cool.csv hamburg.csv online.csv
>> surf.csv
>> ads.csv country.csv haus.csv ooo.csv
>> swiss.csv
>> adult.csv creditcard.csv healthcare.csv orange.csv
>> sydney.csv
>> agency.csv cricket.csv help.csv ovh.csv
>> systems.csv
>> alsace.csv cymru.csv hiphop.csv paris.csv
>> taipei.csv
>> amsterdam.csv dance.csv holiday.csv partners.csv
>> tattoo.csv
>> app.csv date.csv horse.csv parts.csv
>> team.csv
>> archi.csv dating.csv host.csv party.csv
>> tech.csv
>> associates.csv deals.csv hosting.csv photo.csv
>> technology.csv
>> attorney.csv delivery.csv house.csv photography.csv
>> theater.csv
>> auction.csv desi.csv how.csv photos.csv
>> tienda.csv
>> audio.csv design.csv immobilien.csv pics.csv
>> tips.csv
>> axa.csv dev.csv immo.csv pictures.csv
>> tirol.csv
>> barclaycard.csv diet.csv ink.csv pink.csv
>> today.csv
>> barclays.csv digital.csv international.csv pizza.csv
>> tokyo.csv
>> bar.csv direct.csv investments.csv place.csv
>> tools.csv
>> bargains.csv directory.csv irish.csv plus.csv
>> top.csv
>> bayern.csv discount.csv jetzt.csv poker.csv
>> town.csv
>> beer.csv dog.csv joburg.csv porn.csv
>> toys.csv
>> berlin.csv domains.csv juegos.csv post.csv
>> trade.csv
>> best.csv earth.csv kim.csv press.csv
>> training.csv
>> bid.csv education.csv kitchen.csv prod.csv
>> trust.csv
>> bike.csv email.csv kiwi.csv productions.csv
>> university.csv
>> bio.csv emerck.csv koeln.csv properties.csv
>> uno.csv
>> black.csv energy.csv krd.csv property.csv
>> uol.csv
>> blackfriday.csv equipment.csv kred.csv pub.csv
>> vacations.csv
>> blue.csv estate.csv land.csv quebec.csv
>> vegas.csv
>> bnpparibas.csv eus.csv law.csv realtor.csv
>> ventures.csv
>> boo.csv events.csv legal.csv recipes.csv
>> video.csv
>> boutique.csv exchange.csv life.csv red.csv
>> vision.csv
>> brussels.csv expert.csv limited.csv rehab.csv
>> voyage.csv
>> build.csv exposed.csv link.csv reise.csv
>> wales.csv
>> builders.csv express.csv live.csv reisen.csv
>> wang.csv
>> business.csv fail.csv lol.csv ren.csv
>> watch.csv
>> buzz.csv faith.csv london.csv rentals.csv
>> webcam.csv
>> bzh.csv farm.csv love.csv repair.csv
>> website.csv
>> cab.csv finance.csv luxury.csv report.csv
>> wien.csv
>> camera.csv fish.csv management.csv rest.csv
>> wiki.csv
>> camp.csv fishing.csv mango.csv review.csv
>> win.csv
>> capital.csv fit.csv market.csv reviews.csv
>> windows.csv
>> cards.csv fitness.csv marketing.csv rio.csv
>> work.csv
>> care.csv flights.csv markets.csv rip.csv
>> works.csv
>> career.csv foo.csv media.csv rocks.csv
>> world.csv
>> careers.csv football.csv melbourne.csv ruhr.csv
>> wtf.csv
>> casa.csv forsale.csv menu.csv ryukyu.csv
>> xn--3e0b707e.csv
>> cash.csv foundation.csv microsoft.csv sale.csv
>> xn--4gbrim.csv
>> casino.csv frl.csv moda.csv scb.csv
>> xn--80adxhks.csv
>> center.csv fund.csv moe.csv school.csv
>> xn--80asehdb.csv
>> ceo.csv futbol.csv monash.csv science.csv
>> xn--90ais.csv
>> chat.csv gal.csv money.csv scot.csv
>> xn--d1acj3b.csv
>> church.csv gallery.csv moscow.csv services.csv
>> xn--j1amh.csv
>> city.csv garden.csv movie.csv sexy.csv
>> xn--p1ai.csv
>> claims.csv gent.csv nagoya.csv shiksha.csv
>> xn--pgbs0dh.csv
>> click.csv gift.csv network.csv shoes.csv
>> xn--q9jyb4c.csv
>> clinic.csv gifts.csv new.csv singles.csv
>> xn--wgbl6a.csv
>> clothing.csv glass.csv news.csv site.csv
>> xxx.csv
>> club.csv global.csv nexus.csv social.csv
>> xyz.csv
>> coach.csv globo.csv ngo.csv software.csv
>> yandex.csv
>> codes.csv gmail.csv ninja.csv solar.csv
>> yoga.csv
>> coffee.csv goo.csv nrw.csv solutions.csv
>> yokohama.csv
>> college.csv goog.csv ntt.csv soy.csv
>> youtube.csv
>> community.csv google.csv nyc.csv space.csv
>> zone.csv
>> company.csv graphics.csv office.csv style.csv
>> computer.csv gratis.csv okinawa.csv sucks.csv
>>
>> In the meantime I'd like to cite again the Nile University Crew
>> expertly led by Sameh El Ansary for designing and coding a Crawler's
>> that been able to cope with shifting through 5 years of DNS junk with
>> minimal maintenance, save the love and attention I give the servers
>> by keeping them up to date with patches so they don't end up toppling
>> over. They haven't been rebooted in 464 days and I am crossing
>> fingers for their well-being.
>> And of course, thanks to the University of Southampton Crew who built
>> the excellent 2nd version of the Web Site under Tim Chown's supervision.
>>
>> I am still writing an article for RIPE Labs - just struggling to find
>> the time to finish it, but getting there.
>>
>> Warmest regards,
>>
>> Olivier
>>
>>
>>
>> _______________________________________________
>> IPv6crawler-wg mailing list
>> IPv6crawler-wg at gih.co.uk
>> http://gypsy.gih.co.uk/mailman/listinfo/ipv6crawler-wg
>
> --
> Olivier MJ Crépin-Leblond, PhD
> http://www.gih.com/ocl.html
--
Christian de Larrinaga FBCS, CITP,
-------------------------
@ FirstHand
-------------------------
+44 7989 386778
cdel at firsthand.net
-------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gypsy.gih.co.uk/pipermail/ipv6crawler-wg/attachments/20160206/40730e2c/attachment-0001.html>
More information about the IPv6crawler-wg
mailing list