<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Hello Christian,<br>
<br>
the the sqlite database comes in when it comes down to displaying
the results. The results of the crawls are in native CSV. All 306Gb
of these. The Sqlite database is much smaller as it only uses a
subset of all data collected (the data which is used in the GUI) and
we are not using a single Sqlite database but one for each crawl - a
summary of each crawl for each TLD.<br>
The question of Sqlite v3 is a good one -- and I have unfortunately
got no idea whether it would work or whether it would break things.
To be added to the list of things to do.<br>
Kindest regards,<br>
<br>
Olivier<br>
<br>
<div class="moz-cite-prefix">On 06/02/2016 18:09, Christian de
Larrinaga wrote:<br>
</div>
<blockquote cite="mid:56B628E2.5090601@firsthand.net" type="cite">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<div style="font-size: 12pt;font-family: Helvetica Neue;"><span
style="font-family: Helvetica Neue;">That is a humungous large
sqlite database! or are you only collecting the data as a form
of cache using sqlite and then exporting it out once organised
into csv?<br>
<br>
Sqlite v3 supports utf-8 which might help?<br>
if it doesn't break something else of course. <br>
<br>
C <br>
</span><br>
<span>Olivier MJ Crepin-Leblond wrote:</span><br>
<blockquote cite="mid:56B62022.7060009@gih.com" type="cite">
<meta content="text/html; charset=utf-8"
http-equiv="Content-Type">
Hello all,<br>
<br>
another update: the first complete run using the new TLDs has
completed! <br>
You can view the results up to February 2016 from <a
moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://www.ipv6matrix.org"><a class="moz-txt-link-freetext" href="http://www.ipv6matrix.org">http://www.ipv6matrix.org</a></a><br>
<br>
In adding new gTLDs we have hit a snag, although this snag
does not significantly affect overall results since it appears
to only affect a tiny number of domains.<br>
<br>
I am speaking about Internationalized Top Level Domains
(IDNs):<br>
<br>
xn--3e0b707e xn--80adxhks xn--90ais xn--j1amh
xn--pgbs0dh xn--wgbl6a<br>
xn--4gbrim xn--80asehdb xn--d1acj3b xn--p1ai
xn--q9jyb4c<br>
<br>
Each of these is the ASCII equivalent of a non ASCII domain
name. Whist the Crawler works well with them and we are able
to collect all of the data pertaining to crawls in IDNs, the
program that builds the Database uses SQLite. Until now,
database entries made use of domain names that were ASCII -
but IDNs use a double dash "--" in the domain. SQLite coughs
on DASH - so we have not been able to produce the database
needed for the displaying of the results when including IDNs.<br>
<br>
Until we have a workaround, I have manually isolated data
collected for IDNs, which means we still collect them, but we
will not take them into account in the final database results.
As I have said, this is a tiny subset of domains: 760 entries
out of a total of 1 Million domains.<br>
<br>
I am *still* drafting a very long article for RIPE labs. In
fact, we might publish this in two parts. In the meantime, the
results appear to be somehow consistent with results of other
tracking projects, some of which use other methods to track
IPv6 adoption:<br>
<br>
- <a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://6lab.cisco.com/stats/">http://6lab.cisco.com/stats/</a><br>
- <a moz-do-not-send="true" class="moz-txt-link-freetext"
href="https://www.vyncke.org/ipv6status/">https://www.vyncke.org/ipv6status/</a><br>
- <a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://www.mrp.net/ipv6_survey/">http://www.mrp.net/ipv6_survey/</a><br>
<br>
We now have 306 Gb of comma separated value text data in
store, tracing back the spread of the IPv6 Internet since July
2010. (294Gb in November 2015)<br>
<br>
I look forward to your kind feedback.<br>
<br>
Warmest regards,<br>
Olivier<br>
<br>
<br>
<div class="moz-cite-prefix">On 26/11/2015 19:32, Olivier MJ
Crepin-Leblond wrote:<br>
</div>
<blockquote cite="mid:56575051.3070402@gih.com" type="cite">
<meta http-equiv="content-type" content="text/html;
charset=utf-8">
Hello all,<br>
<br>
Two worthy pieces of news regarding the IPv6 Matrix Project
( <a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://www.ipv6matrix.org">http://www.ipv6matrix.org</a>
):<br>
<br>
1. I have updated the Web site with the latest results
ending in late October - hence noting a Crawl display date
of November 2015.<br>
We now have 294 Gb of comma separated value text data in
store, tracing back the spread of the IPv6 Internet since
July 2010.<br>
Altogether, we ran the text approximately 36 times on all 1
million Alexa busiest Domain names. This represented testing
of about 6.5 million hosts, carefully collecting traceroute
information for each and every of them. We now have a very
unique database that is showing the spread of the IPv6
Internet information sources worldwide.<br>
<br>
2. Today I took out my very dusty Linux & Python gloves
and performed a much needed update to the IPv6 Matrix
Crawler input database, including the Alexa 1 million list
as well as GeoIP Databases.<br>
<br>
Indeed, the Alexa database of the world's 1 million busiest
Web sites dated from the Crawler's first inception in the
first half of 2010.<br>
We're more than 5 years later!<br>
<br>
In a way, keeping the same input database has kept the base
of crawls the steady thus the ability to compare results was
possible. However, the flip-side of the coin is that we are
ending up with more and more domain names marked as being
dysfunctional. Nearly 5% of the domain names in the database
were unreachable. The updated input database should resolve
this, but we might also see a jump in some results. It will
be interesting to see what the next run yields.<br>
Why do we not update the input database more often? Because
buried in that database are the domain names of the people
who wanted to opt out over the years. Having never thought
about this, I spent several hours tracing back 5 years of
emails of people complaining about the crawl triggering
their firewalls. I put together a blacklist of domain names
I have manually deleted from the crawl input files. <br>
The blacklist, as it stands now:<br>
<br>
Deleted:<br>
<br>
it-mate.co.uk<br>
indianic.com<br>
your-server.de<br>
catacombscds.com<br>
dewlance.com<br>
tcs.com<br>
printweb.de<br>
nocser.net<br>
shoppingnsales.com<br>
bsaadmail.com<br>
epayservice.ru<br>
4footyfans.com <br>
guitarspeed99.com<br>
saga.co.uk<br>
<br>
Already gone from the current Alexa list:<br>
<br>
infinityautosurf.com<br>
canada-traffic.com<br>
usahitz.com<br>
jawatankosong.com.my<br>
4d.com.my<br>
fitnessuncovered.co.uk<br>
kualalumpurbookfair.com<br>
xgen-it.com<br>
bpanet.de<br>
edns.de<br>
back2web.de<br>
waaaouh.com<br>
every-web.com<br>
w3sexe.com<br>
gratuits-web.com<br>
france-mateur.com<br>
pliagedepapier.com<br>
immobilieretparticuliers.com<br>
chronobio.com<br>
stickers-origines.com<br>
tailor-made.co.uk<br>
<br>
With these out of the input files, we are able to start the
next crawl. <b><br>
I hope I have not missed any complaints, but if I have,
this is advance notice that we might receive a few emails
in the forthcoming weeks. We might also receive a few
emails from sites that have appeared on the Alexa 1
million list since 2010.</b><br>
<br>
Back to this list, the excellent filtering program which was
used to process the original list and clean it up was used
again for the modern list. The Alexa list had a number of
domain names which were actually sub-directories in the
past, as well as some invalid domains. Alexa has since
tightened its act. The latest Alexa list is much cleaner. It
holds 999998 valid domains vs. 984587 domains for the
original 2010 list.<br>
<br>
Finally, new gTLDs have now appeared in the Alexa list,
including some Internationalised Domain Names (IDNs). The
world is indeed a very different place!<br>
It will be interesting to see how the Crawler as well as all
other scripts to process the information into displayable
data on the Web server, will cope with these:<br>
<br>
academy.csv consulting.csv guide.csv
one.csv supply.csv<br>
accountant.csv contractors.csv guru.csv
onl.csv support.csv<br>
actor.csv cool.csv hamburg.csv
online.csv surf.csv<br>
ads.csv country.csv haus.csv
ooo.csv swiss.csv<br>
adult.csv creditcard.csv healthcare.csv
orange.csv sydney.csv<br>
agency.csv cricket.csv help.csv
ovh.csv systems.csv<br>
alsace.csv cymru.csv hiphop.csv
paris.csv taipei.csv<br>
amsterdam.csv dance.csv holiday.csv
partners.csv tattoo.csv<br>
app.csv date.csv horse.csv
parts.csv team.csv<br>
archi.csv dating.csv host.csv
party.csv tech.csv<br>
associates.csv deals.csv hosting.csv
photo.csv technology.csv<br>
attorney.csv delivery.csv house.csv
photography.csv theater.csv<br>
auction.csv desi.csv how.csv
photos.csv tienda.csv<br>
audio.csv design.csv immobilien.csv
pics.csv tips.csv<br>
axa.csv dev.csv immo.csv
pictures.csv tirol.csv<br>
barclaycard.csv diet.csv ink.csv
pink.csv today.csv<br>
barclays.csv digital.csv international.csv
pizza.csv tokyo.csv<br>
bar.csv direct.csv investments.csv
place.csv tools.csv<br>
bargains.csv directory.csv irish.csv
plus.csv top.csv<br>
bayern.csv discount.csv jetzt.csv
poker.csv town.csv<br>
beer.csv dog.csv joburg.csv
porn.csv toys.csv<br>
berlin.csv domains.csv juegos.csv
post.csv trade.csv<br>
best.csv earth.csv kim.csv
press.csv training.csv<br>
bid.csv education.csv kitchen.csv
prod.csv trust.csv<br>
bike.csv email.csv kiwi.csv
productions.csv university.csv<br>
bio.csv emerck.csv koeln.csv
properties.csv uno.csv<br>
black.csv energy.csv krd.csv
property.csv uol.csv<br>
blackfriday.csv equipment.csv kred.csv
pub.csv vacations.csv<br>
blue.csv estate.csv land.csv
quebec.csv vegas.csv<br>
bnpparibas.csv eus.csv law.csv
realtor.csv ventures.csv<br>
boo.csv events.csv legal.csv
recipes.csv video.csv<br>
boutique.csv exchange.csv life.csv
red.csv vision.csv<br>
brussels.csv expert.csv limited.csv
rehab.csv voyage.csv<br>
build.csv exposed.csv link.csv
reise.csv wales.csv<br>
builders.csv express.csv live.csv
reisen.csv wang.csv<br>
business.csv fail.csv lol.csv
ren.csv watch.csv<br>
buzz.csv faith.csv london.csv
rentals.csv webcam.csv<br>
bzh.csv farm.csv love.csv
repair.csv website.csv<br>
cab.csv finance.csv luxury.csv
report.csv wien.csv<br>
camera.csv fish.csv management.csv
rest.csv wiki.csv<br>
camp.csv fishing.csv mango.csv
review.csv win.csv<br>
capital.csv fit.csv market.csv
reviews.csv windows.csv<br>
cards.csv fitness.csv marketing.csv
rio.csv work.csv<br>
care.csv flights.csv markets.csv
rip.csv works.csv<br>
career.csv foo.csv media.csv
rocks.csv world.csv<br>
careers.csv football.csv melbourne.csv
ruhr.csv wtf.csv<br>
casa.csv forsale.csv menu.csv
ryukyu.csv xn--3e0b707e.csv<br>
cash.csv foundation.csv microsoft.csv
sale.csv xn--4gbrim.csv<br>
casino.csv frl.csv moda.csv
scb.csv xn--80adxhks.csv<br>
center.csv fund.csv moe.csv
school.csv xn--80asehdb.csv<br>
ceo.csv futbol.csv monash.csv
science.csv xn--90ais.csv<br>
chat.csv gal.csv money.csv
scot.csv xn--d1acj3b.csv<br>
church.csv gallery.csv moscow.csv
services.csv xn--j1amh.csv<br>
city.csv garden.csv movie.csv
sexy.csv xn--p1ai.csv<br>
claims.csv gent.csv nagoya.csv
shiksha.csv xn--pgbs0dh.csv<br>
click.csv gift.csv network.csv
shoes.csv xn--q9jyb4c.csv<br>
clinic.csv gifts.csv new.csv
singles.csv xn--wgbl6a.csv<br>
clothing.csv glass.csv news.csv
site.csv xxx.csv<br>
club.csv global.csv nexus.csv
social.csv xyz.csv<br>
coach.csv globo.csv ngo.csv
software.csv yandex.csv<br>
codes.csv gmail.csv ninja.csv
solar.csv yoga.csv<br>
coffee.csv goo.csv nrw.csv
solutions.csv yokohama.csv<br>
college.csv goog.csv ntt.csv
soy.csv youtube.csv<br>
community.csv google.csv nyc.csv
space.csv zone.csv<br>
company.csv graphics.csv office.csv
style.csv<br>
computer.csv gratis.csv okinawa.csv
sucks.csv<br>
<br>
In the meantime I'd like to cite again the Nile University
Crew expertly led by Sameh El Ansary for designing and
coding a Crawler's that been able to cope with shifting
through 5 years of DNS junk with minimal maintenance, save
the love and attention I give the servers by keeping them up
to date with patches so they don't end up toppling over.
They haven't been rebooted in 464 days and I am crossing
fingers for their well-being.<br>
And of course, thanks to the University of Southampton Crew
who built the excellent 2nd version of the Web Site under
Tim Chown's supervision.<br>
<br>
I am still writing an article for RIPE Labs - just
struggling to find the time to finish it, but getting there.<br>
<br>
Warmest regards,<br>
<br>
Olivier<br>
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
IPv6crawler-wg mailing list
<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:IPv6crawler-wg@gih.co.uk">IPv6crawler-wg@gih.co.uk</a>
<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://gypsy.gih.co.uk/mailman/listinfo/ipv6crawler-wg">http://gypsy.gih.co.uk/mailman/listinfo/ipv6crawler-wg</a>
</pre>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Olivier MJ Crépin-Leblond, PhD
<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://www.gih.com/ocl.html">http://www.gih.com/ocl.html</a>
</pre>
</blockquote>
<br>
<div class="moz-signature">-- <br>
Christian de Larrinaga FBCS, CITP,<br>
-------------------------<br>
<span style="font-weight: bold;">@ FirstHand</span><br
style="font-weight: bold;">
-------------------------<br>
+44 7989 386778<br>
<a moz-do-not-send="true" class="moz-txt-link-abbreviated"
href="mailto:cdel@firsthand.net">cdel@firsthand.net</a> <br>
-------------------------<br>
<br>
</div>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Olivier MJ Crépin-Leblond, PhD
<a class="moz-txt-link-freetext" href="http://www.gih.com/ocl.html">http://www.gih.com/ocl.html</a>
</pre>
</body>
</html>